Commit 93ff306b authored by Niels-Oliver Walkowski's avatar Niels-Oliver Walkowski
Browse files

upd: Make changes requested by editor review results

parent 5294045d
Due to file size limitations, we had to move the pretrained models to another repository.
Please download the pickled model here: http://gams.uni-graz.at/o:dispecs.word2vec.it/ITALIAN
\ No newline at end of file
Due to file size limitations, we had to move the pretrained models to another repository.
Please download the pickled model here: http://gams.uni-graz.at/o:dispecs.word2vec.es/SPANISH
\ No newline at end of file
Due to file size limitation on GitHub, we had to move the pretrained models to Google Drive.
Please download the pickled model here: https://drive.google.com/file/d/15ChW1dipL2ULU-yaa5oXN-ZUJ00HW5VV/view?usp=sharing
\ No newline at end of file
Due to file size limitation on GitHub, we had to move the pretrained models to Google Drive.
Please download the pickled model here: https://drive.google.com/file/d/14O2s3x9ZEG3Zx-SVA6LcDScpZuSAosiJ/view?usp=sharing
\ No newline at end of file
Due to file size limitation on GitHub, we had to move the pretrained models to Google Drive.
Please download the pickled model here: https://drive.google.com/file/d/1SME9EMCF8dnSJU458kaoV1dZUmgNHMel/view?usp=sharing
\ No newline at end of file
File mode changed from 100755 to 100644
...@@ -24,7 +24,7 @@ ...@@ -24,7 +24,7 @@
"More precisely, the first part requires manual annotations of seed words from which we transfer sentiment to other words occurring in a similar context as seed words. To transfer sentiment, we train word embeddings and use a machine learning classification task. In doing so, we computationally extend the list of annotated words and avoid a more time-consuming and tedious manual annotation process. Note that this procedure is also adaptable to other languages (also contemporary languages).\n", "More precisely, the first part requires manual annotations of seed words from which we transfer sentiment to other words occurring in a similar context as seed words. To transfer sentiment, we train word embeddings and use a machine learning classification task. In doing so, we computationally extend the list of annotated words and avoid a more time-consuming and tedious manual annotation process. Note that this procedure is also adaptable to other languages (also contemporary languages).\n",
"In the second part, we provide a collection of ready-to-use sentiment dictionaries (which we created with the first part of the tool chain) as well as methods to perform the actual sentiment analysis. The implemented methods range from listing basic descriptive statistics to various kinds of plots that allow for an easy interpretation of sentiment expressed in a given text corpus. Further, our methods analyze sentiment on a macro- and microscopic level, as they can not only be applied to a whole corpus providing a bigger picture of data, but also on a document level, for example, by highlighting words that convey sentiment in any given text.\n", "In the second part, we provide a collection of ready-to-use sentiment dictionaries (which we created with the first part of the tool chain) as well as methods to perform the actual sentiment analysis. The implemented methods range from listing basic descriptive statistics to various kinds of plots that allow for an easy interpretation of sentiment expressed in a given text corpus. Further, our methods analyze sentiment on a macro- and microscopic level, as they can not only be applied to a whole corpus providing a bigger picture of data, but also on a document level, for example, by highlighting words that convey sentiment in any given text.\n",
"\n", "\n",
"This repository contains all the data we used for creating sentiment dictionaries, including manually annotated seed words, pre-trained word embedding models and other data resulting from intermediate steps. These can also be used in other contexts and NLP tasks and are not necessarily limited to sentiment analysis. As such, our tool chain serves as a foundation for further methods and approaches applicable to all projects focusing on the computational interpretation of texts.\n", "This repository contains all the data we used for creating sentiment dictionaries, including manually annotated seed words, pretrained word embedding models and other data resulting from intermediate steps. These can also be used in other contexts and NLP tasks and are not necessarily limited to sentiment analysis. As such, our tool chain serves as a foundation for further methods and approaches applicable to all projects focusing on the computational interpretation of texts.\n",
"\n", "\n",
"Before we continue, please import the necessary Python Packages for the interactive examples:" "Before we continue, please import the necessary Python Packages for the interactive examples:"
] ]
...@@ -37,12 +37,10 @@ ...@@ -37,12 +37,10 @@
"source": [ "source": [
"import pandas\n", "import pandas\n",
"import matplotlib.pyplot as plt\n", "import matplotlib.pyplot as plt\n",
"import gensim.downloader as api\n",
"import nltk\n", "import nltk\n",
"import seaborn\n", "import seaborn\n",
"import re\n", "import re\n",
"from nltk.tokenize import WordPunctTokenizer\n", "from nltk.tokenize import WordPunctTokenizer\n",
"from mpl_toolkits.mplot3d import Axes3D\n",
"from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.feature_extraction.text import CountVectorizer\n",
"from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.neighbors import KNeighborsClassifier\n",
"from gensim.models.word2vec import Word2Vec\n", "from gensim.models.word2vec import Word2Vec\n",
...@@ -53,7 +51,7 @@ ...@@ -53,7 +51,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Additionally, please execute the following cell to download data (i.e., the Punkt tokenizer and the text8 corpus) provided by NLTK and Gensim which is necessary for the examples presented in this Notebook:" "Additionally, please execute the following cell to download data (i.e., the Punkt tokenizer) provided by NLTK which is necessary for the examples presented in this Notebook:"
] ]
}, },
{ {
...@@ -69,11 +67,20 @@ ...@@ -69,11 +67,20 @@
"[nltk_data] /Users/philippkoncar/nltk_data...\n", "[nltk_data] /Users/philippkoncar/nltk_data...\n",
"[nltk_data] Package punkt is already up-to-date!\n" "[nltk_data] Package punkt is already up-to-date!\n"
] ]
},
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
} }
], ],
"source": [ "source": [
"nltk.download('punkt')\n", "nltk.download('punkt')"
"word2vec_example_corpus = api.load(\"text8\")"
] ]
}, },
{ {
...@@ -124,7 +131,7 @@ ...@@ -124,7 +131,7 @@
"We can assess the sentiment of each sentence by considering whether the words contained in the dictionary occur in them.\n", "We can assess the sentiment of each sentence by considering whether the words contained in the dictionary occur in them.\n",
"For that, let us define the sentiment *s* of a sentence as follows:\n", "For that, let us define the sentiment *s* of a sentence as follows:\n",
"\n", "\n",
"$$s = W_p - W_n$$\n", "![Sentiment Formula for the Introductory Example](miscellaneous/sentiment_formula_1.png)\n",
"\n", "\n",
"where $W_p$ is the number of positive words in a sentence and $W_n$ is the number of negative words in a sentence.\n", "where $W_p$ is the number of positive words in a sentence and $W_n$ is the number of negative words in a sentence.\n",
"Thus, the formula subtracts the number of words with a negative sentiment from the number of words with a positive sentiment.\n", "Thus, the formula subtracts the number of words with a negative sentiment from the number of words with a positive sentiment.\n",
...@@ -146,7 +153,8 @@ ...@@ -146,7 +153,8 @@
"example_sentences = [\n", "example_sentences = [\n",
" \"Today was a good day.\",\n", " \"Today was a good day.\",\n",
" \"I hate getting up early.\",\n", " \"I hate getting up early.\",\n",
" \"This is both funny and sad at the same time.\"]" " \"This is both funny and sad at the same time.\"\n",
"]"
] ]
}, },
{ {
...@@ -460,9 +468,7 @@ ...@@ -460,9 +468,7 @@
"The former predicts a word from a given context whereas the latter predicts the context from a given word.\n", "The former predicts a word from a given context whereas the latter predicts the context from a given word.\n",
"Both have their advantages and disadvantages regarding the size of the underlying text corpus, which is why we consider both of them in our tool chain.\n", "Both have their advantages and disadvantages regarding the size of the underlying text corpus, which is why we consider both of them in our tool chain.\n",
"\n", "\n",
"Before talking about the specifics of our approach, we want to demonstrate the benefits of word2vec. In the following example, we use Gensim's word2vec implementation to train word embeddings of the publicly available *text8* dataset (you have already downloaded it in the *Introduction* section of this Notebook).\n", "Before talking about the specifics of our approach, we want to demonstrate the benefits of word2vec. For the following example, we used Gensim's word2vec implementation to train word embeddings of the publicly available *text8* dataset. We can load the pretrained word2vec model with the following line of code:"
"\n",
"We can train the model with one line of code (note that this can take several minutes to complete):"
] ]
}, },
{ {
...@@ -471,14 +477,14 @@ ...@@ -471,14 +477,14 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"word2vec_model = Word2Vec(word2vec_example_corpus)" "word2vec_model = Word2Vec.load(\"data/processed_data/word2vec_models/example.p\")"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"With the default setup of the `word2vec` method, we train vectors with a dimension of 100. Thus, instead of the three numbers in a vector from the count vector example above, we now have 100 numbers (i.e., a vector of length 100).\n", "As we used the default parameters for word2vec as implemented in Gensim, we trained vectors with a dimension of 100. Thus, instead of the three numbers in a vector from the count vector example above, we now have 100 numbers (i.e., a vector of length 100).\n",
"Moreover, these numbers are now real-valued instead of integer and their interpretation is less self-evident.\n", "Moreover, these numbers are now real-valued instead of integer and their interpretation is less self-evident.\n",
"We can inspect the resulting vector of a word, for example *car*, with:" "We can inspect the resulting vector of a word, for example *car*, with:"
] ]
...@@ -492,23 +498,23 @@ ...@@ -492,23 +498,23 @@
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"[-0.96366096 -1.3742831 0.73106915 1.2464461 0.19334741 -0.858718\n", "[-0.12308194 -0.39095813 1.000928 -0.04769582 0.9710199 -1.3676833\n",
" -0.01184064 -0.24652933 -1.0748054 -0.5054727 1.1855398 -0.62031776\n", " -1.4404075 -0.7423445 1.6969757 -1.1585793 0.64239687 0.6409116\n",
" 0.39737985 1.8011881 -2.595472 -1.9477972 0.44193438 -1.2095671\n", " 1.4341261 0.4111035 -2.4854403 -0.7219625 1.540313 0.42790467\n",
" -0.03679457 0.22283824 0.8520565 0.19460759 0.601061 -0.25661194\n", " -0.44106185 -1.114401 1.3602705 0.8318557 2.0098808 0.7983357\n",
" 0.5227143 -1.2684288 -0.6947603 0.71117973 -0.83271295 1.3840522\n", " -1.5087671 -0.8375068 -0.61658716 1.2119923 -0.6702468 -0.3222445\n",
" 2.6297543 2.1552653 0.46458372 -0.15175714 -1.7563695 -0.18980268\n", " 0.9415244 2.6550481 -0.99794585 0.04601024 -1.6316731 -1.439617\n",
" -0.19808206 -2.296005 -0.4825583 0.84871304 0.7676269 -0.23888186\n", " -1.431928 -2.4456704 -0.71318716 0.86487716 2.2365756 -0.89588803\n",
" 1.5052361 1.7347597 2.9247804 -0.60031617 -2.162292 0.19464816\n", " -0.7087643 0.9257705 1.712533 0.7194995 -2.4665103 -1.3497332\n",
" 0.38104555 -0.25405452 0.56672496 0.18307838 1.9986428 -3.067654\n", " 2.2958574 -0.54635614 0.9186488 0.51365685 0.09351666 -1.0833061\n",
" 1.2176334 1.1274312 0.52531594 0.21777926 0.16100219 -0.06693241\n", " -0.00810259 2.8242166 0.1252522 -1.207868 0.10782211 -0.34977445\n",
" -0.9263279 0.34482655 -0.7890902 -0.06024634 -0.6747798 -0.88505363\n", " -0.30000007 0.33047387 -0.12232512 -0.52950805 -2.4587536 -0.37481222\n",
" -0.8940445 0.63983166 -0.28610742 -3.1221943 0.9004476 1.3235196\n", " -2.2148058 0.14348628 -0.79030484 -2.5900028 2.7875724 -0.7795173\n",
" 0.33159682 1.4678319 0.07791691 1.7263894 -1.1614616 1.1387984\n", " 0.6641297 2.6237233 1.7713573 1.7022327 -0.04617653 1.2087046\n",
" -1.2820066 -1.8504483 -0.41011232 -1.8067706 1.8940349 1.5587422\n", " -1.4730823 0.74134797 -0.26776415 0.22373354 0.71002257 1.5748668\n",
" -1.9671468 -0.20638618 1.795837 -0.37610653 1.1748948 0.65870816\n", " -1.778043 0.48367617 1.0869575 -0.6362949 0.63211554 0.5351157\n",
" -1.9154986 -2.549661 -3.388793 -0.27740353 0.83503336 -2.0548694\n", " -1.8014896 0.39312994 -2.118675 0.83928734 -0.3225636 -2.1843622\n",
" 1.2873101 -0.23276174 3.1180558 2.5557537 ]\n" " -0.557146 -1.6596688 1.2372373 3.977601 ]\n"
] ]
} }
], ],
...@@ -531,26 +537,26 @@ ...@@ -531,26 +537,26 @@
{ {
"data": { "data": {
"text/plain": [ "text/plain": [
"[('driver', 0.7858742475509644),\n", "[('driver', 0.7839363217353821),\n",
" ('cars', 0.7570940852165222),\n", " ('taxi', 0.7524467706680298),\n",
" ('motorcycle', 0.7254790663719177),\n", " ('cars', 0.725511908531189),\n",
" ('taxi', 0.7133187055587769),\n", " ('motorcycle', 0.7036831378936768),\n",
" ('truck', 0.7059160470962524),\n", " ('vehicle', 0.698715090751648),\n",
" ('tire', 0.6728494763374329),\n", " ('truck', 0.6913774609565735),\n",
" ('racing', 0.6664857268333435),\n", " ('passenger', 0.661078155040741),\n",
" ('cab', 0.6571638584136963),\n", " ('automobile', 0.6501474380493164),\n",
" ('automobile', 0.655803918838501),\n", " ('audi', 0.6245964169502258),\n",
" ('glider', 0.6467955708503723),\n", " ('glider', 0.6229903101921082),\n",
" ('passenger', 0.6462221741676331),\n", " ('tire', 0.6213281154632568),\n",
" ('vehicle', 0.6456612944602966),\n", " ('cab', 0.6198135018348694),\n",
" ('motor', 0.6449832320213318),\n", " ('engine', 0.6183426380157471),\n",
" ('automobiles', 0.6323708891868591),\n", " ('volkswagen', 0.6164752840995789),\n",
" ('mercedes', 0.6311836242675781),\n", " ('engined', 0.6096624732017517),\n",
" ('diesel', 0.6301203370094299),\n", " ('airplane', 0.6076435446739197),\n",
" ('honda', 0.6140277981758118),\n", " ('bmw', 0.6070380210876465),\n",
" ('powered', 0.613706111907959),\n", " ('elevator', 0.6061339974403381),\n",
" ('pilot', 0.6132946610450745),\n", " ('racing', 0.6031301617622375),\n",
" ('stock', 0.6110790967941284)]" " ('stock', 0.6030023097991943)]"
] ]
}, },
"execution_count": 11, "execution_count": 11,
...@@ -567,7 +573,7 @@ ...@@ -567,7 +573,7 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"Another advantage of word2vec compared to count vectors is that it can capture various concepts and analogies.\n", "Another advantage of word2vec compared to count vectors is that it can capture various concepts and analogies.\n",
"Perhaps the most famous example in that regard is as follows: If you subtract the vector of the word *man* from the vector of the word *king* and add the vector of the word *woman* it should result in a vector very close to that of the word *queen*. We can evaluate whether this is the case with our trained model through Gensim very easily:" "Perhaps the most famous example in that regard is as follows: If you subtract the vector of the word *man* from the vector of the word *king* and add the vector of the word *woman* it should result in a vector very close to that of the word *queen*. We can evaluate whether this is the case with our pretrained model through Gensim very easily:"
] ]
}, },
{ {
...@@ -578,7 +584,7 @@ ...@@ -578,7 +584,7 @@
{ {
"data": { "data": {
"text/plain": [ "text/plain": [
"[('prince', 0.6386325359344482)]" "[('queen', 0.6670934557914734)]"
] ]
}, },
"execution_count": 12, "execution_count": 12,
...@@ -925,9 +931,9 @@ ...@@ -925,9 +931,9 @@
"source": [ "source": [
"language = \"Spanish\"\n", "language = \"Spanish\"\n",
"sentiment_dict = {}\n", "sentiment_dict = {}\n",
"with open(\"{}{}_negative.txt\".format(\"data/dictionaries/computational_corrected/\", language.lower()), \"r\", encoding=\"utf-8\") as fr:\n", "with open(\"{}{}_negative.txt\".format(\"data/processed_data/dictionaries/computational_corrected/\", language.lower()), \"r\", encoding=\"utf-8\") as fr:\n",
" sentiment_dict[\"neg\"] = fr.read().splitlines()\n", " sentiment_dict[\"neg\"] = fr.read().splitlines()\n",
"with open(\"{}{}_positive.txt\".format(\"data/dictionaries/computational_corrected/\", language.lower()), \"r\", encoding=\"utf-8\") as fr:\n", "with open(\"{}{}_positive.txt\".format(\"data/processed_data/dictionaries/computational_corrected/\", language.lower()), \"r\", encoding=\"utf-8\") as fr:\n",
" sentiment_dict[\"pos\"] = fr.read().splitlines()\n", " sentiment_dict[\"pos\"] = fr.read().splitlines()\n",
"\n", "\n",
"print(\"loaded {} negative words\".format(len(sentiment_dict[\"neg\"])))\n", "print(\"loaded {} negative words\".format(len(sentiment_dict[\"neg\"])))\n",
...@@ -972,7 +978,7 @@ ...@@ -972,7 +978,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"For simplicity, we already transformed French, Italian and Spanish texts into respective pandas DataFrames. If you want to see the code for this process please refer to the corresponding Jupyter Notebook in the code folder. Further, we provide the individual .txt files for each language in the `data/example_texts/` directory of this repository if you wish to reconstruct the DataFrame creation yourself.\n", "For simplicity, we already transformed French, Italian and Spanish texts into respective pandas DataFrames. If you want to see the code for this process please refer to the corresponding Jupyter Notebook in the code folder. Further, we provide the individual .txt files for each language in the `data/raw_data/example_texts/` directory of this repository if you wish to reconstruct the DataFrame creation yourself.\n",
"\n", "\n",
"Now we can load the pandas DataFrame containing Spanish texts as well as four additional attributes (i.e., periodical title, issue number, author, year) as follows:" "Now we can load the pandas DataFrame containing Spanish texts as well as four additional attributes (i.e., periodical title, issue number, author, year) as follows:"
] ]
...@@ -991,7 +997,7 @@ ...@@ -991,7 +997,7 @@
} }
], ],
"source": [ "source": [
"texts_df = pandas.read_pickle(\"data/example_texts/spanish.p\")\n", "texts_df = pandas.read_pickle(\"data/processed_data/example_texts/spanish.p\")\n",
"print(\"loaded dataframe with {} texts and {} attributes\".format(texts_df.shape[0], texts_df.shape[1] - 1))" "print(\"loaded dataframe with {} texts and {} attributes\".format(texts_df.shape[0], texts_df.shape[1] - 1))"
] ]
}, },
...@@ -1004,7 +1010,7 @@ ...@@ -1004,7 +1010,7 @@
"By now, we have loaded our sentiment dictionaries as well as our texts. To compute sentiment, we have to consider the occurrences of the words contained in the dictionaries in the text files.\n", "By now, we have loaded our sentiment dictionaries as well as our texts. To compute sentiment, we have to consider the occurrences of the words contained in the dictionaries in the text files.\n",
"For that, we use the following formula, defining the sentiment *s* of a text with:\n", "For that, we use the following formula, defining the sentiment *s* of a text with:\n",
"\n", "\n",
"$$s = \\frac{W_p - W_n}{W_p + W_n} $$\n", "![Sentiment Formula](miscellaneous/sentiment_formula_2.png)\n",
"\n", "\n",
"where $W_p$ is the number of positive words in a text and $W_n$ is the number of negative words in a text.\n", "where $W_p$ is the number of positive words in a text and $W_n$ is the number of negative words in a text.\n",
"Thus, the computed sentiment score is a value ranging between −1 and +1, where values close to −1 are considered as negative, values close to +1 as positive, and where values close to zero indicate a neutral sentiment.\n", "Thus, the computed sentiment score is a value ranging between −1 and +1, where values close to −1 are considered as negative, values close to +1 as positive, and where values close to zero indicate a neutral sentiment.\n",
......
...@@ -5,5 +5,5 @@ tqdm == 4.50.2 ...@@ -5,5 +5,5 @@ tqdm == 4.50.2
nltk == 3.5 nltk == 3.5
ipywidgets == 7.5.1 ipywidgets == 7.5.1
gensim == 3.8.3 gensim == 3.8.3
sklearn == 0.23.2 scikit-learn == 0.23.2
stop-words == 2018.7.23 stop-words == 2018.7.23
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment