Commit 93ff306b authored by Niels-Oliver Walkowski's avatar Niels-Oliver Walkowski
Browse files

upd: Make changes requested by editor review results

parent 5294045d
Due to file size limitations, we had to move the pretrained models to another repository.
Please download the pickled model here: http://gams.uni-graz.at/o:dispecs.word2vec.it/ITALIAN
\ No newline at end of file
Due to file size limitations, we had to move the pretrained models to another repository.
Please download the pickled model here: http://gams.uni-graz.at/o:dispecs.word2vec.es/SPANISH
\ No newline at end of file
Due to file size limitation on GitHub, we had to move the pretrained models to Google Drive.
Please download the pickled model here: https://drive.google.com/file/d/15ChW1dipL2ULU-yaa5oXN-ZUJ00HW5VV/view?usp=sharing
\ No newline at end of file
Due to file size limitation on GitHub, we had to move the pretrained models to Google Drive.
Please download the pickled model here: https://drive.google.com/file/d/14O2s3x9ZEG3Zx-SVA6LcDScpZuSAosiJ/view?usp=sharing
\ No newline at end of file
Due to file size limitation on GitHub, we had to move the pretrained models to Google Drive.
Please download the pickled model here: https://drive.google.com/file/d/1SME9EMCF8dnSJU458kaoV1dZUmgNHMel/view?usp=sharing
\ No newline at end of file
File mode changed from 100755 to 100644
......@@ -15,42 +15,43 @@
The proposed tool chain comprises two different parts: (i) the optional creation of sentiment dictionaries and (ii) the actual sentiment analysis.
More precisely, the first part requires manual annotations of seed words from which we transfer sentiment to other words occurring in a similar context as seed words. To transfer sentiment, we train word embeddings and use a machine learning classification task. In doing so, we computationally extend the list of annotated words and avoid a more time-consuming and tedious manual annotation process. Note that this procedure is also adaptable to other languages (also contemporary languages).
In the second part, we provide a collection of ready-to-use sentiment dictionaries (which we created with the first part of the tool chain) as well as methods to perform the actual sentiment analysis. The implemented methods range from listing basic descriptive statistics to various kinds of plots that allow for an easy interpretation of sentiment expressed in a given text corpus. Further, our methods analyze sentiment on a macro- and microscopic level, as they can not only be applied to a whole corpus providing a bigger picture of data, but also on a document level, for example, by highlighting words that convey sentiment in any given text.
This repository contains all the data we used for creating sentiment dictionaries, including manually annotated seed words, pre-trained word embedding models and other data resulting from intermediate steps. These can also be used in other contexts and NLP tasks and are not necessarily limited to sentiment analysis. As such, our tool chain serves as a foundation for further methods and approaches applicable to all projects focusing on the computational interpretation of texts.
This repository contains all the data we used for creating sentiment dictionaries, including manually annotated seed words, pretrained word embedding models and other data resulting from intermediate steps. These can also be used in other contexts and NLP tasks and are not necessarily limited to sentiment analysis. As such, our tool chain serves as a foundation for further methods and approaches applicable to all projects focusing on the computational interpretation of texts.
Before we continue, please import the necessary Python Packages for the interactive examples:
%% Cell type:code id: tags:
``` python
import pandas
import matplotlib.pyplot as plt
import gensim.downloader as api
import nltk
import seaborn
import re
from nltk.tokenize import WordPunctTokenizer
from mpl_toolkits.mplot3d import Axes3D
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from gensim.models.word2vec import Word2Vec
from IPython.display import display, HTML, Markdown
```
%% Cell type:markdown id: tags:
Additionally, please execute the following cell to download data (i.e., the Punkt tokenizer and the text8 corpus) provided by NLTK and Gensim which is necessary for the examples presented in this Notebook:
Additionally, please execute the following cell to download data (i.e., the Punkt tokenizer) provided by NLTK which is necessary for the examples presented in this Notebook:
%% Cell type:code id: tags:
``` python
nltk.download('punkt')
word2vec_example_corpus = api.load("text8")
```
%%%% Output: execute_result
True
%% Cell type:markdown id: tags:
[^1]: A cooperation between the Institute for Interactive Systems and Data Science at Graz University of Technology, the Know-Center GmbH as well as the Centre for Information Modelling - Austrian Centre for Digital Humanities (ZIM-ACDH) and the Institute of Romance Studies, both at the University of Graz.
%% Cell type:markdown id: tags:
......@@ -87,11 +88,11 @@
| happy | positive |
We can assess the sentiment of each sentence by considering whether the words contained in the dictionary occur in them.
For that, let us define the sentiment *s* of a sentence as follows:
$$s = W_p - W_n$$
![Sentiment Formula for the Introductory Example](miscellaneous/sentiment_formula_1.png)
where $W_p$ is the number of positive words in a sentence and $W_n$ is the number of negative words in a sentence.
Thus, the formula subtracts the number of words with a negative sentiment from the number of words with a positive sentiment.
If the resulting value is negative, then we assume the sentiment of a sentence to be negative, whereas if the value is positive, we assume the sentiment of a sentence to be positive.
This value is zero if the two numbers are equal or if there are no words of the dictionary in a sentence. In this case, we assume that the sentence has a neutral sentiment.
......@@ -103,11 +104,12 @@
``` python
example_sentences = [
"Today was a good day.",
"I hate getting up early.",
"This is both funny and sad at the same time."]
"This is both funny and sad at the same time."
]
```
%% Cell type:markdown id: tags:
Further, we need to create a dictionary that maps our words in the dictionary to a sentiment:
......@@ -271,23 +273,21 @@
For our dictionary creation pipeline we rely on word2vec (Mikolov et al. 2013), a state-of-the-art method to compute word embeddings.
Word2vec relies on a two-layer neural network to train word vectors that capture the linguistic contexts of words and comprises either one of two different model architectures: continuous bag-of-words (CBOW) or skip-gram.
The former predicts a word from a given context whereas the latter predicts the context from a given word.
Both have their advantages and disadvantages regarding the size of the underlying text corpus, which is why we consider both of them in our tool chain.
Before talking about the specifics of our approach, we want to demonstrate the benefits of word2vec. In the following example, we use Gensim's word2vec implementation to train word embeddings of the publicly available *text8* dataset (you have already downloaded it in the *Introduction* section of this Notebook).
We can train the model with one line of code (note that this can take several minutes to complete):
Before talking about the specifics of our approach, we want to demonstrate the benefits of word2vec. For the following example, we used Gensim's word2vec implementation to train word embeddings of the publicly available *text8* dataset. We can load the pretrained word2vec model with the following line of code:
%% Cell type:code id: tags:
``` python
word2vec_model = Word2Vec(word2vec_example_corpus)
word2vec_model = Word2Vec.load("data/processed_data/word2vec_models/example.p")
```
%% Cell type:markdown id: tags:
With the default setup of the `word2vec` method, we train vectors with a dimension of 100. Thus, instead of the three numbers in a vector from the count vector example above, we now have 100 numbers (i.e., a vector of length 100).
As we used the default parameters for word2vec as implemented in Gensim, we trained vectors with a dimension of 100. Thus, instead of the three numbers in a vector from the count vector example above, we now have 100 numbers (i.e., a vector of length 100).
Moreover, these numbers are now real-valued instead of integer and their interpretation is less self-evident.
We can inspect the resulting vector of a word, for example *car*, with:
%% Cell type:code id: tags:
......@@ -305,45 +305,45 @@
word2vec_model.wv.most_similar("car", topn=20)
```
%%%% Output: execute_result
[('driver', 0.7858742475509644),
('cars', 0.7570940852165222),
('motorcycle', 0.7254790663719177),
('taxi', 0.7133187055587769),
('truck', 0.7059160470962524),
('tire', 0.6728494763374329),
('racing', 0.6664857268333435),
('cab', 0.6571638584136963),
('automobile', 0.655803918838501),
('glider', 0.6467955708503723),
('passenger', 0.6462221741676331),
('vehicle', 0.6456612944602966),
('motor', 0.6449832320213318),
('automobiles', 0.6323708891868591),
('mercedes', 0.6311836242675781),
('diesel', 0.6301203370094299),
('honda', 0.6140277981758118),
('powered', 0.613706111907959),
('pilot', 0.6132946610450745),
('stock', 0.6110790967941284)]
[('driver', 0.7839363217353821),
('taxi', 0.7524467706680298),
('cars', 0.725511908531189),
('motorcycle', 0.7036831378936768),
('vehicle', 0.698715090751648),
('truck', 0.6913774609565735),
('passenger', 0.661078155040741),
('automobile', 0.6501474380493164),
('audi', 0.6245964169502258),
('glider', 0.6229903101921082),
('tire', 0.6213281154632568),
('cab', 0.6198135018348694),
('engine', 0.6183426380157471),
('volkswagen', 0.6164752840995789),
('engined', 0.6096624732017517),
('airplane', 0.6076435446739197),
('bmw', 0.6070380210876465),
('elevator', 0.6061339974403381),
('racing', 0.6031301617622375),
('stock', 0.6030023097991943)]
%% Cell type:markdown id: tags:
Another advantage of word2vec compared to count vectors is that it can capture various concepts and analogies.
Perhaps the most famous example in that regard is as follows: If you subtract the vector of the word *man* from the vector of the word *king* and add the vector of the word *woman* it should result in a vector very close to that of the word *queen*. We can evaluate whether this is the case with our trained model through Gensim very easily:
Perhaps the most famous example in that regard is as follows: If you subtract the vector of the word *man* from the vector of the word *king* and add the vector of the word *woman* it should result in a vector very close to that of the word *queen*. We can evaluate whether this is the case with our pretrained model through Gensim very easily:
%% Cell type:code id: tags:
``` python
word2vec_model.wv.most_similar(positive=["woman", "king"], negative=["man"], topn=1)
```
%%%% Output: execute_result
[('prince', 0.6386325359344482)]
[('queen', 0.6670934557914734)]
%% Cell type:markdown id: tags:
We observe that, indeed, the trained word embedding correctly reflects this analogy.
Can you think of any other concepts or analogies to test? Feel free to try it with other word vectors.
......@@ -612,13 +612,13 @@
%% Cell type:code id: tags:
``` python
language = "Spanish"
sentiment_dict = {}
with open("{}{}_negative.txt".format("data/dictionaries/computational_corrected/", language.lower()), "r", encoding="utf-8") as fr:
with open("{}{}_negative.txt".format("data/processed_data/dictionaries/computational_corrected/", language.lower()), "r", encoding="utf-8") as fr:
sentiment_dict["neg"] = fr.read().splitlines()
with open("{}{}_positive.txt".format("data/dictionaries/computational_corrected/", language.lower()), "r", encoding="utf-8") as fr:
with open("{}{}_positive.txt".format("data/processed_data/dictionaries/computational_corrected/", language.lower()), "r", encoding="utf-8") as fr:
sentiment_dict["pos"] = fr.read().splitlines()
print("loaded {} negative words".format(len(sentiment_dict["neg"])))
print("loaded {} positive words".format(len(sentiment_dict["pos"])))
```
......@@ -654,29 +654,29 @@
text=Si je prends la liberté de vous dédier cet Ouvrage; ce n’est en aucune maniere pour me ménager une favorable occasion d’instruire les hommes de votre mérite, & de vous donner, même avec sobrieté, les éloges dont vous êtes digne...
```
%% Cell type:markdown id: tags:
For simplicity, we already transformed French, Italian and Spanish texts into respective pandas DataFrames. If you want to see the code for this process please refer to the corresponding Jupyter Notebook in the code folder. Further, we provide the individual .txt files for each language in the `data/example_texts/` directory of this repository if you wish to reconstruct the DataFrame creation yourself.
For simplicity, we already transformed French, Italian and Spanish texts into respective pandas DataFrames. If you want to see the code for this process please refer to the corresponding Jupyter Notebook in the code folder. Further, we provide the individual .txt files for each language in the `data/raw_data/example_texts/` directory of this repository if you wish to reconstruct the DataFrame creation yourself.
Now we can load the pandas DataFrame containing Spanish texts as well as four additional attributes (i.e., periodical title, issue number, author, year) as follows:
%% Cell type:code id: tags:
``` python
texts_df = pandas.read_pickle("data/example_texts/spanish.p")
texts_df = pandas.read_pickle("data/processed_data/example_texts/spanish.p")
print("loaded dataframe with {} texts and {} attributes".format(texts_df.shape[0], texts_df.shape[1] - 1))
```
%% Cell type:markdown id: tags:
### Computing Sentiment
By now, we have loaded our sentiment dictionaries as well as our texts. To compute sentiment, we have to consider the occurrences of the words contained in the dictionaries in the text files.
For that, we use the following formula, defining the sentiment *s* of a text with:
$$s = \frac{W_p - W_n}{W_p + W_n} $$
![Sentiment Formula](miscellaneous/sentiment_formula_2.png)
where $W_p$ is the number of positive words in a text and $W_n$ is the number of negative words in a text.
Thus, the computed sentiment score is a value ranging between −1 and +1, where values close to −1 are considered as negative, values close to +1 as positive, and where values close to zero indicate a neutral sentiment.
The implementation of this formula in Python is:
......
......@@ -5,5 +5,5 @@ tqdm == 4.50.2
nltk == 3.5
ipywidgets == 7.5.1
gensim == 3.8.3
sklearn == 0.23.2
scikit-learn == 0.23.2
stop-words == 2018.7.23
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment