Commit 93ff306b authored by Niels-Oliver Walkowski's avatar Niels-Oliver Walkowski
Browse files

upd: Make changes requested by editor review results

parent 5294045d
Due to file size limitations, we had to move the pretrained models to another repository.
Please download the pickled model here: http://gams.uni-graz.at/o:dispecs.word2vec.it/ITALIAN
\ No newline at end of file
Due to file size limitations, we had to move the pretrained models to another repository.
Please download the pickled model here: http://gams.uni-graz.at/o:dispecs.word2vec.es/SPANISH
\ No newline at end of file
Due to file size limitation on GitHub, we had to move the pretrained models to Google Drive.
Please download the pickled model here: https://drive.google.com/file/d/15ChW1dipL2ULU-yaa5oXN-ZUJ00HW5VV/view?usp=sharing
\ No newline at end of file
Due to file size limitation on GitHub, we had to move the pretrained models to Google Drive.
Please download the pickled model here: https://drive.google.com/file/d/14O2s3x9ZEG3Zx-SVA6LcDScpZuSAosiJ/view?usp=sharing
\ No newline at end of file
Due to file size limitation on GitHub, we had to move the pretrained models to Google Drive.
Please download the pickled model here: https://drive.google.com/file/d/1SME9EMCF8dnSJU458kaoV1dZUmgNHMel/view?usp=sharing
\ No newline at end of file
File mode changed from 100755 to 100644
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# A Sentiment Analysis Tool Chain for 18<sup>th</sup> Century Periodicals # A Sentiment Analysis Tool Chain for 18<sup>th</sup> Century Periodicals
*Philipp Koncar, Christina Glatz, Elisabeth Hobisch, Yvonne Völkl, Bernhard C. Geiger, Sanja Sarić, Martina Scholger, Denis Helic* *Philipp Koncar, Christina Glatz, Elisabeth Hobisch, Yvonne Völkl, Bernhard C. Geiger, Sanja Sarić, Martina Scholger, Denis Helic*
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Introduction ## Introduction
Sentiment analysis is a common task in natural language processing (NLP) and aims for the automatic and computational identification of emotions, attitudes and opinions expressed in textual data (Pang et al. 2008). While Sentiment analysis is especially tailored for and widely used in the context of social media analysis and other web data, such as product reviews (Pak and Paroubek 2010, Ortigosa et al. 2014, Hamilton et al. 2016), the application to literary texts is still challenging due to the lack of methods - with few but notable exceptions (Schmidt et al. 2018; Sprugnoli et al. 2020) - dedicated to languages other than English and from earlier times. Sentiment analysis is a common task in natural language processing (NLP) and aims for the automatic and computational identification of emotions, attitudes and opinions expressed in textual data (Pang et al. 2008). While Sentiment analysis is especially tailored for and widely used in the context of social media analysis and other web data, such as product reviews (Pak and Paroubek 2010, Ortigosa et al. 2014, Hamilton et al. 2016), the application to literary texts is still challenging due to the lack of methods - with few but notable exceptions (Schmidt et al. 2018; Sprugnoli et al. 2020) - dedicated to languages other than English and from earlier times.
In the currently ongoing project [^1] Distant Reading for Periodicals of the Enlightenment (DiSpecs - [link to project website](http://gams.uni-graz.at/dispecs)) funded by the Austrian Academy of Sciences, we analyze Spectator periodicals from the Digital Edition project The Spectators in the International Context (Ertler et al. 2011) in terms of their thematic, stylistic, and emotional orientation using different computational methods, including sentiment analysis. During the project, it became obvious that existing methods are only partly suitable for 18<sup>th</sup> century texts which comprise the spectatorial press. In particular, we encountered problems with shifts in word meanings and spellings between modern day and 18<sup>th</sup> century languages. Therefore, we concluded that the development of appropriate methods for 18<sup>th</sup> century texts is necessary to further improve the quality of our analyses. Additional funding by CLARIAH-AT made this endeavour possible. In the currently ongoing project [^1] Distant Reading for Periodicals of the Enlightenment (DiSpecs - [link to project website](http://gams.uni-graz.at/dispecs)) funded by the Austrian Academy of Sciences, we analyze Spectator periodicals from the Digital Edition project The Spectators in the International Context (Ertler et al. 2011) in terms of their thematic, stylistic, and emotional orientation using different computational methods, including sentiment analysis. During the project, it became obvious that existing methods are only partly suitable for 18<sup>th</sup> century texts which comprise the spectatorial press. In particular, we encountered problems with shifts in word meanings and spellings between modern day and 18<sup>th</sup> century languages. Therefore, we concluded that the development of appropriate methods for 18<sup>th</sup> century texts is necessary to further improve the quality of our analyses. Additional funding by CLARIAH-AT made this endeavour possible.
With the contribution presented here, we not only introduce new sentiment dictionaries for French, Italian and Spanish texts of the 18<sup>th</sup> century, but also build a freely and publicly available tool chain based on Jupyter Notebooks, enabling researchers to apply our dictionary creation process and sentiment analysis methods to their own material and projects. Our Notebooks furthermore contain tutorial-style introductions to concepts such as word embeddings, k-nearest neighbor classification, and dictionary-based sentiment analysis. With the contribution presented here, we not only introduce new sentiment dictionaries for French, Italian and Spanish texts of the 18<sup>th</sup> century, but also build a freely and publicly available tool chain based on Jupyter Notebooks, enabling researchers to apply our dictionary creation process and sentiment analysis methods to their own material and projects. Our Notebooks furthermore contain tutorial-style introductions to concepts such as word embeddings, k-nearest neighbor classification, and dictionary-based sentiment analysis.
The proposed tool chain comprises two different parts: (i) the optional creation of sentiment dictionaries and (ii) the actual sentiment analysis. The proposed tool chain comprises two different parts: (i) the optional creation of sentiment dictionaries and (ii) the actual sentiment analysis.
More precisely, the first part requires manual annotations of seed words from which we transfer sentiment to other words occurring in a similar context as seed words. To transfer sentiment, we train word embeddings and use a machine learning classification task. In doing so, we computationally extend the list of annotated words and avoid a more time-consuming and tedious manual annotation process. Note that this procedure is also adaptable to other languages (also contemporary languages). More precisely, the first part requires manual annotations of seed words from which we transfer sentiment to other words occurring in a similar context as seed words. To transfer sentiment, we train word embeddings and use a machine learning classification task. In doing so, we computationally extend the list of annotated words and avoid a more time-consuming and tedious manual annotation process. Note that this procedure is also adaptable to other languages (also contemporary languages).
In the second part, we provide a collection of ready-to-use sentiment dictionaries (which we created with the first part of the tool chain) as well as methods to perform the actual sentiment analysis. The implemented methods range from listing basic descriptive statistics to various kinds of plots that allow for an easy interpretation of sentiment expressed in a given text corpus. Further, our methods analyze sentiment on a macro- and microscopic level, as they can not only be applied to a whole corpus providing a bigger picture of data, but also on a document level, for example, by highlighting words that convey sentiment in any given text. In the second part, we provide a collection of ready-to-use sentiment dictionaries (which we created with the first part of the tool chain) as well as methods to perform the actual sentiment analysis. The implemented methods range from listing basic descriptive statistics to various kinds of plots that allow for an easy interpretation of sentiment expressed in a given text corpus. Further, our methods analyze sentiment on a macro- and microscopic level, as they can not only be applied to a whole corpus providing a bigger picture of data, but also on a document level, for example, by highlighting words that convey sentiment in any given text.
This repository contains all the data we used for creating sentiment dictionaries, including manually annotated seed words, pre-trained word embedding models and other data resulting from intermediate steps. These can also be used in other contexts and NLP tasks and are not necessarily limited to sentiment analysis. As such, our tool chain serves as a foundation for further methods and approaches applicable to all projects focusing on the computational interpretation of texts. This repository contains all the data we used for creating sentiment dictionaries, including manually annotated seed words, pretrained word embedding models and other data resulting from intermediate steps. These can also be used in other contexts and NLP tasks and are not necessarily limited to sentiment analysis. As such, our tool chain serves as a foundation for further methods and approaches applicable to all projects focusing on the computational interpretation of texts.
Before we continue, please import the necessary Python Packages for the interactive examples: Before we continue, please import the necessary Python Packages for the interactive examples:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import pandas import pandas
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
import gensim.downloader as api
import nltk import nltk
import seaborn import seaborn
import re import re
from nltk.tokenize import WordPunctTokenizer from nltk.tokenize import WordPunctTokenizer
from mpl_toolkits.mplot3d import Axes3D
from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier from sklearn.neighbors import KNeighborsClassifier
from gensim.models.word2vec import Word2Vec from gensim.models.word2vec import Word2Vec
from IPython.display import display, HTML, Markdown from IPython.display import display, HTML, Markdown
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Additionally, please execute the following cell to download data (i.e., the Punkt tokenizer and the text8 corpus) provided by NLTK and Gensim which is necessary for the examples presented in this Notebook: Additionally, please execute the following cell to download data (i.e., the Punkt tokenizer) provided by NLTK which is necessary for the examples presented in this Notebook:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
nltk.download('punkt') nltk.download('punkt')
word2vec_example_corpus = api.load("text8")
``` ```
%% Output %% Output
[nltk_data] Downloading package punkt to [nltk_data] Downloading package punkt to
[nltk_data] /Users/philippkoncar/nltk_data... [nltk_data] /Users/philippkoncar/nltk_data...
[nltk_data] Package punkt is already up-to-date! [nltk_data] Package punkt is already up-to-date!
True
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
[^1]: A cooperation between the Institute for Interactive Systems and Data Science at Graz University of Technology, the Know-Center GmbH as well as the Centre for Information Modelling - Austrian Centre for Digital Humanities (ZIM-ACDH) and the Institute of Romance Studies, both at the University of Graz. [^1]: A cooperation between the Institute for Interactive Systems and Data Science at Graz University of Technology, the Know-Center GmbH as well as the Centre for Information Modelling - Austrian Centre for Digital Humanities (ZIM-ACDH) and the Institute of Romance Studies, both at the University of Graz.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Sentiment Dictionaries for the 18<sup>th</sup> Century ## Sentiment Dictionaries for the 18<sup>th</sup> Century
In this section, we describe the process of how we created sentiment dictionaries and provide examples for the underlying methods. Please proceed with the *Sentiment Analysis for the 18<sup>th</sup> Century* section if you want to use our ready-to-use dictionaries for French, Italian and Spanish of the 18<sup>th</sup> century. In this section, we describe the process of how we created sentiment dictionaries and provide examples for the underlying methods. Please proceed with the *Sentiment Analysis for the 18<sup>th</sup> Century* section if you want to use our ready-to-use dictionaries for French, Italian and Spanish of the 18<sup>th</sup> century.
The Jupyter Notebooks, which can be easily used by everyone to create sentiment dictionaries for their own projects, can be found in the `code/dictionary_creation/` directory of this repository. The Jupyter Notebooks, which can be easily used by everyone to create sentiment dictionaries for their own projects, can be found in the `code/dictionary_creation/` directory of this repository.
The process for creating sentiment dictionaries comprises three major steps, each of them described in more detail in the sections below. Overall, we first need to generate a set of seed words which serves as a basis to automatically transfer sentiment to other words in the text corpus. For this expansion, we train word embeddings to capture the context of individual words and use them in a classification task to transfer sentiment of seed words to other words in similar contexts. Note that this process is suitable for a plethora of languages and is not limited to the 18<sup>th</sup> century languages. The following figure illustrates the whole dictionary creation pipeline: The process for creating sentiment dictionaries comprises three major steps, each of them described in more detail in the sections below. Overall, we first need to generate a set of seed words which serves as a basis to automatically transfer sentiment to other words in the text corpus. For this expansion, we train word embeddings to capture the context of individual words and use them in a classification task to transfer sentiment of seed words to other words in similar contexts. Note that this process is suitable for a plethora of languages and is not limited to the 18<sup>th</sup> century languages. The following figure illustrates the whole dictionary creation pipeline:
![Overview of the Dictionary Creation Pipeline](miscellaneous/dictionary_creation_overview.png) ![Overview of the Dictionary Creation Pipeline](miscellaneous/dictionary_creation_overview.png)
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Sentiment Dictionaries 101 ### Sentiment Dictionaries 101
Before we discuss the individual steps to create sentiment dictionaries, we want to give a short introduction of what sentiment dictionaries are and how they can be used for sentiment analysis. Before we discuss the individual steps to create sentiment dictionaries, we want to give a short introduction of what sentiment dictionaries are and how they can be used for sentiment analysis.
Assume we want to assess the sentiment on a sentence level and that we are given the following three sentences: Assume we want to assess the sentiment on a sentence level and that we are given the following three sentences:
**Sentence 1:** Today was a good day.<br> **Sentence 1:** Today was a good day.<br>
**Sentence 2:** I hate getting up early.<br> **Sentence 2:** I hate getting up early.<br>
**Sentence 3:** This is both funny and sad at the same time. **Sentence 3:** This is both funny and sad at the same time.
Further, we have a sentiment dictionary, which comprises a list of words, each associated with sentiment (e.g., positive or negative): Further, we have a sentiment dictionary, which comprises a list of words, each associated with sentiment (e.g., positive or negative):
| Word | Sentiment | | Word | Sentiment |
| --- | --- | | --- | --- |
| good | positive | | good | positive |
| bad | negative | | bad | negative |
| hate | negative | | hate | negative |
| funny | positive | | funny | positive |
| sad | negative | | sad | negative |
| happy | positive | | happy | positive |
We can assess the sentiment of each sentence by considering whether the words contained in the dictionary occur in them. We can assess the sentiment of each sentence by considering whether the words contained in the dictionary occur in them.
For that, let us define the sentiment *s* of a sentence as follows: For that, let us define the sentiment *s* of a sentence as follows:
$$s = W_p - W_n$$ ![Sentiment Formula for the Introductory Example](miscellaneous/sentiment_formula_1.png)
where $W_p$ is the number of positive words in a sentence and $W_n$ is the number of negative words in a sentence. where $W_p$ is the number of positive words in a sentence and $W_n$ is the number of negative words in a sentence.
Thus, the formula subtracts the number of words with a negative sentiment from the number of words with a positive sentiment. Thus, the formula subtracts the number of words with a negative sentiment from the number of words with a positive sentiment.
If the resulting value is negative, then we assume the sentiment of a sentence to be negative, whereas if the value is positive, we assume the sentiment of a sentence to be positive. If the resulting value is negative, then we assume the sentiment of a sentence to be negative, whereas if the value is positive, we assume the sentiment of a sentence to be positive.
This value is zero if the two numbers are equal or if there are no words of the dictionary in a sentence. In this case, we assume that the sentence has a neutral sentiment. This value is zero if the two numbers are equal or if there are no words of the dictionary in a sentence. In this case, we assume that the sentence has a neutral sentiment.
For our three examples, this means that Sentence 1 has a positive sentiment, Sentence 2 a negative sentiment and Sentence 3 a neutral sentiment. For our three examples, this means that Sentence 1 has a positive sentiment, Sentence 2 a negative sentiment and Sentence 3 a neutral sentiment.
To verify whether our assumption is true, we implement the example in Python. For that, we first create a list that contains the three example sentences from above: To verify whether our assumption is true, we implement the example in Python. For that, we first create a list that contains the three example sentences from above:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
example_sentences = [ example_sentences = [
"Today was a good day.", "Today was a good day.",
"I hate getting up early.", "I hate getting up early.",
"This is both funny and sad at the same time."] "This is both funny and sad at the same time."
]
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Further, we need to create a dictionary that maps our words in the dictionary to a sentiment: Further, we need to create a dictionary that maps our words in the dictionary to a sentiment:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
sentiment_dictionary = { sentiment_dictionary = {
"good": "positive", "good": "positive",
"bad": "negative", "bad": "negative",
"hate": "negative", "hate": "negative",
"funny": "positive", "funny": "positive",
"sad": "negative", "sad": "negative",
"happy": "positive" "happy": "positive"
} }
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We then define a function that takes one sentence as input and returns its computed sentiment following the formula and rules stated above: We then define a function that takes one sentence as input and returns its computed sentiment following the formula and rules stated above:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def compute_sentiment(sentence): def compute_sentiment(sentence):
wpt = WordPunctTokenizer() wpt = WordPunctTokenizer()
words_list = wpt.tokenize(sentence) words_list = wpt.tokenize(sentence)
number_of_negative_words = 0 number_of_negative_words = 0
number_of_positive_words = 0 number_of_positive_words = 0
for word, sentiment in sentiment_dictionary.items(): for word, sentiment in sentiment_dictionary.items():
if sentiment == "negative": if sentiment == "negative":
number_of_negative_words += words_list.count(word) number_of_negative_words += words_list.count(word)
elif sentiment == "positive": elif sentiment == "positive":
number_of_positive_words += words_list.count(word) number_of_positive_words += words_list.count(word)
sentiment_score = number_of_positive_words - number_of_negative_words sentiment_score = number_of_positive_words - number_of_negative_words
if sentiment_score < 0: if sentiment_score < 0:
computed_sentiment = "negative" computed_sentiment = "negative"
elif sentiment_score > 0: elif sentiment_score > 0:
computed_sentiment = "positive" computed_sentiment = "positive"
else: else:
computed_sentiment = "neutral" computed_sentiment = "neutral"
return computed_sentiment return computed_sentiment
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We then iterate through the list of sentences and print the sentiment for each of them: We then iterate through the list of sentences and print the sentiment for each of them:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
for sentence in example_sentences: for sentence in example_sentences:
print("The sentiment for '{}' is {}.".format(sentence, compute_sentiment(sentence))) print("The sentiment for '{}' is {}.".format(sentence, compute_sentiment(sentence)))
``` ```
%% Output %% Output
The sentiment for 'Today was a good day.' is positive. The sentiment for 'Today was a good day.' is positive.
The sentiment for 'I hate getting up early.' is negative. The sentiment for 'I hate getting up early.' is negative.
The sentiment for 'This is both funny and sad at the same time.' is neutral. The sentiment for 'This is both funny and sad at the same time.' is neutral.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
As we can now observe, the verification of our previous assumptions through Python confirms that they were all correct. As we can now observe, the verification of our previous assumptions through Python confirms that they were all correct.
The example above highlights the advantages of dictionary-based sentiment analysis approaches: first, it is straightforward to conduct (as demonstrated above) and second, the produced results are transparent and easily interpretable. The example above highlights the advantages of dictionary-based sentiment analysis approaches: first, it is straightforward to conduct (as demonstrated above) and second, the produced results are transparent and easily interpretable.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Step 1: Selecting Seed Words ### Step 1: Selecting Seed Words
Our sentiment dictionary creation depends on seed words for which the sentiment is known. Based on these seed words, we can automatically transfer sentiment to other words, allowing us to circumvent a more tedious and time-consuming annotation process. This step comprises three parts: First, we extract seed words from the text corpus. Second, we manually annotate them. Third, we select annotated seed words based on the agreement of multiple annotators. Our sentiment dictionary creation depends on seed words for which the sentiment is known. Based on these seed words, we can automatically transfer sentiment to other words, allowing us to circumvent a more tedious and time-consuming annotation process. This step comprises three parts: First, we extract seed words from the text corpus. Second, we manually annotate them. Third, we select annotated seed words based on the agreement of multiple annotators.
#### Extracting Frequent Words #### Extracting Frequent Words
At first, we extract the 3,000 most frequent words from the entire text corpus (without removing stop words or conducting lemmatization). The number of extracted seed words depends on the corpus size. We decided for a compromise between the number of words and the associated efforts to annotate them. At first, we extract the 3,000 most frequent words from the entire text corpus (without removing stop words or conducting lemmatization). The number of extracted seed words depends on the corpus size. We decided for a compromise between the number of words and the associated efforts to annotate them.
In this case, these 3,000 most frequent words account for 3.9% of unique French words (76,035 in total), 2.5% of unique italian words (121,504 in total) and 3.6% of unique Spanish words (84,185 in total). In this case, these 3,000 most frequent words account for 3.9% of unique French words (76,035 in total), 2.5% of unique italian words (121,504 in total) and 3.6% of unique Spanish words (84,185 in total).
Focusing on most frequent words allows us to achieve a good coverage as seed words that do not occur frequently in our texts are of no use in the subsequent classification task. Focusing on most frequent words allows us to achieve a good coverage as seed words that do not occur frequently in our texts are of no use in the subsequent classification task.
The Jupyter Notebook for the extraction of seed words (see `code/dictionary_creation/1_seed_words/1_extraction.ipynb`) allows for additional settings, such as a maximum document frequency of words to extract. The Jupyter Notebook for the extraction of seed words (see `code/dictionary_creation/1_seed_words/1_extraction.ipynb`) allows for additional settings, such as a maximum document frequency of words to extract.
#### Annotating Frequent Words #### Annotating Frequent Words
Once we extracted frequent words, we let three experts, who are familiar with the content and context of the French, Italian and Spanish Spectator periodicals, annotate the seed words. Once we extracted frequent words, we let three experts, who are familiar with the content and context of the French, Italian and Spanish Spectator periodicals, annotate the seed words.
In particular, we instructed them to assign each of the extracted words to either of three sentiment classes: (i) positive, (ii) negative or (iii) neutral and to take into account the sociocultural circumstances of the 18<sup>th</sup> century. In particular, we instructed them to assign each of the extracted words to either of three sentiment classes: (i) positive, (ii) negative or (iii) neutral and to take into account the sociocultural circumstances of the 18<sup>th</sup> century.
As a result, the annotators captured the sentiment of words with regard to their intended meaning. As a result, the annotators captured the sentiment of words with regard to their intended meaning.
The provided Jupyter Notebook (see `code/dictionary_creation/1_seed_words/2_annotation.ipynb`) for the annotation of words includes a few simple code lines to generate *.csv* files which can then be opened and annotated in an arbitrary spreadsheet program (e.g., Microsoft Excel or LibreOffice Calc). The provided Jupyter Notebook (see `code/dictionary_creation/1_seed_words/2_annotation.ipynb`) for the annotation of words includes a few simple code lines to generate *.csv* files which can then be opened and annotated in an arbitrary spreadsheet program (e.g., Microsoft Excel or LibreOffice Calc).
Note that the annotated sentiment needs to be entered in the *sentiment* column of the generated file and, for the remaining Jupyter Notebooks to work without changes, the annotators must stick to the following three classes: *positive*, *negative* and *neutral*. For example, the annotations for the words *good*, *bad* and *house* could be: Note that the annotated sentiment needs to be entered in the *sentiment* column of the generated file and, for the remaining Jupyter Notebooks to work without changes, the annotators must stick to the following three classes: *positive*, *negative* and *neutral*. For example, the annotations for the words *good*, *bad* and *house* could be:
word | sentiment word | sentiment
--- | --- --- | ---
good | positive good | positive
bad | negative bad | negative
house | neutral house | neutral
However, other expressions of sentiment, such as additional classes or numerical values, may be used if the subsequent Jupyter Notebooks are adjusted accordingly. However, other expressions of sentiment, such as additional classes or numerical values, may be used if the subsequent Jupyter Notebooks are adjusted accordingly.
#### Selecting Seed Words #### Selecting Seed Words
As past research has indicated that sentiment is very subjective and involves disagreement between multiple individuals (Mozetič et al. 2016), we need to implement a selection procedure to settle on the final sentiment of frequent words. As past research has indicated that sentiment is very subjective and involves disagreement between multiple individuals (Mozetič et al. 2016), we need to implement a selection procedure to settle on the final sentiment of frequent words.
For that, we use a simple majority vote, in which we only keep words for which at least two annotators have equal annotations and in which we remove the remaining words otherwise. For that, we use a simple majority vote, in which we only keep words for which at least two annotators have equal annotations and in which we remove the remaining words otherwise.
Regarding the number of annotators, we suggest employing as many of them as possible as it allows for better generalization. Regarding the number of annotators, we suggest employing as many of them as possible as it allows for better generalization.
Naturally, there is a trade-off between the number of annotators you can find/afford (i.e., it is very time-consuming to annotate words, especially when you have to consider the sociocultural context) and the quality of resulting annotations. Naturally, there is a trade-off between the number of annotators you can find/afford (i.e., it is very time-consuming to annotate words, especially when you have to consider the sociocultural context) and the quality of resulting annotations.
In our case, we went with three (which we consider the lowest limit) annotators, who have all worked with the data for several years and have an extensive knowledge about this period in time. In our case, we went with three (which we consider the lowest limit) annotators, who have all worked with the data for several years and have an extensive knowledge about this period in time.
The Jupyter Notebook (see `code/dictionary_creation/1_seed_words/3_selection.ipynb`) for the selection of seed words provides a ready-to-use implementation based on a majority vote. In the following table, we provide the number of positive, negative and neutral selected seed words: The Jupyter Notebook (see `code/dictionary_creation/1_seed_words/3_selection.ipynb`) for the selection of seed words provides a ready-to-use implementation based on a majority vote. In the following table, we provide the number of positive, negative and neutral selected seed words:
Language | # positive | # negative | # neutral Language | # positive | # negative | # neutral
--- | --- | --- | --- --- | --- | --- | ---
French | 803 | 381 | 1,738 French | 803 | 381 | 1,738
Italian | 1,811 | 244 | 838 Italian | 1,811 | 244 | 838
Spanish | 385 | 251 | 2,340 Spanish | 385 | 251 | 2,340
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Step 2: Creating Word Embeddings ### Step 2: Creating Word Embeddings
Computers have hard times working with text or words. Computers have hard times working with text or words.
To counteract this problem, we need to transform text or words into numerical forms. To counteract this problem, we need to transform text or words into numerical forms.
One way to achieve this are so-called word embeddings, which represent words by vectors. One way to achieve this are so-called word embeddings, which represent words by vectors.
Existing research discerns two types of word embeddings: (i) frequency-based word embeddings, such as count vectors or TF-IDF vectors, as well as prediction-based word embeddings, such as word2vec (Mikolov et al. 2013). Existing research discerns two types of word embeddings: (i) frequency-based word embeddings, such as count vectors or TF-IDF vectors, as well as prediction-based word embeddings, such as word2vec (Mikolov et al. 2013).
Before describing our utilized method, we want to demonstrate principles of word embeddings in the following example. Before describing our utilized method, we want to demonstrate principles of word embeddings in the following example.
For that, consider the following three sentences contained in a list: For that, consider the following three sentences contained in a list:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
example_sentences = [ example_sentences = [
"Word embeddings are fun.", "Word embeddings are fun.",
"It is fun to learn new things.", "It is fun to learn new things.",
"Teaching word embeddings is also fun." "Teaching word embeddings is also fun."
] ]
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We can use the `CountVectorizer` from the `scikit-learn` package to create a count matrix and store that in a pandas `DataFrame` for easier interpretation: We can use the `CountVectorizer` from the `scikit-learn` package to create a count matrix and store that in a pandas `DataFrame` for easier interpretation:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
vectorizer = CountVectorizer() vectorizer = CountVectorizer()
count_matrix = vectorizer.fit_transform(example_sentences) count_matrix = vectorizer.fit_transform(example_sentences)
count_vectors_df = pandas.DataFrame(count_matrix.todense(), columns=vectorizer.get_feature_names()) count_vectors_df = pandas.DataFrame(count_matrix.todense(), columns=vectorizer.get_feature_names())
display(count_vectors_df) display(count_vectors_df)
``` ```
%% Output %% Output
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
As we can now observe, this produces a table in which each row represents one of the three sentences and each column represents a distinct word occurring in either of the three sentences. As we can now observe, this produces a table in which each row represents one of the three sentences and each column represents a distinct word occurring in either of the three sentences.
Each value in the cells of the table equals 1 if the respective word occurs in the respective sentence or 0 otherwise. Each value in the cells of the table equals 1 if the respective word occurs in the respective sentence or 0 otherwise.
We can then find the vector representation of a word by considering the values in its column. We can then find the vector representation of a word by considering the values in its column.
For example, the vector of the word *embeddings* is `[1, 0, 1]` and of the word *fun* is `[1, 1, 1]`. For example, the vector of the word *embeddings* is `[1, 0, 1]` and of the word *fun* is `[1, 1, 1]`.
Thus, a word is represented by the set of sentences/documents in which it occurs. Thus, a word is represented by the set of sentences/documents in which it occurs.
Word embeddings also capture the context of a word in a document, including the semantic similarity as well as the relations with other words. Word embeddings also capture the context of a word in a document, including the semantic similarity as well as the relations with other words.
As such, vectors of words that are frequently used together in texts are also very close to each other in the resulting vector space. As such, vectors of words that are frequently used together in texts are also very close to each other in the resulting vector space.
Contrary, vectors of words that are never or only minimally used in a similar context should be very distant from each other. Contrary, vectors of words that are never or only minimally used in a similar context should be very distant from each other.
This suits perfectly for our dictionary creation process and based on the vector representation of words, we can automatically transfer sentiment from our seed words to other words. This suits perfectly for our dictionary creation process and based on the vector representation of words, we can automatically transfer sentiment from our seed words to other words.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Word2Vec #### Word2Vec
For our dictionary creation pipeline we rely on word2vec (Mikolov et al. 2013), a state-of-the-art method to compute word embeddings. For our dictionary creation pipeline we rely on word2vec (Mikolov et al. 2013), a state-of-the-art method to compute word embeddings.
Word2vec relies on a two-layer neural network to train word vectors that capture the linguistic contexts of words and comprises either one of two different model architectures: continuous bag-of-words (CBOW) or skip-gram. Word2vec relies on a two-layer neural network to train word vectors that capture the linguistic contexts of words and comprises either one of two different model architectures: continuous bag-of-words (CBOW) or skip-gram.
The former predicts a word from a given context whereas the latter predicts the context from a given word. The former predicts a word from a given context whereas the latter predicts the context from a given word.
Both have their advantages and disadvantages regarding the size of the underlying text corpus, which is why we consider both of them in our tool chain. Both have their advantages and disadvantages regarding the size of the underlying text corpus, which is why we consider both of them in our tool chain.
Before talking about the specifics of our approach, we want to demonstrate the benefits of word2vec. In the following example, we use Gensim's word2vec implementation to train word embeddings of the publicly available *text8* dataset (you have already downloaded it in the *Introduction* section of this Notebook). Before talking about the specifics of our approach, we want to demonstrate the benefits of word2vec. For the following example, we used Gensim's word2vec implementation to train word embeddings of the publicly available *text8* dataset. We can load the pretrained word2vec model with the following line of code:
We can train the model with one line of code (note that this can take several minutes to complete):
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
word2vec_model = Word2Vec(word2vec_example_corpus) word2vec_model = Word2Vec.load("data/processed_data/word2vec_models/example.p")
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
With the default setup of the `word2vec` method, we train vectors with a dimension of 100. Thus, instead of the three numbers in a vector from the count vector example above, we now have 100 numbers (i.e., a vector of length 100). As we used the default parameters for word2vec as implemented in Gensim, we trained vectors with a dimension of 100. Thus, instead of the three numbers in a vector from the count vector example above, we now have 100 numbers (i.e., a vector of length 100).
Moreover, these numbers are now real-valued instead of integer and their interpretation is less self-evident. Moreover, these numbers are now real-valued instead of integer and their interpretation is less self-evident.
We can inspect the resulting vector of a word, for example *car*, with: We can inspect the resulting vector of a word, for example *car*, with:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
print(word2vec_model.wv["car"]) print(word2vec_model.wv["car"])
``` ```
%% Output %% Output
[-0.96366096 -1.3742831 0.73106915 1.2464461 0.19334741 -0.858718 [-0.12308194 -0.39095813 1.000928 -0.04769582 0.9710199 -1.3676833
-0.01184064 -0.24652933 -1.0748054 -0.5054727 1.1855398 -0.62031776 -1.4404075 -0.7423445 1.6969757 -1.1585793 0.64239687 0.6409116
0.39737985 1.8011881 -2.595472 -1.9477972 0.44193438 -1.2095671 1.4341261 0.4111035 -2.4854403 -0.7219625 1.540313 0.42790467
-0.03679457 0.22283824 0.8520565 0.19460759 0.601061 -0.25661194 -0.44106185 -1.114401 1.3602705 0.8318557 2.0098808 0.7983357
0.5227143 -1.2684288 -0.6947603 0.71117973 -0.83271295 1.3840522 -1.5087671 -0.8375068 -0.61658716 1.2119923 -0.6702468 -0.3222445
2.6297543 2.1552653 0.46458372 -0.15175714 -1.7563695 -0.18980268 0.9415244 2.6550481 -0.99794585 0.04601024 -1.6316731 -1.439617
-0.19808206 -2.296005 -0.4825583 0.84871304 0.7676269 -0.23888186 -1.431928 -2.4456704 -0.71318716 0.86487716 2.2365756 -0.89588803
1.5052361 1.7347597 2.9247804 -0.60031617 -2.162292 0.19464816 -0.7087643 0.9257705 1.712533 0.7194995 -2.4665103 -1.3497332
0.38104555 -0.25405452 0.56672496 0.18307838 1.9986428 -3.067654 2.2958574 -0.54635614 0.9186488 0.51365685 0.09351666 -1.0833061
1.2176334 1.1274312 0.52531594 0.21777926 0.16100219 -0.06693241 -0.00810259 2.8242166 0.1252522 -1.207868 0.10782211 -0.34977445
-0.9263279 0.34482655 -0.7890902 -0.06024634 -0.6747798 -0.88505363 -0.30000007 0.33047387 -0.12232512 -0.52950805 -2.4587536 -0.37481222
-0.8940445 0.63983166 -0.28610742 -3.1221943 0.9004476 1.3235196 -2.2148058 0.14348628 -0.79030484 -2.5900028 2.7875724 -0.7795173
0.33159682 1.4678319 0.07791691 1.7263894 -1.1614616 1.1387984 0.6641297 2.6237233 1.7713573 1.7022327 -0.04617653 1.2087046
-1.2820066 -1.8504483 -0.41011232 -1.8067706 1.8940349 1.5587422 -1.4730823 0.74134797 -0.26776415 0.22373354 0.71002257 1.5748668
-1.9671468 -0.20638618 1.795837 -0.37610653 1.1748948 0.65870816 -1.778043 0.48367617 1.0869575 -0.6362949 0.63211554 0.5351157
-1.9154986 -2.549661 -3.388793 -0.27740353 0.83503336 -2.0548694 -1.8014896 0.39312994 -2.118675 0.83928734 -0.3225636 -2.1843622
1.2873101 -0.23276174 3.1180558 2.5557537 ] -0.557146 -1.6596688 1.2372373 3.977601 ]
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We can find words similar to *car* by considering the cosine similarity between the vector of *car* and all other vectors. The higher the similarity, the more similar are the words. Gensim provides an easy way to find most similar words based on cosine similarity: We can find words similar to *car* by considering the cosine similarity between the vector of *car* and all other vectors. The higher the similarity, the more similar are the words. Gensim provides an easy way to find most similar words based on cosine similarity:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
word2vec_model.wv.most_similar("car", topn=20) word2vec_model.wv.most_similar("car", topn=20)
``` ```
%% Output %% Output
[('driver', 0.7858742475509644), [('driver', 0.7839363217353821),
('cars', 0.7570940852165222), ('taxi', 0.7524467706680298),
('motorcycle', 0.7254790663719177), ('cars', 0.725511908531189),
('taxi', 0.7133187055587769), ('motorcycle', 0.7036831378936768),
('truck', 0.7059160470962524), ('vehicle', 0.698715090751648),
('tire', 0.6728494763374329), ('truck', 0.6913774609565735),
('racing', 0.6664857268333435), ('passenger', 0.661078155040741),
('cab', 0.6571638584136963), ('automobile', 0.6501474380493164),
('automobile', 0.655803918838501), ('audi', 0.6245964169502258),
('glider', 0.6467955708503723), ('glider', 0.6229903101921082),
('passenger', 0.6462221741676331), ('tire', 0.6213281154632568),
('vehicle', 0.6456612944602966), ('cab', 0.6198135018348694),
('motor', 0.6449832320213318), ('engine', 0.6183426380157471),
('automobiles', 0.6323708891868591), ('volkswagen', 0.6164752840995789),
('mercedes', 0.6311836242675781), ('engined', 0.6096624732017517),
('diesel', 0.6301203370094299), ('airplane', 0.6076435446739197),
('honda', 0.6140277981758118), ('bmw', 0.6070380210876465),
('powered', 0.613706111907959), ('elevator', 0.6061339974403381),
('pilot', 0.6132946610450745), ('racing', 0.6031301617622375),
('stock', 0.6110790967941284)] ('stock', 0.6030023097991943)]
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Another advantage of word2vec compared to count vectors is that it can capture various concepts and analogies. Another advantage of word2vec compared to count vectors is that it can capture various concepts and analogies.
Perhaps the most famous example in that regard is as follows: If you subtract the vector of the word *man* from the vector of the word *king* and add the vector of the word *woman* it should result in a vector very close to that of the word *queen*. We can evaluate whether this is the case with our trained model through Gensim very easily: Perhaps the most famous example in that regard is as follows: If you subtract the vector of the word *man* from the vector of the word *king* and add the vector of the word *woman* it should result in a vector very close to that of the word *queen*. We can evaluate whether this is the case with our pretrained model through Gensim very easily:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
word2vec_model.wv.most_similar(positive=["woman", "king"], negative=["man"], topn=1) word2vec_model.wv.most_similar(positive=["woman", "king"], negative=["man"], topn=1)
``` ```
%% Output %% Output
[('prince', 0.6386325359344482)] [('queen', 0.6670934557914734)]
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We observe that, indeed, the trained word embedding correctly reflects this analogy. We observe that, indeed, the trained word embedding correctly reflects this analogy.
Can you think of any other concepts or analogies to test? Feel free to try it with other word vectors. Can you think of any other concepts or analogies to test? Feel free to try it with other word vectors.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Training the Models #### Training the Models
Before we start training the actual word2vec models for the Spectator periodicals in the respective languages, we need to preprocess our texts. Before we start training the actual word2vec models for the Spectator periodicals in the respective languages, we need to preprocess our texts.
The amount of required preprocessing for word2vec is very minimal, as we only need to remove stop words and extract individual sentences. The amount of required preprocessing for word2vec is very minimal, as we only need to remove stop words and extract individual sentences.
The latter is important because individual sentences are the required input form for the word2vec implementation of Gensim. The latter is important because individual sentences are the required input form for the word2vec implementation of Gensim.
Since word2vec has many hyperparameters (that all have an impact on the resulting word embeddings), we need to tune them to achieve the best possible performance. Since word2vec has many hyperparameters (that all have an impact on the resulting word embeddings), we need to tune them to achieve the best possible performance.
One way to optimize hyperparameters is to conduct a *grid search*. One way to optimize hyperparameters is to conduct a *grid search*.
For this purpose, we define a set of possible hyperparameters and train one individual model for each possible hyperparameter combination. For this purpose, we define a set of possible hyperparameters and train one individual model for each possible hyperparameter combination.
We then evaluate each model and select the one that yielded the best performance. We then evaluate each model and select the one that yielded the best performance.
The Jupyter Notebook (see `code/dictionary_creation/2_word_embeddings/1_grid_search.ipynb`) contains a selection of possible hyperparameters, but further adjustments may be necessary. The Jupyter Notebook (see `code/dictionary_creation/2_word_embeddings/1_grid_search.ipynb`) contains a selection of possible hyperparameters, but further adjustments may be necessary.
For that, we refer to the [documentation](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec) (Řehůřek 2009-2021) of the word2vec implementation of Gensim. For that, we refer to the [documentation](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec) (Řehůřek 2009-2021) of the word2vec implementation of Gensim.
All models trained this way are stored for later use and, depending on your text corpus and the number of hyperparameter combinations, the file size of stored models can take hundreds of gigabytes. All models trained this way are stored for later use and, depending on your text corpus and the number of hyperparameter combinations, the file size of stored models can take hundreds of gigabytes.
#### Evaluating the Models #### Evaluating the Models
Once we trained one model for each hyperparameter combination, we need to evaluate which of these combinations achieved the best performance. Once we trained one model for each hyperparameter combination, we need to evaluate which of these combinations achieved the best performance.
For this evaluation, we rely on lists of manually annotated word pairs, in which every word pair was assigned to a relation score. For this evaluation, we rely on lists of manually annotated word pairs, in which every word pair was assigned to a relation score.
In our case, this relation score ranges from 0 to 10, where 0 represents no similarity and 10 represents absolute similarity. In our case, this relation score ranges from 0 to 10, where 0 represents no similarity and 10 represents absolute similarity.
For example, when we consider the word pairs *old & new*, *easy & hard*, *beautiful & wonderful* as well as *rare & scarce*, the respective scores could be: For example, when we consider the word pairs *old & new*, *easy & hard*, *beautiful & wonderful* as well as *rare & scarce*, the respective scores could be:
word pair | relation score word pair | relation score
--- | --- --- | ---
old & new | 0 old & new | 0
easy & hard | 1.23 easy & hard | 1.23
beautiful & wonderful | 7.15 beautiful & wonderful | 7.15
rare & scarce | 9.89 rare & scarce | 9.89
Typically, such word pair lists are manually annotated, which is again a very time-consuming and tedious process and requires multiple annotators. In our case, we adapt previously existing lists for French, Italian and Spanish (Freitas et al. 2016; [GitHub Repository](https://github.com/siabar/Multilingual_Wordpairs)). Typically, such word pair lists are manually annotated, which is again a very time-consuming and tedious process and requires multiple annotators. In our case, we adapt previously existing lists for French, Italian and Spanish (Freitas et al. 2016; [GitHub Repository](https://github.com/siabar/Multilingual_Wordpairs)).
More precisely, we filter all words that are not existing in our historic Spectator periodicals, extend lists with spelling variations of the 18<sup>th</sup> century as well as check whether relation scores are meaningful and also applicable to the languages of the 18<sup>th</sup> century. More precisely, we filter all words that are not existing in our historic Spectator periodicals, extend lists with spelling variations of the 18<sup>th</sup> century as well as check whether relation scores are meaningful and also applicable to the languages of the 18<sup>th</sup> century.
Using the adapted lists, we can compute Pearson correlation coefficients between scores of word pairs and similarities of respective word vectors from the trained models in our dedicated Jupyter Notebook (see `code/dictionary_creation/2_word_embeddings/2_evaluation.ipynb`). Using the adapted lists, we can compute Pearson correlation coefficients between scores of word pairs and similarities of respective word vectors from the trained models in our dedicated Jupyter Notebook (see `code/dictionary_creation/2_word_embeddings/2_evaluation.ipynb`).