Commit 9319c47a authored by Niels-Oliver Walkowski's avatar Niels-Oliver Walkowski
Browse files

upd: Solve wrong gensim version for word2vec model

parent e7961121
%% Cell type:markdown id: tags:
# A Sentiment Analysis Tool Chain for 18<sup>th</sup> Century Periodicals
*Philipp Koncar, Christina Glatz, Elisabeth Hobisch, Yvonne Völkl, Bernhard C. Geiger, Sanja Sarić, Martina Scholger, Denis Helic*
%% Cell type:markdown id: tags:
## Introduction
Sentiment analysis is a common task in natural language processing (NLP) and aims for the automatic and computational identification of emotions, attitudes and opinions expressed in textual data (Pang et al. 2008). While Sentiment analysis is especially tailored for and widely used in the context of social media analysis and other web data, such as product reviews (Pak and Paroubek 2010, Ortigosa et al. 2014, Hamilton et al. 2016), the application to literary texts is still challenging due to the lack of methods - with few but notable exceptions (Schmidt et al. 2018; Sprugnoli et al. 2020) - dedicated to languages other than English and from earlier times.
In the currently ongoing project [^1] Distant Reading for Periodicals of the Enlightenment (DiSpecs - [link to project website]( funded by the Austrian Academy of Sciences, we analyze Spectator periodicals from the Digital Edition project The Spectators in the International Context (Ertler et al. 2011) in terms of their thematic, stylistic, and emotional orientation using different computational methods, including sentiment analysis. During the project, it became obvious that existing methods are only partly suitable for 18<sup>th</sup> century texts which comprise the spectatorial press. In particular, we encountered problems with shifts in word meanings and spellings between modern day and 18<sup>th</sup> century languages. Therefore, we concluded that the development of appropriate methods for 18<sup>th</sup> century texts is necessary to further improve the quality of our analyses. Additional funding by CLARIAH-AT made this endeavour possible.
With the contribution presented here, we not only introduce new sentiment dictionaries for French, Italian and Spanish texts of the 18<sup>th</sup> century, but also build a freely and publicly available tool chain based on Jupyter Notebooks, enabling researchers to apply our dictionary creation process and sentiment analysis methods to their own material and projects. Our Notebooks furthermore contain tutorial-style introductions to concepts such as word embeddings, k-nearest neighbor classification, and dictionary-based sentiment analysis.
The proposed tool chain comprises two different parts: (i) the optional creation of sentiment dictionaries and (ii) the actual sentiment analysis.
More precisely, the first part requires manual annotations of seed words from which we transfer sentiment to other words occurring in a similar context as seed words. To transfer sentiment, we train word embeddings and use a machine learning classification task. In doing so, we computationally extend the list of annotated words and avoid a more time-consuming and tedious manual annotation process. Note that this procedure is also adaptable to other languages (also contemporary languages).
In the second part, we provide a collection of ready-to-use sentiment dictionaries (which we created with the first part of the tool chain) as well as methods to perform the actual sentiment analysis. The implemented methods range from listing basic descriptive statistics to various kinds of plots that allow for an easy interpretation of sentiment expressed in a given text corpus. Further, our methods analyze sentiment on a macro- and microscopic level, as they can not only be applied to a whole corpus providing a bigger picture of data, but also on a document level, for example, by highlighting words that convey sentiment in any given text.
This repository contains all the data we used for creating sentiment dictionaries, including manually annotated seed words, pretrained word embedding models and other data resulting from intermediate steps. These can also be used in other contexts and NLP tasks and are not necessarily limited to sentiment analysis. As such, our tool chain serves as a foundation for further methods and approaches applicable to all projects focusing on the computational interpretation of texts.
Before we continue, please import the necessary Python Packages for the interactive examples:
%% Cell type:code id: tags:
``` python
import pandas
import matplotlib.pyplot as plt
import nltk
import seaborn
import re
from nltk.tokenize import WordPunctTokenizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from gensim.models.word2vec import Word2Vec
from IPython.display import display, HTML, Markdown
%% Cell type:markdown id: tags:
Additionally, please execute the following cell to download data (i.e., the Punkt tokenizer) provided by NLTK which is necessary for the examples presented in this Notebook:
%% Cell type:code id: tags:
``` python'punkt')
%% Output
[nltk_data] Downloading package punkt to
[nltk_data] /Users/philippkoncar/nltk_data...
[nltk_data] Package punkt is already up-to-date!
%% Cell type:markdown id: tags:
[^1]: A cooperation between the Institute for Interactive Systems and Data Science at Graz University of Technology, the Know-Center GmbH as well as the Centre for Information Modelling - Austrian Centre for Digital Humanities (ZIM-ACDH) and the Institute of Romance Studies, both at the University of Graz.
%% Cell type:markdown id: tags:
## Sentiment Dictionaries for the 18<sup>th</sup> Century
In this section, we describe the process of how we created sentiment dictionaries and provide examples for the underlying methods. Please proceed with the *Sentiment Analysis for the 18<sup>th</sup> Century* section if you want to use our ready-to-use dictionaries for French, Italian and Spanish of the 18<sup>th</sup> century.
The Jupyter Notebooks, which can be easily used by everyone to create sentiment dictionaries for their own projects, can be found in the `code/dictionary_creation/` directory of this repository.
The process for creating sentiment dictionaries comprises three major steps, each of them described in more detail in the sections below. Overall, we first need to generate a set of seed words which serves as a basis to automatically transfer sentiment to other words in the text corpus. For this expansion, we train word embeddings to capture the context of individual words and use them in a classification task to transfer sentiment of seed words to other words in similar contexts. Note that this process is suitable for a plethora of languages and is not limited to the 18<sup>th</sup> century languages. The following figure illustrates the whole dictionary creation pipeline:
![Overview of the Dictionary Creation Pipeline](miscellaneous/dictionary_creation_overview.png)
%% Cell type:markdown id: tags:
### Sentiment Dictionaries 101
Before we discuss the individual steps to create sentiment dictionaries, we want to give a short introduction of what sentiment dictionaries are and how they can be used for sentiment analysis.
Assume we want to assess the sentiment on a sentence level and that we are given the following three sentences:
**Sentence 1:** Today was a good day.<br>
**Sentence 2:** I hate getting up early.<br>
**Sentence 3:** This is both funny and sad at the same time.
Further, we have a sentiment dictionary, which comprises a list of words, each associated with sentiment (e.g., positive or negative):
| Word | Sentiment |
| --- | --- |
| good | positive |
| bad | negative |
| hate | negative |
| funny | positive |
| sad | negative |
| happy | positive |
We can assess the sentiment of each sentence by considering whether the words contained in the dictionary occur in them.
For that, let us define the sentiment *s* of a sentence as follows:
![Sentiment Formula for the Introductory Example](miscellaneous/sentiment_formula_1.png)
where $W_p$ is the number of positive words in a sentence and $W_n$ is the number of negative words in a sentence.
Thus, the formula subtracts the number of words with a negative sentiment from the number of words with a positive sentiment.
If the resulting value is negative, then we assume the sentiment of a sentence to be negative, whereas if the value is positive, we assume the sentiment of a sentence to be positive.
This value is zero if the two numbers are equal or if there are no words of the dictionary in a sentence. In this case, we assume that the sentence has a neutral sentiment.
For our three examples, this means that Sentence 1 has a positive sentiment, Sentence 2 a negative sentiment and Sentence 3 a neutral sentiment.
To verify whether our assumption is true, we implement the example in Python. For that, we first create a list that contains the three example sentences from above:
%% Cell type:code id: tags:
``` python
example_sentences = [
"Today was a good day.",
"I hate getting up early.",
"This is both funny and sad at the same time."
%% Cell type:markdown id: tags:
Further, we need to create a dictionary that maps our words in the dictionary to a sentiment:
%% Cell type:code id: tags:
``` python
sentiment_dictionary = {
"good": "positive",
"bad": "negative",
"hate": "negative",
"funny": "positive",
"sad": "negative",
"happy": "positive"
%% Cell type:markdown id: tags:
We then define a function that takes one sentence as input and returns its computed sentiment following the formula and rules stated above:
%% Cell type:code id: tags:
``` python
def compute_sentiment(sentence):
wpt = WordPunctTokenizer()
words_list = wpt.tokenize(sentence)
number_of_negative_words = 0
number_of_positive_words = 0
for word, sentiment in sentiment_dictionary.items():
if sentiment == "negative":
number_of_negative_words += words_list.count(word)
elif sentiment == "positive":
number_of_positive_words += words_list.count(word)
sentiment_score = number_of_positive_words - number_of_negative_words
if sentiment_score < 0:
computed_sentiment = "negative"
elif sentiment_score > 0:
computed_sentiment = "positive"
computed_sentiment = "neutral"
return computed_sentiment
%% Cell type:markdown id: tags:
We then iterate through the list of sentences and print the sentiment for each of them:
%% Cell type:code id: tags:
``` python
for sentence in example_sentences:
print("The sentiment for '{}' is {}.".format(sentence, compute_sentiment(sentence)))
%% Output
The sentiment for 'Today was a good day.' is positive.
The sentiment for 'I hate getting up early.' is negative.
The sentiment for 'This is both funny and sad at the same time.' is neutral.
%% Cell type:markdown id: tags:
As we can now observe, the verification of our previous assumptions through Python confirms that they were all correct.
The example above highlights the advantages of dictionary-based sentiment analysis approaches: first, it is straightforward to conduct (as demonstrated above) and second, the produced results are transparent and easily interpretable.
%% Cell type:markdown id: tags:
### Step 1: Selecting Seed Words
Our sentiment dictionary creation depends on seed words for which the sentiment is known. Based on these seed words, we can automatically transfer sentiment to other words, allowing us to circumvent a more tedious and time-consuming annotation process. This step comprises three parts: First, we extract seed words from the text corpus. Second, we manually annotate them. Third, we select annotated seed words based on the agreement of multiple annotators.
#### Extracting Frequent Words
At first, we extract the 3,000 most frequent words from the entire text corpus (without removing stop words or conducting lemmatization). The number of extracted seed words depends on the corpus size. We decided for a compromise between the number of words and the associated efforts to annotate them.
In this case, these 3,000 most frequent words account for 3.9% of unique French words (76,035 in total), 2.5% of unique italian words (121,504 in total) and 3.6% of unique Spanish words (84,185 in total).
Focusing on most frequent words allows us to achieve a good coverage as seed words that do not occur frequently in our texts are of no use in the subsequent classification task.
The Jupyter Notebook for the extraction of seed words (see `code/dictionary_creation/1_seed_words/1_extraction.ipynb`) allows for additional settings, such as a maximum document frequency of words to extract.
#### Annotating Frequent Words
Once we extracted frequent words, we let three experts, who are familiar with the content and context of the French, Italian and Spanish Spectator periodicals, annotate the seed words.
In particular, we instructed them to assign each of the extracted words to either of three sentiment classes: (i) positive, (ii) negative or (iii) neutral and to take into account the sociocultural circumstances of the 18<sup>th</sup> century.
As a result, the annotators captured the sentiment of words with regard to their intended meaning.
The provided Jupyter Notebook (see `code/dictionary_creation/1_seed_words/2_annotation.ipynb`) for the annotation of words includes a few simple code lines to generate *.csv* files which can then be opened and annotated in an arbitrary spreadsheet program (e.g., Microsoft Excel or LibreOffice Calc).
Note that the annotated sentiment needs to be entered in the *sentiment* column of the generated file and, for the remaining Jupyter Notebooks to work without changes, the annotators must stick to the following three classes: *positive*, *negative* and *neutral*. For example, the annotations for the words *good*, *bad* and *house* could be:
word | sentiment
--- | ---
good | positive
bad | negative
house | neutral
However, other expressions of sentiment, such as additional classes or numerical values, may be used if the subsequent Jupyter Notebooks are adjusted accordingly.
#### Selecting Seed Words
As past research has indicated that sentiment is very subjective and involves disagreement between multiple individuals (Mozetič et al. 2016), we need to implement a selection procedure to settle on the final sentiment of frequent words.
For that, we use a simple majority vote, in which we only keep words for which at least two annotators have equal annotations and in which we remove the remaining words otherwise.
Regarding the number of annotators, we suggest employing as many of them as possible as it allows for better generalization.
Naturally, there is a trade-off between the number of annotators you can find/afford (i.e., it is very time-consuming to annotate words, especially when you have to consider the sociocultural context) and the quality of resulting annotations.
In our case, we went with three (which we consider the lowest limit) annotators, who have all worked with the data for several years and have an extensive knowledge about this period in time.
The Jupyter Notebook (see `code/dictionary_creation/1_seed_words/3_selection.ipynb`) for the selection of seed words provides a ready-to-use implementation based on a majority vote. In the following table, we provide the number of positive, negative and neutral selected seed words:
Language | # positive | # negative | # neutral
--- | --- | --- | ---
French | 803 | 381 | 1,738
Italian | 1,811 | 244 | 838
Spanish | 385 | 251 | 2,340
%% Cell type:markdown id: tags:
### Step 2: Creating Word Embeddings
Computers have hard times working with text or words.
To counteract this problem, we need to transform text or words into numerical forms.
One way to achieve this are so-called word embeddings, which represent words by vectors.
Existing research discerns two types of word embeddings: (i) frequency-based word embeddings, such as count vectors or TF-IDF vectors, as well as prediction-based word embeddings, such as word2vec (Mikolov et al. 2013).
Before describing our utilized method, we want to demonstrate principles of word embeddings in the following example.
For that, consider the following three sentences contained in a list:
%% Cell type:code id: tags:
``` python
example_sentences = [
"Word embeddings are fun.",
"It is fun to learn new things.",
"Teaching word embeddings is also fun."
%% Cell type:markdown id: tags:
We can use the `CountVectorizer` from the `scikit-learn` package to create a count matrix and store that in a pandas `DataFrame` for easier interpretation:
%% Cell type:code id: tags:
``` python
vectorizer = CountVectorizer()
count_matrix = vectorizer.fit_transform(example_sentences)
count_vectors_df = pandas.DataFrame(count_matrix.todense(), columns=vectorizer.get_feature_names())
%% Output
%% Cell type:markdown id: tags:
As we can now observe, this produces a table in which each row represents one of the three sentences and each column represents a distinct word occurring in either of the three sentences.
Each value in the cells of the table equals 1 if the respective word occurs in the respective sentence or 0 otherwise.
We can then find the vector representation of a word by considering the values in its column.
For example, the vector of the word *embeddings* is `[1, 0, 1]` and of the word *fun* is `[1, 1, 1]`.
Thus, a word is represented by the set of sentences/documents in which it occurs.
Word embeddings also capture the context of a word in a document, including the semantic similarity as well as the relations with other words.
As such, vectors of words that are frequently used together in texts are also very close to each other in the resulting vector space.
Contrary, vectors of words that are never or only minimally used in a similar context should be very distant from each other.
This suits perfectly for our dictionary creation process and based on the vector representation of words, we can automatically transfer sentiment from our seed words to other words.
%% Cell type:markdown id: tags:
#### Word2Vec
For our dictionary creation pipeline we rely on word2vec (Mikolov et al. 2013), a state-of-the-art method to compute word embeddings.
Word2vec relies on a two-layer neural network to train word vectors that capture the linguistic contexts of words and comprises either one of two different model architectures: continuous bag-of-words (CBOW) or skip-gram.
The former predicts a word from a given context whereas the latter predicts the context from a given word.
Both have their advantages and disadvantages regarding the size of the underlying text corpus, which is why we consider both of them in our tool chain.
Before talking about the specifics of our approach, we want to demonstrate the benefits of word2vec. For the following example, we used Gensim's word2vec implementation to train word embeddings of the publicly available *text8* dataset. We can load the pretrained word2vec model with the following line of code:
%% Cell type:code id: tags:
``` python
word2vec_model = Word2Vec.load("data/processed_data/word2vec_models/example.p")
%% Cell type:markdown id: tags:
As we used the default parameters for word2vec as implemented in Gensim, we trained vectors with a dimension of 100. Thus, instead of the three numbers in a vector from the count vector example above, we now have 100 numbers (i.e., a vector of length 100).
Moreover, these numbers are now real-valued instead of integer and their interpretation is less self-evident.
We can inspect the resulting vector of a word, for example *car*, with:
%% Cell type:code id: tags:
``` python
%% Output
[-0.12308194 -0.39095813 1.000928 -0.04769582 0.9710199 -1.3676833
-1.4404075 -0.7423445 1.6969757 -1.1585793 0.64239687 0.6409116
1.4341261 0.4111035 -2.4854403 -0.7219625 1.540313 0.42790467
-0.44106185 -1.114401 1.3602705 0.8318557 2.0098808 0.7983357
-1.5087671 -0.8375068 -0.61658716 1.2119923 -0.6702468 -0.3222445
0.9415244 2.6550481 -0.99794585 0.04601024 -1.6316731 -1.439617
-1.431928 -2.4456704 -0.71318716 0.86487716 2.2365756 -0.89588803
-0.7087643 0.9257705 1.712533 0.7194995 -2.4665103 -1.3497332
2.2958574 -0.54635614 0.9186488 0.51365685 0.09351666 -1.0833061
-0.00810259 2.8242166 0.1252522 -1.207868 0.10782211 -0.34977445
-0.30000007 0.33047387 -0.12232512 -0.52950805 -2.4587536 -0.37481222
-2.2148058 0.14348628 -0.79030484 -2.5900028 2.7875724 -0.7795173
0.6641297 2.6237233 1.7713573 1.7022327 -0.04617653 1.2087046
-1.4730823 0.74134797 -0.26776415 0.22373354 0.71002257 1.5748668
-1.778043 0.48367617 1.0869575 -0.6362949 0.63211554 0.5351157
-1.8014896 0.39312994 -2.118675 0.83928734 -0.3225636 -2.1843622
-0.557146 -1.6596688 1.2372373 3.977601 ]
[-1.61568284e-01 2.43050766e+00 1.13503861e+00 -1.32489637e-01
-6.03855133e-01 -2.41288829e+00 -1.13166118e+00 1.35888368e-01
-9.01918471e-01 -3.99649262e-01 6.65909648e-01 -1.15869336e-01
-8.21835935e-01 4.40229177e+00 -4.89547849e-02 5.20780444e-01
-1.13055861e+00 3.89837772e-01 -2.68415475e+00 -1.06481516e+00
8.20071042e-01 -7.48265028e-01 4.94387478e-01 -4.13534820e-01
-3.42012078e-01 2.14170074e+00 6.92878425e-01 -2.46970201e+00
2.99051791e-01 1.45085976e-01 -2.41894647e-01 1.84811831e+00
3.65061331e+00 1.47673023e+00 2.24103498e+00 1.15026009e+00
8.59869659e-01 1.85195863e+00 7.42018223e-01 -1.15255916e+00
-3.68037629e+00 -1.35945964e+00 1.63199723e+00 1.59413302e+00
9.80835915e-01 -1.14651680e+00 -8.73115897e-01 -3.53382289e-01
-7.18992949e-01 -6.16120875e-01 1.42035735e+00 -1.43045092e+00
-6.06491208e-01 6.90692738e-02 1.09732881e-01 -1.57915580e+00
1.16796398e+00 -1.00825176e-01 -8.49051476e-02 1.72058091e-01
5.54407179e-01 -8.23314905e-01 -5.89925408e-01 -8.12229276e-01
3.71957213e-01 -5.44791520e-01 7.01065719e-01 1.18419178e-01
-2.30704263e-01 1.63496554e-01 -1.56486221e-03 -1.17793888e-01
2.37422657e+00 1.16788089e+00 2.46368960e-01 -1.34689987e-01
-6.01746678e-01 -3.37866974e+00 2.82352656e-01 -8.86460185e-01
-1.57430458e+00 2.07729554e+00 4.59743559e-01 -5.92743278e-01
-1.50170660e+00 9.42416489e-01 6.83500171e-01 -5.81658542e-01
3.21527749e-01 -2.17504072e+00 -2.08237618e-01 4.95646656e-01
2.16873264e+00 -6.85962796e-01 -3.14174461e+00 -1.15686655e+00
-9.79014397e-01 -9.16246891e-01 6.81892991e-01 -6.32290304e-01]
%% Cell type:markdown id: tags:
We can find words similar to *car* by considering the cosine similarity between the vector of *car* and all other vectors. The higher the similarity, the more similar are the words. Gensim provides an easy way to find most similar words based on cosine similarity:
%% Cell type:code id: tags:
``` python
word2vec_model.wv.most_similar("car", topn=20)
%% Output
[('driver', 0.7839363217353821),
('taxi', 0.7524467706680298),
('cars', 0.725511908531189),
('motorcycle', 0.7036831378936768),
('vehicle', 0.698715090751648),
('truck', 0.6913774609565735),
('passenger', 0.661078155040741),
('automobile', 0.6501474380493164),
('audi', 0.6245964169502258),
('glider', 0.6229903101921082),
('tire', 0.6213281154632568),
('cab', 0.6198135018348694),
('engine', 0.6183426380157471),
('volkswagen', 0.6164752840995789),
('engined', 0.6096624732017517),
('airplane', 0.6076435446739197),
('bmw', 0.6070380210876465),
('elevator', 0.6061339974403381),
('racing', 0.6031301617622375),
('stock', 0.6030023097991943)]
[('driver', 0.7908318638801575),
('taxi', 0.7431166768074036),
('cars', 0.7197413444519043),
('motorcycle', 0.7111701369285583),
('truck', 0.6926654577255249),
('racing', 0.6854234933853149),
('vehicle', 0.6607434749603271),
('passenger', 0.6477780342102051),
('glider', 0.6365007758140564),
('volkswagen', 0.6300870776176453),
('automobile', 0.6175932288169861),
('crash', 0.6141278147697449),
('bmw', 0.6093124151229858),
('rifle', 0.6080166101455688),
('motor', 0.6056495308876038),
('audi', 0.60340416431427),
('racer', 0.598192036151886),
('factory', 0.5972516536712646),
('tire', 0.5950882434844971),
('cab', 0.5927387475967407)]
%% Cell type:markdown id: tags:
Another advantage of word2vec compared to count vectors is that it can capture various concepts and analogies.
Perhaps the most famous example in that regard is as follows: If you subtract the vector of the word *man* from the vector of the word *king* and add the vector of the word *woman* it should result in a vector very close to that of the word *queen*. We can evaluate whether this is the case with our pretrained model through Gensim very easily:
%% Cell type:code id: tags:
``` python
word2vec_model.wv.most_similar(positive=["woman", "king"], negative=["man"], topn=1)
%% Output
[('queen', 0.6670934557914734)]
[('queen', 0.7112579345703125)]
%% Cell type:markdown id: tags:
We observe that, indeed, the trained word embedding correctly reflects this analogy.
Can you think of any other concepts or analogies to test? Feel free to try it with other word vectors.
%% Cell type:markdown id: tags:
#### Training the Models
Before we start training the actual word2vec models for the Spectator periodicals in the respective languages, we need to preprocess our texts.
The amount of required preprocessing for word2vec is very minimal, as we only need to remove stop words and extract individual sentences.
The latter is important because individual sentences are the required input form for the word2vec implementation of Gensim.
Since word2vec has many hyperparameters (that all have an impact on the resulting word embeddings), we need to tune them to achieve the best possible performance.
One way to optimize hyperparameters is to conduct a *grid search*.
For this purpose, we define a set of possible hyperparameters and train one individual model for each possible hyperparameter combination.
We then evaluate each model and select the one that yielded the best performance.
The Jupyter Notebook (see `code/dictionary_creation/2_word_embeddings/1_grid_search.ipynb`) contains a selection of possible hyperparameters, but further adjustments may be necessary.
For that, we refer to the [documentation]( (Řehůřek 2009-2021) of the word2vec implementation of Gensim.
All models trained this way are stored for later use and, depending on your text corpus and the number of hyperparameter combinations, the file size of stored models can take hundreds of gigabytes.
#### Evaluating the Models
Once we trained one model for each hyperparameter combination, we need to evaluate which of these combinations achieved the best performance.
For this evaluation, we rely on lists of manually annotated word pairs, in which every word pair was assigned to a relation score.
In our case, this relation score ranges from 0 to 10, where 0 represents no similarity and 10 represents absolute similarity.
For example, when we consider the word pairs *old & new*, *easy & hard*, *beautiful & wonderful* as well as *rare & scarce*, the respective scores could be:
word pair | relation score
--- | ---
old & new | 0
easy & hard | 1.23
beautiful & wonderful | 7.15
rare & scarce | 9.89
Typically, such word pair lists are manually annotated, which is again a very time-consuming and tedious process and requires multiple annotators. In our case, we adapt previously existing lists for French, Italian and Spanish (Freitas et al. 2016; [GitHub Repository](
More precisely, we filter all words that are not existing in our historic Spectator periodicals, extend lists with spelling variations of the 18<sup>th</sup> century as well as check whether relation scores are meaningful and also applicable to the languages of the 18<sup>th</sup> century.
Using the adapted lists, we can compute Pearson correlation coefficients between scores of word pairs and similarities of respective word vectors from the trained models in our dedicated Jupyter Notebook (see `code/dictionary_creation/2_word_embeddings/2_evaluation.ipynb`).
We then select the model for which the correlation coefficients are the highest and report the following coefficients for the respective languages:
Language | Pearson Rho
--- | ---
French | 0.402
Italian | 0.157
Spanish | 0.310
We provide the selected models in the data folder of this repository.
To use them for your own projects, download them and load the pickled Gensim word2vec model in Python:
import pickle
from gensim.models import Word2Vec
path_to_model = "" # set the path to where you saved the model (e.g., path_to_model="french.p")
with open(path_to_model, "rb") as handle:
model = pickle.load(handle)
%% Cell type:markdown id: tags:
### Step 3: Transferring Sentiment
In the third and final step of the dictionary creation pipeline, we use the generated word embeddings to transfer the sentiment from our seed words to other words that appear in a similar context of seed words.
For that, we use a *k*-nearest neighbors (KNN) classifier (Fix and Hodges 1951) that considers the distances between word vectors.
Remember that the trained word embeddings keep words that appear frequently in a similar context close together and words that are not related very distant from each other in the vector space.
As such, this straightforward and interpretable classification method is perfectly suited for our context-based transfer of sentiment.
#### Classifying Words
The *k*-nearest neighbors classifier is based on distances between vectors in a multidimensional feature space (in our case the vectors from the trained word embeddings).
It classifies an unlabeled instance based on the *k* nearest labeled neighbors of that instance, where *k* can take an arbitrary value.
To demonstrate the functioning of the KNN classifier, we implement a two-dimensional example in Python:
%% Cell type:code id: tags:
``` python
example_X_data = [
[2, 2],
[3, 5],
[4, 8],
[5, 2],
[6, 9],
[2, 6],
[8, 7]
example_X_labels = [
example_y_data = [
[3, 3]
fig, ax = plt.subplots(figsize=(5, 5))
ax.scatter(*zip(*example_X_data), c=example_X_labels)
ax.scatter(*zip(*example_y_data), c="black", marker="x")
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
%% Output
%% Cell type:markdown id: tags:
In this example, we have two different classes (or labels): the *blue* points and the *red* points.
We have one unlabeled instance, in this case a black cross, which we want to assign to either of the two classes.
The resulting class for the unlabeled instance depends on how we set *k*.
Note that for a binary classification task (two classes), only odd numbers for *k* make sense as it could result in ties otherwise.
Now we can train the classifier based on the labeled instances using the `KNeighborsClassifier` method from `scikit-learn`, find the nearest neighbors of our unlabeled instance and assign it to the majority class of these neighbors. For this example, we set *k* to 3:
%% Cell type:code id: tags:
``` python
k = 3
neigh = KNeighborsClassifier(n_neighbors=k), example_X_labels)
nearest_neighbors = neigh.kneighbors(example_y_data)[1][0]
fig, ax = plt.subplots(figsize=(5, 5))
ax.scatter(*zip(*example_X_data), c=example_X_labels)
ax.scatter(*zip(*example_y_data), c="black", marker="x")
for n_i in nearest_neighbors:
ax.plot([example_y_data[0][0], example_X_data[n_i][0]], [example_y_data[0][1], example_X_data[n_i][1]], 'gray', linestyle=':', marker='')
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
Markdown("The unlabeled instance is assigned to the *{}* class.".format(neigh.predict(example_y_data)[0]))
%% Output
The unlabeled instance is assigned to the *blue* class.
<IPython.core.display.Markdown object>
%% Cell type:markdown id: tags:
For *k* = 3, the algorithm considers the three nearest neighbors (indicated by the dashed gray lines) which include one instance of the *red* class and two instances of the *blue* class.
As the majority of neighbors are labeled *blue*, it assigns our unlabeled instance to the *blue* class.
But what about other values of *k*? Feel free to adjust *k* in the above code block and observe the different outcomes.
For example, in the case of *k* = 5, the classifier considers the five nearest neighbors, three of which are labeled *red* and two of which are labeled *blue*.
Thus, it assigns our unlabeled instance to the *red* class.
As you can see, the selection of *k* has a crucial impact on the result of the classification.
Further, there are additional hyperparameters for this classifier that affect the classification outcome, such as the distance measure to find the nearest neighbors.
For word embeddings (e.g., TF-IDF, word2vec), we suggest to use cosine similarity because directions of vectors are more important than magnitudes of vectors.
We specifically advise against the Euclidean distance for such high dimensional cases as vectors tend to become uniformly distant from each other with this metric.
To counteract the problem of the hyperparameter selection, one could conduct a grid search (similar to what we have done for the word2vec model) and evaluate the classified words with a test set containing a ground truth.
For our case, we decided to set *k* to 5 (a common value for *k*) and to evaluate the performance of our model as described in the following section.
Regarding the classification based on our previously trained word embeddings, we provide all code to train the classifier with the word vectors of seed words and then to predict the sentiment class (i.e., positive, negative or neutral) for all remaining words in the text corpus in the Jupyter Notebook (see `code/dictionary_creation/3_classification/1_knn.ipynb`).
#### Evaluating the Classification
To evaluate the performance of the classifier, we randomly extract a maximum of 1,000 words from each of the three assigned sentiment classes and manually annotate them.
Note that the Jupyter Notebook (see `code/dictionary_creation/3_classification/2_evaluation.ipynb`) for the classifier evaluation provides a way to randomly extract words and to prepare *.csv* files for further annotation.
In our work, we again let three expert annotators label the cumulated 3,000 words, respectively for each language (except for Spanish for which the classifier labeled only 649 positive and 440 negative words).
Similar to the annotation process for seed words, we only keep the annotated words for which at least two annotators agreed (majority vote).
We then compute balanced accuracy scores (between the labels assigned by the classifier and the labels assigned by the annotators) to assess the prediction performance of our classifier and, thus, the quality of our computationally created dictionaries.
We report the following balanced accuracy scores:
Language | Balanced Accuracy | Bootstrap Confidence Interval
--- | --- | ---
French | 0.570 | [0.554, 0.597]
Italian | 0.544 | [0.526, 0.563]
Spanish | 0.554 | [0.530, 0.580]
These values indicate that we outperform a random baseline (0.33) and are similar to other automated sentiment classification tasks in terms of performance (Mozetič et al. 2016).
However, note that we did not evaluate all the words labeled by the KNN classifiers but only a small and randomly drawn subset of them. To mitigate the effect of the one-time random extraction of evaluation words, in the table above we also provide bootstrap 95% confidence intervals (over 1,000 iterations) to estimate the balanced accuracy of the whole population for each language.
This step concludes the dictionary creation process. We provide in-depth descriptions of how to use the created dictionaries for sentiment analysis in the following section.
%% Cell type:markdown id: tags:
## Sentiment Analysis for the 18<sup>th</sup> Century
In this section, we demonstrate how to utilize our created dictionaries to analyze sentiment of Spectator periodicals published during the 18<sup>th</sup> century. The related Jupyter Notebooks can be found in the `code/sentiment_analysis/` directory of this repository.
### French, Italian and Spanish Dictionaries
You can use our created sentiment dictionaries for French, Italian and Spanish for your own projects.
In particular, we provide three different forms of dictionaries for each of the three languages.
We provide detailed descriptions of them in the following sections.
#### Manually Annotated
The first kind of dictionaries contains all the manually annotated words used for the seed word selection and the evaluation of the KNN classifiers.
For each language, we provide a list of positive, negative and neutral words.
All words have been manually annotated by three experts familiar with the Spectator periodicals.
To assess the agreement between annotators, we report the Fleiss' kappa (Fleiss 1971), respectively for annotated words used in the seed word selection as well as in the evaluation of KNN classifiers:
*Fleiss' kappa between the three annotators for words used in the seed word selection:*
Language | kappa
--- | ---
French | 0.387
Italian | 0.385
Spanish | 0.370
*Fleiss' kappa between the three annotators for words used in the KNN classifiers evaluation:*
Language | kappa
--- | ---
French | 0.127
Italian | 0.328
Spanish | 0.300
These values suggest a fair agreement between annotators and reflect the typical discrepancies between individuals regarding sentiment (Mozetič et al. 2016).
Note that we only keep the words for which at least two annotators agreed (majority vote).
In the following table, we list the number of positive, negative and neutral words in the manually annotated dictionaries, respectively for each language:
Language | # positive | # negative | # neutral
--- | --- | --- | ---
French | 1,045 | 1,071 | 3,150
Italian | 1,789 | 1,196 | 2,696
Spanish | 681 | 798 | 3,529
#### Computationally Extended
The second kind of dictionaries contain all manually annotated seed words as well as words for which we transferred sentiment through our KNN classification.
Note that these dictionaries are very specific to our text corpus (Spectator periodicals).
If you are considering other data, you may want to use manually annotated words only or create extended dictionaries yourself.
In the following table, we list the number of positive, negative and neutral words in the computationally extended dictionaries, respectively for each language:
Language | # positive | # negative | # neutral
--- | --- | --- | ---
French | 4,713 | 2,350 | 17,499
Italian | 4,365 | 1,652 | 25,494
Spanish | 1,034 | 691 | 19,070
Note that the number of negative Spanish words for the computationally created dictionary is smaller than the one for manually created dictionaries even though we include seed words here.
This is due to the fact that the manually annotated dictionary includes both annotated seed words as well as annotated words for the classifier evaluation. In the case of Spanish, we have 251 negative seed words and the classifier only transferred the negative sentiment to 440 unlabeled words and, thus, making the number smaller compared to the manually created Spanish dictionary.
#### Computationally Extended and Corrected
The third group of dictionaries contains manually annotated seed words and computationally extended words but the latter were corrected using the manually annotated words used in the evaluation of the KNN classifiers.
For example, if the KNN classifier labeled a word as positive but at least two annotators labeled it differently for the evaluation, we changed the sentiment class to that of the agreeing annotators.
In the following table, we list the number of positive, negative and neutral words in the computational extended and corrected dictionaries, respectively for each language:
Language | # positive | # negative | # neutral
--- | --- | --- | ---
French | 4,216 | 2,272 | 18,074
Italian | 4,387 | 1,674 | 25,450
Spanish | 692 | 812 | 19,291
#### Loading the Dictionaries
Since the dictionaries are plain text files, you can easily load them in Python. For the subsequent demonstration of analysis methods, we use the computationally extended and corrected Spanish dictionaries.
%% Cell type:code id: tags:
``` python
language = "Spanish"
sentiment_dict = {}
with open("{}{}_negative.txt".format("data/processed_data/dictionaries/computational_corrected/", language.lower()), "r", encoding="utf-8") as fr:
sentiment_dict["neg"] =
with open("{}{}_positive.txt".format("data/processed_data/dictionaries/computational_corrected/", language.lower()), "r", encoding="utf-8") as fr:
sentiment_dict["pos"] =
print("loaded {} negative words".format(len(sentiment_dict["neg"])))
print("loaded {} positive words".format(len(sentiment_dict["pos"])))
%% Output
loaded 812 negative words
loaded 692 positive words
%% Cell type:markdown id: tags:
### Preparing Your Data
We have now successfully loaded the sentiment dictionaries but before we can use them for the sentiment analysis, we have to prepare our data.
Analyzing sentiment with our Jupyter Notebook requires a specific data format which allows for additional attributes, such as author names or publication dates, to be considered.
To add custom attributes, one needs to append them before the text in the respective *.txt* files.
The Jupyter Notebook (see `code/sentiment_analysis/1_data_preparation.ipynb`) utilized for the data preparation extracts the text as well as additional attributes and stores them in a pandas DataFrame.
The following file format must be satisfied:
* Each additional attribute must be provided in an individual line in the text file.
* A line containing an attribute needs to begin with its name followed by an `=` and then the value for the attribute. No spaces between the `=` and attribute name/value are allowed. For example: `year=1786`.
* The number of additional attributes to provide is unlimited, however, attribute names must be unique.
* Attribute names can include spaces. For example: `periodical title=La Spectatrice`.
* The actual text must be provided at last following a `text=` at the beginning of the line.
* Line breaks in the text are allowed, just not for attribute values.
The following snippet depicts an input *.txt* file in the correct format:
author=Justus Van Effen
text=Si je prends la liberté de vous dédier cet Ouvrage; ce n’est en aucune maniere pour me ménager une favorable occasion d’instruire les hommes de votre mérite, & de vous donner, même avec sobrieté, les éloges dont vous êtes digne...
Note that the inclusion of additional attributes is optional and *.txt* can start with the text right away. For example:
text=Si je prends la liberté de vous dédier cet Ouvrage; ce n’est en aucune maniere pour me ménager une favorable occasion d’instruire les hommes de votre mérite, & de vous donner, même avec sobrieté, les éloges dont vous êtes digne...
%% Cell type:markdown id: tags:
For simplicity, we already transformed French, Italian and Spanish texts into respective pandas DataFrames. If you want to see the code for this process please refer to the corresponding Jupyter Notebook in the code folder. Further, we provide the individual .txt files for each language in the `data/raw_data/example_texts/` directory of this repository if you wish to reconstruct the DataFrame creation yourself.
Now we can load the pandas DataFrame containing Spanish texts as well as four additional attributes (i.e., periodical title, issue number, author, year) as follows:
%% Cell type:code id: tags:
``` python
texts_df = pandas.read_pickle("data/processed_data/example_texts/spanish.p")
print("loaded dataframe with {} texts and {} attributes".format(texts_df.shape[0], texts_df.shape[1] - 1))
%% Output
loaded dataframe with 137 texts and 4 attributes
%% Cell type:markdown id: tags:
### Computing Sentiment
By now, we have loaded our sentiment dictionaries as well as our texts. To compute sentiment, we have to consider the occurrences of the words contained in the dictionaries in the text files.
For that, we use the following formula, defining the sentiment *s* of a text with:
![Sentiment Formula](miscellaneous/sentiment_formula_2.png)
where $W_p$ is the number of positive words in a text and $W_n$ is the number of negative words in a text.
Thus, the computed sentiment score is a value ranging between −1 and +1, where values close to −1 are considered as negative, values close to +1 as positive, and where values close to zero indicate a neutral sentiment.
The implementation of this formula in Python is:
%% Cell type:code id: tags:
``` python
def compute_sentiment(text):
tokens = nltk.word_tokenize(text)
tokens = [t.lower() for t in tokens]
num_negative = 0
num_positive = 0
for nw in sentiment_dict["neg"]:
num_negative += tokens.count(nw.lower())
for pw in sentiment_dict["pos"]:
num_positive += tokens.count(pw.lower())
sentiment_score = (num_positive - num_negative) / (num_positive + num_negative)
except ZeroDivisionError:
sentiment_score = 0
return sentiment_score
%% Cell type:markdown id: tags:
After defining our method to compute sentiment, we have to apply it to the respective texts. Doing this with pandas in Python requires one line of code:
%% Cell type:code id: tags:
``` python
texts_df["sentiment"] = texts_df["text"].apply(compute_sentiment)
%% Cell type:markdown id: tags:
### Analyzing Sentiment
After computing the sentiment for the individual texts, we can start analyzing the text corpus by considering descriptive statistics. We can compute and print these statistics with pandas through the `describe` method (read more about it in the official [pandas documentation](
%% Cell type:code id: tags:
``` python
%% Output
count 137.000000
mean 0.287341
std 0.141106
min -0.084337
25% 0.196507
50% 0.284211
75% 0.387560
max 0.737589
Name: sentiment, dtype: float64
%% Cell type:markdown id: tags:
This method returns the count, mean, median, standard deviation, minimum, maximum as well as the first (25<sup>th</sup> percentile) and third (75<sup>th</sup> percentile) quartile and, thus, provides an idea about the sentiment distribution in the analyzed texts. For better visualization, we can create a histogram plot:
%% Cell type:code id: tags:
``` python
texts_df["sentiment"].plot(kind="hist", bins=10)
plt.title("Histogram Plot Example")
%% Output
%% Cell type:markdown id: tags:
In the figure above, we can observe that the majority of texts in our dataset convey a positive sentiment and only a small number of texts a slightly negative one.
Whether this allows statements about the Spanish corpus of periodicals as a whole or whether there is a bias can only be judged by comparing the results with a different Spanish corpus of 18<sup>th</sup> century texts.
We thus concentrate on comparisons within the present corpus by considering the additional attributes.
We can use other types of plots to investigate the interaction of sentiment with the additional attributes we provided.
For example, we can create a box plot to assess sentiment distribution differences between two Spectator periodicals:
%% Cell type:code id: tags:
``` python
texts_df.boxplot("sentiment", by="periodical title")
plt.title("Box Plot Example")
%% Output
%% Cell type:markdown id: tags:
Box plots are a convenient approach to learn more about the distribution of the underlying data.
The horizontal green lines indicate medians, while the blue lines indicate the first and third quartiles.
The whiskers (horizontal black lines) indicate minimum and maximum values still within 1.5 interquartile ranges.
In the above figure, we observe that Joseph Clavijo y Faxardo's *El Pensador* (1762-63, 1767) conveyed a more positive sentiment as compared to Beatriz Cienfuegos' *La Pensadora Gaditana* (1763-64).
Another interesting example is the development of sentiment over the publication period of the individual issues of the two periodicals. We can use seaborn's `lineplot` method to easily create a plot for this comparison:
%% Cell type:code id: tags:
``` python
seaborn.lineplot(data=texts_df, x="issue number", y="sentiment", hue="periodical title")
plt.title("Line Plot Example")
%% Output
%% Cell type:markdown id: tags:
In the above figure we observe that the sentiment is changing significantly over the issues of both periodicals.
Further, we can analyze sentiment of individual texts contained in our dataset.
The following code block prints and highlights all the words conveying sentiment in a single text (in this case we consider *Prólogo y Razón de la Obra* of *La Pensadora Gaditana*):
%% Cell type:code id: tags:
``` python
text_to_print = texts_df.loc["mws-08C-52.txt", "text"]
# If you want to try this here with your own texts, please replace 'texts_df.loc["mws-08C-52.txt", "text"]' with your own text.
# For example:
# text_to_print = "Yo, señores, gozo la suerte de ser hija de Cadiz: bastante he dicho para poder hablar sin verguenza..."
%% Cell type:code id: tags:
``` python
for nw in sentiment_dict["neg"]:
if nw.lower() in text_to_print.lower() and nw not in ["span", "style", "color", "font", "size"]:
text_to_print = re.sub(r"\b{}\b".format(nw), r"<span style='color:#E74C3C; font-size:20pt'><b>{}</b></span>".format(nw), text_to_print)
for pw in sentiment_dict["pos"]:
if pw.lower() in text_to_print.lower() and pw not in ["span", "style", "color", "font", "size"]:
text_to_print = re.sub(r"\b{}\b".format(pw), r"<span style='color:#27AE60; font-size:20pt'><b>{}</b></span>".format(pw), text_to_print)
%% Output
<IPython.core.display.HTML object>
%% Cell type:markdown id: tags:
The above methods only demonstrate some examples of how to analyze and visualize sentiment. We provide a variety of other sentiment analysis methods in the corresponding Jupyter Notebook (see `code/sentiment_analysis/2_sentiment_analysis.ipynb`).
%% Cell type:markdown id: tags:
## Conclusions
%% Cell type:markdown id: tags:
In this work, we proposed a method to create sentiment dictionaries specifically designed for the analysis of Spectator periodicals published in French, Italian and Spanish during the 18<sup>th</sup> century.
We extracted, annotated and selected a set of seed words for each language which we then used to transfer sentiment to other words based on a classification task relying on trained word embeddings.
Further, we evaluated the performance of both the creation of word embeddings and the conducted classification.
This leaves us confident to believe that our work is a useful contribution to the currently ongoing discussion on the application of computational methods to historical texts both, in digital humanities in general and in (digital) literary studies in specific.
However, while sentiment analysis serves to extend the knowledge about our data, results based on machine methods should always be considered with caution. The manual interpretation through experts familiar with the data's content and context is necessary.
Our tool chain can be easily adapted and can serve as a foundation for future work.
For example, one could implement a more sophisticated sentiment formula which considers the context of words, such as negations that could affect the sentiment orientation. Further, at current state, we only consider positive and negative words to compute sentiment, but our tool chain also creates lists of neutral sentiment words. These words could also be considered in the sentiment formula.
Our specific way of selecting seed words has also a significant impact on the computationally created dictionaries. One could try different seed words and see how this affects the created dictionaries. Additionally, it is possible to investigate other classification methods to transfer sentiment from seed words to other words.
Finally, it may be interesting to compare sentiment dictionaries for the 18<sup>th</sup> century with sentiment dictionaries for modern times.
%% Cell type:markdown id: tags:
## Acknowledgments
This work was funded by CLARIAH-AT and partly funded by the go!digital program of the Austrian Academy of Sciences.
%% Cell type:markdown id: tags:
## References
Ertler, K.-D., Fuchs A., Fischer-Pernkopf M., Hobisch E., Scholger M. & Völkl Y. (2011-2021). The Spectators in the international context. Accessed 16 Feb 2021.
Fix, E., & Hodges, J. L. (1951). Nonparametric discrimination: Consistency properties. Randolph Field, Texas, Project, 21-49.
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5), 378.
Freitas, A., Barzegar, S., Sales, J. E., Handschuh, S., & Davis, B. (2016). Semantic relatedness for all (languages): A comparative analysis of multilingual semantic relatedness using machine translation. In European Knowledge Acquisition Workshop (pp. 212-222). Springer, Cham.
Hamilton, W. L., Clark, K., Leskovec, J., & Jurafsky, D. (2016). Inducing domain-specific sentiment lexicons from unlabeled corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing (Vol. 2016, p. 595). NIH Public Access.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Mozetič, I., Grčar, M., & Smailović, J. (2016). Multilingual Twitter sentiment classification: The role of human annotators. PloS one, 11(5), e0155036.
Ortigosa, A., Martín, J. M., & Carro, R. M. (2014). Sentiment analysis in Facebook and its application to e-learning. Computers in human behavior (Vol. 31, pp. 527-541).
Pak, A., & Paroubek, P. (2010). Twitter as a corpus for sentiment analysis and opinion mining. In LREc (Vol. 10, No. 2010, pp. 1320-1326).
Pang, B. & Lee, L. (2008). Opinion Mining and Sentiment Analysis. In Foundations and Trends® in Information Retrieval (Vol. 2, No. 1–2, pp. 1-135). DOI:
Řehůřek, R. (2009-2021). Word2vec embeddings.
Schmidt, T., Burghardt, M. & Wolff, C. (2018). Herausforderungen für Sentiment Analysis-Verfahren bei literarischen Texten. In Burghardt, M. & Müller-Birn, C. (Eds.), INF-DH-2018. Bonn: Gesellschaft für Informatik e.V.. DOI: 10.18420/infdh2018-16
Sprugnoli, R., Passarotti, M., Corbetta, D. & Peverelli, A. (2020). Odi et Amo. Creating, Evaluating and Extending Sentiment Lexicons for Latin. Proceedings of the 12th Language Resources and Evaluation Conference, 3078-3086.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment