Commit 5d7a1b1e authored by Philipp Koncar's avatar Philipp Koncar
Browse files

initial commit

parents
.DS_Store
.ipynb_checkpoints
# A Sentiment Analysis Tool Chain for 18<sup>th</sup> Century Periodicals
# Abstract
Sentiment analysis is a common task in natural language processing (NLP) and aims for the automatic and computational identification of emotions, attitudes and opinions expressed in textual data.
While Sentiment analysis is typically tailored for and widely used in the context of Web data, the application to literary texts is still challenging due to the lack of methods dedicated to languages other than English and from earlier times.
With the work we present here, we not only introduce new sentiment dictionaries for French, Italian and Spanish periodicals of the 18<sup>th</sup> century, but also build a freely and publicly available tool chain based on Jupyter Notebooks, enabling researchers to apply our dictionary creation process and sentiment analysis methods to their own material and projects.
The proposed tool chain comprises two different parts: (i) the optional creation of sentiment dictionaries and (ii) the actual sentiment analysis.
# Contents
This repository includes the following data:
* _code/dictionary_creation/1_seed_words/:_ This directory contains the Jupyter Notebooks to conduct the first step (extraction, annotation and selection of seed words) of the dictionary creation pipeline.
* _code/dictionary_creation/2_word_embeddings/:_ This directory contains the Jupyter Notebooks to conduct the second step (training and evaluation of word embeddings) of the dictionary creation pipeline.
* _code/dictionary_creation/3_classification/:_ This directory contains the Jupyter Notebooks to conduct the third step (training and evaluation of classifiers) of the dictionary creation pipeline.
* _code/sentiment_analysis/:_ This directory contains the Jupyter Notebooks to conduct the actual sentiment analysis.
* _data/classifier_evaluation/:_ This directory contains the annotated words used for the classifier evaluation.
* _data/dictionaries/:_ This directory contains ready-to-use sentiment dictionaries for 18<sup>th</sup> century French, Italian and Spanish.
* _data/example_texts/:_ This directory contains pickled pandas DataFrames with example texts in French, Italian and Spanish.
* _data/seed_words/:_ This directory contains the selected seed words used for the classification task.
* _data/word2vec/:_ This directory contains the best performing word2vec models as well as the lists of word pairs used for their evaluation.
* _miscellaneous/:_ This directory contains additional material, such as images, used for illustration purposes.
* _publication.ipynb:_ This Jupyter Notebook represents the main publication in which we explain the proposed method in detail and provide illustrative examples.
* _requirements.txt:_ This file lists all the necessary Python packages necessary to run the tool chain (typically used in conjunction with `pip`).
# Configuration
We recommend you to install [Anaconda](https://www.anaconda.com/), as it comes pre-bundled with most required packages.
## Python
If you simply want to use our dictionaries and Jupyter Notebooks to analyze sentiment of your texts, you need to have the following additional Python packages installed:
* pandas 1.1.3
* Matplotlib 3.2.0
* Seaborn 0.11.0
* tqdm 4.50.2
* nltk 3.5
* Jupyter Notebook 6.1.4 or Jupyter Lab 2.2.6
* ipywidgets 7.5.1
In order to create dictionaries yourself, you need to have the following additional Python packages installed:
* pandas 1.1.3
* gensim 3.8.3 (needs to be [installed separately](https://anaconda.org/anaconda/gensim); you best use `pip install gensim` in an Anaconda prompt to install gensim as the conda package is not up-to-date and cannot be installed with Anaconda running Python 3.8)
* sklearn 0.23.2
* nltk 3.5
* spacy 2.2.3 (needs to be [installed separately](https://anaconda.org/conda-forge/spacy))
* stop-words 2018.7.23 (needs to be [installed separately](https://anaconda.org/conda-forge/stop-words))
* Jupyter Notebook 6.1.4 or Jupyter Lab 2.2.6
* ipywidgets 7.5.1
Note that we tested our Jupyter Notebooks with the versions stated above.
While older and newer versions may work, the outcome may be impaired.
## Dataset
If you want to create dictionaries based on your own data, make sure that you have a decent amount of text.
The more text, the better the output.
Also, make sure that you cleaned your data and that each document is contained in a single *.txt* file with UTF-8 encoding.
Our ready-to-use dictionaries and models base on Spectator periodicals published during the 18<sup>th</sup> century.
In particular, we leverage **The Spectators in the international context**, a digital scholarly edition project which aims on building a central repository for spectator periodicals (Ertler et al. 2011, Scholger 2018).
The annotated periodicals follow the XML-based Text Encoding Initiative (TEI) standard (Consortium 2020), which provides a vocabulary on how to represent texts in digital form, and are publicly available through the [digital edition](https://gams.uni-graz.at/spectators).
This dataset contains multiple languages, but we set our focus on French, Italian and Spanish, as these three languages have the largest collections.
For this purpose, we extracted texts from TEI encoded files into plain *.txt* files.
## Hardware
Please keep in mind that your machine needs adequate hardware depending on the amount of text you want to consider.
This is especially important for the dictionary creation tool chain (e.g., we used a machine with 24 cores and 750 GB RAM and computations still took up to three days).
If you just want to analyze sentiment using existing dictionaries, a computer with common hardware should suffice.
# Contributors
name: Philipp Koncar
orcid: 0000-0001-5492-0644
institution: Institute of Interactive Systems and Data Science, Graz University of Technology
e-mail: philipp.koncar@tugraz.at
address: Inffeldgasse 16c, 8010, Graz, Austria
name: Christina Glatz
orcid:
institution: Institute of Romance Studies, University of Graz
e-mail: christina.glatz@uni-graz.at
address: Merangasse 70, 8010 Graz, Austria
name: Elisabeth Hobisch
orcid: 0000-0002-6051-4500
institution: Institute of Romance Studies, University of Graz
e-mail: elisabeth.hobisch@uni-graz.at
address: Merangasse 70, 8010 Graz, Austria
name: Yvonne Völkl
orcid: 0000-0001-8625-3663
institution: Institute of Romance Studies, University of Graz
e-mail: yvonne.voelkl@uni-graz.at
address: Merangasse 70, 8010 Graz, Austria
name: Bernhard C. Geiger
orcid: 0000-0003-3257-743X
institution: Know-Center GmbH
e-mail: geiger@ieee.org
address: Inffeldgasse 13, 8010, Graz, Austria
name: Sanja Sarić
orcid: 0000-0002-0802-6999
institution: Centre for Information Modelling - Austrian Centre for Digital Humanities, University of Graz
e-mail: sanja.saric@uni-graz.at
address: Elisabethstraße 59/III, 8010 Graz, Austria
name: Martina Scholger
orcid: 0000-0003-1438-3236
institution: Centre for Information Modelling - Austrian Centre for Digital Humanities, University of Graz
e-mail: martina.scholger@uni-graz.at
address: Elisabethstraße 59/III, 8010 Graz, Austria
name: Denis Helic
orcid: 0000-0003-0725-7450
institution: Institute of Interactive Systems and Data Science, Graz University of Technology
e-mail: dhelic@tugraz.at
address: Inffeldgasse 16c, 8010, Graz, Austria
# References
Consortium, T. (2020) Guidelines for Electronic Text Encoding and Interchange. http://www.tei-c.org/P5/. Accessed 16 Feb 2021.
Ertler, K.-D., Fuchs A., Fischer-Pernkopf M., Hobisch E., Scholger M., Völkl Y. (2011) The Spectators in the international context. https://gams.uni-graz.at/spectators. Accessed 16 Feb 2021.
Scholger, M. (2018) “Spectators” in the International Context - A Digital Scholarly Edition In: Discourses on Economy in the Spectators, 229–247.. Verlag Dr. Kovac, Hamburg.
%% Cell type:markdown id: tags:
## Imports
%% Cell type:code id: tags:
``` python
import pandas as pd
import os
import glob
from sklearn.feature_extraction.text import CountVectorizer
```
%% Cell type:markdown id: tags:
## Configuration
*input_dir:* The path to the directory that contains your text files. Please make sure to use a '/' (slash) in the end. For example: `path/to/texts/`.
*output_dir:* The path to the directory where you want to save extracted seed words. Please make sure to use a '/' (slash) in the end. For example: `path/to/output/`.
*seed_words_filename:* The filename for the resulting list of seed words. This must use the **.txt** extension.
*max_df & min_df*: Please refer to the [CountVectorizer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) for these parameters.
*num_words:* The number of words to extract.
%% Cell type:code id: tags:
``` python
input_dir = "../data/texts/"
output_dir = "results/raw/"
seed_words_filename = "seed_words.txt"
max_df = 0.8
min_df = 1
num_words = 3000
```
%% Cell type:markdown id: tags:
## Directory Setup (Optional)
Creates directories according to the configuration if not already created manually.
%% Cell type:code id: tags:
``` python
if not os.path.exists(input_dir):
os.makedirs(input_dir)
if not os.path.exists(output_dir):
os.makedirs(output_dir)
```
%% Cell type:markdown id: tags:
## Seed Word Extraction
%% Cell type:markdown id: tags:
### Load texts
%% Cell type:code id: tags:
``` python
text_file_names = glob.glob("{}*.txt".format(input_dir))
print("found {} texts".format(len(text_file_names)))
texts = []
for text_file_name in text_file_names:
with open(text_file_name, "r", encoding="utf-8") as input_file:
texts.append(input_file.read())
print("loaded {} texts".format(len(texts)))
```
%% Cell type:markdown id: tags:
### Extract seed words
%% Cell type:code id: tags:
``` python
cv = CountVectorizer(max_df=max_df, min_df=min_df, token_pattern=r"\b[^\d\W]{3,}\b")
tf_raw = cv.fit_transform(texts)
tf_df = pd.DataFrame(tf_raw.todense(), columns=cv.get_feature_names())
sorted_words = tf_df.sum().sort_values(ascending=False).head(num_words)
```
%% Cell type:markdown id: tags:
### Save seed words
%% Cell type:code id: tags:
``` python
with open("{}{}".format(output_dir, seed_words_filename), "w", encoding="utf-8") as textfile:
for sw in sorted_words.index:
textfile.write("{}\n".format(sw))
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## Imports
%% Cell type:code id: tags:
``` python
import pandas as pd
import os
```
%% Cell type:markdown id: tags:
## Configuration
*seed_words_filename:* The complete path to the input seed words list. For example: `path/to/seed_words/seed_words.txt`.
*output_dir:* The path to the directory where you want to save the created annotation files. Please make sure to use a '/' (slash) in the end. For example: `path/to/output/`.
%% Cell type:code id: tags:
``` python
seed_words_filename = "results/raw/seed_words.txt"
output_dir = "results/annotated/"
```
%% Cell type:markdown id: tags:
## Directory Setup (Optional)
Creates directories according to the configuration if not already created manually.
%% Cell type:code id: tags:
``` python
if not os.path.exists(output_dir):
os.makedirs(output_dir)
```
%% Cell type:markdown id: tags:
## Seed Word Annotation
%% Cell type:markdown id: tags:
### Load seed words
%% Cell type:code id: tags:
``` python
with open("{}".format(seed_words_filename), "r", encoding="utf-8") as inputfile:
seed_words = [line.rstrip() for line in inputfile]
print("loaded {} seed words".format(len(seed_words)))
```
%% Cell type:markdown id: tags:
### Create annotation file
%% Cell type:code id: tags:
``` python
print("enter name of annotator: ")
annotator = input()
annotation_df = pd.DataFrame(index=seed_words, columns=["sentiment"])
annotation_df.index.name = "word"
annotation_df.to_csv("{}{}_seed_words.csv".format(output_dir, annotator.lower()))
print("set up annotation file for: {}".format(annotator))
```
%% Cell type:markdown id: tags:
### Annotate seed words
Please open the created annotation files (.csv files) with a spreadsheet program of your choice (e.g., Excel or LibreOffice Calc) and annotate the seed words.
Make sure you use either of the following sentiment classes:
* positive
* negative
* neutral
Example:
| word | sentiment |
| --- | --- |
| good | positive |
| bad | negative |
| house | neutral |
Once you are finished, make sure to save the file using the **.csv** extension.
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## Imports
%% Cell type:code id: tags:
``` python
import pandas as pd
import os
import glob
```
%% Cell type:markdown id: tags:
## Configuration
*input_dir:* The path to the directory containg annoted seed words. Please make sure to use a '/' (slash) in the end. For example: `path/to/annotated/seed_words/`.
*output_dir:* The path to the directory where you want to save selected seed words. Please make sure to use a '/' (slash) in the end. For example: `path/to/output/`.
%% Cell type:code id: tags:
``` python
input_dir = "results/annotated/"
output_dir = "results/selected/"
```
%% Cell type:markdown id: tags:
## Directory Setup (Optional)
Creates directories according to the configuration if not already created manually.
%% Cell type:code id: tags:
``` python
if not os.path.exists(input_dir):
os.makedirs(input_dir)
if not os.path.exists(output_dir):
os.makedirs(output_dir)
```
%% Cell type:markdown id: tags:
## Seed Word Selection
%% Cell type:markdown id: tags:
### Load annoteted seed words
%% Cell type:code id: tags:
``` python
annotation_file_names = glob.glob("{}*.csv".format(input_dir))
print("found {} annotations".format(len(annotation_file_names)))
annotations = []
for annotation_file_name in annotation_file_names:
annotations.append(pd.read_csv(annotation_file_name, index_col="word"))
print("loaded {} annotations".format(len(annotations)))
```
%% Cell type:markdown id: tags:
### Select seed words
This is based on a majority vote.
%% Cell type:code id: tags:
``` python
annotations_df = pd.concat(annotations, axis=1).fillna("neutral")
pos_words = []
neg_words = []
neu_words = []
for w, row in annotations_df.mode(axis=1).iterrows():
row = row.dropna()
if len(row) > 1:
continue
if row[0] == "positive":
pos_words.append(w)
elif row[0] == "negative":
neg_words.append(w)
elif row[0] == "neutral":
neu_words.append(w)
print("number of positive:", len(pos_words))
print("number of negative:", len(neg_words))
print("number of neutral:", len(neu_words))
```
%% Cell type:markdown id: tags:
### Save selected seed words
%% Cell type:code id: tags:
``` python
with open("{}positive.txt".format(output_dir), mode="wt", encoding="utf-8") as pos_file:
pos_file.write("\n".join(pos_words))
with open("{}negative.txt".format(output_dir), mode="wt", encoding="utf-8") as neg_file:
neg_file.write("\n".join(neg_words))
with open("{}neutral.txt".format(output_dir), mode="wt", encoding="utf-8") as neu_file:
neu_file.write("\n".join(neu_words))
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## Imports
%% Cell type:code id: tags:
``` python
import glob
import pickle
import re
from gensim.models import Word2Vec
from gensim.models.phrases import Phrases, Phraser
from itertools import product
from stop_words import get_stop_words
from nltk import tokenize
```
%% Cell type:markdown id: tags:
## Configuration
*input_dir:* The path to the directory that contains your text files. Please make sure to use a '/' (slash) in the end. For example: path/to/texts/.
*language*: The language of your texts. This is used to get the right list of stops words.
*num_processes*: The number of processes to use. This depends on your hardware. The more cores you can use, the faster the training of the models.
*models_filename:* The filename for the resulting trained models. You may use the **.p** extension indicating a pickled file, but you are free to use whatever you like. Just make sure this is consistent in the evaluation step.
%% Cell type:code id: tags:
``` python
input_dir = "../data/texts/"
language = "french"
num_processes = 2
models_filename = "models.p"
```
%% Cell type:markdown id: tags:
### Grid search parameters
You should provide possible values for hyperparameters of the word2vec model. Please refer to the [gensim documentation](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec) to see a list of all hyperparameter. The following values serve as an example and may be adjusted to your needs.
%% Cell type:code id: tags:
``` python
vector_sizes = [100, 200, 300]
skip_grams = [0, 1]
hs = [0, 1]
windows = [5, 10]
negatives = [5, 10]
iters = [5, 10]
hyperparameters = list(product(vector_sizes, skip_grams, hs, windows, negatives, iters))
num_hyperparameters = len(hyperparameters)
print("number of hyperparameter combinations:", num_hyperparameters)
```
%% Cell type:markdown id: tags:
## Gird Search
%% Cell type:markdown id: tags:
### Loading texts
%% Cell type:code id: tags:
``` python
text_file_names = glob.glob("{}*.txt".format(input_dir))
print("found {} texts".format(len(text_file_names)))
texts = []
for text_file_name in text_file_names:
with open(text_file_name, "r", encoding="utf-8") as input_file:
texts.append(input_file.read())
print("loaded {} texts".format(len(texts)))
combined_text = " ".join(texts)
```
%% Cell type:markdown id: tags:
### Conduct grid search
%% Cell type:code id: tags:
``` python
models = {}
stop_words = get_stop_words(language.lower())
sentences = tokenize.sent_tokenize(combined_text)
reg_exp_tok = tokenize.RegexpTokenizer(r"\w{3,}")
split_sentences = [reg_exp_tok.tokenize(s.lower()) for s in sentences]
split_sentences_wo_sw = []
for s in split_sentences:
cleaned_tokens = [t for t in s if t not in stop_words]
if len(cleaned_tokens) > 0:
split_sentences_wo_sw.append(cleaned_tokens)
for hp in hyperparameters:
model = Word2Vec(sentences=split_sentences_wo_sw, workers=num_processes, size=hp[0], sg=hp[1], hs=hp[2], window=hp[3], negative=hp[4], iter=hp[5])
models[hp] = model
```
%% Cell type:markdown id: tags:
### Save models
%% Cell type:code id: tags:
``` python
with open(models_filename, "wb") as handle:
pickle.dump(models, handle)
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## Imports
%% Cell type:code id: tags:
``` python
import pickle
from gensim.models import Word2Vec
```
%% Cell type:markdown id: tags:
## Configuration
*models_filename:* The complete path to the pickeld word2vec models.
*word_pairs_filename:* The complete path to the list of word pairs used for evaluation. This needs to be a **.csv** file.
*selected_model_filename*: The filename for the best performing model which will be used for the subsequent classification. You may use the **.p** extension indicating a pickled file, but you are free to use whatever you like.
%% Cell type:code id: tags:
``` python
models_filename = "models.p"
word_pairs_filename = "ready_to_use/word_pairs/French.csv"
selected_model_filename = "best_model.p"