Commit 5d7a1b1e authored by Philipp Koncar's avatar Philipp Koncar
Browse files

initial commit

parents
.DS_Store
.ipynb_checkpoints
# A Sentiment Analysis Tool Chain for 18<sup>th</sup> Century Periodicals
# Abstract
Sentiment analysis is a common task in natural language processing (NLP) and aims for the automatic and computational identification of emotions, attitudes and opinions expressed in textual data.
While Sentiment analysis is typically tailored for and widely used in the context of Web data, the application to literary texts is still challenging due to the lack of methods dedicated to languages other than English and from earlier times.
With the work we present here, we not only introduce new sentiment dictionaries for French, Italian and Spanish periodicals of the 18<sup>th</sup> century, but also build a freely and publicly available tool chain based on Jupyter Notebooks, enabling researchers to apply our dictionary creation process and sentiment analysis methods to their own material and projects.
The proposed tool chain comprises two different parts: (i) the optional creation of sentiment dictionaries and (ii) the actual sentiment analysis.
# Contents
This repository includes the following data:
* _code/dictionary_creation/1_seed_words/:_ This directory contains the Jupyter Notebooks to conduct the first step (extraction, annotation and selection of seed words) of the dictionary creation pipeline.
* _code/dictionary_creation/2_word_embeddings/:_ This directory contains the Jupyter Notebooks to conduct the second step (training and evaluation of word embeddings) of the dictionary creation pipeline.
* _code/dictionary_creation/3_classification/:_ This directory contains the Jupyter Notebooks to conduct the third step (training and evaluation of classifiers) of the dictionary creation pipeline.
* _code/sentiment_analysis/:_ This directory contains the Jupyter Notebooks to conduct the actual sentiment analysis.
* _data/classifier_evaluation/:_ This directory contains the annotated words used for the classifier evaluation.
* _data/dictionaries/:_ This directory contains ready-to-use sentiment dictionaries for 18<sup>th</sup> century French, Italian and Spanish.
* _data/example_texts/:_ This directory contains pickled pandas DataFrames with example texts in French, Italian and Spanish.
* _data/seed_words/:_ This directory contains the selected seed words used for the classification task.
* _data/word2vec/:_ This directory contains the best performing word2vec models as well as the lists of word pairs used for their evaluation.
* _miscellaneous/:_ This directory contains additional material, such as images, used for illustration purposes.
* _publication.ipynb:_ This Jupyter Notebook represents the main publication in which we explain the proposed method in detail and provide illustrative examples.
* _requirements.txt:_ This file lists all the necessary Python packages necessary to run the tool chain (typically used in conjunction with `pip`).
# Configuration
We recommend you to install [Anaconda](https://www.anaconda.com/), as it comes pre-bundled with most required packages.
## Python
If you simply want to use our dictionaries and Jupyter Notebooks to analyze sentiment of your texts, you need to have the following additional Python packages installed:
* pandas 1.1.3
* Matplotlib 3.2.0
* Seaborn 0.11.0
* tqdm 4.50.2
* nltk 3.5
* Jupyter Notebook 6.1.4 or Jupyter Lab 2.2.6
* ipywidgets 7.5.1
In order to create dictionaries yourself, you need to have the following additional Python packages installed:
* pandas 1.1.3
* gensim 3.8.3 (needs to be [installed separately](https://anaconda.org/anaconda/gensim); you best use `pip install gensim` in an Anaconda prompt to install gensim as the conda package is not up-to-date and cannot be installed with Anaconda running Python 3.8)
* sklearn 0.23.2
* nltk 3.5
* spacy 2.2.3 (needs to be [installed separately](https://anaconda.org/conda-forge/spacy))
* stop-words 2018.7.23 (needs to be [installed separately](https://anaconda.org/conda-forge/stop-words))
* Jupyter Notebook 6.1.4 or Jupyter Lab 2.2.6
* ipywidgets 7.5.1
Note that we tested our Jupyter Notebooks with the versions stated above.
While older and newer versions may work, the outcome may be impaired.
## Dataset
If you want to create dictionaries based on your own data, make sure that you have a decent amount of text.
The more text, the better the output.
Also, make sure that you cleaned your data and that each document is contained in a single *.txt* file with UTF-8 encoding.
Our ready-to-use dictionaries and models base on Spectator periodicals published during the 18<sup>th</sup> century.
In particular, we leverage **The Spectators in the international context**, a digital scholarly edition project which aims on building a central repository for spectator periodicals (Ertler et al. 2011, Scholger 2018).
The annotated periodicals follow the XML-based Text Encoding Initiative (TEI) standard (Consortium 2020), which provides a vocabulary on how to represent texts in digital form, and are publicly available through the [digital edition](https://gams.uni-graz.at/spectators).
This dataset contains multiple languages, but we set our focus on French, Italian and Spanish, as these three languages have the largest collections.
For this purpose, we extracted texts from TEI encoded files into plain *.txt* files.
## Hardware
Please keep in mind that your machine needs adequate hardware depending on the amount of text you want to consider.
This is especially important for the dictionary creation tool chain (e.g., we used a machine with 24 cores and 750 GB RAM and computations still took up to three days).
If you just want to analyze sentiment using existing dictionaries, a computer with common hardware should suffice.
# Contributors
name: Philipp Koncar
orcid: 0000-0001-5492-0644
institution: Institute of Interactive Systems and Data Science, Graz University of Technology
e-mail: philipp.koncar@tugraz.at
address: Inffeldgasse 16c, 8010, Graz, Austria
name: Christina Glatz
orcid:
institution: Institute of Romance Studies, University of Graz
e-mail: christina.glatz@uni-graz.at
address: Merangasse 70, 8010 Graz, Austria
name: Elisabeth Hobisch
orcid: 0000-0002-6051-4500
institution: Institute of Romance Studies, University of Graz
e-mail: elisabeth.hobisch@uni-graz.at
address: Merangasse 70, 8010 Graz, Austria
name: Yvonne Völkl
orcid: 0000-0001-8625-3663
institution: Institute of Romance Studies, University of Graz
e-mail: yvonne.voelkl@uni-graz.at
address: Merangasse 70, 8010 Graz, Austria
name: Bernhard C. Geiger
orcid: 0000-0003-3257-743X
institution: Know-Center GmbH
e-mail: geiger@ieee.org
address: Inffeldgasse 13, 8010, Graz, Austria
name: Sanja Sarić
orcid: 0000-0002-0802-6999
institution: Centre for Information Modelling - Austrian Centre for Digital Humanities, University of Graz
e-mail: sanja.saric@uni-graz.at
address: Elisabethstraße 59/III, 8010 Graz, Austria
name: Martina Scholger
orcid: 0000-0003-1438-3236
institution: Centre for Information Modelling - Austrian Centre for Digital Humanities, University of Graz
e-mail: martina.scholger@uni-graz.at
address: Elisabethstraße 59/III, 8010 Graz, Austria
name: Denis Helic
orcid: 0000-0003-0725-7450
institution: Institute of Interactive Systems and Data Science, Graz University of Technology
e-mail: dhelic@tugraz.at
address: Inffeldgasse 16c, 8010, Graz, Austria
# References
Consortium, T. (2020) Guidelines for Electronic Text Encoding and Interchange. http://www.tei-c.org/P5/. Accessed 16 Feb 2021.
Ertler, K.-D., Fuchs A., Fischer-Pernkopf M., Hobisch E., Scholger M., Völkl Y. (2011) The Spectators in the international context. https://gams.uni-graz.at/spectators. Accessed 16 Feb 2021.
Scholger, M. (2018) “Spectators” in the International Context - A Digital Scholarly Edition In: Discourses on Economy in the Spectators, 229–247.. Verlag Dr. Kovac, Hamburg.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import os\n",
"import glob\n",
"from sklearn.feature_extraction.text import CountVectorizer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configuration\n",
"\n",
"*input_dir:* The path to the directory that contains your text files. Please make sure to use a '/' (slash) in the end. For example: `path/to/texts/`.\n",
"\n",
"*output_dir:* The path to the directory where you want to save extracted seed words. Please make sure to use a '/' (slash) in the end. For example: `path/to/output/`.\n",
"\n",
"*seed_words_filename:* The filename for the resulting list of seed words. This must use the **.txt** extension.\n",
"\n",
"*max_df & min_df*: Please refer to the [CountVectorizer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) for these parameters.\n",
"\n",
"*num_words:* The number of words to extract."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"input_dir = \"../data/texts/\"\n",
"output_dir = \"results/raw/\"\n",
"seed_words_filename = \"seed_words.txt\"\n",
"max_df = 0.8\n",
"min_df = 1\n",
"num_words = 3000"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Directory Setup (Optional)\n",
"Creates directories according to the configuration if not already created manually."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"if not os.path.exists(input_dir):\n",
" os.makedirs(input_dir)\n",
"if not os.path.exists(output_dir):\n",
" os.makedirs(output_dir)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Seed Word Extraction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load texts"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"text_file_names = glob.glob(\"{}*.txt\".format(input_dir))\n",
"print(\"found {} texts\".format(len(text_file_names)))\n",
"texts = []\n",
"for text_file_name in text_file_names:\n",
" with open(text_file_name, \"r\", encoding=\"utf-8\") as input_file:\n",
" texts.append(input_file.read())\n",
"print(\"loaded {} texts\".format(len(texts)))"
]
},
{
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import os"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configuration\n",
"\n",
"*seed_words_filename:* The complete path to the input seed words list. For example: `path/to/seed_words/seed_words.txt`.\n",
"\n",
"*output_dir:* The path to the directory where you want to save the created annotation files. Please make sure to use a '/' (slash) in the end. For example: `path/to/output/`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"seed_words_filename = \"results/raw/seed_words.txt\"\n",
"output_dir = \"results/annotated/\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Directory Setup (Optional)\n",
"Creates directories according to the configuration if not already created manually."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"if not os.path.exists(output_dir):\n",
" os.makedirs(output_dir)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Seed Word Annotation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load seed words"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with open(\"{}\".format(seed_words_filename), \"r\", encoding=\"utf-8\") as inputfile:\n",
" seed_words = [line.rstrip() for line in inputfile]\n",
"print(\"loaded {} seed words\".format(len(seed_words)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create annotation file"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"enter name of annotator: \")\n",
"annotator = input()\n",
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import os\n",
"import glob"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configuration\n",
"\n",
"*input_dir:* The path to the directory containg annoted seed words. Please make sure to use a '/' (slash) in the end. For example: `path/to/annotated/seed_words/`.\n",
"\n",
"*output_dir:* The path to the directory where you want to save selected seed words. Please make sure to use a '/' (slash) in the end. For example: `path/to/output/`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"input_dir = \"results/annotated/\"\n",
"output_dir = \"results/selected/\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Directory Setup (Optional)\n",
"Creates directories according to the configuration if not already created manually."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"if not os.path.exists(input_dir):\n",
" os.makedirs(input_dir)\n",
"if not os.path.exists(output_dir):\n",
" os.makedirs(output_dir)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Seed Word Selection"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load annoteted seed words"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"annotation_file_names = glob.glob(\"{}*.csv\".format(input_dir))\n",
"print(\"found {} annotations\".format(len(annotation_file_names)))\n",
"annotations = []\n",
"for annotation_file_name in annotation_file_names:\n",
" annotations.append(pd.read_csv(annotation_file_name, index_col=\"word\"))\n",
"print(\"loaded {} annotations\".format(len(annotations)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Select seed words\n",
"This is based on a majority vote."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"annotations_df = pd.concat(annotations, axis=1).fillna(\"neutral\")\n",
"pos_words = []\n",
"neg_words = []\n",
"neu_words = []\n",
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import glob\n",
"import pickle\n",
"import re\n",
"from gensim.models import Word2Vec\n",
"from gensim.models.phrases import Phrases, Phraser\n",
"from itertools import product\n",
"from stop_words import get_stop_words\n",
"from nltk import tokenize"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configuration\n",
"\n",
"*input_dir:* The path to the directory that contains your text files. Please make sure to use a '/' (slash) in the end. For example: path/to/texts/.\n",
"\n",
"*language*: The language of your texts. This is used to get the right list of stops words.\n",
"\n",
"*num_processes*: The number of processes to use. This depends on your hardware. The more cores you can use, the faster the training of the models.\n",
"\n",
"*models_filename:* The filename for the resulting trained models. You may use the **.p** extension indicating a pickled file, but you are free to use whatever you like. Just make sure this is consistent in the evaluation step."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"input_dir = \"../data/texts/\"\n",
"language = \"french\"\n",
"num_processes = 2\n",
"models_filename = \"models.p\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Grid search parameters\n",
"You should provide possible values for hyperparameters of the word2vec model. Please refer to the [gensim documentation](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec) to see a list of all hyperparameter. The following values serve as an example and may be adjusted to your needs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"vector_sizes = [100, 200, 300]\n",
"skip_grams = [0, 1]\n",
"hs = [0, 1]\n",
"windows = [5, 10]\n",
"negatives = [5, 10]\n",
"iters = [5, 10]\n",
"\n",
"hyperparameters = list(product(vector_sizes, skip_grams, hs, windows, negatives, iters))\n",
"num_hyperparameters = len(hyperparameters)\n",
"print(\"number of hyperparameter combinations:\", num_hyperparameters)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Gird Search"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loading texts"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"text_file_names = glob.glob(\"{}*.txt\".format(input_dir))\n",
"print(\"found {} texts\".format(len(text_file_names)))\n",
"texts = []\n",
"for text_file_name in text_file_names:\n",
" with open(text_file_name, \"r\", encoding=\"utf-8\") as input_file:\n",
" texts.append(input_file.read())\n",
"print(\"loaded {} texts\".format(len(texts)))\n",
"combined_text = \" \".join(texts) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Conduct grid search"
]
},
{
"cell_type": "code",
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pickle\n",
"from gensim.models import Word2Vec"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configuration\n",
"\n",
"*models_filename:* The complete path to the pickeld word2vec models.\n",
"\n",
"*word_pairs_filename:* The complete path to the list of word pairs used for evaluation. This needs to be a **.csv** file.\n",
"\n",
"*selected_model_filename*: The filename for the best performing model which will be used for the subsequent classification. You may use the **.p** extension indicating a pickled file, but you are free to use whatever you like."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"models_filename = \"models.p\"\n",
"word_pairs_filename = \"ready_to_use/word_pairs/French.csv\"\n",
"selected_model_filename = \"best_model.p\""
]
},