Commit 93ff306b authored by Niels-Oliver Walkowski's avatar Niels-Oliver Walkowski
Browse files

upd: Make changes requested by editor review results

parent 5294045d
......@@ -15,11 +15,13 @@ This repository includes the following data:
* _code/dictionary_creation/2_word_embeddings/:_ This directory contains the Jupyter Notebooks to conduct the second step (training and evaluation of word embeddings) of the dictionary creation pipeline.
* _code/dictionary_creation/3_classification/:_ This directory contains the Jupyter Notebooks to conduct the third step (training and evaluation of classifiers) of the dictionary creation pipeline.
* _code/sentiment_analysis/:_ This directory contains the Jupyter Notebooks to conduct the actual sentiment analysis.
* _data/classifier_evaluation/:_ This directory contains the annotated words used for the classifier evaluation.
* _data/dictionaries/:_ This directory contains ready-to-use sentiment dictionaries for 18<sup>th</sup> century French, Italian and Spanish.
* _data/example_texts/:_ This directory contains pickled pandas DataFrames with example texts in French, Italian and Spanish.
* _data/seed_words/:_ This directory contains the selected seed words used for the classification task.
* _data/word2vec/:_ This directory contains the best performing word2vec models as well as the lists of word pairs used for their evaluation.
* _data/processed_data/classifier_evaluation/:_ This directory contains the annotated words used for the classifier evaluation.
* _data/processed_data/dictionaries/:_ This directory contains ready-to-use sentiment dictionaries for 18<sup>th</sup> century French, Italian and Spanish.
* _data/processed_data/example_texts/:_ This directory contains pickled pandas DataFrames with example texts in French, Italian and Spanish.
* _data/processed_data/seed_words/:_ This directory contains the selected seed words used for the classification task.
* _data/processed_data/word_pairs/:_ This directory contains the lists of word pairs used for the evaluation of word2vec models.
* _data/processed_data/word2vec_models/:_ This directory contains the pretrained word2vec models.
* _data/raw_data/example_texts/:_ This directory contains a ZIP archive with plain text files used to create the pandas DataFrames containing example texts.
* _miscellaneous/:_ This directory contains additional material, such as images, used for illustration purposes.
* _publication.ipynb:_ This Jupyter Notebook represents the main publication in which we explain the proposed method in detail and provide illustrative examples.
* _requirements.txt:_ This file lists all the necessary Python packages necessary to run the tool chain (typically used in conjunction with `pip`).
......@@ -29,6 +31,7 @@ This repository includes the following data:
We recommend you to install [Anaconda](https://www.anaconda.com/), as it comes pre-bundled with most required packages.
## Python
The Jupyter Notebooks presented in this repository require Python 3.8.5.
If you simply want to use our dictionaries and Jupyter Notebooks to analyze sentiment of your texts, you need to have the following additional Python packages installed:
* pandas 1.1.3
......@@ -53,7 +56,6 @@ Note that we tested our Jupyter Notebooks with the versions stated above.
While older and newer versions may work, the outcome may be impaired.
## Dataset
If you want to create dictionaries based on your own data, make sure that you have a decent amount of text.
The more text, the better the output.
Also, make sure that you cleaned your data and that each document is contained in a single *.txt* file with UTF-8 encoding.
......@@ -65,7 +67,6 @@ This dataset contains multiple languages, but we set our focus on French, Italia
For this purpose, we extracted texts from TEI encoded files into plain *.txt* files.
## Hardware
Please keep in mind that your machine needs adequate hardware depending on the amount of text you want to consider.
This is especially important for the dictionary creation tool chain (e.g., we used a machine with 24 cores and 750 GB RAM and computations still took up to three days).
If you just want to analyze sentiment using existing dictionaries, a computer with common hardware should suffice.
......
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment