{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import nltk\n", "import os\n", "import re\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from tqdm.auto import tqdm\n", "from scipy.stats import ks_2samp\n", "from IPython.core.display import display, HTML" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The execution of the following cell is mandatory! This is needed to have progress bars with pandas computations." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "tqdm.pandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have not yet downloaded the Treebank tokenizer for nltk, please do so by executing the following cell. You only need to do this once!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "nltk.download('punkt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configuration\n", "*language:* The language of your texts and the dictionaries you want to use.\n", "\n", "*dictionary_dir:* The path to the directory containing the dictionaries. If you use the default directory structure, you can use `dictionaries/manual/` for dictionaries containing only manual annotated words and `dictionaries/computational/` for dictionaries containing manual and automatically extended words. Please make sure to use a '/' (slash) in the end. For example: `path/to/dictionaries/`.\n", "\n", "*dataframe_filename:* The full path to the filename of the pandas DataFrame created in the previous step. You may use the .p extension indicating a pickled file.\n", "\n", "*texts_results_dir:* The path to the directory used for text outputs. Please make sure to use a '/' (slash) in the end. For example: `path/to/output/`.\n", "\n", "*plot_results_dir:* The path to the directory used for plot outputs. Please make sure to use a '/' (slash) in the end. For example: `path/to/output/`.\n", "\n", "*plot_file_format:* The file format to use for your plots. Typically, matplotlib supports: `png`, `pdf`, `ps`, `eps` and `svg`.\n", "\n", "*output_dataframe_filename:* The full path to the filename of the resulting pandas DataFrame including the computed sentiment. You may use the **.p** extension indicating a pickled file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "language = \"italian\"\n", "dictionary_dir = \"dictionaries/manual/\"\n", "dataframe_filename = \"texts_italian.p\"\n", "text_results_dir = \"results/texts/italian/\"\n", "plot_results_dir = \"results/plots/italian/\"\n", "plot_file_format= \"pdf\"\n", "output_dataframe_filename = \"texts_with_sentiment.p\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Directory Setup (Optional)\n", "Creates directories according to the configuration if not already created manually." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if not os.path.exists(dictionary_dir):\n", " os.makedirs(dictionary_dir)\n", "if not os.path.exists(text_results_dir):\n", " os.makedirs(text_results_dir)\n", "if not os.path.exists(plot_results_dir):\n", " os.makedirs(plot_results_dir)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sentiment Dictionary" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sentiment_dict = {}\n", "with open(\"{}{}_negative.txt\".format(dictionary_dir, language.lower()), \"r\", encoding=\"utf-8\") as fr:\n", " sentiment_dict[\"neg\"] = fr.read().splitlines()\n", "with open(\"{}{}_positive.txt\".format(dictionary_dir, language.lower()), \"r\", encoding=\"utf-8\") as fr:\n", " sentiment_dict[\"pos\"] = fr.read().splitlines()\n", "\n", "print(\"loaded {} negative words\".format(len(sentiment_dict[\"neg\"])))\n", "print(\"loaded {} positive words\".format(len(sentiment_dict[\"pos\"])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Text DataFrame" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "texts_df = pd.read_pickle(dataframe_filename)\n", "print(\"loaded dataframe with {} texts and {} attributes\".format(texts_df.shape[0], texts_df.shape[1] - 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compute sentiment\n", "This functions computes the sentiment of texts:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def compute_sentiment(text):\n", " tokens = nltk.word_tokenize(text)\n", " tokens = [t.lower() for t in tokens]\n", " num_negative = 0\n", " num_positive = 0\n", " for nw in sentiment_dict[\"neg\"]:\n", " num_negative += tokens.count(nw.lower())\n", " for pw in sentiment_dict[\"pos\"]:\n", " num_positive += tokens.count(pw.lower())\n", " try:\n", " sentiment_score = (num_positive - num_negative) / (num_positive + num_negative)\n", " except ZeroDivisionError:\n", " sentiment_score = 0\n", " return sentiment_score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply `compute_sentiment` to each text cell in the DataFrame:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "texts_df[\"sentiment\"] = texts_df[\"text\"].progress_apply(compute_sentiment)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save the DataFrame including the computing sentiment for later use:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "texts_df.to_pickle(output_dataframe_filename)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sentiment Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Overall Sentiment Statistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This prints descriptive statistics of sentiment:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "texts_df[\"sentiment\"].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Histogram Plot\n", "Note that all plots are also individually saved to `plot_results_dir`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "texts_df[\"sentiment\"].plot(kind=\"hist\", bins=10)\n", "plt.tight_layout()\n", "plt.savefig(\"{}sentiment_histogram.{}\".format(plot_results_dir, plot_file_format))\n", "plt.show()\n", "plt.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Attribute Plots\n", "Set the attribute you want to analyze:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "attribute = \"issue number\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Bar Plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "ax = texts_df.groupby(attribute)[\"sentiment\"].mean().plot(kind=\"bar\", ylabel=\"Sentiment Score\")\n", "plt.tight_layout()\n", "plt.savefig(\"{}{}_bar_plot.{}\".format(plot_results_dir, attribute, plot_file_format))\n", "plt.show()\n", "plt.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Box Plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "texts_df.boxplot(\"sentiment\", by=attribute)\n", "plt.tight_layout()\n", "plt.savefig(\"{}{}_box_plot.{}\".format(plot_results_dir, attribute, plot_file_format))\n", "plt.show()\n", "plt.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Line Plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "texts_df.groupby(attribute)[\"sentiment\"].mean().plot(kind=\"line\", ylabel=\"Sentiment Score\")\n", "plt.tight_layout()\n", "plt.savefig(\"{}{}_line_plot.{}\".format(plot_results_dir, attribute, plot_file_format))\n", "plt.grid()\n", "plt.show()\n", "plt.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Seaborn Lineplot\n", "Seaborn `lineplot` ([documentation](https://seaborn.pydata.org/generated/seaborn.lineplot.html)) allows to separate attributes very easily using the `hue` parameter. Please change it to the column name you want to group your data by." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "sns.lineplot(data=texts_df, x=attribute, y=\"sentiment\", hue=\"periodical title\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Individual Text Analysis\n", "Set the filename of the individual text you want to analyze:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "filename = \"mws-099-374.txt\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Sentiment Development" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next cell allows you to automalitcally partition text into cunks and compute sentiment seperately for each chunk.\n", "Please set `num_chunks` to the number of chunks you want to analyze." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "num_chunks = 10\n", "tokens = nltk.word_tokenize(texts_df.loc[filename, \"text\"])\n", "tokens=[token.lower() for token in tokens if token.isalpha()]\n", "\n", "chunks = np.array_split(tokens, num_chunks)\n", "chunks = [list(c) for c in chunks]\n", "\n", "distant_results = []\n", "for c in chunks:\n", " num_positive = 0\n", " num_negative = 0\n", " for nw in sentiment_dict[\"neg\"]:\n", " num_negative += c.count(nw.lower())\n", " for pw in sentiment_dict[\"pos\"]:\n", " num_positive += c.count(pw.lower())\n", " try:\n", " sentiment_score = (num_positive - num_negative) / (num_positive + num_negative)\n", " except ZeroDivisionError:\n", " sentiment_score = 0\n", " distant_results.append({\"num_positive\": num_positive, \"num_negative\": num_negative, \"sentiment_score\": sentiment_score})\n", "\n", "distant_results_df = pd.DataFrame(distant_results)\n", "\n", "lst_pal = {\"num_positive\": \"g-\", \"num_negative\": \"r-\", \"sentiment_score\": \"k--\"}\n", "\n", "ax1 = distant_results_df[[\"num_positive\", \"num_negative\"]].plot(style=lst_pal, legend=False)\n", "ax1.set_xlabel(\"Text Progress ->\")\n", "ax1.set_ylabel(\"Number of Words\")\n", "\n", "ax2 = distant_results_df[\"sentiment_score\"].plot(secondary_y=True, style=lst_pal)\n", "ax2.set_ylabel(\"Sentiment Score\")\n", "\n", "ax1.grid()\n", "ax2.grid(linestyle=\"--\")\n", "\n", "handles,labels = [],[]\n", "for ax in [ax1, ax2]:\n", " for h,l in zip(*ax.get_legend_handles_labels()):\n", " handles.append(h)\n", " labels.append(l)\n", "\n", "plt.legend(handles, labels)\n", "plt.tight_layout()\n", "plt.savefig(\"{}{}_distant.{}\".format(plot_results_dir, filename.replace(\".\", \"_\"), plot_file_format))\n", "plt.show()\n", "plt.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Sentiment Words Highlighting\n", "The next cell highlights all the words conveying a sentiment in your individual text file. This is also saved to `texts_results_dir` as an *.html* file for later use." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "text_to_print = texts_df.loc[filename, \"text\"]\n", "for nw in sentiment_dict[\"neg\"]:\n", " if nw.lower() in text_to_print.lower() and nw not in [\"span\", \"style\", \"color\", \"font\", \"size\"]:\n", " text_to_print = re.sub(r\"\\b{}\\b\".format(nw), r\"{}\".format(nw), text_to_print)\n", " \n", "for pw in sentiment_dict[\"pos\"]:\n", " if pw.lower() in text_to_print.lower() and pw not in [\"span\", \"style\", \"color\", \"font\", \"size\"]:\n", " text_to_print = re.sub(r\"\\b{}\\b\".format(pw), r\"{}\".format(pw), text_to_print)\n", "\n", "with open(\"{}{}.html\".format(text_results_dir, \".\".join(filename.split(\".\")[:-1])), \"w\") as outfile:\n", " outfile.write(text_to_print)\n", "HTML(text_to_print)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hypothesis Test Example (with Italian examples)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*La Frusta Letteraria di Aristarco Scannabue* is more scientific, while *La donna galante ed erudita* is focused on cultureal issues, such as theatre or interpersonal relations. Thus, we would expect that the sentiment in the two would be different, because the first may be more factual and the latter more emotional.\n", "\n", "The following two-sample Kolmogorov-Smirnov has the null hypothesis that two samples are drawn from the same distribution (i.e., the sentiment is indentical for both). If the resulting p-value is smaller than, for example, the commonly used significance level 0.05, we can reject the null hypothesis. This would suggest that there are significant differences in sentiment between the two distributions. We now compare the two italian periodicals using `ks_2samp` ([documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html)) from scipy. Please make sure that you set `language = italian` and that you loaded the correct texts as well as sentiment dictionaries before executing this cell. First we separate the two disttributions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "la_frusta = texts_df[texts_df[\"periodical title\"] == \"La Frusta Letteraria di Aristarco Scannabue\"][\"sentiment\"]\n", "la_donna = texts_df[texts_df[\"periodical title\"] == \"La donna galante ed erudita\"][\"sentiment\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And then we conduct our hypothesis test:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "statistic, pvalue = ks_2samp(la_frusta, la_donna)\n", "print(\"pvalue:\", pvalue)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since 0.55 > 0.05 (our significance level) we cannot reject our null hypothesis and, thus, the difference in sentiment between the two periodicals is non-significant." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }