{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import os\n", "import glob\n", "from sklearn.metrics import balanced_accuracy_score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configuration\n", "\n", "*extended_words_dir:* The path to the directory where you saved resulting dictionaries. Please make sure to use a '/' (slash) in the end. For example: path/to/input/.\n", "\n", "*annotated_positive_words_filename:* The complete path to the **.txt** file containing annotated positive evaluation words.\n", "\n", "*annotated_negative_words_filename:* The complete path to the **.txt** file containing annotated negative evaluation words.\n", "\n", "*annotated_neutral_words_filename:* The complete path to the **.txt** file containing annotated neutral evaluation words.\n", "\n", "*num_words_to_annotate:* The number of words to randomly sample from each class (positive, negative, neutral) for annotation. This is only needed if you do not use the ready-to-use ecaluation words.\n", "\n", "*annotated_words_dir:* The path to the directory where you want to save randomly extracted evaluation words as well as selected evaluation words. Please make sure to use a '/' (slash) in the end. For example: path/to/output/." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "extended_words_dir = \"result/\"\n", "annotated_positive_words_filename = \"ready_to_use/French_positive.txt\"\n", "annotated_negative_words_filename = \"ready_to_use/French_negative.txt\"\n", "annotated_neutral_words_filename = \"ready_to_use/French_neutral.txt\"\n", "num_words_to_annotate = 10\n", "annotated_words_dir = \"evaluation_words/\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Directory Setup (Optional)\n", "Creates directories according to the configuration if not already created manually." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if not os.path.exists(annotated_words_dir):\n", " os.makedirs(annotated_words_dir)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create annotation lists\n", "The following cells are only necessary if you want to create your own evaluation word lists. You can also use the ready-to-use lists and skip to the *Evaluate extended dictionaries* section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Randomly sample evaluation words" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open(\"{}negative.txt\".format(extended_words_dir), \"r\", encoding=\"utf-8\") as fr:\n", " neg = fr.read().splitlines()\n", "with open(\"{}positive.txt\".format(extended_words_dir), \"r\", encoding=\"utf-8\") as fr:\n", " pos = fr.read().splitlines()\n", "with open(\"{}neutral.txt\".format(extended_words_dir), \"r\", encoding=\"utf-8\") as fr:\n", " neu = fr.read().splitlines()\n", "neg_s = pd.Series(index=neg, dtype=\"object\")\n", "pos_s = pd.Series(index=pos, dtype=\"object\")\n", "neu_s = pd.Series(index=neu, dtype=\"object\")\n", "neg_samples = neg_s.sample(num_words_to_annotate)\n", "pos_samples = pos_s.sample(num_words_to_annotate)\n", "neu_samples = neu_s.sample(num_words_to_annotate)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Save evaluation words to .csv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "evaluation_words_s = pd.concat([neg_samples, pos_samples, neu_samples])\n", "\n", "print(\"enter name of annotator: \")\n", "annotator = input()\n", "\n", "evaluation_words_s.to_csv(\"{}{}_evaluation_words.csv\".format(annotated_words_dir, annotator.lower()), index_label=\"word\", header=[\"sentiment\"])\n", "\n", "print(\"set up annotation file for: {}\".format(annotator))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Annotate seed words\n", "Please open the created annotation files (.csv files) with a spreadsheet program of your choice (e.g., Excel or LibreOffice Calc) and annotate the seed words.\n", "Make sure you use either of the following sentiment classes:\n", "\n", "* positive\n", "* negative\n", "* neutral\n", "\n", "Example:\n", "\n", "| word | sentiment |\n", "| --- | --- |\n", "| good | positive |\n", "| bad | negative |\n", "| house | neutral |\n", "\n", "Once you are finished, make sure to save the file using the **.csv** extension." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select evaluation words" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "annotation_file_names = glob.glob(\"{}*.csv\".format(annotated_words_dir))\n", "print(\"found {} annotations\".format(len(annotation_file_names)))\n", "annotations = []\n", "for annotation_file_name in annotation_file_names:\n", " annotations.append(pd.read_csv(annotation_file_name, index_col=\"word\"))\n", "print(\"loaded {} annotations\".format(len(annotations)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select evaluation words\n", "This is similar to the procedure for seed words and based on a majority vote." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "annotations_df = pd.concat(annotations, axis=1).fillna(\"neutral\")\n", "pos_words = []\n", "neg_words = []\n", "neu_words = []\n", "for w, row in annotations_df.mode(axis=1).iterrows():\n", " row = row.dropna()\n", " if len(row) > 1:\n", " continue\n", " if row[0] == \"positive\":\n", " pos_words.append(w)\n", " elif row[0] == \"negative\":\n", " neg_words.append(w)\n", " elif row[0] == \"neutral\":\n", " neu_words.append(w)\n", "print(\"number of positive:\", len(pos_words))\n", "print(\"number of negative:\", len(neg_words))\n", "print(\"number of neutral:\", len(neu_words))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Save selected evaluation words" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open(\"{}positive.txt\".format(annotated_words_dir), mode=\"wt\", encoding=\"utf-8\") as pos_file:\n", " pos_file.write(\"\\n\".join(pos_words))\n", "with open(\"{}negative.txt\".format(annotated_words_dir), mode=\"wt\", encoding=\"utf-8\") as neg_file:\n", " neg_file.write(\"\\n\".join(neg_words))\n", "with open(\"{}neutral.txt\".format(annotated_words_dir), mode=\"wt\", encoding=\"utf-8\") as neu_file:\n", " neu_file.write(\"\\n\".join(neu_words))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluate extended dictionaries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load extended dictionaries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open(\"{}negative.txt\".format(extended_words_dir), \"r\", encoding=\"utf-8\") as fr:\n", " pred_neg = fr.read().splitlines()\n", "with open(\"{}positive.txt\".format(extended_words_dir), \"r\", encoding=\"utf-8\") as fr:\n", " pred_pos = fr.read().splitlines()\n", "with open(\"{}neutral.txt\".format(extended_words_dir), \"r\", encoding=\"utf-8\") as fr:\n", " pred_neu = fr.read().splitlines()\n", "pred_y_neg_s = pd.Series(\"negative\", index=pred_neg)\n", "pred_y_pos_s = pd.Series(\"positive\", index=pred_pos)\n", "pred_y_neu_s = pd.Series(\"neutral\", index=pred_neu)\n", "pred_y_s = pd.concat([pred_y_neg_s, pred_y_pos_s, pred_y_neu_s])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load annotated evaluation words" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open(\"{}\".format(annotated_negative_words_filename), \"r\", encoding=\"utf-8\") as fr:\n", " true_neg = fr.read().splitlines()\n", "with open(\"{}\".format(annotated_positive_words_filename), \"r\", encoding=\"utf-8\") as fr:\n", " true_pos = fr.read().splitlines()\n", "with open(\"{}\".format(annotated_neutral_words_filename), \"r\", encoding=\"utf-8\") as fr:\n", " true_neu = fr.read().splitlines()\n", "true_y_neg_s = pd.Series(\"negative\", index=true_neg)\n", "true_y_pos_s = pd.Series(\"positive\", index=true_pos)\n", "true_y_neu_s = pd.Series(\"neutral\", index=true_neu)\n", "true_y_s = pd.concat([true_y_neg_s, true_y_pos_s, true_y_neu_s])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluate\n", "Compute the balanced accuracy score. For three classes (i.e., positive, negative, neutral), the random baseline is 0.33. If the score is higher than that, you are better than random guessing." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pred_y_s = pred_y_s[pred_y_s.index.isin(true_y_s.index)]\n", "pred_y_s.sort_index(inplace=True)\n", "true_y_s.sort_index(inplace=True)\n", "print(\"balanced accuracy score:\", balanced_accuracy_score(true_y_s, pred_y_s))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }