{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import glob\n", "import pickle\n", "import re\n", "from gensim.models import Word2Vec\n", "from gensim.models.phrases import Phrases, Phraser\n", "from itertools import product\n", "from stop_words import get_stop_words\n", "from nltk import tokenize" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configuration\n", "\n", "*input_dir:* The path to the directory that contains your text files. Please make sure to use a '/' (slash) in the end. For example: path/to/texts/.\n", "\n", "*language*: The language of your texts. This is used to get the right list of stops words.\n", "\n", "*num_processes*: The number of processes to use. This depends on your hardware. The more cores you can use, the faster the training of the models.\n", "\n", "*models_filename:* The filename for the resulting trained models. You may use the **.p** extension indicating a pickled file, but you are free to use whatever you like. Just make sure this is consistent in the evaluation step." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "input_dir = \"../data/texts/\"\n", "language = \"french\"\n", "num_processes = 2\n", "models_filename = \"models.p\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Grid search parameters\n", "You should provide possible values for hyperparameters of the word2vec model. Please refer to the [gensim documentation](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec) to see a list of all hyperparameter. The following values serve as an example and may be adjusted to your needs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vector_sizes = [100, 200, 300]\n", "skip_grams = [0, 1]\n", "hs = [0, 1]\n", "windows = [5, 10]\n", "negatives = [5, 10]\n", "iters = [5, 10]\n", "\n", "hyperparameters = list(product(vector_sizes, skip_grams, hs, windows, negatives, iters))\n", "num_hyperparameters = len(hyperparameters)\n", "print(\"number of hyperparameter combinations:\", num_hyperparameters)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gird Search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading texts" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text_file_names = glob.glob(\"{}*.txt\".format(input_dir))\n", "print(\"found {} texts\".format(len(text_file_names)))\n", "texts = []\n", "for text_file_name in text_file_names:\n", " with open(text_file_name, \"r\", encoding=\"utf-8\") as input_file:\n", " texts.append(input_file.read())\n", "print(\"loaded {} texts\".format(len(texts)))\n", "combined_text = \" \".join(texts) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conduct grid search" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "models = {}\n", "stop_words = get_stop_words(language.lower())\n", "sentences = tokenize.sent_tokenize(combined_text)\n", "reg_exp_tok = tokenize.RegexpTokenizer(r\"\\w{3,}\")\n", "split_sentences = [reg_exp_tok.tokenize(s.lower()) for s in sentences]\n", "split_sentences_wo_sw = []\n", "for s in split_sentences:\n", " cleaned_tokens = [t for t in s if t not in stop_words]\n", " if len(cleaned_tokens) > 0:\n", " split_sentences_wo_sw.append(cleaned_tokens)\n", "for hp in hyperparameters:\n", " model = Word2Vec(sentences=split_sentences_wo_sw, workers=num_processes, size=hp[0], sg=hp[1], hs=hp[2], window=hp[3], negative=hp[4], iter=hp[5])\n", " models[hp] = model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Save models" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open(models_filename, \"wb\") as handle:\n", " pickle.dump(models, handle)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 2 }