"We first tell the notebook logic whether we have a full bokeh server. This *is* the case for local jupyter installations, but is *not* the case for notebooks running on mybinder - in the latter case we have some limits on interactivity."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e8b6be29-90ac-4c33-a466-a1739dd4d241",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"import importlib\n",
"importlib.reload(nbutils)\n",
"importlib.reload(gold)\n",
"\n",
"nbutils.initialize(\"export\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f6288bc0-18c8-4bec-8300-382cbef61e65",
"metadata": {},
"outputs": [],
"source": [
"import bokeh.io\n",
"bokeh.io.output_notebook()"
]
},
{
{
"cell_type": "markdown",
"cell_type": "markdown",
"id": "exterior-texas",
"id": "exterior-texas",
...
@@ -104,10 +51,51 @@
...
@@ -104,10 +51,51 @@
"\n",
"\n",
"* Static token embeddings: these operate on the token level such. We experiment with GloVe (Pennington et al. 2014), fastText (Mikolov et al., 2017) and Numberbatch (Speer et al, 2018). We use these three to compute token similarity and combine this with alignment algorithms (such as Waterman-Smith-Beyer) to compute document similarity. We also investigate the effect of stacking two static embeddings (fastText and Numberbatch).\n",
"* Static token embeddings: these operate on the token level such. We experiment with GloVe (Pennington et al. 2014), fastText (Mikolov et al., 2017) and Numberbatch (Speer et al, 2018). We use these three to compute token similarity and combine this with alignment algorithms (such as Waterman-Smith-Beyer) to compute document similarity. We also investigate the effect of stacking two static embeddings (fastText and Numberbatch).\n",
"* Contextual token embeddings: these also operate on the token level, i.e. embeddings that change according to a specific token instance's context. In this notebook we experiment with using such token embeddings from a sentence bert model.\n",
"* Contextual token embeddings: these also operate on the token level, i.e. embeddings that change according to a specific token instance's context. In this notebook we experiment with using such token embeddings from a sentence bert model.\n",
"* Document embeddings derived from specially trained models. Document embeddings represent one document via one single embedding. We use document embeddings obtained from a BERT model. More specifically, we use a Sentence-BERT model trained for the semantic textual similarity (STS) task (Reimers and Gurevych, 2019).\n",
"* Document embeddings derived from specially trained models. Document embeddings represent one document via one single embedding. We use document embeddings obtained from a BERT model. More specifically, we use a Siamese BERT model named Sentence-BERT, which is trained specifically for the semantic textual similarity (STS) task (Reimers and Gurevych, 2019).\n",
"* Document embeddings derived from token embeddings. We also experiment with averaging different kinds of token embeddings (static and contextual) to derive document embeddings.\n"
"* Document embeddings derived from token embeddings. We also experiment with averaging different kinds of token embeddings (static and contextual) to derive document embeddings.\n"
]
]
},
},
{
"cell_type": "markdown",
"id": "8c07714a-5fdd-49c8-9330-b1f58f044de8",
"metadata": {},
"source": [
"# Technical Setup"
]
},
{
"cell_type": "markdown",
"id": "0de6249d-9437-4653-842c-f0635d61aaec",
"metadata": {},
"source": [
"We need to tell notebook logic whether we have a full bokeh server. This *is* the case for local jupyter installations, but is *not* the case for notebooks running on mybinder. In the latter case we have some limits on interactivity."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "723265bc-0629-42c0-a350-40c735a9529d",
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"\n",
"sys.path.append(\"code\")\n",
"\n",
"import nbutils\n",
"import gold\n",
"import vectorian\n",
"import ipywidgets as widgets\n",
"from ipywidgets import interact\n",
"\n",
"def reload():\n",
" import importlib\n",
" importlib.reload(nbutils)\n",
"\n",
"reload()\n",
"nbutils.initialize(\"auto\")"
]
},
{
{
"cell_type": "markdown",
"cell_type": "markdown",
"id": "assigned-length",
"id": "assigned-length",
...
@@ -316,7 +304,8 @@
...
@@ -316,7 +304,8 @@
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
"nlp = nbutils.make_nlp()"
"nlp = nbutils.make_nlp()\n",
"nlp.pipeline"
]
]
},
},
{
{
...
@@ -564,9 +553,9 @@
...
@@ -564,9 +553,9 @@
" the_embeddings[\"sbert\"],\n",
" the_embeddings[\"sbert\"],\n",
" CosineSimilarity())\n",
" CosineSimilarity())\n",
"\n",
"\n",
"a = list(session.documents[0].spans(session.partition(\"document\")))[0][3]\n",
"a = list(session.documents[0].spans(session.partition(\"document\")))[0][2]\n",
"vis.goto(\"rest is silence\", \"Fig for Fortune\")"
"plotter1 = vis.make(\"rest is silence\", \"Fig for Fortune\")"
]
]
},
},
{
{
...
@@ -620,7 +609,7 @@
...
@@ -620,7 +609,7 @@
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
"vis.plot(\"silence\")"
"plotter1(\"silence\")"
]
]
},
},
{
{
...
@@ -630,7 +619,7 @@
...
@@ -630,7 +619,7 @@
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
"vis.goto(\"sea of troubles\", \"Book of Common Prayer\")"
"plotter2 = vis.make(\"sea of troubles\", \"Book of Common Prayer\")"
]
]
},
},
{
{
...
@@ -656,7 +645,7 @@
...
@@ -656,7 +645,7 @@
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
"vis.plot(\"sea\")"
"plotter2(\"sea\")"
]
]
},
},
{
{
...
@@ -674,7 +663,7 @@
...
@@ -674,7 +663,7 @@
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
"vis.plot(\"troubles\")"
"plotter2(\"troubles\")"
]
]
},
},
{
{
...
@@ -692,7 +681,7 @@
...
@@ -692,7 +681,7 @@
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
"vis.plot(\"troublesomest\")"
"plotter2(\"troublesomest\")"
]
]
},
},
{
{
...
@@ -743,22 +732,6 @@
...
@@ -743,22 +732,6 @@
"sbert_encoder_name = nlp.meta[\"name\"]"
"sbert_encoder_name = nlp.meta[\"name\"]"
]
]
},
},
{
"cell_type": "code",
"execution_count": null,
"id": "5ddd998c-2a3a-4110-b5ef-39303bf3c62d",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"import importlib\n",
"importlib.reload(nbutils)\n",
"importlib.reload(gold)\n",
"\n",
"nbutils.initialize(\"auto\")"
]
},
{
{
"cell_type": "markdown",
"cell_type": "markdown",
"id": "ec8a1966-bdf0-41a0-b637-5cba51758904",
"id": "ec8a1966-bdf0-41a0-b637-5cba51758904",
...
@@ -1432,14 +1405,6 @@
...
@@ -1432,14 +1405,6 @@
"\n",
"\n",
"Nagoudi, El Moatez Billah, and Didier Schwab. “Semantic Similarity of Arabic Sentences with Word Embeddings.” Proceedings of the Third Arabic Natural Language Processing Workshop, Association for Computational Linguistics, 2017, pp. 18–24. DOI.org (Crossref), doi:10.18653/v1/W17-1303."
"Nagoudi, El Moatez Billah, and Didier Schwab. “Semantic Similarity of Arabic Sentences with Word Embeddings.” Proceedings of the Third Arabic Natural Language Processing Workshop, Association for Computational Linguistics, 2017, pp. 18–24. DOI.org (Crossref), doi:10.18653/v1/W17-1303."