"For performing our actual investigations we rely on a framework called The Vectorian, which we first introduced in 2020 in a less versatile state (Liebl and Burghardt, 2020). By employing highly optimized algorithms and data structures, the Vectorian allows us to perform rapid searches over the gold standard texts using a variety of approaches and strategies. "
"For performing our actual investigations we rely on a framework called The Vectorian, which we first introduced in 2020 (Liebl and Burghardt, 2020). By employing highly optimized algorithms and data structures, the Vectorian allows us to perform rapid searches over the gold standard texts using a variety of approaches and strategies. "
"Finally we instantiate an NLP parser based on Sentence-BERT (Reimers and Gurevych, 2019) and a shim that allows us to use this model's contextual token embeddings in the Vectorian."
"We instantiate an NLP parser that is able to provide embeddings based on Sentence-BERT (Reimers and Gurevych, 2019)."
]
]
},
},
{
{
...
@@ -286,10 +311,27 @@
...
@@ -286,10 +311,27 @@
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
"nlp = nbutils.make_nlp()\n",
"nlp = nbutils.make_nlp()"
"\n",
]
},
{
"cell_type": "markdown",
"id": "8c772ad8-af05-4074-9392-b263a5d6b358",
"metadata": {},
"source": [
"Finally, we add a shim that allows us to use Sentence-BERT's contextual token embeddings in the Vectorian."
"Since the above representation is hard to grasp, we can visualize various words under one embedding - using colors to show the strength of activation in different vector components."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f344867c-e095-4d6d-baf5-1fa559b37f6b",
"metadata": {},
"outputs": [],
"source": [
"import ipywidgets as widgets\n",
"from ipywidgets import interact\n",
"\n",
" \n",
"@interact(embedding=widgets.Dropdown(\n",
" options=[(k, v) for k, v in the_embeddings.items() if not v.is_contextual],\n",
"Looking at these color patterns, we can gain some intuitive understanding of why and how word embeddings are suitable for word similarity computations. For example, *sail* and *boat* both show a strong activation on dimension 27. Similarly, *guitar* and *piano* share similar values around dimension 24. The words *coffee* and *tea* also share some similar patterns around dimension 2 and dimension 49, that sets them apart from the other four words."
]
},
{
"cell_type": "markdown",
"id": "7a591ebe-dc7d-4109-a2f5-0480593057ca",
"metadata": {},
"source": [
"By multiplying the normalized vectors component by component, we can derive the terms that make up the computation of the so-called cosine similarity, which is commonly used to compute word similarity using word embeddings. The visualization below makes it clear from which dimensions a cosine similarity between two words derives large positive components."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "da7436d6-bf5a-4e2d-a4b0-0174c0606cf0",
"metadata": {},
"outputs": [],
"source": [
"@interact(embedding=widgets.Dropdown(\n",
" options=[(k, v) for k, v in the_embeddings.items() if not v.is_contextual],\n",
"A similar investigation into fastText shows comparable spots of positive contribution. The situation is more complex due to the higher number of dimensions."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1b603839-2495-411d-b9e6-e1d133277f0c",
"metadata": {},
"outputs": [],
"source": [
"@interact(embedding=widgets.Dropdown(\n",
" options=[(k, v) for k, v in the_embeddings.items() if not v.is_contextual],\n",
"Computing the cosine similarity is mathematically equivalent to summing up the terms in the diagram above. The overall similarity between *guitar* and *piano* is measured at about 68% with the fastText embedding we use."