"For token embeddings, there are also various options, as the diagram below illustrates. We can either produce contextual token embeddings, which will incorporate a specific token's context and can be obtained from architectures like ELMO or BERT. Or we can use static token embeddings, which will map one token to one embedding, independent of its specific occurence in a text. For static embeddings there is now a variety of established options like fastText or GloVe. We can also combine embeddings or stack them (i.e. concatenate embedding vectors) to simply create new embeddings from existing ones.\n",
"For token embeddings, there are also various options, as the diagram below illustrates. The most recent option are contextual token embeddings (also sometimes called *dynamic* embeddings), which will incorporate a specific token's context and can be obtained from architectures like ELMO or BERT. Another option are static token embeddings, which map one token to one embedding, independent of its specific occurence in a text. For an overview of static and contextual embeddings, and their differences, see (Wang et al. 2020).\n",
"\n",
"For static embeddings there is now a variety of established options like fastText or GloVe. We can also combine embeddings or stack them (i.e. concatenate embedding vectors) to simply create new embeddings from existing ones.\n",
"\n",
"\n",
"\n"
"\n"
]
]
...
@@ -465,7 +467,7 @@
...
@@ -465,7 +467,7 @@
"id": "7a591ebe-dc7d-4109-a2f5-0480593057ca",
"id": "7a591ebe-dc7d-4109-a2f5-0480593057ca",
"metadata": {},
"metadata": {},
"source": [
"source": [
"A large positive value (i.e. a small $\\theta$ between **u** and **v**) indicates high similarity, whereas a small or even negative value (i.e. a large $\\theta$) indicates low similarity.\n",
"A large positive value (i.e. a small $\\theta$ between **u** and **v**) indicates high similarity, whereas a small or even negative value (i.e. a large $\\theta$) indicates low similarity. For a discussion of issues with this notion of similarity, see (Faruqui et al., 2016).\n",
"\n",
"\n",
"The visualization below encodes\n",
"The visualization below encodes\n",
"\n",
"\n",
...
@@ -634,7 +636,7 @@
...
@@ -634,7 +636,7 @@
"id": "427bc950-6e6d-4585-b98f-688484df28e9",
"id": "427bc950-6e6d-4585-b98f-688484df28e9",
"metadata": {},
"metadata": {},
"source": [
"source": [
"The example above is a situation where pure token similarity and therefore embeddings won't help us much. While the syntactic structure is mirrored, the term \"silence\" is replaced with \"all but wind\". Even if we focus on nouns only, we would not expect \"silence\" and \"wind\" to be understood to be similar. Still an embedding approach should be able to recognize that the words at the beginning of phrase are exact matches."
"The example above is a situation where token similarity - and therefore embeddings - might not help much. While the syntactic structure is mirrored, the term \"silence\" is replaced with \"all but wind\". Even if we focus on nouns only, we would not expect \"silence\" and \"wind\" to be understood to be similar. Still an embedding approach should be able to recognize that the words at the beginning of phrase are exact matches."
]
]
},
},
{
{
...
@@ -642,7 +644,7 @@
...
@@ -642,7 +644,7 @@
"id": "baa324fc-042b-474c-9ac2-6c28e61d4a41",
"id": "baa324fc-042b-474c-9ac2-6c28e61d4a41",
"metadata": {},
"metadata": {},
"source": [
"source": [
"If we inspect the cosine similarity of the token \"silence\" with other tokens in the context under three of our embeddings, we see that there is more connection between \"silence\" and \"wind\" than we expected - esp. with numberbatch. Still, the absolute value of 0.3 for numberbatch is low. Interestingly, glove associates \"silence\" with \"action\", i.e. an opposite. The phenomenon that embeddings sometimes cluster opposites is a common observation and can be an issue when wanting to differentiate between these."
"If we inspect the cosine similarity of the token \"silence\" with other tokens in the context under three of our embeddings, we see that there is more connection between \"silence\" and \"wind\" than we expected - esp. with numberbatch. Still, the absolute value of 0.3 for numberbatch is low. Interestingly, glove associates \"silence\" with \"action\", i.e. an opposite. The phenomenon that embeddings sometimes cluster opposites is a common observation and can be an issue when wanting to differentiate between synonyms and antonyms."
]
]
},
},
{
{
...
@@ -732,7 +734,12 @@
...
@@ -732,7 +734,12 @@
"id": "seven-aggregate",
"id": "seven-aggregate",
"metadata": {},
"metadata": {},
"source": [
"source": [
"Before we turn to alignment strategies to match sentences token by token, we first look at representing each document with one embedding in order to gather an understanding how different embedding strategies relate to the nearness of documents. We will later turn to individual token embeddings."
"Before we turn to alignment strategies to match sentences token by token, we first look at representing each document with one single embedding in order to gather an understanding how different embedding strategies relate to the nearness of documents. We will later return to individual token embeddings.\n",
"\n",
"We will use two strategies for computing document embeddings:\n",
"\n",
"* averaging over token embeddings\n",
"* computing document embeddings through a dedicated model"
]
]
},
},
{
{
...
@@ -740,7 +747,7 @@
...
@@ -740,7 +747,7 @@
"id": "average-controversy",
"id": "average-controversy",
"metadata": {},
"metadata": {},
"source": [
"source": [
"We first prepare additional sentence embeddings using SBERT that we will show in our first big visualization."
"In order to achieve the latter, we compute document embeddings using Sentence-BERT."
]
]
},
},
{
{
...
@@ -767,9 +774,9 @@
...
@@ -767,9 +774,9 @@
"id": "12d93851-2895-422d-92d0-586aa540d480",
"id": "12d93851-2895-422d-92d0-586aa540d480",
"metadata": {},
"metadata": {},
"source": [
"source": [
"Now we construct an Explorer class. In addition to providing the SBERT encoder we just built, we configure the Explorer to use averaging to build documents embeddings from token embeddings.\n",
"In order to achieve the former, we configure a helper class instance to use averaging to build documents embeddings from token embeddings.\n",
"\n",
"\n",
"Note to interactive readers: you can change the \"mean\" (averaging) method to other methods for computing document tokens as well."
"Interactive readers may want to change the \"mean\" (averaging) method to other methods for computing document tokens as well."
"Various approaches have been proposed. For an overview of sequence alignment algorithms as well as adjacent approaches like Dynamic Time Warping, see Kruskal (Kruskal, 1983). In this section, we use the Waterman-Smith-Beyer (WSB) algorithm which produces optimal local alignments and allows a general (e.g. non-affine) cost function (Waterman and Smith and Beyer, 1974). Other commonly used global alignment algorithms - such as Smith-Waterman and Gotoh - can be regarded as special cases of WSB. In comparison to Needleman-Wunsch, WSB produces local alignments. In contrast to classic formulations of WSB - which often use a fixed substitution cost - we use the word distance from word embeddings to compute the substitution penalty for specific pairs of words."
]
},
{
"cell_type": "markdown",
"id": "60da85ca-b472-4e59-846d-0c120454807a",
"metadata": {},
"source": [
"A different approach to compute a measure of similarity between bag of words is the so-called Word Mover's Distance introduced by Kusner et al. (Kusner et al., 2015)."
]
},
{
{
"cell_type": "markdown",
"cell_type": "markdown",
"id": "retired-reverse",
"id": "retired-reverse",
...
@@ -1372,8 +1412,28 @@
...
@@ -1372,8 +1412,28 @@
"\n",
"\n",
"Reimers, Nils, and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks.” ArXiv:1908.10084 [Cs], Aug. 2019. arXiv.org, http://arxiv.org/abs/1908.10084.\n",
"Reimers, Nils, and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks.” ArXiv:1908.10084 [Cs], Aug. 2019. arXiv.org, http://arxiv.org/abs/1908.10084.\n",
"\n",
"\n",
"Liebl, Bernhard, and Manuel Burghardt. “‘Shakespeare in the Vectorian Age’ – An Evaluation of Different Word Embeddings and NLP Parameters for the Detection of Shakespeare Quotes.” Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, 2020, pp. 56–58."
"Liebl, Bernhard, and Manuel Burghardt. “‘Shakespeare in the Vectorian Age’ – An Evaluation of Different Word Embeddings and NLP Parameters for the Detection of Shakespeare Quotes.” Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, 2020, pp. 56–58.\n",
"\n",
"Kruskal, Joseph B. “An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules.” SIAM Review, vol. 25, no. 2, Apr. 1983, pp. 201–37. DOI.org (Crossref), doi:10.1137/1025045.\n",
"\n",
"Kusner, Matt J., et al. “From Word Embeddings to Document Distances.” Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, JMLR.org, 2015, pp. 957–66.\n",
"\n",
"Waterman, M. S., et al. “Some Biological Sequence Metrics.” Advances in Mathematics, vol. 20, no. 3, June 1976, pp. 367–87. DOI.org (Crossref), doi:10.1016/0001-8708(76)90202-4.\n",
"\n",
"Faruqui, Manaal, et al. “Problems With Evaluation of Word Embeddings Using Word Similarity Tasks.” ArXiv:1605.02276 [Cs], May 2016. arXiv.org, http://arxiv.org/abs/1605.02276.\n",
"\n",
"Wang, Yuxuan, et al. “From Static to Dynamic Word Representations: A Survey.” International Journal of Machine Learning and Cybernetics, vol. 11, no. 7, July 2020, pp. 1611–30. DOI.org (Crossref), doi:10.1007/s13042-020-01069-8.\n",
"\n",
"Nagoudi, El Moatez Billah, and Didier Schwab. “Semantic Similarity of Arabic Sentences with Word Embeddings.” Proceedings of the Third Arabic Natural Language Processing Workshop, Association for Computational Linguistics, 2017, pp. 18–24. DOI.org (Crossref), doi:10.18653/v1/W17-1303."