"There are now various established ways to compute embeddings for similarity tasks. A first important distinction is between *token* embeddings and *document* embeddings (see diagram below) - note that we use the terms \"token embeddings\" and \"word embeddings\" interchangeably. While the former imply one embedding (i.e. numeric vector) per token, the latter operate by mapping a whole document (a set of tokens) into one single embedding.\n",
"There are now various established ways to compute embeddings for word similarity tasks. A first important distinction is between *token* embeddings and *document* embeddings (see diagram below) - note that we use the terms \"token embeddings\" and \"word embeddings\" interchangeably. While the former imply one embedding (i.e. numeric vector) per token, the latter operate by mapping a whole document (a set of tokens) into one single embedding.\n",
"\n",
"\n",
"There are two common ways to compute document embeddings. One way is to derive them from token embeddings - for example by averaging over them. More complex approaches train dedicated models that are optimized to produce good document embeddings.\n",
"There are two common ways to compute document embeddings. One way is to derive them from token embeddings - for example by averaging over them. More complex approaches train dedicated models that are optimized to produce good document embeddings.\n",
"\n",
"\n",
"So on this level, we can differentiate between three kinds of embeddings: pure token embeddings, document embeddings derived from token embeddings, and - finally - document embeddings from dedicated document embedding models (e.g. SBERT).\n",
"So on this level, we can differentiate between three kinds of embeddings: pure token embeddings, document embeddings derived from token embeddings, and - finally - document embeddings from dedicated document embedding models - e.g. models like Sentence-BERT (Reimers and Gurevych, 2019).\n",
"\n",
"\n",
"\n"
"\n"
"Before we turn to alignment strategies to match sentences token by token, we first look at representing each document with one single embedding in order to gather an understanding how different embedding strategies relate to the nearness of documents. We will later return to individual token embeddings.\n",
"Before we turn to alignment strategies to match sentences token by token, we first look at representing each document with one single embedding in order to gather an understanding how different embedding strategies relate to the nearness of documents. We will later return to individual token embeddings.\n",
"\n",
"\n",
"We will use two strategies for computing document embeddings:\n",
"We will use the two strategies for computing document embeddings we mentioned earlier:\n",
"\n",
"\n",
"* averaging over token embeddings\n",
"* averaging over token embeddings\n",
"* computing document embeddings through a dedicated model"
"* computing document embeddings through a dedicated model"
...
@@ -774,37 +774,34 @@
...
@@ -774,37 +774,34 @@
"id": "12d93851-2895-422d-92d0-586aa540d480",
"id": "12d93851-2895-422d-92d0-586aa540d480",
"metadata": {},
"metadata": {},
"source": [
"source": [
"In order to achieve the former, we configure a helper class instance to use averaging to build documents embeddings from token embeddings.\n",
"In order to achieve the former, we configure a helper class instance to use averaging to build documents embeddings from token embeddings. Interactive readers may want to try changing the \"mean\" (i.e. averaging) method to other methods for computing document tokens as well."
"\n",
"Interactive readers may want to change the \"mean\" (averaging) method to other methods for computing document tokens as well."
" {\"encoder\": \"paraphrase_distilroberta\", \"locator\": (\"fixed\", \"an old man is twice\"), 'has_tok_emb': False}\n",
" {\"encoder\": \"paraphrase_distilroberta\", \"locator\": (\"fixed\", \"an old man is twice\")}\n",
"]);"
"]);"
]
]
},
},
...
@@ -903,7 +900,7 @@
...
@@ -903,7 +900,7 @@
"id": "60da85ca-b472-4e59-846d-0c120454807a",
"id": "60da85ca-b472-4e59-846d-0c120454807a",
"metadata": {},
"metadata": {},
"source": [
"source": [
"A different approach to compute a measure of similarity between bag of words is the so-called Word Mover's Distance introduced by Kusner et al. (Kusner et al., 2015)."
"A different approach to compute a measure of similarity between bag of words is the so-called Word Mover's Distance introduced by Kusner et al. (Kusner et al., 2015). The main idea is computing similarity through finding the optimal solution of a transportation problem between words."
]
]
},
},
{
{
...
@@ -982,6 +979,14 @@
...
@@ -982,6 +979,14 @@
"We first define a strategy for searching the corpus. In the summary below you will find the strategy used for the non-interactive version of this text. In the interactive version, you can click on \"Edit\" and change these settings and rerun the following sections of the notebook accordingly."
"We first define a strategy for searching the corpus. In the summary below you will find the strategy used for the non-interactive version of this text. In the interactive version, you can click on \"Edit\" and change these settings and rerun the following sections of the notebook accordingly."
]
]
},
},
{
"cell_type": "markdown",
"id": "40a082f5-99cf-4b21-8d65-d824f2199e0f",
"metadata": {},
"source": [
"We investigate two variants of WMD. First the classic variant as described by Kusner et al., where a transportation problem is solved over the normalized bag of words (nbow) vector. We also introduce a new variant of WMD, where we keep the bag of words (bow) unnormalized - i.e. we pose the transportation problem on absolute word occurence counts."
]
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": null,
"execution_count": null,
...
@@ -1171,7 +1176,9 @@
...
@@ -1171,7 +1176,9 @@
"id": "c637b6d2-6271-48ed-9375-41b62c2874d9",
"id": "c637b6d2-6271-48ed-9375-41b62c2874d9",
"metadata": {},
"metadata": {},
"source": [
"source": [
"The distributions of score contributions we just observed are the motivation for our approach to tag-weighted alignments, that are described in (Liebl and Burghardt, 2020). We demonstrate it now, by using a tag-weighted alignment that will weight nouns like \"madness\" and \"method\" 3 times more than other word types. Let's set it up (\"NN\" is a Penn Treebank tag and identifies singular nouns):"
"The distributions of score contributions we just observed are the motivation for our approach to tag-weighted alignments, that are described in (Liebl and Burghardt, 2020). Nagoudi and Schwab used similar ideas of POS weighting for computing sentence similarity, but did not combine it with alignments (Nagoudi and Schwab, 2017).\n",
"\n",
"We now demonstrate tag-weighted alignments, by using a tag-weighted alignment that will weight nouns like \"madness\" and \"method\" 3 times more than other word types. \"NN\" is a Penn Treebank tag and identifies singular nouns."
]
]
},
},
{
{
...
@@ -1204,11 +1211,7 @@
...
@@ -1204,11 +1211,7 @@
"id": "ca966318-3d37-464c-a73d-fd3d153dcabb",
"id": "ca966318-3d37-464c-a73d-fd3d153dcabb",
"metadata": {},
"metadata": {},
"source": [
"source": [
"This tag-weighting allows to fix move the correct results far to the top, namely to ranks 1, 2, 4 and 6.\n",
"Tag-weighting moves the correct results far to the top, namely to ranks 1, 2, 4 and 6. By increasing the NN weight to 5, it is possible to bring rank 73 to rank 15. This is sort of an extreme measure though and we will not investigate it further here. Instead we investigate how the weighting affects the other queries. Therefore, we re-run the NDCG computation and compare it against unweighted WSB."
"\n",
"Note that we can bring rank 73 to rank 15 by increasing the NN weight to 5. But this is sort of an extreme measure and we will not follow it here.\n",
"\n",
"Instead we wonder: how will the weighting affect the other queries? Let's re-run the NDCG computation and compare it against unweighted WSB."