Commit c9289ad2 authored by Niels-Oliver Walkowski's avatar Niels-Oliver Walkowski
Browse files

upd(vdhd.2): Create release candidate 2

parent 3f269707
%% Cell type:markdown id:190bc476 tags:
# “Embed, embed! There’s knocking at the gate.”
## Detecting Intertextuality with Embeddings and the Vectorian
<i>Bernhard Liebl & Manuel Burghardt <br>
Computational Humanities Group, Leipzig University</i>
## Table of Contents
* [1. Introduction](#section_1)
* [1.1 Enter: word embeddings](#section_1_1)
* [1.2 Outline of the notebook](#section_1_2)
* [1.3 Technical setup](#section_1_3)
* [2. Data and Tools](#section_2)
* [2.1 Introducing the gold standard dataset](#section_2_1)
* [2.2 Overview of different types of embeddings](#section_2_2)
* [2.3 "Shapespeare in the Vectorian Age" – Meet the Vectorian framework](#section_2_3)
* [2.3.1 Loading word embeddings](#section_2_3_1)
* [2.3.2 Creating the session](#section_2_3_2)
* [3. Embeddings as a tool for intertextuality research](#section_3)
* [3.1 Exploring word embeddings](#section_3_1)
* [3.1.1 An introduction to word embeddings and token similarity](#section_3_1_1)
* [3.1.2 Detecting Shakespearean intertextuality through word embeddings](#section_3_1_2)
* [3.2 Exploring document embeddings](#section_3_2)
* [3.3 Exploring word mappings: WSB vs. WMD](#section_3_3)
* [3.3.1 Mapping quote queries to longer text documents](#section_3_3_1)
* [3.3.2 Evaluation: Plotting the nDCG over the corpus](#section_3_3_2)
* [3.3.3 Focussing on single queries](#section_3_3_3)
* [3.4 The influence of different embeddings](#section_3_4)
* [4. Conclusion](#section_4)
* [5. Interactive searches with your own data](#section_5)
* [6. References](#section_6)
%% Cell type:markdown id:55a187a5 tags:
# 1. Introduction <a class="anchor" id="section_1"></a>
%% Cell type:markdown id:08900f82 tags:
The detection of intertextual references in text corpora is a topic in digital humanities that has gained a lot of attention in recent years (for instance Bamman & Crane, 2008; Burghardt et al., 2019; Büchler et al., 2013; Forstall et al., 2015; Scheirer et al., 2014). While intertextuality – from a literary studies perspective – describes the phenomenon of one text being present in another text (cf. Genette, 1993), the computational problem at hand is the task of text similarity detection (Bär et al., 2012), and more concretely, semantic similarity detection.
%% Cell type:markdown id:931e786e tags:
In the following example of Shakespearean intertextuality, the words *bleed* and *leak* are semantically (and phonetically) similar, demonstrating that *Star Trek* here is quoting Shakespeare without any doubt:
> Shylock: If you prick *us*, do *we* not **bleed**. <br>
(Shakespeare; The Merchant of Venice)
> Data: If you prick *me*, do *I* not **leak**. <br>
(Star Trek: The Next Generation; The Measure of a Man)
%% Cell type:markdown id:d93736f9 tags:
## 1.1 Enter: word embeddings <a class="anchor" id="section_1_1"></a>
%% Cell type:markdown id:7d2d01c6 tags:
Over the years, there have been various attempts at measuring semantic similarity, some of them knowledge-based (e.g. based on WordNet), others corpus-based, like LDA (Chandrasekaran & Vijay, 2021). The advent of word embeddings (Mikolov et al., 2013) has changed the field considerably by introducing a new and fast way to tackle the notion of word meaning. On the one hand, word embeddings are building blocks that can be combined with a number of other methods, such as alignments, soft cosine or Word Mover's Distance, to implement some kind of sentence similarity (Manjavacas et al., 2019). On the other hand, the concept of embeddings can be extended to work on the sentence-level as well, which is a conceptually different approach (Wieting et al., 2016).
%% Cell type:markdown id:fe1c1073 tags:
We introduce the **<a href="https://github.com/poke1024/vectorian">Vectorian</a>** as a framework that allows researchers to try out different embedding-based methods for intertextuality detection. In contrast to previous versions of the Vectorian (Liebl & Burghardt, 2020a/b) as a mere web interface with a limited set of static parameters, we now present a clean and redesigned API that is showcased in this interactive Jupyter notebook.
We will first use the Vectorian to build queries where we plug in pre-trained static word embeddings such as <a href="https://fasttext.cc/">fastText</a> (Mikolov et al., 2018) and <a href="https://nlp.stanford.edu/projects/glove/">GloVe</a> (Pennington et al., 2014). We evaluate the influence of computing similarity through alignments such as <a href="http://rna.informatik.uni-freiburg.de/Teaching/index.jsp?toolName=Waterman-Smith-Beyer">Waterman-Smith-Beyer</a> (WSB; Waterman et al., 1976) and two variants of Word Mover’s Distance (WMD; Kusner et al., 2015). We also investigate the performance of state-of-art sentence embeddings like <a href="https://www.sbert.net/">Siamese BERT networks</a> (Reimers & Gurevych, 2019) for the task - both on a document level (as document embeddings) and as contextual token embeddings. Overall, we find that WSB with fastText offers highly competitive performance. We find some slight indication that POS tag-weighted WSB might offer further benefits in some scenarios. Readers can upload their own data for performing search queries and try out additional vector space metrics such as p-norms or improved sqrt‐cosine similarity (Sohangir & Wang, 2017).
%% Cell type:markdown id:76ec578f tags:
## 1.2 Outline of the notebook <a class="anchor" id="section_1_2"></a>
%% Cell type:markdown id:3354577a tags:
In the notebook, we will go through different examples of intertextuality to demonstrate and explain the implications of different embeddings and similarity measures. To achieve this, we provide a small ground truth corpus of intertextual Shakespeare references that can be used for some controlled evaluation experiments. Our main goal is to provide an interactive environment, where researchers can test out different methods for text reuse and intertextuality detection. This notebook thus adds to a critical reflection of digital methods and can help to shed some light on their epistemological implications for the field of computational intertextuality detection. At the end of the notebook, researchers can also easily import their own data and investigate all the showcased methods for their specific texts.
%% Cell type:markdown id:83dd01e9 tags:
## 1.3 Technical setup <a class="anchor" id="section_1_3"></a>
%% Cell type:markdown id:81509d15 tags:
We import a couple of helper functions for visualizations and various computations (`nbutils`), a wrapper to load our gold standard data (`gold`), and finally the Vectorian library (`vectorian`), through which we will perform searches and evaluations later on.
In `nbutils.initialize` we check whether there is a [bokeh server](https://docs.bokeh.org/en/latest/index.html) available. This typically *is* the case for local Jupyter installations, but is *not* the case for notebooks running on <a href="https://mybinder.org/">mybinder</a>. In the latter case, the notebook has some limitations regarding interactivity.
%% Cell type:code id:2ad4b2be tags:
``` python
import sys; sys.path.append("code") # make importable
import nbutils, gold, vectorian
import ipywidgets as widgets
from ipywidgets import interact
nbutils.initialize("auto", export=True)
```
%% Output
%% Cell type:markdown id:cfedd8b6-31de-464e-8970-4ec5508e0eb9 tags:
# 2. Data and Tools <a class="anchor" id="section_2"></a>
%% Cell type:markdown id:d6be272b-e11c-47a4-a94e-8da5c061fa15 tags:
## 2.1 Introducing the gold standard dataset <a class="anchor" id="section_2_1"></a>
%% Cell type:markdown id:038a3660-df73-4444-97e8-f53827fe0bcd tags:
In the following we use a collection of 100 short text snippets (=documents) that quote a total of 20 different Shakespeare phrases. All of these documents were derived from the [WordWeb IDEM portal](http://wordweb-idem.ch/about-us.html), where literary scholars collect intertextual references in a freely accessible database (Hohl-Trillini et al., 2020). Each document quotes exactly one of the 20 phrases. For some phrases, e.g. "to be or not to be", there are more quoting documents than for others (see interactive overview of documents below). If there are multiple documents that quote the same phrase, we selected them in a way that each of them does this in a different way. There are no verbatim quotes in the documents, but always more or less complex variations of the original phrase.
%% Cell type:markdown id:19440288-e2ab-4d0d-89c9-3698c4494a9a tags:
We use this collection of documents containing quotes as a gold standard in order to assess how well different embeddings and search algorithms are able to detect rephrasings of different types of quotes.
In technical terms, the gold standard data is represented as a directed graph, where nodes are phrases - e.g. "to be or not to be" - and edges model intertextuality - i.e. one phrase re-occurring in a different context. For example, Shakespeare's "to be or not to be" will have several outgoing edges that reference other phrases from other works that we consider intertextually related. Edges are directed, and start from the work containing query phrases (which is always by William Shakespeare in this notebook's gold standard data) and go to the work that contains dependent rephrasings. Note that this relationship is purely conceptual and does not imply a chronological timeline of text reuse. For example, "The rest is silence" occurs in Hamlet (1623 for the First Folio), whereas the rephrasing "the rest is all but wind" occurs in A Fig for Fortune (1596).
Nodes contain additional information on a phrase's context (i.e. surrounding text) and the containing work (and author), both of which allow us to understand where the phrase comes from and where it is used.
The visualization below shows the full gold data graph. Nodes are represented as circles. The 20 larger red nodes are source nodes, i.e. those nodes by Shakespeare that serve as queries for our investigations. The 100 smaller orange nodes are phrases that are related to the original Shakespeare phrase. By hovering over nodes, you are able to see the phrase itself, the work it occurs in, and the full context where it is embedded. Re-occurences of phrases are highlighted in bold.
%% Cell type:code id:bab19081-815e-4b11-80e7-77581c4f2362 tags:
``` python
gold_data = gold.load_data("data/raw_data/gold.json")
nbutils.plot_gold(gold_data, title=f"The gold data is a {gold_data}")
```
%% Output
%% Cell type:markdown id:87bddd3e-b9dc-44b0-8382-c01a7205185f tags:
The browser widget below lets the reader explore the same graph data through a different UI. The specific example shown by default is the rephrasing of the Shakespeare phrase "to be or not to be" in a non-Shakespeare work titled "The Phoenix" by Thomas Middleton. For a deeper discussion of the intertextual provenience of this special phrase see (Trillini, 2020).
The phrase in Middleton's work is "to be named or not be named" - which as before is highlighted in bold. The context, in which this rephrasing is embedded, is the whole line by "Fidelio".
The phrase in Middleton's work is "to be named or not be named". The context, in which this rephrasing is embedded, is the whole line by "Fidelio".
%% Cell type:code id:aeba4fc4-7d00-4edb-824d-a887131c6ce3 tags:
``` python
nbutils.Browser(gold_data, "to be or not to be", "The Phoenix");
```
%% Output
%% Cell type:markdown id:33e17502-144a-447c-b180-181e9821ee4e tags:
While the structure of the gold data has been geared towards our specific use case in this notebook, the graph-based format of <a href="data/raw_data/gold.json">gold.json</a> should be obvious to understand and easy to replace with custom datasets. Note that the loader inside <a href="code/gold.py">gold.py</a> is very simple and essentially just building a graph. Also note that only nodes with in-degree 0 are considered as base nodes that are converted into queries later on in the notebook.
%% Cell type:markdown id:7917ba04 tags:
## 2.2 Overview of different types of embeddings <a class="anchor" id="section_2_2"></a>
%% Cell type:markdown id:d4c80d2b tags:
**Word embeddings** take up the linguistic concept of collocations. The other words with which each word occurs in a corpus are recorded. These collocation profiles are then represented as vectors. If, for example, two words (e.g. "car" and "truck") occur with very similar words (e.g. „wheels, drive, street, etc.“) then they would also have very similar word vectors, i.e. they would be semantically - or at least structurally - very similar.
There are various established ways to compute embeddings for word similarity tasks. A first important distinction to be made is between *token* / *word* embeddings and *document* embeddings. While **token embeddings** model one embedding per token, **document embeddings** try to map an entire document (i.e. an ordered sequence of tokens) into one single embedding. There are two common ways to compute document embeddings. One way is to derive them from token embeddings - for instance through averaging token embeddings vectors. More complex approaches train dedicated models that are optimized to produce good document embeddings.
All in all, we can distinguish three types of embeddings:
* original token embeddings (these can be either static or contextual)
* document embeddings derived from token embeddings (e.g. through averaging)
* document embeddings from dedicated models, such as <a href="https://www.sbert.net/">Sentence-BERT</a> (Reimers & Gurevych, 2019).
The diagram below shows this taxonomy. Orange arrows indicate specific embeddings used in this notebook.
%% Cell type:code id:1718fe62-1f7c-4a36-adcb-d42985c73a34 tags:
``` python
nbutils.plot_dot("miscellaneous/diagram_embeddings_1.dot")
```
%% Output
<IPython.core.display.SVG object>
%% Cell type:markdown id:ab3a5ecf tags:
The following diagram showcases various options for token embeddings. The most recent option is using contextual token embeddings (also sometimes called *dynamic* embeddings), which will incorporate a specific token's context and can be obtained from architectures like <a href="https://jalammar.github.io/illustrated-bert/">ELMO or BERT</a>. Another option is using static token embeddings, which map one token to one embedding, independent of its specific occurrence in a text. For an overview of static and contextual embeddings, and their differences, see (Wang et al. 2020).
We have a variety of established options for static embeddings like <a href="https://fasttext.cc/">fastText</a> or <a href="https://nlp.stanford.edu/projects/glove/">GloVe</a>. We can also combine several embeddings into one single embedding - a common mechanism used for this is *stacking*, i.e. concatenating embedding vectors.
%% Cell type:code id:4360e891-7da5-4617-aef1-6987ef4e3979 tags:
``` python
nbutils.plot_dot("miscellaneous/diagram_embeddings_2.dot")
```
%% Output
<IPython.core.display.SVG object>
%% Cell type:markdown id:a2820b9f tags:
In this notebook, we showcase the following four variations of embeddings:
* **Static token embeddings**: these operate on the token level. We experiment with <a href="https://nlp.stanford.edu/projects/glove/">GloVe</a> (Pennington et al. 2014), <a href="https://fasttext.cc/">fastText</a> (Mikolov et al., 2018) and <a href="https://github.com/commonsense/conceptnet-numberbatch">Numberbatch</a> (Speer et al, 2017). We use these three embeddings to compute token similarity and combine them with alignment algorithms (such as <a href="http://rna.informatik.uni-freiburg.de/Teaching/index.jsp?toolName=Waterman-Smith-Beyer">Waterman-Smith-Beyer</a>) to compute document similarity. We also investigate the effect of stacking two static embeddings (fastText and Numberbatch) into a single new embedding.
* **Contextual token embeddings**: these also operate on the token level, but embeddings can change according to a specific token instance's context. In this notebook we experiment with using such token embeddings from the <a href="https://www.sbert.net/">Sentence-BERT</a> model (Reimers & Gurevych, 2019). Note that this model is usually used to produce document embeddings. For our experiments in this variant, we ignore this layer and access its underlying token embeddings.
* **Document embeddings derived from specially trained models**: document embeddings represent one document via one single embedding. Again, we use <a href="https://www.sbert.net/">Sentence-BERT</a> (Reimers & Gurevych, 2019), but this time we extract document embeddings. More specifically, we will use two Sentence-BERT models trained specifically for the semantic textual similarity (STS) task (Reimers & Gurevych, 2019).
* **Document embeddings derived from token embeddings**: We also experiment with averaging different kinds of token embeddings (static and contextual) to derive document embeddings.
%% Cell type:markdown id:e746071b tags:
## 2.3 "Shapespeare in the Vectorian Age" – Meet the Vectorian framework <a class="anchor" id="section_2_3"></a>
%% Cell type:markdown id:e5afbd3e tags:
To conduct our actual investigations, we rely on a framework called the **<a href="https://github.com/poke1024/vectorian">Vectorian</a>**, which we first introduced in 2020 (Liebl & Burghardt, 2020a/b). Using highly optimized algorithms and data structures, the Vectorian enables interactive real-time searches over text corpora using a variety of approaches and strategies.
%% Cell type:markdown id:f1cb0e38 tags:
In order to use the Vectorian, we need to map the gold standard data to Vectorian API concepts (which we highlight `like this`). As a first step, we take all contexts from the 100 gold standard phrases and use these as `Documents` in the Vectorian.
A `Document` in Vectorian terminology is something we can perform a search on. `Documents` in the Vectorian are created using different kinds of `Importers` that perform necessary natural language processing tasks using an additional `NLP` class. Since this step can be time-consuming, we pre-computed this step and use the `Corpus` class to quickly load these pre-processed Documents into the notebook. For details about the pre-processing, see `code/prepare_corpus.ipynb`.
Note that using the phrase contexts as `Documents` is a simplification of the search process necessary for a clean evaluation. While using a full book or work as `Document` and searching over its parts (e.g. over all sentences or over a sliding window of its tokens) would be a more realistic setting, we had to manually re-check all results classified as false positives in such a setting, since the automatic search might reveal new correct text reuse references which we were previously unaware of.
In contrast, our gold standard has been manually curated such that there is one and *only* one text reuse reference per context. By searching over contexts that carry exactly one correct text reuse reference, we can ensure that our performance evaluation of a search strategy is sound.
%% Cell type:markdown id:f467944b tags:
Using the loaded `Documents` and a set of `Embeddings`, we then create a `Session` that allows us to perform searches for instances of intertextuality. More details about the technical architecture we build on in this notebook can be found in the [source code](https://github.com/poke1024/vectorian) and the [API Documentation](https://poke1024.github.io/vectorian/index.html) for the Vectorian.
%% Cell type:markdown id:fe62e898 tags:
### 2.3.1 Loading word embeddings <a class="anchor" id="section_2_3_1"></a>
%% Cell type:markdown id:673a5d4b tags:
In terms of static embeddings, we will work with pre-trained versions of <a href="https://nlp.stanford.edu/projects/glove/" target="_blank">GloVe</a>, <a href="https://fasttext.cc/docs/en/crawl-vectors.html" target="_blank">fastText</a> and <a href="https://github.com/commonsense/conceptnet-numberbatch" target="_blank">Numberbatch</a>. GloVe uses a form of matrix factorization on a global co-occurence matrix to compute embeddings for a finite set of predefined tokens (Pennington et al. 2014). In contrast, fastText training operates on local context windows (Mikolov et al., 2018). In contrast to GloVe and the earlier word2vec, fastText additionally computes embeddings on character n-grams instead of tokens, which means there are no out-of-vocabulary tokens (Mikolov et al., 2018). GloVe and fastText only use data from a corpus, whereas Numberbatch embeddings additionally incorporate information from a knowledge graph (Speer et al, 2017).
For reasons of limited RAM in the interactive Binder environment (and to limit download times), we use small or compressed versions of the official pre-trained versions:
* for **<a href="https://nlp.stanford.edu/projects/glove/" target="_blank">GloVe</a>**, we use the official 50-dimensional version of the 6B variant
* for **<a href="https://fasttext.cc/docs/en/crawl-vectors.html" target="_blank">fastText</a>** we use a version that was trained on *Common Crawl* and *Wikipedia* using CBOW, and then compressed using the standard settings in https://github.com/avidale/compress-fasttext
* for **<a href="https://github.com/commonsense/conceptnet-numberbatch" target="_blank">Numberbatch</a>** we use version 19.08 that was reduced into a 50-dimension version using a standard PCA
We also use one **stacked embedding**, in which we combine fastText and Numberbatch. We will call this embedding `fasttext_numberbatch`.
Finally we will use contextual embeddings based on the Sentence-BERT architecture (Reimers & Gurevych, 2019). We use two models, with the second one being the newer one and - as it has been trained for asymmetric search - more suitable to the task at hand:
* the pre-trained English <a href="https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v1">paraphrase_distilroberta_base_v1</a> model, which is trained for <a href="https://www.sbert.net/examples/applications/semantic-search/README.html">symmetric semantic search </a>
* the pre-trained English <a href="https://huggingface.co/sentence-transformers/msmarco-distilbert-base-v4">msmarco-distilbert-base-v4</a> model, which is trained for <a href="https://www.sbert.net/examples/applications/semantic-search/README.html">asymmetric semantic search</a>.
We refer to both models as `sbert` variants. Note that many other models can be trained for the Sentence-BERT architecture, that might perform differently on the tasks at hand.
Also note that all embeddings we use in this notebook were trained from large generic corpora, i.e. no embedding was trained from the documents we search over.
%% Cell type:markdown id:04037dac tags:
We first need to instantiate a NLP parser that is capable of providing us with standard NLP tasks such as tokenization and POS tagging. Internally, we use <a href="https://spacy.io/">spaCy</a> to construct a suitable parser. `nlp.pipeline` will return the fully constructed NLP pipeline in case the reader is interested.
%% Cell type:code id:6d7d352e-6c64-4a2d-bb42-b8ebb8a28f1e tags:
``` python
nlp = nbutils.make_nlp()
```
%% Cell type:markdown id:994ec251 tags:
We now create the desired `sbert` embeddings (more specifically, suitable class instances compatible with the Vectorian) as well as the other static embeddings we described earlier. The <a href="data/raw_data/embeddings.yml">embeddings.yml</a> file referenced below contains a detailed technical description of what is loaded exactly.
%% Cell type:code id:16ef1ef5 tags:
``` python
the_embeddings = nbutils.load_embeddings("data/raw_data/embeddings.yml")
print("loaded:", ", ".join(the_embeddings.keys()))
```
%% Output
loaded: glove, fasttext, numberbatch, sbert_paraphrase, sbert_msmarco, fasttext_numberbatch
%% Cell type:markdown id:65d16656 tags:
### 2.3.2 Creating the session <a class="anchor" id="section_2_3_2"></a>
%% Cell type:markdown id:7f2c3ada tags:
The following code creates a `Session` in the Vectorian framework that will allow us to perform searches over the gold standard corpus using the desired embeddings:
%% Cell type:code id:7c2ab856 tags:
``` python
session = vectorian.session.LabSession(
vectorian.corpus.Corpus("data/processed_data/corpus", mutable=False),
embeddings=the_embeddings.values())
```
%% Output
%% Cell type:markdown id:5910a28b-704f-45e4-ae63-91d3ceb32fdd tags:
Finally, the following code will speed up searches later in the notebook by loading all contextual embedding vectors into RAM.
%% Cell type:code id:9d27028b-4f13-4d5a-8f6a-bc41715a6ed1 tags:
``` python
session.cache_contextual_embeddings()
```
%% Output
%% Cell type:markdown id:c09829d8-c158-43b3-be08-1a76f343d3bf tags:
# 3. Embeddings as a tool for intertextuality research <a class="anchor" id="section_3"></a>
%% Cell type:markdown id:4e52e6c6-f0c6-47ff-934b-ec5d8d2c51e0 tags:
## 3.1 Exploring word embeddings <a class="anchor" id="section_3_1"></a>
%% Cell type:markdown id:ec8c309d tags:
### 3.1.1 An introduction to word embeddings and token similarity <a class="anchor" id="section_3_1_1"></a>
%% Cell type:markdown id:936bfeb6 tags:
Before we dive into the actual analyses (of the instances of intertextuality), we first take a brief look at the inner workings of embeddings. Mathematically speaking, a word embedding is a vector **x** of dimension *n*, i.e. a vector consisting of *n* scalars.
$$\mathbf{x}=(x_1, x_2, ..., x_{n-1}, x_n)$$
For example, the compressed numberbatch embedding we use has *n*=50 and thus represents the word "coffee" with the following 50 scalar values:
%% Cell type:code id:474c8b2e tags:
``` python
widgets.GridBox(
[
widgets.Label(f"{x:.2f}")
for x in session.word_vec(the_embeddings["numberbatch"], "coffee")
],
layout=widgets.Layout(grid_template_columns="repeat(10, 50px)"),
)
```
%% Output
%% Cell type:markdown id:9bea3512 tags:
Since the above representation is difficult to understand, we visualize the values of
$$x_1, x_2, ..., x_{n-1}, x_n$$
through different colors. By default, all values are normalized by ||**x**||&#x2082;, i.e. the dot product of these vectors gives the cosine similarity.
%% Cell type:code id:ea9c7758 tags:
``` python
@interact(
embedding=widgets.Dropdown(
options=[(k, v) for k, v in the_embeddings.items() if not v.is_contextual],
value=the_embeddings["numberbatch"],
),
normalize=True,
)
def plot(embedding, normalize):
nbutils.plot_embedding_vectors_val(
["sail", "boat", "coffee", "tea", "guitar", "piano"],
get_vec=lambda w: session.word_vec(embedding, w),
normalize=normalize,
)
```
%% Output
%% Cell type:markdown id:d59f2826 tags:
By looking at these color patterns, we can gain some intuitive understanding of why and how word embeddings are appropriate for word similarity calculations. For example, *sail* and *boat* both show a strong activation for dimension 27. Similarly, *guitar* and *piano* share similar values for dimension 24. The words *coffee* and *tea* also share similar values in dimensions 1 and 2, which slightly set them apart from the other four words.
%% Cell type:markdown id:90ef6760 tags:
A common approach to compute the similarity between two word vectors **u** and **v** in this kind of high-dimensional vector spaces is to compute the cosine of the angle &theta; between the vectors, which is called **cosine similarity**:
$$cos \theta = \frac{\mathbf{u} \cdot \mathbf{v}}{||\mathbf{u}||_2 ||\mathbf{v}||_2} = \frac{\sum_1^n \mathbf{u}_i \mathbf{v}_i}{\sqrt{\sum_1^n \mathbf{u}_i^2} \sqrt{\sum_1^n \mathbf{v}_i^2}} = \sum_1^n \left( \frac{\mathbf{u}}{||\mathbf{u}||_2} \right)_i \left( \frac{\mathbf{v}}{||\mathbf{v}||_2} \right)_i$$
%% Cell type:markdown id:3d8876b1 tags:
A large positive value (i.e. a small &theta; between **u** and **v**) indicates higher similarity, whereas a small or even negative value (i.e. a large &theta;) indicates lower similarity. For a discussion of issues with this notion of similarity, see Faruqui et al. (2016).
The visualization below encodes
$$\left( \frac{\mathbf{u}}{||\mathbf{u}||_2} \right)_i \left( \frac{\mathbf{v}}{||\mathbf{v}||_2} \right)_i$$
for different i, 1 &le; i &le; n, through colors to illustrate how different vector components (i.e. which values of *i*) contribute to the cosine similarity for two words. Brighter colors (orange/yellow) indicate dimensions with higher contribution.
%% Cell type:code id:70e2ade9 tags:
``` python
@interact(
embedding=widgets.Dropdown(
options=[(k, v) for k, v in the_embeddings.items() if not v.is_contextual],
value=the_embeddings["numberbatch"],
)
)
def plot(embedding):
nbutils.plot_embedding_vectors_mul(
[("sail", "boat"), ("coffee", "tea"), ("guitar", "piano")],
get_vec=lambda w: session.word_vec(embedding, w),
)
```
%% Output
%% Cell type:markdown id:ff150311 tags:
As in the earlier plot, dimension 27 pops out as a strong link between *sail* and *boat*.
A comparable investigation of fastText shows similar spots of strong contributions. The plot here is somewhat more complex due to the higher number of dimensions (*n* = 300).
%% Cell type:code id:adc1d2ed tags:
``` python
@interact(
embedding=widgets.Dropdown(
options=[(k, v) for k, v in the_embeddings.items() if not v.is_contextual],
value=the_embeddings["fasttext"],
)
)
def plot(embedding):
nbutils.plot_embedding_vectors_mul(
[("sail", "boat"), ("coffee", "tea"), ("guitar", "piano")],
get_vec=lambda w: session.word_vec(embedding, w),
)
```
%% Output
%% Cell type:markdown id:9e9c6ff5 tags:
Computing the overall cosine similarity for two words is mathematically equivalent to summing up the terms in the diagram above. The overall similarity between *guitar* and *piano* is approx. 68% with the fastText embedding we use. For *guitar* and *coffee* it is significantly lower with a similarity of approx. 20%.
%% Cell type:code id:4b9d13ae tags:
``` python
from vectorian.sim.token import EmbeddingTokenSim
from vectorian.sim.vector import CosineSim
token_sim = EmbeddingTokenSim(the_embeddings["fasttext"], CosineSim())
[session.similarity(token_sim, "guitar", x) for x in ["piano", "coffee"]]
```
%% Output
[0.68097234, 0.19857687]
%% Cell type:markdown id:5fea91e0 tags:
Note that for contextual embeddings, we need to compute the similarity between the actual instances of tokens within a text document.
%% Cell type:code id:b45a1647 tags:
``` python
token_sim = EmbeddingTokenSim(the_embeddings["sbert_paraphrase"], CosineSim())
a = list(session.documents[0].spans(session.partition("document")))[0][2]
b = list(session.documents[6].spans(session.partition("document")))[0][10]
[a.text, b.text, session.similarity(token_sim, a, b)]
```
%% Output
['dare', 'bear', 0.42446673]
%% Cell type:markdown id:ee9d77e8 tags:
### 3.1.2 Detecting Shakespearean intertextuality through word embeddings <a class="anchor" id="section_3_1_2"></a>
%% Cell type:markdown id:5c4ebd8b tags:
We now explore the usefulness of embeddings and token similarity with the gold standard dataset that was introduced earlier. In the following example the phrase "the rest is silence" is quoted as "the rest is all but wind". While the syntactic structure is mirrored between original phrase and its re-occurrence, the term "silence" is replaced with "all but wind".
%% Cell type:code id:53cb1328 tags:
``` python
vis = nbutils.TokenSimPlotterFactory(session, nlp, gold_data)
```
%% Cell type:code id:08494bc8 tags:
``` python
plotter1 = vis.make("rest is silence", "Fig for Fortune")
```
%% Output
%% Cell type:markdown id:52f2a451 tags:
Intuitively, we expect "silence" and "wind" to be related to a certain degree. To investigate how well this intuition transfers to our measurements through embeddings, we inspect the cosine similarity of the token "silence" with other tokens in the document's ("A Fig for Fortune, 1596") context for three different embedding models.
It becomes clear that for all three embeddings there is a strong connection between "silence" and "wind". The cosine similarity is particularly high with the Numberbatch model. Nevertheless, the absolute value of 0.3 for Numberbatch is still in a rather low range. Interestingly, GloVe associates "silence" with "action", which can be understood as quite the opposite of silence. The phenomenon that embeddings sometimes cluster opposites is a common observation and can be a problem when trying to distinguish between synonyms and antonyms.
%% Cell type:code id:3df8a4d6 tags:
``` python
plotter1("silence")
```
%% Output
%% Cell type:markdown id:19d78987 tags:
Another quote example involving the phrase "sea of troubles" is shown below. We see that the word "sea" is paraphrased as "waves", whereas "troubles" gets substituted by "troublesome". If we take a closer look at the cosine similarities of the tokens "sea" and "troubles" with all the other tokens in the document's context, we see that they are – expectedly – rather high, which means we should be able to detect such kinds of rephrasing.
%% Cell type:code id:3fb62321 tags:
``` python
plotter2 = vis.make("sea of troubles", "Book of Common Prayer")
```
%% Output
%% Cell type:code id:5f7b8069 tags:
``` python
plotter2("sea")
```
%% Output
%% Cell type:code id:97a3192d tags:
``` python
plotter2("troubles")
```
%% Output
%% Cell type:markdown id:5cee2f24 tags:
It is also interesting to investigate how out-of-vocabulary words like "troublesomest" produce zero similarities with standard key-value embeddings, whereas fastText is still able to produce a vector thanks to subword information.
%% Cell type:code id:ac79d82c tags:
``` python
plotter2("troublesomest")
```
%% Output
%% Cell type:markdown id:b06e6051 tags:
## 3.2 Exploring document embeddings <a class="anchor" id="section_3_2"></a>
%% Cell type:markdown id:22375ca1 tags:
Next, we consider the representation of each document with a single embedding to gain an understanding of how different embedding strategies relate to document similarity. We will later return to individual token embeddings.
For this purpose, we will use the two strategies already mentioned for computing document embeddings:
* averaging over token embeddings
* computing document embeddings through a dedicated model
%% Cell type:markdown id:a2cb44d3 tags:
In order to achieve the latter, we compute document embeddings through Sentence-BERT encoders.
%% Cell type:code id:8a3a06ec tags:
``` python
doc_encoders = nbutils.make_doc_encoders(the_embeddings, session)
embedder = nbutils.DocEmbedder(
session=session,
nlp=nlp,
doc_encoders=doc_encoders,
encoder="paraphrase [doc]",
)
embedder.display()
```
%% Output
%% Cell type:markdown id:db0755a1 tags:
Similar to the investigation of token embedding values, we now look at the feature dimensions of the document embeddings. In the following plot we observe that the phrase "an old man is twice a child" and the corresponding text reuses from the gold standard (i.e. the true positives) show some salient contribution around dimensions 25 and 300 (see the 5 upper rows and contrast them to the lower 5 rows). When comparing the same pattern with non-matching text reuse occurrences from the "go, by Saint Hieronimo" pattern on the other hand (see the 5 lower rows), there is less activation in these areas. Therefore these areas seem to offer some good features to differentiate the matching of a pattern with the correct occurrences.
%% Cell type:code id:2d2e9473 tags:
``` python
bars = nbutils.DocEmbeddingBars(embedder, session, gold_data)
bars.plot("an old man is twice a child", "Saint Hieronimo")
```
%% Output
%% Cell type:markdown id:cc59876d tags:
Instead of focusing on only one phrase, we now look at a plot of the embeddings of all documents in our gold standard data. The plot uses a dimensionality reduction technique known as t-Distributed Stochastic Neighbor Embedding (t-SNE) and allows us to reduce multiple dimensions to just two dimensions.
%% Cell type:code id:6a3a0c1c tags:
``` python
doc_embedding_explorer = nbutils.DocEmbeddingExplorer(
session=session,
nlp=nlp,
gold=gold_data,
doc_encoders=doc_encoders,
)
doc_embedding_explorer.plot(
[
{
"encoder": "paraphrase [doc]",
"locator": ("fixed", "carry coals")
},
{
"encoder": "paraphrase [doc]",
"locator": ("fixed", "an old man is twice"),
},
]
)
pass
```
%% Output
%% Cell type:markdown id:247dee41 tags:
In the t-SNE visualization above, the dots represent documents and the colors represent the phrase that is linked to this document in our gold standard (more details on the underlying documents are shown when hovering the mouse cursor over the nodes). Dots that are close to each other indicate that the underlying documents share a certain similarity. Nearby dots of the same color indicate that the embedding tends to cluster documents in a way that mirrors the ground truth in our gold standard.
%% Cell type:markdown id:d8d40738 tags:
In the left plot, we searched for the phrase "we will not carry coals" (visualized as large yellow circle with a cross). The plot shows that the query is in fact part of a document cluster (smaller green-yellow circles) that contains a variation of that phrase. Similarly, on the right we see that the phrase "an old man is twice a child" loosely clusters with the actual (green) documents we associate with it in our gold standard.
In the left plot, we searched for the phrase "we will not carry coals" (visualized as large yellow circle with a cross). The plot shows that the query is in fact part of a document cluster (smaller yellow circles) that contains a variation of that phrase. Similarly, on the right we see that the phrase "an old man is twice a child" loosely clusters with the actual (green) documents we associate with it in our gold standard.
In summary, for these phrases and documents, the `paraphrase_distilroberta` model automatically produces a document embedding that replicates some structure of our gold standard ground truth.
%% Cell type:markdown id:118dd3d5 tags:
In the following plot we look at token-based embeddings, document embeddings and how the two are related. The document embeddings on the left are averaged from token embeddings. On the right side, we see a t-SNE plot of the token embeddings that make up the document embeddings that are currently selected on the left. The colors differentiate which token embedding belongs to which document embedding. By showing the constituents of the document embeddings, this visualization makes more transparent how such document embeddings come to be and why certain documents on the left are clustered.
%% Cell type:code id:13942a2e tags:
``` python
doc_embedding_explorer.plot(
[
{
"encoder": "numberbatch",
"selection": [
"ww_32c26a7909c83bda",
"ww_b5b8083a6a1282bc",
"ww_9a6cb20b0b157545",
"ww_a6f4b0e3428ad510",
"ww_8e68a517bc3ecceb",
],
}
]
)
pass
```
%% Output
%% Cell type:markdown id:d3cda845 tags:
In the specific example shown above, the red circles on the left represent contexts that our gold standard lists as containing rephrasings of the phrase "a horse, a horse, my kingdom for a horse". We included two other unrelated documents that are color-coded as lilac and light rose.
To understand the document clustering on the left, we might expect that the term "horse" from the investigated phrase plays a central role. Indeed, the red token embeddings in the right plot show a cluster around "horse" in the lower left. However, it seems that this is not the main ingredient of the cluster of these documents on the left. On the contrary, we see that the top two documents do not refer to "horse", but instead refer to a topic of water and ships. More specifically we observe these terms:
To understand the document clustering on the left, we might expect that the term "horse" from the investigated phrase plays a central role. Indeed, the red token embeddings in the right plot show a cluster around "horse" in the lower left. However, it seems that this is not the main ingredient of the cluster of these documents on the left. On the contrary, we find that - for example - the documents represented by the two close red dots in the upper left corner of the document embeddings view, do not refer to "horse", but instead refer to a topic of water and ships. Looking at all three documents shown in red, we observe these terms:
* The term "boat" from "A boat, a boat" in the document "Eastward Ho!"
* The term "boat" in "muscle boat" in the document "The Poor Man's Comfort"
* The term "swim" from "To swim the river villain" in the document "The Battle of Alcazar"
To reiterate: the three red documents do not seem to be clustered around a concept of "horse" or "kingdom", as might be expected from their grouping in our gold standard - and useful for recovering them when querying for the phrase "a horse, a horse, my kingdom for a horse". Instead all three red documents seem to get clustered through some common notion of ship or water.
To reiterate: the three red documents do not seem to be clustered around a concept of "horse" or "kingdom", as might be expected from their grouping in our gold standard. Instead, all three red documents seem to get clustered through some common notion of ship or water, which is not useful for recovering them when querying for the phrase 'a horse, a horse, my kingdom for a horse'
Note that there is a token cluster of sailing and water (e.g. "boat", "swim", "sail" and "river") on the left side of the right plot, which shows these terms are considered similar.
Especially concerning with the resulting document clustering is the fact that "The Battle of Alcazar" contains the "swim" term rather randomly and not as annotated part of the rephrased quote. Still, this term "swim", and not the term "horse" seems to make it cluster to the other two documents, making it a "swim" cluster and not a "horse" cluster.
This short investigation is a caveat into the effects of unsupervised document clustering. We do see groups that form due to inherent qualities, but these qualities (e.g. "horse" vs. "water") might not at all mirror what we expect.
%% Cell type:markdown id:e5041ac3 tags:
Note that the plot above is interactive and can be customized (simply drag the mouse to lasso different documents).
%% Cell type:markdown id:0c10659f tags:
## 3.3 Exploring word mappings: WSB vs. WMD <a class="anchor" id="section_3_3"></a>
%% Cell type:markdown id:1c044ad9 tags:
So far, we have experimented with different token embeddings and seen how similarity comparison can be implemented for single tokens. We have also looked at document embeddings to compare documents. We now return to token embeddings, but instead of comparing single tokens, we now turn to the detection of intertextual references by comparing longer token sequences with each other. In contrast to document embeddings, we will work with one embedding per token.
The problem when comparing token sequences for this task is to identify the relevant parts or segments in a sequence. For example, a quotation like "to be or not to be" will occur as a local phenomenon, i.e. only at a certain position in a document. The rest of the document will likely be sentences that do not match with the quote phrase at all. Furthermore the phrase might be changed through the insertion, deletion or mutation of tokens.
In order to compute document similarity based on token embeddings, we turn to two kinds of approaches.
%% Cell type:markdown id:16977478 tags:
One popular class of techniques are sequence alignment algorithms as well as adjacent approaches like Dynamic Time Warping, see Kruskal (1983). In this section, we introduce the **<a href="http://rna.informatik.uni-freiburg.de/Teaching/index.jsp?toolName=Waterman-Smith-Beyer" target="_blank">Waterman-Smith-Beyer</a> (WSB)** algorithm, which produces optimal local alignments and provides a general (e.g. non-affine) cost function (Waterman, Smith & Beyer, 1974). Other commonly used global alignment algorithms - such as <a href="http://rna.informatik.uni-freiburg.de/Teaching/index.jsp?toolName=Smith-Waterman" target="_blank">Smith-Waterman</a> and <a href="http://rna.informatik.uni-freiburg.de/Teaching/index.jsp?toolName=Gotoh" target="_blank">Gotoh</a> - can be regarded as special cases of WSB. Unlike the popular <a href="https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm">Needleman-Wunsch</a> global alignment algorithm, WSB produces local alignments. In contrast to classic formulations of WSB - which often use a fixed substitution cost - we use the word distance from word embeddings to compute the substitution penalty for specific pairs of words.
%% Cell type:markdown id:03983ce2 tags:
Another approach to compute a measure of similarity between documents - more specifically their bag of words (bow) representation - is the so-called **<a href="https://nbviewer.jupyter.org/github/vene/vene.github.io/blob/pelican/content/blog/word-movers-distance-in-python.ipynb" target="_blank">Word Mover's Distance</a>** introduced by Kusner et al. (2015). The main idea is computing similarity through finding the optimal solution of a transportation problem between words.
In the following, we will experiment with two variants of the WMD. In addition to the classic WMD, where a transportation problem is solved over the normalized bag of words (nbow) vector, we also introduce a new variant of WMD that keeps the bag of words (bow) unnormalized, i.e. we pose the transportation problem on absolute word occurrence counts.
%% Cell type:markdown id:cd2e18bd tags:
Note: document embeddings do not need any of the above techniques, as they embed documents into a vector space in a way that queries and target documents that share similar features are close to each other in that space.
%% Cell type:markdown id:351b6557 tags:
### 3.3.1 Mapping quote queries to longer text documents <a class="anchor" id="section_3_3_1"></a>
%% Cell type:code id:58f33423 tags:
``` python
def make_index_builder(nlp, **kwargs):
return nbutils.InteractiveIndexBuilder(
session, nlp,
partition_encoders=dict((k, v.encoder) for k, v in doc_encoders.items()),
**kwargs)
```
%% Cell type:code id:3448bfae tags:
``` python
index_builder = make_index_builder(nlp)
index_builder
```
%% Output
%% Cell type:markdown id:525a45d8 tags:
What can be seen above is the description of a search strategy that we will employ in the following sections of this notebook. By switching to the "Edit" part, it is possible to explore the settings in more detail and even change them to something completely different. Note that various parameters in the "Edit" GUI - e.g. mixing of embeddings - are beyond the scope of this notebook. For for more details see Liebl & Burghardt (2020a/b).
%% Cell type:markdown id:afeab1a4 tags:
Example: For the phrase "old men's crotchets" we find the following top matches ("We old men have our crotchets"), with a similarity score of 77.7%. By increasing the n value we can always display more ranked results.
%% Cell type:code id:2e79f163 tags:
``` python
index_builder.build_index().find("old men's crotchets", n=3)
```
%% Output
<vectorian.session.LabResult at 0x7fb1bd29ad00>
%% Cell type:markdown id:c70e87a9 tags:
### 3.3.2 Evaluation: Plotting the nDCG over the corpus <a class="anchor" id="section_3_3_2"></a>
%% Cell type:markdown id:d56b2b2f tags:
In the following we will systematically evaluate different strategies for identifying intertextuality in our gold standard data. We investigate WSB and the two variants of WMD (bow vs. nbow). To compute token embeddings, we use (compressed) fastText. As another variant we evaluate the performance of Sentence-BERT, when computing one embedding per document. The evaluation metric is **normalized discounted cumulative gain** [(nDCG)](https://en.wikipedia.org/wiki/Discounted_cumulative_gain), which we already used in earlier similar studies (also see Liebl & Burghardt, 2020b). It is computed as follows.
Each specific query (using a specific *phrase*) operates on a corpus consisting of a set of documents
$D = d_1, ..., d_n$
In our case n = 100. We call the set of relevant document for this query R, with
$R = r_1, ..., r_k$
R models the ground truth encoded in the gold standard, i.e. the results we regard optimal for a specific query. In terms of the graph description of our gold standard, R is the set of nodes directly connected to the query node.
If the documents we actually retrieve through a search algorithm are numbered
$x_1, ..., x_n$
in order of their score (highest first), then the nDCG for that specific retrieval is defined as follows:
%% Cell type:markdown id:302b969a-7562-4ac5-bec3-3150514cd7ec tags:
$$
rel_i=\begin{cases}
1 & \text{if } x_i \in R,\\
0 & \text{if } x_i \notin R,
\end{cases}, \hspace{1em}
DCG_n=\sum_{i=1}^{n}\frac{rel_i}{log_2 (i+1)}, \hspace{1em}
IDCG_k=\sum_{i=1}^{k}\frac{1}{log_2 (i+1)}, \hspace{1em}
nDCG=\frac{DCG_n}{IDCG_k}
$$
%% Cell type:markdown id:f22449b9 tags:
In the summary below you will find more detailed descriptions of the search strategies that will be evaluated in the following. By using "Edit", it is possible to change these settings to something else - a rerun of the following sections of the notebook would then be necessary.
%% Cell type:code id:203ab1e3 tags:
``` python
import collections
def strategy_evaluation(embedding, show_ui=True, doc=True):
index_builders = collections.OrderedDict(
{
"wsb": make_index_builder(
nlp,
strategy="Alignment",
strategy_options={
"alignment": vectorian.alignment.LocalAlignment(
gap={
"s": vectorian.alignment.smooth_gap_cost(5),
"t": vectorian.alignment.smooth_gap_cost(5)
}
),
"similarity": {"embedding": embedding}
},
),
"wmd nbow": make_index_builder(
nlp,
strategy="Alignment",
strategy_options={
"alignment": vectorian.alignment.WordMoversDistance.wmd("nbow"),
"similarity": {"embedding": embedding}
},
),
"wmd bow": make_index_builder(
nlp,
strategy="Alignment",
strategy_options={
"alignment": vectorian.alignment.WordMoversDistance.wmd("bow"),
"similarity": {"embedding": embedding}
},
)
})
if doc:
index_builders["doc sbert paraphrase"] = make_index_builder(
the_embeddings["sbert_paraphrase"].nlp,
strategy="Partition Embedding",
strategy_options={"encoder_index": 0})
index_builders["doc sbert msmarco"] = make_index_builder(
the_embeddings["sbert_msmarco"].nlp,
strategy="Partition Embedding",
strategy_options={"encoder_index": 1})
if show_ui:
# present UI of various options that allows for editing
accordion = widgets.Accordion(children=[x.displayable for x in index_builders.values()])
for i, k in enumerate(index_builders.keys()):
accordion.set_title(i, k)
display(accordion)
def make_plotter():
return nbutils.plot_ndcgs(
gold_data, dict((k, v.build_index()) for k, v in index_builders.items()))
return make_plotter
```
%% Cell type:code id:eebd39cf-599b-4ff0-8e99-a5e12f2e6e47 tags:
``` python
make_p1 = strategy_evaluation(the_embeddings["fasttext"])
```
%% Output
%% Cell type:markdown id:4b5bf126 tags:
With the following command we will get an overview of the quality of the results we obtain when using the index configures with `index_builder` by computing the nDCG over the 20 queries in our gold standard with regard to the known optimal results (this may take a few seconds).
With the following command we will get an overview of the quality of the results we obtain when using the index configured with `index_builder` by computing the nDCG over the 20 queries in our gold standard with regard to the known optimal results (this may take a few seconds).
%% Cell type:code id:d34c972d tags:
``` python
p1 = make_p1()
p1.plot()
```
%% Output
%% Cell type:code id:a8dcdbf8-d2f1-4339-958d-c8cc6b389a77 tags:
``` python
p1.plot_hist()
```
%% Output
%% Cell type:markdown id:3c628c94 tags:
In terms of overall performance (mean and median) Waterman-Smith-Beyer (WSB) performs better than all other tested approaches.
As the histogram above shows, we see (WSB) especially outperforms other approaches in terms of the number of fully correct (nDCG=100%) queries. On the other hand, WSB also suffers from a low 30% nDCG that other approaches lack.
One advantage of WSB over the full WMD variants is how easy it is to interpret the results. WSB produces an alignment that relates one document token to at most one query token. For WMD, this assumption often breaks down, which makes the results harder to understand. We use this characteristic of WSB in the following section to illustrate which mappings actually occur.
%% Cell type:markdown id:dff7ace5-7dac-4333-8823-cc0800768484 tags:
The experiments above were performed with fastText. Note that running the same evaluation on the GloVe embeddings shows different results. WMD now outperforms WSB in the mean performance. The absolute performance is worse than fastText though.
%% Cell type:code id:42194025-cc5d-4f63-9af5-011ae543cfae tags:
``` python
make_p2 = strategy_evaluation(the_embeddings["glove"], show_ui=False, doc=False)
make_p2().plot()
```
%% Output
%% Cell type:markdown id:ec7c966e-b709-4f39-a013-e51ae61d155a tags:
For the sake of completeness, here are the results for numberbatch, which are somewhat similar to those of GloVe.
%% Cell type:code id:b8ba624e-fd9b-41c6-9d3c-918c12bb220e tags:
``` python
make_p3 = strategy_evaluation(the_embeddings["numberbatch"], show_ui=False, doc=False)
make_p3().plot()
```
%% Output
%% Cell type:markdown id:26fd85e5 tags:
### 3.3.3 Focussing on single queries <a class="anchor" id="section_3_3_3"></a>
%% Cell type:markdown id:8f66f030 tags:
We now investigate some queries, for which the performance for WSB is bad, in order to better understand why our search fails to obtain the optimal (true positive) results at the top of the result list.
%% Cell type:code id:a94f0e44 tags:
``` python
index_builder = make_index_builder(nlp)
index_builder
```
%% Output
%% Cell type:markdown id:96320ac9 tags:
We turn to the query that scored lowest in the previous evaluation ("though this be madness, yet there is a method in it"), and look at its results in some more detail.
%% Cell type:code id:bdd1f1f6 tags:
``` python
plot_a = nbutils.plot_results(
gold_data, index_builder.build_index(), "though this be madness", rank=7
)
```
%% Output
%% Cell type:markdown id:71aa1df7 tags:
The best match obtained here (red bar on rank 7) is anchored on two word matches, namely `madness` (a 100% match) and `methods` (a 72% match). The other words are quite different and there is no good alignment.
%% Cell type:code id:a52b9ee2 tags:
``` python
plot_b = nbutils.plot_results(
gold_data, index_builder.build_index(), "though this be madness", rank=3
)
```
%% Output
%% Cell type:markdown id:08472e31 tags:
Above we see the rank 3 result from the same query, which is a false positive - i.e. our search claims it is a better result than the one we saw before, but in fact this result is not relevant according to our gold standard. If we analyze why this result gets such a high score nevertheless, we see that "is" and "in" both contribute 100% scores. In contrast to the scores before, 100% for "madness" and 72% for "methods", this partially explains the higher overall score (if we assume for now that the contributions from the other tokens are somewhat similar).
We will now try to understand why the true positive results are ranked rather low. The following plot breaks down how the overall scores are composed from single token scores:
%% Cell type:code id:c42e0bb4 tags:
``` python
nbutils.vis_token_scores(
plot_b.matches[:50],
highlight={"token": ["madness", "method"], "rank": [7, 21, 35, 46]},
)
```
%% Output
<function nbutils.vis_token_scores.<locals>.plot(indicate_gap_penalty)>
%% Cell type:markdown id:8193d37f tags:
The true positive results are marked with black triangles. We see that our current search strategy isn't doing a very good job of ranking them highly. Looking at the score composition of the relevant results, we can identify two distinct features: all relevant results show a rather large contribution of either "madness" (look at ranks 7 and 35, for example) and/or a rather large contribution of "method" (ranks 7 and 46). However, these contributions do not lead to higher ranks necessarily, since other words such as "is", "this" and "though" score higher for other results: for example, look at the contribution of words like "in" and "is" in ranks 1, 3 and 5.
In the plot below, we visualize this observation using ranks 1, 7 and 35. Comparing the rank 1 result on the left - which is a false positive - with the two relevant results on the right, we see that "in", "through" and "is" make up for large parts of the score for rank 1, whereas "madness" is a considerable factor for the two relevant matches. Unfortunately, this contribution is not sufficient to bring these results to higher ranks.
%% Cell type:code id:5baf690f tags:
``` python
@widgets.interact(plot_as=widgets.ToggleButtons(options=["bar", "pie"], value="bar"))
def plot(plot_as):
nbutils.vis_token_scores(
plot_b.matches, kind=plot_as, ranks=[1, 7, 35], plot_width=800
)
```
%% Output
%% Cell type:markdown id:1a44136c tags:
The distributions of score contributions we just observed are the motivation for our approach to tag-weighted alignments as they are described in Liebl & Burghardt (2020a/b). Nagoudi and Schwab (2017) used a similar idea of POS tag-weighting for computing sentence similarity, but did not combine it with alignments.
We now demonstrate (POS) tag-weighted alignments, by using a tag-weighted alignment that will weight nouns like "madness" and "method" 3 times higher than other word types. "NN" is a Penn Treebank tag and identifies singular nouns.
%% Cell type:code id:7311b2cf tags:
``` python
tag_weighted_index_builder = make_index_builder(
nlp, strategy="Tag-Weighted Alignment", strategy_options={"tag_weights": {"NN": 3}}
)
tag_weighted_index_builder
```
%% Output
%% Cell type:code id:663137b4 tags:
``` python
nbutils.plot_results(
gold_data, tag_weighted_index_builder.build_index(), "though this be madness"
)
```
%% Output
<nbutils.ResultScoresPlotter at 0x7fb1fcf133d0>
%% Cell type:markdown id:dac2ee24 tags:
Tag-weighting moves the correct results far to the top, namely to ranks 1, 2, 4 and 6. By increasing the NN weight to 5, it is possible to promote the true positive on rank 67 to rank 11. This is a bit of an extreme measure though and we will not investigate it further here. Instead we investigate how the weighting affects the other queries. Therefore, we re-run the nDCG computation and compare it against unweighted WSB.
%% Cell type:code id:06013d75 tags:
``` python
index_builder_unweighted = make_index_builder(nlp)
index_builder_unweighted
```
%% Output
%% Cell type:code id:6b88c896 tags:
``` python
tw_plot = nbutils.plot_ndcgs(
gold_data,
{
"wsb_unweighted": index_builder_unweighted.build_index(),
"wsb_weighted": tag_weighted_index_builder.build_index(),
})
tw_plot.plot()
```
%% Output
%% Cell type:code id:1004ea4b-f465-4951-ba32-e2fc45be4224 tags:
``` python
tw_plot.plot_hist()
```
%% Output
%% Cell type:markdown id:79a49983 tags:
The evaluation shows that even though there is an improvement in the mean nDCG when employing POS tag weighting for the corpus, the effect vanishes when computing the median. The histogram above shows that POS tag weighting manages to shift two low quality queries that are present in the unaligned version (at the 30% and 60% bands) into higher bands, which considerably improves the mean. Apart from these two outliers however, POS tag weighting cannot show improvement in the higher bands, as it accumulates more queries into the 90% band, but loses some in the 100% band as opposed to unweighted alignment. Therefore the median is not improved.
The two improved low nDCG queries are listed in the column called *better* below. The first query is the one we used as rationale for the whole design. More research is needed to understand if tag weighted alignment might in general benefit some larger subset of queries.
%% Cell type:code id:0b09f971-c1eb-4a69-ab1a-6a246a2f2f70 tags:
``` python
nbutils.eval_strategies(tw_plot.data, gold_data)
```
%% Output
%% Cell type:markdown id:5283d93a tags:
## 3.4 The influence of different embeddings <a class="anchor" id="section_3_4"></a>
%% Cell type:markdown id:9265e37a tags:
While we have experimented with different strategies like WSB and WMD in the previous sections, we will now only use tag-weighted alignments only, and take a look into the effect that different embeddings might have on the results.
While we have experimented with different strategies like WSB and WMD in the previous sections, we will now use tag-weighted alignments only, and take a look into the effect that different embeddings might have on the results.
In contrast to earlier experiments we only use token-based embeddings. For example, for Sentence-BERT we extract one embedding per token, instead of computing one embedding per document as in the experiments before. Thus, these embeddings only serve as input to a local alignment computation here.
One caveat here is that we are using compressed embeddings, i.e. we would need to verify these results with uncompressed embeddings. Still, the performance even of compressed fastText seems quite solid.
%% Cell type:code id:e78507f9 tags:
``` python
index_builders = {}
# for each embedding, define a search strategy based on tag-weighted alignments
for e in the_embeddings.values():
index_builders[e.name] = make_index_builder(
nlp=e.nlp if e.is_contextual else nlp,
strategy="Tag-Weighted Alignment",
strategy_options={"tag_weights": {"NN": 3}, "similarity": {"embedding": e}},
)
# present an UI to interactively edit these search strategies
accordion = widgets.Accordion(children=[x.displayable for x in index_builders.values()])
for i, k in enumerate(index_builders.keys()):
accordion.set_title(i, k)
accordion
```
%% Output
%% Cell type:code id:db53567e tags:
``` python
p2 = nbutils.plot_ndcgs(
gold_data, dict((k, v.build_index()) for k, v in index_builders.items()))
p2.plot()
```
%% Output
%% Cell type:code id:87ad7ceb-1f73-4d35-98fd-1ca07f68c70e tags:
``` python
p2.plot_hist()
```
%% Output
%% Cell type:markdown id:1ac03a32 tags:
Some observations to be made here:
Some observations to be made here (based on the detailed plots that are available in the interactive version):
* In a few queries ("llo, ho, ho my lord", "frailty, thy name is woman", "hell itself should gape"), GloVe gives slightly better results than fastText, but this cannot be generalized to the overall performance.
* For some queries ("I do bear a brain.", "O all you host of heaven!") the embedding does not seem to matter at all.
* A real competitor for fastText are contextual token embeddings from the Sentence-BERT *msmarco* model, which achieves a stunning 99.8% as median performance when used as token embedding with tag-weighted alignments, and outperforms its own document embedding median of 91.6% that we obtained earlier. Note that the mean performance does not show this effect though. As the histogram above explains, the effect can be attributed to the fact that the *msmarco* model achieves a 100% nDCG for over half of the performed queries (which dominate the median), while the other other half has low outliers (e.g. in the range of 60%) which in turn affect the mean.
* A real competitor for fastText are contextual token embeddings from the Sentence-BERT *msmarco* model, which achieves a stunning 99.8% as median performance when used as token embedding with tag-weighted alignments, and outperforms its own document embedding median of 91.6% that we obtained earlier. Note that the mean performance does not show this effect though. As the histogram above explains, the effect can be attributed to the fact that the *msmarco* model achieves a 100% nDCG for over half of the performed queries (which dominate the median), while the other half has low outliers (e.g. in the range of 60%) which in turn affect the mean.
%% Cell type:markdown id:221deb5d tags:
# 4. Conclusion <a class="anchor" id="section_4"></a>
%% Cell type:markdown id:419cf643 tags:
In this interactive notebook we have demonstrated how different types of word embeddings can be used to detect intertextual phenomena in a semi-automatic way. We also provide a basic ground truth dataset of 100 short documents that contain 20 different quotes from Shakespeare's plays. This setup enables us to investigate the inner workings of different embeddings and to evaluate their suitability for the case of Shakespearean intertextuality.
%% Cell type:markdown id:ebb3b3ac tags:
The following main findings – that also open up perspectives for future research – were obtained from this evaluation study:
1. POS tag-weighted alignments achieve the highest overall mean nDCG for our specific data - however this is not true for the median nDCG. Since tag-weighting seems to improve the results of only a small subset of queries considerably, our future work will focus on better understanding the structure of queries and the role of outliers.
2. Document embeddings show a strong performance. A special appeal of these models lies in their ease of use (assuming a pre-trained model), as they do not rely on additional WSB or WMD mappings.
3. In terms of static embeddings, compressed fastText embeddings clearly outperform compressed GloVe and Numberbatch embeddings in our evaluations. Since we used low-dimensional embeddings for this notebook, these results are only a first hint and need to be verified through the use of full (high-dimensional) embeddings.
4. Contextual token embeddings extracted from Sentence-BERT, when used as an input to tag-weighted alignments, seem to offer the best overall performance. The *msmarco* model's result of 98.8% seems to indicate that future research might want to focus on hybrid methods combining high-quality contextual token embeddings with other approaches such as alignments.
%% Cell type:markdown id:9829e464 tags:
While we were able to gather some interesting insights for the specific use case of Shakespearean intertextuality, the main goal of this notebook is to provide an interactive platform that enables other researchers to investigate other parameter combinations as well. The code blocks and widgets above offer many ways to play around with the settings and explore their effects on the ground truth data.
More importantly, researchers can also import their own (English language) data to the notebook and experiment with all of the functions and parameters that are part of the [Vectorian API](https://github.com/poke1024/vectorian-2021). We hope this notebook provides a low-threshold access to the toolbox of embeddings for other researchers in the field of intertextuality studies and thus adds to a critical reflection of these new methods.
%% Cell type:markdown id:a4ce0575 tags:
# 5. Interactive searches with your own data <a class="anchor" id="section_5"></a>
%% Cell type:markdown id:400f94ab-7679-431f-9317-ffa037912142 tags:
This section allows you to upload your own text corpora and search in them.
%% Cell type:markdown id:ab53cf6d tags:
First, specify the text documents you want to search through the upload widget seen below. Note that it expects plain text files with a `.txt` extension. The upload widget is provided through a helper class called `CustomSearch` that knows about the embeddings used for searching.
A good source for obtaining some public domain plain text files to search in for this demo is <a href="https://wikisource.org">wikisource.org</a>. For example, you can <a href="https://ws-export.wmcloud.org/?lang=en&title=The_Origin_of_Species_(1872)&format=txt">download Charles Darwin's "The Origin of Species" via wikisource</a>. To download other titles, you need to enter the exact name identifier in wikisource. Always make sure that you are getting the plain text format.
%% Cell type:code id:55478ee3 tags:
``` python
search = nbutils.CustomSearch(
[the_embeddings[x] for x in ["numberbatch", "fasttext"]])
search
```
%% Output
%% Cell type:markdown id:fd3ba2ba tags:
From the file or files stored in the upload widget above, we now build a Vectorian `Session`. For this, we need an `nlp` instance for importing the text documents. Depending on the size and number of documents, the initial processing can take some time.
Once processing has finished, you are presented the full interactive search interface the Vectorian offers (we have hidden it so far and focused on a subset). Note that in contrast to our earlier experiments, we do not search on the *document* level by default, but rather the *sentence* level - i.e. we split each document into sentences and then check each sentence whether it contains an occurrence of the given query phrase. You can change this behavior in the "Partition" dropdown.
Use the "Query" edit field and the "Search" button to perform searches.
%% Cell type:code id:2b767872 tags:
``` python
search.interact(nlp)
```
%% Output
%% Cell type:markdown id:64fd87c8 tags:
# 6. References
<a class="anchor" id="section_6"></a>
Bamman, David & Crane, Gregory (2008). The logic and discovery of textual allusion. In Proceedings of the 2008 LREC Workshop on Language Technology for Cultural Heritage Data.
Bär, Daniel, Zesch, Torsten & Gurevych, Iryna (2012). Text reuse detection using a composition of text similarity measures. In Proceedings of COLING 2012, p. 167–184.
Büchler, Marco, Geßner, Annette, Berti, Monica & Eckart, Thomas (2013). Measuring the influence of a work by text re-use. Bulletin of the Institute of Classical Studies. Supplement, p. 63–79.
Burghardt, Manuel, Meyer, Selina, Schmidtbauer, Stephanie & Molz, Johannes (2019). “The Bard meets the Doctor” – Computergestützte Identifikation intertextueller Shakespearebezüge in der Science Fiction-Serie Dr. Who. Book of Abstracts, DHd.
Chandrasekaran, Dhivya & Mago, Vijay (2021). Evolution of Semantic Similarity – A Survey. ACM Computing Surveys (CSUR), 54(2), p. 1-37.
Faruqui, Manaal, Tsvetkov, Yulia, Rastogi, Pushpendre & Dyer, Chris (2016). Problems With Evaluation of Word Embeddings Using Word Similarity Tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, p. 30-35.
Forstall, Christopher, Coffee, Neil, Buck, Thomas, Roache, Katherine & Jacobson, Sarah (2015). Modeling the scholars: Detecting intertextuality through enhanced word-level n-gram matching. Digital Scholarship in the Humanities, 30(4), p. 503–515.
Genette, Gérard (1993). Palimpseste. Die Literatur auf zweiter Stufe. Suhrkamp.
Hohl-Trillini, Regula & Burghardt, Manuel & Molz, Johannes & Pichler, Alex & Reiter, Nils & Sulzbacher, Ben & Nantke, Julia (2020). "Intertextualität in literarischen Texten und darüber hinaus." Book of Abstracts, DHd 2020, Paderborn.
Kusner, Matt, Sun, Yu, Kolkin, Nicholas & Weinberger, Kilian (2015). From word embeddings to document distances. In International conference on machine learning, p. 957-966.
Liebl, Bernhard & Burghardt, Manuel (2020a). „The Vectorian“ – Eine parametrisierbare Suchmaschine für intertextuelle Referenzen. Book of Abstracts, DHd 2020, Paderborn.
Liebl, Bernhard & Burghardt, Manuel (2020b). “Shakespeare in The Vectorian Age” – An Evaluation of Different Word Embeddings and NLP Parameters for the Detection of Shakespeare Quotes”. Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LateCH), co-located with COLING’2020.
Manjavacas, Enrique, Long, Brian & Kestemont, Mike (2019). On the feasibility of automated detection of allusive text reuse. Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature.
Mikolov, Tomas, Chen, Kai, Corrado, Greg & Dean, Jeffrey (2013). Efficient estimation of word representations in vector space. In Proceedings of International Conference on Learning Representations (ICLR 2013). arXiv preprint arXiv:1301.3781.
Mikolov, Tomas, Grave, Edouard, Bojanowski, Piotr, Puhrsch, Christian & Joulin, Armand (2018). Advances in pretraining distributed word representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). arXiv preprint arXiv:1712.09405.
Nagoudi, El Moatez Billah & Schwab, Didier (2017). Semantic Similarity of Arabic Sentences with Word Embeddings. In Proceedings of the 3rd Arabic Natural Language Processing Workshop, Association for Computational Linguistics, 2017, p. 18–24.
Pennington, Jeffrey, Socher, Richard & Manning, Christopher D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), p. 1532-1543.
Reimers, Nils & Gurevych, Iryna (2019). Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
Scheirer, Walter, Forstall, Christopher & Coffee, Neil (2014). The sense of a connection: Automatic tracing of intertextuality by meaning. Digital Scholarship in the Humanities, 31(1), p. 204–217.
Sohangir, Sahar & Wang, Dingding (2017). Document Understanding Using Improved Sqrt-Cosine Similarity. In Proceedings of the 2017 IEEE 11th International Conference on Semantic Computing (ICSC), p. 278-279.
Speer, Robyn, Chin, Joshua & Havasi, Catherine (2017). Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), p. 4444–4451.
Trillini, Regula (2020). "Casual Shakespeare: Three Centuries of Verbal Echoes." Routledge.
Wang, Yuxuan, Hou, Yutai, Che, Wanxiang & Liu, Ting (2020). From static to dynamic word representations: A survey. International Journal of Machine Learning and Cybernetics 11, p. 1611–1630.
Waterman, Michael S., Smith, Temple F. & Beyer, William A. (1976). Some biological sequence metrics. Advances in Mathematics 20(3), p. 367-387.
Werner, Matheus & Laber, Eduardo (2020). Speeding up Word Mover's Distance and its variants via properties of distances between embeddings. In Proceedings of the 24th European Conference on Artificial Intelligence (ECAI 2020), p. 2204-2211.
Wieting, John, Bansal, Mohit, Gimpel, Kevin & Livescu, Karen (2015). Towards universal paraphrastic sentence embeddings. Proceedings of the 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico.
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment