Commit 83d8592f authored by Bernhard Liebl's avatar Bernhard Liebl
Browse files

mathjax fixes

parent 5c08cdbc
%% Cell type:markdown id:190bc476 tags:
# “Embed, embed! There’s knocking at the gate."
## Detecting Intertextuality with the Vectorian Notebook of Embeddings
<i>Bernhard Liebl & Manuel Burghardt <br>
Computational Humanities Group, Leipzig University</i>
%% Cell type:markdown id:55a187a5 tags:
## Introduction
%% Cell type:markdown id:08900f82 tags:
The detection of intertextual references in text corpora is a digital humanities topic that has gained a lot of attention in recent years (for instance Bamman & Crane, 2008; Burghardt et al., 2019; Büchler et al., 2013; Forstall et al., 2015; Scheirer et al., 2014). While intertextuality – from a literary studies perspective – describes the phenomenon of one text being present in another text (cf. Genette, 1993), the computational problem at hand is the task of text similarity detection (Bär et al., 2012), and more concretely, semantic similarity detection.
%% Cell type:markdown id:931e786e tags:
In the following example of Shakespearean intertextuality, the words *bleed* and *leak* are semantically (and phonetically) similar, demonstrating that *Star Trek* here is quoting Shakespeare without any doubt:
> Shylock: If you prick *us*, do *we* not **bleed**. <br>
(Shakespeare; The Merchant of Venice)
> Data: If you prick *me*, do *I* not **leak**. <br>
(Star Trek: The Next Generation; The Measure of a Man)
%% Cell type:markdown id:d93736f9 tags:
### Enter: word embeddings
%% Cell type:markdown id:7d2d01c6 tags:
Over the years, there have been various attempts for measuring semantic similarity, some of them knowledge-based (e.g. based on WordNet), others corpus-based, like LDA (Chandrasekaran & Vijay, 2021). The arrival of word embeddings (Mikolov et al., 2013) has changed the field considerably by introducing a new and fast way to tackle the notion of word meaning. On the one hand, word embeddings are building blocks that can be combined with a number of other methods, such as alignments, soft cosine or Word Mover's Distance, to implement some kind of sentence similarity (Manjavacas et al., 2019). On the other hand, the concept of embeddings can be extended to work one the sentence-level as well, which is a conceptually different approach (Wieting et al., 2016).
%% Cell type:markdown id:fe1c1073 tags:
We introduce the **Vectorian** as a framework that allows researchers to try out different embedding-based methods for intertextuality detection. In contrast to previous versions of the Vectorian (Liebl & Burghardt, 2020a/b) as a mere web interface with a limited set of static parameters, we now present a clean and completely redesigned API that is showcased in an interactive Jupyter notebook. In this notebook, we first use the Vectorian to build queries where we plug in static word embeddings such as FastText (Mikolov et al., 2018) and GloVe (Pennington et al., 2014). We evaluate the influence of computing similarity through alignments such as Waterman-Smith-Beyer (WSB; Waterman et al., 1976) and two variants of Word Mover’s Distance (WMD; Kusner et al., 2015). We also investigate the performance of state-of-art sentence embeddings like Siamese BERT networks (Reimers & Gurevych, 2019) for the task - both on a document level (as document embeddings) and as contextual token embeddings. Overall, we find that POS tag-weighted WSB with fastText offers highly competitive performance. Readers can upload their own data for performing search queries and try out additional vector space metrics such as p-norms or improved sqrt‐cosine similarity (Sohangir & Wang, 2017).
%% Cell type:markdown id:76ec578f tags:
### Outline of the notebook
%% Cell type:markdown id:3354577a tags:
In the notebook, we will go through different examples of intertextuality to demonstrate and explain the implications of different embeddings and similarity measures. To achieve this we provide a small ground truth corpus of intertextual Shakespeare references that can be used for some controlled evaluation experiments. Our main goal is to provide an interactive environment, where researchers can test out different methods for text reuse and intertextuality detection. This notebook thus adds to a critical reflection of digital methods and can help to shed some light on their epistemological implications for the field of computational intertextuality detection. At the end of the notebook, researchers can also easily import their own data and investigate all the showcased methods for their specific texts.
%% Cell type:markdown id:81127404 tags:
## Overview of different types of embeddings
%% Cell type:markdown id:37c44475 tags:
**Word embeddings** take up the linguistic concept of collocations. For each word in a corpus it is recorded with which other words it occurs. These collocation profiles are then represented as vectors. If, for example, two words (e.g. "car" and "truck") occur with very similar words (e.g. „wheels, drive, street, etc.“) then they would also have very similar word vectors, i.e. they would be semantically - or at least structurally - very similar.
There are now various established ways to compute embeddings for word similarity tasks. A first important distinction to be made is between *token* / *word* embeddings and *document* embeddings (see diagram below). While **token embeddings** model one embedding per token, **document embeddings** try to map an entire document (i.e. a set of tokens) into one single embedding. There are two common ways to compute document embeddings. One way is to derive them from token embeddings, for instance by averaging those. More complex approaches train dedicated models that are optimized to produce good document embeddings.
This means that, all in all, we can distinguish three types of embeddings:
* original token embeddings
* document embeddings derived from token embeddings
* document embeddings from dedicated models, such as Sentence-BERT (Reimers & Gurevych, 2019)
![Different kinds of embeddings](miscellaneous/diagram_embeddings_1.svg)
%% Cell type:markdown id:c5fe82b4 tags:
For token embeddings, there are also various options, as the diagram below illustrates. The most recent option are contextual token embeddings (also sometimes called *dynamic* embeddings), which will incorporate a specific token's context and can be obtained from architectures like ELMO or BERT. Another option are static token embeddings, which map one token to one embedding, independent of its specific occurence in a text. For an overview of static and contextual embeddings, and their differences, see (Wang et al. 2020).
For static embeddings there is now a variety of established options like fastText or GloVe. We can also combine embeddings or stack them (i.e. concatenate embedding vectors) to simply create new embeddings from existing ones.
![Different kinds of embeddings](miscellaneous/diagram_embeddings_2.svg)
%% Cell type:markdown id:df308da6 tags:
In this notebook, we showcase the following four classes of embeddings:
* **Static token embeddings**: these operate on the token level. We experiment with GloVe (Pennington et al. 2014), fastText (Mikolov et al., 2018) and Numberbatch (Speer et al, 2017). We use these three embeddings to compute token similarity and combine them with alignment algorithms (such as Waterman-Smith-Beyer) to compute document similarity. We also investigate the effect of stacking two static embeddings (fastText and Numberbatch).
* **Contextual token embeddings**: these also operate on the token level, but embeddings can change according to a specific token instance's context. In this notebook we experiment with using such token embeddings from the Sentence-BERT model (Reimers & Gurevych, 2019).
* **Document embeddings derived from specially trained models**: document embeddings represent one document via one single embedding. We use document embeddings obtained from a BERT model. More specifically, we use a Siamese BERT model named Sentence-BERT, which is trained specifically for the semantic textual similarity (STS) task (Reimers & Gurevych, 2019).
* **Document embeddings derived from token embeddings**: We also experiment with averaging different kinds of token embeddings (static and contextual) to derive document embeddings.
%% Cell type:markdown id:2b0dbc82 tags:
## Technical setup
%% Cell type:markdown id:43f94243 tags:
We import a couple of helper functions for visualizations and various computations (`nbutils`), a wrapper to load our gold standard data (`gold`), and finally the Vectorian library (`vectorian`), through which we will perform searches and evaluations later on.
In `nbutils.initialize` we check whether there is a [bokeh server](https://docs.bokeh.org/en/latest/index.html) available. This typically *is* the case for local Jupyter installations, but is *not* the case for notebooks running on *mybinder*. In the latter case, the notebook has some limitations regarding interactivity.
%% Cell type:code id:6bb0f486 tags:
``` python
import sys
# make "nbutils" and "code" importable
sys.path.append("code")
import nbutils
import gold
import vectorian
import ipywidgets as widgets
import importlib
from ipywidgets import interact
# initialize nbutils
nbutils.initialize("auto")
```
%% Output
%% Cell type:markdown id:69a0586f tags:
## Introducing the gold standard dataset
%% Cell type:markdown id:9e9b0560 tags:
In the following we use a collection of 100 short text snippets (=documents) that quote a total of 20 different Shakespeare phrases. All of these documents were derived from the [WordWeb IDEM portal](http://wordweb-idem.ch/about-us.html). Each document quotes exactly one of the 20 phrases. For some phrases, e.g. "to be or not to be", there are more quoting documents than for others (see interactive overview of documents below). If there are multiple documents that quote the same phrase, we selected them in a way each of them does this in a different way. There are no verbatim quotes in the documents, but always more or less complex variations of the original phrase.
%% Cell type:markdown id:2e212a0b tags:
We use this collection of quote documents as a gold standard, to be able to assess how well different embeddings work for different types of quotes.
%% Cell type:code id:dc9fb99d tags:
``` python
gold_data = gold.Data("data/raw_data/gold.json")
```
%% Cell type:markdown id:1d2043de tags:
Technically speaking, our gold standard consists of a number of `Patterns`. Each `Pattern` is associated with a phrase, e.g. "to be or not to be", which occurs in a rephrased form in other works and contexts. These reoccurences, which model text reuse, are called `Occurrences` in our data. Each such `Occurrence` carries the actual phrase and a larger context in which it occurs, which together we call the `Evidence`. The data layout for our gold standard looks as follows:
%% Cell type:markdown id:964b86d2 tags:
![UML of gold standard data](miscellaneous/gold_uml.svg)
%% Cell type:markdown id:0baa21e8 tags:
One specific example in this data is the `Occurrence` of the `Pattern` "to be or not to be" in a `Source` titled "The Phoenix". The `Evidence` to be found here is the phrase "to be named or not be named".
%% Cell type:markdown id:870a650d tags:
All of the 20 quote patterns can be browsed in the 100 associated documents via the following widget.
%% Cell type:code id:64ca0683 tags:
``` python
nbutils.Browser(gold_data, "to be or not to be", "The Phoenix")
pass
```
%% Output
%% Cell type:markdown id:88bebe94 tags:
For a further exploration of the dataset, we also provide a visualization of the gold standard dataset, with `Patterns` indicated as blue circles and `Evidence` indicated as green circles. Matching evidence and patterns are connected via edges and each bouqet consists of one pattern and the matching instances of text reuse. Hovering the mouse over the nodes reveals their actual contents.
%% Cell type:code id:29d62e2c tags:
``` python
nbutils.plot_gold(gold_data)
```
%% Output
%% Cell type:markdown id:e746071b tags:
## "Shapespeare in the Vectorian Age" – Meet the Vectorian framework
%% Cell type:markdown id:e5afbd3e tags:
To conduct our actual investigations, we rely on a framework called the *Vectorian*, which we first introduced in 2020 (Liebl & Burghardt, 2020a/b). Using highly optimized algorithms and data structures, the Vectorian enables fast searching over the gold standard data using a variety of approaches and strategies.
%% Cell type:markdown id:f1cb0e38 tags:
In order to use the Vectorian, we need to map the gold standard concepts to Vectorian concepts as follows: From the gold standard, we want to gather all texts that contain some kind of text reuse. These texts reside in the `context` attribute of the `Evidence` instances (see Figure above). In the Vectorian, we then create `Documents` from those texts. A `Document` in Vectorian terminology is something we can perform a search on. `Documents` in the Vectorian are created using different kinds of `Importers` that perform necessary natural language processing tasks using an additional `NLP` class (see diagram below). Since this step can be very time-consuming, we precomputed this step and use the `Corpus` class to quickly load these preprocessed Documents into the notebook. For details about the full preprocessing, see `code/prepare_corpus.ipynb`.
%% Cell type:markdown id:f467944b tags:
Using the loaded `Documents` and a set of `Embeddings` we want to work with, we can then create a `Session` that allows us to perform intertextuality searches. More about the specific steps we take can be found in [Vectorian's Python API](https://github.com/poke1024/vectorian-2021).
%% Cell type:markdown id:fe62e898 tags:
## Loading word embeddings
%% Cell type:markdown id:a9406706 tags:
In this step the static embeddings that were described above are loaded from Vectorian's model zoo. This zoo contains a number of prebuilt embeddings for various languages, for instance GloVe, fastText and Numberbatch.
%% Cell type:code id:994c5c17 tags:
``` python
from vectorian.embeddings import Zoo
Zoo.list()[::20]
```
%% Output
('fasttext-af',
'fasttext-ba',
'fasttext-bs',
'fasttext-de',
'fasttext-fa',
'fasttext-gv',
'fasttext-id',
'fasttext-kn',
'fasttext-mai',
'fasttext-mt',
'fasttext-nl',
'fasttext-pl',
'fasttext-sah',
'fasttext-sq',
'fasttext-tl',
'fasttext-vo',
'glove-6B-50',
'numberbatch-19.08-eo',
'numberbatch-19.08-io',
'numberbatch-19.08-oc',
'numberbatch-19.08-vi')
%% Cell type:markdown id:673a5d4b tags:
For reasons of limited RAM in the interactive Binder environment (and to limit download times), we use smaller or compressed versions of the static embeddings:
* for **GloVe**, we use the official 50-dimensional version of the 6B variant
* for **fastText** we use a version that was compressed using the standard settings in https://github.com/avidale/compress-fasttext
* for **Numberbatch** we use a 50-dimension version that was reduced using a standard PCA
%% Cell type:code id:d1bc8611 tags:
``` python
the_embeddings = {}
the_embeddings["glove"] = Zoo.load("glove-6B-50")
the_embeddings["numberbatch"] = Zoo.load("numberbatch-19.08-en").pca(50)
the_embeddings["fasttext"] = Zoo.load("fasttext-en-mini")
```
%% Cell type:markdown id:fae47fc7 tags:
We also use one **stacked embedding**, in which we combine fastText and Numberbatch.
%% Cell type:code id:e7d9f16c tags:
``` python
from vectorian.embeddings import StackedEmbedding
the_embeddings["fasttext_numberbatch"] = StackedEmbedding(
[the_embeddings["fasttext"], the_embeddings["numberbatch"]]
)
```
%% Cell type:markdown id:04037dac tags:
Next, we instantiate an NLP parser that is able to provide embeddings based on Sentence-BERT (Reimers & Gurevych, 2019).
%% Cell type:code id:84dd961f tags:
``` python
nlp = nbutils.make_nlp()
nlp.pipeline
```
%% Output
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7fe8dd216270>),
('tagger', <spacy.pipeline.tagger.Tagger at 0x7fe8dbbb8680>),
('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7fe8dd15ab20>),
('attribute_ruler',
<spacy.pipeline.attributeruler.AttributeRuler at 0x7fe8dd1a8ac0>),
('lemmatizer',
<spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7fe8dd1cb340>),
('sentence_bert',
<spacy_sentence_bert.language.SentenceBert at 0x7fe8c1f42fa0>)]
%% Cell type:markdown id:994ec251 tags:
Finally, we add a wrapper that allows us to use Sentence-BERT's contextual token embeddings in the Vectorian.
%% Cell type:code id:16ef1ef5 tags:
``` python
from vectorian.embeddings import SentenceBertEmbedding
the_embeddings["sbert"] = SentenceBertEmbedding(nlp, 768)
```
%% Cell type:markdown id:65d16656 tags:
## Creating the session
%% Cell type:markdown id:7f2c3ada tags:
The Vectorian `Session` is created with the specified embeddings and the preprocessed documents, which are loaded via the `Corpus` class:
%% Cell type:code id:7c2ab856 tags:
``` python
from vectorian.session import LabSession
from vectorian.corpus import Corpus
session = LabSession(
Corpus.load("data/processed_data/corpus"),
embeddings=the_embeddings.values(),
normalizers="default",
)
```
%% Output
Opening glove-6B-50: 100%|██████████
Opening numberbatch-19.08-en: 100%|██████████
1587it [00:00, 31729.89it/s]
1587it [00:00, 19887.96it/s]
%% Cell type:markdown id:24786e14 tags:
The session now contains all embeddings we will work with as well as the list of documents that contain the texts from the gold standard `Evidence` items.
%% Cell type:markdown id:ec8c309d tags:
## An introduction to word embeddings and token similarity
%% Cell type:markdown id:936bfeb6 tags:
Before we dive into the acutal intertextuality analyses, we first take a brief look at the inner workings of embeddings. Mathematically speaking, a word embedding is a vector **x** of dimension $n$, i.e. a vector consisting of $n$ scalars.
Before we dive into the acutal intertextuality analyses, we first take a brief look at the inner workings of embeddings. Mathematically speaking, a word embedding is a vector **x** of dimension *n*, i.e. a vector consisting of *n* scalars.
\begin{equation*}
\mathbf{x}=(x_1, x_2, ..., x_{n-1}, x_n)
\end{equation*}
$$\mathbf{x}=(x_1, x_2, ..., x_{n-1}, x_n)$$
For example, the compressed numberbatch embedding we use has $n=50$ and thus represents the word "coffee" with the following 50 scalar values:
For example, the compressed numberbatch embedding we use has *n*=50 and thus represents the word "coffee" with the following 50 scalar values:
%% Cell type:code id:474c8b2e tags:
``` python
widgets.GridBox(
[
widgets.Label(f"{x:.2f}")
for x in session.word_vec(the_embeddings["numberbatch"], "coffee")
],
layout=widgets.Layout(grid_template_columns="repeat(10, 50px)"),
)
```
%% Output
%% Cell type:markdown id:9bea3512 tags:
Since the above representation is difficult to understand, we visualize the values of
\begin{equation*}x_1, x_2, ..., x_{n-1}, x_n\end{equation*}
$$x_1, x_2, ..., x_{n-1}, x_n$$
through different colors. By default, all values are normalized by ||**x**||&#x2082;, i.e. the dot product of these values gives the cosine similarity.
%% Cell type:code id:ea9c7758 tags:
``` python
@interact(
embedding=widgets.Dropdown(
options=[(k, v) for k, v in the_embeddings.items() if not v.is_contextual],
value=the_embeddings["numberbatch"],
),
normalize=True,
)
def plot(embedding, normalize):
nbutils.plot_embedding_vectors_val(
["sail", "boat", "coffee", "tea", "guitar", "piano"],
get_vec=lambda w: session.word_vec(embedding, w),
normalize=normalize,
)
```
%% Output
%% Cell type:markdown id:d59f2826 tags:
By looking at these color patterns, we can gain some intuitive understanding of why and how word embeddings are appropriate for word similarity calculations. For example, *sail* and *boat* both show a strong activation for dimension 27. Similarly, *guitar* and *piano* share similar values for dimension 24. The words *coffee* and *tea* also share some similar patterns for dimension 2 and dimension 49, which also set them apart from the other four words.
%% Cell type:markdown id:90ef6760 tags:
A common approach to compute the similarity between two word vectors **u** and **v** in this kind of high-dimensional vector spaces is to compute the cosine of the angle $\theta$ between the vectors, which is called **cosine similarity**:
A common approach to compute the similarity between two word vectors **u** and **v** in this kind of high-dimensional vector spaces is to compute the cosine of the angle &theta; between the vectors, which is called **cosine similarity**:
\begin{equation*}
cos \theta = \frac{\mathbf{u} \cdot \mathbf{v}}{||\mathbf{u}||_2 ||\mathbf{v}||_2} = \frac{\sum_1^n \mathbf{u}_i \mathbf{v}_i}{\sqrt{\sum_1^n \mathbf{u}_i^2} \sqrt{\sum_1^n \mathbf{v}_i^2}} = \sum_1^n \left( \frac{\mathbf{u}}{||\mathbf{u}||_2} \right)_i \left( \frac{\mathbf{v}}{||\mathbf{v}||_2} \right)_i
\end{equation*}
$$cos \theta = \frac{\mathbf{u} \cdot \mathbf{v}}{||\mathbf{u}||_2 ||\mathbf{v}||_2} = \frac{\sum_1^n \mathbf{u}_i \mathbf{v}_i}{\sqrt{\sum_1^n \mathbf{u}_i^2} \sqrt{\sum_1^n \mathbf{v}_i^2}} = \sum_1^n \left( \frac{\mathbf{u}}{||\mathbf{u}||_2} \right)_i \left( \frac{\mathbf{v}}{||\mathbf{v}||_2} \right)_i$$
%% Cell type:markdown id:3d8876b1 tags:
A large positive value (i.e. a small $\theta$ between **u** and **v**) indicates higher similarity, whereas a small or even negative value (i.e. a large $\theta$) indicates lower similarity. For a discussion of issues with this notion of similarity, see Faruqui et al. (2016).
A large positive value (i.e. a small &theta; between **u** and **v**) indicates higher similarity, whereas a small or even negative value (i.e. a large &theta;) indicates lower similarity. For a discussion of issues with this notion of similarity, see Faruqui et al. (2016).
The visualization below encodes
\begin{equation*}
\left( \frac{\mathbf{u}}{||\mathbf{u}||_2} \right)_i \left( \frac{\mathbf{v}}{||\mathbf{v}||_2} \right)_i
\end{equation*}
$$\left( \frac{\mathbf{u}}{||\mathbf{u}||_2} \right)_i \left( \frac{\mathbf{v}}{||\mathbf{v}||_2} \right)_i$$
for different $i, 1 \le i \le n$ through colors to illustrate how different components contribute to the cosine similarity for two words. Brighter colors (orange/yellow) indicate dimensions with higher values.
for different i, 1 &le; i &le; n, through colors to illustrate how different components contribute to the cosine similarity for two words. Brighter colors (orange/yellow) indicate dimensions with higher values.
%% Cell type:code id:70e2ade9 tags:
``` python
@interact(
embedding=widgets.Dropdown(
options=[(k, v) for k, v in the_embeddings.items() if not v.is_contextual],
value=the_embeddings["numberbatch"],
)
)
def plot(embedding):
nbutils.plot_embedding_vectors_mul(
[("sail", "boat"), ("coffee", "tea"), ("guitar", "piano")],
get_vec=lambda w: session.word_vec(embedding, w),
)
```
%% Output
%% Cell type:markdown id:ff150311 tags:
A comparable investigation of fastText shows similar spots of positive dimensions. The plot here is somewhat more complex due to the higher number of dimensions (*n* = 300).
%% Cell type:code id:adc1d2ed tags:
``` python
@interact(
embedding=widgets.Dropdown(
options=[(k, v) for k, v in the_embeddings.items() if not v.is_contextual],
value=the_embeddings["fasttext"],
)
)
def plot(embedding):
nbutils.plot_embedding_vectors_mul(
[("sail", "boat"), ("coffee", "tea"), ("guitar", "piano")],
get_vec=lambda w: session.word_vec(embedding, w),
)
```
%% Output
%% Cell type:markdown id:9e9c6ff5 tags:
Computing the overall cosine similarity for two words is mathematically equivalent to summing up the terms in the diagram above. The overall similarity between *guitar* and *piano* is approx. 68% with the fastText embedding we use. For *guitar* and *coffee* it is significantly lower with a similarity of approx. 20%.
%% Cell type:code id:4b9d13ae tags:
``` python
from vectorian.metrics import TokenSimilarity, CosineSimilarity
token_sim = TokenSimilarity(the_embeddings["fasttext"], CosineSimilarity())
[session.similarity(token_sim, "guitar", x) for x in ["piano", "coffee"]]
```
%% Output
[0.68097234, 0.19857685]
%% Cell type:markdown id:5fea91e0 tags:
To compute the similarity between tokens of a contextual embedding, we need to reference the actual token instances within a text document.
%% Cell type:code id:b45a1647 tags:
``` python
token_sim = TokenSimilarity(the_embeddings["sbert"], CosineSimilarity())
a = list(session.documents[0].spans(session.partition("document")))[0][2]
b = list(session.documents[6].spans(session.partition("document")))[0][5]
[a.text, b.text, session.similarity(token_sim, a, b)]
```
%% Output
['Hermit', 'men', 0.5877105]
%% Cell type:code id:83e40913 tags:
``` python
list(session.documents[0].spans(session.partition("token")))
```
%% Output
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-17-e110dd300f60> in <module>
----> 1 list(session.documents[0].spans(session.partition("token")))
/opt/miniconda3/envs/vectorian-demo/lib/python3.7/site-packages/vectorian/corpus/document.py in spans(self, partition)
635 def spans(self, partition):
636 get = self._spans_getter(partition)
--> 637 for i in range(self.n_spans(partition)):
638 yield get(i)
639
/opt/miniconda3/envs/vectorian-demo/lib/python3.7/site-packages/vectorian/corpus/document.py in n_spans(self, partition)
602
603 def n_spans(self, partition):
--> 604 n = self._spans[partition.level]['start'].shape[0]
605 k = n // partition.window_step
606 if (k * partition.window_step) < n:
KeyError: 'token'
%% Cell type:markdown id:ee9d77e8 tags:
## Detecting Shakespearean intertextuality through word embeddings
%% Cell type:markdown id:5c4ebd8b tags:
We explore the usefulness of embeddings and token similarity with the gold standard dataset that was introduced earlier. In the following example the pattern "the rest is silence" is quoted as "the rest is all but wind". While the syntactic structure is mirrored between pattern and occurrence, the term "silence" is replaced with "all but wind". If we focus on nouns only, we can expect "silence" and "wind" to be semantically related - at least to a certain degree.
%% Cell type:code id:53cb1328 tags:
``` python
vis = nbutils.TokenSimPlotterFactory(session, nlp, gold_data)
```
%% Cell type:code id:08494bc8 tags:
``` python
plotter1 = vis.make("rest is silence", "Fig for Fortune")
```
%% Output
%% Cell type:markdown id:52f2a451 tags:
In the following, we inspect the cosine similarity of the token "silence" with other tokens in the document's ("A Fig of Fortune, 1596") context for three different embedding models.It becomes clear that for all three embeddings there is a strong connection between "silence" and "wind". The cosine similarity is particularly high with the Numberbatch model. Nevertheless, the absolute value of 0.3 for Numberbatch is still in a rather low range. Interestingly, GloVe associates "silence" with "action", which can be understood as quite the opposite of silence. The phenomenon that embeddings sometimes cluster opposites is a common observation and can be a problem when trying to distinguish between synonyms and antonyms.
%% Cell type:code id:3df8a4d6 tags:
``` python
plotter1("silence")
```
%% Output
%% Cell type:markdown id:19d78987 tags:
In an example for the pattern "sea of troubles", we see that the word "sea" in one document is paraphrased as "waves", and "troubles" is substituted by "troublesome". If we take a closer look at the cosine similarities of the tokens "sea" and "troubles" with all the other tokens in the document's context, we see that they are – expectedly – rather high.
%% Cell type:code id:3fb62321 tags:
``` python
plotter2 = vis.make("sea of troubles", "Book of Common Prayer")
```
%% Output
%% Cell type:code id:5f7b8069 tags:
``` python
plotter2("sea")
```
%% Output
%% Cell type:code id:97a3192d tags:
``` python
plotter2("troubles")
```
%% Output
%% Cell type:markdown id:5cee2f24 tags:
It is also interesting to investigate how out-of-vocabulary words like "troublesomest" produce zero similarities with standard key-value embeddings, whereas fastText is still able to produce a vector thanks to subword information.
%% Cell type:code id:ac79d82c tags:
``` python
plotter2("troublesomest")
```
%% Output
%% Cell type:markdown id:b06e6051 tags:
## Exploring document embeddings
%% Cell type:markdown id:22375ca1 tags:
Next, we consider the representation of each document with a single embedding to gain an understanding of how different embedding strategies relate to document similarity. We will later return to individual token embeddings.
For this purpose, we will use the two strategies already mentioned for computing document embeddings:
* averaging over token embeddings
* computing document embeddings through a dedicated model
%% Cell type:markdown id:a2cb44d3 tags:
In order to achieve the latter, we compute document embeddings using Sentence-BERT. The `CachedPartitionEncoder` in the following code will encode and cache the document embeddings.
%% Cell type:code id:2f78d07d tags:
``` python
from vectorian.embeddings import CachedPartitionEncoder, SpanEncoder
# create an encoder that basically calls nlp(t).vector
sbert_encoder = CachedPartitionEncoder(
SpanEncoder(lambda texts: [nlp(t).vector for t in texts])
)
# compute encodings and/or save cached data
sbert_encoder.try_load("data/processed_data/doc_embeddings")
sbert_encoder.cache(session.documents, session.partition("document"))
sbert_encoder.save("data/processed_data/doc_embeddings")
# extract name of encoder for later use
sbert_encoder_name = nlp.meta["name"]
```
%% Cell type:markdown id:a323d646 tags:
We first present a strategy that creates embeddings for a document using the Sentence-BERT encoder.
%% Cell type:code id:8a3a06ec tags:
``` python
embedder = nbutils.DocEmbedder(
session=session,
nlp=nlp,
doc_encoders={sbert_encoder_name: sbert_encoder},
encoder="paraphrase_distilroberta",
)
embedder.display()
```
%% Output
%% Cell type:markdown id:db0755a1 tags:
Similar to the investigation of token embedding values, we now look at the feature dimensions of the document embeddings. In the following plot we observe that the pattern "an old man is twice a child" and the corresponding text reuses from the gold standard (i.e. the true positives) show some salient contribution around dimensions 25 and 300 (see the 5 upper rows). When comparing the same pattern with non-matching text reuse occurences from the "go, by Saint Hieronimo" pattern on the other hand (see the 5 lower rows), there is less activation in these areas. Therefore these areas seem to offer some good features to differentiate the matching of a pattern with the correct occurrences.
%% Cell type:code id:2d2e9473 tags:
``` python
bars = nbutils.DocEmbeddingBars(embedder, session, gold_data)
bars.plot("an old man is twice a child", "Saint Hieronimo")
```
%% Output
%% Cell type:markdown id:cc59876d tags:
Instead of focusing on only one pattern, we now look at a plot of the embeddings of all documents in our gold standard data. The plot uses a dimensionality reduction technique known as t-Distributed Stochastic Neighbor Embedding (t-SNE) and allows us to reduce multiple dimensions to just two dimensions.
%% Cell type:code id:6a3a0c1c tags:
``` python
doc_embedding_explorer = nbutils.DocEmbeddingExplorer(
session=session,
nlp=nlp,
gold=gold_data,
doc_encoders={sbert_encoder_name: sbert_encoder},
)
doc_embedding_explorer.plot(
[
{"encoder": "paraphrase_distilroberta", "locator": ("fixed", "carry coals")},
{
"encoder": "paraphrase_distilroberta",
"locator": ("fixed", "an old man is twice"),
},
]
)
```
%% Output
%% Cell type:markdown id:247dee41 tags:
In the t-SNE visualization above, the dots represent documents and the colors represent the query that results in this document in our gold standard (more details on the underlying documents are shown when hovering the mouse cursor over the nodes). Dots that are close to each other indicate that the underlying documents share a certain similarity. Nearby dots of the same color indicate that the embedding tends to cluster documents similar to our gold standard.
%% Cell type:markdown id:d8d40738 tags:
In the left plot, we searched for the phrase "we will not carry coals" (visualized as large yellow circle with a cross). The plot shows that the query is in fact part of a document cluster (smaller green-yellow circles) that contains a variation of that phrase. Similarly, on the right we see that the phrase "an old man is twice a child" clusters with the actual (green) documents we associate with it in our gold standard.
For these phrases and documents, the `paraphrase_distilroberta` model automatically produces a document embedding that actually recognizes and separates inherent structures.
%% Cell type:markdown id:118dd3d5 tags:
In the plot above we looked at the document embedding produced by a **token-based** embedding. This has the advantage that we can actually look at token embeddings that make up the document embedding (through averaging), which we will do in the. following plot. On the right side, we see a t-SNE plot of all the token embeddings that occur in the documents that are selected on the left. This visualization makes more transparent why certain documents on the left are clustered to be similar.
%% Cell type:code id:13942a2e tags:
``` python
doc_embedding_explorer.plot(
[
{
"encoder": "numberbatch",
"selection": [
"ww_32c26a7909c83bda",
"ww_b5b8083a6a1282bc",
"ww_9a6cb20b0b157545",
"ww_a6f4b0e3428ad510",
"ww_8e68a517bc3ecceb",
],
}
]
)
```
%% Output
%% Cell type:markdown id:d3cda845 tags:
The red circles on the left represent contexts that match the phrase "a horse, a horse, my kingform for a horse". When we look at the token embeddings in the right plot (that includes other documents as well), we see that a group is formed due to word embeddings that cluster around "horse", but we also see a cluster around "boat", "sail" and "river" on the left. In fact document 1 contains "muscle boat", document 2 contains "To swim the river villain", and document 3 contains "A boat, a boat". We see that this kind of unsupervised document clustering groups items due to inherent qualities that might not actually match the initial query criteria.
%% Cell type:markdown id:e5041ac3 tags:
Custom token embedding plots can be generated by selecting different documents from the left plot (drag the mouse to lasso).
%% Cell type:markdown id:0c10659f tags:
## Mapping quote queries to longer text documents: WSB vs. WMD
%% Cell type:markdown id:1c044ad9 tags:
So far, we have experimented with different token embeddings and seen how similarity comparison can be implemented for single tokens. However, for the detection of intertextual references it is necessary to compare longer token sequences with each other. The problem here is to identify the right segment in the target text, because a quotation like "to be or not to be" will occur as a local phenomenon, only at a certain position in a document. The rest of the document will likely be sentences that do not match with the quote phrase at all. To identify the segment where a potential quote occurs, there are different approaches.
%% Cell type:markdown id:16977478 tags:
One popular class of techniques are sequence alignment algorithms as well as adjacent approaches like Dynamic Time Warping, see Kruskal (1983). In this section, we introduce the **Waterman-Smith-Beyer (WSB)** algorithm, which produces optimal local alignments and provides a general (e.g. non-affine) cost function (Waterman, Smith & Beyer, 1974). Other commonly used global alignment algorithms - such as Smith-Waterman and Gotoh - can be regarded as special cases of WSB. In comparison to the popular Needleman-Wunsch global alignment algorithm, WSB produces local alignments. In contrast to classic formulations of WSB - which often use a fixed substitution cost - we use the word distance from word embeddings to compute the substitution penalty for specific pairs of words.
%% Cell type:markdown id:03983ce2 tags:
Another approach to compute a measure of similarity between bags of words is the so-called **Word Mover's Distance** introduced by Kusner et al. (2015). The main idea is computing similarity through finding the optimal solution of a transportation problem between words.
In the following, we will actually experiment with two variants of the WMD. In addition to the classic WMD, where a transportation problem is solved over the normalized bag of words (nbow) vector, we also introduce a new variant of WMD. In this new variant, we keep the bag of words (bow) unnormalized, i.e. we pose the transportation problem on absolute word occurrence counts.
%% Cell type:markdown id:cd2e18bd tags:
Note: document embeddings do not need any of the above techniques, as they embed documents into a vector space in a way that queries and target documents that share similar features are close to each other in that space.
%% Cell type:markdown id:351b6557 tags:
### A search query using alignment over similar tokens
%% Cell type:code id:58f33423 tags:
``` python
def make_index_builder(**kwargs):
return nbutils.InteractiveIndexBuilder(
session, nlp, partition_encoders={sbert_encoder_name: sbert_encoder}, **kwargs
)
```
%% Cell type:code id:3448bfae tags:
``` python
index_builder = make_index_builder()
index_builder
```
%% Output
%% Cell type:markdown id:525a45d8 tags:
What can be seen above is the description of a search strategy that we will employ in the following sections of this notebook. By switching to the "Edit" part, it is possible to explore the settings in more detail and even change them to something completely different. Note that various parameters in the "Edit" GUI - e.g. mixing of embeddings - are beyond the scope of this notebook. For for more details see Liebl & Burghardt (2020a/b).
%% Cell type:markdown id:afeab1a4 tags:
Example: For the pattern "to be or not to be" we find the following top matches ("to be named or not to be named"), with a similarity score of 96.6%. By increasing the n value we can always display more ranked results.
%% Cell type:code id:5cad0598 tags:
``` python
gold_data.patterns[0].phrase
```
%% Output
'to be or not to be'
%% Cell type:code id:2e79f163 tags:
``` python
index_builder.build_index().find(gold_data.patterns[0].phrase, n=1)
```
%% Output
<vectorian.session.LabResult at 0x7f9f58d04250>
%% Cell type:markdown id:c70e87a9 tags:
### Evaluation: Plotting the nDCG over the corpus
%% Cell type:markdown id:d56b2b2f tags:
In the following we will systematically evaluate different strategies for identifying intertextuality in our gold standard data. We investigate WSB and the two variants of WMD (bow vs. nbow) in combination with the fastText embedding. As another variant we evaluate the performance of Sentence-BERT document embeddings. The evaluation metric is **normalized discounted cumulative gain** [(nDCG)](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) (also see Liebl & Burghardt, 2020b).
%% Cell type:markdown id:f22449b9 tags:
In the summary below you will find more detailed descriptions of the search strategies that will be evaluated in the following. By using "Edit", it is possible to change these settings to something else (a rerun of the following sections of the notebook would then be necessary).
%% Cell type:code id:203ab1e3 tags:
``` python
import collections
import ipywidgets as widgets
# define 4 different search stratgies via make_index_builder
index_builders = collections.OrderedDict(
{
"wsb": make_index_builder(
strategy="Alignment",
strategy_options={"alignment": vectorian.alignment.WatermanSmithBeyer()},
),
"wmd nbow": make_index_builder(
strategy="Alignment",
strategy_options={
"alignment": vectorian.alignment.WordMoversDistance.wmd("nbow")
},
),
"wmd bow": make_index_builder(
strategy="Alignment",
strategy_options={
"alignment": vectorian.alignment.WordMoversDistance.wmd("bow")
},
),
"doc embedding": make_index_builder(strategy="Partition Embedding"),
}
)
# present UI of various options that allows for editing
accordion = widgets.Accordion(children=[x.displayable for x in index_builders.values()])
for i, k in enumerate(index_builders.keys()):
accordion.set_title(i, k)
accordion
```
%% Output
%% Cell type:markdown id:4b5bf126 tags:
With the following command we will get an overview of the quality of the results we obtain when using the index configures with `index_builder` by computing the nDCG over the 20 queries in our gold standard with regard to the known optimal results (this may take a few seconds).
%% Cell type:code id:d34c972d tags:
``` python
nbutils.plot_ndcgs(
gold_data, dict((k, v.build_index()) for k, v in index_builders.items())
)
```
%% Output
100%|██████████| 80/80 [00:22<00:00, 3.56it/s]
%% Cell type:markdown id:3c628c94 tags:
We see that some queries obtain 100%, i.e. the top results match the optimal ones given in our gold standard. We see that Waterman-Smith-Beyer (WSB) tends to perform a little better than Word Mover's Distance (WMD), with the exception of "though this be madness...", where WMD outperforms WSB. In general the Vectorian modification of WMD, which does not use nbow, performs better than the original description of WMD. The one exception here is "livers white as milk".
One advantage of WSB over the full WMD variants is its easy interpretability. WSB produces an alignment that relates one document token to at most one query token. For WMD, this assumption often breaks down, which makes the results harder to understand. We use this characteristic of WSB in the following section to illustrate which mappings actually occur.
%% Cell type:markdown id:26fd85e5 tags:
### Focussing on single queries
%% Cell type:markdown id:8f66f030 tags:
We now investigate some queries, for which the performance for WSB is bad, to get a better understanding for why our search fails to obtain the optimal (true positive) results at the top of the result list.
%% Cell type:code id:a94f0e44 tags:
``` python
index_builder = make_index_builder()
index_builder
```
%% Output
%% Cell type:markdown id:96320ac9 tags:
We turn to the query with the lowest score ("though this be madness, yet there is a method in it") in the previous evaluation, and look at its results in some more detail.
%% Cell type:code id:bdd1f1f6 tags:
``` python
plot_a = nbutils.plot_results(
gold_data, index_builder.build_index(), "though this be madness", rank=6
)
```
%% Output
%% Cell type:markdown id:71aa1df7 tags:
The best match obtained here (red bar on rank 6) is anchored on two word matches, namely `madness` (a 100% match) and `methods` (a 72% match). The other words are quite different and there is no good alignment.
%% Cell type:code id:a52b9ee2 tags:
``` python
plot_b = nbutils.plot_results(
gold_data, index_builder.build_index(), "though this be madness", rank=3
)
```
%% Output
%% Cell type:markdown id:08472e31 tags:
Above we see the rank 3 result from the same query, which is a false positive - i.e. our search proclaims it is a better result than the one we saw before, but in fact this result is not relevant according to our gold standard. If we analyze why this result gets such a high score nonetheless, we see that "is" and "in" both contribute 100% scores. In contrast to the scores before, 100% for "madness" and 72% for "methods", this partially explains the higher overall score (if we assume for now that the contributions from the other tokens are somewhat similar).
We will now try to understand why the true positive results are ranked rather low. To find a solution for this issue, we will be looking at the composition of scores for each result we obtain for this query:
%% Cell type:code id:c42e0bb4 tags:
``` python
nbutils.vis_token_scores(
plot_b.matches[:50],
highlight={"token": ["madness", "method"], "rank": [6, 20, 35, 45]},
)
```
%% Output
%% Cell type:markdown id:8193d37f tags:
The true positive results are marked with black triangles. We see that our current search strategy isn't doing a very good job of ranking them highly. Looking at the score composition of the relevant results, we can identify two distinct features: all relevant results show a rather large contribution of either "madness" (look at rank 6 and rank 35, for example) and/or a rather large contribution of "method" (ranks 45 and 6). However, these contributions do not lead to higher ranks necessarily, since other words such as "is", "this" and "though" score higher for other results: for example, look at the contribution of "in" for rank 1.
In the plot below, we visualize this observation using ranks 1, 6 and 35. Comparing the rank 1 result on the left - which is a false positive - with the two relevant results on the right, we see that "in", "through" and "is" make up for large parts of the score for rank 1, whereas "madness" is a considerable factor for the two relevant matches. Unfortunately, this contribution is not sufficient to bring these results to higher ranks.
%% Cell type:code id:5baf690f tags:
``` python
@widgets.interact(plot_as=widgets.ToggleButtons(options=["bar", "pie"], value="pie"))
def plot(plot_as):
nbutils.vis_token_scores(
plot_b.matches, kind=plot_as, ranks=[1, 6, 35], plot_width=800
)
```
%% Output
%% Cell type:markdown id:1a44136c tags:
The distributions of score contributions we just observed are the motivation for our approach to tag-weighted alignments as they are described in Liebl & Burghardt (2020a/b). Nagoudi and Schwab (2017) used a similar idea of POS tag-weighting for computing sentence similarity, but did not combine it with alignments.
We now demonstrate (POS) tag-weighted alignments, by using a tag-weighted alignment that will weight nouns like "madness" and "method" 3 times higher than other word types. "NN" is a Penn Treebank tag and identifies singular nouns.
%% Cell type:code id:7311b2cf tags:
``` python
tag_weighted_index_builder = make_index_builder(
strategy="Tag-Weighted Alignment", strategy_options={"tag_weights": {"NN": 3}}
)
tag_weighted_index_builder
```
%% Output
%% Cell type:code id:663137b4 tags:
``` python
nbutils.plot_results(
gold_data, tag_weighted_index_builder.build_index(), "though this be madness"
)
```
%% Output
%% Cell type:markdown id:dac2ee24 tags:
Tag-weighting moves the correct results far to the top, namely to ranks 1, 2, 4 and 6. By increasing the NN weight to 5, it is possible to promote the true positive on rank 73 to rank 15. This is a bit of an extreme measure though and we will not investigate it further here. Instead we investigate how the weighting affects the other queries. Therefore, we re-run the nDCG computation and compare it against unweighted WSB.
%% Cell type:code id:06013d75 tags:
``` python
index_builder_unweighted = make_index_builder()
index_builder_unweighted
```
%% Output
%% Cell type:code id:6b88c896 tags:
``` python
nbutils.plot_ndcgs(
gold_data,
{
"wsb_unweighted": index_builder_unweighted.build_index(),
"wsb_weighted": tag_weighted_index_builder.build_index(),
},
)
```
%% Output
100%|██████████| 40/40 [00:11<00:00, 3.37it/s]
%% Cell type:markdown id:79a49983 tags:
In the end, the evaluation shows that we considerably increased our accuracy through employing POS tag weighting in this scenario.
%% Cell type:markdown id:5283d93a tags:
## The influence of different embeddings
%% Cell type:markdown id:9265e37a tags:
While we have varied the similarity metrics in the previous example, we will now look into the effect that different embeddings might have on the results. The caveat here is that we are using compressed embeddings, i.e. we would need to verify these results with uncompressed embeddings. Still, the performance of compressed fastText seems very solid.
%% Cell type:code id:e78507f9 tags:
``` python
index_builders = {}
# for each embedding, define a search strategy based on tag-weighted alignments
for e in the_embeddings.values():
index_builders[e.name] = make_index_builder(
strategy="Tag-Weighted Alignment",
strategy_options={"tag_weights": {"NN": 3}, "similarity": {"embedding": e}},
)
# present an UI to interactively edit these search strategies
accordion = widgets.Accordion(children=[x.displayable for x in index_builders.values()])
for i, k in enumerate(index_builders.keys()):
accordion.set_title(i, k)
accordion
```
%% Output
%% Cell type:code id:db53567e tags:
``` python
nbutils.plot_ndcgs(
gold_data, dict((k, v.build_index()) for k, v in index_builders.items())
)
```
%% Output
100%|██████████| 100/100 [00:32<00:00, 3.08it/s]
%% Cell type:markdown id:1ac03a32 tags:
Some observations to be made here:
* In a few queries ("llo, ho, ho my lord", "frailty, thy name is woman", "hell itself should gape"), GloVe gives slightly better results than fastText, but this cannot be generalized to the overall performance.
* For some queries ("I do bear a brain.", "O all you host of heaven!") the embedding does not seem to matter at all.
* A real competitor for fastText are contextual embeddings from Sentence-BERT - however, these are much more expensive in terms of computation time, storage space and code complexity.
%% Cell type:markdown id:221deb5d tags:
## Conclusion
%% Cell type:markdown id:419cf643 tags:
In this interactive notebook we have demonstrated how different types of word embeddings can be used to detect intertextual phenomena in a semi-automatic way. We also provide a basic ground truth dataset of 100 short documents that contain 20 different quotes from Shakespeare's plays. This setup enables us to investigate the inner workings of different embeddings and to evaluate their suitability for the case of Shakespearean intertextuality.
%% Cell type:markdown id:ebb3b3ac tags:
The following main findings – that also open up perspectives for future research – were obtained from this evaluation study:
1. POS tag-weighted alignments achieve the highest overall performance (in terms of nDCG) for our specific data. As a result, we want to encourage other researchers to explore the use of tag-weighting in alignments, as it seems to be a powerful approach that is not widely known in the literature.
2. Document embeddings also show a strong performance. Their special appeal lies in easy use (assuming a pretrained model), as they do not rely on additional WSB or WMD mappings.
3. When using WMD on short texts, our bow variant outperforms the original nbow variant by Kusner et al. (2015) by quite a margin. Further research must show whether these findings also hold for longer documents and how they evaluate when compared to optimized variants like RWMD (Kusner et al., 2015) and other variants (Werner & Laber, 2020).
4. In terms of embeddings, compressed fastText embeddings clearly outperform compressed GloVe and Numberbatch embeddings in our evaluations. Since we used low-dimensional embeddings for this notebook, these results are only a first hint and need to be verified through the use of full (high-dimensional) embbeddings. Contextual token embeddings extracted from Sentence-BERT provide a similar performance as fastText - however in terms of computation time and memory they are way more complex. All in all, our evaluation seems to highlight the solid performance of fastText.
%% Cell type:markdown id:9829e464 tags:
While we were able to gather some interesting insights for the specific use case of Shakespearean intertextuality, the main goal of this notebook is to provide an interactive platform that enables other researchers to investigate other parameter combinations as well. The code blocks and widgets above offer many ways to play around with the settings and explore their effects on the ground truth data.
More importantly, researchers can also import their own (English language) data to the notebook and experiment with all of the funtions and parameters that are part of the [Vectorian API](https://github.com/poke1024/vectorian-2021). We hope this notebook provides a low-threshold access to the toolbox of embeddings for other researchers in the field of intertextuality studies and thus adds to a critical reflection of these new methods.
%% Cell type:markdown id:a4ce0575 tags:
## Interactive searches with your own data
%% Cell type:markdown id:ab53cf6d tags:
First specify the text documents you want to search through by an upload widget:
%% Cell type:code id:55478ee3 tags:
``` python
import ipywidgets as widgets
upload = widgets.FileUpload(accept=".txt", multiple=True)
upload
```
%% Output
%% Cell type:markdown id:fd3ba2ba tags:
From this upload widget contents, we now build a Vectorian session we can perform search through. As always with Vectorian session, we need to specify the embeddings the want to employ for searching. We also need an `nlp` instance for importing the text documents. Depending on the size and number of documents, this step can take some time.
%% Cell type:code id:2b767872 tags:
``` python
from vectorian.importers import StringImporter
from vectorian.session import LabSession
import codecs
def files_to_session(upload):
if not upload.value:
raise RuntimeError(
"cannot run on empty upload. please provide at least one text file."
)
im = StringImporter(nlp)
# for each uploaded file, import it via importer "im" and add to "docs"
docs = []
for k, data in upload.value.items():
docs.append(
im(codecs.decode(data["content"], encoding="utf-8"), title=k, unique_id=k)
)
return LabSession(
docs, embeddings=[emb_numberbatch, emb_fasttext], normalizers="default"
)
upload_session = files_to_session(upload)
```
%% Output
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-48-826143823b45> in <module>
25 normalizers="default")
26
---> 27 upload_session = files_to_session(upload)
<ipython-input-48-826143823b45> in files_to_session(upload)
9
10 if not upload.value:
---> 11 raise RuntimeError("cannot run on empty upload. please provide at least one text file.")
12
13 docs = []
RuntimeError: cannot run on empty upload. please provide at least one text file.
%% Cell type:markdown id:b850e23d tags:
Now we present the full interactive search interface the Vectorian offers (we have hidden it so far and focussed on a subset). Note that in contrast to our experiments earlier, we do not search on the *document* level by default, but rather the *sentence* level - i.e. we split each document into sentences and then search on each sentence. You can change this in the "Partition" dropdown.
%% Cell type:code id:022a1acb tags:
``` python
upload_session.interact(nlp)
```
%% Cell type:markdown id:64fd87c8 tags:
## References
Bamman, David & Crane, Gregory (2008). The logic and discovery of textual allusion. In Proceedings of the 2008 LREC Workshop on Language Technology for Cultural Heritage Data.
Bär, Daniel, Zesch, Torsten & Gurevych, Iryna (2012). Text reuse detection using a composition of text similarity measures. In Proceedings of COLING 2012, p. 167–184.
Büchler, Marco, Geßner, Annette, Berti, Monica & Eckart, Thomas (2013). Measuring the influence of a work by text re-use. Bulletin of the Institute of Classical Studies. Supplement, p. 63–79.
Burghardt, Manuel, Meyer, Selina, Schmidtbauer, Stephanie & Molz, Johannes (2019). “The Bard meets the Doctor” – Computergestützte Identifikation intertextueller Shakespearebezüge in der Science Fiction-Serie Dr. Who. Book of Abstracts, DHd.
Chandrasekaran, Dhivya & Mago, Vijay (2021). Evolution of Semantic Similarity – A Survey. ACM Computing Surveys (CSUR), 54(2), p. 1-37.
Faruqui, Manaal, Tsvetkov, Yulia, Rastogi, Pushpendre & Dyer, Chris (2016). Problems With Evaluation of Word Embeddings Using Word Similarity Tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, p. 30-35.
Forstall, Christopher, Coffee, Neil, Buck, Thomas, Roache, Katherine & Jacobson, Sarah (2015). Modeling the scholars: Detecting intertextuality through enhanced word-level n-gram matching. Digital Scholarship in the Humanities, 30(4), p. 503–515.
Genette, Gérard (1993). Palimpseste. Die Literatur auf zweiter Stufe. Suhrkamp.
Kusner, Matt, Sun, Yu, Kolkin, Nicholas & Weinberger, Kilian (2015). From word embeddings to document distances. In International conference on machine learning, p. 957-966.
Liebl, Bernhard & Burghardt, Manuel (2020a). „The Vectorian“ – Eine parametrisierbare Suchmaschine für intertextuelle Referenzen. Book of Abstracts, DHd 2020, Paderborn.
Liebl, Bernhard & Burghardt, Manuel (2020b). “Shakespeare in The Vectorian Age” – An Evaluation of Different Word Embeddings and NLP Parameters for the Detection of Shakespeare Quotes”. Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LateCH), co-located with COLING’2020.
Manjavacas, Enrique, Long, Brian & Kestemont, Mike (2019). On the feasibility of automated detection of allusive text reuse. Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature.
Mikolov, Tomas, Chen, Kai, Corrado, Greg & Dean, Jeffrey (2013). Efficient estimation of word representations in vector space. In Proceedings of International Conference on Learning Representations (ICLR 2013). arXiv preprint arXiv:1301.3781.
Mikolov, Tomas, Grave, Edouard, Bojanowski, Piotr, Puhrsch, Christian & Joulin, Armand (2018). Advances in pretraining distributed word representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). arXiv preprint arXiv:1712.09405.
Nagoudi, El Moatez Billah & Schwab, Didier (2017). Semantic Similarity of Arabic Sentences with Word Embeddings. In Proceedings of the 3rd Arabic Natural Language Processing Workshop, Association for Computational Linguistics, 2017, p. 18–24.
Pennington, Jeffrey, Socher, Richard & Manning, Christopher D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), p. 1532-1543.
Reimers, Nils & Gurevych, Iryna (2019). Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
Scheirer, Walter, Forstall, Christopher & Coffee, Neil (2014). The sense of a connection: Automatic tracing of intertextuality by meaning. Digital Scholarship in the Humanities, 31(1), p. 204–217.
Sohangir, Sahar & Wang, Dingding (2017). Document Understanding Using Improved Sqrt-Cosine Similarity. In Proceedings of the 2017 IEEE 11th International Conference on Semantic Computing (ICSC), p. 278-279.
Speer, Robyn, Chin, Joshua & Havasi, Catherine (2017). Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), p. 4444–4451.
Wang, Yuxuan, Hou, Yutai, Che, Wanxiang & Liu, Ting (2020). From static to dynamic word representations: A survey. International Journal of Machine Learning and Cybernetics 11, p. 1611–1630.
Waterman, Michael S., Smith, Temple F. & Beyer, William A. (1976). Some biological sequence metrics. Advances in Mathematics 20(3), p. 367-387.
Werner, Matheus & Laber, Eduardo (2020). Speeding up Word Mover's Distance and its variants via properties of distances between embeddings. In Proceedings of the 24th European Conference on Artificial Intelligence (ECAI 2020), p. 2204-2211.
Wieting, John, Bansal, Mohit, Gimpel, Kevin & Livescu, Karen (2015). Towards universal paraphrastic sentence embeddings. Proceedings of the 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico.
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment