Navigate back to the homepage
Portfolio

How Machines Understand Words

Stefan Libiseller
August 14th, 2019 · 3 min read

Machine learning is - very simplified - transforming input numbers by learned numbers. This means, if words are to be the input for a machine learning model, they somehow have to be translated into numbers first.

One possibility is to encode each character and then feed that sequence to the ML model. While this is possible in theory, it is very hard for the ML model to pick up on the task you want it to perform. This is because, before it can focus on the task, it has to learn the complex mechanisms behind language and the meaning of words. It's the equivalent of taking a quiz in a language you don't speak.

Encoding words as a whole is a much better idea as it opens up the possibility of adding pre-trained knowledge of language and meaning to that encoding. This words to numbers translation is called a word embedding.

Word Embedding

In a word embedding each word gets assigned a word vector which encodes its meaning relative to all other words. These vectors typically have between 50 and 300 dimensions (independent numbers). A common vocabulary size for word embeddings is about two million words.

To show you more about what word embeddings can do I am using fastText, a pre-trained word embedding by Facebook, which is publicly available. You can download the cc.en.300.vec word vector file here.

Take a Look at a Word Vector

This Python code loads the fastText word vector for "dog".

1import torchtext
2
3# load fastText vectors
4fasttext = torchtext.vocab.Vectors('cc.en.300.vec')
5
6def get_vector(word):
7 assert word in fasttext.stoi, f'*{word}* is not in vocab!'
8 return fasttext.vectors[fasttext.stoi[word]]
9
10get_vector('dog')
11
12# output
13# ---------------------------------------------------------------------
14[ 0.1680, -0.0013, 0.0162, 0.2779, -0.1062, 0.0366, 0.2043, 0.0642,
15 -0.0115, 0.0582, -0.2252, -0.2130, -0.0762, -0.0495, 0.0449, 0.2431,
16 0.0446, -0.0288, -0.3035, 0.0158, -0.2097, -0.0215, -0.0963, -0.0459,
17 -0.0209, -0.2514, 0.1053, -0.1832, 0.0060, 0.2452, 0.0032, 0.2207,
18 -0.0141, 0.0683, -0.0712, -0.0064, -0.0015, 0.0694, -0.1512, -0.4159,
19 0.0808, 0.0432, -0.1890, 0.0269, -0.2053, 0.0283, 0.0146, 0.0388,
20 -0.2020, 0.2738, -0.2366, -0.1278, -0.0665, -0.1274, -0.2438, 0.1801,
21 -0.0407, -0.0155, -0.1460, 0.1093, 0.0273, 0.0163, 0.1462, 0.0856,
22 -0.1293, -0.0084, 0.1568, 0.2373, -0.1708, 0.1281, -0.0095, 0.0350,
23 -0.0718, -0.1996, -0.0901, -0.0601, 0.1511, -0.0249, 0.0367, 0.0767,
24 0.0178, -0.1069, -0.0110, -0.1920, -0.0224, -0.0404, 0.1455, -0.0236,
25 0.1104, -0.0976, 0.0238, -0.2057, 0.0172, 0.0320, 0.0082, 0.0866,
26 0.1850, 0.1840, 0.1067, 0.0374, -0.4075, -0.0402, 0.0846, -0.1112,
27 -0.2529, 0.1772, -0.1850, 0.2514, -0.0127, -0.1483, 0.1600, -0.0588,
28 -0.0614, 0.1117, -0.1457, 0.1475, -0.3153, 0.0108, -0.1519, -0.0436,
29 -0.0635, -0.0888, -0.0578, -0.0983, -0.0251, 0.0774, -0.0807, 0.1271,
30 0.1698, -0.1946, -0.1263, -0.0550, -0.0597, -0.1529, -0.0905, 0.0596,
31 0.1855, 0.0218, 0.2297, -0.1333, -0.0720, -0.0312, -0.0077, -0.0386,
32 -0.0635, 0.0168, -0.3063, 0.3933, -0.0754, -0.1283, 0.0095, -0.2939,
33 -0.0505, -0.1281, -0.1555, 0.1101, 0.0319, 0.0221, -0.1495, 0.1655,
34 -0.1755, 0.1453, 0.1828, -0.1498, -0.2188, -0.1255, 0.1867, -0.1273,
35 -0.0232, 0.0352, 0.0901, 0.1168, -0.2179, -0.0116, 0.0472, -0.1177,
36 0.1580, 0.0814, -0.1904, -0.1378, 0.0857, -0.0967, -0.0752, -0.2005,
37 0.1006, 0.0772, 0.2077, -0.0425, -0.0078, 0.1553, -0.2352, -0.0190,
38 0.0103, 0.0056, 0.1036, 0.0051, -0.0062, 0.1506, -0.0222, 0.1142,
39 0.0601, 0.0364, 0.0585, 0.0437, 0.0291, 0.1614, 0.0338, -0.1743,
40 0.0866, 0.1908, 0.0800, -0.1523, -0.0601, -0.1148, -0.1047, -0.3520,
41 -0.0891, -0.0627, -0.0143, -0.0135, 0.1672, 0.0007, 0.0710, -0.0440,
42 0.1362, 0.0377, 0.1690, -0.0459, -0.1022, 0.0346, -0.1959, 0.0451,
43 -0.0774, 0.1307, -0.0142, -0.0253, -0.1935, 0.0333, -0.0448, 0.1531,
44 -0.0086, -0.0767, -0.2097, -0.1825, 0.1158, -0.1706, 0.0685, -0.0045,
45 0.0069, 0.0382, 0.0310, -0.0462, 0.0433, -0.1529, -0.4071, 0.1019,
46 -0.0417, -0.1270, 0.0347, 0.1016, -0.0407, -0.1196, -0.0041, -0.0848,
47 -0.1461, 0.0328, -0.1638, 0.0261, -0.0199, -0.1041, 0.0212, 0.1466,
48 -0.1706, 0.0447, -0.2523, 0.0423, -0.0611, 0.1119, -0.0780, -0.1129,
49 0.0558, 0.0450, -0.1698, -0.0260, -0.0644, 0.1335, -0.1240, -0.0888,
50 -0.0556, -0.1717, -0.0921, -0.0390, -0.1410, 0.0748, 0.2265, -0.2045,
51 -0.1585, 0.3027, 0.0942, 0.1540]

As you can see the word 'dog' translates to an array of 300 float values, which could now be processed by a machine learning model. Every word included in the embedding translates to such an array of length 300 with pre-trained numbers.

Often, the word embedding is the first layer in a deep learning model. This has the advantage that the word vectors themselves, just like the weights and biases of the neurons, can be further optimized during the training process.

Related Words

Good word embeddings have properties that make it easier to process text. The most important one is that related words are close to each other.

The code below finds the closest n words by comparing all word vectors to the input word vector and calculating the vector distance between them. Similar to how you would find the distance between two points in a 2D plane, but with 300 dimensions. It then sorts them by distance and prints the n closest words.

1import torch
2
3def pretty_print(word_distance_list):
4 for word, distance in word_distance_list:
5 print(f"{word:{25}} {distance:{10}.{7}}")
6
7def closest_words(word, n=3):
8 vector = get_vector(word)
9 distances = [(w, torch.dist(vector, get_vector(w)).item())
10 for w in fasttext.itos]
11 return sorted(distances, key = lambda w: w[1])[:n]
12
13pretty_print(closest_words('dog'))
14
15# output
16# ---------------------------------------------------------------------
17dog 0.0
18dogs 1.289913
19puppy 1.546394

As you can see dog, dogs and puppy are close together in the 300D vector space. This property of word embeddings lets the model understand synonyms and the gives it the power to process meaning.

Analogy

Somewhat surprisingly word vectors are also able to solve analogy problems. This works by applying simple math operations to the vectors. For example: Subtract dog from puppy which leaves you with a fictional baby vector. Add cat to this vector and you should get kitty.

puppydog=kittycatpuppy - dog = kitty - catpuppydog+cat=kittypuppy - dog + cat = kitty

Sounds crazy? It somewhat is. But it works!

Here is the implementation in Python code:

1def x_to_y_like_a_to(x, y, a, n=3):
2 print(f"{x} is to {y} like {a} to...")
3 b = get_vector(y) - get_vector(x) + get_vector(a)
4 possible_words = closest_words(b, n=n+3)
5 solution = np.squeeze([word for word in possible_words
6 if word[0] not in [x, y, a]][:n])
7 pretty_print(solution)
8 return
9
10x_to_y_like_a_to('dog', 'puppy', 'cat')
11
12# output
13# ---------------------------------------------------------------------
14# dog is to puppy like cat is to...
15kitty 1.40337
16kitten 1.40879
17kittens 1.53741

As you can see, the vectors in the word embedding were chosen in such a way that these math operations are possible. There are many examples where such analogies work. Full disclosure: They don't work for every example.

Let's take a look at whether fastText has a sense of tenses:

1x_to_y_like_a_to('walk', 'walked', 'swim')
2
3# output
4# ---------------------------------------------------------------------
5# walk is to walked like swim is to...
6swam 1.21056
7swimming 1.31170
8swimmers 1.32381

Yes, they do. And it even works for irregular verbs.

What about geography?

1x_to_y_like_a_to('France', 'Paris', 'Germany')
2
3# output
4# ---------------------------------------------------------------------
5# France is to Paris like Germany is to...
6Berlin 0.68127
7Munich 0.71688
8Frankfurt 0.78771

Turns out you can learn a lot about the world if you read (or process) the internet.

Out of Vocabulary Words

But what about words that are not in the vocabulary? No vector means no way to process that word. What now?

Often, unknown words are simply replaced by an <unk> token which translates to a vector of all zeros. As long as the number of unknown words is small and the task isn't too complex, the model will still be able to give the correct prediction. For more complicated tasks, such as translation, it is not possible to simply ignore a unknown word.

Byte Pair Encoding

Byte pair encoding (BPE) solves this problem by working with frequent letter combinations instead. It dismantles words into predefined subword units and translates them into pre-trained vectors. For example, the word ending ing_ (the underscore represents the trailing space) would be translated into it's own 'participle vector'. Including the space in the encoding allows it to distinguish between ing, a subword in the middle, and ing_ as word ending.

The model learns to interpret sequences of word chunks rather than whole words. This technique gives it more flexibility to process unknown words while keeping the advantages of word based embeddings compared to character based processing.

The name byte pair encoding comes from the method, traditionally a compressing algorithm, that is used to determine what letters should be combined into a single token (word part).

Conclusion

For machines to process words, they must first be converted to numbers. Word embeddings translate the meaning of words into a dense vector representation. In this representation, words of the same meaning are close to each other. A problem with word embeddings are unknown words. For this reason, recent publications often use byte pair encoding which vectorizes subwords instead of entire words.

Popular word embeddings:
fastText by Facebook
GloVe by Stanford
SpaCy by Explosion AI

Other Links:
TorchText - text processing for PyTorch
pytorch-sentiment-analysis - Tutorials on getting started with PyTorch and TorchText

Let's automate the boring stuff!

I am a freelance Data Scientist from Vienna and available for projects and consulting. If you have any questions, feedback or just want to say "hi" drop me a line on Twitter or send me an email. :)

Join my newsletter

Be the first to receive my latest content with the ability to opt-out at anytime. I promise to not spam your inbox or share your email with any third parties.

More blog posts

Machine Learning for Humans in a Hurry

A compact summary of what machine learning is and how it works.

July 9th, 2019 · 3 min read
© 2019 - Stefan LibisellerImprint
Link to $https://twitter.com/StanderwahreLink to $https://github.com/libisellerLink to $https://instagram.com/standerwahre