Wednesday, April 29, 2015

Analogy recovery from the Wikipedia corpus - a natural language processing task

Natural language processing is a field of science aiming at teaching machines the hectic human language, whether it is spoken or written. On the lowest level of abstraction computers talk to each other in sequence of bits, which in chunks forms instructions. The instruction set is limited, as well as well-defined - not any like human language. We can express rather complex concepts by a single word, while a computer would struggle spending a great amount of its words (1 word usually consists of a few bytes) to tell the same stuff. Also, families of human languages are results of thousands years of evolution, thus often neglect logic as much as are built upon traditions. All in all, getting machines understand our most innate communication form is not an easy sport. In fact, it's difficult as hell. So complex that so far we could only rely on heuristics; not even close to a full blown neural net, which would imitate our language processing brain parts, functions.

Articulation of analogies _is_ a big deal in human communication and understanding. Thus providing machines this skill would serve the goals of NLP quite well. An analogy comes in the form of 4 interconnected expressions; e.g. candy for kids is like alcohol for adults. My implementation of Levy and Goldberg's paper aims at recovering analogies from large resource of writings (corpus), so that machines can quantify (or vectorize) meanings of words and do arithmetic on them. But how would you express analogy in terms of word arithmetic? Let's say you've got 3 words out of the analogy: candy, kid and alcohol. You may then ask the computer: a kid is to a candy as an adult to [what]? Depending on the size and content of the corpus, the computer would answer "alcohol", "smoking", "mary jane", "informative murder porn" or a seemingly random expression if not enough writings are provided in the first place. In word arithmetic it is like:

[result] := adult - kid + candy

Now, [result] is usually not in the vocabulary (as a vector of floating numbers), but we can search for a word that mostly resembles it - of which vector is the closest to [result]. In the calculation, the meaning of kid is removed from the meaning of adult, possibly eliminating concepts as organism or human - leaving us with a notion of maturity or something close to it. When maturity is combined with candy the outcome might bear the meaning ~"addictive substance for grown-ups". Then a word with the most similar meaning (most similar vector elements) is selected from the vocabulary so the analogy is made. Play the same game with the man:woman - king:queen analogy to understand it better!

But where does the corpus comes into the picture? To translate natural words to vectors of numeric values that the machine is able to manipulate, the applied algorithm runs through the texts of the corpus and make connections among words. The strength of these connections is determined by the number of occurrence of the given words in the same sentence - they also should be just a few words away of each other. If the corpus is big enough - I used the Wikicorpus for testing -, vectorized words could gain a numeric meaning analog to what pops up in our head while interpreting the textual or spoken representation of the same word.

You can find the analogy recovery implementation in my repository at https://bitbucket.org/csiki/analogyrecovery; have fun with it! The mentioned NLP paper and Wikicorpus are also accessible.

References

O. Levy and Y. Goldberg, “Linguistic Regularities in Sparse and Explicit Word Representations,” Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pp. 171–180, 2014.

Samuel Reese, Gemma Boleda, Montse Cuadros, Lluís Padró, German Rigau. Wikicorpus: A Word-Sense Disambiguated Multilingual Wikipedia Corpus. In Proceedings of 7th Language Resources and Evaluation Conference (LREC'10), La Valleta, Malta. May, 2010.

No comments:

Post a Comment