To present the idea and implementation of Thoughtflow, I start off by asking the following. What is common between an Overwatch server outage rage forum, a terrorist nest, a Bruce Lee quote enthusiasts gathering place, a puppy death cry-over, a chat group full of people who just farted, an ad hoc hiking trip or swingers party organizing interface and whatever you can imagine where resonating thoughts meet and empathy flourishes? Let me cut the BS and get to the answer.
Thoughtflow is designed to extract sentiment (meaning) out of sentences and group them according to similarity. Think of a search bar, you type in your thought as a sentence and you get grouped up with others who typed in analogous sentences.
One level deeper: Thoughtflow uses machine intelligence to turn human readable text to a numerical vector, which is then used to address a high dimensional space. In this metric space, distances between vectors can be defined and used to tie close ones together, hence to establish groups of related sentences - that is, cluster them.
Outline of Thoughtflow: web UI accepts sentences, forwards it to Skip-Thought, which then spits out the encoded sentence as a numerical vector. The sentence vector is consumed by DBSCAN and is grouped up by semantic similarity with other live sentences/thoughts in the system. In groups, users may post on the wall, comment, or vote - at the moment.
So, how could you turn a sentence into a vector, while holding onto the meaning? Or a (simpler) alternative, how could you turn a word into a vector in the same manner? That's what word2vec models are for! Here's how they work.
Sentence embeddings
"You shall know a word by the company it keeps" - said John Rupert Firth an influential linguist from the 50's. This idea gives basis to the majority of word2vec models today. If we believe that the meaning of a word is implicitly given in the contexts it keeps, then one solution to turn the word to a vector would be to keep a frequency counter for each possible context, and increase that counter if the word is present [1]. Then by concatenating these frequency counts, we can obtain a vector that represents the word - and in the same way any other word can be encoded, obviously having somewhat different frequency counts. As an example let our word2vec model learn 2 words on the following corpus: I LOVE TREES. I LOVE PEOPLE. I HATE KALE. I HATE TRUMP.
CONTEXT: [I, TREES] [I, PEOPLE] [I, KALE] [I, TRUMP] LOVE: [ 1 , 1 , 0 , 0 ] HATE: [ 0 , 0 , 1 , 1 ]
These are the frequency word vectors we capture in the end, if we only construct contexts out of words that are 1 distance away. As you can imagine the number of contexts could go up pretty high (given the corpus), especially if we increase the size of contexts (surrounding 4, 6... words). Nevertheless, in the long run, it can be seen that similar words would be represented by similar vectors, taken Firth's theory. I had a project to build such representation of words from scratch in C++, btw.
You can imagine the same working on sentences, right? Just take a sentence and its context - the surrounding sentences - and build a frequency vector. Now, the problem you might encounter is that the number of contexts just explodes exponentially, as more sentences are taught to the model. The space of sensible sentences is huge, not to mention the space of contexts (set of sensible sentences). It requires unfeasible amount of resources just to be able to keep track of the contexts. That's for sure not how the human brain understands texts. Aaaand that's where autoencoder models come into picture. Let's jump straight into an example:
3 girls play. The first tells the second an arbitrary long story without the third hearing any of it. The second has to compress the story to 10 words, and forward it to the third. The third is then tells the story she thinks the first girl came up with. We then deliver punishment to the third, some of which she relays (backpropagate) to the second - the first is not part of the learning model, she just tells random stories.
If the story is far from what was initially said by the first, the punishment to share is proportionally large. If the mistake is negligible, or the story is perfectly transferred, little or no punishment is given. No dolls, no candy ever; an AI's life is hard. So let's imagine we play this game for a while. We expect the girls / the layers of neurons to learn how to compress stories to 10 words, or to a given length vector. The magic happens in how the punishment is given, how the task is described.
What if we instruct them to take a sentence as input, and give the surrounding sentences, the context, as an output. Then, the second would try to encode the sentence she gets from the first, so the created representation contains as much context information as possible; without actually hearing the context. In turn, the third can come up with the surrounding sentences. Playing this game long enough, the girls would be capable of representing any sentence in a concise format that still contains (back to Firth) the sentiment of the original input sentence.
Let's translate the girls' game to autoencoders: the second girl is the encoder, the third is the decoder. The first girl is the one who takes the human readable text and turns it into a sequence of word vectors - similar word vector that's been described above - and forward it.. or actually whatever, you can imagine any arbitrary analogy for the first girl; I just needed three to entertain my Powerpuff obsession.
3 girls playing aka the Skip-Thought autoencoder being trained. Bubbles turns a sentence (taken from a context) to a sequence of word vectors by means of a word2vec method. After receiving the sequence, Blossom encode it into one single vector using a specific kind of Recurrent Neural Network (GRU). Using this compressed sentence representation, Buttercup tries to figure out the context of the sentence - the surrounding 2 sentences - and presents it to Sedusa. Sedusa assess the error of the game, which is the difference between the word vectors of the actual context (si-1, si+1) and the one the girls come up with. The error is then backpropagated first to Buttercup, then to Blossom. Sedusa's (constructive) punishment helps the 2 girls find ways to perform better, while Bubbles giggles and acts the fool as always.
At the heart of Thoughtflow, the so called Skip-Thought model lies [2]. It works similarly to how I described the latter game - word vector representations are fed to a GRU [3] Recurrent Neural Network (the encoder), the output of which is then propagated to the decoder unit. The decoder then builds the context word-by-word, vector-by-vector. The context is then checked against the desired context and the produced error is backpropagated through the autoencoder. After training, the output of the encoder unit is used to represent sentences.
Grouping method
Finally, we know how to express sentences as numerical vectors. What's left is to cluster them into groups of analogous sentences. First of all, a distance metric has to be defined between vectors. I chose cosine similarity in the end, as it is suitable for high dimensional spaces and empirically works well with word embeddings; also I manually tested and compared it with Euclidean distance and cosine yielded more sensible groups.
There are lots of different flavors of clustering methods. To pick the most fitting for our problem, it's effective to start off by grouping these methods according to the input parameters they require.
Centroid-based algorithms - like k-means - require the number of clusters (k) to be provided prior to clustering - essentially, the number of groups that is expected to emerge from the thoughts. This is not given and quite difficult even to approximate. Clustering for each possible k and then choosing the best one, would spend too much resource - e.g. k-means is O(n2). It also assumes the clusters to be about similar size, which is definitely not true for Thoughtflow.
Distribution-based clustering algorithms also require the number of distributions to capture clusters in the data and assumes Gaussian distribution of each cluster, which is too strict for our problem.
Hierarchical clustering methods would be just too slow O(n2logn) for large amount of sentences, though the parameters it needs can be given in our case.
After mercilessly excluding a variety of methods, I found density-based clustering the most fitting for the cause and chose DBSCAN in the end. This algorithm only takes the maximum distance between neighboring elements (ε) and the minimum amount of neighboring elements that makes up a cluster. The latter - number of sentences needed for a group - is determined easily and can be kept as a constant. To figure out the optimal ε, I use a simple global optimization algorithm - simulated annealing - and I optimize on the Silhouette score of the clustering result. This means that I run DBSCAN over and over again until the optimal ε value is found. Now you ask, yeye, you were whining about running k-means a couple of times, now you're gonna run DBSCAN probably thousands of times. Yeah, but there's a difference: after a sparse distance matrix (di,j; i,j = 1..|Sentences|), e.g. CSR, is computed only for neighboring sentence pairs - di,j ≤ ε - the complexity of running DBSCAN on that distance matrix is linear. Constructing such a distance matrix can be done in O(nlogn), and then a couple thousand runs of DBSCAN are achieved fast. Not to mention that DBSCAN finely yields itself to the MapReduce programming model [4], which makes it a good candidate for parallelization.
That's it. At least the theory. I've implemented all these mentioned above, and packaged it in a web application, which is run on the Google Cloud Platform. Check it out - thoughtflow.io [inactive until further interest]! It might not have enough users, or it might still be a bit slow. It is in alpha version as of March 2017.
Application
On the index page, an input field is waiting for a sentence to be typed in. The given sentece is immediately turned into a 4800 long numerical vector, which is then saved in the database. Clustering of all live sentences is repeated in a every 30 seconds. This time interval is going to be lowered as soon as I'm able to convert DBSCAN to Google Dataflow, which should finely scale up to millions of sentences. So, after you've got your sentence in the system and the first clustering happens, you are either redirected to a group or you stay on the loading screen in case there are not enough sentences having similar semantic meaning to yours.
Whichever happens, your thought is accessible on the index page, typically within 1 hour of its creation. So in case you're tired of the loading page and its animation - seriously, check it out, I worked a whole night on it :D - you can just keep going back to the index page and type in another sentence, until you find a group. A thought is alive for 1 hour after the last interaction performed on it. Once in the group, you can only post, comment, or vote, at the moment. As a cumulative effect of this rule, the user has higher chance to meet other active users in groups. By lowering the lifetime of sentences, this chance can be further increased.
btw, going back to the intro, this platform would not function quite well for terrorist organizations, as all the group conversations are public and accessible.
Plans for the future
I anticipate that however great the Skip-Thought model is - it was taught on more than 11 thousand books [5] -, it has to be updated continuously, as it should be able to represent sentences containing recently developed expressions - e.g. video-game names. I haven't mentioned it previously, but I've also implemented vocabulary expansion of Skip-Thought - I trained a transformation matrix through simple linear regression in order to turn word vectors from the spaCy representation to Skip-Thought word vectors. Now, I can just learn new word representations in spaCy and use them in Skip-Thought. Apart from this "hacky" solution, I plan to keep on capturing texts from the wild internet and feed them to the Skip-Thought model in a streaming fashion. Time tells if the model is effective enough to bear that amount of knowledge and variety.
In the long term, I envision grouping of brain signals instead of sentences. Starting by capturing EEG features of emotional information - both handcrafted features, like Frontal Alpha Asymmetry [6] and generated ones through Sparse Coding [7]. Just imagine, people with similar brain patterns matched up in a group.
References
[1] O. Levy and Y. Goldberg, “Linguistic Regularities in Sparse and Explicit Word Representations,” Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pp. 171–180, 2014.
[2] R. Kiros et al., “Skip-Thought Vectors,” arXiv:1506.06726 [cs], Jun. 2015.
[3] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. NIPS Deep Learning Workshop, 2014.
[4] “A new scalable parallel DBSCAN algorithm using the disjoint-set data structure,” ResearchGate. [Online]. Available: https://www.researchgate.net/publication/261212964_A_new_scalable_parallel_DBSCAN_algorithm_using_the_disjoint-set_data_structure.
[5] Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler. "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books." arXiv preprint arXiv:1506.06724 (2015).
[6] J. A. Coan and J. J. B. Allen, “Frontal EEG asymmetry as a moderator and mediator of emotion,” Biol Psychol, vol. 67, no. 1–2, pp. 7–49, Oct. 2004.
[7] Q. Barthélemy, C. Gouy-Pailler, Y. Isaac, A. Souloumiac, A. Larue, and J. I. Mars, “Multivariate Temporal Dictionary Learning for EEG,” arXiv:1303.0742 [cs, q-bio, stat], Mar. 2013.
Thank you for this wonderful article.. I have one question wrt the vectors itself. Can i use the raw vectors themselves to to cluster the similar documents (after applying SOM, T-SNE etc. that is). Also how do skip thoughts perform against Doc2Vec ? Any ideas on that?
ReplyDeleteIsn't inferring from Doc2vec simpler?
Also, if the number of documents were large, how would you compute cosine similarity b/w all the documents? for 1000 documents, you would require a 1000*1000 matrix size ?