word2vec, Tensorflow and Vector Counts

28-01-2017

So in addition to getting married, helping plan a move to the USA and learning all about bioinformatics, I've been working a lot with word2vec, tensorflow and various word vector counts to figure out similarities between verbs. It's a fun little project that sits right at the heart of natural language processing.

The idea we have is to see if we can automatically find whether verbs are similar or dissimilar to each other. We have a nice list of english verbs from this page at Oxford University. This project has a set of verbs and a human rating as to how similar they are. There are test sets and training sets.

word2vec is an algorithm designed by google that creates a set of vectors that are designed to be added and subtracted together to allow some basic vector operations on words. Word2vec can do interesting things such as finding what word is most similar to another word or predicting what word would come next in a sentence or what word would we get if we ran the following equation : royal + queen - female = ? (king would be a good answer to that potentially).

Tensorflow has been the deep-learning darling of the internet for a while now. It contains a couple of word2vec examples, with some tutorials on how these work. You can pose the same questions to this as you can to the original version.

You need a large corpus/training set for word2vec and tensorflow. We use the ukwac dataset. It's quite large (several gigabytes) as it is a snapshot of the .uk domain (from around 2008 I believe). The text is parsed into sentences and then tagged. So each word has it's type (verb, adverb, noun etc) and its role in sentence (object, root, subject etc) listed. This comes in very handy later on.

So how do you start with all this? Well, its possible to remove all the tags and smush all the sentences back together into one large file and have word2vec process this one large file. That works quite well. Tensorflow is much more difficult. It requires a dictionary lookup and a whole set of integers. Rather than work on words, you create a dictionary of, say, the top 50,000 most common words and give them a number. You then process every sentence, replace it with a number or an 'unknown' placeholder if the word is not so common.

I decided to pre-process the ukwac files with a program written in C++. I wanted to try memory mapped files from the boost library, mostly because I was interested but also because this program could then run on smaller machines. It works quite well and produces a set of files that tensorflow can read easily. I use openmp to split the large files into blocks and process them in parallel, giving us some speedup. While a bit more complicated, it seems to work great.

Vector counts are a different kettle of fish. Most vectors produced by Tensorflow or Word2vec are about 200 items long (maybe 400 if we are being cheeky), but vector counts are typically on the order of 5000 or so. Each vector is associated with a word and represents what other words appear near this word. We use a sliding window of a certain size to create this vector. For example, consider the last sentence. Withing 5 words of the word certain, we have "a, sliding, window, of, a, size, to,create, this and vector". The word 'a' occurs twice so the entry for 'a' in the 'certain' vector would go up by two.

There are more details to it though. The basis is smaller than vocab for instance. But are we bothered by words like 'the' or 'a' for example? We choose our basis carefully and then compute these large vectors. Rather than use the raw counts, we work out the probability that a word will occur near another word.

With all these vectors, we can then perform some maths, such as vector addition, multiplication and such, or more complicated things, such as adding together all the vectors of an intransitive verb's subjects. We can also use the kronecker product to find any similar factors in pairs of vectors. I use numpy and scipy to perform most of the maths. I've looking into using Cython to try and speed things up, though the jury is out on that one. We use some basic stats such as spearmans rho and the permutation test.

With our python scripts, our preprocessed files and vectors, we can upload all these to a cluster. Using virtualenv we can create custom jobs on our cluster and leverage the power of supercomputing Rarrgh! So far, the python side of things is a bit slow and is still running as I write this, but I'm quite keen to see the final results!