R2kS - A Novel Measure for Comparing Gene Expression Based on Ranked Gene Lists


One of the first jobs I've worked on for Queen Mary University is a small program to compared two ordered lists. Pretty simple right? Well, not when you have lists 200 thousand plus elements wrong and several hundred thousand of theses lists! Now we have a lot of computing power for this but we have to apply a bit of brain power to harness all that metal. This is based on this paper by Ni and Vingron

A biologist came to me, asking for more runtime on the QMUL machines. Apparently, the job he wanted was going to take about 15 days or thereabouts! :O We looked at his code and it was a bit hard to follow to say the least! There was some OpenMP lines in there but whether or not this was really making things faster was up for debate. Also, OpenMP is of somewhat limited use on our cluster because jobs are distributed based on cores that could be across the network. OpenMP isn't much use when you have different machines and nodes working on the same data. You have what is known as Non-uniform-memory-access going on.

I decided that this was probably a small enough, and fun enough problem to have a go writing it from scratch. I'm glad I did as the final code was much cleaner and properly formatted for use with MPI. We used IntelMPI here at QMUL but I suspect this will work with OpenMPI and MPICH as well.

The original paper describes the problem and solution thusly:

Bioinformatics analyses frequently yield results in the form of lists of genes sorted by, for example, sequence similarity to a query sequence or degree of differential expression of a gene upon a change of cellular condition. Comparison of such results may depend strongly on the particular scoring system throughout the entire list, although the crucial information resides in which genes are ranked at the top of the list.

Here, we propose to reduce the lists to the mere ranking of the genes and to compare only the ranked lists. To this end, we introduce a measure of similarity between ranked lists. Our measure puts particular emphasis on finding the same items near the top of the list, while the genes further down should not have a strong influence.

Our approach can be understood as a special version of a two-dimensional Kolmogorov-Smirnov statistic. We present a dynamic programming algorithm for its computation and study the distribution of the similarity values. The performance on simulated and on real biological data is studied in comparison to other available measures.

The key here is the use of Dynamic Programming, which makes sure we keep space (and often time) complexity down to a minimum. It took me a while to figure it all out but after a little help understanding the mathematic notation, it seemed easy enough to work through some small examples on paper. With the basic algorithm finished, it was reasonably easy to divide up the problem as each pair of lists can be compared separately.

You can download, comment play or submit modifications to the code on the github page