Knowledge discovery by accuracy maximization

Available from researchgate, with a web page as well.  Authors are Stefano Cacciatore, Claudio Luchinat and Leonardo Tenori, from, collectively, the University of Florence, Harvard Medical School, Rovira i Virgili University, and the FiorGen Foundation.  (Some authors are apparently at multiples of these; see the paper for the precise list.)

I’m generally not a huge fan of machine learning, and this paper triggers some of that feeling.  ML often feels to me much more like a traditional science, with lots of ideas that might work, and lots of evaluation to find the actual good ones.  There’s not a lot of good theory backing a lot of it; some ideas work and some don’t, and there often doesn’t feel like any reason for it.  It’s just the way the world is.

That’s not to say that ML techniques aren’t incredibly useful; it’s just that ML papers often tend to feel less principled and more like trial and error.  The results may be useful, but that’s not what I want to read about.

Anyway, this paper feels like one cute idea that’s explained very briefly in one paragraph, and then a huge amount of supporting work showing that, with proper tuning, it can work well.  I don’t have the background to judge it, but I think I understand the cute idea, at the very least.

In short, the idea is to group elements in a clever way: basically assuming a grouping, then checking how well a standard algorithm does at finding this grouping.  If it comes close but disagrees in some cases, assume it was right and update the assumed grouping.  Iterate until either the algorithm finds the right grouping, or it’s been a while and it’s time to give up.  Do this a bunch of times, and then weight how similar elements are by how often they were grouped together.

That gives grouping weights; now we multiply the likelihood of two points being in the same group by the Euclidean distance between two points.  This produces a “distance”.  By then using an all-pairs shortest path, we now have effectively a description of an underlying topology – e.g., if the dataset was a sheet that was curled up in three dimensions, this will do a good job of finding that sheet.

Grouping elements

The first step is to figure out groupings.  This is done by finding many groupings, then translating them into adjacency matrices and averaging them.

First, we pick some fraction of the elements (0.75 by default).  As far as I can tell, this is to create gaps in the data which will result in well-defined clusters.  However, I don’t see it explained in the paper at all.

Next, we assign initial groupings.  The simplest way to do this is to create a new group for each element; an alternative is to apply a standard clustering algorithm.  Now we do 10-fold cross-validation of a standard supervised algorithm.

Unwrapping that last sentence, a supervised algorithm is one that takes a classified learning set, builds up some model, and then applies that model to new data.  10-fold cross-validation means splitting the elements into 10 groups, and picking one group as the test set.  We then train the algorithm on the 9 remaining groups, and then see how well it does on the test set.  We do this with each of the 10 groups as the test set, and average the 10 results to determine how well the algorithm did.

An example of a standard clustering algorithm is k-Nearest-Neighbors, or kNN.  To classify a new point, it finds the k points in the training set closest to the new point, and assumes that point is in the set that most of its neighbors are.  Simple as this seems, it’s one of the standard classifiers.  (Others, such as SVM, are a bit more complicated.)

So all that’s happening at this point is we have a grouping (which we basically made up).  We pick 10% of the points for training, and train k-Nearest-Neighbors on the remaining 90%.  We then run kNN on the 10% and see how well it does.  Then repeat this 10 times for each 10% slice.

All told, we now have a notion of how “good” the grouping is based on how kNN does on it.  Moreover, we know, for each element, how kNN classified it.  If kNN got everything exactly right, then we’ve converged, and found the “true” classification.  (It may or may not be actually right, but it’s as good as we’ll get here.)  Otherwise, we can suppose that our original grouping was wrong, and kNN is right in some cases, so update the groups of some random subset of elements that kNN disagreed on.

Now we can rerun the cross-validation, and see if kNN does better.  If it does, we take this new set as our basis, and iterate.  If not, we try again with a different random set to update.

So basically, this is randomly converging on a grouping which kNN is able to learn and predict reasonably well by taking multiple iterations.  This effectively converts the supervised kNN (which requires input which is classified) into an unsupervised method (which just takes unlabeled input, and derives its own labels).

Once we have a set of groupings, we generate the proximity matrix: the element at row i, column j is 1 if elements i and j are in the same group, and 0 otherwise.  We then repeat this whole process a bunch of times with different random subsets of the data (remember, we initially picked a random 75% of the data to work on).  This generates a bunch of matrices, and we average all of them to get the overall proximity matrix (where each value is between 0 and 1, indicating what fraction of the time its row and column elements were in the same group.)

The dissimilarity matrix

One useful thing, though, is not only to find out which points are similar, but to find the underlying topology of the set.  For example, the dataset might be the “Swiss Roll”; imagine a sheet of paper curled up into a spiral.  (Or look at an actual swiss roll.)  In this case, the dataset is three dimensional, but it can be expressed in two dimensions (how far along the spiral, and where along the depth of it.)  Finding this structure can be extremely useful.

So we construct a dissimilarity matrix, which is effectively the distance between each pair of points along this structure.  (So in the Swiss Roll example, distance is measured along the rolled structure, not directly in three dimensions.)

To do this, first take the proximity matrix computed in the first part, and truncate values below some value (0.05 by default) to 0.  This theoretically cuts off points that occasionally ended up in the same group by luck, but aren’t really related.  Now for each pair of points, take the standard Euclidean distance between them (i.e., the square root of the sum of their differences) and divide it by the proximity.  This gives an adjacency matrix where two points will be close together if they are both nearby according to the standard meaning and are likely to be in the same group.  So it indicates the rough structure of the data.

Finally, apply an all-pairs shortest path to find distances between all pairs of points.  If we just used normal distance, this would be redundant because of the triangle inequality.  But with the proximity matrix factored in, two points may not have been clustered together often, but both often clustered with a third point; this may result in a shorter path through the third point than by going directly from one to the other.  For an extreme example, two rarely clustered points will have infinite distance between them (since their proximity is truncated to 0).  But there is likely a path connecting them through other points, so they will end up a finite distance away after the shortest path.

And now we have effectively a graph indicating the underlying structure of the data which we can use for pretty visualizations or to find interesting facts about it.


This idea is pretty in theory, but does it work in practice?

First, the authors generate data following various shapes (including the swiss roll), add some randomness, and try to recover the shape.  KODAMA does really well here, better than most other approaches.

Next up is clustering on various datasets; again, KODAMA does well.

Then come some real-world datasets; one on gene expression and one on metabolites; these are apparently standard benchmarks, and there are pretty pictures showing that KODAMA does really well.

Then they try some new exmaples.  One is early-type galaxies – various parameters are collected and KODAMA succesfully finds the difference between Fast Rotators and Slow Rotators, suggesting that these are actually meaningful distinctions.

The other was picked up by the media, and that’s State of the Unions addresses.  They gave each State of the Union address from 1900 to 2014 to KODAMA as “term frequency-inverse document frequency”; basically, weighting words based on how often the term appears in the document compared to general English text.  The result was that there’s a sharp change during Ronald Reagan’s years.  Examining the indicators, “total”, “understanding”, “expenditures”, “resources”, “peace”, “production”, and “employment” were more common pre-Reagan, while “bless”, “parents”, “businesses”, “thank”, “tonight”, “child”, and “children” were more common after.


I normally don’t like machine learning that much, but this paper had a cool idea and showed that it worked.  I would have liked to see more motivation and explanation, but it apparently was sufficient for me to figure it out, so I can’t complain.

The use of a supervised algorithm combined with cross-validation to generate an unsupervised algorithm is clever and works very well; then building on it to generate the structure seems a bit ad-hoc to me (why use Euclidean distance?  Why divide by the dissimilarity, and not its square or square root or log?).  But it works, and that’s what counts in machine learning.

This entry was posted in papers and tagged , , , , , . Bookmark the permalink.


Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s