Data mining algorithms in plain English
Maybe not interesting if you're a data mining guru, but this explanation of the top 10 most influential data mining algorithms in plain English is a good read for the rest of us, though “plain English” is perhaps debatable.
Here's a good one, on k-means:
You might be wondering:
Given this set of vectors, how do we cluster together patients that have similar age, pulse, blood pressure, etc?
Want to know the best part?
You tell k-means how many clusters you want. K-means takes care of the rest.
How does k-means take care of the rest? k-means has lots of variations to optimize for certain types of data.
At a high level, they all do something like this:
k-means picks points in multi-dimensional space to represent each of the k clusters. These are called centroids.
Every patient will be closest to 1 of these k centroids. They hopefully won’t all be closest to the same one, so they’ll form a cluster around their nearest centroid.
What we have are k clusters, and each patient is now a member of a cluster.
k-means then finds the center for each of the k clusters based on its cluster members (yep, using the patient vectors!).
This center becomes the new centroid for the cluster.
Since the centroid is in a different place now, patients might now be closer to other centroids. In other words, they may change cluster membership.
Steps 2-6 are repeated until the centroids no longer change, and the cluster memberships stabilize. This is called convergence.
This seems like a great idea for a book: the central data algorithms of the third industrial revolution, this networked, online age. One chapter per algorithm, with a discussion of how it manifests itself on the key websites, applications, hardware, and other services we use all the time now. If you are a data mining expert in need of someone to be the “plain English” side of a writing team, call me maybe.