Basic Machine Learning - Clustering

Originally posted on Mar 27, 2018 10:00:00 AM
Last updated on July 24, 2024
Bart Maertens

data architect and developer with over 20 years of experience in data engineering and analytics. Founder and lead of the know.bi expert team, Apache Hop co-founder and PMC member.

How is this related?

In this post, we'll take a look at how we can find out in what way data is structured or related.

Clustering or cluster analysis is a way of getting an idea about how data is structured or related.
For instance, when dealing with geographical data such as sightings of unidentified flying objects (UFOs) it may be interesting to see whether these sightings are clustered around certain points or how there are related.

The dataset we'll use in this post is a collection of UFO sightings from the last century. We'll use the latitude and longitude to see if the sightings are clustered around certain points.

df = df.dropna(how='any') # remove null geo = df[['long','lat']].as_matrix(columns=None) # as matrix plt.scatter(geo[:, 0], geo[:, 1]); # scatter plot plt.show() # show plot plt.clf() # clear plot

When we simply plot latitude and longitude we can clearly see the sightings happen all around the world (which gives us a neat world map).

But how does this fit together?

A simple way of doing this is using K-Means clustering. With K-Means we decide on a number of centers around which our data is grouped.

Untitled Diagram.svg

A downside to this method is that we have to manually decide on a valid number of clusters to run. Since we plotted the data above we can simply derive a quantity by looking at how our data is spread.

kmeans = KMeans(n_clusters=4) # 4 clusters kmeans.fit(geo) # fit geo matrix y_kmeans = kmeans.predict(geo) # predict kmeans plt.scatter(geo[:, 0], geo[:, 1], c=y_kmeans, s=50, cmap='plasma') centers = kmeans.cluster_centers_ # cluster centers plt.scatter(centers[:, 0], centers[:, 1], c='black', s=150, alpha=0.5); # scatter plot plt.show() # show plot plt.clf() # clear plot

é

These four clusters are an example of how clustering can display relations between data, in this example they are (parts of) continents.
By calculating the centers of these clusters (or, more realistically, clusters on a lower level of detail) we could for instance compare them to landmarks or touristic areas such as memorials. This offers us the chance to further explore certain areas.

Get the code here, or get in touch if you want to dive deeper!

data science, artificial intelligence, machine learning, python, algorithm

What's weird about this?

At certain times you might be faced with unexpected patterns or events...

What size is this?

Suppose you want to predict what the length or width of a flower petal.For this...

Is this A, or B?

As a follow-up to last week's machine learning tidbit let's look at an example of...

Basic Machine Learning - Clustering

How is this related?

Subscribe to the know.bi blog

Blog comments

Related posts

Basic Machine Learning - Anomaly Detection

What's weird about this?

Basic Machine Learning - Linear Regression

What size is this?

Basic Machine Learning - Classification

Is this A, or B?