Basic Machine Learning - Anomaly Detection

What's weird about this?

At certain times you might be faced with unexpected patterns or events appearing in your data. Let's take a look on how we can tackle anomalies, by detecting them.

Imagine you're exploring a data set and suddenly notice some anomalies.

As an example we'll take a look at unexpected locations of player kills in a videogame. Every record has a certain map, x and y attached to it.

Going by the data there are two maps, so first things first we'll need to filter the data on only including one of them:

df = pd.read_csv(filePath, usecols=['map','victim_position_x','victim_position_y'], nrows=20000) # first 20K rows
df = df.loc[(df['map']=='ERANGEL')] # select one of the maps

If we then plot this data we get a good looking cluster of points:

deaths = df[['victim_position_x','victim_position_y']].as_matrix(columns=None)
plt.scatter(deaths[:,0],deaths[:,1])
plt.show()
plt.clf()


Immediately we can spot quite a few outliers in our data, but how do we predict which are anomalies and which aren't? To do this we can use gaussian (also named normal) distribution to help with anomaly detection.

 anomaly detection.png

Gaussian distribution is a function which predicts the exact distribution of events and with it, can be used to determine extreme values which fall outside of the general pool of observations using the mean and variance.

normal-distr.png

mu = deaths.mean(axis=0)
sigma = deaths.var(axis=0)

[5.71298987 5.35145847] [7.36143001 6.82879176]

We determine a probability treshold which can indicate an outlier and the probability that a death falls into the normal distribution (see the notebook for the select_treshold function).

epsilon, f1 = select_threshold(pval, yval) 
outliers = np.where(p < epsilon) # get outliers

We can then apply these probabilities to indicate which deaths are normal and which are anomalies. Plotting this data we can easily show the normal distribution as blue and the outliers as red dots:

# plot data
plt.scatter(deaths[:,0], deaths[:,1])
# plot outliers
plt.scatter(deaths[outliers[0],0], deaths[outliers[0],1], s=50, color='r', marker='o')
plt.show()

Of course this is only one way of doing anomaly detection, in the future we may look at other techniques to tackle this problem.

Get the code here!

Get started with data science!

You may also like

These blogs about data science

GraphConnect, the annual Neo4J event, was hosted in New York yesterday (2018-09-20). About 800 people gathered near Times Square for a day of talks about…

Amazon SageMaker is a "fully managed machine learning service". This means it provisions an environment for data scientists and developers without them needing…

What size is this? Suppose you want to predict what the length or width of a flower petal. For this we can look for a relation between the two.