Using Statistics to Your Advantage

Predictive Analytics: Know What You’re Up Against

'Today's business applications are raking in mountains of new customer, market, social listening, and real-time app, cloud, or product performance data. Predictive analytics is one way to leverage all of that information, gain tangible new insights, and stay ahead of the competition' - www.pcmag.com

The idea that historical data can tell us something about the future is the core concept of time series forecasting. By extensively collecting, storing and analyzing data, it is possible to gain useful insights and business value through the use of statistics. Image for example that a retail store can have an idea of the quantities of fresh products they are most likely going to sell next week, not only through the experience of its employees or by guessing, but established by statistically valid arguments. A retail store can refill its stock based on these statistics to minimize the loss of fresh products that aren’t sold and have to be thrown away.

In this blog post I will demonstrate that through the use of the appropriate tools, PDI and Weka, it is possible to quickly gain insights in collected data and make statistically valid predictions about the future.

The example used in this blog post is based on a flights datasets, collected over 14 years (1995 – 2008) containing information about flights in the Unites States. By analyzing the data from 1995 until 2007, a prediction will be made for the amount of flights in 2008. This blog post contains two examples: one forecasting the amount of flights per month for 2008 and one forecasting the amount of flights per day in 2008. To demonstrate the accuracy of the forecasts, the statistical predictions will be compared with the actual numbers from the year 2008.

To start off, all the flight data is loaded into a single Vertica node installed on a CentOS virtual machine. The fact table contains 86 289 323 records divided over 14 years.

To make two types of forecasts, one per month and one per day, it is required to group the historical data from 1995 until 2007 per month and per day and write the outcome to a csv-file. This way, the Weka software can be used to generate two models, one for each type of forecast. How to generate a model using Weka is explained in a previous blog post. Based on several tests according to the accuracy of the forecasts of different algorithms, the algorithms used to generate the models are:

  • Gaussian process algorithm for forecasting per day
  • Multilayer perceptron for forecasting per month

The generated models are used in the Weka forecasting steps in separate PDI transformations to forecast 12 units (12 months) and 366 units (366 days in 2008) respectively. The generated data (months/days and the amount of flights, real and predicted for 2008 as well as the difference between the real and the predicted amount of flights) is written to two separate database tables for easier reporting through the use of the Pentaho report analyzer. The result will graphically demonstrate the accuracy of the predictions. Note that all the forecasting results are based on a model generated from 1995 – 2007.

The first forecast graph shows the actual amount of flights, the forecasted amount of flights and the difference between the two for 2008 per month.

The graph is based on the following data.

The forecasted amounts appears to follow the general trend of the curve and the difference between the two varies from 10095 flights (June) on a total of 2444755 to 254389 on a total of 1970431 (October), which means the percentage of difference varies between 0,4% and 11,4%. The average difference percentage based on this data is 2.27%.

The second forecast graph shows the data shows the actual amount of flights, the forecasted amount of flights and the percentage difference between the two for January 2008 per day.

This graph is based on the following data.

Again, the forecasted amount appears to follow the general trend of the actual amount of flights. The percentage difference varies from 0,04% (difference of 33 flights on 01-01-2008) to 9.86% (difference of 6623 flights on 26-01-2008). The average difference percentage based on this data is 2.54%.

As seen on the graphs, forecasting algorithms can pretty accurately detect trends in historical data and (in this example) manage to predict amounts of flights with an average difference of 2.27% per month and 2.56% per day.

While a crystal ball predicting the future remains a phantasy, knowing what you’re up against with statistically valid arguments in the foreseeable future does provide a business with an advantage of immeasurable value.