Using the Weka scoring and forecasting plugin in Pentaho Data Integration

Weka Plugins for PDI: Scoring - Forecasting

When it comes to big data, the real opportunity for enterprises is in advanced data analytics, specifically machine learning.’ - http://www.information-age.com/

Weka, or Waikato Environment for Knowledge Analysis, is a software tool in the Pentaho platform that allows users to apply machine learning algorithms to data. Machine learning algorithms are mechanisms that allow computers to recognize patterns in data, learn from them and apply these patterns to other data. A model data set ‘teaches’ the computer the patterns (in the form of a model) so it can recognize these patterns in other datasets. Using these models, data miners attempt at recognizing patterns in datasets for the purpose of learning more about any piece of available data.

Pentaho Data Integration (PDI) supports Weka machine learning directly in the form of two plugins: Weka scoring and Weka forecasting. This enables PDI users to directly call on scoring and forecasting models generated in Weka to apply to the data in their transformations. For more information about the Weka support in Pentaho, visit the Pentaho data-mining forums at http://community.pentaho.com/projects/data-mining/.

This blog will demonstrate how to use the Weka scoring and forecasting plugins in PDI.

First off, a check of the Weka version that came with your PDI installation is required . When working with Weka and PDI, it is very important to make sure that you consistently use the same version of Weka. The general working method is to first create a model using the Weka software, and then using that model in PDI to, in this case, either configure the weka scoring or the weka forecasting step. The version of the Weka software used to generate the model that will be called on in the PDI transformation has to match the Weka version in your PDI plugins folders.

The Weka files that came with PDI can be found in two folders:

  • Data-integration/plugins/weka-forecasting/lib
  • Data-integration/plugins/weka-scoring/lib

The jar-file that requires a version check is the pdm-3.X-ce-3.X.X.X jar-file, where 3.X resembles the version.

The Weka Forecasting plugin requires Weka version 3.8 or higher. If the Weka version that came with the PDI installation is one below 3.8, it is required to replace the weka jar-file (‘pdm-3.X-ce-3.X.X.X.jar’) with a higher version. You can download Weka via ‘https://sourceforge.net/projects/weka/files/ .

After downloading, copy the ‘weka.jar’ file to both plugins-folders to replace the existing jar-file. Using the same weka version for both plugins will reduce the chance of accidentally generating a model in the wrong version.

TIP: to avoid version issues with generated models, Weka can be started from the jar-file copied to the plugin-folders. This makes sure the model is generated in the same version as the Weka called on by the scoring and forecasting steps in PDI.

The dataset used in the examples contains European unemployment figures. The dataset is freely available via ‘https://developers.google.com/public-data/docs/exa...

Weka scoring

Scoring means attaching a prediction/score of a field value to a new piece of incoming data based on a previously set model.

A scoring mechanism determines the value of a chosen field based on the values of other fields. The determination of this value is based off a decision tree that is generated in a model. A decision tree, in this case, contains decisions based on values in the data. The end point of the decision tree results in a score for a field value.

The weka scoring plugin allows PDI to use a decision tree model (or a different type of model) in a transformation to score the data handled by the transformation.

To start using the Weka scoring plugin in PDI, first start the Weka software to generate a model that will be used by PDI and open the Explorer.

To generate a model, Weka requires an example file to use as a reference. Open a file by clicking ‘Open file’ in the Preprocess tab.

This will open a dialog that allows a file type specification and file selection. Checking the ‘invoke options dialog’ will open an additional dialog that allows the specification of data in the file.

This can be very useful for specifying dates and date formats so Weka can recognize the dates in the file more easily. This is not necessary for this test case, but will be especially important when working with the forecasting plugin later on.

When the file is successfully imported, Weka should recognize the fields and data in the file.

The Classify tab allows the user to choose (via the Choose button) a classification type and the field for which the scoring model will be generated. A typical classification type for scoring data is a decision tree.

A decision tree is a model of consequent decisions based on values resulting in multiple outcomes. A series of decisions always leads to an outcome, and any sequence of decision can lead to a different outcome. Every decision leads to a more detailed image of the field value to score. In the example case, when scoring the sex of the group to which the data applies, every value of every field of incoming data will help decide whether or not the data applies to a male or a female group.

There are multiple decision tree classification types available.

  • RandomTree: generates one decision tree model. The scoring is predicted solely based on this decision tree.
  • REPTree (Reduced Error Pruning): generates multiple decision trees based on different iterations and chooses the best decision tree based on the mean square error. The mean square error is a risk function to calculate the tree with the least amount of errors.
  • M5P model tree: The M5P decision tree model combines a conventional decision tree with the possibility of linear regression functions at the nodes. Linear regression is a modeling approach for the relationship between two (or more) variables, one scalar, dependent variable and another explanatory, independent variable.

For scoring purposes, there are many other modeling types available. Different classifying modeling types use different underlying algorithms, all of them with the purpose of attaching a prediction of a field value to an incoming row of data.

By pressing start, Weka will generate the model. The Classifier output window displays the results of applying the underlying algorithms to the file data. Right clicking a result in the result-list gives the option to save the model. This will be the model PDI will use for its scoring plugin.

Open PDI and create a transformation with the following steps: CSV file input and Weka scoring.

Configure the CSV file input to import the CSV file used to generate the model. Configure the Weka scoring step to load the model generated by Weka. The Fields mapping tab should show the data types of the model attributes and the incoming fields.

If the data types don’t match, make sure they match by adjusting the imported data types in the CSV file input step. If the data types match, the Model tab should show the model.

If the Model tab doesn’t show the model while the data types match, you might have to check if the version of the Weka you used to generate the model matches the Weka version in your PDI plugins/weka forecasting folder.

Previewing the Weka scoring step will show the predicted field value according to the model used.

Weka forecasting

Forecasting means predicting field values over time based on previously known values over a period of time.

A forecasting mechanism uses known data over a period of time in the past to predict how this data is going to further evolve. For example, known unemployment figures of a country over the past 20 years allow for predicting how the unemployment figures are going to evolve over the next months/years. The base of this forecasting mechanism is a model generated from template data. This model is used to predict future data values.

The Weka forecasting plugin allows PDI to use a forecasting model, generated by Weka in combination with the data in the transformation to predict the further evolution of this data.

Using PDI, some of the data provided in the European unemployment dataset has been modified to better accommodate the Weka timeseries package. The modification of the sample data allowed for easier import of the data and to easier understand the results of the model.

In the modification, the date format is edited and the data is filtered for only one country and one seasonality. The modification transformation looks as follows and is available for download here.

To start using the PDI Weka forecasting plugin, a forecasting model has to be generated using Weka. This requires an additional package to be installed via the Package manager, found under the Tools menu item.

Locate the timeseriesForecasting package under the Time Series category and install. The latest version is required, however possibilities can be limited by the used Weka version.

Once the timeseriesForecasting package is installed, the Explorer GUI shows an extra tab ‘Forecasting’.

To generate a model based on the modified file, import the file in the explorer. This time, check the ‘invoke options dialog’ for extra import specifications. Make sure to fill in the column number of your date attribute and the date format. In the example picture, the date column was the fourth column in the format dd-MM-yyyy. Specifying this information will allow Weka to easily recognize the date attribute and automatically use this date attribute as time factor in the forecasting model.

Generating the forecasting model can be configured in the Forecast tab.

The Target Selection area specifies the fields to predict in the model. The Parameters area allows for more specific configuration. The most important parameter is the Time stamp, which, after specifying the date attribute in the options dialog, should be automatically completed with the Date field. The periodicity is set to <Detect automatically>. The Parameters area also specifies the number of units to forecast. To see the effect of Weka forecasting, set the number of time units to 24 (2 years) and press start.

In the advanced configuration tab, the algorithm used to generate the model can be specified. Weka provides several algorithms for forecasting. Deciding which algorithm will be used to generate the model is an important task and is based on the data on which the forecast will be based.

  • Linear Regression: Linear regression is the standard algorithm chosen by Weka for forecasting. The algorithm relies on two types of variables. A scalar, dependent variable and one or more explanatory, independent variables. The algorithm searches for the influence of the explanatory variables on the scalar variable to generate a model to forecast the scalar variables based on more explanatory variables.
  • Multilayer Perceptron: The multilayer perceptron algorithm is a supervised articifial neural network model that consists of at least three layers. It uses a set of input data (layer 1) and maps them to every possible combination (layer 2) to generate every possible outcome (layer 3). Based on the expected result (hence supervised), the outcomes with the smallest error compared to the expected result are used to train the network.
  • Gaussian Process: A Gaussian process is a statistical distribution where observations of values occur over a specific time or in a specific space. This way, every point in this time or space is associated with some variable value. Based on the values for every point in this time or space, values for other points in time or space can be predicted.
  • Holt-Winters: The triple exponential smoothing method, taking into account trend and seasonality (periodicity) in data.

This teaches us that, for example, for sales figures of a toy store (which is greatly influenced by seasonality due to the holiday period), the Holt-Winters triple exponential smoothing method will provide a more accurate forecast than the linear regression model.

By clicking Start, the model will be generated. Also, the Output tab shows the data with the added predictions and the Train future pred. tab shows a graphical interpretation of the data. Right click the result in the list to save the model.

To use the model in the PDI Weka scoring plugin, create a transformation with the following steps: CSV file input, Weka scoring.

Configure the CSV file input step to import the CSV file used to generate the model. Configure the Weka forecasting step to use the model generated in Weka. Like the Weka scoring step, the Fields mapping should show the model attributes compared to the incoming fields and the Model tab should show the model used.

If the Model doesn’t show correctly, there might be an issue with the Weka version of the generated model and the Weka version in your PDI installation.

Previewing the Weka forecasting step reveals the data from the imported file appended with additional predictions for future data.

Both the Weka scoring and forecasting plugin have a wide range of use cases. With the amount of data growing as fast as it does, understanding what this data means is a vital part of understanding a business and the world around it.

The combined power of Weka and PDI will make sure that vast amounts of data can be easily processed, managed and understood to provide any data scientist and business owner with the necessary knowledge to make a business work in the modern day world.

Partners

Pentaho