Pentaho Kettle Thin JDBC Driver

When Matt announced the Pentaho Kettle JDBC Thin Driver earlier this summer, he opened up a whole new world of possibilities for Java-based applications to access data.
Whereas JDBC is normally used to access databases, it is now possible to use a Kettle transformation as a data source and as such unlock about any data source or combination of data sources. This has already been used to run BIRT reports on top of transformations, to query transformations in SQuirreL, to stream data from a transformation into QlikView, etc...

A commonly requested use case when working with Mondrian (Pentaho’s OLAP engine) cubes is the comparison of actual vs forecast sales figures. Forecast data is rarely loaded into data warehouses, making it hard to compare actual and forecast figures in one OLAP cube. One Kettle transformation and the Kettle JDBC driver, however, turns this comparison into a triviality.

We’ll have a look at what it takes to run a transformation as the source of a Mondrian cube. First of all, the JDBC Thin Driver is not part of a released Kettle version, so a recent build 
needs to be downloaded from (or use these direct links for Windows or Linux). After unzipping, any transformation can be used as a JDBC data source. In the example transformation below, actual sales data is read from the data warehouse (errr, Steel Wheels), and combined with the forecast figures per month and per product line for the first half of 2005 (Excel file). 
Note: this transformation uses the BI server's SampleData Hypersonic database, so the Hypersonic server (included in the BI server) needs to be started before you can use the example transformation.

Next, a carte config file is needed. Carte, Kettle's lightweight web server, serves as the entry point to accept JDBC calls and pass them to the transformation. The configuration file, carte-jdbc-config.xml for this example, will tell Carte which step in which transformation will be used as a data source. Apart from the transformation and step, the host name and port are also specified (more information about Carte config files can be found here).

<?xml version="1.0" encoding="UTF-8"?>
<service_step>Get Performanceservice_step>

Carte can be started with this config file’s full path as a command line parameter.
e.g. ./ /opt/pentaho/pdi/carte-jdbc-config.xml.

INFO  12-09 14:25:33,748 - Carte - Installing timer to purge stale objects after 1440 minutes.
INFO  12-09 14:25:33,795 - Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
INFO  12-09 14:25:33,871 - Carte - Created listener for webserver @ address : localhost:10001
INFO  12-09 14:25:33,871 - jetty-6.1.21
INFO  12-09 14:25:33,933 - Started SocketConnector@localhost:10001

Before the transformation can be accessed through a query, the necessary jar files have to be copied as jdbc drivers to the client application. 
The actuals to budget comparison in this example will use the Pentaho BI Server and Analyzer, so all jar files from /lib/ need to be copied to /tomcat/webapps/pentaho/WEB-INF/lib. 
As soon as the pentaho BI server is started, a new data source (File --> New Data Source, or just click the 'New Data Source' icon) can be added. Select 'Database Tables', and create a new data source.  
The database type is 'Generic Database', the url format is 'jdbc:pdi://:/kettle', e.g. 'jdbc:pdi://localhost:10001/kettle'. The class name (Custom Driver Class Name) to use is 'org.pentaho.di.core.jdbc.ThinDriver'. Carte's default username/password combination is cluster/cluster (check the wiki for information on how to change these defaults).

After a connection has been successfully made, the transformation's table ('ActualsVsForecast' in this case) can be added and set as the fact table (see lower right corner).

The default Analysis model (Mondrian cube) generates a measure for all numerical columns, and a single hierarchy, single level dimension for all columns. 'Customize Model Now' in the next window can be used to modify this generated model into something more usable.

The model is now ready to start analyzing actuals versus budget results using the Analyzer in the Enterprise Edition, or Saiku in the Community Edition. Any changes to the budget data will be immediately picked up by the transformation, and are readily available in the Analyzer. 

The PDI JDBC Thin Driver opens a world of possibilities to unlock new data sources, while being increadibly easy to set up and use. There are still limits to the SQL the JDBC driver understands, but as with most other Pentaho releases, this is expected to grow at a very rapid pace.

Download a Kettle CI release, grab the examples and try this out for yourself.