Configuring Kettle for Hadoop

Because of the popularity and press attention given to "Big Data", every major and minor BI and DI vendor claims to support it.  For the majority of those vendors this claim is rather hollow, simply indicating a capability to get data from Hadoop through a Hive JDBC driver.  Obviously there is more to the world of Big Data than that and if you like open source software plus solid and wide support for the most popular Big Data platforms, there is no getting round the latest drop of Pentaho Data Integration (Kettle).

While connecting to a NoSQL database like MongoDB is fairly straightforward, configuring a connection to Hadoop is traditionally not that easy for the simple reason that there are different Hadoop versions and distributions out there that all have different library dependencies.  Fortunately, with the latest version of the Pentaho Big Data plugin that ships with PDI 4.4.0, configuring Hadoop has become especially easy to do.

All you need to do is go to the plugins/pentaho-big-data-plugin folder and edit the file.  At the very top you can specify the active Hadoop configuration to use:


This "hadoop-20" value signifies that you want to use Apache Hadoop version 0.20.  "hadoop-20" refers to a folder in the plugins/pentaho-big-data-plugin/hadoop-configurations directory.  Each of the folders in that directory contains a set of libraries.  By default, PDI 4.4.0 ships with libraries for the following Hadoop releases:

  • cdh3u4 : Cloudera Hadoop version 3.4
  • cdh4 : Cloudera Hadoop version 4
  • hadoop-20 : Hadoop 0.20
  • mapr : MapR
Obviously it's not possible to ship with each and every distribution version out there.  For example, Apache Hadoop 1.0 is not included.  So how would we support this release?  Simple, we copy the "hadoop-20" folder to "hadoop1".  Then we remove the hadoop-core-0.20.2.jar file in lib/client and replace it with the hadoop-core file corresponding to the Hadoop version you're using, for example hadoop-core-1.0.4.jar.  Please note that in a lot of cases, Hadoop 1.x also needs commons-configuration-1.7.jar so make sure to add this to either the lib/client folder or simply to PDI in libext/commons.

While the lib/client folder in the hadoop-configurations/* subfolders are meant for PDI, you can also add libraries to the class path your Hadoop cluster nodes.  To do this, simply add libraries to the lib/pmr folder

We are hopeful that Pentaho will keep up with the most popular Hadoop distributions and versions so that users only have to change the file to start running Pentaho Map/Reduce jobs inside of your Hadoop cluster.  For now is very happy with this new configuration system and we encourage you to try it out if you haven't done so yet.