Because of the popularity and press attention given to "Big Data", every major and minor BI and DI vendor claims to support it. For the majority of those vendors this claim is rather hollow, simply indicating a capability to get data from Hadoop through a Hive JDBC driver. Obviously there is more to the world of Big Data than that and if you like open source software plus solid and wide support for the most popular Big Data platforms, there is no getting round the latest drop of Pentaho Data Integration (Kettle).
While connecting to a NoSQL database like MongoDB is fairly straightforward, configuring a connection to Hadoop is traditionally not that easy for the simple reason that there are different Hadoop versions and distributions out there that all have different library dependencies. Fortunately, with the latest version of the Pentaho Big Data plugin that ships with PDI 4.4.0, configuring Hadoop has become especially easy to do.
All you need to do is go to the plugins/pentaho-big-data-plugin folder and edit the plugin.properties file. At the very top you can specify the active Hadoop configuration to use:
This "hadoop-20" value signifies that you want to use Apache Hadoop version 0.20. "hadoop-20" refers to a folder in the plugins/pentaho-big-data-plugin/hadoop-configurations directory. Each of the folders in that directory contains a set of libraries. By default, PDI 4.4.0 ships with libraries for the following Hadoop releases:
While the lib/client folder in the hadoop-configurations/* subfolders are meant for PDI, you can also add libraries to the class path your Hadoop cluster nodes. To do this, simply add libraries to the lib/pmr folder
We are hopeful that Pentaho will keep up with the most popular Hadoop distributions and versions so that users only have to change the plugin.properties file to start running Pentaho Map/Reduce jobs inside of your Hadoop cluster. For now know.bi is very happy with this new configuration system and we encourage you to try it out if you haven't done so yet.