Scalable PDI architecture using Docker

Shipping PDI in a Docker container

Docker isn't the new cool, either you're using it or you simply haven't tried it yet. For those still unfamiliar with the tool: think of Docker as a way to box your software with all its dependencies, making it host-agnostic. Your 'box', or -image- is then used as a blueprint for light virtual machines* with the sole purpose of executing a given process. This process can be anything from a simple web frontend to an analytics engine. Or more specific to this post; an ETL process.

(*take this with a grain of salt)

So why would we want to run our ETL process in Docker? Ease of deployment comes to mind. We may want to eliminate the need to provision underlying infrastructure for our process to run on. We may want to run multiple versions of the same ETL in parallel. The use-case may require remote deployment, in the cloud for example. A second reason is scalability as we may have a cluster we want to scale up or down, or the process may simply require more memory availability (the host still needs to have sufficient free memory of course).


Pentaho Data Integration (PDI) on Docker - architecture overview

In this post we'll be looking at building a scalable Pentaho Data Integration (PDI) architecture using Docker, different ways to provision ETL/data, scheduling, monitoring and automation.


Pentaho Data Integration (PDI) on Docker - architecture overview

First things first we'll need a Docker image (I'm going to assume you can run hello-world on Docker). A Docker image is built by following instructions defined in a build file, or Dockerfile. For the purpose of this post we'll be looking at a community edition image, although a EE build is equally possible but may require you to add a way to apply licenses. To build the final PDI image we'll use 2 images; one for the PDI base and one provisioned image which contains plugins, jars, properties files and other project dependencies.

Pentaho Data Integration (PDI) on Docker - Base Dockerfile

Example of a Dockerfile for a base image.

It's also possible to copy the PDI zip and unzip it within the Dockerfile. However, this will lead to a larger image size. One of the base principles of Docker is that on build it will copy everything to image from it's build directory. If you want a minimum size for your container you'll have to unzip PDI to the current directory. Your directory should contain the Dockerfile and a data-integration directory, then run "docker build -t pdi".


The Dockerfile contains install commands for Java (needed by PDI), unzip, git (which we'll use to clone ETL from a repository) and ssh (to authorize with the git repository).



Pentaho Data Integration (PDI) on Docker - Dockerfile 

Example of a Dockerfile for a  provisioning image.

In our provisioning image we use the previous image, add an ssh key, create a pentaho user, add properties files, plugins and an entrypoint bash file (see below). This type of setup will quickly throw some personal choices at you. One of the things you should decide is do you want to add the ETL at build or feed the container ETL at run-time?

If we add the ETL at build time we will create a "build once, run many" configuration. This enables more uptime and quicker deployment. Adding the ETL at build requires us to re-build the image each time we want to run it. This is, presumably, a more Docker way of doing things but I like to treat the ETL as meta-data data which is consumed by PDI.

We may also want to make the image a bit smarter. Docker allows us to choose an entrypoint which the container is going to start from. This can be the program itself, or something more interesting like a script that allows us to use kitchen or pan.

Pentaho Data Integration (PDI) on Docker - entry point bash script
Example of an entry point bash script.
After adding this script you can change the Dockerfile line 'ENTRYPOINT ["/bin/bash"]' to ENTRYPOINT [""].

This allows us to run the container with the following command:
  • docker run --privileged -v /data:/[external-dir] provisioned-pdi [] [git repository] [trans or job name]

The first command adds an external shared directory where data could be stored, the image name, the entrypoint, the git repository contains our ETL and the trans/job name determines whether to run pan or kitchen.


Pictured below  is an example setup where we have an external share with the data to process, the ETL located in a git repository, the Docker images stored in a Docker Repository. Deployed on a host as a Docker container process, writing to a database.




Pentaho Data Integration (PDI) on Docker - architecture overview



This setup is of course but an example, using Docker with PDI enables you to de-couple your loads and allows you to easily deploy at will, instead of having to provision a server with needed dependencies.


You can find the source here!


  Talk to an expert!