Apache Hop is a visual, metadata-driven data engineering platform that allows data professionals to build and run data pipelines without the need to write code.
Apache Hop was designed and built to support data projects throughout the entire life cycle, from the moment a data point arrives in your organization until it lands in your data warehouse or analytics platform.
Apache Hop has built-in support for hundreds of source and target data platforms. This includes file format, relational, graph and NoSQL databases and many others. Apache Hop is built to process any volume of data: from edge devices in IoT projects over standard data warehousing projects up to distributed platforms that process petabytes of data.
Hop users visually develop pipelines and workflows using metadata to describe how data should be processed. This visual design enables data developers to focus on what they want to do, without the need to spend countless hours on technical details.
This visual design and the abstraction of the technical details enable data professionals to be more productive. Visual design makes citizen developers more productive when developing data pipelines and workflows than they would be with "real" source code. Even more so, maintaining and updating your own (or even worse, someone else's) workflows and pipelines after a couple of weeks of months is a lot easier when you can visually see the flow of data in the pipeline.
As with any platform or any project, a good start is half the battle. Hop has all the functionality required to organize your work in projects and keep a strict separation between code (project) and configuration (environments and environments files).
Setting up your Hop projects according to the best practices described below will make your projects easier to develop, maintain, and deploy.
Whether you have previous experience with Pentaho Data Integration (Kettle) and want to upgrade to Apache Hop or are new to Apache Hop, no matter how far you are in adopting a DevOps way of working, a well-designed project and corresponding environments will make your life a lot easier.
Rather than just diving in and creating workflows and pipelines, here are 5 steps you should follow to start working with Hop.
To download and install Apache Hop use the following guide:
hop-gui.bat
(Windows) or ./hop-gui.sh
(Linux or Mac)After starting the Apache Hop GUI, you’ll be presented with the window below:
PRO TIP: Apache Hop stores your configuration in <hop>/config/ by default. Set an environment (system) variable HOP_CONFIG_FOLDER and point it to a folder on your file system to store your Apache Hop configuration outside of your installation. This will let you switch seamlessly between Hop versions and installations.
To create a new project click the Add a new project button. This button opens the following dialog:
Insert all fields and click OK.
After clicking OK, the system will show the following dialogs:
The environment can be configured as in the following example:
The initial variables we set in “Step 2: Create a project and an environment” are just the start.
Your projects should be transparent and portable: using variables for all file paths, relational or NoSQL database connections, email server configuration, and so on is crucial. Hard-coded values should always raise an alarm. They may not cause problems right away, but sooner or later (probably sooner) one of those pesky hard-coded values will pop up and wreak havoc when you least expect and want it.
Let’s explore using variables to create a relational database connection. Relational Database Connections are a typical type of metadata item that is used throughout your project. You don't want to be reading from or writing to a different database than the one you had in mind.
To create a Relational Database Connection, go to the metadata perspective, right-click on "Relation Database Connection" and select New.
The New Relational Database Connection editor opens, with the following fields ready to be configured.
Note that for the configuration fields, you can use variables that can be specified in an environment file.
In this case, you can:
Example:
The connection can be configured as in the following example:
Test the connection by clicking on the Test button.
Pipelines in Hop perform the heavy data lifting: in a pipeline, you read data from one or more sources, perform a number of operations (joins, lookups, filters, and lots more), and finally, write the processed data to one or more target platforms.
To create a Pipeline, hit CTRL-N click on the New -> Pipeline option or click on the File -> New -> Pipeline option.
Your new pipeline is created, and you’ll see the dialog below.
Now you are ready to add the first transform. Click anywhere in the pipeline canvas and you will see the following dialog:
In this case, we are going to add a Generate rows transform. This transform allows us to generate a number of empty rows (though you could add fixed-value fields ). To do so, Search 'generate' -> Generate rows.
Now it’s time to configure the Generate rows transform. Open the transform and set your values as in the following example:
Next step? Add and connect an Add sequence transform. The Add sequence transform adds a sequence to the Hop stream. A sequence is an ever-changing integer value with a specific start and increment value.
To connect the "Generate Rows" and "Add Sequence" transforms, we'll create a "hop", the black arrow you see below. There are multiple ways to create hops, but the easiest ways are dragging from the first to the second transform while holding down the shift key or dragging from the first to the second transform while holding down your scroll wheel instead of your primary mouse button.
To configure the sequence open the transform and set your values as in the following example:
To preview the result of a pipeline to see how it performs, select the transform and use the Preview the pipeline option:
The results are shown as follows:
Finally, run your workflow by clicking on the Run -> Launch option:
You can verify the execution results in the Transform Metrics and Logging tabs.
You are now ready to start working with Apache Hop.
Before you can run your Apache Hop project in production, you'll need to manage it in version control.
You'll want to create unit tests to make sure your workflows and pipelines not only run without errors, but also process your data exactly the way you want them to. Check our guide on unit testing for more information.
If you want to periodically run workflows and pipelines, Apache Hop integrates with any scheduler that can kick off a script, command or container. Check our guide on running Apache Hop workflows and pipelines from Airflow Airflow for an example.
Not only have you built and run your first Apache Hop pipeline, but you've also done it like a pro!
Working with correctly configured projects and environments will save you a lot of headaches further down the road.
Let us know in the comments if you'd like to see more hands-on guides like this.
Want to find out more? Download our free Hop fact sheet now!