5 minutes to get started with Apache Hop

5 minutes to get started with Apache Hop

Apache Hop, short for Hop Orchestration Platform, is a data orchestration and data engineering platform that facilitates all aspects of data and metadata orchestration.

Hop allows data professionals to work visually, using metadata to describe how data should be processed. Visual design enables data developers to focus on what they want to do instead of how that task needs to be done. This focus on the task at hand lets Hop developers be more productive than they would be when writing code.

As with any platform or any project, a good start is half the battle. Hop has all the functionality that is required to organize your work in projects and keep a strict separation between code (project) and configuration through environments and environments files.

Setting up your Hop projects according to the best practices described below will make your projects easier to develop, maintain and deploy. Whether you have previous experience with Pentaho Data Integration (Kettle) or are new to hop and no matter how far you are in adopting a DevOps way of working, well-designed projects and environments will make your life a lot easier. 

Rather than just diving in and creating workflows and pipelines, here are 5 steps you should follow to start working with Hop.

Step 1: Download and start Apache Hop

To download and install Apache Hop use the following guide:

  1. Download Hop from the download page
  2. Check the java docs to download and install Java 8 or higher for your operating system.
  3. Unzip Hop to a folder of your choice.
  4. You now have access to the different Hop Tools through their corresponding scripts.
  5. Start the Hop-Gui by using hop-gui.bat (Windows) or ./hop-gui.sh (Linux or Mac)

After starting the Apache Hop GUI, you’ll be presented with the window below:hop menu options

  • The menu bar includes options mainly for managing pipelines and workflows and the Apache Hop GUI configuration.
  • The main toolbar includes the New option to create files (pipelines and workflows) and metadata, and the options to manage projects and environments.
  • The perspectives toolbar includes switcher icons between the various perspectives.

Step 2: Create a project and an environment

PROTIP: project and environment locations are stored in your Hop folder config/hop-config.json by default. Set the HOP_CONFIG_FOLDER (operating system) environment variable to the path of a folder outside of your Hop installation to keep your project information if you switch between Hop installations or versions. 

To create a new project click the Add a new project button. This button opens the following dialogue:new project

The project can be configured as in the following example:project
  • Name: Choose a unique project name (5-minutes-to).
  • Home folder: This is the folder where the project is located (C:\Users\Default\Documents).
  • Configuration file (relative path): This is the folder where the project’s configuration JSON is located, by default: project-config.json.
  • Parent project to inherit from: You can select a parent project to inherit metadata from (non selected).
  • Description: A description for this project (project 5 minutes to).
  • Metadata base folder: This is the folder where this environment’s metadata will be stored, by default: ${PROJECT_HOME}/metadata
  • Unit test base path: The folder where this environment’s unit tests will be stored, by default: ${PROJECT_HOME}
  • Data Sets CSV Folder: The folder where this environment’s data files will be stored, by default: ${PROJECT_HOME}/datasets
  • Enforce execution in environment home: Give an error when trying to execute a pipeline or workflow which is not located in the environment home directory or in a sub-directory, by default: checked.
  • Project variables to set: A list of variable names, values, and variable descriptions to use with this project.

Insert all fields and click OK.

After clicking OK, the system will show the following dialogs:

  • Confirm that you want the first environment for your project.
    create environment
If you accept by clicking the OK option, you will see the following dialog for creating the environment.new environment

The environment can be configured as in the following example:

environment
  • Name: The environment name (env-dev).
  • Purpose: select the purpose of the environment (Development).
  • Project: note that the created project is selected by default (5-minutes-to).
  • Click the New button and select a directory for the environment file. Click Open and notice that the environment file is added to the Configuration files list.env-dev-config
  • Click OK for saving.
  • You can use the Edit button to add variables to your environment file.

variables dialog

TIP: OUTPUT_DIR and INPUT_DIR are sample variables, you can add the values of the input and output directories to be used globally in your environment.

Step 3: Use variables everywhere

The initial variables we set in “Step 2: Create a project and an environment” are just the start. Your projects should be transparent and portable, using variables for all file paths, relational or NoSQL database connections, email server configuration, and so on. Hard-coded values should always raise an alarm.

Let’s explore using variables to create a relational database connection.

The Relational Database Connection, specified on a project level, can be reused across multiple (instances of) a transform or other plugin types.

To create a Relational Database Connection click on the New -> Relational Database Connection option or click on the Metadata -> Relational Database Connection option.

The system displays the New Relational Database Connection view with the following fields to be configured.

new connection
 

Note that for the configuration fields you can use variables that can be specified in an environment file.

connection variables

In this case, you can:
  • Add a file that contains the connection variables to the relational database that you are going to configure or …
  • Add the variables to the development environment that we configured in the previous step.

Example:

example

 

The connection can be configured as in the following example:

configure connection

  • Connection name: the name of the metadata object (staging-connection).
  • Server or IP address: the name of the server (${STAGING_SERVER}).
  • Database name (4.0): the name of the database (${STAGING_DATABASE}).
  • Port: the Bolt port number (${STAGING_PORT}).
  • Username: specify your username (${STAGING_USERNAME}).
  • Password: specify your password (${STAGING_PASSWORD})

Test the connection by clicking on the Test button.

test button

Step 4: Create a pipeline

Pipelines in Hop perform the heavy data lifting: in a pipeline, you read data from one or more sources, perform a number of operations (joins, lookups, filters, and lots more), and finally, write the processed data to one or more target platforms.

To create a Pipeline click on the New -> Pipeline option or click on the File -> New -> Pipeline option.

The system displays the New Relational Database Connection view with the following fields to be configured.

Your new pipeline is created, and you’ll see the dialog below.

tap or click

 

Now you are ready to add the first transform. Click anywhere in the pipeline canvas and you will see the following dialog:

search
 

In this case, we are going to add a Generate rows transform. This transform allows us to generate a number of empty or equal rows. To do so, Search 'generate' -> Generate rows.

generate rows
 

Now it’s time to configure the Generate rows transform. Open the transform and set your values as in the following example:

generate rows
  • Transform name: choose a name for your transform, just remember that the name of the transform should be unique in your pipeline (generate-c-rate).
  • Limit: set the maximum number of rows you want to generate (100).
  • Name: the name of the field (c_rate).
  • Type: select the filed type (Integer).
  • Value: specify a value (4.5487).
  • Click on the Preview button to see the generate field and the OK button to save.

preview data

 

Next step? Add and connect an Add sequence transform. The Add sequence transform adds a sequence to the Hop stream. A sequence is an ever-changing integer value with a specific start and increment value.

sample pipeline

 

To configure the sequence open the transform and set your values as in the following example:add sequence

  • Transform name: The name of the transform as it appears in the pipeline workspace. This name must be unique within a single pipeline (add-seq).
  • Name of value: Name of the new sequence value that is added to the stream (seq).
  • Start at: The value to begin the sequence with (1).
  • Increment by: The amount by which the sequence increases or decreases (1).
  • Click OK to save.

To preview the result of a pipeline to see how it performs, select the transform and use the Preview the pipeline option:

preview the pipeline

 

The results are shown as follows:preview data

TIP: As you can see, there is a long and varied list of options, but don't be alarmed, you will know each option as you use them, and you need to make changes to your data. Each transform has a help description for its use and you can also consult the Official Apache Hop documentation.
 

Step 5: Run a pipeline

Finally, run your workflow by clicking on the Run -> Launch option:run options

TIP: Note the selected value in the Pipeline run configuration field: local. Hop workflows and pipelines can run on the native Hop engine, both locally and remotely. Pipelines can also run on Apache Spark, Apache Flink, and Google Dataflow through the Apache Beam runtime configurations.

You can verify the execution results in the Transform Metrics and Logging tabs.

run pipeline

Want to find out more? Download our free Hop fact sheet now!

Download Now

Blog comments