5 minutes to build unit tests in Apache Hop

5 minutes to build unit tests in Apache Hop

Why unit testing?

Apache Hop is a data engineering and data orchestration platform that allows data engineers and data developers to visually design workflows and data pipelines to build powerful solutions.

However, building data pipelines is just the start. You want to run your workflows and pipelines in production reliably, and you want to make sure your data is processed exactly the way you want it to. This last part is where unit testing comes in. 

In standard software development, developers often write tests before they develop the actual functionality (TDD, or Test Driven Development). These unit tests validate the functionality of an application, from very low-level to high level. These unit tests often are integrated in the entire software build process and are executed automatically as part of a CI/CD process. 

Apache Hop is a low code, often even a zero code data development platform that doesn't build traditional applications. However, the same software development principles and best practices still apply, and unit testing is an indispensable one of those. Hop unit tests allow you to work test-driven, but also allow you to build regression tests, so you can be sure that an issue that was once fixed, remains fixed by including a test for it in your test library.

Testing lets you build more reliable data engineering projects, so let's take a closer look at how unit tests work in Apache Hop.

Unit tests in Apache Hop

To test if your pipelines are processing your data exactly the way you expect them to, Apache Hop compares the data generated by a unit test run with a known result. This known result, also known as a Golden Dataset is a dataset that was added to your project with guaranteed correct results. 

When the results that are produced by your pipeline unit test exactly match the Golden Dataset, the test passes. If there are any differences, the test fails. 

There are a number of tweaks you can apply to your pipeline unit tests that we'll cover in a later post, but let's have a look at the basics first. 

Hop unit tests can speed up development in a number of cases:

  • Pipelines without design time input: mappings, single threader, etc.
  • When input data doesn’t exist yet, is in development, or where there is no direct access to the source system.
  • When it takes a long time to get to input data, long-running queries, etc.

Main components of a unit test

Hop uses the following concepts (metadata objects) to work with pipeline unit tests:

  • Dataset: a set of rows with a certain layout, stored in a CSV data set. When used as input we call it an input data set. When used to validate a transform’s output we call it a golden data set.
  • Unit test: The combination of input data sets, golden data sets, tweaks, and a pipeline.
  • Unit test tweak: the ability to remove or bypass a transform during a test

You can have 0, 1, or more input or golden data sets defined in a unit test, just like you can have multiple unit tests defined per pipeline.

The default dataset folder can be specified in the project dialog. Check the 'Data Sets CSV Folder (HOP_DATASETS_FOLDER)'. By default, the value for the ${HOP_DATASETS_FOLDER} variable is set to ${PROJECT_HOME}/datasets.

Unit test in runtime

When a pipeline is executed in Hop GUI and a unit test is selected the following happens:

  • All transforms marked with an input data set are replaced with an Injector transform
  • All transforms marked with a golden data set are replaced with a dummy transform.
  • All transforms marked with a "Bypass" tweak are replaced with a dummy transform.
  • All transforms marked with a "Remove" tweak are removed

These operations take place on an in-memory copy of the pipeline, unless you specify a pipeline file location in the unit test dialog.

After execution, the output for transforms that have a Golden Dataset assigned to is validated against the golden data and logged. If the generated output for that transform doesn't exactly match the corresponding dataset, the test fails and a dialog will pop up when running in Hop Gui.

Unit test and dataset options

The 'Unit Testing' category in the transform context dialog (click on the transform's icon to open) contains the available unit testing options:

options

  • Set input data set: For the active unit test, it defines which data set to use instead of the output of the transform.
  • Clear input data set: Remove a defined input data set from this transform unit test.
  • Set golden data set: The input to this transform is taken and compared to the golden data set you are selecting.
  • Clear golden data set: Remove a defined input data set for this transform unit test.
  • Create data set: Create an empty data set with the output fields of this transform.
  • Write rows to data set: Run the current pipeline and write the data to a data set.
  • Remove from test: When this unit test is run, do not include this transform.
  • Include in test: Run the current pipeline and write the data to a data set.
  • Bypass in test: When this unit test is run, bypass this transform (replace with a dummy).
  • Remove bypass in test: Do not bypass this transform in the current pipeline during testing.

Creating data sets is also possible from the 'New' context menu or metadata perspective.

Step 1: Create the datasets

Consider the following basic pipeline below. This pipeline reads data from a CSV file, extracts the years from the date of birth, counts rows by this year, sorts, and writes out to a file.

We’ll use this example to create a test to verify the output of the pipeline is what we expected.pipeline

We are going to create a dataset for the read-customers transform and a dataset for the write-ave-age-by-state transform.

To create the first dataset, click on the read-customers transform icon to open the context dialog:

ut-select-action

  • Click the Create data set option. The popup dialog already shows the field layout in the bottom half of the dialog:

ut-create-dataset

Now it’s time to configure the data set, set your values as in the following example:

ut-create-dataset1

 

  • Name: The data set name (data-set-customers)
  • Set folder (or use HOP_DATASETS_FOLDER): The name of the folder where the base file is located (leave it empty and create a datasets folder in your Hop project)
  • Base file name: The name of the file to be used (data-set-customers.csv)
  • The data set fields and their column names in the file: This table is a map of rows and columns that make up the data set and it’s automatically generated.
  • The Enter Mapping dialog allows you to map transform output fields to data set fields. For this example, just click Guess → OK.

ut-create-dataset-mapping

The sort order for the dataset can be modified.

ut-mapping-order

  • Do the same for the output transform you’ll want to check the data for (write-ave-age-by-state in the example). We configured the data set as in the following image:

data-set-2

Check the metadata perspective. You should now have two data sets available.

data-set-3

  • Run your pipeline after creating the datasets:

run-options-1

run-options-2

Step 2: Write data to the data sets

To write data to the newly created data sets, you can follow the steps:

  • Click the read-customers transform icon again, and then click Write rows to data set.
  • You’ll get a popup dialog asking you to select the data set, select data-set-customers, and click OK.

ut-write-data-set2


  • The Enter Mapping dialog allows you to map transform output fields to data set fields. For this example, just click Guess → OK.

ut-write-dataset3

  • The Run Options dialog will appear, click Launch to write data to the data set:

ut-write-dataset4

  • Check the data-set-customers.csv file in the datasets folder.

Repeat for the write-ave-age-by-state transform and data set.

Step 3: Create the unit test

Click the + icon (highlighted) in the unit testing toolbar to create a new unit test. Previously created unit tests will be available from the dropdown box for editing.

ut-create-unit-test

The Pipeline Unit Test dialog displays the following values by default:

ut-create-unit-test1

To configure the unit test, set your values as in the following example:

  • Name: The unit test name (ave-age-by-state-unit).
  • Type of test: Choose Unit test or Development (Unit test).
  • The pipeline to test: The pipeline this test applies to. By default, you should see the active pipeline filename here. (./hop/unit-testing/ave-age-by-state.hpl)

You’ll get a popup dialog:

ut-create-unit-test2

Since we’re creating a unit test for the active pipeline in this example, confirming is fine.

Step 4: Set input and golden data sets

To set the input data set click the read-customers transform icon again, select Set input data set.ut-create-unit-test3

  • You’ll get a popup dialog asking you to select the data set, select data-set-customers and click OK.

ut-set-dataset

  • The Enter Mapping dialog allows you to map transform output fields to data set fields. For this example, just click Guess → OK.

ut-set-dataset1

  • Confirm or modify the sort order for the data set.

ut-set-dataset2

Note the data set indicator on read-customers transform.

ut-set-dataset3

  • Repeat for write-ave-age-by-state, but using the Set golden data set option.

ut-set-dataset4

Your pipeline now has two new indicators for the input and output data set.

ut-set-dataset5

Step 5: Run the unit test

If the pipeline runs with all tests passed, you’ll receive a notification in the logs:

20221/04/21 21:16:43 - read-customers - Unit test 'ave-age-by-state-unit' passed succesfully
2022/04/21 21:16:43 - read-customers - ----------------------------------------------
2022/04/21 21:16:43 - read-customers - customers by year out - customers-by-year : Test passed succesfully against golden data set
2022/04/21 21:16:43 - read-customers - Test passed succesfully against unit test
2022/04/21 21:16:43 - read-customers - ----------------------------------------------
2022/04/21 21:16:43 - read-customers - Pipeline duration : 0.108 seconds [ 0.108 ]
2022/04/21 21:16:43 - read-customers - Execution finished on a local pipeline engine with run configuration 'local'

If changes to the pipeline cause the test to fail, a popup will be shown for the failed rows.

In the example below, the number of rows, causing the test to fail:

ut-run-ut

ut-run-ut1

With these basic unit tests in place, there may be situations where you want to exclude certain transforms or even entire substreams from testing in your pipelines. To accommodate this, you can bypass and remove transforms in a pipeline, which is what we'll discuss in the next post in this series. 

Want to find out more? Download our free Hop fact sheet now!

Download Now

Subscribe to the know.bi blog

Blog comments