Loops in Apache Hop

Apr 25, 2023 11:00:00 AM
Bart Maertens

data architect and developer with over 20 years of experience in data engineering and analytics. Founder and lead of the know.bi expert team, Apache Hop co-founder and PMC member.

In any data engineering project, there are lots of use cases where you'll want the same process to run multiple times, e.g. to loop over a number of folders, run for every available month in a data range etc.

Apache Hop offers multiple ways to loop over the same workflow or pipeline. Let's take a closer look at the different options.

Deprecated: copy rows to result + execute for each row

As stated in the section title, this option is deprecated and is only available in Apache Hop for historical reasons. DO NOT use this option in new development. It does work, but it's a lot harder to figure out what is going on inside your pipelines or workflows.
If you have this type of loops in your project e.g. as part of an imported Pentaho Data Integration (Kettle) project, have a look at the other ways to build loops in this posts to refactor those loops.

In this scenario, you'll need at least three apache Hop files:

in a first pipeline, we'll build a list of values to loop over. These rows are placed in memory with a Copy rows to result transform.
in a second pipeline, we'll consume each of the values in the loop. Each value in the loop is accepted as a parameter in this pipeline.
both pipelines are executed by a workflow. The first pipeline action puts the values to loop over in memory. In the second pipeline action, we'll enable the `Execute for every result row` option and pass the fieldname(s) we copied to memory as a "Stream column name" as a parameter to the pipeline that processes the loop values.

This is what that looks like in a very basic example:

Create 10 rows with a counter to loop over. Copy these rows to memory.

Apache Hop loops - copy rows to result

Process each of the values in the loop individually. This example receives the loop value as a `${PRM_COUNTER}` parameter and prints it to the logs.

loops-copy-rows-to-result-process-one-row

Both pipelines are executed from a workflow.

Apache Hop loops - copy rows to result in workflow

The second pipeline action in this workflow runs the pipeline where we process the loop values. The "Execute for every result row" option runs this pipeline for every counter value we placed in memory in the first pipeline.

Apache Hop loops - copy rows to result parameters

The logs for this workflow will look similar to the output below:

2023/04/24 11:25:07 - Hop - Starting workflow

2023/04/24 11:25:07 - loops-process-rows-from-memory - Start of workflow execution

2023/04/24 11:25:07 - loops-process-rows-from-memory - Starting action [loops-copy-rows-to-result.hpl]

2023/04/24 11:25:07 - loops-copy-rows-to-result.hpl - Using run configuration [local]

2023/04/24 11:25:07 - loops-copy-rows-to-result - Executing this pipeline using the Local Pipeline Engine with run configuration 'local'

2023/04/24 11:25:07 - loops-copy-rows-to-result - Execution started for pipeline [loops-copy-rows-to-result]

2023/04/24 11:25:07 - generate 10 rows.0 - Finished processing (I=0, O=0, R=0, W=10, U=0, E=0)

2023/04/24 11:25:07 - add counter.0 - Finished processing (I=0, O=0, R=10, W=10, U=0, E=0)

2023/04/24 11:25:07 - Copy rows to result.0 - Finished processing (I=0, O=0, R=10, W=10, U=0, E=0)

2023/04/24 11:25:07 - loops-copy-rows-to-result - Pipeline duration : 0.052 seconds [ 0.052" ]

2023/04/24 11:25:07 - loops-process-rows-from-memory - Starting action [loops-copy-rows-to-result-log-counter.hpl]

...

2023/04/24 11:25:07 - loops-copy-rows-to-result-log-counter - Executing this pipeline using the Local Pipeline Engine with run configuration 'local'

2023/04/24 11:25:07 - loops-copy-rows-to-result-log-counter - Execution started for pipeline [loops-copy-rows-to-result-log-counter]

2023/04/24 11:25:08 - generate 1 row.0 - Finished processing (I=0, O=0, R=0, W=1, U=0, E=0)

2023/04/24 11:25:08 - log ${PRM_COUNTER}.0 -

2023/04/24 11:25:08 - log ${PRM_COUNTER}.0 - ------------> Linenr 1------------------------------

2023/04/24 11:25:08 - log ${PRM_COUNTER}.0 - #################################

2023/04/24 11:25:08 - log ${PRM_COUNTER}.0 - the vaule for PRM_COUNTER is now 10

2023/04/24 11:25:08 - log ${PRM_COUNTER}.0 - #################################

2023/04/24 11:25:08 - log ${PRM_COUNTER}.0 -

2023/04/24 11:25:08 - log ${PRM_COUNTER}.0 - ====================

2023/04/24 11:25:08 - log ${PRM_COUNTER}.0 - Finished processing (I=0, O=0, R=1, W=1, U=0, E=0)

2023/04/24 11:25:08 - loops-copy-rows-to-result-log-counter - Pipeline duration : 0.035 seconds [ 0.035" ]

2023/04/24 11:25:08 - loops-process-rows-from-memory - Finished action [loops-copy-rows-to-result-log-counter.hpl] (result=[true])

2023/04/24 11:25:08 - loops-process-rows-from-memory - Finished action [loops-copy-rows-to-result.hpl] (result=[true])

2023/04/24 11:25:08 - loops-process-rows-from-memory - Workflow execution finished

2023/04/24 11:25:08 - Hop - Workflow execution has ended

2023/04/24 11:25:08 - loops-process-rows-from-memory - Workflow duration : 0.65 seconds [ 0.650" ]

2023/04/24 11:25:08 - loops-copy-rows-to-result-log-counter - Execution finished on a local pipeline engine with run configuration 'local'

As you may have noticed, this way of looping is not very transparent. There is no way to pick up the stream values you want to pass to the second pipeline. You'll need to log information to the logs if you want to have a clear view on what is happening in your loop.
All of this combined makes it hard to maintain and debug this type of loops.

Pipeline and Workflow executor

The Workflow executor and Pipeline executor offer flexible and elegant ways to run workflows and pipelines from within an existing pipeline.

Pipeline Executor

The pipeline executor is a relatively simple but very powerful transform.

Specify a name for the pipeline you want to execute (or accept the pipeline name from a field), specify a run configuration, map the child pipeline's parameters to fields in your current pipeline, and done.

The pipeline executor transform will send rows to the child pipeline one by one by default. This default behavior can be changed in the "Row grouping" tab. Use a Get rows from result transform in the child pipeline to fetch the rows if you're sending more than one row to the child pipeline.

Looping over a list of values to send to your child pipeline is not necessarily the last action you want to perform in your main pipeline.

There are 5 possibilities to create hops from the pipeline executor transform to later transforms in the pipeline.

Apache Hop loops - pipeline executor configuration

Apache Hop loops - pipeline executor

Execution results

This hop type returns execution results and metrics from the various child pipeline runs.

It's a good idea to at least check if there have been any issues in one of your child pipelines with the "ExecutionResult", "ExecutionExitStatus" or "ExecutionNrErrors" fields.

Fieldname	Type	Description
ExecutionTime	Integer	the time it took to execute the child pipeline
ExecutionResult	Boolean	the result of the child pipeline execution (Y/N)
ExecutionNrErrors	Integer	the number of errors encountered in the child pipeline execution
ExecutionLinesRead	Integer	number of lines read from previous transforms (in the child pipeline)
ExecutionLinesWritten	Integer	number of lines written to following transforms (in the child pipeline)
ExecutionLinesInput	Integer	number of lines read from physical I/O like files or databases
ExecutionLinesOutput	Integer	number of lines written to physical I/O like files or databases
ExecutionLinesRejected	Integer	number of rejected lines in the child pipeline
ExecutionLinesUpdated	Integer	number of updated lines in the child pipeline
ExecutionLinesDeleted	Integer	number of deleted lines in the child pipeline
ExecutionFilesRetrieved	Integer	number of retrieved files in the child pipeline
ExecutionExitStatus	Integer	exit status of the child pipeline
ExecutionLogText	String	the full logging text for the child pipeline’s execution
ExecutionLogChannelId	String	log channel id for the child pipeline’s execution

Result rows after execution

This rowset receives data that was copied to memory by the child pipeline, e.g. with a Copy rows to result transform. Use the "Result rows" tab in the pipeline executor transform to specify which fields you expect to receive from the child pipelines.

Result file names after execution

This rowset will contain any filename that was copied to the results, e.g. with the `Add filenames to result` in the "Content" tab of the Text file input transform.

Copy of the executor transform's input data

This rowset passes on the data stream as it was received by the pipeline executor transform.

Main output of the transform

This rowset mimics the input for this pipeline executor transform.

Workflow Executor

The workflow executor transform is similar to the pipeline executor transform but, as the name implies, lets you run workflows from within a pipeline.

Specify a name for the workflow you want to execute, specify a run configuration, map the child workflow's parameters to fields in your pipeline, and done.

The workflow executor transform will send rows to the workflow one by one by default. This default behavior can be changed in the "Row grouping" tab.

Again, similar to the pipeline executor transform, Looping over a list of values to send to your child workflow is not necessarily the last action you want to perform in your main pipeline.

There are 4 possibilities to create hops from the workflow executor transform to later transforms in the pipeline.

Execution results

This hop type returns execution results and metrics from the various child workflow runs.

It's a good idea to at least check if there have been any issues in one of your child workflow runs with the "ExecutionResult", "ExecutionExitStatus" or "ExecutionNrErrors" fields.

Fieldname	Type	Description
ExecutionTime	Integer	the time it took to execute the child workflow
ExecutionResult	Boolean	the result of the child workflow execution (Y/N)
ExecutionNrErrors	Integer	the number of errors encountered in the child workflow execution
ExecutionLinesRead	Integer	number of lines read from previous transforms (in the child workflow)
ExecutionLinesWritten	Integer	number of lines written to following transforms (in the child workflow)
ExecutionLinesInput	Integer	number of lines read from physical I/O like files or databases
ExecutionLinesOutput	Integer	number of lines written to physical I/O like files or databases
ExecutionLinesRejected	Integer	number of rejected lines in the child workflow
ExecutionLinesUpdated	Integer	number of updated lines in the child workflow
ExecutionLinesDeleted	Integer	number of deleted lines in the child workflow
ExecutionFilesRetrieved	Integer	number of retrieved files in the child workflow
ExecutionExitStatus	Integer	exit status of the child workflow
ExecutionLogText	String	the full logging text for the child workflow’s execution
ExecutionLogChannelId	String	log channel id for the child workflow’s execution

Result rows after execution

This rowset receives data that was copied to memory by the child workflow. Use the "Result rows" tab in the workflow executor transform to specify which fields you expect to receive from the child workflows.

Result file names after execution

This rowset will contain any filename that was copied to the results by the child workflow.

Repeat and End Repeat actions

In addition to the workflow and pipeline executor transforms, the Repeat and End Repeat actions let you build loops from a workflow.

The repeat action in itself is pretty simple: it requires a workflow or pipeline and the run configuration to use.

The action will continue to execute the specified workflow or pipeline until a condition is met: either a variable is set with an (optional) value, or an "End repeat" action is triggered in a child workflow.

The example below runs a pipeline that increments a "${COUNTER}" variable with each run. If the variable values exceeds 10, a variable "|${END_LOOP}" is set. This variable is detected by the Repeat action, and the loop stops.

Apache Hop loops - repeat action

Apache Hop loops - repeat pipeline

Conclusion

The options discussed here give you all the tools you need to build and run loops in your Apache Hop projects.

Check the samples discussed here in our github repository. This post and the samples discussed here have been contributed to the Apache Hop documentation and samples project (#2559, #2337, #2338) and will be available in the 2.5 release.

If you upgraded your projects from Pentaho Data Integration (Kettle) or intend to upgrade, now's the time to refactor your deprecated "Copy rows to result" loops to any of the options discussed here.

Let us know in the comments if you need more information about loops in Apache Hop, or get in touch if you'd like to find out how we can help you to be more successful with Apache Hop.

data engineering, data orchestration, apache hop

Workflow Log

Apache Hop is a data engineering and data orchestration platform that allows data...

What is data testing, and why should you test your data?

Apache Hop is a data engineering and data...

Loops in Apache Hop

Deprecated: copy rows to result + execute for each row

Pipeline and Workflow executor

Pipeline Executor

Execution results

Result rows after execution

Result file names after execution

Copy of the executor transform's input data

Main output of the transform

Workflow Executor

Execution results

Result rows after execution

Result file names after execution

Repeat and End Repeat actions

Conclusion

Subscribe to the know.bi blog

Blog comments

Related posts

5 minutes to configure Workflow Log in Apache Hop

Workflow Log

Parallel execution in Apache Hop workflows

Unit testing in Apache Hop - complete, correct and consistent data

What is data testing, and why should you test your data?