Workflow Log
Apache Hop is a data engineering and data orchestration platform that allows data...
In any data engineering project, there are lots of use cases where you'll want the same process to run multiple times, e.g. to loop over a number of folders, run for every available month in a data range etc.
Apache Hop offers multiple ways to loop over the same workflow or pipeline. Let's take a closer look at the different options.
As stated in the section title, this option is deprecated and is only available in Apache Hop for historical reasons. DO NOT use this option in new development. It does work, but it's a lot harder to figure out what is going on inside your pipelines or workflows.
If you have this type of loops in your project e.g. as part of an imported Pentaho Data Integration (Kettle) project, have a look at the other ways to build loops in this posts to refactor those loops.
In this scenario, you'll need at least three apache Hop files:
This is what that looks like in a very basic example:
Create 10 rows with a counter to loop over. Copy these rows to memory.
Process each of the values in the loop individually. This example receives the loop value as a `${PRM_COUNTER}` parameter and prints it to the logs.
Both pipelines are executed from a workflow.
The second pipeline action in this workflow runs the pipeline where we process the loop values. The "Execute for every result row" option runs this pipeline for every counter value we placed in memory in the first pipeline.
The logs for this workflow will look similar to the output below:
The Workflow executor and Pipeline executor offer flexible and elegant ways to run workflows and pipelines from within an existing pipeline.
The pipeline executor is a relatively simple but very powerful transform.
Specify a name for the pipeline you want to execute (or accept the pipeline name from a field), specify a run configuration, map the child pipeline's parameters to fields in your current pipeline, and done.
The pipeline executor transform will send rows to the child pipeline one by one by default. This default behavior can be changed in the "Row grouping" tab. Use a Get rows from result transform in the child pipeline to fetch the rows if you're sending more than one row to the child pipeline.
Looping over a list of values to send to your child pipeline is not necessarily the last action you want to perform in your main pipeline.
There are 5 possibilities to create hops from the pipeline executor transform to later transforms in the pipeline.
This hop type returns execution results and metrics from the various child pipeline runs.
It's a good idea to at least check if there have been any issues in one of your child pipelines with the "ExecutionResult", "ExecutionExitStatus" or "ExecutionNrErrors" fields.
Fieldname | Type | Description |
ExecutionTime | Integer | the time it took to execute the child pipeline |
ExecutionResult | Boolean | the result of the child pipeline execution (Y/N) |
ExecutionNrErrors |
Integer
|
the number of errors encountered in the child pipeline execution |
ExecutionLinesRead | Integer | number of lines read from previous transforms (in the child pipeline) |
ExecutionLinesWritten | Integer | number of lines written to following transforms (in the child pipeline) |
ExecutionLinesInput | Integer | number of lines read from physical I/O like files or databases |
ExecutionLinesOutput | Integer | number of lines written to physical I/O like files or databases |
ExecutionLinesRejected | Integer | number of rejected lines in the child pipeline |
ExecutionLinesUpdated | Integer | number of updated lines in the child pipeline |
ExecutionLinesDeleted | Integer | number of deleted lines in the child pipeline |
ExecutionFilesRetrieved | Integer | number of retrieved files in the child pipeline |
ExecutionExitStatus | Integer | exit status of the child pipeline |
ExecutionLogText | String | the full logging text for the child pipeline’s execution |
ExecutionLogChannelId | String | log channel id for the child pipeline’s execution |
This rowset receives data that was copied to memory by the child pipeline, e.g. with a Copy rows to result transform. Use the "Result rows" tab in the pipeline executor transform to specify which fields you expect to receive from the child pipelines.
This rowset will contain any filename that was copied to the results, e.g. with the `Add filenames to result` in the "Content" tab of the Text file input transform.
This rowset passes on the data stream as it was received by the pipeline executor transform.
This rowset mimics the input for this pipeline executor transform.
The workflow executor transform is similar to the pipeline executor transform but, as the name implies, lets you run workflows from within a pipeline.
Specify a name for the workflow you want to execute, specify a run configuration, map the child workflow's parameters to fields in your pipeline, and done.
The workflow executor transform will send rows to the workflow one by one by default. This default behavior can be changed in the "Row grouping" tab.
Again, similar to the pipeline executor transform, Looping over a list of values to send to your child workflow is not necessarily the last action you want to perform in your main pipeline.
There are 4 possibilities to create hops from the workflow executor transform to later transforms in the pipeline.
This hop type returns execution results and metrics from the various child workflow runs.
It's a good idea to at least check if there have been any issues in one of your child workflow runs with the "ExecutionResult", "ExecutionExitStatus" or "ExecutionNrErrors" fields.
Fieldname | Type | Description |
ExecutionTime | Integer | the time it took to execute the child workflow |
ExecutionResult | Boolean | the result of the child workflow execution (Y/N) |
ExecutionNrErrors |
Integer
|
the number of errors encountered in the child workflow execution |
ExecutionLinesRead | Integer | number of lines read from previous transforms (in the child workflow) |
ExecutionLinesWritten | Integer | number of lines written to following transforms (in the child workflow) |
ExecutionLinesInput | Integer | number of lines read from physical I/O like files or databases |
ExecutionLinesOutput | Integer | number of lines written to physical I/O like files or databases |
ExecutionLinesRejected | Integer | number of rejected lines in the child workflow |
ExecutionLinesUpdated | Integer | number of updated lines in the child workflow |
ExecutionLinesDeleted | Integer | number of deleted lines in the child workflow |
ExecutionFilesRetrieved | Integer | number of retrieved files in the child workflow |
ExecutionExitStatus | Integer | exit status of the child workflow |
ExecutionLogText | String | the full logging text for the child workflow’s execution |
ExecutionLogChannelId | String | log channel id for the child workflow’s execution |
This rowset receives data that was copied to memory by the child workflow. Use the "Result rows" tab in the workflow executor transform to specify which fields you expect to receive from the child workflows.
This rowset will contain any filename that was copied to the results by the child workflow.
In addition to the workflow and pipeline executor transforms, the Repeat and End Repeat actions let you build loops from a workflow.
The repeat action in itself is pretty simple: it requires a workflow or pipeline and the run configuration to use.
The action will continue to execute the specified workflow or pipeline until a condition is met: either a variable is set with an (optional) value, or an "End repeat" action is triggered in a child workflow.
The example below runs a pipeline that increments a "${COUNTER}" variable with each run. If the variable values exceeds 10, a variable "|${END_LOOP}" is set. This variable is detected by the Repeat action, and the loop stops.
The options discussed here give you all the tools you need to build and run loops in your Apache Hop projects.
Check the samples discussed here in our github repository. This post and the samples discussed here have been contributed to the Apache Hop documentation and samples project (#2559, #2337, #2338) and will be available in the 2.5 release.
If you upgraded your projects from Pentaho Data Integration (Kettle) or intend to upgrade, now's the time to refactor your deprecated "Copy rows to result" loops to any of the options discussed here.
Let us know in the comments if you need more information about loops in Apache Hop, or get in touch if you'd like to find out how we can help you to be more successful with Apache Hop.
Apache Hop is a data engineering and data orchestration platform that allows data...
One of the first concepts new Apache Hop users learn is that pipelines are executed in parallel and...
The Apache Beam project released 2.48.0 just in time to be included in the upcoming Apache Hop...
Blog comments