Project Hop - Exploring the future of data integration

Originally posted on Apr 29, 2020 10:05:13 AM
Last updated on July 24, 2024
Bart Maertens

data architect and developer with over 20 years of experience in data engineering and analytics. Founder and lead of the know.bi expert team, Apache Hop co-founder and PMC member.

Project Hop was announced at KCM19 back in November 2019. The first preview release is available since April, 10th. We’ve been posting about it on our social media accounts, but what exactly is Project Hop? Let’s explore the project in a bit more detail. In this post, we'll have a look at what Project Hop is, why the project was started and why know.bi wants to go all in on it.

What is Project Hop?

hop As the project’s tagline says, Project Hop intends to explore the future of data integration. We take that quite literally. We’ve seen massive changes in the data processing landscape over the last decade (the rise and fall of the Hadoop ecosystem, just to name one). All of these changes need to be supported and integrated into your data engineering and data processing systems.

Apart from these purely technical challenges, the data processing life cycle has become a software life cycle. Robust and reliable data processing requires testing, a fast and flexible deployment process and a strict separation between data and metadata.

Project Hop wants to be your goto-tool for data processing. Our main goals are:

Open Source: this is to state the obvious. The only way to build an innovative software platform in this day and age is by relying on open standards and open source software, leaving open source as the only viable option.
Visual design: data processes need to be easy to design, easy to test, easy to run and easy to deploy. We believe that visually designing data processes greatly increases developer productivity. Although visually designed, all of our work items can be managed like any other piece of software: version control, testing, CI/CD, documentation are all first class citizens in the Hop platform. Let’s put the prejudice to rest: visually designed code is code and can be handled just like any other type of code.
Metadata driven: a strict separation of data and metadata allows you to design data processes regardless of the data itself
Runtime agnostic: design once, run anywhere. We’re all working to solve data problems, not Spark, Flink, AirFlow or any other engine-specific problems. We want you to be able to design a data process and run it on any engine you want.
Pluggable: all of the components in the Hop platform should be pluggable. As a developer, this makes it easy to add new functionality. As a system administrator, if gives you full control over the functionality you want to allow in your systems, as a data designer, it gives you full control to pick and choose the functionality you want to use

Why Project Hop?

As long time Kettle (Pentaho Data Integration, or PDI) users, there’s a lot we’ve been able to do towards these goals with the Kettle platform. However, Kettle has a history of almost two decades, and a large installed customer base that requires stability and backward compatibility.

To build a platform that is ready for the next two decades, we need to look forward, not backwards. After long discussions with Matt Casters, the initial developer and former lead architect for the Kettle platform, we decided to leave the past behind and part ways with Kettle.

We used Kettle (release 8.2.0.7) as our starting point, but made some drastic changes:

naming conventions are now more in line with modern technologies. Transformations are now pipelines, jobs are now workflows and so on. These are not just new names, we did major code refactoring to reflect these changes in the Hop code architecture
Code cleanup and refactoring: we removed a lot of outdated code, updated a lot of dependencies and made major changes to the overall software architecture to support pluggable runtimes and make every item of the Hop platform pluggable
Documentation: documentation is a first class citizen in Hop. We treat documentation as code (as asciidoc), it is included in version control and can have bugs.
UI rewrite: Kettle’s UI (Spoon) didn’t meet our architectural requirements. To have a pluggable, extensible environment, we rewrote the entire UI and believe to have created a much more user friendly and modern user interface. Hop GUI’s startup times now is seconds, not minutes.

Last but not least, we strongly believe open source software should be in the public domain. We still have some work to do, but intend to start the incubation process at the Apache Software Foundation and donate the Hop source code to the ASF sooner rather than later.
With Apache Hop, we are convinced that a lot of organizations can benefit from having an easy to use and powerful meta-driven platform.

Hop is a journey, not a destination.

Hop will never be finished. We’re getting closer to a first release, but there will always be more work to do.

Next on our roadmap are

Environment support will allow you to dynamically switch between projects and environment. As you switch from one environment to another, your entire configuration , last used files etc will be updated.
Pluggable Runtime Support: you’ll be able to design Hop workflows and pipelines in the Hop Gui, and deploy to any supported runtime. The first engine we’ll support is Apache Beam, which will allow you to run your pipelines on e.g. Spark, Flink, Google DataFlow
With integrated testing you’ll be able to create regression tests on your workflow and pipeline code, and will you to define a “golden” data set to test against

There are many other long term plans we have in mind to further develop Hop that you’ll hear about in due course.

We invite you to join us on this journey and hope you’ll enjoy to see the project evolve and grow in terms of functionality and community.

Where does that leave me as a PDI/Kettle user?

Hop is a new platform, with a roadmap and future of its own. With the drastic architectural changes we made, we had to make the unavoidable choice of breaking compatibility.

This doesn’t mean we’ll ignore the existing Kettle community and user base who want to join us on our journey. We’ll start working on a migration tool that will allow you to import your existing Kettle/PDI code into Hop.

Project Hop and know.bi

Know.bi has been engaged in tens of international Pentaho projects since 2012. Kettle/PDI has been the red line in all of these projects, and made up the lion's share of the work in almost all of these projects. Kettle is on of our most indispensable tools, but needed some solid refurbishing after 20 years of development. Project Hop is a lot more than a Kettle face lift. We want to use the solid Kettle foundations to build a new data integration platform that is ready to work with modern data platforms and is ready for the future.

Although Hans and Bart are actively involved in Project Hop's development, know.bi is and will remain a system integrator and team of consultants. We are convinced Project Hop will allow us to serve our customers better and faster, which is our ultimate goal as a solution provider.

Our existing Pentaho and Kettle service will continue for the foreseeable future. We consider a full switch to Project Hop in due time to be a logical and natural evolution.

We hope you're as excited as we are about Project Hop and we're looking forward to building successful data engineering solutions with you.

kettle, pentaho data integration, data integration, projecthop

What is Amazon DMS

Every day, more and more companies are moving towards cloud computing, with...

Catching the "bad guys" using graphs.

Figure 1: Gartner layered model for fraud detection

What is a graph database?

Although graph theory has been around for centuries, graph databases...

Project Hop - Exploring the future of data integration

What is Project Hop?

Why Project Hop?

Hop is a journey, not a destination.

Where does that leave me as a PDI/Kettle user?

Project Hop and know.bi

Subscribe to the know.bi blog

Blog comments

Related posts

Getting Started with AWS DMS

What is Amazon DMS

Fraud Detection with Graphs

Catching the "bad guys" using graphs.

Graph Databases - Analytical Use Cases

What is a graph database?