Incubator - the Apache Way and Community
Just before the end of 2021, Apache Hop graduated from...
Before going into the details of how you should upgrade to Apache Hop, let's have a look at a couple of reasons why upgrading to Apache Hop is a good idea.
Since Apache Hop started as an Incubating project at the Apache Software Foundation, the project has built release after release. In a year and a half, there have been the 0.60, 0.70, 0.99, 1.0, 1.1, 1.2 and 2.0 releases. The last three releases came after graduation as an Apache Top Level Project.
Not only is Apache Hop now a lot smaller and faster than Pentaho Data Integration, the platform also contains a lot more functionality. All of this functionality is built as plugins that can be added by unzipping a plugin to your Hop folder or can be removed by deleting a plugin folder from your Hop installation.
Some of the highlights of the new functionality that Apache Hop brings are
As an Apache project, the entire development on and roadmap of Apache Hop are truly open source.
Even though Pentaho Data Integration is open source by definition, the roadmap and development are in fact driven by a single company.
The source code for all Apache projects is owned by the Apache Software Foundation, which means it's free and in the public domain forever.
The roadmap and development for Apache projects are driven by a community with a democratic decision process: all discussions happen on publicly available mailing lists, all decisions are made through voting rounds. Numerous companies have contributed to Apache Hop over the past couple of years, without any of them "owning" Apache Hop.
Apache Hop is designed and developed by an active and growing community. This global community of Apache Hop developers, contributed, testers and users is growing fast.
There is an increasing number of active user groups in Brazil, Spain, Italy, Japan, Germany, the Benelux and other places. The Hop community tests, discusses and criticizes new and existing functionality in Hop, pushing the development continuously forward.
Hop as a project has a very low bar of entry: the Hop community considers almost everything a contribution: obviously source code is one of them, but so are documentation, bug reports, community building, even discussions and (constructive) criticism are considered contributions.
Now, how do you get from PDI to Apache Hop? Converting your existing Pentaho Data Integration projects to Apache Hop is as simple as selecting 'File -> Import from PDI/Kettle' in Hop Gui.
However, since Hop is an entirely new platform, things work a little differently under the hood. Metadata is managed a lot stricter in Hop than it is in PDI, the concepts of projects and environments are new, as are unit tests and more.
Installing Hop is easy. Check our post on how to get started with Apache Hop according to best practices to make sure you hit the ground running before we move on to the actual upgrade.
Let's walk through 7 key points that will make your switch from Pentaho Data Integration to Apache Hop not just a migration but a true upgrade for your project.
Converting PDI/Kettle jobs and transformations to Apache Hop pipelines is trivial. Upgrading an entire project that has evolved and grown over many years, however, is not without risk. Over time, your jobs and transformations may have built up some technical debt, or may just not be as clean as you expect them to be.
A pre-upgrade audit detects as many unpleasant surprises as soon as possible, ideally before you even start the upgrade process.
Apache Hop manages metadata a lot stricter than Kettle/PDI, so that's where your focus should be. In general, the more you know about a project before you start the upgrade, the better.
A number of useful checks:
Building a library of checks and analyzing the results of those checks will help you detect any possible issues as soon as possible.
If you need help in building these checks, feel free to reach out. We have the tools to check your projects inside out and are happy to help.
Make sure you and all the project stakeholders are on the same page about the scope of the upgrade. Architectural changes and code refactoring can be included, but make the project more complex and harder to test.
Apache Hop and PDI/Kettle are two completely different platforms. One aspect of an upgrade is the code conversion from jobs and transformations to workflows and pipelines, but just about everything else in a project's organization is different in Hop than it was in PDI/Kettle.
A couple of quick guidelines for a successful upgrade are:
Breaking the upgrading process for a large project into smaller chunks makes your entire upgrade process a lot more manageable.
Identify which parts of your project can be upgraded separately, what the dependencies are between those modules or sub-projects, and take them through the upgrade process one at a time.
Some of the advantages of upgrading your project into smaller parts:
With the projects and modules to upgrade in place, build an upgrade planning and timeline, and enforce a code freeze on each module while it is being upgraded.
Sometimes changes to a module that is being upgraded will be unavoidable, for example to backport a hotfix that needs to be applied to production. For these scenarios, it helps to create a procedure to apply these hotfixes to both the production environment (that is still PDI) and the Hop project that is being upgraded.
Apache Hop is no different than any other software platform: a lot of the success of an Apache Hop project depends on the implementation.
Applying the Hop best practices puts you on a fast track to operational excellence.
Apache Hop gives you a lot of freedom to implement projects the way you want. The Apache Hop best practices are advice based on real-life experiences and are intended to improve the overall quality of Hop projects.
The best practices are available on the Apache Hop website. We'll cover these in more detail in later posts.
With your Apache Hop and PDI/Kettle infrastructure set up in parallel, testing is quite straightforward.
Use Hop's integrated unit testing to its full potential to automate testing where possible.
Unit testing in Apache Hop is a fully integrated core functionality. Make testing a first-class citizen in your upgrade projects and beyond:
Apache Hop, even though it still is relatively young, is a large and quickly growing platform. A lot has changed since the project forked from PDI/Kettle, so upgrading your skills and knowledge is as important as upgrading the technology.
Bringing and keeping your and your team members' knowledge up to date is crucial for your project's success.
Apache Hop considers documentation a crucial part of the platform. The documentation at the Apache Hop website is available for all Hop versions since 1.0. The docs contain a lot of detailed information and a growing number of how-to guides.
Training, again, is only half of the work. A team culture where knowledge sharing and constantly coaching each other on best practices and new ways of working with Apache Hop, just like any other component in your data architecture, is crucial and will benefit the project and everyone involved.
Apache Hop 2.0.0 is available for download at the Apache Hop download page.