Read our overview of the Keynotes
Read our overview of the talks in the Business room
Dan explained how serverless PDI allows to spend time on the solution rather than getting the PID server and infrastructure up and running.
Although virtualization already takes some of the infrastructure management pain away, there's still quite a bit of overhead involved, whereas no infrastructure management is needed when running In the cloud.
Dan used the use case of a small company that has no infrastructure and has all its apps outsources.
He used 3 services for this use case:
Dan showed a demo where a configuration asks PDI for data, launches and runs PDI and sends information to Lamba.
With the rise of IoT, data is produced on devices are too small to do on-board processing, and so a new approach is required.
This approach consists of 4 steps of data processing:
As data is moving to higher levels, the more knowledge can be gained from it.
Kleyson took the stage to talk about the CTools NewMapComponent, which provides out-of-the-box API functionality and extensions.
The component works with Google Maps and OpenLayers, and has at least the following capabilities:
Francesco came to present the new edition of the Pentaho Reporting book and.
Although the goal of his talk was not to sell books, the crowd was given a discount coupon code that works perfectly well to sell and buy books ;-)
Pentaho Reporting hasn't received a lot of love from Pentaho in the latest releases, no there were no big new announcements in this -otherwise very entertaining- talk.
The code samples for the new edition of the book are available on github.
Guilherme discussed a number of Machine Learning plugins in the PDI marketplace:
FOREX market prediction was used as an example, using Spark ML lib random forests. The example is avaible on github.
Caio started by explaining that machine learning models are never completely accurate, they can still be useful.
To create a good model, a project needs
If there is a lack of time (as there always is), Automated Machine Learning can come in to help. AutoML can't replace data scientists, but can definitely make their life easier.
When using AutoML, PDI can be useful in data onboarding, data preparation, data blending, model orchestration and visualization.
Caio continued with a demo, based on the Kaggle Titanic scenario:
Nelson took the stage after Uwe for the next iteration of his "10 WTF moments in PDI" presentation.
The '10' WTF moments are:
Bonus 1: there's no support for variable values instead of hard coded or field values in the Filter Rows step
Bonus 2: there's no possibility to provide the kettle.properties file from the command line
Diethard started his talk by stating the obvious: 'ETL developers are not code developers'.
While this is very true in most projects, ETL developers are required to act very much like 'real' developers, including the use of VCS tools, most notably git.
The main observation Diethard made is that chaos is everywhere. Inconsistency is everywhere, standardization and conventions are hard to find.
However, one of the bare necessities of a robust project, where ETL code can be deployed in a robust, repeatable, reliable and supportable way is having consistency, standardization and conventions. Standards should therefore be enforced where possible. One possibility to enforce that is through git, and githooks specfically.
Another point Diethard made is that code and config should be stored, managed and maintained in separate repositories, and as part of separate release cycles.
Two other recommendations:
The second last speaker in the technical room was Slawo, back at PCM after a couple of years of absence.
Slawo argued that ETL should be tested like code, but, because of lack of full ETL testing framework for PDI, created one himself.
The folder structure Slawo suggests for a testing infrastructure is
All tests are run from Jenkins.
An important part of the testing framework is the ability to reset an environment completely (or to a given point in time, e.g. release?).
Slawo stressed the importance of applying best practices to all ETL that is developed. Not only is f-ed up ETL hard or impossible to read, maintain and tune, it is also untestable.
The types of tests that are supported in Slawo's framework are:
Finally, Slawo declared his love for JRuby and RSpec.
JRuby is ideal to test PDI ETL code, because all code (testing and PDI) runs within the Java Virtual Machine.
Slawo loves Rspec because, as a Behaviour Driven Development framework, it allows developers to define what will be tested, and what the desired outcome of the test is. This allows very brief code to describe powerful test scenarios.
More information about Slawo's testing framework is available here.
Last on the agenda (aka top of the bill) was #PCM16 hero Hiromu Hota, who introduced himself as an 'unofficial member of Pentaho Labs'.
Hiroma started by walking the audience through a number of updates in WebSpoon:
Hiromu elablorated on how WebSpoon streamlines the machine learning workflow.
He then explained a number of scenarios where WebSpoon is a better fit than 'fat client' Spoon:
After the WebSpoon updates, with a quite theatrical "one more thing", Hiromu got the crowd quiet, and blew the audience of their socks like he did at #PCM16.
His new "WOW moment" is GitSpoon: a visual integration of git and PDI.
GitSpoon allows to not only display the version history of a job or transformation in Spoon, but also can do a visual diff between two versions of a job or transformation.
When Hiromu asked if we wanted him to make his private github repository public, the PCM17 crowd went wild.
Hirome Hota, aka "Hota San", has proven himself to be the PCM rock star for the second year in a row. It'll be exciting to see what he comes up with for PCM18, expectations are definitely set high now!