PCM17 - Technical Room
Read our overview of the Keynotes
Read our overview of the talks in the Business room
Data Pipelines - Running PDI on AWS Lambda - Dan Keeley
Dan explained how serverless PDI allows to spend time on the solution rather than getting the PID server and infrastructure up and running.
Although virtualization already takes some of the infrastructure management pain away, there's still quite a bit of overhead involved, whereas no infrastructure management is needed when running In the cloud.
Dan used the use case of a small company that has no infrastructure and has all its apps outsources.
He used 3 services for this use case:
- AWS Lambda (compute)
- AWQ API Gateway (connectivity)
- AWS S3 (storage)
Dan showed a demo where a configuration asks PDI for data, launches and runs PDI and sends information to Lamba.
A framework for IOT, rules for implementation at the Edge, Gateway, and Data Lake. - Dominik Claßen
With the rise of IoT, data is produced on devices are too small to do on-board processing, and so a new approach is required.
This approach consists of 4 steps of data processing:
- edge: realtime streaming of data from the device itself
- gateway: first level ofaccumulating, syncing, aggregating data on e.g. a Raspberry Pi
- data lake: used to store all (not only IoT) data
- application: processing
As data is moving to higher levels, the more knowledge can be gained from it.
Understanding the NewMapComponent - Kleyson Rios
Kleyson took the stage to talk about the CTools NewMapComponent, which provides out-of-the-box API functionality and extensions.
The component works with Google Maps and OpenLayers, and has at least the following capabilities:
- map controls (pan, zoom, selection)
- feature states (e.g. markers): selected, unselected
- marker styling: states + attributes, shapes (dot, cross, custom SVG)
- events, e.g. onclick
Pentaho 8 Reporting for Java Developers - Francesco Corti
Francesco came to present the new edition of the Pentaho Reporting book and.
Although the goal of his talk was not to sell books, the crowd was given a discount coupon code that works perfectly well to sell and buy books ;-)
Pentaho Reporting hasn't received a lot of love from Pentaho in the latest releases, no there were no big new announcements in this -otherwise very entertaining- talk.
The code samples for the new edition of the book are available on github.
Machine Learning in PDI - What's new in the Marketplace? - Guilherme Raimundo
Guilherme discussed a number of Machine Learning plugins in the PDI marketplace:
- Recurrent Neural Network Forecaster: this plugin makes data forecasts based on Recurrent Neural Network models trained on Weka using the wekaDeeplearning4j package.
- PDI Weka Forecasting step
- Weka DeepLearning4J
- Deep Learning Forecast
FOREX market prediction was used as an example, using Spark ML lib random forests. The example is avaible on github.
Working with Automated Machine Learning (AutoML) and Pentaho - Caio Moreno de Souza
Caio started by explaining that machine learning models are never completely accurate, they can still be useful.
To create a good model, a project needs
If there is a lack of time (as there always is), Automated Machine Learning can come in to help. AutoML can't replace data scientists, but can definitely make their life easier.
When using AutoML, PDI can be useful in data onboarding, data preparation, data blending, model orchestration and visualization.
Caio continued with a demo, based on the Kaggle Titanic scenario:
- PDI + H2O AutoML: best model based on the data, with an optional time variable (more time --> better model)
- PDI: orchestration of R-code
- use H2O to find the best model based on R output
- pass best model to predictive analysis step
10 WTF moments in PDI - Nelson Sousa
Nelson took the stage after Uwe for the next iteration of his "10 WTF moments in PDI" presentation.
The '10' WTF moments are:
- unparseable dates because of daylight savings time
- deadlocks when too much data is stored in memory
- disappearing files in file repository (if the repository folder can't be found, it goes to the repository root)
- connection updates in file repository (file repository's .kdb files are only updated from View menu, not from transformation step dialog)
- 'enableFastMode=true': use database specific connections to significantly improve performance
- connection fails: if a password in properties file ends in a space, that space is considered to be part of the password. Nelson suggests to end password values with a hash sign.
Bonus 1: there's no support for variable values instead of hard coded or field values in the Filter Rows step
Bonus 2: there's no possibility to provide the kettle.properties file from the command line
Version Controlling Pentaho Solution Code Artefacts with Big Agile Teams - Diethard Steiner
Diethard started his talk by stating the obvious: 'ETL developers are not code developers'.
While this is very true in most projects, ETL developers are required to act very much like 'real' developers, including the use of VCS tools, most notably git.
The main observation Diethard made is that chaos is everywhere. Inconsistency is everywhere, standardization and conventions are hard to find.
However, one of the bare necessities of a robust project, where ETL code can be deployed in a robust, repeatable, reliable and supportable way is having consistency, standardization and conventions. Standards should therefore be enforced where possible. One possibility to enforce that is through git, and githooks specfically.
Another point Diethard made is that code and config should be stored, managed and maintained in separate repositories, and as part of separate release cycles.
Two other recommendations:
- Spoon can be pre-configured for certain environments and configurations to enforce standardizations and conventions
- code should be structured in components, with code reuse where possible
Testing Pentaho ETL solutions - Slawomir Chodnicki
The second last speaker in the technical room was Slawo, back at PCM after a couple of years of absence.
Slawo argued that ETL should be tested like code, but, because of lack of full ETL testing framework for PDI, created one himself.
The folder structure Slawo suggests for a testing infrastructure is
- entry points (access tests)
- environment configuration
- test specs
All tests are run from Jenkins.
An important part of the testing framework is the ability to reset an environment completely (or to a given point in time, e.g. release?).
Slawo stressed the importance of applying best practices to all ETL that is developed. Not only is f-ed up ETL hard or impossible to read, maintain and tune, it is also untestable.
The types of tests that are supported in Slawo's framework are:
- Non functional
Finally, Slawo declared his love for JRuby and RSpec.
JRuby is ideal to test PDI ETL code, because all code (testing and PDI) runs within the Java Virtual Machine.
Slawo loves Rspec because, as a Behaviour Driven Development framework, it allows developers to define what will be tested, and what the desired outcome of the test is. This allows very brief code to describe powerful test scenarios.
More information about Slawo's testing framework is available here.
Updates on webSpoon and other innovations from Hitachi R&D - Hiromu Hota
Last on the agenda (aka top of the bill) was #PCM16 hero Hiromu Hota, who introduced himself as an 'unofficial member of Pentaho Labs'.
Hiroma started by walking the audience through a number of updates in WebSpoon:
- various fixes
- automated UI testing
- CI: nightly builds on every commit
- transformation steps, job entries confirmed to be compatible
- carte integration
- file dialog: open from server, import from client
Hiromu elablorated on how WebSpoon streamlines the machine learning workflow.
He then explained a number of scenarios where WebSpoon is a better fit than 'fat client' Spoon:
- easier version, plugin, configuration management
- keep data onsite, allow ETL developers to work remotely
After the WebSpoon updates, with a quite theatrical "one more thing", Hiromu got the crowd quiet, and blew the audience of their socks like he did at #PCM16.
His new "WOW moment" is GitSpoon: a visual integration of git and PDI.
GitSpoon allows to not only display the version history of a job or transformation in Spoon, but also can do a visual diff between two versions of a job or transformation.
When Hiromu asked if we wanted him to make his private github repository public, the PCM17 crowd went wild.
Hirome Hota, aka "Hota San", has proven himself to be the PCM rock star for the second year in a row. It'll be exciting to see what he comes up with for PCM18, expectations are definitely set high now!