Follow the 2016 Pentaho Community Meeting live!

PCM Live Blog - Technical Room

This blog will be updated throughout the day to keep everyone up-to-date about the event!

Pentaho 7.0 – Pedro Alves

Pedro Alves had the honor to introduce Pentaho 7.0, claiming a it'll be a very big release, and he wasn't lying … .

With a growing amount of data, there's also a growing demand of tools. With this growth of tools comes the issue that things can get complicated. Business, Business Analytics and IT have always been considered separate values. But Pentaho 7.0 tackles this problem by merging them together.

Pentaho 7.0 merges the development cycle: starting from the data, all the way to the analytics. The analytics will from now on be integrated into PDI. Users can simply connect to their BA server from within PDI and do (pretty much) they would on the server from PDI: access files on the BA repository, generate analytical charts and tables, generate reports, … .

With this comes a solution to an issue all of us developers know too well: working all the way through the cycle, discovering an error in the analytical phase and as a result: having to start over.

2 Enterprise Edition only functionalities that make this possible:

  • Check the analytical result at any step in the flow (generate tables, charts, …)
  • Immediately make the result available on the server

Pedro also picked up on a few major focus points for Big Data in 7.0:

  • Spark integration via a new step 'Spark submit' using simplified SQL queries
  • Hadoop security, allowing secure multi-user authentication to a cluster
  • Metadata injection, which is now supported by 40+ new steps

Another big addition to the group of PDI steps is the annotate stream step. This step allows PDI to display the data in a model view, the same way we'd want to see it in a report. With the annotate stream step we can now mark what a data field really (as an example: marking a field to be an average measure, something. Rather than having to create a mondrian schema, PDI will then create one in the background and makes it available for use in the (new) model view. And as mentioned earlier, this result can then immediately be published to the server for use in the Analyzer.

So to sum it all up, Pentaho 7.0 takes a huge step forward into making our lives as a Pentaho developer just that bit easier. By merging the cycle from data integration all the way to analytics in a single application. From now on, we will no longer speak of PDI, BA- and DI server as a separate entity, but of the pentaho server as a whole.

Added bonus, Pentaho 7.0 is already available on SourceForge. Find it here:

https://sourceforge.net/projects/pentaho/files/Data%20Integration/

Alternative Big Data Devops – Tom Barber

Making life easier for ops”

Tom Barber came to present a project he's working on for the company Canonical, namely Juju: a system that makes installing and configuring other systems easier, whilst also automating the process.

What is Juju?:

  • Open source app modelling platform (developed by Canonical)
  • Allows deployment to the cloud, bare metal, containers, etc.

Juju & LXD:

  • Easy deployment via a single command line (e.g.: Hadoop stack)
  • Scaling and testing

Juju GUI (For those of us who despise command lines)


 Relations:

  • Apache Bigtop:

Reference implementation of Hadoop

  • Apache Drill
  • Link with Pentaho (work in progress):

Data integration (PDI): Create clusters (scaling)

Automatically configure the clusters (and automatically update them if necessary)

  • Monitoring:

With Nagios (Logging)

With Ganglia (Graphical / Analytical)

Interested? For more information, go to: http://jujucharms.com


Data models for Hadoop: Kimball without updates - UbiquisBI

The fellows over at UbiquisBI gave an excellent presentation on slowly changing dimensions and snapshot on Hadoop.

Even better: they wrote a 5 part Blog of their own explaining everything that needs to be known.

Go check it out!

http://ubiquis.co.uk/dwh/status-change-fact-table-...

VizAPI 3.0 – Duarte Leão

Duarte Leão presented a complete redesign for VizAPI 3.0. 

Points he made for 2.0 were:

  • Not compatible with Google’s VizAPI
  • Irremediably coalesced with pentaho analyzer
  • “Mind your own business”

There was a clear layer split for the design:

Platform layer

  • An AMD plugin with AMD ids
  • message bundles easily, platform
  • Stylesheets

Object layer

  • There's always something missing, pentaho/‘type’
  • Identification, metadata, reflection, extens ibility, configuration, validation, serialization, changes

Data layer

  • There's always something missing, pentaho/‘type’
  • Pentaho/data
  • Data set and metadata, formatting, filter

Visualization layer

  • Pentaho/visual

Model/view, visual roles, rendering, interaction, color palettes, printing

Translating visual elements selected by user

After this he showed a demo of the inner workings of VizAPI 3.0

Managing multi-project/multi-environment scenarios with Docker– João Gameiro

João Gameiro showed us a BASH script to use multiple projects with multiple versions of Pentaho with Docker.

CBF1: 

  • Since 2007
  • Used ant, compiling the Java source
  • Was a tool to facilitate setup and deploy of Pentaho CE projects

CBF2 takes a different approach from CBF1:

  • BASH script
  • It runs in different operative systems than docker

Main features:

  • Bbinaries instead of sourcecode
  •  Supports CE & EE
  • All environment setups
  • Ability to easily switch projects
  • VCS friendly

Core image; clean install with a choice of which version and the ability to merge different pentaho projects with different versions.

Jens Bleuel – New features in 7.0

Jens Bleuel presented the new features of PDI 7.0.

A lot of progress is made in the metadata injection step. The general idea was to increase the possibilities of metadata injection into different steps.

Metadata injection is now possible into the metadata injection step. Some people yelled ‘insane!’ but this does have its usecases. A more general idea is to increase support for metadata injection in more steps

A lot of big data steps are already possible, but a lot more will follow.

Data services is also something that is worked on a lot. This contains data visualization in PDI. It is possible to create a virtual data set within PDI and visualize your data within PDI. Data services can be pushed to the Pentaho server and are immediately available as data sources.

Further new features included updated Elastic Search steps, SalesForce steps, JSON steps, the addition of last log dates for jobs and updates on the Google Analytics Input step. Also a lot of functional features were added within steps.

To finish, Jens announced that the CTRL + C copy function in the Spoon log now works properly in 7.0. APPLAUSE!

Jens also mentioned Pentaho Labs and told something about it. Pentaho labs and Hitachi labs are now working together for research purposes to improve innovation.

Matt Casters – PDI Unit testing

Unit testing is one of the things that we accept as one of the essential components in software development. Since PDI is a development tool, it would be awesome to have unit testing available in PDI.

With PDI unit testing it is possible to inject sample data into your transformation for the sole purpose of a test run. This is available through the use of data sets. It is possible to pick any data set from a local (or remote) source and use some data for a test run of your transformation. This local data set is adjustable and can be used dynamically.

The idea behind this is to start from an empty transformation and create a unit test. The unit test provides output in the logging with the results of your test. Errors in unit tests are displayed as a popup window with appropriate feedback about your test transformation. During testing, it is also possible to bypass steps to prevent them from hurting your test. At runtime, bypassed steps are replaced by dummy steps to keep the transformation in its entirety working.

To start working with the unit tests, download the jar file from github, throw it in the plugins folder and start testing!

As a second part of his presentation, Matt explained some pieces of code that enable the Meta Store Attributes to be stored. He also explained how to add your own extension point plugins into the java code if you want your own tweaks in the code.

Hiromu Hota – webSpoon

Hiromu Hota works for Hitachi America and came to introduce webSpoon. A web browser based version of Spoon. WebSpoon is basically Spoon that runs in your brower, easy as that. By accessing a server URL, you can create, preview, save and run transformations and jobs in your browser. WebSpoon works on server side, so all your transformations are stored and run in your browser.

If you want to deploy webspoon for yourself, you can download the .war file from the repository, copy it to the tomcat webserver folder and restart your server. After doing so, webspoon will be accessible through the url of your running server.

Different usecases can be thought of for a browser based spoon:

  • PDI on the go: run pdi on your smartphone or tablet.
  • Security: transformations and jobs run on the server so the data remains within the server.
  • No installation required.
  • No difference in UI between BI server and DI server.

In order to get developing yourself and contribute to the project, clone the repository, install RAP and eclipse and import the cloned UI folder as an eclipse project.

Andre Simoes and Julien Hofstede - Pentaho BA + Open ID connect SSO

Reelmetrics provides analytics services for the casino industry. Pentaho is embedded in Rails for using the analytics and reporting visualizations, but is not the core of their setup.

Reelmetrics wanted to use Single Sign on to allow users to provide 1 set of credentials for all applications they want to access. To enable this, they deployed all applications within 1 domain and allowed Rails to fetch the visualizations from the pentaho server.

Because SAML and CAS rely on redirects of the browsers, these were not an option. OpenID Connect was chosen as the solution.

The way OpenID Connect works relies on a login attempt on the server you want to access and the check for an asctive session. If the server cannot find an active OpenID session, it redirects the user to the OpenID Connect provider to provide a token to the user attempting a login.

This OpenID Connect method allowed the team to implement security with all components available in the Pentaho Server actually working.

Wael Elrifai IOT: Data to information, information to knowledge

Wael Elrifai came to present a use case on machine learning and IOT, more specificly on Hitachi Rail.

The challenge of today’s train business is to modernize and improve rail transportation reliability in the United Kingdom and reduce the maintenance costs of the trains.

The way they want to improve today’s train businesses is by building sensors into trains and having them send a lot of information to a centralized environment that processes this data and makes it possible to make decisions based on this data.

By ingesting this data, performing analysis and deeply discovering it, patterns can be seen and described, correlations between different factors can be found and anomalies can be exposed.

Wael explained some of the main principles of analytics, machine learning and AI and how this can help industries gain an advantage in the future.

Extra - know.bi & Neo4j - Loading data to Neo4j using PDI

Rik Van Bruggen gave an explanation about the how, what and why of graph databases and more specifically Neo4J.

Graph databases are databases where not only data is stored according to a certain relation, but also the relation between objects in the database itself is stored. This means that, at write time, the ‘join’ (i.e. relations between data objects) are stored within the data model in the database. In systems that require a lot of connections through joins, a graph database will be a lot faster than regular operational database systems.

After the Neo4J introduction, Bart introduced the Neo4J output step he developed to easily load data into Neo4J through the use of PDI. The demo loaded different Belgian beer types into a Neo4J database to expose different relationships. This resulted in the crowd enjoying some nice Belgian beers exposed by a Neo4J graph database.