PCMAMS - Pentaho Community Meeting 2012


With Jan Aertsen, last year's PCM blogger being sick, live coverage of the Pentaho Community Meeting for 2012 will be done by yours truly. The room is rather dark and the pictures are taken with my smartphone, so apologies for the poor quality...

Update 2012-10-01: added links to shared presentations and group photo.

Coffee and Introduction

After coffee and cake, Doug starts this year's community meeting with a round of introductions. 

Will Gorman - Engineering @ Pentaho

Will Gorman introduces the Pentaho development and test team at the Pentaho Ivory Towers in Orlando. After the introduction, he talks about the importance of big data, the open sourcing of the big data software components and the move to github.
After this overview, Will talks about the major enhancements in the Sugar release, with features like the new repository (JCR-based), data source management, REST APIs and a new scheduler. The publish password will be replaced by action based security, allowing users to publish content based on their role instead of the publish password. The Administration Console will be put to rest, and will be replaced by an administration perspective in the PUC (Pentaho User Console).

At the end of his session, Will mentions the new Pentaho Market Place, a collaboration between Pentaho and WebDetails, to be released in the fall of 2012.
Will ends his session with a Q&A, talking about among others repository import/export, migration automation.

Slawomir Chodnicki - Pentaho CE Ecosystem Overview

Slawo starts by telling he's working on smoothening the Pentaho CE contribution process. He then promises to fix a PDI bug if anyone gets the correct number of 3D pie charts in his presentation.

Slawo uses the business user persona 'Linda' to give an overview of the BI stack (ETL, reporting, PME or Pentaho Metadata Editor, WAQR (or sublimely renamed to 'Wanker'),  ...). From PME and WAQR, he jumps to Saiku and Saiku adhoc reporting as CE production-ready replacements for JPivot and Wanker respectively. 


Slawo then jumps to CTools, using Ambient BI's CIO Dashboard example to show the power of the platform. In the end, he mentions the power of the latest mobile BI CTools enhancements. 

Matt Casters - Kettle 5.0 Update 

Matt starts by claiming he didn't want to write a presentation, then pulls out his PowerPoint presentation and bashes Apple and Dutch 'gastronomy' within one minute. 

The current status of Kettle is: 

  • Analyzer improvements 
  • Tons of improvements (large xlsx support, ...) 
  • new Big Data version
  • InstaView improvements (dubbed James' Brainfart)
  • Spoon Features

Planned features for Kettle 5.0 are:

Matt showing of the uebercool Kettle Metrics Gantt chart
  • new architectures and major features 
  • easier looping, named parameters 
  • Kettle metrics (logging tables + Gantt chart). inspired by Mozilla Metrics, this new feature will allow your Kettle installation to keep track of the time it takes Kettle to perform internal tasks like connecting to a database etc. All this data p0rn can be written to a logging table, allowing you to tune your ETL like there's no tomorrow. 
  • job restartability (the Jens Project): transactional jobs, job checkpoints, 
  • data federation: mainly the Kettle JDBC driver, which allows a transformation step to be used as a table in a SQL query, which generates a transformation in the background. Future improvements include Julian Hyde's OptiQ integration. Impressive stuff! (by the way: an example use case of the Kettle JDBC functionality can be found  on our blog). 
  • last but not least, there may be some room for metadata in Kettle 5.0 as well. 
Metadata in Kettle 5.0. Color effects are offered for free by christmas light effects at the ceiling.       


PCMAM's interpretation of CCC: Coffee, Cake, Chat

Edwin Weber - PDI Data Vault Framework

Edwin starts with a quick introduction of Dan Lindstedt's Data Vault: hubs, links and satellites. 

He then continues to discuss how he manages a number of Data Vault projects, mainly at the St. Antonius Ziekenhuis at Utrecht, through Kettle. The design decisions are discussed, Edwin shows how metadata is handled through Excel files, and how hubs, links and satellites are handled based on the Excel metadata. Version management is performed in GIT.   

Great stuff! Get Edwin's framework from Sourceforge.net here

Update 2012-10-01: slides!

Jens Bleuel - Funky Kettle stuff that will blow your socks off

After the mutiny in the family Jens had to deal with after stealing his son's model trains last year, we had to do with a video of Jens's Kettle controlled model trains. Steel Wheels like never before, or as Doug stated, 'this already blew one sock of'. Jens continues with a video of a Kettle controlled model helicopter. But can it make coffee? If it can fly a helicopter, that can't be much of a challenge... 

On to the serious stuff!  

Integrate and embed Kettle into PostgreSQL. Jens uses Windows 7 (seriously?) 64bit, Java, PostgreSQL + PLJava + PDI to call Kettle from within PostgreSQL. There is still some work to be done (e.g. PostgreSQL java calls are single threaded, so needs to use the Kettle Single Threader), but this is nesat stuff! More information at kettle.bleuel.com.

Cora and Yvonne preparing lunch while the bunch of us are growing sitzfleisch

Julian Hyde/Luc Boudreau - Mondrian 4

Mondrian 4 holds relatively few visible changes, but will make the life of schema developers far easier.
New in Mondrian 4, planned for beta release next week, are attributes, measure groups (groupings of measures on different levels of granularity etc), physical schema, internals improvements (performance, reliability). Because of the amount of changes, this is going to be a long beta.
While Paul is preparing his demo of Saiku on top of Mondrian 4, Luc mentions how each level in a hierarchy will be allowed to use as a stand alone object. This will offer a lot more flexibility, and -as Pedro points out- will allow a year and a month level to be used on different axes.

Vampire Luc drinks blood after dusk 

Mondrian 4 will contain:

  • measure groups: possibility to include measures from different fact tables (making virtual cubes obsolete)
  • many-to-many associations between measure groups and dimensions
  • different ways to link dimensions to fact tables
  • aggregate tables are measure groups 

Mondrian 4 is an omelette, so existing stuff had to be broken: 

  • Good News (tm): there will be an upgrader, that translates a Mondrian 3 into a Mondrian 4 schema. 
  • aggregate recognizer: 'automagical' recognition of aggregate tables from the database catalog will be phased out, in favor of measure groups (and thus different fact tables). 
  • schema workbench will get the boot. There's no replacement so far, so get out your text editor for some 'good' old xml hacking. 
  • XMLA server: olap4j-xmlaserver on @github.
  • Hierarchy syntax will move to SSAS-style syntax, e.g. [Time.Weekly].[Day] will change to [Time].[Weekly].[Day]
2511 of 2770 tests pass, but work still needs to be done on 
  • ragged hierarchies
  • analyzer upgrade
  • aggregate table api 
  • complex schema mappings

Downloads are available from Pentaho CI and will be pushed to Sourceforge in the course of next week.  Test, file bugs, contribute if you want to speed this up! 

Coming up: the Mondrian book, eta May 2013!

The road ahead: 

  • shelved aggregate tables: Mondrian will be only go to aggregate tables for historical data, and will use the fact table for relatively new data. 
  • some other stuff, skipped because of scheduling drama.  

Update 2012-10-01: slides!

Lunch Break 

Luc Boudreau - Scaling Mondrian to Yahoo's Demands

Luc explains why and how Mondrian was scaled to run on top of 140 petabytes (compared to 140 years of HD video). Apart from the amount of data, security (through programmatic roles) and  scalability turned out to be the main challenges.  

In scalability, specific topics that needed to be covered were caching, synchronization without locks and blocks, memory rollup, indexing and aggregation. 

Vampire Luc, the second coming

Paul Stoelberger - Saiku Update 

New features in Saiku 2.4 will be mainly a switch to the Apache license, an updated Excel export (contributed by Sergio Ramazzina, (@sramazzina)) with a summary sheet and an explain plan. 

Fun stuff that Paul has been working on are sparklines, heat grids, subtotals, parameters, new visualizations, and drilling. 

Almost as a 'One more thing', Paul mentioned and showed crosstabs in Saiku adhoc. Way cool! 

Look through the colored bands to see the sparklines (right) in Saiku 

Update 2012-10-01: slides!

Julian Hyde - OptiQ: a front-end for everything 

Julian shows how OptiQ allows you to query data through SQL from big data sources, from 2 or more data sources, .

"OptiQ does a lot of database-like stuff, but it is not a database."

OptiQ is a really, really smart JDBC driver, a framework and a data source management system. 

Thomas Morgner - "What's so hard about crosstabs? I could do it in a day! - Matt C"

Thomas only brough 1 slide, and shifts to demoing PRD crosstabs immediately. 

Rendering a crosstab takes a (whoooole) lot of time and modifying the layout is still a very tedious -or as Thomas calls it- "developerish" task. This functionality is -imho- not ready for prime time, but it definitely is a step in the right direction. The Big Release is planned for the Sugar release, spring 2013.

Roland Bouman - xmla4Js update

Roland gives us an update about his xmla4js project.
"We had to port HttpRequest to Node.js, because it didn't exist." O, it's just that...
xmla4Js allows to create a thin client through REST, without having to deal with XML/A directly.
Roland goes on to show xmla4Js as a browser XML/A command line tool to work on XML/A directly, or as a query tool.
Xmla4Js is also available as a BI server plugin.
Download xmla4Js here.

Alain Debecker - Introducing Orgbox

OrgBox is a drag and drop ui to draw organization charts. Employees can be assigned to posts, files can be associated, KPIs can be identified for what-if scenarios etc. A tablet version of OrgBox is on the roadmap. 

OrgBox is not open source (tssss....) and not really stable yet. The executable of OrgBox is available for free, but if an extension is requested, there is a cost involved (cough up and/or provide a customer reference).     

A Mondrian schema can be ran on top of OrgBox data. 

update 2012-10-17: presentation

Cees Van Kemenade - Community Dashboard Manager/Community Data Processor

CDM provides version control of dashbaords, synchronization of multiple dashboards and support for multitenancy. For example, CDM can detect what changes have been made to a dashboard, and apply those changes to other dashboards.

CDP writes data from the BI server to databases, parameterizes SQL and code, and provides hot swappability of code, which is demoed by Cees by hacking into one of the CDP files and showing the changes in a dashboard.    

Next, Cees demonstrates version management tools in CDM (commit, diff, drop last commit, ...). 

WebDetails (Pedro Martins) - CDB - Community Data Browser

CDB aims to provide a central repository for your data sources, based on CDA.
In short, what CDB does is :

  • explore cubes, build queries
  • save queries (organized in groups and more)
  • use result sets in visualization tools (dashboards)

WebDetails (Pedro Vale) - CDC - Community Distributed Cache

CDC is a Hazelcast implementation that allows to :

  • switch between default and CDC cache for CDA and Mondrian. 
  • adding and removing of new cache nodes 
  • selectively clear the cache of specific CDE dashboards
  • selectively clear the cache of specifice schemas, cubes, dashboards (optionally from the outside, e.g. after running ETL)

With CDC, a cluster of caching server can be put in place, to provide more caching memory than what can be provided by a single machine and/or to take the memory load away from the BA server. 

WebDetails (Pedro Alves) - CDV - Community Data Validator

After showing the WebDetails timeline, Pedro tells that nothing annoys him more than a customer telling him that the data for their project is wrong, no matter what the reason is. This annoyance triggered the development of CDV. 

CDV runs a set of tests -written in javascript- against your CDA-data. These scripts can run on the server or client side. Not only can CDVcheck whether your queries ran successfully or failed (with various error levels), you can also schedule tests, specify timeouts for queries and send out notifications.  

After showing CDV, Pedro did a bit of freewheeling with upcoming CCC charts and other CTools work. 

Jos van Dongen / Aly van Zalk - Antonius Intelligence update/meta data driven dashboards

The previous sessions have taken more time than expected, so Aly and Jos promise to do a St. Antonius presentation on steroids in 5 instead of 15 minutes. 

Aly shows a number of dashboards that would have gotten an approving nod from Stephen Few. These dashboards allow to click through to the individual patient record. 

Jos explains how the hospital uses a KPI generating framework that is totally metadata driven. A mangement interface was written in WaveMaker.  

Jos mentions a contribution to PRD by Slawo that lets you set a color for a given data value. This was needed functionality because patients in ER get a triag. That color code needs to be represented in the chart, whether there is data for a given triage code or not. 

Unplanned Coffee break and group picture

Slawomir Chodnicki - Kettle Plugin Development (hands on session)

After sending off everyone who is not interested in compiling java code, about 15 people were left in the room. 

Slawo starts by explaining that this will be a hands on session, but not a full-fledged class. 

The rest of this session involves writing and running java code.