A look at what's new in Pentaho 8.0
On September, 19th 2017, Hitachi introduced Hitachi Vantara: A New Digital Company committed to solving the world's toughest business and societal challenges. The Pentaho software is a key part of this new company. On November, 16th 2017 Hitachi Vantara launched the new Pentaho 8.0 with real-time data processing to fast-track digital insights for enterprise customers. So what's this new release all about?
At first glance the Hitachi re-branding meets the eye. The user console and the applications are all adjusted to the Hitachi color scheme.
Let's dive a little deeper into the technical enhancements.
AEL: Enhanced and Simplified
The Adaptive Execution Layer (AEL) is used to run transformations in different engines. The AEL translates the steps in your transformation to native operators in the engine you selected, for example Spark in a Hadoop cluster. This allows ETL code that was developed in PDI to run on Spark natively without modification. Compatibility is enabled for Spark libraries packaged with Cloudera, Hortonworks and Apache distributions. After Spark, support for other engines will be added in future releases.
Kafka and Streaming Ingestion in PDI
Pentaho added Kafka streaming and data publishing to PDI with a number of Kafka steps. Kafka was already available via input and output plugins in the marketplace, Pentaho now added steps of their own to PDI. With the 'Get records from stream' step you can connect to a streaming data source such as Kafka to process the records. This enables real-time processing, monitoring and aggregation. Other streaming sources will be added in the feature.
Big Data Security: Named Clusters and Knox Support
Pentaho added support for the Apache Knox Gateway that simplifies Hadoop security management. This enhancement provides a secure, single point of access to Hadoop components on a cluster. Apache Knox is a gateway security tool that provides perimeter security for the Hortonworks Distribution of Hadoop services.
The biggest change in Pentaho 8 are the addition of worker nodes. Worker nodes can dynamically distribute and scale work items across multiple nodes like: PDI jobs and transformations & report executions.
The use of worker nodes result in:
- Run PDI workloads at scale
- Coordinating and monitoring the items sent to the worker nodes.
The worker nodes, based on Lumada technology, contain two parts:
- the container framework based on Docker (the company driving the container movement)
- the Orchestration Framework based on Mesos (an open-source project to manage computer clusters) and Marathon (a container orchestration platform for Mesos)
Filters to inspect Your Data
Filters can now be added to the visualizations of your data within PDI:
- Drill Down
- Keep or Exclude Selected data
- Filters panel
Additional Big Data Formats
To extend the range of supported Big Data formats Pentaho added Avro and Parquet data support.
Avro is an open source data format that provides data serialization and data exchange services for Apache Hadoop. Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem. Input/output transformation steps are provided to make the process of gathering raw data and moving data into the Hadoop ecosystem easier.
Both steps can be used in transformations running on the Kettle engine or the spark engine via AEL