Easily load data to Neo4J with Pentaho Data Integration

Load data to Neo4J

Whether you’re a Neo4J rock star or are just getting your feet wet, the biggest problem you’re probably facing is getting your data into Neo4J as quickly and easily as possible. Of course you can create and import CSV files, but that process quickly becomes tedious and time consuming. We think there is a better way. 

As huge fans of both Kettle and Neo4J, we decided to bring the two together, and are proud to present the availability of the new version of our Neo4J plugin. 
This plugin allows you to do the data preparation and loading of your nodes and relationships, with their labels and properties, all from a visual development environment. 

We’ll assume you’re familiar with getting both Neo4J and Kettle up and running. If you’re not, check here to find out how to get started with Kettle, and here to get started with Neo4J.

To install the plugin from within Spoon, go to the Pentaho Marketplace through Tools → Marketplace. 


PDI - Install Neo4J Output from the marketplace


In the marketplace, search for ‘neo4j’ and click the ‘install’ button next to the ‘Neo4J Output’ plugin. 


PDI - Install Neo4J Output from the marketplace


The plugin will download and install, after which you'll need to restart Spoon. Once Spoon is back up, you’ll find a new ‘Neo4J Output’ step in the 'Output' category.


PDI - Neo4J Output step


A sample graph, cheers! 

Graphs are everywhere, and so is Belgian beer, so we didn't have to look hard for a sample graph to create through the plugin: we’ve recreated Neo4J rock star Rik Van Bruggen’s Beer Graph demo with the Neo4J Output step to show you how easy creating nodes and relationships can be. Get the ETL to create this sample graph here

The sample (jb_beer_graph.kjb) job consists of two transformations:

  • tr_beer_nodes.ktr: create nodes
  • tr_beer_relationships.ktr: create relationships

The beer graph is created in two separate transformations (first nodes, then relationships) to ensure no node duplicates are created because of transaction overlaps in the transformation that creates the relationships. An alternative approach could have been to create all three relationships in separate transformation. As always, there are many ways to skin a cat, so YMMV. 

You'll need to add key/value pairs to your kettle.properties like, for example: 

NEO4J_PORT=7687            # the BOLT protocol port (default 7687), not the browser port (default 7474)

After running the job, the graph can be queried from the Neo4J browser (e.g. http://localhost:7474) with the query below, which reads like 'give me all nodes that have a label 'BeerBrand' and a 'name' property of 'Orval': 


Neo4J - query the beer graph for "Orval"


This query will return the node for the delicious Orval beer. By double clicking on the node, you'll find its brewery, beer type and alcohol percentage, which will look very similar to the graph below: 


Neo4J - Beer graph - query results


Creating nodes and relationships

Using the plugin is straight forward. First, set the connection properties and verify the connection works. Then, label, relationship and property fields can be selected for nodes and relationships. Properties for nodes and relationships can use the field name as the property name, but this can be overruled by manually entering a value in the  'Property Name' field. 


PDI - Neo4J output step configuration


Detailed documentation about how to use this step can be found on github.


Try the step, beat the hell out of it and let us know if you find any issue by mailtwitter or directly on github

  Talk to an expert!

You may also like

These blogs about pentaho

On May, 16th 2018, Hitachi Vantara released Pentaho 8.1 Although this is a minor follow-up release to 8.0 as far as version numbers go, but nevertheless a lot…

Shipping PDI in a Docker container

Cloud computing is the way to the future, and the way to bring your company to the next level. With the abillity to have enterprise grade services and…