Loading data to Neo4J

Update 2017-09: a PDI 'Neo4J Output' step was released to make this process easier. Find out more!

In this blogpost I will describe how to load data and work with Neo4j.

This includes:

  • Inserting data
  • Selecting data from a Neo4j database
  • Performing lookups

Neo4j is a fast and open source  graph database. Graph databases bring a great new way and interesting perspectives to analyze your data. In short this is how graph databases work: compared to relational databases, data is stored in nodes rather than tables. Relationships between nodes are created in a free-form way instead of defined within the data itself as in a relational database. These relationships allow complex hierarchies to be traversed quickly, which is one of the advantages over traditional relational databases. A graph can be represented visually as below:

Nodes can have relationships with each other and other node types. In regular database terms we would be talking about self joins and join to other tables.

The real power of graph databases is that they are true relational databases: regular databases claim to be relational but require costly joins to create relationships. If you, for example, want to calculate the shortest route between two airports ‘A and B,’ you need to create several joins, or need a table that contains all possible paths. Relationships in Neo4j can contain additional information, properties, which can be compared to using a join table in a traditional database. 

In the example data, we will be working with airports and flights. The airports will be nodes, and flights between airports will be created as relationships between airports (origin and destination). There is also a direct relationship between airports to represent the distance between them, as shown in the image above. 

As most databases Neo4j has a JDBC driver which can be used to access the data. This means that some of the steps that are available in PDI will be usable with Neo4j. Neo4j, like most NoSQL solutions, use a specficic query language. Neo4j uses a language called Cypher, which is a declarative and easily readable query language. Learning Cypher from a SQL background shouldn't be too hard. There are no dedicated steps for Neo4j in PDI (yet), but we can do a great deal of work using the default PDI steps.

Writing data to Neo4J

Inserting records into Neo4j isn’'t possible with the default insert/update and table output steps. Most NoSQL databases have bulk loader steps available in PDI, which is not the case for Neo4j. The best and fastest way to insert data into Neo4j is using the bulk loading capabilities through the “command line, orchestrated through PDI.

  1. write all the data you wish to insert to a csv file
  2. wait for the csv file to finish processing
  3. create a bulk insert statement
    1. for example the statement I have to insert airport
    2. LOAD CSV WITH HEADERS FROM "file:///tmp/airport.csv" AS csvLine CREATE (a:Airport {name: csvLine.airport});
  4. use a SQL row executor to trigger the query
    1. important! the commit size has to be changed to 0 Neo4J does not have a commit causing the step to wait infinitely

Reading from Neo4j

Because of PDI's flexible data architecture and the Neo4J JDBC driver, reading data from Neo4j is a painless process, since the default Table input step can be used to run Cypher queries. In these samples, flight and delay data provided by the FAA was used.

The query below returns all the airports available in this graph database.

MATCH (f:Flight) RETURN f

There are no conditions in this query, it's a simple '‘select * from flights'’. Since Neo4j returns the data in JSON format, it can be streamed from the select query to a JSON input step and be processed as any other data stream in PDI.

Performing Lookups

Performing lookups is an important part of any data integration process. Since Neo4J doesn't process the SQL this step uses behind the scenes, the database lookup step can't be used with Neo4j. An alternative desgin pattern in PDI, familiar to most users, is the combination of an input stream, based on a Neo4J Table Input in this case, and a Stream Lookup step.

This approach allows common PDI functionality to work with Neo4J, easy to develop, easy to maintain. Remember that all data used with the Stream Lookup needs to be loaded in memory. 

Table Input

The Table input step can be (ab)used to perform database lookups. In this case, the Cypher query use was:

MATCH(a:Airport) WHERE a.name = ? RETURN a

To avoid errors when running this transformation, arguments need to be passed to Cypher in the format "{#}", where # is the number of the argument, which make the final query look like:

MATCH(a:Airport) WHERE a.name = {1} RETURN a

Dynamic SQL row step

An alternative approach to perform a lookup is to create a query and use the Dynamic sql row step to retrieve the data. The full Cypher statement can be generated and executed through this step. The only thing to keep in mind is to replace the prepared statement variable with an actual value.

Querying data in Neo4J

Now, let's take a look at the data  loaded to Neo4j. Neo4j comes with a web based query and visualization tool. To find the distance between JFK and SFO, the cypher query below could be used to lookup the direct distance between these airports:

MATCH (a:Airport{name:'JFK'})-[dist:DISTANCE]->(b:Airport{name:'SFO'}) RETURN dist.distance

This will return 2586. There is another way to do the exact same query by using the where clause, which also returns 2586mi. 

MATCH (a:Airport)-[dist:DISTANCE]->(b:Airport)WHERE a.name = 'JFK' and b.name = 'SFO'RETURN dist.distance

Another question one could ask is: 'What is the distance between JFK and SFO when no direct link is available. Show all the options we have when using 1 extra stop'.

MATCH airports=(a:Airport{name:'JFK'})-[dist:DISTANCE*2..4]->(b:Airport{name:'SFO'}) RETURN nodes(airports) as links, reduce(acc=0, n in dist| acc + toInt(n.distance)) as total_distance ORDER BY total_distance

The visual result for the query above is the following. This ability to find chained relationships (the shortest route from JFK to SFO is when no direct link is available) shows the power of a graph database. 

By default, results are shown visually. The example below shows the top 10 airports and the relationships (distance) between them.

This post aimed to introduce Neo4J and graph databases in general, and how they can be used from Pentaho Data Integration.

Sample transformations and sample data to load data into Neo4J can be found here. Questions and remarks are welcome in the comments.