Imagine the power of a Distributed Graph Database engine with the flexibility of a Document Database all in one product. This is OrientDB - http://orientdb.com/
This blogpost will give you a brief introduction in OrientDB and using it with Pentaho Data Integration. After reading this blog post, you will be able to judge if the sales pitch quoted above is correct.
OrientDB is a document-graph database, with the flexibility and speed of a document store combined with the advanced relationship features of a graph database.
A well known example of a document database is MongoDB. In a document database, each record and its associated data is thought of as a document. A document stores everything related to a database object together. The main advantage is that data can be stored unstructured, semi-structured or even mixed-structured.
The purpose of a graph database is explained in the blog post 'Loading data to Neo4j': a system that describes information (nodes) and the relationship between the information (edges). Since the graph database stores the relationships between data it is easy to execute queries that are very difficult in a relational database.
OrientDB combines the flexible data structure and the storage of the relationships in its database. OrientDB has all the native features of a graph database with the speed and flexibility of a document store database.
The key elements in OrientDB are document, vertices, edges, links and properties. Documents and vertices are the primary storage containers of data. Unstructured and inherent data can be represented via documents. Each document can have a different set of properties and one document can be embedded in another. In OrientDB, each document belongs to a class which enables you to create a schema in your document. A vertex is a storage for a graph similar to a document. A vertex can store data, contain properties and can be queried. But, different from a document, it also has incoming and outgoing edges to define the relationships to other vertices in the graph.
Edges and links stand for the relationships in OrientDB. Where edges are the connection between two vertices, links are the connection between two documents. An edge provides the fast performance when traversing data. Edges are stored as a document, where the document class is a member of the E class. Each edge has an incoming and an outgoing vertex to define its relationship.
Documents, vertices and edges can all have properties. Each property has a unique name and a data type. Properties can supply restrictions and requirements to their element such as min/max, mandatory, read-only.
Clusters provided the storage for a group of records, the default being one cluster per class. Multiple clusters can be defined per class. The clusters provide the storage flexibility, they can also be used for optimized queries. In general they are useful for storage, scaling and archiving purposes.
However, the main question remains: how is data queried? Unlike Neo4j, OrientDB does not use Cypher (or any other proprietary query language) to write queries. OrientDB uses SQL to make queries, mainly for its familiarity among developers. Using the SQL language results in a faster onboarding and less training.
Installation of OrientDB
One can choose between two editions to install OrientDB. The Community Edition is a free version, the Enterprise Edition requires a subscription. Both version are developed by the same team and company. The Enterprise Edition serves as an extension of the Community Edition but provides additional features such as: Query profiler, Live Monitoring in Workbench, etc, but both have the same functionalities. The Community Edition was used for this blog post.
The latest version can be downloaded from the OrientDB download page, OrientDB 2.0.12 Community Edition was used for this post. After downloading the software, simply unzip , navigate to the bin directory and launch the server script <SCRIPT> to launch the server.
There are different ways to work with OrientDB: studio, console, http requests or integrated in other platforms. After launching the server script, the OrientDB Studio is available at <SERVER>:2480.
For more information regarding the installation and how to get started with OrientDB, visit the OrientDB website. In the learn section you can find a brief tutorial explaining the basics for OrientDB.
Use in PDI
On the download page of OrientDB you can download the JDBC + OrientDB libraries. Add the orientdb-jdbc-2.0.12-all jar file to the lib folder of your Pentaho data integration installation folder and Run PDI.
For this blogpost the VehicleHistoryGraph database is used. A Connection is set up with following parameters:
PDI was designed to easy work with data from different datasources even if they are not supported. You can use the default Table input step, insert a query and run it. In the example below we want to retrieve all persons called Ann. As you can see the query is a simple select statement.
When retrieving the Select query via the Get SQL select statement-button all fields are included except for the OrientDB fields:@rid, @version and @ class. When including this field manually in the select statement the whole vertex is returned. as showed in the figure below.
The Record ID of @rid is the physical position of the record inside the database. This means that loading a record by its @rid is blazing fast.
A lookup can be performed by simply using the Database lookup step. Notice that the OrientDB @ fields cannot be retrieved via this step.
The Table Output step can be used to write to OrientDB. OrientDB automatically adds the OrientDB rid, version and class when written to.
In this example we have added Bette Davis via the ETL in the database. When executing the query: "select * from Person where fullName = 'Bette Davis'" OrientDB returns the following result.
This was a brief introduction into OrientDB. Feel free to post questions or remarks in the comment section below.
The sample transformations can be found here.