What is a graph database?
Although graph theory has been around for centuries, graph databases started to appear relatively recently.
‘Traditional’ relational databases store data in tables. These tables have a fixed format (fixed number of columns, each with a fixed data type). Tables are linked through the primary key in one table and the corresponding foreign keys in other tables. When a query is executed, the database engine fetches the primary keys from one table and links them to the corresponding foreign keys in other tables through SQL joins.
Although this works well to insert, select, update and delete individual records over one or a limited number of tables (CRUD operations), performing joins during query execution over large schemas and data volumes is expensive and slow.
Property graph databases (like the one offered by market leader Neo4J) use nodes with labels and properties to store data instead of the relational database tables and columns. Instead of fitting all data into a fixed, predefined table structure, each node is a new instance with a structure that can be different from other nodes.
This schema-less structure offers more flexibility than the relational database table. On top of the additional flexibility, graph databases treat relationships as first class citizens. Instead of being built during query execution, relationship are persisted in a graph database. Having all relationships stored with the data not only allows to extremely outperform relational databases on relationship-heavy queryies, it allows use cases that simply aren’t possible in relational databases.
In this post, we’ll have a look at a couple of analytical use cases with graph databases.
Social network analysis
Social networks are everywhere, with LinkedIn and Facebook as the most popular examples. Social networks are natural graph implementations.
In these social networks, people, companies, skills, interests etc are stored as nodes, each with a varying number of properties. A table to store all possible properties of a ‘person’ would either require a huge number of (very sparsely populated) columns, or a significant number of related tables. In graphs, a ‘person’ node only needs to contain the properties that are available for a person.
Relationship can be created between nodes, e.g. ‘knows’, ‘works for’, ‘lives in’ relationships.
In a relational database, finding the people someone knows would require something like a self join on a person table. Although this would be doable for the first level network, it becomes a lot harder for the typical FOAF (Friend of a friend) analysis. Although theoretically possible, running a query to find 3rd, 4th or 5th level relationship (e.g. 4 self joins on the person table) in a network of a significant size would be extremely slow, if it wouldn’t bring the database to its knees.
In graph databases (where both the nodes and their relationships are stored in the database) and with a query language like Cypher that has functions for path finding etc, this is trivial.
Fraud systems, whether they’re intended for bank, insurance, ecommerce or other types of fraud, are typically done by setting up ‘rings’ of fake accounts, accidents, transactions etc.
Similar to the social network analysis, querying the entire database for fraudulous accounts, transactions etc (empty link analysis) requires a huge number of joins and self joins that are hard to build and expensive to run, which only gets worse as the number of accounts (and in the fraud ring) grows. Since fraud needs to be detected as soon as possible, heavy and slow queries are unaffordable.
Graph databases are built to work with relationships, and query languages like Neo4J’s Cypher have built in semantic to detect rings by traversing the graph, which makes it possible to navigate the graph and detect fraud in memory, in real time.
We’re all familiar with the “Customers who bought ‘x’ also bought ‘y’” friend recommendations on Facebook and “Who to follow” on Twitter.
Graph databases allow to create numerous relationships between nodes without impacting the model. For example if serveral people (person nodes) have the same 'located in' relationship and both like sushi (have 'like' relationships to sushi restaurants), it is easy to recommend people sushi restaurants in town their peers have visited, but they haven't yet discovered.