services-2

GRAPH DATABASES

Graph databases have started to grab a lot of attention in recent years. With the Panama Papers investigation, the perception may have grown that graph databases are mainly used for fraud detection and similar explorative use cases, but is that really the case? Let’s dive into what graph databases  are exactly, what the use cases are and what can they do for you.

Graph database market is growing

  • 2019:  $ 1.0 billion
  • 2024: $ 2.9 billion
  • 2026: $ 3.7 billion

Proven GRaph Technology Use Cases

  • Machine Learning
  • Fraud Detection
  • Regulatory Compliance
  • Identity and Access Management
  • Supply Chain Transparency
  • And many more

Graph hype

The graph paradigm goes well beyond databases and application development; it’s a reimagining of what’s possible around the idea of connections. And just like any new problem-solving framework, approaching a challenge from a different dimension often produces an orders-of-magnitude change in possible solutions.

Let's Get Started

Just want to see us in action?

See for yourself which insights a graph database can provide. Unlike other databases, relationships take first priority in graph databases. We at know.bi like to adapt this philosophy with our customers. Fill out the form to request a demo customized for your specific needs, or give us a call.

services-2

Introduction

What are graphs?

A Brief History of Graph Databases 

Graphs aren’t new. In 1736 (that’s right, almost 300 years ago), the Swiss mathematician Leonhard Euler published his paper on the seven bridges of Königsberg. Leonard was not only a mathematician, he also was a physicist, astronomer, geographer, logician and engineer.
 
In this paper, which is considered the mathematical foundation of graph theory, Euler tried to come up with a solution for what may seem a trivial problem: the city of Königsberg (now Kaliningrad, Russia) was set on both sides of the Pregel river. Euler needed to find a walk through the city that would cross each of the city’s 7 bridges only once. 
To find this ideal walk, Euler divided the city into a number of land masses (vertices) which are connected by bridges (edges).

"Although Euler couldn’t find a solution to the problem, he effectively turned the entire city of Königsberg into the first ever mathematical graph."

 

graph_small

What is a graph database?

Types of graph databases 

With the basic understanding of what a graph is, let’s have a look at how this translates to graph databases.

There are a number of graph implementations we’ll look at in some detail, but remember this is not an exhaustive list. There are other implementations, some databases have graph additions bolted on to their relational engine, etc. The bottom line however, is that graph database popularity is skyrocketing.

Discover more

By Storage Model

  • Native Graph Storage: these types of graph databases have been designed from the ground up to work with graphs (e.g. TigerGraph, Neo4j)
  • Relational Storage: data is stored in a relational(-ish) model, and is transformed into a graph at runtime (= when queried) (e.g. GraphX). 
  • Key-Value Store: similar to relational storage graph, but with a key-value or other NoSQL store as the underlying persistence layer (e.g. JanusGraph). 

It goes without saying that native graph databases, being designed from the ground up to work with graphs, significantly outperform graph databases with other storage types. 

graphx storage
graph_struc

By Data Model

Labeled Property Graphs

In a labeled property graph, data is organized as nodes and relationships, both of which can contain properties (key-value pairs). 

Nodes can be tagged with a number (0 or more) of labels to represent different roles in a graph or business domain. 

Relationships provide directed (see “Graph Theory”), named and semantically relevant connections between two nodes. Just like nodes, relationships can have properties, which can add weight or cost to a relationship. When, for example, you’re trying to find the shortest route between two paths, it may be more efficient to follow a path that leads through three low cost (e.g. short distance) relationships instead of one costly (e.g. long distance) relationship. Although relationships are created with a direction, this direction can be ignored when traversing the graph. 

Examples of labeled property graphs are Neo4J, AWS Neptune, ...

Hypergraph

A relationship in a hypergraph can connect to any number of nodes. This model is especially useful for data that contains a large number of many-to-many relationships. Hypergraphs can always be created as labeled property graphs, this is not always the case in the opposite direction.    

An example of a hypergraph is HypergraphDB

Triple Store (RDF)

A triple store or RDF (Resource Description Framework) stores data as triples. A triple (e.g. “Bob is 35”, “Bob knows John”) consists of 

  • the subject is the object or concept the triple provides information about: “Bob”
  • the predicate describes what the object tells about the subject: “is” or “knows”
  • The object: “35” or “John”

A triple can be compared to a node in a labeled property graph. Relationships in a triple store are defined as ‘Arcs’, with a triple as the subject (start node), a triple for the object (end node) and an arc or type of relationship for the predicate. 

Since arcs create logically linked triples or nodes, triple stores are considered graph databases. However, since their architecture is oriented towards individual triples, they are not as well suited for fast graph traversal like native graph databases, especially property graphs. 

Examples of Triple Stores are AWS Neptune, AllegroGraph, Stardog

Discover more

example-graph
graph_overview

How are graph databases different from relational databases?

Relational databases store data in tables. These tables consist of a highly structured, predefined set of columns with strict data types. 

Relationships are defined as a combination of columns that serve as row identifiers in one table (primary keys) and references to similar row identifiers in other tables (foreign keys). Relationships in relational databases are not stored in the database, but built at runtime (through JOIN statements in queries). Because relationships do not exist as database objects, they can’t contain any additional meta-information. Creating relationships in runtime is expensive, which makes it hard to work with highly connected data.  

In short, the limitations of relational databases in highly connected use cases are: 

  • Rigidity: the fixed structure of relational database tables makes it hard to work in today’s agile environment, with quickly changing business requirements
  • No “relationship” concept: in today’s connected world, information about relationships between data points often is more important than the data itself. Relational databases haven’t been designed to capture this relationship information.
  • Performance becomes a problem when working with highly connected data. Large numbers of (self) joins can bring performance to its knees, and are one of the symptoms of what is known as “SQL Strain”.

The graph database landscape

The graph databases represented a $1 billion market in 2019, and is projected to grow to almost $3 billion by 2024.

According to DB Engines, below is an overview of the 5 largest (pure) graph databases. The DB Engines score is calculated based on number of mentions on search engines, Google Trends, number of questions on StackOverflow, relevance in job offerings and more. 

Neo4jneo4j

Neo4j is the absolute market leader in the graph market. The platform provides native graph storage and processing, an extensive library of graph algorithms, clustered deployments and much more. 

JanusGraphjanusgraph

A graph database optimized for distributed clusters, runs on top of distributed NoSQL/key-value storage engines like Cassandra, HBase or Google BigTable. 

DGraphdgraph

DGraph is a horizontally scalable transactional graph database with fast arbitrary-depth joins using a GraphQL-like query language. 

Giraphapache-giraph-logo-1

Apache Giraph is an iterative graph processing system built for high scalability 

TigerGraphtigergraph

TigerGraph is a complete, distributed, parallel graph computing platform supporting web-scale data analytics in real-time 

 

Not included in the DB Engines top 5 ranking, but worth an honorable mention is AWS Neptune, which is a hybrid RDF/property graph database (in this case, the “hybrid” only means one of two options can be chosen when a database is created, there is not hybrid functionality, nor is there an option to switch from RDF to property graph or vice versa after creation). 

graph_chart

Graphs in the real world

Graph database use cases

Social network analysis

Modern social networks allow people to communicate and share information with other individuals in large networks. These networks can range from intense interactions with a number of close friends to being part of a larger network (or “community”). 

Social network analysis (SNA) is focused on these relationships. It tries to find the way in which individual`s interactions with others influence their behavior or decisions.

Social networks tend to get very complex, consisting of thousands of individuals and millions of interactions (relations) between them. Analyzing these amounts of data requires building a model that simplifies the social network while at the same time remains representative. 

Fraud Detection

Whether you’re looking at credit card fraud, ecommerce, insurance or other types of fraud, the complexity of fraudulent behavior is becoming increasingly complex.  

In a graph database like Neo4j, transactions are stored as a graph where related pieces of data are connected, which makes it easy to traverse those relationships in real time and to find the fraudulent patterns quickly. 

abstract-art-blur-bright-373543_small
geo_smalljpg

Geo

A lot of geographical or navigational problems work perfectly with graphs. Finding the best route from point A to point B is a matter of finding the shortest paths through a network (graph) of points along the way. Similarly, finding locations nearby is a matter of finding all points within a total distance of point A.  

Recommendation engines

Real-time recommendation engines are key to the success of any online business. To make relevant recommendations in real time requires the ability to correlate product, customer, inventory, supplier, logistics and even social sentiment data. Moreover, a real-time recommendation engine requires the ability to instantly capture any new interests shown in the customer’s current visit – something that batch processing can’t accomplish. Matching historical and session data is trivial for a graph database.

Graph databases easily outperform relational and other NoSQL data stores for connecting masses of buyer and product data (and connected data in general) to gain insight into customer needs and product trends.

Authorization and Access Control

Traditionally, information about (groups of) people and resources (files, devices, products, legal documents, …) and access related to these people and resources have been stored in directory services (e.g. Active Directory). As these hierarchic systems start to grow, they start to struggle with a number of challenges: 

  • Highly interconnected identity and access permissions data 
  • Productivity and customer satisfaction (performance)
  • Dynamic structure

With a graph engine that can traverse these complex networks of relationships in milliseconds, analyzing access to resources, identifying duplicate roles etc becomes trivial.  

Master Data Management

Consistent operational data needs to be managed across the entire organization. This requires a central area where master data about customers, products, processes and more is managed

Not only do you need to make sense of all this distributed data that lives in various systems, formats, locations and quality, even more importantly is a good comprehension of the relationships between all of this data. 

photo-1549927455-67cc16cc490c_small
networkh

Lineage, Auditing, Impact Analysis

Keeping an overview of how various infrastructure components and systems are interconnected in today’s large and complex organizations is a daunting task in the relational world. 

Since all (virtual) hardware infrastructure, software servers, applications, data flows and user actions are connected, they already are a real-world graph that only needs to be persisted in a graph database. 

A number of query results could then be: 

  • Lineage: I have this incoming data point A. What happens to this data point, where does it end up? Who modifies it at which point? 
  • Impact Analysis: what is the impact of changing column x in table y on database z? Which users will be impacted by this change further down the chain? 
  • Auditing: what are my most popular dashboards? Which are my most popular reports? What is the data nobody ever uses?

Network and infrastructure monitoring

Network and IT infrastructure easily become complex to manage, which sooner rather than later requires a configuration management database (CMDB). Where relational databases quickly start to struggle in managing the large number of interconnected systems, this is another use case that graph databases excel at. 

A graph database enables you to keep track of your entire infrastructure, it also makes it easy to connect to your many monitoring tools and gain critical insights into the complex relationships between different network or data center operations. From dependency management to automated microservice monitoring, the uses for graphs in network and IT operations is endless.

Graph Modeling

Most people who have been involved in the design or modeling of a relational database schema consider the process to be hard. A number of notable reasons that feed this perception are 

  • The real world model needs to be adapted to a technical design. A lot of decisions (e.g. normalization, deduplicating data, identifying keys, data types, etc) have to be made for purely technical reasons, to make the real world model fit the technical restrictions enforced by the database 
  • Thinking ahead is hard but unavoidable. Once a database schema has been created and is populated with data, changes are hard and expensive. Changes to the schema need to be done with forward and backward compatibility in mind, and the changes themselves are often slow and tedious 
matrix_whiteboard_model1
matrix_whiteboard_model4

Compared to relational modeling, graph modeling is almost a walk in the park: 

  • Humans think in terms of objects (nodes) and what connects these objects (relationships). We basically think of the world as a set of graphs. This makes graph modeling very “whiteboard friendly”. A conceptual design of a system no longer needs a technical translation to fit into a relational database schema, but can almost immediately be applied to a graph database. 
  • Since there’s no need to have a fixed database schema, graph models can easily be changed or extended further down the road. The original nodes and relationships can stay in place and remain compatible, while new functionality can use an updated or changed data model. The graph database model effortlessly grows and improves with the business needs. 

Graph Query Languages

SPARQL

SPARQL (pronounced “Sparkle”) is a recursive acronym for “SPARQL Protocol and RDF Query Language”. It is an RDF query language (a semantic query language for databases) that is able to retrieve and manipulate data stored in Resource Description Framework (RDF) format. 

SPARQL queries come in 4 different forms: 

  • SELECT query: Used to extract raw values from a SPARQL endpoint, the results are returned in a table format.
  • CONSTRUCT query: Used to extract information from the SPARQL endpoint and transform the results into valid RDF.
  • ASK query: Used to provide a simple True/False result for a query on a SPARQL endpoint.
  • DESCRIBE query: Used to extract an RDF graph from the SPARQL endpoint, the content of which is left to the endpoint to decide based on what the maintainer deems as useful information.

Example (SELECT) query:

sparql_select

This query joins together all of the triples with a matching subject, where the type predicate, "a", is a person (foaf:Person), and the person has one or more names (foaf:name) and mailboxes (foaf:mbox).

Sparql

Cypher

Cypher is Neo4j’s graph query language that allows users to store and retrieve data from the graph database. It is a declarative, SQL-inspired language for describing visual patterns in graphs using ASCII-Art syntax.

Similar to other query languages, Cypher contains a variety of keywords for specifying patterns, filtering patterns, and returning results. Among the most common are: MATCH, WHERE, and RETURN. These operate slightly differently than the SELECT and WHERE in SQL; however, they have similar purposes.

The most important keywords to mention are: 

  • MATCH is used before describing the search pattern for finding nodes, relationships, or combinations of nodes and relationships together. 
  • WHERE is used to add additional constraints to patterns and filter out any unwanted patterns.
  • RETURN formats and organizes how the results should be outputted, including specific properties, lists, ordering, and more.

Example query:

cypher_example

This query matches: 

  • (nicole:Actor {name: 'Nicole Kidman'}): all nodes labeled “Actor” with a filter on the “name” property for actors named “Nicole Kidman”. These nodes are aliased “nicole” 
  • (movie:Movie): all nodes labeled “Movie”, aliased “movie”
  • [:ACTED_IN]: for both lists of nodes as specified above, find all nodes that are connected by a “ACTED_IN” relationship. 
  • return all nodes aliased “movie” 

The core Cypher language can be extended through plugins. Two notable plugins that add hundreds of procedures and functions are 

  • APOC (Awesome Procedures On Cypher): 
  • Algo: Graph algorithms

GQL

Standardization is required to avoid fragmentation in the increasingly popular world of labeled property graphs, which resulted in the creation of GQL (Graph Query Language). In June of 2019, the ISO/IEC’s Joint Technical Committee 1 (responsible for IT standards) started the voting process for GQL. With GQL as the ISO/IEC’s first new standard since SQL, this is quite something!

GQL is created by combining the strengths of 3 graph query languages: 

  • Cypher (Neo4j, openCypher community)
  • PGQL (Oracle)
  • G-CORE: a research language proposal from Linked Data Benchmark Council,  co-authored by world-class researchers from the Netherlands, Germany, Chile, the U.S, and technical staff from SAP, Oracle, Capsenta and Neo4j. 

By combining the strengths of the leading graph query languages in the industry into what is intended to become the ‘SQL for graphs’, the ISO/IEC intends to prevent fragmentation and move the entire graph space forwards.

gql

Loading data to a graph

Each graph database has its own implementations and preferred ways of loading data. A common practice is importing structured (CSV, RDF etc) data formats from text files. 

Let’s have a closer look at the data loading options in market leader Neo4j: 

  • Load CSV: import pre-modeled data (for nodes and relationships) into your graph through Cypher or the neo4j-admin command line tool
  • APOC: Neo4j’s APOC (Awesome Procedures On Cypher), a library of extensions to the core Cypher language, allows some more complex operations and transformations on the while loading data (e.g. apoc.load.json, apoc.load.jdbc to load data from JSON files and relational databases respectively) 
  • Neo4j’s ETL Tool supports a number of import scenarios to load your data into Neo4j. Although useful for smaller scale problems, the Neo4j ETL quickly runs out of options. 
  • Custom code through drivers: language drivers exist for .Net, Javascript, Java, Go, Python and others. Being language drivers, the possibilities are limited to the programming language restrictions (or lack thereof). With great flexibility comes great responsibility (and an awful lot of work).
  • Kettle/Hop: the open source Kettle data engineering platform, also (previously?) known as Pentaho Data Integration provides a set of integrations for Neo4j. There are visual transformation steps to read data from a Cyper query, to load data to nodes and relationships or directly to a graph that can be modeled in Kettle itself. Development of Project Hop ( started as a new incarnation and quickly evolving beyond being “just a Kettle fork”) is sponsored and supported by Neo4j and will treat Neo4j as a first class citizen.
neo4j_etl

Graph Analytics and Algorithms

Where machine learning usually is applied to relational or tabular data, graphs provide a great alternative way to analyze connected data from a different, relationship-oriented angle. 

This is another area where Neo4j is quite far ahead of the competition. A number of algorithms types that are supported through the algo-library are: 

  • Path finding: find the shortest path or evaluate the availability and quality of routes
  • Centrality: determine the importance of distinct nodes in a network
  • Community Detection: evaluate how a group is clustered or partitioned, as well as its tendency to strengthen or break apart
  • Similarity: help calculate the similarity of nodes
  • Labs: the Neo4j labs team continuously experiments with and works on graph algorithms that are not yet ready for prime time or production use. 

  Let's Connect!

analytics