Catching the "bad guys" using graphs.
Figure 1: Gartner layered model for fraud detection
Although graph theory has been around for centuries, graph databases began their rise to popularity relatively recently. A graph database like Neo4j is a lot more than a data store.
‘Traditional’ relational databases store data in tables. These tables have a fixed format (fixed number of columns, each with a fixed data type). Tables are linked through the primary key in one table and the corresponding foreign keys in other tables. When a query is executed, the database engine fetches the primary keys from one table and links them to the corresponding foreign keys in other tables through SQL joins.
Although this works well to insert, select, update and delete individual records over one or a limited number of tables (CRUD operations), performing joins during query execution over large schemas and data volumes is expensive and slow.
Property graph databases (like market leader Neo4j) use nodes with labels and properties to store data instead of the relational database tables and columns. Instead of fitting all data into a fixed, predefined table structure, each node is a new instance with a structure that can be different from other nodes.
This schema-less structure offers more flexibility than the relational database table. On top of the additional flexibility, graph databases treat relationships as first class citizens. Instead of creating relationships in query runtime through joins, relationship are persisted in a graph database. Having all relationships stored with the data not only allows to extremely outperform relational databases on relationship-heavy queries, it also allows use cases that simply aren’t possible in relational databases. Through the algorithms that ship with your Neo4j database, machine learning analytics becomes available to anyone who has a basic understanding of the problem at hand and knows a little Cypher, Neo4j's query language.
In this post, we’ll have a look at a couple of analytical use cases with graph databases.
Social networks are everywhere, with LinkedIn and Facebook as the most popular examples. Social networks are natural graph implementations.
In these social networks, people, companies, skills, interests etc are stored as nodes, each with a varying number of properties. A table to store all possible properties of a ‘person’ would either require a huge number of (very sparsely populated) columns, or a significant number of related tables. In graphs, a ‘person’ node only needs to contain the properties that are available for a person.
Relationship can be created between nodes, e.g. ‘knows’, ‘works for’, ‘lives in’ relationships.
In a relational database, finding the people someone knows would require something like a self join on a person table. Although this would be doable for the first level network, it becomes a lot harder for the typical FOAF (Friend of a friend) analysis. Although theoretically possible, running a query to find 3rd, 4th or 5th level relationship (e.g. 4 self joins on the person table) in a network of a significant size would be extremely slow, if it wouldn’t bring the database to its knees.
In graph databases (where both the nodes and their relationships are stored in the database) and with a query language like Cypher that has functions for path finding etc, this is trivial.
Typical examples of social network analysis include finding the "weight" or influence of people in a network, community detection (discovering the hidden groups or clusters of people in a network), discovering hidden relationships and more.
Fraud systems, whether they’re intended for bank, insurance, eCommerce or other types of fraud, are typically done by setting up ‘rings’ of fake accounts, accidents, transactions etc.
Similar to the social network analysis, querying the entire database for fraudulent accounts, transactions etc (empty link analysis) requires a huge number of joins and self joins that are hard to build and expensive to run, which only gets worse as the number of accounts (and in the fraud ring) grows. Since fraud needs to be detected as soon as possible, preferably in real-time, heavy and slow queries won't work.
Graph databases are built to work with relationships, and query languages like Neo4J’s Cypher have built in functions and algorithms to detect rings by traversing the graph, which makes it possible to navigate the graph and detect fraud in memory, in real time.
We’re all familiar with the “Customers who bought ‘x’ also bought ‘y’” friend recommendations on Facebook and “Who to follow” on Twitter.
This is another use case that is trivial with a graph database like Neo4j. By scanning the historical data in your graph and combining personal, product and sales data, Neo4j is able to determine in real-time which products you could add to your shopping basket, which stores, restaurants or movies you could be interested in etc.
For example if several people have the same 'located in' relationship and both like sushi (have 'like' relationships to sushi restaurants), it is easy to recommend people sushi restaurants in town their peers have visited, but they haven't yet discovered.