PostgreSQL vs. Vertica - time is money

An essential part of providing the best possible BI solution is the choice of database.

In an ever evolving world of technological possibilities, choosing a solution that is both capable of solving the current problem and scales for the foreseeable future is of critical importance.

First off, based on the practical use of your database system, an important choice is the type of database system.

This article focuses on the choice between two database architectures: row-oriented vs column-oriented databases, more specifically (row-oriented) PostgreSQL and (column-oriented) HPE Vertica.

Row-based databases (like PostgreSQL) are the best known database architecture. As the name implies, data is stored and manipulated in rows. Row-based operations work really well in an application context, where the database load consists of relatively large numbers of CRUD-operations (create, read, update, delete).

In analytical contexts, where the workload consists of a relatively small number of queries over a small number of columns but large numbers of records, the row based approach is not ideal. Columnar databases have been developed to worked around the limitations of row-based databases for analytical purposes. They store data compressed and per column, much like an index in a row-based database.

Vertica is not just an analytical database, it is a  distributed analytical database. By running on a cluster of machines, Vertica is capable of handling huge volumes of data, providing snappy response times through standard SQL.

Although the true power of Vertica only shows when running on a cluster, handling huge volumes of data, the architectural benefits already show on a single machine and more modest amounts of data. We’ll compare both databases in a BI solution using a sufficiently large amount of flight data.

The core functionality for our BI solution test case will be the calculation and aggregation of flight data for reporting purposes. A possible example report contains the average departures and arrivals per airport per year. The key factor in this test case is the speed of calculation and aggregation of the data to report.

Calculating and aggregating data in database queries are tasks that span large amounts of data. The way these large amounts of data are handled differs between the types of database systems.

In a row oriented database table, every record contains 1 item of the information stored in the table. Every field in this record specifies this item of information.

for example

ID operator origin destination
1 brussels airlines brussels berlin
2 brussels airlines rome brussels
3 jetair fly charleroi barcelona-girona

As discussed earlier, row-oriented database systems are designed to query for specific operations records and link these to specific records in other tables through joins. This optimizes create, read, update and deletes (CRUD operations). To optimize the search for specific records, an index can be created on the search identifier.

A row-oriented database is thus optimized for the performance of frequent, item-specific CRUD operations.

A column-oriented database system is, in essence, structured differently. Rather than store the information row by row, the information is stored per column in combination with the unique identifier for the specific item.

1:brussels airlines 1:brussels 1: berlin
2:brussels airlines 2: rome 2:brussels
3: jetair fly 3:charleroi 3:barcelona-girona

Column oriented database systems are optimized for queries that require only specific columns of a table. Because the different columns are stored separately on disk, a query only reads the columns necessary to perform the query. Columnar data is stored on disk in a compressed format, which further reduces the amount of I/O required, and thus further speeds up the query response time.

To perform the tests for this use case, we use a centOS (7.1) virtual machine in Oracle Virtualbox. This virtual machine has the following database systems installed:

  • PostgreSQL 9.2.15
  • Vertica Analytic Database v7.2.0-0

Both database systems have a database instance set up named dbflights. Both dbflights instances contain the following tables, modeled according to the datawarehouse star scheme:

  • flight_fact
  • airport_dim
  • carrier_dim
  • flight_dim
  • plane_dim
  • time_dim
  • date_dim

The flight fact table contains 35874731 records.

Using ETL developed with Pentaho Data Integration (PDI), both dbflights instances are loaded with a sufficient amount of relevant flight data. All the flight data used is available online.

To compare the performance of the two database systems in the specific usecase, queries relevant for reporting purposes are sent to both instances a couple of times using the freely available SQL client DBeaver.

All query execution times are registered. For every query, the execution times are compared to draw a conclusion based on sufficient facts.

Executed queries vary from simply counting rows to slightly heavier grouping and aggregating. For every query tested, the query, the query result and the execution times are displayed. Execution times are registered in milliseconds (ms).

General row count of the flight_fact table

select count(*) from flight_fact;

execution count(*) PostgreSQL Vertica % of PostgreSQL response time
1 35874731 30951ms 44ms 0.14%
2 35874731 30989ms 53ms 0.17%
3 35874731 29973ms 36ms 0.12%

Counting the amount of flights per airport without joins

For testing the effect of an aggregation on the response time of the query, the flight_fact table is grouped by airport_origin_id.

select airport_origin_id, count(*)

from flight_fact

group by airport_origin_id;

Airport_origin_id Count(*)
251 1326
1074 1682
264 2969
887 42750
802 14278
1070 6897
1080 290
3081 182849
3041 3721
496 37329
2125 7223
455 8704
1420 758
2306 6311
4632 3318
4452 5622
209 1608
1101 13774
2182 30538
276 749

execution PostgreSQL Vertica % of PostgreSQL response time
1 28100ms 883ms 3.14%
2 27904ms 869ms 3.11%
3 28228ms 818ms 2.90%

Counting the amount of flights per year

select d.year, count(*)

from flight_fact f join date_dim d on f.date_id = d.date_id

group by d.year;

2005 14281192
2006 14283844
2007 14906430
2004 14258540
2008 14019456

execution PostgreSQL Vertica % of PostgreSQL response time
1 49223ms 5142ms 10.44%
2 50105ms 5117ms 10.21 %
3 50894ms 5119ms 10.06%

Getting the top 20 airports with the most departure flights

select a.airport_fullname_name, count(*)

from flight_fact f

join airport_dim a on f.airport_origin_id = a.airport_id

group by a.airport_fullname_name

order by count(*) desc

limit 20

Hartsfield Jackson Atlanta International Airport 2088085
Chicago O'Hare International Airport 1813778
Dallas Fort Worth International Airport 1532301
Los Angeles International Airport 1150478
Denver International Airport 1052901
George Bush Intercontinental Houston Airport 1023696
Phoenix Sky Harbor International Airport 970260
McCarran International Airport 866988
Newark Liberty International Airport 764695
Detroit Metropolitan Wayne County Airport 750612
Cincinnati Northern Kentucky International Airport 745110
Salt Lake City International Airport 728214
Minneapolis-St Paul International/Wold-Chamberlain Airport 696961
San Francisco International Airport 669903
General Edward Lawrence Logan International Airport 637630
La Guardia Airport 624536
Orlando International Airport 607063
Charlotte Douglas International Airport 592780
Philadelphia International Airport 561025
John F Kennedy International Airport 548991

execution PostgreSQL Vertica % of PostgreSQL response time
1 28548ms 6253ms 21.16 %
2 27237ms 4966ms 18.23%
3 26390ms 5103ms 19.34%

Counting the busiest days per year

select d.year, d.fullday, count(*)

from flight_fact f

join date_dim d on f.date_id = d.date_id

group by d.year, d.fullday

order by d.year, count(*) desc;

2004 Friday 2122894
2004 Thursday 2120490
2004 Monday 2089016
2004 Wednesday 2073042
2004 Tuesday 2067726
2004 Sunday 1977758
2004 Saturday 1807614
2005 Friday 2101970
2005 Monday 2097936
2005 Thursday 2095498
2005 Wednesday 2086416
2005 Tuesday 2074578
2005 Sunday 1985910
2005 Saturday 1838884
2006 Friday 2113212
2006 Thursday 2105898
2006 Monday 2096108
2006 Wednesday 2085904
2006 Tuesday 2060644
2006 Sunday 2023016
2006 Saturday 1799062
2007 Monday 2224948
2007 Friday 2203378
2007 Thursday 2195476
2007 Wednesday 2177716
2007 Tuesday 2157124
2007 Sunday 2081112
2007 Saturday 1866676
2008 Wednesday 2079330
2008 Monday 2072402
2008 Friday 2070332
2008 Thursday 2064448
2008 Tuesday 2064098
2008 Sunday 1953774
2008 Saturday 1715072

execution PostgreSQL Vertica % of PostgreSQL response time
1 46200ms 8912ms 19.29%
2 52165ms 7892ms 15.13%
3 51785ms 7103ms 13.72%

Calculating largest percentage of cancellations per carrier

select c.carrier_description_name, sum(f.cancelled_qty), count(*), (sum(f.cancelled_qty)*1.00/count(*))*100 as PCT

from flight_fact f

join carrier_dim c on f.carrier_id = c.carrier_id

group by c.carrier_description_name

order by PCT desc;

American Eagle Airlines Inc. 95646 2596395 3.68
Mesa Airlines Inc. 30050 854056 3.51
Comair Inc. 47174 1464176 3.22
Atlantic Southeast Airlines 42171 1423460 2.96
Pinnacle Airlines Inc. 15039 521059 2.88
Independence Air 11164 401447 2.78
Expressjet Airlines Inc. 45964 2022223 2.27
Skywest Airlines Inc. 58889 2694052 2.18
American Airlines Inc. 67645 3254456 2.07
United Air Lines Inc. 45552 2481255 1.83
Delta Air Lines Inc. 49477 2779846 1.77
Alaska Airlines Inc. 13432 795783 1.68
US Airways Inc. (Merged with America West 9/05. Reporting for both starting 10/07.) 36240 2289408 1.58
America West Airlines Inc. (Merged with US Airways 9/05. Stopped reporting 10/07.) 5595 393792 1.42
Aloha Airlines Inc. 1175 89547 1.31
Northwest Airlines Inc. 28521 2183176 1.30
JetBlue Airways 8973 744157 1.20
ATA Airlines d/b/a ATA 1452 139244 1.04
AirTran Airways Corporation 11240 1120438 1.00
Southwest Airlines Co. 50195 5496384 0.91
Continental Air Lines Inc. 12939 1526985 0.84
Frontier Airlines Inc. 1778 336958 0.52
Hawaiian Airlines Inc. 1292 266434 0.48

execution PostgreSQL Vertica % of PostgreSQL response time
1 25943ms 6609ms 25.48%
2 27681ms 6895ms 24.91%
3 24534ms 6629ms 27.02%

It would be very interesting to see in which airports these cancellations took place. If we join the existing query with a second dimension, airport_dim, we get the following query:

select a.airport_fullname_name, c.carrier_description_name, sum(f.cancelled_qty), count(*), (sum(f.cancelled_qty)*1.00/count(*))*100 as PCT

from flight_fact f

join carrier_dim c on f.carrier_id = c.carrier_id

join airport_dim a on f.airport_origin_id = a.airport_id

group by a.airport_fullname_name, c.carrier_description_name

order by PCT desc;

This query takes around 11000ms to fetch 1000 rows on the Vertica instance whereas the PostgreSQL instance doesn't succeed in completing the query after several minutes.

With differences varying from only 0.12% (36ms – 29973ms) up to about 27% (6609ms – 25943ms) of the calculation time, the execution time of the operations on the Vertica instance are significantly lower than the execution times on the PostgreSQL instance.

These statistics show that the Vertica instance computes the operations a lot quicker than the PostgreSQL instance. Therefore, this is an excellent example of why you shouldn't waste time tuning performance when the choice of system can resolve the performance isssue. And yet, the true power of a Vertica system resides in its clustering capabilities. If we increase the amount of data, divide it over a few nodes and then compare it to a PostgreSQL instance, the difference in response times will only increase. This is a very interesting subject for a follow-up blogpost!