An essential part of providing the best possible BI solution is the choice of database.
In an ever evolving world of technological possibilities, choosing a solution that is both capable of solving the current problem and scales for the foreseeable future is of critical importance.
First off, based on the practical use of your database system, an important choice is the type of database system.
This article focuses on the choice between two database architectures: row-oriented vs column-oriented databases, more specifically (row-oriented) PostgreSQL and (column-oriented) HPE Vertica.
Row-based databases (like PostgreSQL) are the best known database architecture. As the name implies, data is stored and manipulated in rows. Row-based operations work really well in an application context, where the database load consists of relatively large numbers of CRUD-operations (create, read, update, delete).
In analytical contexts, where the workload consists of a relatively small number of queries over a small number of columns but large numbers of records, the row based approach is not ideal. Columnar databases have been developed to worked around the limitations of row-based databases for analytical purposes. They store data compressed and per column, much like an index in a row-based database.
Vertica is not just an analytical database, it is a distributed analytical database. By running on a cluster of machines, Vertica is capable of handling huge volumes of data, providing snappy response times through standard SQL.
Although the true power of Vertica only shows when running on a cluster, handling huge volumes of data, the architectural benefits already show on a single machine and more modest amounts of data. We’ll compare both databases in a BI solution using a sufficiently large amount of flight data.
The core functionality for our BI solution test case will be the calculation and aggregation of flight data for reporting purposes. A possible example report contains the average departures and arrivals per airport per year. The key factor in this test case is the speed of calculation and aggregation of the data to report.
Calculating and aggregating data in database queries are tasks that span large amounts of data. The way these large amounts of data are handled differs between the types of database systems.
In a row oriented database table, every record contains 1 item of the information stored in the table. Every field in this record specifies this item of information.
for example
ID | operator | origin | destination |
1 | brussels airlines | brussels | berlin |
2 | brussels airlines | rome | brussels |
3 | jetair fly | charleroi | barcelona-girona |
As discussed earlier, row-oriented database systems are designed to query for specific operations records and link these to specific records in other tables through joins. This optimizes create, read, update and deletes (CRUD operations). To optimize the search for specific records, an index can be created on the search identifier.
A row-oriented database is thus optimized for the performance of frequent, item-specific CRUD operations.
A column-oriented database system is, in essence, structured differently. Rather than store the information row by row, the information is stored per column in combination with the unique identifier for the specific item.
1:brussels airlines | 1:brussels | 1: berlin |
2:brussels airlines | 2: rome | 2:brussels |
3: jetair fly | 3:charleroi | 3:barcelona-girona |
Column oriented database systems are optimized for queries that require only specific columns of a table. Because the different columns are stored separately on disk, a query only reads the columns necessary to perform the query. Columnar data is stored on disk in a compressed format, which further reduces the amount of I/O required, and thus further speeds up the query response time.
To perform the tests for this use case, we use a centOS (7.1) virtual machine in Oracle Virtualbox. This virtual machine has the following database systems installed:
Both database systems have a database instance set up named dbflights. Both dbflights instances contain the following tables, modeled according to the datawarehouse star scheme:
The flight fact table contains 35874731 records.
Using ETL developed with Pentaho Data Integration (PDI), both dbflights instances are loaded with a sufficient amount of relevant flight data. All the flight data used is available online.
To compare the performance of the two database systems in the specific usecase, queries relevant for reporting purposes are sent to both instances a couple of times using the freely available SQL client DBeaver.
All query execution times are registered. For every query, the execution times are compared to draw a conclusion based on sufficient facts.
Executed queries vary from simply counting rows to slightly heavier grouping and aggregating. For every query tested, the query, the query result and the execution times are displayed. Execution times are registered in milliseconds (ms).
General row count of the flight_fact table
select count(*) from flight_fact;
execution | count(*) | PostgreSQL | Vertica | % of PostgreSQL response time |
1 | 35874731 | 30951ms | 44ms | 0.14% |
2 | 35874731 | 30989ms | 53ms | 0.17% |
3 | 35874731 | 29973ms | 36ms | 0.12% |
Counting the amount of flights per airport without joins
For testing the effect of an aggregation on the response time of the query, the flight_fact table is grouped by airport_origin_id.
select airport_origin_id, count(*)
from flight_fact
group by airport_origin_id;
Airport_origin_id | Count(*) |
251 | 1326 |
1074 | 1682 |
264 | 2969 |
887 | 42750 |
802 | 14278 |
1070 | 6897 |
1080 | 290 |
3081 | 182849 |
3041 | 3721 |
496 | 37329 |
2125 | 7223 |
455 | 8704 |
1420 | 758 |
2306 | 6311 |
4632 | 3318 |
4452 | 5622 |
209 | 1608 |
1101 | 13774 |
2182 | 30538 |
276 | 749 |
execution | PostgreSQL | Vertica | % of PostgreSQL response time |
1 | 28100ms | 883ms | 3.14% |
2 | 27904ms | 869ms | 3.11% |
3 | 28228ms | 818ms | 2.90% |
Counting the amount of flights per year
select d.year, count(*)
from flight_fact f join date_dim d on f.date_id = d.date_id
group by d.year;
2005 | 14281192 |
2006 | 14283844 |
2007 | 14906430 |
2004 | 14258540 |
2008 | 14019456 |
execution | PostgreSQL | Vertica | % of PostgreSQL response time |
1 | 49223ms | 5142ms | 10.44% |
2 | 50105ms | 5117ms | 10.21 % |
3 | 50894ms | 5119ms | 10.06% |
Getting the top 20 airports with the most departure flights
select a.airport_fullname_name, count(*)
from flight_fact f
join airport_dim a on f.airport_origin_id = a.airport_id
group by a.airport_fullname_name
order by count(*) desc
limit 20
Hartsfield Jackson Atlanta International Airport | 2088085 |
Chicago O'Hare International Airport | 1813778 |
Dallas Fort Worth International Airport | 1532301 |
Los Angeles International Airport | 1150478 |
Denver International Airport | 1052901 |
George Bush Intercontinental Houston Airport | 1023696 |
Phoenix Sky Harbor International Airport | 970260 |
McCarran International Airport | 866988 |
Newark Liberty International Airport | 764695 |
Detroit Metropolitan Wayne County Airport | 750612 |
Cincinnati Northern Kentucky International Airport | 745110 |
Salt Lake City International Airport | 728214 |
Minneapolis-St Paul International/Wold-Chamberlain Airport | 696961 |
San Francisco International Airport | 669903 |
General Edward Lawrence Logan International Airport | 637630 |
La Guardia Airport | 624536 |
Orlando International Airport | 607063 |
Charlotte Douglas International Airport | 592780 |
Philadelphia International Airport | 561025 |
John F Kennedy International Airport | 548991 |
execution | PostgreSQL | Vertica | % of PostgreSQL response time |
1 | 28548ms | 6253ms | 21.16 % |
2 | 27237ms | 4966ms | 18.23% |
3 | 26390ms | 5103ms | 19.34% |
Counting the busiest days per year
select d.year, d.fullday, count(*)
from flight_fact f
join date_dim d on f.date_id = d.date_id
group by d.year, d.fullday
order by d.year, count(*) desc;
2004 | Friday | 2122894 |
2004 | Thursday | 2120490 |
2004 | Monday | 2089016 |
2004 | Wednesday | 2073042 |
2004 | Tuesday | 2067726 |
2004 | Sunday | 1977758 |
2004 | Saturday | 1807614 |
2005 | Friday | 2101970 |
2005 | Monday | 2097936 |
2005 | Thursday | 2095498 |
2005 | Wednesday | 2086416 |
2005 | Tuesday | 2074578 |
2005 | Sunday | 1985910 |
2005 | Saturday | 1838884 |
2006 | Friday | 2113212 |
2006 | Thursday | 2105898 |
2006 | Monday | 2096108 |
2006 | Wednesday | 2085904 |
2006 | Tuesday | 2060644 |
2006 | Sunday | 2023016 |
2006 | Saturday | 1799062 |
2007 | Monday | 2224948 |
2007 | Friday | 2203378 |
2007 | Thursday | 2195476 |
2007 | Wednesday | 2177716 |
2007 | Tuesday | 2157124 |
2007 | Sunday | 2081112 |
2007 | Saturday | 1866676 |
2008 | Wednesday | 2079330 |
2008 | Monday | 2072402 |
2008 | Friday | 2070332 |
2008 | Thursday | 2064448 |
2008 | Tuesday | 2064098 |
2008 | Sunday | 1953774 |
2008 | Saturday | 1715072 |
execution | PostgreSQL | Vertica | % of PostgreSQL response time |
1 | 46200ms | 8912ms | 19.29% |
2 | 52165ms | 7892ms | 15.13% |
3 | 51785ms | 7103ms | 13.72% |
Calculating largest percentage of cancellations per carrier
select c.carrier_description_name, sum(f.cancelled_qty), count(*), (sum(f.cancelled_qty)*1.00/count(*))*100 as PCT
from flight_fact f
join carrier_dim c on f.carrier_id = c.carrier_id
group by c.carrier_description_name
order by PCT desc;
American Eagle Airlines Inc. | 95646 | 2596395 | 3.68 |
Mesa Airlines Inc. | 30050 | 854056 | 3.51 |
Comair Inc. | 47174 | 1464176 | 3.22 |
Atlantic Southeast Airlines | 42171 | 1423460 | 2.96 |
Pinnacle Airlines Inc. | 15039 | 521059 | 2.88 |
Independence Air | 11164 | 401447 | 2.78 |
Expressjet Airlines Inc. | 45964 | 2022223 | 2.27 |
Skywest Airlines Inc. | 58889 | 2694052 | 2.18 |
American Airlines Inc. | 67645 | 3254456 | 2.07 |
United Air Lines Inc. | 45552 | 2481255 | 1.83 |
Delta Air Lines Inc. | 49477 | 2779846 | 1.77 |
Alaska Airlines Inc. | 13432 | 795783 | 1.68 |
US Airways Inc. (Merged with America West 9/05. Reporting for both starting 10/07.) | 36240 | 2289408 | 1.58 |
America West Airlines Inc. (Merged with US Airways 9/05. Stopped reporting 10/07.) | 5595 | 393792 | 1.42 |
Aloha Airlines Inc. | 1175 | 89547 | 1.31 |
Northwest Airlines Inc. | 28521 | 2183176 | 1.30 |
JetBlue Airways | 8973 | 744157 | 1.20 |
ATA Airlines d/b/a ATA | 1452 | 139244 | 1.04 |
AirTran Airways Corporation | 11240 | 1120438 | 1.00 |
Southwest Airlines Co. | 50195 | 5496384 | 0.91 |
Continental Air Lines Inc. | 12939 | 1526985 | 0.84 |
Frontier Airlines Inc. | 1778 | 336958 | 0.52 |
Hawaiian Airlines Inc. | 1292 | 266434 | 0.48 |
execution | PostgreSQL | Vertica | % of PostgreSQL response time |
1 | 25943ms | 6609ms | 25.48% |
2 | 27681ms | 6895ms | 24.91% |
3 | 24534ms | 6629ms | 27.02% |
It would be very interesting to see in which airports these cancellations took place. If we join the existing query with a second dimension, airport_dim, we get the following query:
select a.airport_fullname_name, c.carrier_description_name, sum(f.cancelled_qty), count(*), (sum(f.cancelled_qty)*1.00/count(*))*100 as PCT
from flight_fact f
join carrier_dim c on f.carrier_id = c.carrier_id
join airport_dim a on f.airport_origin_id = a.airport_id
group by a.airport_fullname_name, c.carrier_description_name
order by PCT desc;
This query takes around 11000ms to fetch 1000 rows on the Vertica instance whereas the PostgreSQL instance doesn't succeed in completing the query after several minutes.
With differences varying from only 0.12% (36ms – 29973ms) up to about 27% (6609ms – 25943ms) of the calculation time, the execution time of the operations on the Vertica instance are significantly lower than the execution times on the PostgreSQL instance.
These statistics show that the Vertica instance computes the operations a lot quicker than the PostgreSQL instance. Therefore, this is an excellent example of why you shouldn't waste time tuning performance when the choice of system can resolve the performance isssue. And yet, the true power of a Vertica system resides in its clustering capabilities. If we increase the amount of data, divide it over a few nodes and then compare it to a PostgreSQL instance, the difference in response times will only increase. This is a very interesting subject for a follow-up blogpost!