As discussed in a previous article where PostgreSQL was pit against HPE Vertica, the choice of database is critical to a good BI solution. This article will further dive into columnar databases with a comparison of HPE Vertica, MonetDB and -the still in alpha- MariaDB Column Store (MariaDBCS).
MonetDB was developed at the CWI in Amsterdam and released as open-source software in 2004.
Its architecture is comprised of 3 layers; a query interface front-end, the optimization mid-layer and the MonetDB Assembly Language (MAL) interpreter. It is aimed at maximum efficiency of hardware for large scale processing. MonetDB also has (limited) support for clustering.
A cluster-based analytics platform acquired by HP Software in 2011.
Designed to manage large, fast-growing volumes of data with fast query performance. It offers compression and uses it along with sorting and encoding for ‘cost-effective loads and reads’. Vertica is designed to run in a cluster and as such the results of this test would be vastly different when using even bigger datasets and multiple nodes. For the purpose of this test though, Vertica is set up in a single node format.
The new analytics engine MariaDB ColumnStore is MariaDB’s effort to provide a product which is scalable, reliable and open-source with a hundred percent SQL compatibility. The engine itself runs under the familiar MariaDB interface and as such is backwards-compatible with components like previous JDBC drivers.
Running some tests
Testing is done on a VirtualBox running Ubuntu 14.04 ( Trusty Tahr), with 8GB Memory and 50GB storage. Each database will have full reign of the machine as we’ll have consecutive runs for testing. This dataset is inserted into a data warehouse running on each database using Pentaho Data Integration. It contains 20.520.383 records total.
The DDL statements for each table:
CREATE TABLE dim_movies (
, title VARCHAR(200)
, genres VARCHAR(120)
CREATE TABLE dim_links (
, imdbId DECIMAL(25)
, tmdbId DECIMAL(25)
CREATE TABLE dim_tags (
, movieId DECIMAL(20)
, tag VARCHAR(500)
, "timestamp" VARCHAR(50)
CREATE TABLE "fact_ratings"(
"fact_id" BIGINT , "userId" DECIMAL
, "movieId" DECIMAL
, "rating" DOUBLE
, "timestamp" VARCHAR(40)
Keep in mind that all 3 of these databases will require modification of the create table statements above.
Noticeable when testing MariaDBCS was that it did not agree with some data types and did not allow LOCAL statement when using LOAD DATA.
The DWH writing times for each database:
|3m 23s||3m 49s||4m 13s|
Each test will be a simple every-day-use query.
The first test is a simple average query: “SELECT AVG(fact_id) FROM fact_ratings;”
Result: 10000131 MariaDBCS takes a very long time to calculate and display the average here while the other 2 complete the task under a few seconds.
The second test query is a count of all the ratings:
“ SELECT rating, count(*) from fact_ratings group by rating order by rating;”
The third test is somewhat more complex. The goal is to have the top 20 movies (of the table) and their average ratings. “select dim_movies.title, avg(fact_ratings.rating)
join fact_ratings on dim_movies.movieId = fact_ratings.”movieId”
group by dim_movies.title DESC
order by dim_movies.title
|"Great Performances" Cats (1998)||2.75|
|chicagoGirl: The Social Network Takes on a Dictator (2013)||3.67|
|$ (Dollars) (1971)||2.83|
|$5 a Day (2008)||2.87|
|$ellebrity (Sellebrity) (2012)||2|
|'Hellboy': The Seeds of Creation (2004)||3.03|
|'Human' Factor, The (Human Factor, The) (1975)||2.5|
|'Neath the Arizona Skies (1934)||2.32|
|'R Xmas (2001)||2.58|
|'Round Midnight (1986)||3.61|
|'Salem's Lot (2004)||3.11|
|'Til There Was You (1997)||2.85|
|'Twas the Night Before Christmas (1974)||3.37|
|'burbs, The (1989)||3.07|
|'night Mother (1986)||3.45|
|(500) Days of Summer (2009)||3.80|
|(Absolutions) Pipilotti's Mistakes ((Entlastungen)) Pipilottis Fehler) (1988)||1|
The fourth test builds further on the third one: now we want the top 20 movies, their rating and imdbid; that are larger than the average rating. “select dim_movies.title, fact_ratings.rating, dim_links.imdbid
left outer join fact_ratings on dim_movies.movieId = fact_ratings."movieId"
left outer join dim_links on dim_movies.movieId = dim_links.movieid
where fact_ratings.rating > (select avg(fact_ratings.rating) from fact_ratings)
order by fact_ratings.rating DESC
|District 9 (2009)||5||1.136.608|
|Lord of the Rings: The Two Towers, The (2002)||5||167.261|
|Lord of the Rings: The Return of the King, The (2003)||5||167.260|
|Mr. Holland's Opus (1995)||5||113.862|
|From Dusk Till Dawn (1996)||5||116.367|
|Star Wars: Episode IV - A New Hope (1977)||5||76.759|
|Legends of the Fall (1994)||5||110.322|
|Jurassic Park (1993)||5||107.290|
|Blade Runner (1982)||5||83.658|
|Terminator 2: Judgment Day (1991)||5||103.064|
|2001: A Space Odyssey (1968)||5||62.622|
|Star Wars: Episode V - The Empire Strikes Back (1980)||5||80.684|
|Star Wars: Episode VI - Return of the Jedi (1983)||5||86.190|
|Femme Nikita, La (Nikita) (1990)||5||100.263|
|Darjeeling Limited, The (2007)||5||838.221|
|Diving Bell and the Butterfly, The (Scaphandre et le papillon, Le) (2007)||5||401.383|
|Kung Fu Panda (2008)||5||441.773|
|Lord of the Rings: The Fellowship of the Ring, The (2001)||5||120.737|
From these results it's clear that MonetDB and HPE Vertica are the winners of this benchmark. MonetDB is the fastest choice when agglomerating multiple tables together to produce an end result, where as HPE Vertica offers the best results when executing simple queries and calculations. As the newcomer to this type of storage, MariaDB has offered up a valuable candidate which albeit slow in comparison; does offer full SQL.
Again I feel the need to throw in a reminder that if we were using a larger dataset (read: terra/petabytes) with Vertica properly set up, it would dwarf the other two in comparison.
Suggested is that; when choosing a DBMS; look at the dataset and the type of queries that will be used. Considering each database has its pitfalls, which is the best option for you?
To conclude; speed is optimal and it’s clear there isn’t a monopoly on it just yet.