Why move your BI to the cloud?
As discussed in a previous post, there are many reasons to move your BI to the cloud.
Security, being able to work from anywhere and delivering faster, with more resource flexibility and at a lower cost are just a few.
In this post, we'll have a look at some of the key components in your BI AWS architecture.
The focus for this post will be on the 'traditional BI' components. Components for big data and data science will be discussed in later posts.
First of all, you'll need need to build your infrastructure. There is (at least one) AWS equivalent for every component in your physical, on-premise infrastructure:
- A VPC, or Virtual Private Cloud, is your own isolated corner of the AWS cloud, your private cloud in the cloud. You'll want to build your entire infrastructure within one or more VPCs.
- S3, or Simple Storage Service, is the AWS object storage service. Use S3 as your central file storage to write to and retrieve data from.
- EC2, or Elastic Cloud Compute, is the AWS service where you'll build your virtual servers for ETL, visualization platforms etc. Pre-built images (or AMIs) are available, or you can build your own.
- RDS, or Relational Database Service, is the 'standard' AWS database service. RDS is available in several flavors (Oracle, MS SQL, PostgreSQL, MySQL, Aurora), and is a good fit for your staging, landing etc needs. However, use RedShift (see below) as your data warehouse.
Adhoc querying: Athena
AWS Athena is a service that allows quick, adhoc SQL querying directly on your data in S3.
There's no need to develop ETL or to build a data warehouse, all Athena requires is a table structure, defined over your CSV, JSON, log or other files in S3, and you're good to go.
Data sources defined in Athena can be used in QuickSight for visualization.
ETL: Data Pipeline, Glue
AWS currently provides two ETL services: Data Pipeline and Glue.
- Data Pipeline is similar to your on-premise ETL platform. It is a managed orchestration service that lets you control the what how and who of the ETL for your AWS resources. Data Pipeline even allows you to periodically import on-premise data.
- Glue provides a managed (serverless), Apache Spark based ETL service. With Glue, there's no need to worry about resources. Glue comes with a metadata repository, an engine that can output ETL as Python code and a scheduler that handles dependencies, job monitoring and retries. Glue can read data from a variety of sources, including Athena, RedShift (Spectrum)
Analytical database: Redshift
One of the key components in a modern BI or analytics architecture is an analytical database or column store.
Redshift is the AWS service that provides a fully managed, distributed analytical option for your data warehouse. Redshift allows you to start small and grow the number of nodes in the cluster and complexity of your Redshift implementation as your data grows. As with other column stores, there's no need to constantly create and maintain indexes to keep your data warehouse performance acceptable. Most queries will return in seconds at most.
Redshift can be used as a source for your visualization platform of choice, or with AWS QuickSight.
Quicksight is an AWS visualization and analysis service. Although Quicksight is not a complete replacement for most full BI platforms, it does allow you to quickly develop adhoc visualization and analyses on a variety of data sources. Once development is done, visualization can be distributed to large numbers of users who can use the visualizations in their browser or on their mobile.
AWS has all the required components to build full BI or analytic projects. Operating in the cloud may require changes in the way you're used to operate, but it also opens plenty of opportunities for scalability and flexibility that are not possible on-promise or in a self managed data center.
Contact us to find out how we can help you become successful with your cloud analytics!