overview of aws analytics services
AWS offers a variety of powerful analytics services designed to handle different data processing needs. In this blog, we will focus on Amazon Athena, Amazon EMR, AWS Glue, and Amazon Kinesis, as these services are most likely to appear on the AWS Certified Cloud Practitioner exam. You can follow the links provided to learn more about other AWS analytics services like Amazon CloudSearch, Amazon OpenSearch Service, Amazon QuickSight, Amazon Data Pipeline, AWS Lake Formation, and Amazon MSK.
Amazon Elastic MapReduce (EMR)
Amazon EMR is a web service that allows businesses, researchers, data analysts, and developers to process vast amounts of data efficiently and cost-effectively. EMR uses a hosted Hadoop framework running on Amazon EC2 and Amazon S3 and supports Apache Spark, HBase, Presto, and Flink. Common use cases include log analysis, financial analysis, and ETL activities.
A Step is a programmatic task that processes data, while a cluster is a collection of EC2 instances provisioned by EMR to run these Steps. EMR uses Apache Hadoop, an open-source Java software framework, as its distributed data processing engine.
EMR is an excellent platform for deploying Apache Spark, an open-source distributed processing framework for big data workloads that utilizes in-memory caching and optimized query execution. You can also launch Presto clusters, an open-source distributed SQL query engine designed for fast analytic queries against large datasets. All nodes for a given cluster are launched in the same Amazon EC2 Availability Zone.
You can access Amazon EMR through the AWS Management Console, Command Line Tools, SDKs, or the EMR API. With EMR, you have access to the underlying operating system and can SSH in.
Amazon Athena
Amazon Athena is an interactive query service that allows you to analyze data in Amazon S3 using standard SQL. As a serverless service, there is no infrastructure to manage, and you only pay for the queries you run. Athena is easy to use: simply point to your data in Amazon S3, define the schema, and start querying using standard SQL.
Athena uses Presto with full standard SQL support and works with various data formats, including CSV, JSON, ORC, Apache Parquet, and Avro. It is ideal for quick ad-hoc querying and integrates with Amazon QuickSight for easy visualization. Athena can handle complex analysis, including large joins, window functions, and arrays, and uses a managed Data Catalog to store information and schemas about the databases and tables you create for your data stored in Amazon S3.
AWS Glue
AWS Glue is a fully managed, pay-as-you-go, extract, transform, and load (ETL) service that automates data preparation for analytics. AWS Glue automatically discovers and profiles data via the Glue Data Catalog, recommends and generates ETL code to transform your source data into target schemas, and runs the ETL jobs on a fully managed, scale-out Apache Spark environment to load your data into its destination.
AWS Glue allows you to set up, orchestrate, and monitor complex data flows, and you can create and run an ETL job with a few clicks in the AWS Management Console. Glue can discover both structured and semi-structured data stored in data lakes on Amazon S3, data warehouses in Amazon Redshift, and various databases running on AWS. It provides a unified view of data via the Glue Data Catalog, which is available for ETL, querying, and reporting using services like Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. Glue generates Scala or Python code for ETL jobs that you can customize further using familiar tools. As a serverless service, there are no compute resources to configure and manage.
Data Analysis and Query Use Cases
AWS offers several query services and data processing frameworks to address different needs and use cases, such as Amazon Athena, Amazon Redshift, and Amazon EMR.
- Amazon Redshift provides the fastest query performance for enterprise reporting and business intelligence workloads, especially those involving complex SQL with multiple joins and sub-queries.
- Amazon EMR simplifies and makes it cost-effective to run highly distributed processing frameworks like Hadoop, Spark, and Presto, compared to on-premises deployments. It is flexible, allowing you to run custom applications and code and define specific compute, memory, storage, and application parameters to optimize your analytic requirements.
- Amazon Athena offers the easiest way to run ad-hoc queries for data in S3 without needing to set up or manage any servers.
Below is a summary of primary use cases for a few AWS query and analytics services:
| AWS Service | Primary Use Case | When to Use |
|---|---|---|
| Amazon Athena | Query | Run interactive queries against data directly in Amazon S3 without worrying about data formatting or infrastructure management. Can be used with other services such as Amazon Redshift. |
| Amazon Redshift | Data Warehouse | Pull data from multiple sources, format and organize it, store it, and support complex, high-speed queries for business reports. |
| Amazon EMR | Data Processing | Highly distributed processing frameworks like Hadoop, Spark, and Presto. Run scale-out data processing tasks for applications such as machine learning, graph analytics, data transformation, and streaming data. |
| AWS Glue | ETL Service | Transform and move data to various destinations. Used to prepare and load data for analytics. Data sources can be S3, Redshift, or other databases. Glue Data Catalog can be queried by Athena, EMR, and Redshift Spectrum. |
Amazon Kinesis
Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data for timely insights and quick reactions to new information. It offers a collection of services for processing streams of various data, processed in “shards.” There are four types of Kinesis services:
Kinesis Video Streams
Kinesis Video Streams securely streams video from connected devices to AWS for analytics, machine learning (ML), and other processing. It durably stores, encrypts, and indexes video data streams, allowing access to data through easy-to-use APIs. Data producers provide data streams, stored for 24 hours by default, up to 7 days. Consumers receive and process data, with multiple shards in a stream and support for server-side encryption (KMS) with a customer master key.
Kinesis Data Streams
Kinesis Data Streams enables custom applications that process or analyze streaming data for specialized needs. It allows real-time processing of streaming big data, rapidly moving data off data producers and continuously processing it. Kinesis Data Streams stores data for later processing by applications, differing from Firehose, which delivers data directly to AWS services.
Common use cases include:
- Accelerated log and data feed intake
- Real-time metrics and reporting
- Real-time data analytics
- Complex stream processing
Kinesis Data Firehose
Kinesis Data Firehose is the easiest way to load streaming data into data stores and analytics tools. It captures, transforms, and loads streaming data, enabling near real-time analytics with existing business intelligence tools and dashboards. Firehose can use Kinesis Data Streams as sources, batch, compress, and encrypt data before loading, and synchronously replicate data across three availability zones (AZs) as it is transported to destinations. Each delivery stream stores data records for up to 24 hours.
Kinesis Data Analytics
Kinesis Data Analytics is the easiest way to process and analyze real-time, streaming data using standard SQL queries. It provides real-time analysis with use cases including:
- Generating time-series analytics
- Feeding real-time dashboards
- Creating real-time alerts and notifications
- Quickly authoring and running powerful SQL code against streaming sources
Kinesis Data Analytics can ingest data from Kinesis Streams and Firehose, outputting to S3, Redshift, Elasticsearch, and Kinesis Data Streams.
