A data engineer needs Amazon Athena queries to finish faster. The data engineer notices that all the
files the Athena queries use are currently stored in uncompressed .csv format. The data engineer also
notices that users perform most queries by selecting a specific column.
Which solution will MOST speed up the Athena query performance?
C
Explanation:
Amazon Athena is a serverless interactive query service that allows you to analyze data in Amazon
S3 using standard SQL. Athena supports various data formats, such as CSV, JSON, ORC, Avro, and
Parquet. However, not all data formats are equally efficient for querying. Some data formats, such as
CSV and JSON, are row-oriented, meaning that they store data as a sequence of records, each with
the same fields. Row-oriented formats are suitable for loading and exporting data, but they are not
optimal for analytical queries that often access only a subset of columns. Row-oriented formats also
do not support compression or encoding techniques that can reduce the data size and improve the
query performance.
On the other hand, some data formats, such as ORC and Parquet, are column-oriented, meaning that
they store data as a collection of columns, each with a specific data type. Column-oriented formats
are ideal for analytical queries that often filter, aggregate, or join data by columns. Column-oriented
formats also support compression and encoding techniques that can reduce the data size and
improve the query performance. For example, Parquet supports dictionary encoding, which replaces
repeated values with numeric codes, and run-length encoding, which replaces consecutive identical
values with a single value and a count. Parquet also supports various compression algorithms, such
as Snappy, GZIP, and ZSTD, that can further reduce the data size and improve the query performance.
Therefore, changing the data format from CSV to Parquet and applying Snappy compression will
most speed up the Athena query performance. Parquet is a column-oriented format that allows
Athena to scan only the relevant columns and skip the rest, reducing the amount of data read from
S3. Snappy is a compression algorithm that reduces the data size without compromising the query
speed, as it is splittable and does not require decompression before reading. This solution will also
reduce the cost of Athena queries, as Athena charges based on the amount of data scanned from S3.
The other options are not as effective as changing the data format to Parquet and applying Snappy
compression. Changing the data format from CSV to JSON and applying Snappy compression will not
improve the query performance significantly, as JSON is also a row-oriented format that does not
support columnar access or encoding techniques. Compressing the CSV files by using Snappy
compression will reduce the data size, but it will not improve the query performance significantly, as
CSV is still a row-oriented format that does not support columnar access or encoding techniques.
Compressing the CSV files by using gzjg compression will reduce the data size, but it will degrade the
query performance, as gzjg is not a splittable compression algorithm and requires decompression
before reading. Reference:
Amazon Athena
Choosing the Right Data Format
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
, Chapter 5: Data Analysis
and Visualization, Section 5.1: Amazon Athena
A manufacturing company collects sensor data from its factory floor to monitor and enhance
operational efficiency. The company uses Amazon Kinesis Data Streams to publish the data that the
sensors collect to a data stream. Then Amazon Kinesis Data Firehose writes the data to an Amazon S3
bucket.
The company needs to display a real-time view of operational efficiency on a large screen in the
manufacturing facility.
Which solution will meet these requirements with the LOWEST latency?
C
Explanation:
This solution will meet the requirements with the lowest latency because it uses Amazon Managed
Service for Apache Flink to process the sensor data in real time and write it to Amazon Timestream, a
fast, scalable, and serverless time series database. Amazon Timestream is optimized for storing and
analyzing time series data, such as sensor data, and can handle trillions of events per day with
millisecond latency. By using Amazon Timestream as a source, you can create an Amazon QuickSight
dashboard that displays a real-time view of operational efficiency on a large screen in the
manufacturing facility.
Amazon QuickSight is a fully managed business intelligence service that can
connect to various data sources, including Amazon Timestream, and provide interactive
visualizations and insights123
.
The other options are not optimal for the following reasons:
A . Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data
Analytics) to process the sensor data. Use a connector for Apache Flink to write data to an Amazon
Timestream database. Use the Timestream database as a source to create a Grafana dashboard. This
option is similar to option C, but it uses Grafana instead of Amazon QuickSight to create the
dashboard. Grafana is an open source visualization tool that can also connect to Amazon
Timestream, but it requires additional steps to set up and configure, such as deploying a Grafana
server on Amazon EC2, installing the Amazon Timestream plugin, and creating an IAM role for
Grafana to access Timestream. These steps can increase the latency and complexity of the solution.
B . Configure the S3 bucket to send a notification to an AWS Lambda function when any new object is
created. Use the Lambda function to publish the data to Amazon Aurora. Use Aurora as a source to
create an Amazon QuickSight dashboard. This option is not suitable for displaying a real-time view of
operational efficiency, as it introduces unnecessary delays and costs in the data pipeline. First, the
sensor data is written to an S3 bucket by Amazon Kinesis Data Firehose, which can have a buffering
interval of up to 900 seconds. Then, the S3 bucket sends a notification to a Lambda function, which
can incur additional invocation and execution time. Finally, the Lambda function publishes the data
to Amazon Aurora, a relational database that is not optimized for time series data and can have
higher storage and performance costs than Amazon Timestream .
D . Use AWS Glue bookmarks to read sensor data from the S3 bucket in real time. Publish the data to
an Amazon Timestream database. Use the Timestream database as a source to create a Grafana
dashboard. This option is also not suitable for displaying a real-time view of operational efficiency, as
it uses AWS Glue bookmarks to read sensor data from the S3 bucket. AWS Glue bookmarks are a
feature that helps AWS Glue jobs and crawlers keep track of the data that has already been
processed, so that they can resume from where they left off. However, AWS Glue jobs and crawlers
are not designed for real-time data processing, as they can have a minimum frequency of 5 minutes
and a variable start-up time. Moreover, this option also uses Grafana instead of Amazon QuickSight
to create the dashboard, which can increase the latency and complexity of the solution .
Reference:
: Amazon Managed Streaming for Apache Flink
: Amazon Timestream
: Amazon QuickSight
: Analyze data in Amazon Timestream using Grafana
: Amazon Kinesis Data Firehose
: Amazon Aurora
: AWS Glue Bookmarks
: AWS Glue Job and Crawler Scheduling
A company stores daily records of the financial performance of investment portfolios in .csv format in
an Amazon S3 bucket. A data engineer uses AWS Glue crawlers to crawl the S3 data.
The data engineer must make the S3 data accessible daily in the AWS Glue Data Catalog.
Which solution will meet these requirements?
B
Explanation:
To make the S3 data accessible daily in the AWS Glue Data Catalog, the data engineer needs to create
a crawler that can crawl the S3 data and write the metadata to the Data Catalog. The crawler also
needs to run on a daily schedule to keep the Data Catalog updated with the latest data. Therefore,
the solution must include the following steps:
Create an IAM role that has the necessary permissions to access the S3 data and the Data
Catalog.
The AWSGlueServiceRole policy is a managed policy that grants these permissions1
.
Associate the role with the crawler.
Specify the S3 bucket path of the source data as the crawler’s data store.
The crawler will scan the
data and infer the schema and format2
.
Create a daily schedule to run the crawler.
The crawler will run at the specified time every day and
update the Data Catalog with any changes in the data3
.
Specify a database name for the output. The crawler will create or update a table in the Data Catalog
under the specified database. The table will contain the metadata about the data in the S3 bucket,
such as the location, schema, and classification.
Option B is the only solution that includes all these steps. Therefore, option B is the correct answer.
Option A is incorrect because it configures the output destination to a new path in the existing S3
bucket. This is unnecessary and may cause confusion, as the crawler does not write any data to the
S3 bucket, only metadata to the Data Catalog.
Option C is incorrect because it allocates data processing units (DPUs) to run the crawler every day.
This is also unnecessary, as DPUs are only used for AWS Glue ETL jobs, not crawlers.
Option D is incorrect because it combines the errors of option A and C. It configures the output
destination to a new path in the existing S3 bucket and allocates DPUs to run the crawler every day,
both of which are irrelevant for the crawler.
Reference:
: AWS managed (predefined) policies for AWS Glue - AWS Glue
: Data Catalog and crawlers in AWS Glue - AWS Glue
: Scheduling an AWS Glue crawler - AWS Glue
[4]: Parameters set on Data Catalog tables by crawler - AWS Glue
[5]: AWS Glue pricing - Amazon Web Services (AWS)
A company loads transaction data for each day into Amazon Redshift tables at the end of each day.
The company wants to have the ability to track which tables have been loaded and which tables still
need to be loaded.
A data engineer wants to store the load statuses of Redshift tables in an Amazon DynamoDB table.
The data engineer creates an AWS Lambda function to publish the details of the load statuses to
DynamoDB.
How should the data engineer invoke the Lambda function to write load statuses to the DynamoDB
table?
C
Explanation:
The Amazon Redshift Data API enables you to interact with your Amazon Redshift data warehouse in
an easy and secure way. You can use the Data API to run SQL commands, such as loading data into
tables, without requiring a persistent connection to the cluster. The Data API also integrates with
Amazon EventBridge, which allows you to monitor the execution status of your SQL commands and
trigger actions based on events. By using the Data API to publish an event to EventBridge, the data
engineer can invoke the Lambda function that writes the load statuses to the DynamoDB table. This
solution is scalable, reliable, and cost-effective. The other options are either not possible or not
optimal. You cannot use a second Lambda function to invoke the first Lambda function based on
CloudWatch or CloudTrail events, as these services do not capture the load status of Redshift tables.
You can use the Data API to publish a message to an SQS queue, but this would require additional
configuration and polling logic to invoke the Lambda function from the queue. This would also
introduce additional latency and cost. Reference:
Using the Amazon Redshift Data API
Using Amazon EventBridge with Amazon Redshift
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
, Chapter 2: Data Store
Management, Section 2.2: Amazon Redshift
A data engineer needs to securely transfer 5 TB of data from an on-premises data center to an
Amazon S3 bucket. Approximately 5% of the data changes every day. Updates to the data need to be
regularly proliferated to the S3 bucket. The data includes files that are in multiple formats. The data
engineer needs to automate the transfer process and must schedule the process to run periodically.
Which AWS service should the data engineer use to transfer the data in the MOST operationally
efficient way?
A
Explanation:
AWS DataSync is an online data movement and discovery service that simplifies and accelerates data
migrations to AWS as well as moving data to and from on-premises storage, edge locations, other
cloud providers, and AWS Storage services1
. AWS DataSync can copy data to and from various
sources and targets, including Amazon S3, and handle files in multiple formats. AWS DataSync also
supports incremental transfers, meaning it can detect and copy only the changes to the data,
reducing the amount of data transferred and improving the performance.
AWS DataSync can
automate and schedule the transfer process using triggers, and monitor the progress and status of
the transfers using CloudWatch metrics and events1
.
AWS DataSync is the most operationally efficient way to transfer the data in this scenario, as it meets
all the requirements and offers a serverless and scalable solution. AWS Glue, AWS Direct Connect,
and Amazon S3 Transfer Acceleration are not the best options for this scenario, as they have some
limitations or drawbacks compared to AWS DataSync.
AWS Glue is a serverless ETL service that can
extract, transform, and load data from various sources to various targets, including Amazon
S32
.
However, AWS Glue is not designed for large-scale data transfers, as it has some quotas and
limits on the number and size of files it can process3
. AWS Glue also does not support incremental
transfers, meaning it would have to copy the entire data set every time, which would be inefficient
and costly.
AWS Direct Connect is a service that establishes a dedicated network connection between your on-
premises data center and AWS, bypassing the public internet and improving the bandwidth and
performance of the data transfer. However, AWS Direct Connect is not a data transfer service by
itself, as it requires additional services or tools to copy the data, such as AWS DataSync, AWS Storage
Gateway, or AWS CLI. AWS Direct Connect also has some hardware and location requirements, and
charges you for the port hours and data transfer out of AWS.
Amazon S3 Transfer Acceleration is a feature that enables faster data transfers to Amazon S3 over
long distances, using the AWS edge locations and optimized network paths. However, Amazon S3
Transfer Acceleration is not a data transfer service by itself, as it requires additional services or tools
to copy the data, such as AWS CLI, AWS SDK, or third-party software. Amazon S3 Transfer
Acceleration also charges you for the data transferred over the accelerated endpoints, and does not
guarantee a performance improvement for every transfer, as it depends on various factors such as
the network conditions, the distance, and the object size. Reference:
AWS DataSync
AWS Glue
AWS Glue quotas and limits
[AWS Direct Connect]
[Data transfer options for AWS Direct Connect]
[Amazon S3 Transfer Acceleration]
[Using Amazon S3 Transfer Acceleration]
A company uses an on-premises Microsoft SQL Server database to store financial transaction dat
a. The company migrates the transaction data from the on-premises database to AWS at the end of
each month. The company has noticed that the cost to migrate data from the on-premises database
to an Amazon RDS for SQL Server database has increased recently.
The company requires a cost-effective solution to migrate the data to AWS. The solution must cause
minimal downtown for the applications that access the database.
Which AWS service should the company use to meet these requirements?
B
Explanation:
AWS Database Migration Service (AWS DMS) is a cloud service that makes it possible to migrate
relational databases, data warehouses, NoSQL databases, and other types of data stores to AWS
quickly, securely, and with minimal downtime and zero data loss1
.
AWS DMS supports migration
between 20-plus database and analytics engines, such as Microsoft SQL Server to Amazon RDS for
SQL Server2
.
AWS DMS takes over many of the difficult or tedious tasks involved in a migration
project, such as capacity analysis, hardware and software procurement, installation and
administration, testing and debugging, and ongoing replication and monitoring1
.
AWS DMS is a cost-
effective solution, as you only pay for the compute resources and additional log storage used during
the migration process2
. AWS DMS is the best solution for the company to migrate the financial
transaction data from the on-premises Microsoft SQL Server database to AWS, as it meets the
requirements of minimal downtime, zero data loss, and low cost.
Option A is not the best solution, as AWS Lambda is a serverless compute service that lets you run
code without provisioning or managing servers, but it does not provide any built-in features for
database migration. You would have to write your own code to extract, transform, and load the data
from the source to the target, which would increase the operational overhead and complexity.
Option C is not the best solution, as AWS Direct Connect is a service that establishes a dedicated
network connection from your premises to AWS, but it does not provide any built-in features for
database migration. You would still need to use another service or tool to perform the actual data
transfer, which would increase the cost and complexity.
Option D is not the best solution, as AWS DataSync is a service that makes it easy to transfer data
between on-premises storage systems and AWS storage services, such as Amazon S3, Amazon EFS,
and Amazon FSx for Windows File Server, but it does not support Amazon RDS for SQL Server as a
target. You would have to use another service or tool to migrate the data from Amazon S3 to Amazon
RDS for SQL Server, which would increase the latency and complexity. Reference:
Database Migration - AWS Database Migration Service - AWS
What is AWS Database Migration Service?
AWS Database Migration Service Documentation
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
A data engineer is building a data pipeline on AWS by using AWS Glue extract, transform, and load
(ETL) jobs. The data engineer needs to process data from Amazon RDS and MongoDB, perform
transformations, and load the transformed data into Amazon Redshift for analytics. The data updates
must occur every hour.
Which combination of tasks will meet these requirements with the LEAST operational overhead?
(Choose two.)
A D
Explanation:
The correct answer is to configure AWS Glue triggers to run the ETL jobs every hour and use AWS
Glue connections to establish connectivity between the data sources and Amazon Redshift. AWS
Glue triggers are a way to schedule and orchestrate ETL jobs with the least operational overhead.
AWS Glue connections are a way to securely connect to data sources and targets using JDBC or
MongoDB drivers. AWS Glue DataBrew is a visual data preparation tool that does not support
MongoDB as a data source. AWS Lambda functions are a serverless option to schedule and run ETL
jobs, but they have a limit of 15 minutes for execution time, which may not be enough for complex
transformations. The Redshift Data API is a way to run SQL commands on Amazon Redshift clusters
without needing a persistent connection, but it does not support loading data from AWS Glue ETL
jobs. Reference:
AWS Glue triggers
AWS Glue connections
AWS Glue DataBrew
[AWS Lambda functions]
[Redshift Data API]
A company uses an Amazon Redshift cluster that runs on RA3 nodes. The company wants to scale
read and write capacity to meet demand. A data engineer needs to identify a solution that will turn
on concurrency scaling.
Which solution will meet this requirement?
B
Explanation:
Concurrency scaling is a feature that allows you to support thousands of concurrent users and
queries, with consistently fast query performance. When you turn on concurrency scaling, Amazon
Redshift automatically adds query processing power in seconds to process queries without any
delays. You can manage which queries are sent to the concurrency-scaling cluster by configuring
WLM queues. To turn on concurrency scaling for a queue, set the Concurrency Scaling mode value to
auto. The other options are either incorrect or irrelevant, as they do not enable concurrency scaling
for the existing Redshift cluster on RA3 nodes. Reference:
Working with concurrency scaling - Amazon Redshift
Amazon Redshift Concurrency Scaling - Amazon Web Services
Configuring concurrency scaling queues - Amazon Redshift
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide (Chapter 6, page 163)
A data engineer must orchestrate a series of Amazon Athena queries that will run every day. Each
query can run for more than 15 minutes.
Which combination of steps will meet these requirements MOST cost-effectively? (Choose two.)
A B
Explanation:
Option A and B are the correct answers because they meet the requirements most cost-effectively.
Using an AWS Lambda function and the Athena Boto3 client start_query_execution API call to invoke
the Athena queries programmatically is a simple and scalable way to orchestrate the queries.
Creating an AWS Step Functions workflow and adding two states to check the query status and
invoke the next query is a reliable and efficient way to handle the long-running queries.
Option C is incorrect because using an AWS Glue Python shell job to invoke the Athena queries
programmatically is more expensive than using a Lambda function, as it requires provisioning and
running a Glue job for each query.
Option D is incorrect because using an AWS Glue Python shell script to run a sleep timer that checks
every 5 minutes to determine whether the current Athena query has finished running successfully is
not a cost-effective or reliable way to orchestrate the queries, as it wastes resources and time.
Option E is incorrect because using Amazon Managed Workflows for Apache Airflow (Amazon
MWAA) to orchestrate the Athena queries in AWS Batch is an overkill solution that introduces
unnecessary complexity and cost, as it requires setting up and managing an Airflow environment and
an AWS Batch compute environment.
Reference:
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
, Chapter 5: Data
Orchestration, Section 5.2: AWS Lambda, Section 5.3: AWS Step Functions, Pages 125-135
Building Batch Data Analytics Solutions on AWS
, Module 5: Data Orchestration, Lesson 5.1: AWS
Lambda, Lesson 5.2: AWS Step Functions, Pages 1-15
AWS Documentation Overview
, AWS Lambda Developer Guide, Working with AWS Lambda
Functions, Configuring Function Triggers, Using AWS Lambda with Amazon Athena, Pages 1-4
AWS Documentation Overview
, AWS Step Functions Developer Guide, Getting Started, Tutorial:
Create a Hello World Workflow, Pages 1-8
A company is migrating on-premises workloads to AWS. The company wants to reduce overall
operational overhead. The company also wants to explore serverless options.
The company's current workloads use Apache Pig, Apache Oozie, Apache Spark, Apache Hbase, and
Apache Flink. The on-premises workloads process petabytes of data in seconds. The company must
maintain similar or better performance after the migration to AWS.
Which extract, transform, and load (ETL) service will meet these requirements?
B
Explanation:
AWS Glue is a fully managed serverless ETL service that can handle petabytes of data in seconds.
AWS Glue can run Apache Spark and Apache Flink jobs without requiring any infrastructure
provisioning or management. AWS Glue can also integrate with Apache Pig, Apache Oozie, and
Apache Hbase using AWS Glue Data Catalog and AWS Glue workflows. AWS Glue can reduce the
overall operational overhead by automating the data discovery, data preparation, and data loading
processes. AWS Glue can also optimize the cost and performance of ETL jobs by using AWS Glue Job
Bookmarking, AWS Glue Crawlers, and AWS Glue Schema Registry. Reference:
AWS Glue
AWS Glue Data Catalog
AWS Glue Workflows
[AWS Glue Job Bookmarking]
[AWS Glue Crawlers]
[AWS Glue Schema Registry]
[AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide]