databricks databricks certified professional data engineer practice test

Exam Title: Databricks Certified Data Engineer Professional

Last update: Nov 27 ,2025
Question 1

The data engineering team has configured a job to process customer requests to be forgotten (have
their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default
table settings.
The team has decided to process all deletions from the previous week as a batch job at 1am each
Sunday. The total duration of this job is less than one hour. Every Monday at 3am, a batch job
executes a series of VACUUM commands on all Delta Lake tables throughout the organization.
The compliance officer has recently learned about Delta Lake's time travel functionality. They are
concerned that this might allow continued access to deleted data.
Assuming all delete logic is correctly implemented, which statement correctly addresses this
concern?

  • A. Because the vacuum command permanently deletes all files containing deleted records, deleted records may be accessible with time travel for around 24 hours.
  • B. Because the default data retention threshold is 24 hours, data files containing deleted records will be retained until the vacuum job is run the following day.
  • C. Because Delta Lake time travel provides full access to the entire history of a table, deleted records can always be recreated by users with full admin privileges.
  • D. Because Delta Lake's delete statements have ACID guarantees, deleted records will be permanently purged from all storage systems as soon as a delete job completes.
  • E. Because the default data retention threshold is 7 days, data files containing deleted records will be retained until the vacuum job is run 8 days later.
Answer:

E


Explanation:
https://learn.microsoft.com/en-us/azure/databricks/delta/vacuum

vote your answer:
A
B
C
D
E
A 0 B 0 C 0 D 0 E 0
Comments
Question 2

A junior data engineer has configured a workload that posts the following JSON to the Databricks
REST API endpoint 2.0/jobs/create.

Assuming that all configurations and referenced resources are available, which statement describes
the result of executing this workload three times?

  • A. Three new jobs named "Ingest new data" will be defined in the workspace, and they will each run once daily.
  • B. The logic defined in the referenced notebook will be executed three times on new clusters with the configurations of the provided cluster ID.
  • C. Three new jobs named "Ingest new data" will be defined in the workspace, but no jobs will be executed.
  • D. One new job named "Ingest new data" will be defined in the workspace, but it will not be executed.
  • E. The logic defined in the referenced notebook will be executed three times on the referenced existing all purpose cluster.
Answer:

C


Explanation:
This is the correct answer because the JSON posted to the Databricks REST API endpoint
2.0/jobs/create defines a new job with a name, an existing cluster id, and a notebook task. However,
it does not specify any schedule or trigger for the job execution. Therefore, three new jobs with the
same name and configuration will be created in the workspace, but none of them will be executed
until they are manually triggered or scheduled. Verified Reference: [Databricks Certified Data
Engineer Professional], under “Monitoring & Logging” section; [Databricks Documentation], under
“Jobs API - Create” section.

vote your answer:
A
B
C
D
E
A 0 B 0 C 0 D 0 E 0
Comments
Question 3

An upstream system is emitting change data capture (CDC) logs that are being written to a cloud
object storage directory. Each record in the log indicates the change type (insert, update, or delete)
and the values for each field after the change. The source table has a primary key identified by the
field pk_id.
For auditing purposes, the data governance team wishes to maintain a full record of all values that
have ever been valid in the source system. For analytical purposes, only the most recent value for
each record needs to be recorded. The Databricks job to ingest these records occurs once per hour,
but each individual record may have changed multiple times over the course of an hour.
Which solution meets these requirements?

  • A. Create a separate history table for each pk_id resolve the current state of the table by running a union all filtering the history tables for the most recent state.
  • B. Use merge into to insert, update, or delete the most recent entry for each pk_id into a bronze table, then propagate all changes throughout the system.
  • C. Iterate through an ordered set of changes to the table, applying each in turn; rely on Delta Lake's versioning ability to create an audit log.
  • D. Use Delta Lake's change data feed to automatically process CDC data from an external system, propagating all changes to all dependent tables in the Lakehouse.
  • E. Ingest all log information into a bronze table; use merge into to insert, update, or delete the most recent entry for each pk_id into a silver table to recreate the current table state.
Answer:

E


Explanation:
This is the correct answer because it meets the requirements of maintaining a full record of all values
that have ever been valid in the source system and recreating the current table state with only the
most recent value for each record. The code ingests all log information into a bronze table, which
preserves the raw CDC data as it is. Then, it uses merge into to perform an upsert operation on a
silver table, which means it will insert new records or update or delete existing records based on the
change type and the pk_id columns. This way, the silver table will always reflect the current state of
the source table, while the bronze table will keep the history of all changes. Verified Reference:
[Databricks Certified Data Engineer Professional], under “Delta Lake” section; Databricks
Documentation, under “Upsert into a table using merge” section.

vote your answer:
A
B
C
D
E
A 0 B 0 C 0 D 0 E 0
Comments
Question 4

An hourly batch job is configured to ingest data files from a cloud object storage container where
each batch represent all records produced by the source system in a given hour. The batch job to
process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is
missed. The user_id field represents a unique key for the data, which has the following schema:
user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT,
auto_pay BOOLEAN, last_updated BIGINT
New records are all ingested into a table named account_history which maintains a full record of all
data in the same schema as the source. The next table in the system is named account_current and is
implemented as a Type 1 table representing the most recent value for each unique user_id.
Assuming there are millions of user accounts and tens of thousands of records processed hourly,
which implementation can be used to efficiently update the described account_current table as part
of each hourly batch job?

  • A. Use Auto Loader to subscribe to new files in the account history directory; configure a Structured Streaminq trigger once job to batch update newly detected files into the account current table.
  • B. Overwrite the account current table with each batch using the results of a query against the account history table grouping by user id and filtering for the max value of last updated.
  • C. Filter records in account history using the last updated field and the most recent hour processed, as well as the max last iogin by user id write a merge statement to update or insert the most recent value for each user id.
  • D. Use Delta Lake version history to get the difference between the latest version of account history and one version prior, then write these records to account current.
  • E. Filter records in account history using the last updated field and the most recent hour processed, making sure to deduplicate on username; write a merge statement to update or insert the most recent value for each username.
Answer:

C


Explanation:
This is the correct answer because it efficiently updates the account current table with only the most
recent value for each user id. The code filters records in account history using the last updated field
and the most recent hour processed, which means it will only process the latest batch of data. It also
filters by the max last login by user id, which means it will only keep the most recent record for each
user id within that batch. Then, it writes a merge statement to update or insert the most recent value
for each user id into account current, which means it will perform an upsert operation based on the
user id column. Verified Reference: [Databricks Certified Data Engineer Professional], under “Delta
Lake” section; Databricks Documentation, under “Upsert into a table using merge” section.

vote your answer:
A
B
C
D
E
A 0 B 0 C 0 D 0 E 0
Comments
Question 5

A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine
learning team. The table contains information about customers derived from a number of upstream
sources. Currently, the data engineering team populates this table nightly by overwriting the table
with the current valid values derived from upstream data sources.
The churn prediction model used by the ML team is fairly stable in production. The team is only
interested in making predictions on records that have changed in the past 24 hours.
Which approach would simplify the identification of these changed records?

  • A. Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.
  • B. Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.
  • C. Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.
  • D. Modify the overwrite logic to include a field populated by calling spark.sql.functions.current_timestamp() as data are being written; use this field to identify records written on a particular date.
  • E. Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.
Answer:

E


Explanation:
The approach that would simplify the identification of the changed records is to replace the current
overwrite logic with a merge statement to modify only those records that have changed, and write
logic to make predictions on the changed records identified by the change data feed.
This approach
leverages the Delta Lake features of merge and change data feed, which are designed to handle
upserts and track row-level changes in a Delta table12
. By using merge, the data engineering team
can avoid overwriting the entire table every night, and only update or insert the records that have
changed in the source data. By using change data feed, the ML team can easily access the change
events that have occurred in the customer_churn_params table, and filter them by operation type
(update or insert) and timestamp. This way, they can only make predictions on the records that have
changed in the past 24 hours, and avoid re-processing the unchanged records.
The other options are not as simple or efficient as the proposed approach, because:
Option A would require applying the churn model to all rows in the customer_churn_params table,
which would be wasteful and redundant. It would also require implementing logic to perform an
upsert into the predictions table, which would be more complex than using the merge statement.
Option B would require converting the batch job to a Structured Streaming job, which would involve
changing the data ingestion and processing logic. It would also require using the complete output
mode, which would output the entire result table every time there is a change in the source data,
which would be inefficient and costly.
Option C would require calculating the difference between the previous model predictions and the
current customer_churn_params on a key identifying unique customers, which would be
computationally expensive and prone to errors. It would also require storing and accessing the
previous predictions, which would add extra storage and I/O costs.
Option D would require modifying the overwrite logic to include a field populated by calling
spark.sql.functions.current_timestamp() as data are being written, which would add extra complexity
and overhead to the data engineering job. It would also require using this field to identify records
written on a particular date, which would be less accurate and reliable than using the change data
feed.
Reference:
Merge
,
Change data feed

vote your answer:
A
B
C
D
E
A 0 B 0 C 0 D 0 E 0
Comments
Question 6

A table is registered with the following code:

Both users and orders are Delta Lake tables. Which statement describes the results of querying
recent_orders?

  • A. All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.
  • B. All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.
  • C. Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.
  • D. All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.
  • E. The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.
Answer:

B


vote your answer:
A
B
C
D
E
A 0 B 0 C 0 D 0 E 0
Comments
Question 7

A production workload incrementally applies updates from an external Change Data Capture feed to
a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this
table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto
Compaction were both turned on for the streaming production job. Recent review of data files shows
that most data files are under 64 MB, although each partition in the table contains at least 1 GB of
data and the total table size is over 10 TB.
Which of the following likely explains these smaller file sizes?

  • A. Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations
  • B. Z-order indices calculated on the table are preventing file compaction C Bloom filler indices calculated on the table are preventing file compaction
  • D. Databricks has autotuned to a smaller target file size based on the overall size of data in the table
  • E. Databricks has autotuned to a smaller target file size based on the amount of data in each partition
Answer:

A


Explanation:
This is the correct answer because Databricks has a feature called Auto Optimize, which
automatically optimizes the layout of Delta Lake tables by coalescing small files into larger ones and
sorting data within each file by a specified column. However, Auto Optimize also considers the trade-
off between file size and merge performance, and may choose a smaller target file size to reduce the
duration of merge operations, especially for streaming workloads that frequently update existing
records. Therefore, it is possible that Auto Optimize has autotuned to a smaller target file size based
on the characteristics of the streaming production job. Verified Reference: [Databricks Certified Data
Engineer Professional], under “Delta Lake” section; Databricks Documentation, under “Auto
Optimize” section. https://docs.databricks.com/en/delta/tune-file-size.html#autotune-table
'Autotune file size based on workload'

vote your answer:
A
B
D
E
A 0 B 0 D 0 E 0
Comments
Question 8

Which statement regarding stream-static joins and static Delta tables is correct?

  • A. Each microbatch of a stream-static join will use the most recent version of the static Delta table as of each microbatch.
  • B. Each microbatch of a stream-static join will use the most recent version of the static Delta table as of the job's initialization.
  • C. The checkpoint directory will be used to track state information for the unique keys present in the join.
  • D. Stream-static joins cannot use static Delta tables because of consistency issues.
  • E. The checkpoint directory will be used to track updates to the static Delta table.
Answer:

A


Explanation:
This is the correct answer because stream-static joins are supported by Structured Streaming when
one of the tables is a static Delta table. A static Delta table is a Delta table that is not updated by any
concurrent writes, such as appends or merges, during the execution of a streaming query. In this
case, each microbatch of a stream-static join will use the most recent version of the static Delta table
as of each microbatch, which means it will reflect any changes made to the static Delta table before
the start of each microbatch. Verified Reference: [Databricks Certified Data Engineer Professional],
under “Structured Streaming” section; Databricks Documentation, under “Stream and static joins”
section.

vote your answer:
A
B
C
D
E
A 0 B 0 C 0 D 0 E 0
Comments
Question 9

A junior data engineer has been asked to develop a streaming data pipeline with a grouped
aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average
temperature for each non-overlapping five-minute interval. Events are recorded once per minute per
device.
Streaming DataFrame df has the following schema:
"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"
Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.

  • A. to_interval("event_time", "5 minutes").alias("time")
  • B. window("event_time", "5 minutes").alias("time")
  • C. "event_time"
  • D. window("event_time", "10 minutes").alias("time")
  • E. lag("event_time", "10 minutes").alias("time")
Answer:

B


Explanation:
This is the correct answer because the window function is used to group streaming data by time
intervals. The window function takes two arguments: a time column and a window duration. The
window duration specifies how long each window is, and must be a multiple of 1 second. In this case,
the window duration is “5 minutes”, which means each window will cover a non-overlapping five-
minute interval. The window function also returns a struct column with two fields: start and end,
which represent the start and end time of each window. The alias function is used to rename the
struct column as “time”. Verified Reference: [Databricks Certified Data Engineer Professional], under
“Structured Streaming” section;
Databricks Documentation
, under “WINDOW” section.
https://www.databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html

vote your answer:
A
B
C
D
E
A 0 B 0 C 0 D 0 E 0
Comments
Question 10

A data architect has designed a system in which two Structured Streaming jobs will concurrently
write to a single bronze Delta table. Each job is subscribing to a different topic from an Apache Kafka
source, but they will write data with the same schema. To keep the directory structure simple, a data
engineer has decided to nest a checkpoint directory to be shared by both streams.
The proposed directory structure is displayed below:

Which statement describes whether this checkpoint directory structure is valid for the given scenario
and why?

  • A. No; Delta Lake manages streaming checkpoints in the transaction log.
  • B. Yes; both of the streams can share a single checkpoint directory.
  • C. No; only one stream can write to a Delta Lake table.
  • D. Yes; Delta Lake supports infinite concurrent writers.
  • E. No; each of the streams needs to have its own checkpoint directory.
Answer:

E


Explanation:
This is the correct answer because checkpointing is a critical feature of Structured Streaming that
provides fault tolerance and recovery in case of failures. Checkpointing stores the current state and
progress of a streaming query in a reliable storage system, such as DBFS or S3. Each streaming query
must have its own checkpoint directory that is unique and exclusive to that query. If two streaming
queries share the same checkpoint directory, they will interfere with each other and cause
unexpected errors or data loss. Verified Reference: [Databricks Certified Data Engineer Professional],
under “Structured Streaming” section;
Databricks Documentation
, under “Checkpointing” section.

vote your answer:
A
B
C
D
E
A 0 B 0 C 0 D 0 E 0
Comments
Page 1 out of 19
Viewing questions 1-10 out of 195
Go To
page 2