Given the code fragment:
import pyspark.pandas as ps
psdf = ps.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
Which method is used to convert a Pandas API on Spark DataFrame (pyspark.pandas.DataFrame) into
a standard PySpark DataFrame (pyspark.sql.DataFrame)?
A
Explanation:
Pandas API on Spark (pyspark.pandas) allows interoperability with PySpark DataFrames. To convert a
pyspark.pandas.DataFrame to a standard PySpark DataFrame, you use .to_spark().
Example:
df = psdf.to_spark()
This is the officially supported method as per Databricks Documentation.
Incorrect options:
B, D: Invalid or nonexistent methods.
C: Converts to a local pandas DataFrame, not a PySpark DataFrame.
A Spark engineer is troubleshooting a Spark application that has been encountering out-of-memory
errors during execution. By reviewing the Spark driver logs, the engineer notices multiple "GC
overhead limit exceeded" messages.
Which action should the engineer take to resolve this issue?
C
Explanation:
The message "GC overhead limit exceeded" typically indicates that the JVM is spending too much
time in garbage collection with little memory recovery. This suggests that the driver or executor is
under-provisioned in memory.
The most effective remedy is to increase the driver memory using:
--driver-memory 4g
This is confirmed in Spark's official troubleshooting documentation:
“If you see a lot of GC overhead limit exceeded errors in the driver logs, it’s a sign that the driver is
running out of memory.”
—
Spark Tuning Guide
Why others are incorrect:
A may help but does not directly address the driver memory shortage.
B is not a valid action; GC cannot be disabled.
D increases memory usage, worsening the problem.
A DataFrame df has columns name, age, and salary. The developer needs to sort the DataFrame by
age in ascending order and salary in descending order.
Which code snippet meets the requirement of the developer?
D
Explanation:
To sort a PySpark DataFrame by multiple columns with mixed sort directions, the correct usage is:
python
CopyEdit
df.orderBy("age", "salary", ascending=[True, False])
age will be sorted in ascending order
salary will be sorted in descending order
The orderBy() and sort() methods in PySpark accept a list of booleans to specify the sort direction for
each column.
Documentation Reference:
PySpark API - DataFrame.orderBy
What is the difference between df.cache() and df.persist() in Spark DataFrame?
D
Explanation:
df.cache() is shorthand for df.persist(StorageLevel.MEMORY_AND_DISK)
df.persist() allows specifying any storage level such as MEMORY_ONLY, DISK_ONLY,
MEMORY_AND_DISK_SER, etc.
By default, persist() uses MEMORY_AND_DISK, unless specified otherwise.
Reference:
Spark Programming Guide - Caching and Persistence
A data analyst builds a Spark application to analyze finance data and performs the following
operations: filter, select, groupBy, and coalesce.
Which operation results in a shuffle?
A
Explanation:
The groupBy() operation causes a shuffle because it requires all values for a specific key to be
brought together, which may involve moving data across partitions.
In contrast:
filter() and select() are narrow transformations and do not cause shuffles.
coalesce() tries to reduce the number of partitions and avoids shuffling by moving data to fewer
partitions without a full shuffle (unlike repartition()).
Reference:
Apache Spark - Understanding Shuffle
A data engineer is asked to build an ingestion pipeline for a set of Parquet files delivered by an
upstream team on a nightly basis. The data is stored in a directory structure with a base path of
"/path/events/data". The upstream team drops daily data into the underlying subdirectories
following the convention year/month/day.
A few examples of the directory structure are:
Which of the following code snippets will read all the data within the directory structure?
B
Explanation:
To read all files recursively within a nested directory structure, Spark requires the recursiveFileLookup
option to be explicitly enabled. According to Databricks official documentation, when dealing with
deeply nested Parquet files in a directory tree (as shown in this example), you should set:
df = spark.read.option("recursiveFileLookup", "true").parquet("/path/events/data/")
This ensures that Spark searches through all subdirectories under /path/events/data/ and reads any
Parquet files it finds, regardless of the folder depth.
Option A is incorrect because while it includes an option, inferSchema is irrelevant here and does not
enable recursive file reading.
Option C is incorrect because wildcards may not reliably match deep nested structures beyond one
directory level.
Option D is incorrect because it will only read files directly within /path/events/data/ and not
subdirectories like /2023/01/01.
Databricks documentation reference:
"To read files recursively from nested folders, set the recursiveFileLookup option to true. This is
useful when data is organized in hierarchical folder structures" — Databricks documentation on
Parquet files ingestion and options.
Given a DataFrame df that has 10 partitions, after running the code:
result = df.coalesce(20)
How many partitions will the result DataFrame have?
A
Explanation:
The .coalesce(numPartitions) function is used to reduce the number of partitions in a DataFrame. It
does not increase the number of partitions. If the specified number of partitions is greater than the
current number, it will not have any effect.
From the official Spark documentation:
“coalesce() results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions,
there will not be a shuffle, instead each of the 100 new partitions will claim one or more of the
current partitions.”
However, if you try to increase partitions using coalesce (e.g., from 10 to 20), the number of
partitions remains unchanged.
Hence, df.coalesce(20) will still return a DataFrame with 10 partitions.
Reference: Apache Spark 3.5 Programming Guide → RDD and DataFrame Operations → coalesce()
Given the following code snippet in my_spark_app.py:
What is the role of the driver node?
A
Explanation:
In the Spark architecture, the driver node is responsible for orchestrating the execution of a Spark
application. It converts user-defined transformations and actions into a logical plan, optimizes it into
a physical plan, and then splits the plan into tasks that are distributed to the executor nodes.
As per Databricks and Spark documentation:
“The driver node is responsible for maintaining information about the Spark application, responding
to a user's program or input, and analyzing, distributing, and scheduling work across the executors.”
This means:
Option A is correct because the driver schedules and coordinates the job execution.
Option B is incorrect because the driver does more than just UI monitoring.
Option C is incorrect since data and computations are distributed across executor nodes.
Option D is incorrect; results are returned to the driver but not stored long-term by it.
Reference: Databricks Certified Developer Spark 3.5 Documentation → Spark Architecture → Driver
vs Executors.
A Spark developer wants to improve the performance of an existing PySpark UDF that runs a hash
function that is not available in the standard Spark functions library. The existing UDF code is:
import hashlib
import pyspark.sql.functions as sf
from pyspark.sql.types import StringType
def shake_256(raw):
return hashlib.shake_256(raw.encode()).hexdigest(20)
shake_256_udf = sf.udf(shake_256, StringType())
The developer wants to replace this existing UDF with a Pandas UDF to improve performance. The
developer changes the definition of shake_256_udf to this:CopyEdit
shake_256_udf = sf.pandas_udf(shake_256, StringType())
However, the developer receives the error:
What should the signature of the shake_256() function be changed to in order to fix this error?
D
Explanation:
When converting a standard PySpark UDF to a Pandas UDF for performance optimization, the
function must operate on a Pandas Series as input and return a Pandas Series as output.
In this case, the original function signature:
def shake_256(raw: str) -> str
is scalar — not compatible with Pandas UDFs.
According to the official Spark documentation:
“Pandas UDFs operate on pandas.Series and return pandas.Series. The function definition should be:
def my_udf(s: pd.Series) -> pd.Series:
and it must be registered using pandas_udf(...).”
Therefore, to fix the error:
The function should be updated to:
def shake_256(df: pd.Series) -> pd.Series:
return df.apply(lambda x: hashlib.shake_256(x.encode()).hexdigest(20))
This will allow Spark to efficiently execute the Pandas UDF in vectorized form, improving
performance compared to standard UDFs.
Reference: Apache Spark 3.5 Documentation → User-Defined Functions → Pandas UDFs
A developer is working with a pandas DataFrame containing user behavior data from a web
application.
Which approach should be used for executing a groupBy operation in parallel across all workers in
Apache Spark 3.5?
A)
Use the applylnPandas API
B)
C)
D)
A.
Use the applyInPandas API:
df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()
B.
Use the mapInPandas API:
df.mapInPandas(mean_func, schema="user_id long, value double").show()
C.
Use a regular Spark UDF:
from pyspark.sql.functions import mean
df.groupBy("user_id").agg(mean("value")).show()
D.
Use a Pandas UDF:
@pandas_udf("double")
def mean_func(value: pd.Series) -> float:
return value.mean()
df.groupby("user_id").agg(mean_func(df["value"])).show()
A
Explanation:
The correct approach to perform a parallelized groupBy operation across Spark worker nodes using
Pandas API is via applyInPandas. This function enables grouped map operations using Pandas logic in
a distributed Spark environment. It applies a user-defined function to each group of data represented
as a Pandas DataFrame.
As per the Databricks documentation:
"applyInPandas() allows for vectorized operations on grouped data in Spark. It applies a user-defined
function to each group of a DataFrame and outputs a new DataFrame. This is the recommended
approach for using Pandas logic across grouped data with parallel execution."
Option A is correct and achieves this parallel execution.
Option B (mapInPandas) applies to the entire DataFrame, not grouped operations.
Option C uses built-in aggregation functions, which are efficient but not customizable with Pandas
logic.
Option D creates a scalar Pandas UDF which does not perform a group-wise transformation.
Therefore, to run a groupBy with parallel Pandas logic on Spark workers, Option A using
applyInPandas is the only correct answer.
Reference: Apache Spark 3.5 Documentation → Pandas API on Spark → Grouped Map Pandas UDFs
(applyInPandas)