In this article, we will delve into the various default configurations of a Spark session and explore how to display and modify them using both PySpark and SparkSQL. By gaining a deeper understanding of these configurations, you can fine-tune your Spark applications to achieve better efficiency and scalability.
Before Understanding the most commonly used spark configurations lets see how to display all default configurations
Using Pyspark:
Using SparkSQL:
Commonly Used PySpark Configurations
-
spark.master
Description: Specifies the cluster manager for Spark to connect to, such as local, YARN, or Mesos.
Usage: Determines where the Spark application runs. Use
local[*]
for local testing or specify a cluster.Default Value:
local[*]
Example:
spark.conf.set("spark.master", "local[4]")
-
spark.executor.memory
Description: Allocates the amount of memory each executor can use.
Usage: Optimize memory for heavy computations by increasing this value.
Default Value:
1g
Example:
spark.conf.set("spark.executor.memory", "4g")
-
spark.driver.memory
Description: Specifies the amount of memory available for the driver process.
Usage: Adjust this value for larger workloads or when the driver needs to handle significant data processing.
Default Value:
1g
Example:
spark.conf.set("spark.driver.memory", "2g")
-
spark.executor.cores
Description: Sets the number of CPU cores each executor can use.
Usage: Increase this value to parallelize tasks efficiently.
Default Value:
1
Example:
spark.conf.set("spark.executor.cores", "2")
-
spark.sql.shuffle.partitions
Description: Defines the number of partitions for shuffle operations like joins and aggregations.
Usage: Lower this value for smaller datasets and increase for large datasets to balance performance.
Default Value:
200
Example:
spark.conf.set("spark.sql.shuffle.partitions", "50")
-
spark.default.parallelism
Description: Specifies the number of partitions for RDD operations that do not explicitly set partition size.
Usage: Adjust this value based on the cluster's resources and workload.
Default Value: Based on the number of cores available in the cluster.
Example:
spark.conf.set("spark.default.parallelism", "8")
-
spark.sql.autoBroadcastJoinThreshold
Description: Sets the maximum size of a table to be broadcasted for join operations.
Usage: Increase this value for small tables to optimize join performance.
Default Value:
10MB
Example:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "20MB")
-
spark.serializer
Description: Specifies the serializer used for shuffling data. KryoSerializer is faster and more memory-efficient.
Usage: Use KryoSerializer for better performance when dealing with large datasets.
Default Value:
org.apache.spark.serializer.JavaSerializer
Example:
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
-
spark.network.timeout
Description: Sets the timeout for network communication between nodes.
Usage: Increase this value for heavy workloads or slow networks to prevent timeout errors.
Default Value:
120s
Example:
spark.conf.set("spark.network.timeout", "300s")
-
spark.rdd.compress
Description: Enables compression for RDDs to save memory and storage.
Usage: Use this option when dealing with memory-intensive workloads to improve resource efficiency.
Default Value:
false
Example:
spark.conf.set("spark.rdd.compress", "true")
Each Spark session comes with over 357+ configuration parameters, which control various aspects of Spark's behavior. These configurations define how Spark interacts with the cluster, manages resources, and processes data. Using the
spark.conf.get
and spark.conf.set
methods, you can easily view and modify these settings programmatically within your Spark application.
With spark.conf.get
, you can retrieve the current value of any configuration, while spark.conf.set
allows you to override the default values to customize the session’s behavior based on your workload requirements.
Complete Code Example
For your convenience, here's the complete code used in this blog to work with Spark configurations:
#Using PySpark
from pyspark.sql.conf import *
from pyspark.sql.functions import *
configs = spark.sparkContext.getConf().getAll()
df= spark.createDataFrame(configs,["Key","Value"])
display(df.orderBy(col("Key")))
#Display Sinlge Configuration Property
spark.conf.get("spark.sql.shuffle.partitions", defaultValue=None)
#defaultValue (optional): A fallback value returned if the configuration is not set. If not provided, the method will throw an error if the configuration doesn't exist.
#Modify the default values
spark.conf.set("spark.sql.shuffle.partitions", "50")
%sql
---Using SQL
SET;
--To Modify the Default values
SET spark.sql.shuffle.partitions = 50