When you are tuning spark you can do what I like to call ‘black box tuning’. Basically you treat the spark job like a black box and see what performance you can get out of the job by not looking at the code, but instead playing with the spark configuration to see what performance you can pull out of it. This can be done before tuning the code to see if you can hit your service level agreement without code changes. Often real improvements will come from actually making code changes. If you do decide to make code changes then do that first and do block box testing after.
The simplest thing you can do is to just play with the spark parameters. It good to think about how these settings are passed to your spark application.
The least priority settings are is the spark environment you are in. (spark-conf defaults)
$SPARK_HOME/conf/spark-defaults.conf
Then next priority of spark settings is spark-submit parameters.
--conf spark.sql.shuffle.partitions=300
The highest priority settings are the ones that are set in the code:
spark.conf.set(“spark.sql.shuffle.partitions
”,"300"
)
Here we want to change the settings a lot (to test different configurations) so the right choice is to use the spark submit parameters. This will let us use a simple linux script to test multiple settings. (You should therefore ensure that no conflicting settings are made in the code.)
This article continues next in the testing harness.