Memory recommendation for Spark execution
Spark can be configured to run in standalone mode or on top of Hadoop YARN or Mesos. In Syncfusion Big Data Platform, Spark is configured to run on top of YARN. In Hadoop cluster, YARN allocates resources for applications to run in cluster. Spark applications run as independent sets of processes (executors) on a cluster, coordinated by the SparkContext object in your main program (called the driver program). Based on default configuration, Spark command line interface runs with one driver and two executors.
To know more about Spark execution, please refer below link,
http://spark.apache.org/docs/latest/cluster-overview.html
Please find the properties to configure for spark driver and executor memory from below table,
Properties | Default / Configured Value | Description |
spark.executor.memory | 512 MB | Amount of memory to use per executor process. |
spark.executor.instances | 2 | The number of executors to be run. |
spark.driver.memory | 1024 MB | Amount of memory to use for driver process, i.e. where SparkContext is initialized |
Spark shell required memory = (Driver Memory + 384 MB) + (Number of executors * (Executor memory + 384 MB))
Here 384 MB is maximum memory (overhead) value that may be utilized by Spark when executing jobs.
Example:
Spark required memory = (1024 + 384) + (2*(512+384)) = 3200 MB
To know more about Spark configuration, please refer below link:
http://spark.apache.org/docs/latest/running-on-yarn.html
Below equation is to calculate and check whether there is enough memory available in YARN for proper functioning of Spark shell,
Enough Memory for Spark (Boolean) = (Memory Total – Memory Used) > Spark required memory
You can ensure the Spark required memory available in YARN Resource Manager web interface.
Resource Manager URL: http://<name_node_host>:8088/cluster
Here Memory Total is memory configured for YARN Resource Manager using the property “yarn.nodemanager.resource.memory-mb”. You can get the details from the Resource Manager UI as illustrated in below screenshot.
Note:
It is also mandatory to check for available physical memory (RAM) along with ensuring required memory for Spark execution based on YARN metrics. For instance, you have required available memory on YARN but there is a chance that other applications or processes outside Hadoop and Spark on the machine can consume more physical memory, in that case Spark shell cannot be run properly, so equivalent amount of physical memory is required in RAM as well.
To know more about editing configuration of Hadoop and its ecosystem including Spark using our Cluster Manager application, please refer below link.
To fine tune Spark based on available machines and its hardware specification to get maximum performance, please refer below link
https://help.syncfusion.com/bigdata/cluster-manager/performance-improvements#spark