Setup local Spark Cluster on Mac and explore UI for Master, Workers, and Jobs(pyspark)

What is covered

In this blog, I would describe my experiments about setting up a local (standalone) Spark cluster on Mac M1 machine.
Would start pyspark with an option to use the master setup in above step.
Explore the UI for:
- Master
- Worker
- Jobs (transformation and action operations, Directed Acyclic Graph)

Why local (standalone) server

local server is a simple deployment mode and is possible to run the daemons in a single node
Jobs submitted using python programs and could be explored using the same UI (would cover in a future blog)

Start the master and worker

After installation of pyspark (using homebrew), scripts for starting the master and worker were available at the following location:

cd /opt/homebrew/Cellar/apache-spark/3.2.1/libexec/sbin

start the server (note that would be specific to the machine)

% ./start-master.sh 
starting org.apache.spark.deploy.master.Master, logging to /opt/homebrew/Cellar/apache-spark/3.2.1/libexec/logs/<your-local-logfile>.out

check the log file for the spark master server details ( is obtained from the previous step>

% tail -10 /opt/homebrew/Cellar/apache-spark/3.2.1/libexec/logs/<your-local-logfile>.out
...
22/05/18 07:41:57 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
22/05/18 07:41:57 INFO Master: Starting Spark master at spark://<your-spark-master>:7077
22/05/18 07:41:57 INFO Master: Running Spark version 3.2.1
22/05/18 07:41:57 INFO Utils: Successfully started service 'MasterUI' on port 8080.
...

check the master UI
notice that workers are zero, since, we have not yet started any

Screen Shot 2022-05-18 at 7.44.38 AM.png

start the worker by selecting the number of cores and memory based on your system configuration.
please check logfile in the previous step and replace with the appropriate value for

 % ./start-worker.sh --cores 2 --memory 2G spark://<your-spark-master>:7077
starting org.apache.spark.deploy.worker.Worker, logging to /opt/homebrew/Cellar/apache-spark/3.2.1/libexec/logs/<your-local-worker-logfile>.out

check the master UI for worker information (now it shows 1 worker)

Screen Shot 2022-05-18 at 7.48.16 AM.png

also, check the worker UI for the cores and memory allocated

Screen Shot 2022-05-18 at 7.50.01 AM.png

submit jobs using pyspark and observe jobs UI

we will start pyspark to use the master setup above
operations are performed on RDDs (Resilient Distributed Dataset)
spark has two kinds of operations:
- Transformation → operations such as map, filter, join or union that are performed on an RDD that yields a new RDD containing the result
- Action → operations such as reduce, first, count that return a value after running a computation on an RDD
Actions could be displayed as jobs in the UI and transformations could be observed while exploring the DAG (Directed Acyclyic Graph) output
for pyspark to submit jobs to master, start as follows (replace with the appropriate value for :
```
% pyspark --master spark://<your-spark-master>:7077
```
Now observe the master that Application is listed

Screen Shot 2022-05-18 at 10.34.17 PM.png

we will write a simple code in pyspark shell to do one transformation (parallelize) and one action (count) operation

...
Using Python version 3.9.7 (default, Sep 16 2021 08:50:36)
Spark context Web UI available at http://venkatas-mini.lan:4040
Spark context available as 'sc' (master = spark://Venkatas-Mini.lan:7077, app id = app-20220518222803-0001).
SparkSession available as 'spark'.
>>> a=("hello","world","pyspark","local","mac","m1")
>>> b = sc.parallelize(a)
>>> b.count()
6                                                                               
>>>

From master, click on the application to check the application UI:

Screen Shot 2022-05-18 at 10.39.02 PM.png

In the application ui, click on "Application Detail UI" to get details on the jobs

Screen Shot 2022-05-18 at 10.42.26 PM.png

to see DAG (Directed Acyclic Graph) visualization click on the job (under Description column)
this is a simple DAG with only one transformation and one operation

Screen Shot 2022-05-18 at 10.45.23 PM.png

Next steps

submit spark jobs as python programs and check UI for master, worker and jobs
try out additional spark operations, and actions, and check additional details like storage, etc. in the UI

References

Spark Standalone Server