Previous Index Next

Compute the average of an RDD of numbers with Spark using Python

As part of this question, you need to write a python function that takes a Spark RDD and computes the average of the numbers in the RDD.

Please make sure that you keep your program as distributed as possible.

INSTRUCTIONS

Initialize the Spark Context (sc) by following these steps. Follow the below steps on the jupyter notebook on the right-hand side.

Step 1 - Add the Spark and Python location to the path

import os
import sys
os.environ["SPARK_HOME"] = "/usr/hdp/current/spark2-client"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/local/anaconda/bin/python" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/local/anaconda/bin/python"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.4-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

Step 2 - Initialize the Spark

from pyspark import SparkContext, SparkConf
# Initialize the config object, You will have to correct this line
conf = SparkConf().setAppName("appName").setMaster("yarn")
# Create Spark Context
sc = SparkContext(conf=conf)

Question -

Now write a python function with name myaverage that takes an RDD and returns the average as a decimal number.

def myaverage(rdd):
       #
       # Your code will be here
       #
       return somenumber

Hint - You can test your function with the below code.

no_rdd = sc.parallelize([1, 2, 3])
myaverage(no_rdd)

The above code should return 2.0

Get Hint See Answer

Note - Having trouble with the assessment engine? Follow the steps listed here

Apache Spark with Python - Pyspark Assessment

Compute the average of an RDD of numbers with Spark using Python

XP

Loading comments...