Login using Social Account
     Continue with GoogleLogin using your credentials
As part of this question, you need to write a python function that takes a Spark RDD and computes the average of the numbers in the RDD.
Please make sure that you keep your program as distributed as possible.
Initialize the Spark Context (sc) by following these steps. Follow the below steps on the jupyter notebook on the right-hand side.
Step 1 - Add the Spark and Python location to the path
import os
import sys
os.environ["SPARK_HOME"] = "/usr/hdp/current/spark2-client"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/local/anaconda/bin/python"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/local/anaconda/bin/python"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.4-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
Step 2 - Initialize the Spark
from pyspark import SparkContext, SparkConf
# Initialize the config object, You will have to correct this line
conf = SparkConf().setAppName("appName").setMaster("yarn")
# Create Spark Context
sc = SparkContext(conf=conf)
Question -
Now write a python function with name myaverage
that takes an RDD and returns the average as a decimal number.
def myaverage(rdd):
#
# Your code will be here
#
return somenumber
Hint - You can test your function with the below code.
no_rdd = sc.parallelize([1, 2, 3])
myaverage(no_rdd)
The above code should return 2.0
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
Note - Having trouble with the assessment engine? Follow the steps listed here
Loading comments...