Apache Spark Basics with Python

33 / 86

Apache Spark with Python - Preparing the environment

In this project, we will learn about the basics of Apache Spark using Python. However, to run these codes there are a couple of things you need to do before you can continue further:

  1. Appending the Spark and Python location to the path
  2. Initialize the SparkContext

Note 1: To run the codes from this course, it is mandatory to first run the codes given in this slide.

Note 2: It is mandatory to run the codes on the default Jupyter notebook given on the right side of this split screen so that the assessment engine can detect and asses your code.

Note 3: Some of the steps throughout this course may depend on the output or successful execution of the previous steps. So it is mandatory to go through the course sequentially.

Note 4: Do not open more than one Jupyter notebook while you are completing this course, that would result in an error. If by mistake you do open more than one Jupyter notebook, close the other tabs, shutdown the kernel for this Jupyter notebook and then restart it to mitigate the error.

Here is a link to a post in our discussion forum which talks about various debugging steps that you yourself can go through to solve most of the issues that you might come across in this course. If these do not resolve your issue/error, please reach out to us by leaving a comment in the slide where you are facing the issue (preferably with a screenshot), and we would be more than happy to help.

Happy learning!

INSTRUCTIONS
  • To appending the Spark and Python location to the path, please copy paste the code given below as-is on the right side of this split screen and run the same

    import os
    import sys
    os.environ["SPARK_HOME"] = "/usr/hdp/current/spark2-client"
    os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
    os.environ["PYSPARK_PYTHON"] = "/usr/local/anaconda/bin/python" 
    os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/local/anaconda/bin/python"
    sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.4-src.zip")
    sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
    
  • Now, to initialize Spark you first need to import SparkContext and SparkConf from pyspark. Then you can intialize the config object using SparkConf, and finally initialize SparkContext with the variable sc

    from pyspark import SparkContext, SparkConf
    conf = SparkConf().setAppName("appName")
    <<your code goes here>> = SparkContext(conf=conf)
    
See Answer

No hints are availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...