Apache Spark with Python - Preparing the environment

In this project, we will learn about the basics of Apache Spark using Python. However, to run these codes there are a couple of things you need to do before you can continue further:

Appending the Spark and Python location to the path
Initialize the SparkContext

Note 1: To run the codes from this course, it is mandatory to first run the codes given in this slide.

Note 2: It is mandatory to run the codes on the default Jupyter notebook given on the right side of this split screen so that the assessment engine can detect and asses your code.

Note 3: Some of the steps throughout this course may depend on the output or successful execution of the previous steps. So it is mandatory to go through the course sequentially.

Note 4: Do not open more than one Jupyter notebook while you are completing this course, that would result in an error. If by mistake you do open more than one Jupyter notebook, close the other tabs, shutdown the kernel for this Jupyter notebook and then restart it to mitigate the error.

Here is a link to a post in our discussion forum which talks about various debugging steps that you yourself can go through to solve most of the issues that you might come across in this course. If these do not resolve your issue/error, please reach out to us by leaving a comment in the slide where you are facing the issue (preferably with a screenshot), and we would be more than happy to help.

Happy learning!

INSTRUCTIONS

To appending the Spark and Python location to the path, please copy paste the code given below as-is on the right side of this split screen and run the same

import os
import sys
os.environ["SPARK_HOME"] = "/usr/hdp/current/spark2-client"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
os.environ["PYSPARK_PYTHON"] = "/usr/local/anaconda/bin/python" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/local/anaconda/bin/python"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.4-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

Now, to initialize Spark you first need to import SparkContext and SparkConf from pyspark. Then you can intialize the config object using SparkConf, and finally initialize SparkContext with the variable sc
```
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("appName")
<<your code goes here>> = SparkContext(conf=conf)
```

See Answer

Note - Having trouble with the assessment engine? Follow the steps listed here

Previous Index Next

Please login to comment

0 Comments

There are 8 new comments.

Apache Spark Basics with Python

Apache Spark with Python - Preparing the environment

XP

Please login to comment

0 Comments