Running PySpark in Jupyter / IPython notebook

You can run PySpark code in Jupyter notebook on CloudxLab. The following instructions cover both 1 and 2 versions of Apache Spark.

What is Jupyter notebook?

The IPython Notebook is now known as the Jupyter Notebook. It is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media. For more details on the Jupyter Notebook, please see the Jupyter website.

Please follow below steps to access the Jupyter notebook on CloudxLab

Step 1 – Login to web console

Step 2 – Run below commands on web console

Above command launches a Jupyter notebook in the next available port if port 8888 is being used by another user. We’ve opened ports 8888 to 8920 so that your notebook will get one of the available ports even if a lot of users are accessing the notebook at the same time.

If you are going to use jupyter for a longer duration, the connection might close causing the web console to timeout. You can use nohup unix command in the background:

This will launch jupyter in background. Even if you close the console, the jupyter would keep running. To kill jupyter later you can use the following command:

Once the notebook is successfully launched, it shows a URL with the token. Sample lines will be like

Please note that here notebook gets launched at port 8890 as port 8888 and port 8889 are used by other users. Copy the entire URL.

Replace 0.0.0.0 with the domain of your web console. Let’s say your web console is at f.cloudxlab.com, then replace 0.0.0.0 with f.cloudxlab.com. Now final URL becomes

Paste this final URL in the browser and you can access the notebook. Please note that without valid token you can not access the notebook.

PS – Please do not share above URL with anyone, as it will give them a full access to your notebook as well as terminal (Web console)

Step 3- Set up environment variables

Step 4- Load PySpark module

You should get the result like below image if pyspark module is loaded properly.

PySpark Jupyter Notebook
PySpark Jupyter Notebook

Running Spark 2.0.1 using Jupyter.

Start Jupyter using following commands:

And in the notebook, please use the following code:

The output should look like following:

Running Python Spark 2.0.1 using jupyter
Running Python Spark 2.0.1 using jupyter

Running Spark 2.0.2 using Jupyter

We have created command jupyter-spark2.0.2 to launch Spark2.0.2 on Jupyter. Login to web console and type  jupyter-spark2.0.2 to launch notebook.Start

Alternatively, you can also launch Jupyter using below commands:

And in the notebook, please use the following code:

The output should look like following

  • Sravan Ch

    could you please let us know how to get the jdbc connection get working

    I tried adding the library to SPARK-CLASSPATH instead of adding it to spark.driver.extraLibraryPath or spark.executor.extraLibraryPath
    os.environ[‘SPARK_CLASSPATH’] = r”/home/sravandata002869/mysql-connector-java-5.1.40/mysql-connector-java-5.1.40-bin.jar”

    Would really appreciate if you can get me a working example with jdbc connection

    Thanks,
    Sravan

    • abhinav singh

      Hi Sravan,

      Just curious if above code is working from command line?

    • abhinav singh

      Hi Sravan,

      Just curious if above code is working from command line?

    • abhinav singh

      Hi Sravan,

      Just curious if above code is working from command line?

    • abhinav singh

      Hi Sravan,

      Just curious if above code is working from command line?

    • abhinav singh

      Hi Sravan,

      Just curious if above code is working from command line?

    • abhinav singh

      Hi Sravan,

      Just curious if above code is working from command line?

      • Sravan Ch

        I’m doing all this work on jupyter notebook

      • Sravan Ch

        Never Mind, I got this working..

        • abhinav singh

          Hi Sravan,

          Great!

          Can you please post the solution here so that it may help other users 🙂

          Thanks

  • Sravan Ch

    To give you the complete code that I was running in the notebook

    from pyspark import SparkContext,SparkConf
    from pyspark.sql import SQLContext,HiveContext

    sconf = SparkConf()
    sc=SparkContext(conf=sconf)
    sqlc=SQLContext(sc)

    username = “sqoopuser”
    pwd = “NHkkP876rp”
    hostname = “ip-172-31-13-154”
    dbname = “sqoopex”
    port = 3306
    table = “ani_country”

    df = sqlc.read.format(“jdbc”).options(
    url=”jdbc:mysql://{0}:{1}/{2}”.format(hostname, port, dbname),
    driver = “com.mysql.jdbc.Driver”,
    dbtable = table,
    user=username,
    password=pwd,
    ).load()

    It’s not working as it cannot recognize the driver and I was not able to add the jar properly

  • Sravan Ch

    I’ve put mysql jar in here
    /home/sravandata002869