Running PySpark in Jupyter / IPython notebook

You can run PySpark code in Jupyter notebook on CloudxLab. The following instructions cover both 1.6 and 2.3 versions of Apache Spark.

What is Jupyter notebook?

The IPython Notebook is now known as the Jupyter Notebook. It is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media. For more details on the Jupyter Notebook, please see the Jupyter website.

Please follow below steps to access the Jupyter notebook on CloudxLab

Step 1 – Click on “Jupyter” button under My Lab

Step 2.1 – To access Spark 2.3, use below code

Step 2.2 – To access Spark 1.6, use below code


Below steps are not valid anymore. These steps used to work in the old setup

Step 1 – Login to web console

Step 2 – Run below commands on web console

Above command launches a Jupyter notebook in the next available port if port 8888 is being used by another user. We’ve opened ports 8888 to 8920 so that your notebook will get one of the available ports even if a lot of users are accessing the notebook at the same time.

If you are going to use jupyter for a longer duration, the connection might close causing the web console to timeout. You can use nohup unix command in the background:

This will launch jupyter in background. Even if you close the console, the jupyter would keep running. To kill jupyter later you can use the following command:

Once the notebook is successfully launched, it shows a URL with the token. Sample lines will be like

Please note that here notebook gets launched at port 8890 as port 8888 and port 8889 are used by other users. Copy the entire URL.

Replace 0.0.0.0 with the domain of your web console. Let’s say your web console is at f.cloudxlab.com, then replace 0.0.0.0 with f.cloudxlab.com. Now final URL becomes

Paste this final URL in the browser and you can access the notebook. Please note that without valid token you can not access the notebook.

PS – Please do not share above URL with anyone, as it will give them a full access to your notebook as well as terminal (Web console)

Step 3- Open the new notebook

Please select either of the two python environment when you click on “New”:

  • Python [default]
  • Python [conda root]
Selecting Python Environment
Selecting Python Environment

Step 4- Set up environment variables

Step 5- Load PySpark module

You should get the result like below image if pyspark module is loaded properly.

PySpark Jupyter Notebook
PySpark Jupyter Notebook

Running Spark 2.0.1 using Jupyter.

Start Jupyter using following commands:

And in the notebook, please use the following code:

The output should look like following:

Running Python Spark 2.0.1 using jupyter
Running Python Spark 2.0.1 using jupyter

Running Spark 2.0.2 using Jupyter

We have created command jupyter-spark2.0.2 to launch Spark2.0.2 on Jupyter. Login to web console and type  jupyter-spark2.0.2 to launch notebook.Start

Alternatively, you can also launch Jupyter using below commands:

And in the notebook, please use the following code:

The output should look like following

  • Sravan Ch

    could you please let us know how to get the jdbc connection get working

    I tried adding the library to SPARK-CLASSPATH instead of adding it to spark.driver.extraLibraryPath or spark.executor.extraLibraryPath
    os.environ[‘SPARK_CLASSPATH’] = r”/home/sravandata002869/mysql-connector-java-5.1.40/mysql-connector-java-5.1.40-bin.jar”

    Would really appreciate if you can get me a working example with jdbc connection

    Thanks,
    Sravan

    • abhinav singh

      Hi Sravan,

      Just curious if above code is working from command line?

    • abhinav singh

      Hi Sravan,

      Just curious if above code is working from command line?

    • abhinav singh

      Hi Sravan,

      Just curious if above code is working from command line?

    • abhinav singh

      Hi Sravan,

      Just curious if above code is working from command line?

    • abhinav singh

      Hi Sravan,

      Just curious if above code is working from command line?

    • abhinav singh

      Hi Sravan,

      Just curious if above code is working from command line?

      • Sravan Ch

        I’m doing all this work on jupyter notebook

      • Sravan Ch

        Never Mind, I got this working..

        • abhinav singh

          Hi Sravan,

          Great!

          Can you please post the solution here so that it may help other users 🙂

          Thanks

          • Manish Verma

            i try to give my password to open jupyter notepad it’s not working even i try with token not working . message invalid password, invalid token

          • abhinav singh

            Hi Manish,

            That is strange. Could you please share the screenshot?

  • Sravan Ch

    To give you the complete code that I was running in the notebook

    from pyspark import SparkContext,SparkConf
    from pyspark.sql import SQLContext,HiveContext

    sconf = SparkConf()
    sc=SparkContext(conf=sconf)
    sqlc=SQLContext(sc)

    username = “sqoopuser”
    pwd = “NHkkP876rp”
    hostname = “ip-172-31-13-154”
    dbname = “sqoopex”
    port = 3306
    table = “ani_country”

    df = sqlc.read.format(“jdbc”).options(
    url=”jdbc:mysql://{0}:{1}/{2}”.format(hostname, port, dbname),
    driver = “com.mysql.jdbc.Driver”,
    dbtable = table,
    user=username,
    password=pwd,
    ).load()

    It’s not working as it cannot recognize the driver and I was not able to add the jar properly

  • Sravan Ch

    I’ve put mysql jar in here
    /home/sravandata002869

  • Arun Kumar VR

    Is there a way to access Spark using scala like this ??

    • Abhinav Singh

      Hi Arun,

      I think you can use the same steps for Scala also. Just make sure that you specify scala path instead of Python path.

      Could you give it a try once?

      Thanks

      Regards,
      Abhinav

      • Arun Kumar VR

        HI Abhinav,

        Works as charm.
        Thanks you

        Rgds,
        Arun

        • Abhinav Singh

          Great 🙂

          Regards,
          Abhinav

  • Nithin Ts

    Notebook is not launching for me in browser even if Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
    http://0.0.0.0:8890/?token=xxxxxxxxxxx
    http://f.cloudxlab.com:8890/?token=xxxx not working . Can someone please help me

    • Abhinav Singh

      Hi @nithin_ts:disqus,

      Can you please let me know what error are you getting? I hope you are replacing xxxx with the token which is generated by the command.

      Hope this helps.

      Thanks

      • Nithin Ts

        I am replacing token with the generated token 🙂
        This site can’t be reached
        The webpage at http://0.0.0.0:8920/?token=4XXXXXXXXXXXXXXXXXX1 might be temporarily down or it may have moved permanently to a new web address.
        ERR_ADDRESS_INVALID

        This site can’t be reached
        f.cloudxlab.com refused to connect.
        Search Google for cloudxlab 8920
        ERR_CONNECTION_REFUSED

        I tried 3-4 open ports nothing worked

  • sudhindra r

    Hi, I am facing the same issue as Nithin. I am unable to launch the browser. I copied and pasted the URL and changed it to http://f.cloudxlab.com:8890/?token=xxxxxxx. The error message on the browser is “Server not found”.

    • Shahrukh

      Hi Sudhir,
      Please make sure you are replacing the token value correctly. Here is a checklist –

      1. Make sure you have specified a correct port number, in the command
      2. The URL, where your notebook is running, is shown in the console, once you hit enter
      3. If in case you cannot see your URL, you can see the contents of the file nohup.out using the command cat nohup.out
      4. Make sure to replace the 0.0.0.0 with the domain name of your web console
      5. Make sure your URL has http:// and not https:// at the beginning.
      6. Also, make sure to copy the entire URL and paste it into a new browser tab

      • sudhindra r

        Hi Shahrukh,
        I figured that this would run only on f.cloudxlab.com due to the latest version of Python.

        • Abhinav Singh

          Hi @sudhindrar:disqus,

          The above steps will work in all the consoles.

      • Nithin Ts

        I have tried this and it is not working for me , IU have been struggling to use this from past 15 days it is not resolved as well frustrating exp

  • Femi A

    Hi Singh,
    I just wanted to be sure I do the right thing, I want to setup jupyter for pyspark. See instruction below if correct

    Running Jupyter for pyspark
    If you are going to use Jupyter for a longer duration, the connection might close causing the web console to timeout.
    rm nohup.out
    nohup jupyter notebook –no-browser –ip xx.cloudxlab.com –port 8890 & tail -f nohup.out &