Introduction to NumPy and Pandas – A Simple Tutorial

Python is increasingly being used as a scientific language. Matrix and vector manipulations are extremely important for scientific computations. Both NumPy and Pandas have emerged to be essential libraries for any scientific computation in python due to their intuitive syntax and high-performance matrix computation capabilities.

In this post, we will provide an overview of the common functionalities of NumPy and Pandas. We will realise the similarity of these libraries with existing toolboxes in R and MATLAB. This similarity and added flexibility have resulted in wide acceptance of python in the scientific community lately. Topic covered in the blog are:

  1. Overview of NumPy
  2. Overview of Pandas
  3. Using Matplotlib

This post is an excerpt from a live hands-on training conducted by CloudxLab on 25th Nov 2017. It was attended by more than 100 learners around the globe. The participants were from countries namely; United States, Canada, Australia, Indonesia, India, Thailand, Philippines, Malaysia, Macao, Japan, Hong Kong, Singapore, United Kingdom, Saudi Arabia, Nepal, & New Zealand.

What is NumPy?

NumPy stands for ‘Numerical Python’ or ‘Numeric Python’. It is an open source module of Python which provides fast mathematical computation on arrays and matrices. Since, arrays and matrices are an essential part of the Machine Learning ecosystem, NumPy along with Machine Learning modules like Scikit-learn, Pandas, Matplotlib, TensorFlow, etc. complete the Python Machine Learning ecosystem.

NumPy provides the essential multi-dimensional array-oriented computing functionalities designed for high-level mathematical functions and scientific computation. Numpy can be imported into the notebook using

NumPy’s main object is the homogeneous multidimensional array. It is a table with same type elements, i.e, integers or string or characters (homogeneous), usually integers. In NumPy, dimensions are called axes. The number of axes is called the rank.

There are several ways to create an array in NumPy like np.array, np.zeros, no.ones, etc. Each of them provides some flexibility.

Command to create an array Example
np.array
 
np.ones
 
np.full
 
np.arange
 
np.linspace
 
np.random.rand(2,3)
 
np.empty((2,3))
 

Some of the important attributes of a NumPy object are:

  1. Ndim: displays the dimension of the array
  2. Shape: returns a tuple of integers indicating the size of the array
  3. Size: returns the total number of elements in the NumPy array
  4. Dtype: returns the type of elements in the array, i.e., int64, character
  5. Itemsize: returns the size in bytes of each item
  6. Reshape: Reshapes the NumPy array

NumPy array elements can be accessed using indexing. Below are some of the useful examples:

  • A[2:5] will print items 2 to 4. Index in NumPy arrays starts from 0
  • A[2::2] will print items 2 to end skipping 2 items
  • A[::-1] will print the array in the reverse order
  • A[1:] will print from row 1 to end

The session covers these and some important attributes of the NumPy array object in detail.

Vectors and Machine learning

Machine learning uses vectors. Vectors are one-dimensional arrays. It can be represented either as a row or as a column array.

What are vectors? Vector quantity is the one which is defined by a magnitude and a direction. For example, force is a vector quantity. It is defined by the magnitude of force as well as a direction. It can be represented as an array [a,b] of 2 numbers = [2,180] where ‘a’ may represent the magnitude of 2 Newton and 180 (‘b’) represents the angle in degrees.

Another example, say a rocket is going up at a slight angle: it has a vertical speed of 5,000 m/s, and also a slight speed towards the East at 10 m/s, and a slight speed towards the North at 50 m/s. The rocket’s velocity may be represented by the following vector: [10, 50, 5000] which represents the speed in each of x, y and z direction.

Similarly, vectors have several usages in Machine Learning, most notably to represent observations and predictions.

For example, say we built a Machine Learning system to classify videos into 3 categories (good, spam, clickbait) based on what we know about them. For each video, we would have a vector representing what we know about it, such as: [10.5, 5.2, 3.25, 7.0]. This vector could represent a video that lasts 10.5 minutes, but only 5.2% viewers watch for more than a minute, it gets 3.25 views per day on average, and it was flagged 7 times as spam.

As you can see, each axis may have a different meaning. Based on this vector, our Machine Learning system may predict that there is an 80% probability that it is a spam video, 18% that it is clickbait, and 2% that it is a good video. This could be represented as the following vector: class_probabilities = [0.8,0.18,0.02].

As can be observed, vectors can be used in Machine Learning to define observations and predictions. The properties representing the video, i.e., duration, percentage of viewers watching for more than a minute are called features.

Why NumPy and Pandas over regular Python arrays?

In python, a vector can be represented in many ways, the simplest being a regular python list of numbers. Since Machine Learning requires lots of scientific calculations, it is much better to use NumPy’s ndarray, which provides a lot of convenient and optimized implementations of essential mathematical operations on vectors.

Vectorized operations perform faster than matrix manipulation operations performed using loops in python. For example, to carry out a 100 * 100 matrix multiplication, vector operations using NumPy are two orders of magnitude than performing it using loops.

Some ways in which NumPy arrays are different from normal Python arrays are:

  1. If you assign a single value to an ndarray slice, it is copied across the whole slice
NumPy Array Regular Python array

 

 

So, it is easier to assign values to a slice of an array in a NumPy array as compared to a normal array wherein it may have to be done using loops.

  1. ndarray slices are actually views on the same data buffer. If you modify it, it is going to modify the original ndarray as well.
NumPy array slice Regular python array slice

If we need a copy of the NumPy array, we need to use the copy method as another_slice = another_slice = a[2:6].copy(). If we modify another_slice, a remains same

  1. The way multidimensional arrays are accessed using NumPy is different from how they are accessed in normal python arrays. The generic format in NumPy multi-dimensional arrays is:

Array[row_start_index:row_end_index, column_start_index: column_end_index]

NumPy arrays can also be accessed using boolean indexing. For example,

 

NumPy arrays are capable of performing all basic operations such as addition, subtraction, element-wise product, matrix dot product, element-wise division, element-wise modulo, element-wise exponents and conditional operations.

An important feature with NumPy arrays is broadcasting.

In general, when NumPy expects arrays of the same shape but finds that this is not the case, it applies the so-called broadcasting rules.

Basically, there are 2 rules of Broadcasting to remember:

  1. For the arrays that do not have the same rank, then a 1 will be prepended to the smaller ranking arrays until their ranks match.
  2. On adding a 2D array of shape (2,1) to a 2D ndarray of shape (2, 3). NumPy will apply the second rule of broadcasting

NumPy provides basic mathematical and statistical functions like mean, min, max, sum, prod, std, var, summation across different axes, transposing of a matrix, etc.

A particular NumPy feature of interest is solving a system of linear equations. NumPy has a function to solve linear equations. For example,

Can be solved in NumPy using

What is Pandas?

Similar to NumPy, Pandas is one of the most widely used python library in data science. It provides high-performance, easy to use structures and data analysis tools. Unlike NumPy library which provides objects for multi-dimensional arrays, Pandas provides in memory 2d table object called Dataframe. It is like a spreadsheet with column names and row labels.

Hence, with 2d tables, pandas is capable of providing many additional functionalities like creating pivot tables, computing columns based on other columns and plotting graphs. Pandas can be imported into Python using:

Some commonly used data structures in pandas are:

  1. Series objects: 1D array, similar to a column in a spreadsheet
  2. DataFrame objects: 2D table, similar to a spreadsheet
  3. Panel objects: Dictionary of DataFrames, similar to sheet in MS Excel

Pandas Series object is created using pd.Series function. Each row is provided with an index and by defaults is assigned numerical values starting from 0. Like NumPy, Pandas also provide the basic mathematical functionalities like addition, subtraction and conditional operations and broadcasting.

Pandas dataframe object represents a spreadsheet with cell values, column names, and row index labels. Dataframe can be visualized as dictionaries of Series. Dataframe rows and columns are simple and intuitive to access. Pandas also provide SQL-like functionality to filter, sort rows based on conditions. For example,

 

 

 

 

 

New columns and rows can be easily added to the dataframe. In addition to the basic functionalities, pandas dataframe can be sorted by a particular column.

Dataframes can also be easily exported and imported from CSV, Excel, JSON, HTML and SQL database. Some other essential methods that are present in dataframes are:

  1. head(): returns the top 5 rows in the dataframe object
  2. tail(): returns the bottom 5 rows in the dataframe
  3. info(): prints the summary of the dataframe
  4. describe(): gives a nice overview of the main aggregated values over each column

What is MatplotLib?

Matplotlib is a 2d plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments. Matplotlib can be used in Python scripts, Python and IPython shell, Jupyter Notebook, web application servers and GUI toolkits.

matplotlib.pyplot is a collection of functions that make matplotlib work like MATLAB. Majority of plotting commands in pyplot have MATLAB analogs with similar arguments. Let us take a couple of examples:

Example 1: Plotting a line graph Example 2: Plotting a histogram


Summary

Hence, we observe that NumPy and Pandas make matrix manipulation easy. This flexibility makes them very useful in Machine Learning model development.
Check out the free course on Python for Machine Learning by CloudxLab. You can find the in-depth video tutorials on NumPy, Pandas, and Matplotlib in the course.

Python Setup Using Anaconda For Machine Learning and Data Science Tools

Python for Machine Learning

In this post, we will learn how to configure tools required for CloudxLab’s Python for Machine Learning course. We will use Python 3 and Jupyter notebooks for hands-on practicals in the course. Jupyter notebooks provide a really good user interface to write code, equations, and visualizations.

Please choose one of the options listed below for practicals during the course.

Continue reading “Python Setup Using Anaconda For Machine Learning and Data Science Tools”

GraphFrames on CloudxLab

GraphFrames is quite a useful library of spark which helps in bringing Dataframes and GraphX package together.

From the website of Graphframes:

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs. It provides high-level APIs in Scala, Java, and Python. It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames. This extended functionality includes motif finding, DataFrame-based serialization, and highly expressive graph queries.

You can use graph frames very easily with spark-shell at CloudxLab by using —package option in the following way. Continue reading “GraphFrames on CloudxLab”

How to install Python packages on CloudxLab?

In this blog post, we will learn how to install Python packages on CloudxLab.

Step 1-

Create the virtual environment for your project. A virtual environment is a tool to keep the dependencies required by different projects in separate places, by creating virtual Python environments for them. Login to CloudxLab web console and create a virtual environment for your project.

Continue reading “How to install Python packages on CloudxLab?”

Running PySpark in Jupyter / IPython notebook

You can run PySpark code in Jupyter notebook on CloudxLab. The following instructions cover both 1 and 2 versions of Apache Spark.

What is Jupyter notebook?

The IPython Notebook is now known as the Jupyter Notebook. It is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media. For more details on the Jupyter Notebook, please see the Jupyter website.

Please follow below steps to access the Jupyter notebook on CloudxLab

Step 1 – Login to web console

Continue reading “Running PySpark in Jupyter / IPython notebook”

Using TensorFlow on CloudxLab

We are glad to inform you that the TensorFlow is now available on CloudxLab. In this example, we will walk you through a basic tutorial on how to use TensorFlow.

What is TensorFlow?
TensorFlow is an Open Source Software Library for Machine Intelligence. It is developed and supported by Google and is being adopted very fast.

What is CloudxLab?
CloudxLab provides a real cloud-based environment for practicing and learn various tools. You can start learning right away by just signing up online.

Continue reading “Using TensorFlow on CloudxLab”

Access S3 Files in Spark

In this blog post we will learn how to access S3 Files using Spark on CloudxLab.
Please follow below steps to access S3 files:

Access Spark 1.2.1, Spark 1.6 and Spark 2.0 on CloudxLab

In this blog post we will learn how to access various versions of Spark on CloudxLab. Spark 1.2.1 will be helpful if you are preparing for CCA (Cloudera Certified Associate). Spark 1.6 will be useful for practicing SparkR. Please note that Spark 1.2.1, Spark 1.6 and Spark 2.0.1 may not integrate tightly with Hadoop, but you will be able to run most of the commands.

How to access Spark 1.2.1?

Continue reading “Access Spark 1.2.1, Spark 1.6 and Spark 2.0 on CloudxLab”

CloudxLab Getting Started Guide

Please use below resources to make most out of your CloudxLab Subscription

CloudxLab hands-on videos

Hadoop videos on CloudxLab

Spark videos on CloudxLab

CloudxLab Introduction

What is CloudxLab?

CloudxLab is a cloud based virtual lab for practicing Big Data (Hadoop, Spark etc), Machine Learning and Deep Learning technologies.

Origins

While training students on Big Data technologies at KnowBigData, we realized that our learners were facing a lot of trouble downloading and configuring virtual machines (VM) provided by major Hadoop vendors. Most often, these virtual machines were slow and would not allow for use of any other application on the same computer.

Moreover, working on a VM did not give a real world experience as one is still dealing with only one machine instead of a cluster of machines which is the whole idea of Big Data technologies which are primarily based on distributed computing.

This is how CloudxLab was conceptualized in an effort to resolve these pain points of learners. The video below will help understand how one of our clients – Simplilearn – is using CloudxLab to provide a better learning experience to their course takers.

Continue reading “CloudxLab Introduction”