How to access databases using Jupyter Notebook

SQL is a very important skill. You not only can access the relational databases but also big data using Hive, Spark-SQL etcetera. Learning SQL could help you excel in various roles such as Business Analytics, Web Developer, Mobile Developer, Data Engineer, Data Scientist, and Data Analyst. Therefore having access to SQL client is very important via browser. In this blog, we are going to walk through the examples of interacting with SQLite and MySQL using Jupyter notebook.

A Jupyter notebook is a great tool for analytics and interactive computing. You can interact with various tools such as Python, Linux, File System, Scala, Lua, Spark, R, and SQL from the comfort of the browser. For almost every interactive tool, there is a kernel in Jupyter. Let us walk through how would you use SQL to interact with various databases from the comfort of your browser.

Using Jupyter to access databases such SQLite and MySQL.
Continue reading “How to access databases using Jupyter Notebook”

Getting Started with Apache Airflow

Apache Airflow

When you are building a production system whether it’s a machine learning model deployment or simple data cleaning, you would need to run multiple steps with multiple different tools and you would want to trigger some processes periodically. This is not possible to do it manually more than once. Therefore, you need a workflow manager and a scheduler. In workflow manager, you would define which processes to run and their interdependencies and in scheduler, you would want to execute them at a certain schedule.

When I started using Apache Hadoop in 2012, we used to get the HDFS data cleaned using our multiple streaming jobs written in Python, and then there were shell scripts and so on. It was cumbersome to run these manually. So, we started using Azkaban for the same, and later on Oozie came. Honestly, Oozie was less than impressive but it stayed due to the lack of alternatives.

As of today, Apache Airflow seems to be the best solution for creating your workflow. Unlike Oozie, Airflow is not really specific to Hadoop. It is an independent tool – more like a combination of Apache Ant and Unix Cron jobs. It has many more integrations. Check out Apache Airflow’s website.

Continue reading “Getting Started with Apache Airflow”

Zookeeper: Case Study

Introduction

Now that we have a decent idea of the big data and distributed systems, locking in distributed systems and Zookeeper, we are all set to go through a case study where we investigate the use of Zookeeper in a real-time scenario. Let’s get started.

Scenario

Consider a situation where we have an email inbox that consists of emails. We have the task of processing those emails and classifying each of the emails as spam or non-spam. This email inbox is read-only.

We have an email-processor program, running on various machines distributed physically from each other.

Now these machines need to somehow coordinate such that:

  • No email is processed two times
  • No email is left unprocessed

Solution 1:

Usage of flags: we could mark the emails to be read or unread by any machine previously, and only consider those emails which are not yet read.

CONS:

While processor 1 reads an email and mark it as read, and then the processor dies, then the email would not touched by any other processor in future, because it was already marked as read by the first processor, and thus this email would be left unprocessed.

SOLUTION 2:

There should be a manager that could handle the workload and distribute the work to workers.

Cons:

This manager could be a bottleneck as it has to maintain a large number of systems, and thus it would be overloaded. Also, what is the manager dies?

SOLUTION 3:

We need a central storage which could note down who is doing what, like email id, timestamp it was taken up by a processor, status of completion of processing, etc.

Distributed systems with central storage service for coordination
Distributed systems with central storage service for coordination

CONS:

The central storage system can be a bottleneck. Say the email processor programs are running on a lot of machines, then the central storage system would be on high demand and thus it will be overloaded, and it may also die.

Solution 4:

Distributed storage system like Zookeeper could be an ideal solution for the problem.

Zookeeper :

  • provides simple primitives like set/get, so easy to program
  • has an easy data model, like a directory tree
  • is a resilient and highly available tool

How it could solve the problem?

Suppose the process on Machine 1 wants to read some data from the email inbox. Say it has successfully picked 100 emails to process and it noted down this information with Zookeeper. This could be done by creating a sequential ephemeral znode, along with the info about the email_id, timestamp, status, etc. Since this process is creating a znode, it is obviously a write operation. When a process is carrying out a write operation on Zookeeper, then it acquires a lock(with its session id to identify who performing this write). In the meanwhile, another process (maybe from another machine) may want to read emails and make a note of it in zookeeper. This means another process wants to create a znode about the emails it wants to pick. This would not be possible, as the first process has still not released to lock for the second process to conduct a write operation in zookeeper. Also, once the lock acquired by the first process is released, the second process would check if the emails it has picked up are already processed by some other process, to ensure no email is processed more than once. Also, the second process also could check the timestamps when the emails were taken up by other processes and what is status(if the email is processed successfully after being picked by some other process). If the timestamp was made long ago and still the status is unsuccessful, the next process could pick that email, so as to make sure that no email is left unprocessed. In this way, the zookeeper makesure that no email is processed more than once, and n email is left behind unprocessed.

As long as the first process has acquired to like to perform some write operation, all the other processes – those who wish to acquire the lock and perform some write operation – will have to wait, by creating sequential ephemeral znodes. The sequential znodes would have suffixes with the incremental numbers for each of the newly created znodes. Once the current process releases its lock, that znode could be removed, and then the process whose znode is having the minimum number could acquire the lock. Thus, by creating sequential znodes, the order of operations could be preserved. Further, ephemeral znodes help in tracking the clients if they are active or dead. If a client is active, it sends regular signals(called heartbeats) to Zookeeper to mark its presence. If it could not send the heartbeats due to network some temporary network failure or likewise, the session is still alive, but if the heartbeat is ceased for a duration longer than the session timeout, Zookeeper understands that the client is dead. This means the session times out and the ephemeral znode disappears. Thus, the reason for creating sequential ephemeral znodes is that, sequentiality preserves the order in which the operations should be performed, and ephemerality ensures that all the clients are alive(a watcher could be placed to track if any of the processes get disconnected. Then, a notification could be sent to appropriate resources so that a new process could be up and continue the work which was previously handled by the dead process, thus making the whole system fault-tolerant and highly available).

If a zookeeper server dies, then a new server could come up, or the client could connect to some other server in the ensemble. Thus zookeeper is distributed service so that even if a zookeeper server fails, it could still manage resiliently to maintain coordination amongst the distributed systems.

Conclusions

Zookeeper is a distributed coordination service that provides the following mechanisms to promote coordination amongst distributes systems:

  • distributed key-value store to store small JSON data
  • Various types of znodes suitable for different use cases
  • monotonically increasing unique ids to the znodes
  • Zookeeper ensemble
  • watches
  • notifications

The above mechanisms thus make Zookeeper:

  • resilient
  • highly available
  • fault tolerant
  • efficient intermediary for coordination amongst distributed systems

If you are still more eager to know about Zookeeper, feel free to visit here. To know more about CloudxLab courses, here you go!

Distributed Computing with Locks

Introduction

Having known of the prevalence of BigData in real-world scenarios, it’s time for us to understand how they work. This is a very important topic in understanding the principles behind system design and coordination among machines in big data. So let’s dive in.

Scenario:

Consider a scenario where there is a resource of data, and there is a worker machine that has to accomplish some task using that resource. For example, this worker is to process the data by accessing that resource. Remember that the data source is having huge data; that is, the data to be processed for the task is very huge.

Continue reading “Distributed Computing with Locks”

Online Courses Free of Cost during #NoPayJan

Malcolm X once said, “Education is our passport to the future”. This has become more relevant than ever in the last year. The COVID-19 pandemic gave a big jolt to the economy and the existing strata of professions across the world. Many succumbed to the pandemic by losing their jobs and facing extensive pay cuts. 

People who were up-to-date with technology made it through the darkest times, making online education become the next big thing across the globe. According to a recent LinkedIn survey, around more than 60% of professionals have increased the amount of time spent on online learning for upskilling during the lockdown period. But the challenge here was that online education was becoming more and more expensive with a consistent fall in the quality of content. Online education slowly started becoming a very far-fetched dream for the common man. 

At CloudxLab, we strive to ensure that education does not feel like a luxury but a basic need that everybody is entitled to. Keeping this in mind, we bring forth the “#NoPayJan” where you can access some of the most sought after and industry-relevant courses completely free of cost. During #NoPayJan anybody who is signing up at CloudxLab will be able to access the contents of all the self-paced courses. This offer will be running from January 1 till January 31, 2021. CloudxLab provides an online learning platform where you can learn and practice Data Science, Deep Learning, Machine Learning, Big Data, Python, etc.

When the highly competitive and commercialized education providers have cluttered the online learning platform, CloudxLab tries to break through with a disruptive change by making upskilling affordable and accessible and thus, achievable.

Happy New Year & Happy Learning!

Improving the Performance of Deep-Learning based Flask App with ZMQ

Introduction

It is a well-known fact that deep learning models are heavy; with a lot of weights for the deep layers. And it is obviously an overhead to load the model every time we need to get the predictions from the model. Thus this is costly in terms of the time of execution.

In this project, we will mainly focus on addressing this issue, by uniquely integrating the networking functionalities provided by ZMQ library. We will build a server-client based architecture to make the model load exactly once(that is during the starting of the app). The predictions from the model will be served by the model server, as long as it listens to its Flask client which requests it for the predictions for an input image.

Continue reading “Improving the Performance of Deep-Learning based Flask App with ZMQ”

REVA University partners with CloudxLab for setting up Center of Excellence in AI and Deep Technologies

REVA University signs an MoU with CloudxLab to set up a Center of Excellence in AI and Deep Technologies: In the picture, Dr. K.M Babu, Vice Chancellor, Dr. Dhanamjaya, Pro-Vice Chancellor, Sandeep Giri, Founder, CloudX Labs.

REVA University’s REVA Academy for Corporate Excellence (RACE) has inked a memorandum of understanding with CloudxLab for setting up a Center of Excellence in AI and Deep Technologies and providing a platform for promoting research and innovation

REVA University and Cloudxlab research collaboration intends to work on technologies involving, deep learning, reinforced learning, curiosity-based machines, distributed computing, and launching specialized courses in these advanced technologies.

This collaboration will be aimed at providing and launching some highly sought-after courses in deep technologies involving experts from Academia as well as the industry. These courses will be delivered in hybrid mode – a combination of physical classroom, online instructor-led, self-paced, and project-based modes.

Dr. P. Shyama Raju, honorable Chancellor, REVA University  said “This one-of-a-kind collaboration is aimed at being a launchpad for those who are planning to step into the world of AI, Deep Learning and other advanced technologies. It affirms REVA University’s commitment to make high-end technical education available to everyone in the world.”

“With Artificial Intelligence, machine learning, and other high-end technologies influencing every aspect of our lives, we are optimistic that this collaboration will help professionals in shaping their career”, says Sandeep Giri, CEO, and Founder at CloudxLab.

About REVA University 

REVA University is one of the top-ranked private Universities in Bangalore, India, offering a wide range of UG, PG and PhD programs. REVA Academy for Corporate Excellence (RACE) is one of the initiatives of REVA University focused on corporate training to develop visionary enterprise leaders through progressive and integrated learning capabilities. RACE offers best-in-class, specialized, techno-functional, and interdisciplinary programs that are designed to suit the needs of working professionals. 

How to label custom images for YOLO – YOLO 3

In this blog we will show how to label custom images for making your own YOLO detector. We have other blogs that cover how to setup Yolo with Darknet, running object detection on images, videos and live CCTV streams. If you want to detect items not covered by the general model, you need custom training.

In our case we will build a truck type detector. There are 4 types of trucks we will try to identify

Continue reading “How to label custom images for YOLO – YOLO 3”