Practice questions for Machine Learning Engineer Roles

Welcome!

You might have come across several posts which focus only on the theoretical questions for you to prepare for a machine learning engineer role. But is the theoretical preparation enough?

The ML Engineers in the real world do much more than just making models. They spend enough time understanding the data before actually building a model. For this, they should be able to perform different operations on the data, make intuitions and manipulate the data as per the needs. So an ML Engineer must be able to how to play with data and tell some intuition stories.

Pandas is a library for Python to perform various operations on data. Numpy is a famous Python library for numerical computations. It is often expected that an ML Engineer is well-versed with both of these libraries. But where to practice?

ClouldxLab offers a solution. We have come up with some amazing questions which would help you practice Python, Pandas and Numpy hands-on and make you interview ready.

So what are you waiting for? Encourage the aspiring ML Engineer in you, by waking up the problem solver in you. Practice the following questions: https://cloudxlab.com/assessment/playlist-intro/862/machine-learning-prerequisite-mains-10th-july-2021.

All the best!

Getting Started with various tools at CloudxLab

Welcome!

We are happy to announce that we have come up with a new consolidated playlist, which summaries about various tools present at CloudxLab environment, how to use them and where to learn about them.

This would be incrementally improved as new technologies keep getting installed on the lab.

You may find the playlist here.

In this playlist, there is a dedicated slide for each technology. For example, if you want to understand how to use Pandas on the lab, go to the slide named Pandas.

Upon clicking on Pandas, you would be able to see the Pandas guide as follows:

As you could see, this slide contains all the basic information needed such as:

  • the purpose of the library
  • link for the official home page
  • link for the official documentation
  • related resources you could use to learn about the library.
  • instructions on how to use it on the CloudxLab environment.
  • 1-2 lines of sample examples to use it, such as how to inport the library and how to check the version.

We hope that this will be a great starting guide for our users and makes their job of getting started easier.

Happy learning!

When to use While, For, and Map for iterations in Python?

Python has a really sophisticated way of handling iterations. The only thing it does not have “GOTO Labels” which I think is good.

Let us compare the three common ways of iterations in Python: While, For and Map by the way of an example. Imagine that you have a list of numbers and you would like to find the square of each number.

nums = [1,2,3,5,10]
result = []
for num in nums:
    result.append(num*num)
print(result)

It would print [1, 4, 9, 25, 100]

Continue reading “When to use While, For, and Map for iterations in Python?”

How to handle Command Line Arguments in Python?

When you are running python programs from the command line, you can pass various arguments to the program and your program can handle it.

Here is a quick snippet of code that I will be explaining later:

import sys
if __name__ == "__main__":
    print("You passed: ", sys.argv)

When you run this program from the command line, you will get this kind of results:

$ python cmdargs.py
 You passed:  ['cmdargs.py']

Notice that the sys.argv is an array of strings containing all arguments passed to the program. And the first value(at zeroth index) of this array is the name of the program itself. You can put all kinds of check on it.

Continue reading “How to handle Command Line Arguments in Python?”

How does YARN interact with Zookeeper to support High Availability?

In the Hadoop ecosystem, YARN, short for Yet Another Resource negotiator, holds the responsibility of resource allocation and job scheduling/management. The Resource Manager(RM), one of the components of YARN, is primarily responsible for accomplishing these tasks of coordinating with the various nodes and interacting with the client.

To learn more about YARN, feel free to visit here.

Architecture of YARN

Hence, Resource Manager in YARN is a single point of failure – meaning, if the Resource Manager is down for some reason, the whole of the system gets disturbed due to interruption in the resource allocation or job management, and thus we cannot run any jobs on the cluster. 

To avoid this issue, we need to enable the High Availability(HA) feature in YARN. When HA is enabled, we run another Resource Manager parallelly on another node, and this is known as Standby Resource Manager. The idea is that, when the Active Resource Manager is down, the Standby Resource Manager becomes active, and ensures smooth operations on the cluster. And the process continues.

Continue reading “How does YARN interact with Zookeeper to support High Availability?”

Parallel Computing with Dask

Dask collections and schedulers
Source: dask.org

I recently discovered a nice simple library called Dask.

Parallel computing basically means performing multiple tasks in parallel – it could be on the same machine or on multiple machines. When it is on multiple machines, it is called distributed computing.

There are various libraries that support parallel computing such as Apache Spark, Tensorflow. A common characteristic you would find in most parallel computing libraries you would is the computational graph. A computational graph is essentially a directed acyclic graph or dependency graph.

Continue reading “Parallel Computing with Dask”

How to use a library in Apache Spark and process Avro and XML Files

What is Serialization? And why it’s needed?

Before we start with the main topic, let me explain a very important idea called serialization and its utility.

The data in the RAM is accessed based on the address that is why the name Random Access Memory but the data in the disc is stored sequentially. In the disc, the data is accessed using a file name and the data inside a file is kept in a sequence of bits. So, there is inherent mismatch in the format in which data is kept in memory and data is kept in the disc. You can watch this video to understand serialization further.

Serialization is converting an object into a sequence of bytes.
Continue reading “How to use a library in Apache Spark and process Avro and XML Files”

How to access databases using Jupyter Notebook

SQL is a very important skill. You not only can access the relational databases but also big data using Hive, Spark-SQL etcetera. Learning SQL could help you excel in various roles such as Business Analytics, Web Developer, Mobile Developer, Data Engineer, Data Scientist, and Data Analyst. Therefore having access to SQL client is very important via browser. In this blog, we are going to walk through the examples of interacting with SQLite and MySQL using Jupyter notebook.

A Jupyter notebook is a great tool for analytics and interactive computing. You can interact with various tools such as Python, Linux, File System, Scala, Lua, Spark, R, and SQL from the comfort of the browser. For almost every interactive tool, there is a kernel in Jupyter. Let us walk through how would you use SQL to interact with various databases from the comfort of your browser.

Using Jupyter to access databases such SQLite and MySQL.
Continue reading “How to access databases using Jupyter Notebook”

Getting Started with Apache Airflow

Apache Airflow

When you are building a production system whether it’s a machine learning model deployment or simple data cleaning, you would need to run multiple steps with multiple different tools and you would want to trigger some processes periodically. This is not possible to do it manually more than once. Therefore, you need a workflow manager and a scheduler. In workflow manager, you would define which processes to run and their interdependencies and in scheduler, you would want to execute them at a certain schedule.

When I started using Apache Hadoop in 2012, we used to get the HDFS data cleaned using our multiple streaming jobs written in Python, and then there were shell scripts and so on. It was cumbersome to run these manually. So, we started using Azkaban for the same, and later on Oozie came. Honestly, Oozie was less than impressive but it stayed due to the lack of alternatives.

As of today, Apache Airflow seems to be the best solution for creating your workflow. Unlike Oozie, Airflow is not really specific to Hadoop. It is an independent tool – more like a combination of Apache Ant and Unix Cron jobs. It has many more integrations. Check out Apache Airflow’s website.

Continue reading “Getting Started with Apache Airflow”

How to design a large-scale system to process emails using multiple machines [Zookeeper Use Case Study]?

Introduction

As part of this blog we are going to discuss various ways of large scale system design and the pros-cons of each.

To get a fair understanding of this post, you should know what is distributed computing, what is deadlock and race conditions, locking in distributed systems and Zookeeper etc. Let’s get started.

Scenario

Consider a situation where we have an email inbox that consists of emails, and emails are to be processed. For example, processing those emails and classifying each of the emails as spam or non-spam. The other example of the processing could be we are indexing the email so that the search could be performed.

We have an email-processor program, running on various machines distributed physically from each other.

Email processor program running on distributed systems

Now these machines need to somehow coordinate such that:

  • No email is processed two times
  • No email is left unprocessed
Continue reading “How to design a large-scale system to process emails using multiple machines [Zookeeper Use Case Study]?”