Natural Languages and AI

During one of the keynote speeches in India, an elderly person asked a question: why don’t we use Sanskrit for coding in AI. Though this question might look very strange to researchers at first it has some deep background to it.

Long back when people were trying to build language translators, the main idea was to have an intermediate language to and from which we could translate to any language. If we build direct translation from a language A to B, there will be too many permutations. Imagine, we have 10 languages, and we will have to build 90 (10*9) such translators. But to come up with an intermediate language, we would just need to encode for every 10 languages and 10 decoders to convert the intermediate language to each language. Therefore, there will be only 20 models in total.

So, it was obvious that there is definitely a need for an intermediate language. The question was what should be the intermediate language. Some scientists proposed that we should have Sanskrit as the intermediate language because it had good definitive grammar. Some scientists thought a programming language that can dynamically be loaded should be better and they designed a programming language such as Lisp. Soon enough, they all realized that both natural languages and programming languages such as Lisp would not suffice for multiple reasons: First, there may not be enough words to represent each emotion in different languages. Second, all of this will have to be coded manually.

The approach that became successful was the one in which we represent the intermediate language as a list of numbers along with a bunch of numbers that represent the context. Also, instead of manually coding the meaning of each word, the idea that worked out was representing a word or a sentence with a bunch of numbers. This approach is fairly successful. This idea of representing words as a list of numbers has brought a revolution in natural language understanding. There is humongous research that is happening in this domain today. Please check GPT-3Dall-E, and Imagen.

If you subtract woman from Queen and add Man, what should be the result? It should be King, right? This can be easily demonstrated using word embedding.

Queen — woman + man = King

Similarly, Emperor — man + woman = Empress

Yes, this works. Each of these words is represented by a list of numbers. So, we are truly able to represent the meaning of words with a bunch of numbers. If you think about it, we learned the meaning of each word in our mother tongue without using a dictionary. Instead, we figured the meaning out using the context.

In our mind, we have sort of a representation of the word which is definitely not in the form of some other natural language. Based on the same principles, the algorithms also figure out the meaning of the words in terms of a bunch of numbers. It is very interesting to understand how these algorithms work. They work on similar principles to humans. They go through the large corpus of data such as Wikipedia or news archives and figure out the numbers with which each word can be represented. The problem is optimization: come up with those numbers to represent each word such that the distance between the words existing in a similar context is very small as compared to the distance between the words existing in different contexts.

The word Cow is closer to Buffalo as compared to Cup because Cow and buffalo usually exist in similar contexts in sentences.

So, in summary, it is very unreasonable to pursue that we should still be considering a natural language to represent the meaning of a word or sentence.

I hope this makes sense to you. Please post your opinions in the comments.

How to use Numpy Meshgrid to generate data?

When you are generating data, the Meshgrid function of Numpy helps us to generate the coordinates data from individual arrays.

Say, you have a set of values of x 0.1, 0.2, 0.3, 0.4. You want to generate all possible points by combining these four values with say three values for y: 4, 5, 6.

This can be done very easily by using meshgrid function of Numpy as follows:

import numpy as np
x, y = np.meshgrid([0.1, 0.2, 0.3, 0.4], [4, 5, 6])
import matplotlib.pyplot as plt
plt.scatter(x, y)
The Scatter plot generated after meshgrid

To learn more about it, please visit Numpy Meshgrid Reference

MLOps (Machine Learning Operations) – A Complete Hands-On Guide with Case Study

Learn about MLOps with a case study and guided projects. This will also cover DevOps, Software Engineering, System Engineering in the right proportions to ensure your understanding is complete.

You will learn about MLOps with a case study and guided projects. This will also cover DevOps, Software Engineering, System Engineering in the right proportions to ensure your understanding is complete.

Introduction

As part of this series of blogs, I want to help you understand MLOps. MLOps stands for Machine Learning Ops. It is basically a combination of Machine Learning, Software Development, and Operations. It is a vast topic. I want to first establish the value of MLOps and then discuss the various concepts to learn by way of guided projects in a very hands-on manner. If you are looking for a theoretical foundation on MLOps, please follow this documentation by Google.


Case Study

Objective

When I was working with an organization, we wanted to build and show the recommendation. The main idea was that we need to show something interesting to the users in a small ad space that people would want to use. Since we had the anonymized historical behavior of the users, we settled down to show recommendations of apps that they would also like based on their usage of various apps at the time when the ads were being displayed.

Continue reading “MLOps (Machine Learning Operations) – A Complete Hands-On Guide with Case Study”

Practice questions on Data Structures and Algorithms for Software Engineer Roles

Welcome!

You might have seen many people getting anxious for coding interviews. Mostly you are tested for Data Structures and Algorithms in a coding interview. It can be quite challenging and stressful considering the vastness of the topic.

Software Engineers in the real world have to do a lot of problem-solving. They spend enough time understanding the problem before actually coding it. The main reason to practice Data Structures and Algorithms is to improve your problem-solving skills. So a Software Engineer must have a good understanding of both. But where to practice?

ClouldxLab offers a solution. We have come up with some amazing questions which would help you practice Data Structures and Algorithms and make you interview-ready.

So what are you waiting for? Encourage the aspiring Software Engineer in you, by waking up the problem solver in you. Practice the following questions: https://cloudxlab.com/assessment/playlist-intro/566/data-structures-and-algorithms-questions

All the best!

When to use While, For, and Map for iterations in Python?

Python has a really sophisticated way of handling iterations. The only thing it does not have “GOTO Labels” which I think is good.

Let us compare the three common ways of iterations in Python: While, For and Map by the way of an example. Imagine that you have a list of numbers and you would like to find the square of each number.

nums = [1,2,3,5,10]
result = []
for num in nums:
    result.append(num*num)
print(result)

It would print [1, 4, 9, 25, 100]

Continue reading “When to use While, For, and Map for iterations in Python?”

How to handle Command Line Arguments in Python?

When you are running python programs from the command line, you can pass various arguments to the program and your program can handle it.

Here is a quick snippet of code that I will be explaining later:

import sys
if __name__ == "__main__":
    print("You passed: ", sys.argv)

When you run this program from the command line, you will get this kind of results:

$ python cmdargs.py
 You passed:  ['cmdargs.py']

Notice that the sys.argv is an array of strings containing all arguments passed to the program. And the first value(at zeroth index) of this array is the name of the program itself. You can put all kinds of check on it.

Continue reading “How to handle Command Line Arguments in Python?”

Parallel Computing with Dask

Dask collections and schedulers
Source: dask.org

I recently discovered a nice simple library called Dask.

Parallel computing basically means performing multiple tasks in parallel – it could be on the same machine or on multiple machines. When it is on multiple machines, it is called distributed computing.

There are various libraries that support parallel computing such as Apache Spark, Tensorflow. A common characteristic you would find in most parallel computing libraries you would is the computational graph. A computational graph is essentially a directed acyclic graph or dependency graph.

Continue reading “Parallel Computing with Dask”

How to use a library in Apache Spark and process Avro and XML Files

What is Serialization? And why it’s needed?

Before we start with the main topic, let me explain a very important idea called serialization and its utility.

The data in the RAM is accessed based on the address that is why the name Random Access Memory but the data in the disc is stored sequentially. In the disc, the data is accessed using a file name and the data inside a file is kept in a sequence of bits. So, there is inherent mismatch in the format in which data is kept in memory and data is kept in the disc. You can watch this video to understand serialization further.

Serialization is converting an object into a sequence of bytes.
Continue reading “How to use a library in Apache Spark and process Avro and XML Files”

How to access databases using Jupyter Notebook

SQL is a very important skill. You not only can access the relational databases but also big data using Hive, Spark-SQL etcetera. Learning SQL could help you excel in various roles such as Business Analytics, Web Developer, Mobile Developer, Data Engineer, Data Scientist, and Data Analyst. Therefore having access to SQL client is very important via browser. In this blog, we are going to walk through the examples of interacting with SQLite and MySQL using Jupyter notebook.

A Jupyter notebook is a great tool for analytics and interactive computing. You can interact with various tools such as Python, Linux, File System, Scala, Lua, Spark, R, and SQL from the comfort of the browser. For almost every interactive tool, there is a kernel in Jupyter. Let us walk through how would you use SQL to interact with various databases from the comfort of your browser.

Using Jupyter to access databases such SQLite and MySQL.
Continue reading “How to access databases using Jupyter Notebook”

Getting Started with Apache Airflow

Apache Airflow

When you are building a production system whether it’s a machine learning model deployment or simple data cleaning, you would need to run multiple steps with multiple different tools and you would want to trigger some processes periodically. This is not possible to do it manually more than once. Therefore, you need a workflow manager and a scheduler. In workflow manager, you would define which processes to run and their interdependencies and in scheduler, you would want to execute them at a certain schedule.

When I started using Apache Hadoop in 2012, we used to get the HDFS data cleaned using our multiple streaming jobs written in Python, and then there were shell scripts and so on. It was cumbersome to run these manually. So, we started using Azkaban for the same, and later on Oozie came. Honestly, Oozie was less than impressive but it stayed due to the lack of alternatives.

As of today, Apache Airflow seems to be the best solution for creating your workflow. Unlike Oozie, Airflow is not really specific to Hadoop. It is an independent tool – more like a combination of Apache Ant and Unix Cron jobs. It has many more integrations. Check out Apache Airflow’s website.

Continue reading “Getting Started with Apache Airflow”