How to design a large-scale system to process emails using multiple machines [Zookeeper Use Case Study]?

Zookeeper playing crucial role to achieve coordination among distributed systems

Introduction

As part of this blog we are going to discuss various ways of large scale system design and the pros-cons of each.

To get a fair understanding of this post, you should know what is distributed computing, what is deadlock and race conditions, locking in distributed systems and Zookeeper etc. Let’s get started.

Scenario

Consider a situation where we have an email inbox that consists of emails, and emails are to be processed. For example, processing those emails and classifying each of the emails as spam or non-spam. The other example of the processing could be we are indexing the email so that the search could be performed.

We have an email-processor program, running on various machines distributed physically from each other.

Email processor program running on distributed systems

Now these machines need to somehow coordinate such that:

  • No email is processed two times
  • No email is left unprocessed

Solution 1:

Usage of flags: we could mark the emails to be read or unread by any machine previously, and only consider those emails which are not yet read.

CONS:

While processor 1 reads an email and marks it as read, and then the processor dies, then the email would not be touched by any other processor in future, because it was already marked as read by the first processor, and thus this email would be left unprocessed.

SOLUTION 2:

There should be a manager that could handle the workload and distribute the work to workers.

Cons:

This manager could be a bottleneck as it has to maintain a large number of systems, and thus it would be overloaded. Also, what is the manager dies?

SOLUTION 3:

We need a central storage which could note down who is doing what, like email id, timestamp it was taken up by a processor, status of completion of processing, etc.

Zookeeper playing crucial role to achieve coordination among distributed systems

CONS:

The central storage system can be a bottleneck. Say the email processor programs are running on a lot of machines, then the central storage system would be on high demand and thus it will be overloaded, and it may also die.

Solution 4:

By using a distributed system that provides locking such as Zookeeper. You can also use the standard RDBMS system with locking but that would not be highly available.

Zookeeper :

  • provides simple primitives like set/get, so easy to program
  • has an easy data model, like a directory tree
  • is a resilient and highly available tool

To know more about CloudxLab courses, here you go!