Improving the Performance of Deep-Learning based Flask App with ZMQ


It is a well-known fact that deep learning models are heavy; with a lot of weights for the deep layers. And it is obviously an overhead to load the model every time we need to get the predictions from the model. Thus this is costly in terms of the time of execution.

In this project, we will mainly focus on addressing this issue, by uniquely integrating the networking functionalities provided by ZMQ library. We will build a server-client based architecture to make the model load exactly once(that is during the starting of the app). The predictions from the model will be served by the model server, as long as it listens to its Flask client which requests it for the predictions for an input image.

The full version of the guided project is here. And the full code is present here on the GitHub.


This project is in continuation with Project – How to Deploy an Image Classification Model using Flask.

In this project, we would exclusively work towards improving the performance of that image classification app and bring down the time of execution to as low as nearly 0.3 seconds on an average.

So it is expected that you are already aware of the working of the code of the Project – How to Deploy an Image Classification Model using Flask.

  • Please do not delete the virtual environment as mentioned in the last step of that project since we will be considering that project as the very first step for that project.

Checking the Runtime

Q: What is the pain point of the previous project?


(1) The amount of time taken to predict the classes of an input image

  • We could see the amount of time taken to give out the predictions.
  • If we experiment it even with different images, we could observe that the average amount of time taken is around 10-21 seconds for each input image.
  • But this is a huge time which is not tolerable in real-time environments. Users expect the applications they use to be not only accurate in terms of functioning but also fast enough in terms of execution time.

(2) It is a monolith program, thus no flexibility

  • When there is a big service we generally call it monolith meaning, made out of a single stone.
  • When we make something from a single stone we can’t really break it into parts and separate it – we cannot make it modular, it cannot fit on multiple machines. For example, the deep-learning based service might be computation-intensive and thus it may require GPUs, while CPUs may suffice for the working of the web service since they may not require heavy computational resources like GPUs. Thus, breaking down the services based on necessities and resource consumption provide added advantages of optimal resource consumption(thus reducing the cost), modularisation of the code(and thus the services), scaling flexibilities, and isolation of the responsibilities of the team(ML engineer doesn’t need to bother about the work of software engineer and vice-versa, etc.)

Q: But what is causing the problem?

A: Re-execution of the code for each request due to the monolith programming style

  • Model and web server are coded in monolith style, so no flexibility.
  • Also, the same heavy model executions are repeating due to this style of programming.
  • The way works is this:
    • It is coded in a monolith programming style, by putting the flask server and the model loading/inference, all in the same file.
    • Each request from the user will be re-directed to the dedicated URL.
    • Each URL invokes that corresponding function in
    • Each time is in use, all the imports, and all the execution in that function invoked will be done from the beginning.
    • Though we want to use the same model each time we want predictions for different input images, the model is newly getting loaded for each request. This is costly, and thus it is time-consuming.

Q: Then what could be a potential solution?

A: Construct the model once, and keep it ready-to-use in the RAM

  • Now that we understand that the problem is due to the monolith programming style and thus due to the costly process of loading the model repeatedly for each prediction request, we need to find a way where we could overcome the overhead caused by this step.
  • Java has static keyword, using which variables could be declared as static. These static members get loaded into the memory exactly once, ie. when that java class was loaded. And these members can be used by different threads.
  • But this method has a disadvantage: the service will not be scalable because the static variables remain on one computer, they are not in another, and so on so forth. So there are complications involved when you start using the static variables.
  • Also, Python doesn’t provide a static mechanism.
  • Hence, we now switch to such a way where we are going to partially use the static variable concept – but in a wiser way – where we make use of server-client architecture.

Q: So how is this project going to address the issue with execution time?

A: To address the pain-point, we introduce the concept of asynchrony, by uniquely integrating ZMQ(an asynchronous network library) with Flask server

  • The idea is to import the modules and construct the graph exactly once, run a server, and let the server use these loaded modules or variables any number of times as long as the server is active and it is receiving requests.
  • This could be done by breaking down the code into 2 services: web service and model service.
  • We will achieve this by defining a service( here let us call it a model server) – which imports and constructs the graph once, keeps it ready for inference, and keeps listening to for any client requests. Once a client(here flask server) requests the server, the server (which already constructed the graph and kept it ready for inference) responds to the client with predictions. The flask server then renders the corresponding HTML page with the predictions.
  • ZMQ is such a library that provides us with the networking capabilities using which we could build a custom server-client mechanism as per our need.
  • We achieve this as follows:
    • We separate the model and the
    • We create a server-class named Server which invokes the RequestHandler class each time the server gets a request. We will be defining these classes in the file.(a) As long as the Server is not stopped, it listens to a specified port(say 5576).(b) As soon as the server receives a request from a client, it invokes the RequestHandler class which handles the request by (1) converting the base64 encoded image into a normal image (2) preprocessing the image (3) feeding the image to the model (4) getting the predictions (5) returning the resultant predictions to the server.(c) The server responds to the client with the resultant predictions.
    • When a user submits an image for its predictions, the corresponding URL will be invoked(say
    • That URL invokes the corresponding function in upload_file function in file).
    • This invoked function acts as a client. It registers with the socket 5576, sends the image in an encoded form(here we will use base64 encoding for an input image) and keeps polling(or keeps waiting) for the response from the server.
    • The server receives the request, keeps track of the client through a unique id and routes the request to a request-handler. The request-handler converts it back to a normal image, preprocesses it as required by the pre-trained model, feeds that preprocessed image to the model, and returns the predictions(in the form of JSON object) to the server. The server sends this response to the client who requested it.
    • This JSON object will be received back by the same function(which requested the server previously) in and the predictions will be sent to the corresponding template which would be rendered in that function.
    • In this workflow, it is the ZMQ that provides the mechanism of connecting through specified sockets and keeps the server listening through that port.
    • At the beginning of the file, we load the model. Then, we start the server which keeps listening to a port. Thus the server is always actively listening to the port and responds to the client. We don’t import the model again and again as long as the server is active.
    • The server listens as long as it is not stopped, and thus there is no need to newly load the model. The model is retained in memory unlike loading it from disk for each request. So the amount of time consumed ideally goes down.

A Quick Introduction to ZMQ


ZMQ is one of the most efficient libraries using which we can improve performance.

The official words about ZMQ are:

ZeroMQ (also known as ØMQ, 0MQ, or zmq) looks like an embeddable networking library but acts like a concurrency framework. It gives you sockets that carry atomic messages across various transports like in-process, inter-process, TCP, and multicast. You can connect sockets N-to-N with patterns like fan-out, pub-sub, task distribution, and request-reply. It’s fast enough to be the fabric for clustered products. Its asynchronous I/O model gives you scalable multicore applications, built as asynchronous message-processing tasks. It has a score of language APIs and runs on most operating systems.

It supports asynchronicity to perform multiple tasks parallelly, providing customizable networking options.

ZMQ is neither a client nor a server. Rather, we could make our own client and server by making use of the networking functionalities provided by ZMQ.

ZMQ provides sockets of various types, which could be used in different scenarios. For example, the PUB/SUB sockets are used in the publisher-subscriber messaging system. In our scenario, we will be using the mechanism involving ROUTER/DEALER. Let us briefly discuss this mechanism:

(1) In ROUTER/DEALER sockets, ROUTER is used to accept the requests from clients, route the requests, receive the response and send the response to the client, while the DEALER deals with the workers who perform the task. Workers perform the task and return the results to the ROUTER via the DEALER. A ROUTER may have one or more DEALERs.

(2) Here, the ROUTER could be thought of as a frontend for a client to communicate, while the workers work in the backend. Workers perform the task in the backend to return the results to the client via the frontend. The DEALER which is bound to the same context as the ROUTER acts as the main gateway in the backend, through which the results are returned by the workers to the frontend.

(3) Note that the ROUTER also keeps track of the client through a unique id using which the responses will be returned to the client by the ROUTER.

Project Architecture

In our project, the model gets loaded when the server starts.

Then, the server starts listening for any client request.

(1) User uploads an image in web-app.

(2) Flask Server acts as a client of the Model Server. It sends the input image(in some encoded form) to the Model Server.

(3) Model Server invokes the model via a RequestHandler for predictions.

(4) Model yields Predictions and sends them to the Model Server.

(5) Model Server responds to Flask Server with predictions.

(6) Flask Server renders an HTML template along with the predictions displayed.

Brief Project Workflow

In our project:

Project Architecture

Now let’s look at the deeper view of the architecture. Observe the following image(it is just for our intuitive understanding):

Deeper View into the Architecture

(1) We shall create a flask server to serve the web-app and the model server which serves the model. An image is uploaded through the web-app and the resultant predictions are returned by the model server to the client(here in the flask server.

(2) Upon submitting the image through the web-app, the corresponding URL(here /uploader) invokes the corresponding function(here upload_file function) in the

(3) In this function, we create a socket and connect it to the same port(here 5576) through which the frontend of the model server communicates.

(4) The image is encoded in this function in such a format its transmission is compatible through the network between the flask server to the model sever.

(5) At the server, we define the frontend(ROUTER) and the backend(DEALER). The frontend is bound to the port through which it receives client requests, and the backend is bound to the endpoint to which the workers are connected through an in-process communication protocol. As discussed, the DEALER deals with workers.

(6) The frontend of the model server(ROUTER) receives this encoded image along with the id of the client.

(7) Once the frontend receives the request(we refer to this as receiving a message or data), the request handler is invoked(where we will define the workers which are connected to the backend through an in-process communication endpoint that is bound to the backend).

(8) In the request handler, we define the workers to connect with the backend. Then, the workers deliver the results in JSON format from the request handler to the backend.

(9) The backend receives the results and these results are transmitted to the client through the frontend. Remember, clients can only talk to the frontend, and the work is done in the backend, and hence the choice of names.

(10) The client receives the results in JSON format, which may be further used in the rendered templates of the function to display these results.

Understanding the Directory Structure

Let us understand what are all the directories we will be using for app:

The Flask-ZMQ-App-Folder is the main project directory, in which we have:

  • Model-Server-Folder: It contains the virtual environment, model server code, and the requirements.txt file.
    • model-env: The virtual environment for the model server.
    • This is the file where we import the pre-trained resent50 model, receive the encoded image, and perform the class predictions of the given input image by feeding it to the resnet50 pre-trained model we have imported previously. The top 3 predictions will be returned in the form of JSON object to the client.
    • This acts as a temporary client for our server, in order to check if the communication between both of them is happening properly and if the predictions are received without any issues. Once this is successful, we could modify the code in file so that it acts as the client to
    • requirements.txt: The list of all the necessary packages along with their corresponding versions, used for the running of
  • Flask-Server-Folder: It contains the virtual environment, flask server code, and the requirements.txt file.
    • flask-env: The virtual environment for the flask server.
    • This is the file where we initialize Flask. This acts as the client to the server defined in
    • requirements.txt: The list of all the necessary packages along with their corresponding versions, used for the running of
    • static: This folder contains static files, like CSS and images.
    • templates: This folder contains the HTML templates for the web pages we render.


After implementing the project in this way, the runtime performance drastically surged up by atleast 30 times.

Again, the hands-on guided project could be found here at Project – How to Deploy an Image Classification Model using Flask. And the full code is present here on the GitHub.