Project- How to build low-latency deep-learning-based flask app

3 / 17

What is the Idea?

Now that we are familiar with the working of Project - How to Deploy an Image Classification Model using Flask, let us discuss how it is going to be different from our current project.

Q: What is the pain point of the previous project?


(1) The amount of time taken to predict the classes of an input image

If we run the command

time python
  • We could see the amount of time taken to give out the predictions.
  • If we experiment it even with different images, we could observe that the average amount of time taken is around 10-21 seconds for each input image.
  • But this is a huge time which is not tolerable in real-time environments. Users expect the applications they use to be not only accurate in terms of functioning but also fast enough in terms of execution time.

(2) It is a monolith program, thus no flexibility

  • When there is a big service we generally call it monolith meaning, made out of a single stone.

  • When we make something from a single stone we can't really break it into parts and separate it - we cannot make it modular, it cannot fit on multiple machines. For example, the deep-learning based service might be computation-intensive and thus it may require GPUs, while CPUs may suffice for the working of the web service since they may not require heavy computational resources like GPUs. Thus, breaking down the services based on necessities and resource consumption provide added advantages of optimal resource consumption(thus reducing the cost), modularisation of the code(and thus the services), scaling flexibilities, and isolation of the responsibilities of the team(ML engineer doesn’t need to bother about the work of software engineer and vice-versa, etc.)

Q: But what is causing the problem?

A: Re-execution of the code for each request due to the monolith programming style

  • Model and web server are coded in monolith style, so no flexibility.

  • Also, the same heavy model executions are repeating due to this style of programming.

  • The way works is:

    • All the imports and loading of the model happen every time we run the program.

    • Loading the model means constructing the huge deep learning graph by stacking the layers and associating their weights.

    • So constructing the same model, again and again, is costly since it is time-consuming to load the model each time we run the program.

  • Very similarly, the way works is this:

    • It is coded in a monolith programming style, by putting the flask server and the model loading/inference, all in the same file.

    • Each request from the user will be re-directed to the dedicated URL.

    • Each URL invokes that corresponding function in

    • Each time is in use, all the imports, and all the execution in that function invoked will be done from the beginning.

    • Though we want to use the same model each time we want predictions for different input images, the model is newly getting loaded for each request. This is costly, and thus it is time-consuming.

Q: Then what could be a potential solution?

A: Construct the model once, and keep it ready-to-use in the RAM

  • Now that we understand that the problem is due to the monolith programming style and thus due to the costly process of loading the model repeatedly for each prediction request, we need to find a way where we could overcome the overhead caused by this step.
  • Java has static keyword, using which variables could be declared as static. These static members get loaded into the memory exactly once, ie. when that java class was loaded. And these members can be used by different threads.
  • But this method has a disadvantage: the service will not be scalable because the static variables remain on one computer, they are not in another, and so on so forth. So there are complications involved when you start using the static variables.
  • Also, Python doesn't provide a static mechanism.

  • Hence, we now switch to such a way where we are going to partially use the static variable concept - but in a wiser way - where we make use of server-client architecture.

Q: So how is this project going to address the issue with execution time?

A: To address the pain-point, we introduce the concept of asynchrony, by uniquely integrating ZMQ(an asynchronous network library) with Flask server

  • The idea is to import the modules and construct the graph exactly once, run a server, and let the server use these loaded modules or variables any number of times as long as the server is active and it is receiving requests.

  • This could be done by breaking down the code into 2 services: web service and model service.

  • We will achieve this by defining a service( here let us call it a model server) - which imports and constructs the graph once, keeps it ready for inference and keeps listening to for any client requests. Once a client(here flask server) requests the server, the server (which already constructed the graph and kept it ready for inference) responds to the client with predictions. The flask server then renders the corresponding HTML page with the predictions.

  • ZMQ is such a library that provides us with the networking capabilities using which we could build a custom server-client mechanism as per our need.

  • We achieve this as follows:

    • We separate the model and the

    • We create a server-class named Server which invokes the RequestHandler class each time the server gets a request. We will be defining these classes in the file.

      (a) As long as the Server is not stopped, it listens to a specified port(say 5576).

      (b) As soon as the server receives a request from a client, it invokes the RequestHandler class which handles the request by (1) converting the base64 encoded image into a normal image (2) preprocessing the image (3) feeding the image to the model (4) getting the predictions (5) returning the resultant predictions to the server.

      (c) The server responds to the client with the resultant predictions.

    • When a user submits an image for its predictions, the corresponding URL will be invoked(say

    • That URL invokes the corresponding function in upload_file function in file).

    • This invoked function acts as a client. It registers with the socket 5576, sends the image in an encoded form(here we will use base64 encoding for an input image) and keeps polling(or keeps waiting) for the response from the server.

    • The server receives the request, keeps track of the client through a unique id and routes the request to a request-handler. The request-handler converts it back to a normal image, preprocesses it as required by the pre-trained model, feeds that preprocessed image to the model, and returns the predictions(in the form of JSON object) to the server. The server sends this response to the client who requested it.

    • This JSON object will be received back by the same function(which requested the server previously) in and the predictions will be sent to the corresponding template which would be rendered in that function.

    • In this workflow, it is the ZMQ that provides the mechanism of connecting through specified sockets and keeps the server listening through that port.

    • At the beginning of the file, we load the model. Then, we start the server which keeps listening to a port. Thus the server is always actively listening to the port and responds to the client. We don't import the model again and again as long as the server is active.

    • The server listens as long as it is not stopped, and thus there is no need to newly load the model. The model is retained in memory unlike loading it from disk for each request. So the amount of time consumed ideally goes down.

No hints are availble for this assesment

Answer is not availble for this assesment

Loading comments...