Now that we are familiar with the working of Project - How to Deploy an Image Classification Model using Flask, let us discuss how it is going to be different from our current project.
(1) The amount of time taken to predict the classes of an input image
If we run the command
time python test_client_without_zmq.py
(2) It is a monolith program, thus no flexibility
When there is a big service we generally call it monolith meaning, made out of a single stone.
When we make something from a single stone we can't really break it into parts and separate it - we cannot make it modular, it cannot fit on multiple machines. For example, the deep-learning based service might be computation-intensive and thus it may require GPUs, while CPUs may suffice for the working of the web service since they may not require heavy computational resources like GPUs. Thus, breaking down the services based on necessities and resource consumption provide added advantages of optimal resource consumption(thus reducing the cost), modularisation of the code(and thus the services), scaling flexibilities, and isolation of the responsibilities of the team(ML engineer doesn’t need to bother about the work of software engineer and vice-versa, etc.)
A: Re-execution of the code for each request due to the monolith programming style
Model and web server are coded in monolith style, so no flexibility.
Also, the same heavy model executions are repeating due to this style of programming.
test_client_without_zmq.py works is:
All the imports and loading of the model happen every time we run the program.
Loading the model means constructing the huge deep learning graph by stacking the layers and associating their weights.
So constructing the same model, again and again, is costly since it is time-consuming to load the model each time we run the program.
Very similarly, the way
app.py works is this:
It is coded in a monolith programming style, by putting the flask server and the model loading/inference, all in the same file.
Each request from the user will be re-directed to the dedicated URL.
Each URL invokes that corresponding function in
app.py is in use, all the imports, and all the execution in that function invoked will be done from the beginning.
Though we want to use the same model each time we want predictions for different input images, the model is newly getting loaded for each request. This is costly, and thus it is time-consuming.
A: Construct the model once, and keep it ready-to-use in the RAM
statickeyword, using which variables could be declared as static. These static members get loaded into the memory exactly once, ie. when that java class was loaded. And these members can be used by different threads.
Also, Python doesn't provide a
Hence, we now switch to such a way where we are going to partially use the static variable concept - but in a wiser way - where we make use of server-client architecture.
A: To address the pain-point, we introduce the concept of asynchrony, by uniquely integrating ZMQ(an asynchronous network library) with Flask server
The idea is to import the modules and construct the graph exactly once, run a server, and let the server use these loaded modules or variables any number of times as long as the server is active and it is receiving requests.
This could be done by breaking down the code into 2 services: web service and model service.
We will achieve this by defining a service( here let us call it a model server) - which imports and constructs the graph once, keeps it ready for inference and keeps listening to for any client requests. Once a client(here flask server) requests the server, the server (which already constructed the graph and kept it ready for inference) responds to the client with predictions. The flask server then renders the corresponding HTML page with the predictions.
ZMQ is such a library that provides us with the networking capabilities using which we could build a custom server-client mechanism as per our need.
We achieve this as follows:
We separate the model and the
We create a server-class named
Server which invokes the
RequestHandler class each time the server gets a request. We will be defining these classes in the
(a) As long as the
Server is not stopped, it listens to a specified port(say
(b) As soon as the server receives a request from a client, it invokes the
RequestHandler class which handles the request by (1) converting the base64 encoded image into a normal image (2) preprocessing the image (3) feeding the image to the model (4) getting the predictions (5) returning the resultant predictions to the server.
(c) The server responds to the client with the resultant predictions.
When a user submits an image for its predictions, the corresponding URL will be invoked(say
That URL invokes the corresponding function in
upload_file function in
This invoked function acts as a client. It registers with the socket
5576, sends the image in an encoded form(here we will use base64 encoding for an input image) and keeps polling(or keeps waiting) for the response from the server.
The server receives the request, keeps track of the client through a unique id and routes the request to a request-handler. The request-handler converts it back to a normal image, preprocesses it as required by the pre-trained model, feeds that preprocessed image to the model, and returns the predictions(in the form of JSON object) to the server. The server sends this response to the client who requested it.
This JSON object will be received back by the same function(which requested the server previously) in
app.py and the predictions will be sent to the corresponding template which would be rendered in that function.
In this workflow, it is the ZMQ that provides the mechanism of connecting through specified sockets and keeps the server listening through that port.
At the beginning of the
resnet_model_server.py file, we load the model. Then, we start the server which keeps listening to a port. Thus the server is always actively listening to the port and responds to the client. We don't import the model again and again as long as the server is active.
The server listens as long as it is not stopped, and thus there is no need to newly load the model. The model is retained in memory unlike loading it from disk for each request. So the amount of time consumed ideally goes down.
No hints are availble for this assesment
Answer is not availble for this assesment