Navigate back to the homepage
Portfolio

Deploying PyTorch to Production

Stefan Libiseller
December 12th, 2019 · 2 min read

When I started with data science it was pretty hard to find material on how to deploy the models I had just trained. It seemed like everyone was busy blogging about fancy new optimizers or the latest model architecture. Sure, deploying machine learning models is one of the less fancy sides of data science, but in my opinion equally important. After all, what good is the best machine learning model if you can't use it?

In this post I'll show how to build an easy, self hosted web API for small to medium volume production using Flask. For more power and auto-scalability I recommend Cortex, a platform that deploys models to AWS with very little effort. I intentionally won't cover deployment using TorchScript as this could be another blog post on its own.

Flask Web API

Python frameworks like Flask make it easy to create simple web APIs. While this is not the most performant method to deploy PyTorch models, it is ideal for small to medium volume production. It's quick to adapt and deploy, which is often more important than performance.

I highly suggest to use virtual environments like conda to keep the dependencies of different projects separate. For this API you'll need to install Flask and Gunicorn to your environment:

1conda install flask gunicorn

The code below is the whole application, you just need to insert your model and the predict function. Also, don't forget to set your model to evaluation mode - no need to compute gradients in inference mode!

1# Stefan Libiseller
2# libiseller.work/deploying-pytorch-to-production
3
4from flask import Flask, json, request
5
6app = Flask(__name__)
7api_endpoint = "/my_endpoint"
8
9# load your model here
10
11def predict(message):
12 # do your pytorch magic here
13 # return as dict
14 return {'class1': 0.3, 'class2': 0.7}
15
16
17@app.route(api_endpoint, methods=['GET'], endpoint=api_endpoint)
18def api():
19 req_data = request.get_json()
20 message = req_data['message']
21
22 try:
23 response, status = predict(message), 200
24 except Exception as e:
25 response, status = {"error": str(e)}, 500
26
27 return app.response_class(
28 response=json.dumps(response),
29 status=status,
30 mimetype='application/json'
31 )
32
33
34if __name__ == '__main__':
35 app.run(debug=False, host='0.0.0.0', port=5000)

Essentially this code just grabs the message field from the request, puts it trough the predict function and returns the response.

Tip: You can raise exceptions in the predict function and the API will return an error with the exception message. This lets you implement assertions or see exactly what piece of code failed.

Testing

You can test it by executing the file, which should give you an output similar to this...

1python api.py
2
3* Serving Flask app "api" (lazy loading)
4* Environment: production
5 WARNING: This is a development server. Do not use it in a production deployment.
6 Use a production WSGI server instead.
7* Debug mode: off
8* Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)

Don't be confused about the warning, we'll fix it in a minute. Since we are only trying to test our application code, python is correctly warning us about it not running in a production-ready way.

To submit test requests to the API I recommend using Postman. You can import my preconfigured request by copy and pasting this link:
https://www.getpostman.com/collections/638a05ae6864c4f11bc1

Requests need to have an application/json header and a body with a message field like this:

1{
2 "message": "literally anything"
3}

Deploy

To start the application in production use this command:

1gunicorn api:app

Gunicorn is a pre-fork worker model, which essentially means it can handle multiple requests at once. So make sure to use it when deploying.

Be aware that big models are also going to consume a lot of RAM. Make sure to check if you have enough availabe before deplying! Usually the size of the loaded weights is also required in RAM.

Cortex Auto Scale AWS Deployment

If you don't have your own server or need something that scales with demand I recommend Cortex. It's an open-source platform that essentially does the same as the API above, but on an auto-scalable AWS instance that is also GPU accelerable.

They have great documentation and tutorials on their website. I suggest to also take a look at one of their examples. It helped me a lot to understand how it works in practice.

Conclusion

These are just two examples of how PyTorch models can be deployed to production with little to no pain. The Flask API is a great start and Cortex can take it to the next level, if your application requires it. What we didn't cover are optimizations techniques to reduce model size such as quantisation and pruning close-to-zero weights as they only become necessary if you want to deploy to resource restricted environments like smart phones.

Happy shipping!

Other Links:
ONNX - Open neural network exchange format

Let's automate your business!

I am a freelance Data Scientist from Vienna and available for projects and consulting. Machine learning has applications in almost every business case and gives you a significant competitive advantage! Send me an email or drop me a line on Twitter, if you are interested in collaborating. The frist meeting is free with no strings attatched. :)

Join my newsletter

Be the first to receive my latest content with the ability to opt-out at anytime. I promise to not spam your inbox or share your email with any third parties.

More blog posts

How Machines Understand Words

Words for humans to understand words for machines. A summary with code examples about word embeddings, word vectors and byte pair encoding.

August 14th, 2019 · 3 min read

Machine Learning for Humans in a Hurry

A compact summary of what machine learning is and how it works.

July 9th, 2019 · 3 min read
© 2020 - Stefan LibisellerImprint
Link to $https://twitter.com/libiseller_workLink to $https://github.com/libisellerLink to $https://instagram.com/standerwahre