The challenge of scaling GPUs in the cloud
As a startup, we’ve had the luxury of using AWS as our cloud provider from the very beginning. This has given us time to get hands-on experimental time with the wide range of services that AWS offers, from load-balancing and serverless technologies, through to clustered databases and storage.
Part of the solution we deliver to our customers comes in the form of a restful API. Scaling this in real-time to our business customer comes as second nature to all the technologies we’ve picked; API Gateway, Elastic Beanstalk, Cloudfront, and RDS being the primary 4 AWS services that all scale.
The first challenge we face is because part of our solution relies on a GPU backend rather than a CPU. We need to be able to render 360 degree realistic images from a “finger sketch”, where the user draws out the layout of their kitchen or bathroom, and then decorates it to their taste, before sending it to be rendered.
Our second challenge, is that unlike our competitors who send the rendered output “offline” several minutes later, we aim to render “whilst you wait” before you even leave the website, therefore we need highly optimized rendering techniques, with substantial GPU hardware available at all times.
Fortunately AWS provides a range of GPU based servers to choose from, however these tend to be suited for companies who have a static set of EC2 instances ( i.e. a render farm ), rather than our auto-scaling, lower-user requirements.
Secondly we have the scaling technology “Elastic Beanstalk” to contend with, making scaling GPU resources in real time a challenge. When we have a sudden demands in usage that are unpredictable, we need to be able scale quickly within seconds, allowing Elastic Beanstalk to “ramp up” servers in bulk to provide the necessary number of servers to cope with demand.
And finally we teach all our backend teams to develop and deliver using Docker Containers, bring another element into the picture.
Fortunately there is a way to combine Nvidia GPU, Elastic Beanstalk, and Docker Containers.
The first hurdle is getting Nvidia devices such as the Tesla visible at the driver level to CUDA, which is sitting in a docker container. This is done with the Nvidia Container Runtime, (Available here).
The second hurdle is getting this working with Elastic Beanstalk. We don’t want to go down the road of creating a custom AMI image that is used for the p2.xlarge server, as this would require constant updating and maintenance. Rather we would prefer to use a AWS marketplace fully support AMI image, and then use the tools within “ebextentions” to install any additional drivers or scripts at the EC2 level before the docker container starts up. A quick tutorial of ebextentions can be found here.
The final hurdle is the art of scaling itself, with 2 distinct problems. Firstly is how you scale. ElasticBeanstalk allows you to choose many forms of scaling triggers, such as CPU usage, memory usage, disk usage, IOPs etc, but doesn’t yet offer a GPU usage one. This is partially down to Nvidia, who only offer a crude command line interface tool to be able to see GPU utilisation in real-time, there is no fancy API to hook into at the time of writing, only the nvidia-smi interface ;
The other scaling challenge is speed. Docker containers and AWS EC2 have been tweaked to start up in seconds when they’re lightweight, unfortunately starting up a CUDA based system is far from lightweight, with the Docker container image coming in at 5 gigabytes. If you’re rebuilding an entire EC2 server on the fly, you can imagine how long it takes to download and startup this image from Docker Hub over the internet.
To get round this we compile special run-time versions of our containers, with development and debugging code removed, and we are also very conservative when it comes to installing 3rd party tools. Our original docker image came in at 5 gigabytes, and we are actively working to reduce this. A final improvement is to host the docker image locally on your own cloud provider, reducing download times significantly.
A final note is to remember that demand for CPU and GPU processing as a ratio to one another rarely has a constant you can rely on. Therefore it is always advisable to have a CPU system that scales independently of your GPU system. Without this, you’ll find you either have not enough of one resource, or idle GPU servers making a substantial dent on your hosting bill.
Cloud providers have come along way over the past 5 years supporting GPUs, and AWS allows us to render realistic images at scale. But we’re still a long way yet from being able to swap out CPU for GPU solutions in any cloud service we wish. However with a little bit of coding and thoughtful planning, combined with a clever use of cloud architecture, it can certainly be done with today's technologies.