Deep Learning Inference Service at Microsoft

The system includes functionalities of hardware configuration, model placement, request routing, and model execution.
They use CPU, GPU, and FPGA.
Model Master tracks the load and resource availability on each server.
Isolation: run in containers
Increase isolation performance: Processor affinity, NUMA affinity, and Memory restrictions.
- “Memory restrictions ensure that the model never needs to access disk.” don’t quite understand.
Customized communication over UDP/shared-memory to increase the networking performance of containers.
There is frequent burst traffic. Hard to predict server’s load. Two techniques:
- Send the request to a backup server if the first attempt has a risk of missing SLA. Conditions can be like “after 5ms” or “after 95th percentile latency”.
- When one server dequeues a task, it cancels the backup request on the other servers.