Deep Learning Inference Service at Microsoft
- The system includes functionalities of hardware configuration, model placement, request routing, and model execution.
- They use CPU, GPU, and FPGA.
- Model Master tracks the load and resource availability on each server.
- Isolation: run in containers
- Increase isolation performance: Processor affinity, NUMA affinity, and Memory restrictions.
- “Memory restrictions ensure that the model never needs to access disk.” don’t quite understand.
- Customized communication over UDP/shared-memory to increase the networking performance of containers.
- There is frequent burst traffic. Hard to predict server’s load. Two techniques:
- Send the request to a backup server if the first attempt has a risk of missing SLA. Conditions can be like “after 5ms” or “after 95th percentile latency”.
- When one server dequeues a task, it cancels the backup request on the other servers.
© 2020 ScratchPad ― Powered by Jekyll and Textlog theme