• The system includes functionalities of hardware configuration, model placement, request routing, and model execution.
  • They use CPU, GPU, and FPGA.
  • Model Master tracks the load and resource availability on each server.
  • Isolation: run in containers
  • Increase isolation performance: Processor affinity, NUMA affinity, and Memory restrictions.
    • “Memory restrictions ensure that the model never needs to access disk.” don’t quite understand.
  • Customized communication over UDP/shared-memory to increase the networking performance of containers.
  • There is frequent burst traffic. Hard to predict server’s load. Two techniques:
    • Send the request to a backup server if the first attempt has a risk of missing SLA. Conditions can be like “after 5ms” or “after 95th percentile latency”.
    • When one server dequeues a task, it cancels the backup request on the other servers.