PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems

Limitations of existing serving systems
- Resource Waste. Resources are isolated across models. Good for performance / tail latency. Bad for resource utilization.
- Inconsideration for Ops’ Characteristics. When inferencing, logistic regression takes only 0.3% time whereas Ngram and Concat take majority time. Could have used better execution plans if white box.
- Lazy Initialization. Leads to long tail latency. This problem is ML.Net specific but probably also applies to other serving systems. Things that take time when cold start: JIT, memory allocation, …
End-to-end Optimizations
- AoT compilation
- Memory pre-allocation
Multi-model Optimizations
- Object Store: maintain a single copy of the same model parameters.
- Sub-plan Materialization: Save intermediate results so that other models can use.
System components: Flour (Intermediate Representation), Oven (Compiler/Optimizer), Runtime, FrontEnd (User facing)
Offline Phase
- Translate model to Flour: Automated, because there is a one-to-one mapping from a ML.Net program to Flour.
- Oven builds and optimizes a model execution plan. Similar to database systems. Rule-based optimizer. Lots of details that I don’t understand here.
- Register model plan to Runtime. Pick the most efficient implementation (“Physical Stage”).
Online Phase (Runtime, FrontEnd)
Evaluation
- Latency: 3x faster 99% cold start; 45x faster worst cold start
- Memory: 1/25x of ML.Net; 1/62x of ML.Net+Clipper
- Throughput: 10x throughput, close to linear scalability.
Questions
- Inputs are different. How does sub-plan materialization work?
  - Author: Some same input happens frequently. (Bad answer)
  - Author: An input is connected to multiple models. Those models might share some components. (Better answer)
- How many problems this paper solved are specific to ML.Net?
- Memory pool seems like a good idea. I guess TensorFlow’s memory pool isn’t shared across models. But how do they deal with fragmentation?