• Limitations of existing serving systems
    • Resource Waste. Resources are isolated across models. Good for performance / tail latency. Bad for resource utilization.
    • Inconsideration for Ops’ Characteristics. When inferencing, logistic regression takes only 0.3% time whereas Ngram and Concat take majority time. Could have used better execution plans if white box.
    • Lazy Initialization. Leads to long tail latency. This problem is ML.Net specific but probably also applies to other serving systems. Things that take time when cold start: JIT, memory allocation, …
  • End-to-end Optimizations
    • AoT compilation
    • Memory pre-allocation
  • Multi-model Optimizations
    • Object Store: maintain a single copy of the same model parameters.
    • Sub-plan Materialization: Save intermediate results so that other models can use.
  • System components: Flour (Intermediate Representation), Oven (Compiler/Optimizer), Runtime, FrontEnd (User facing)
  • Offline Phase
    • Translate model to Flour: Automated, because there is a one-to-one mapping from a ML.Net program to Flour.
    • Oven builds and optimizes a model execution plan. Similar to database systems. Rule-based optimizer. Lots of details that I don’t understand here.
    • Register model plan to Runtime. Pick the most efficient implementation (“Physical Stage”).
  • Online Phase (Runtime, FrontEnd)
  • Evaluation
    • Latency: 3x faster 99% cold start; 45x faster worst cold start
    • Memory: 1/25x of ML.Net; 1/62x of ML.Net+Clipper
    • Throughput: 10x throughput, close to linear scalability.
  • Questions
    • Inputs are different. How does sub-plan materialization work?
      • Author: Some same input happens frequently. (Bad answer)
      • Author: An input is connected to multiple models. Those models might share some components. (Better answer)
    • How many problems this paper solved are specific to ML.Net?
    • Memory pool seems like a good idea. I guess TensorFlow’s memory pool isn’t shared across models. But how do they deal with fragmentation?