• Although this paper looks terrible, but I think the technical stuff is great and probably the ideas are novel.
  • GPU computation resource isolation (SM partition):
  • Reverse engineering GPU memory and cache hierarchy
    • L1 cache and shared memory are within SM. No contention.
    • Constant memory is small in size (64 KB in GTX 1080) and cached. Hence, we do not expect much contention over constant memory.
    • No program uses texture memory
    • Find the mapping from physical address to DRAM bank

      The algorithm exploits the fact that accessing two addresses that lie on the same bank but different rows (which causes row buffer eviction) will be noticeably slower than accessing two addresses that lie on different banks.

    • Find the mapping from physical address to cache set

      Our algorithm to reverse-engineer the hash function that maps a physical address to a cache set relies on two facts: 1) Cache lines that lie on the same cache set can evict each other, and 2) Accessing a cached word is much faster than accessing a word from DRAM.

  • GPU memory bandwidth isolation:

    Physical pages can be allocated such that no two GPU partitions can access the same cache sets and therefore cannot evict each other’s cache lines. Similarly, they can be allocated such that GPU partitions do not share DRAM banks, thereby avoiding row buffer conflicts.

  • Evaluation
    • Terminology
      • Compute Partitioning only (CP)
      • Both Compute and Memory Partitioning (CMP)
    • GTX 1080

      CMP is much better than CP for predictable performance

      There is a tradeoff between predictability and performance. The runtimes of all applications are shorter for the case of CP as compared to CMP

      Some applications under-utilize GPU resources. All applications that have normalized runtimes less than 1 indicate that they are under-utilizing the GPU.

    • Tesla V100

      CP is better than Nvidia MPS for predictable performance. The average variation for MPS is 135.8% whereas it is considerably lower for CP at 48.1%. This is because MPS partitions at thread granularity whereas CP partitions at SM granularity.

      CP performs better on Tesla V100 than on GTX 1080. This is because Tesla V100 has higher memory bandwidth.