Fractional GPUs: Software-based Compute and Memory Bandwidth Reservation for GPUs

Although this paper looks terrible, but I think the technical stuff is great and probably the ideas are novel.
GPU computation resource isolation (SM partition):
- Each kernel can specify which SMs it wants to run on.
- For kernels that originally requires nb blocks with nt threads per block, it runs with npb blocks with nt threads per block.
- npb * nt == the number of SMs * threads per SM
- i.e. each kernel occupies all threads.
- Threads exit if not in the target SM (by reading %smid register).
  - https://github.com/sakjain92/Fractional-GPUs/blob/c5fe28957c90e6949d3cdbebcb6263378a6d8aa0/include/fractional_gpu_cuda.cuh#L57
- The first thread in the block atomically increment a counter and the return value serve as the original blockIdx.
  - https://github.com/sakjain92/Fractional-GPUs/blob/c5fe28957c90e6949d3cdbebcb6263378a6d8aa0/include/fractional_gpu_cuda.cuh#L85
- Need to change kernel code a little bit.
- Use MPS to run multiple kernels concurrently.
  - But their approach works better than MPS QoS.
Reverse engineering GPU memory and cache hierarchy
- L1 cache and shared memory are within SM. No contention.
- Constant memory is small in size (64 KB in GTX 1080) and cached. Hence, we do not expect much contention over constant memory.
- No program uses texture memory
- Find the mapping from physical address to DRAM bank
  
  The algorithm exploits the fact that accessing two addresses that lie on the same bank but different rows (which causes row buffer eviction) will be noticeably slower than accessing two addresses that lie on different banks.
- Find the mapping from physical address to cache set
  
  Our algorithm to reverse-engineer the hash function that maps a physical address to a cache set relies on two facts: 1) Cache lines that lie on the same cache set can evict each other, and 2) Accessing a cached word is much faster than accessing a word from DRAM.
GPU memory bandwidth isolation:

Physical pages can be allocated such that no two GPU partitions can access the same cache sets and therefore cannot evict each other’s cache lines. Similarly, they can be allocated such that GPU partitions do not share DRAM banks, thereby avoiding row buffer conflicts.
Evaluation
- Terminology
  - Compute Partitioning only (CP)
  - Both Compute and Memory Partitioning (CMP)
- GTX 1080
  
  CMP is much better than CP for predictable performance
  
  There is a tradeoff between predictability and performance. The runtimes of all applications are shorter for the case of CP as compared to CMP
  
  Some applications under-utilize GPU resources. All applications that have normalized runtimes less than 1 indicate that they are under-utilizing the GPU.
- Tesla V100
  
  CP is better than Nvidia MPS for predictable performance. The average variation for MPS is 135.8% whereas it is considerably lower for CP at 48.1%. This is because MPS partitions at thread granularity whereas CP partitions at SM granularity.
  
  CP performs better on Tesla V100 than on GTX 1080. This is because Tesla V100 has higher memory bandwidth.