Project PBerry: FPGA Acceleration for Remote Memory

Fast networks enables remote memory systems:

These systems use the Virtual Memory Subsystem (VMS) in the operating system (OS) as the primary mechanism for two major tasks:

demand paging memory pages from remote hosts to cache them in local DRAM and

eviction of memory pages cached locally back to the remote host that owns the pages

Overheads of using virtual memory to implement remote memory systems:

(Key factor) “[VMS] uses 4KB or larger pages, while applications access and modify data at a finer granularity.”
Page-fault is expensive, especially on writing to pages
Write-protecting pages, flushing TLBs, etc. costs extra time.

As remote memory operates at microsecond latencies [7] in contrast to traditional millisecond-latency devices [11], the overhead of page faults in caching remote pages is prohibitive.

This is very related to the Attack of the Killer Microseconds paper.

In Section 2.1 and Table 1, they didn’t talk about the size of each data entry. I assume they writes <4KB entries, which might be what’s happening in production. I think the point of tracking dirty pages is to batch the network I/O together. In this sense, ~90% or even ~100% amplification is indeed tragic, because even if the system batches the write and performs network I/O in the background and clears out all dirty pages, all pages will become dirty again very quickly, thus the network bandwidth will always be occupied.

Cache-coherent FPGAs:

those that share memory with a CPU and access to such memory employs a cache-coherence protocol to keep the memory consistent across the CPU and the FPGA. For such architectures, the FPGA is connected to the CPU using a point-to-point interconnect implementing a cache coherence protocol (Fig. 1), such as MESI [61] or one of its variants.

PBerry’s goal:

PBerry’s key goal is to enable fast and transparent application memory access tracking at cache line granularity. PBerry also accelerates local and remote data copy.

I skipped the details of PBerry. But Section 6 (Table 2) is very interesting. It enumerates several use case of cache coherence protocol tracking.

People generally don’t believe in remote memory. This paper did not justify for that either. It’s just hard to buy the idea.

I think the PBerry paper’s idea of reusing cache coherence protocol to reduce overhead is very clever. They also enumerated several use case of the proposed coherence tracking primitives, which I think is more interesting than remote memory. Since people generally don’t believe in remote memory, I wonder whether it would produce more valuable works if they switched the research focus on other use cases they listed.

I don’t quite understand the purpose of FPGA-Mem. It seems to me that FPGA-Mem is used as a cache of remote memory. I suppose if they were to focus on other applications like binary translation, this FPGA-Mem is not necessary? Because it seems to me that for applications that cares more about tracking as opposed to copying, only the cache coherence protocol matters. Also it’s unclear to me if they need the reprogrammablity of FPGA. I’m curious whether they chose FPGA because it is a cheaper and easier platform to develop the system (compared to ASIC)?