The paper first points out that preprocessing (decode, resize, crop, normalize) costs lots of time and energy, which agrees with what I observed when running experiments on Nexus. Previous works (BlazeIt, and Tahoma) make tradeoffs between accuracy and throughput by using different specialized models. This paper augments the search space with input format. The first author Daniel Kang clearly has a deep background in decoding images and videos (see his website). The paper shows techniques that can be used to acquire lower fidelity inputs with lower resource consumptions from the popular image and video encodings (Table 4.)

They built a cost-model based engine, Smol, that can search the space of specialized DNNs x input formats, and find out the Pareto frontier in terms of accuracy and end-to-end time cost. Based on that, the engine can decide whether to place the preprocessing in GPU or CPU and what specialized DNNs and input formats to use.