Lotus: Characterize Architecture Level CPU-based Preprocessing in Machine Learning Pipelines

Rajveer Bachkaniwala, Harshith Lanka, Kexin Rong, Ada Gavrilovska

The 2nd Workshop on Hot Topics in System Infrastructure (HotInfra’24), co-located with SOSP’24

[PDF] [Code] [Slides]



1. Problem

As ML pipelines grow more complex, data preprocessing—loading, decoding, transforming—has become a major bottleneck, often using more compute and power than expected. Yet most profiling tools overlook this stage, especially how it interacts with the CPU hardware.

2. Motivation

Today’s ML systems are built on heterogeneous infrastructure. To make smart decisions about hardware (e.g., which CPUs to choose), we need detailed visibility into how preprocessing performs—not just at the code level, but down to the CPU microarchitecture.

3. Contribution / Solution

Lotus is a lightweight profiling tool purpose-built for ML preprocessing. It captures:

  • Fine-grained timings for each preprocessing step (e.g., crop, flip, decode)
  • Hardware-level performance metrics like cache stalls and instruction throughput
  • Mappings between high-level Python code and low-level CPU operations

Lotus makes it easy to answer: Where is preprocessing slow? Is the CPU saturated? Would more cores help?

4. Results / Observations

  • Scaling dataloaders helps—until it doesn’t: More parallelism improves performance, but gains flatten as CPUs become saturated.
  • Hardware bottlenecks matter: Lotus shows where CPUs struggle—whether from frontend stalls, memory contention, or underutilized cores.
  • Better hardware decisions: With Lotus, ML infra teams can match workloads to the right CPU configuration, avoiding overprovisioning or underperformance.