1. Problem
As ML pipelines grow more complex, data preprocessing—loading, decoding, transforming—has become a major bottleneck, often using more compute and power than expected. Yet most profiling tools overlook this stage, especially how it interacts with the CPU hardware.
2. Motivation
Today’s ML systems are built on heterogeneous infrastructure. To make smart decisions about hardware (e.g., which CPUs to choose), we need detailed visibility into how preprocessing performs—not just at the code level, but down to the CPU microarchitecture.
3. Contribution / Solution
Lotus is a lightweight profiling tool purpose-built for ML preprocessing. It captures:
- Fine-grained timings for each preprocessing step (e.g., crop, flip, decode)
- Hardware-level performance metrics like cache stalls and instruction throughput
- Mappings between high-level Python code and low-level CPU operations
Lotus makes it easy to answer: Where is preprocessing slow? Is the CPU saturated? Would more cores help?
4. Results / Observations
- Scaling dataloaders helps—until it doesn’t: More parallelism improves performance, but gains flatten as CPUs become saturated.
- Hardware bottlenecks matter: Lotus shows where CPUs struggle—whether from frontend stalls, memory contention, or underutilized cores.
- Better hardware decisions: With Lotus, ML infra teams can match workloads to the right CPU configuration, avoiding overprovisioning or underperformance.
(Google form link) Click here to provide feedback