Lotus Iiswc24 | RajveerBachkaniwala

Lotus: Characterization of Machine Learning Preprocessing Pipelines via Framework and Hardware Profiling

Rajveer Bachkaniwala, Harshith Lanka, Kexin Rong, Ada Gavrilovska

2024 IEEE International Symposium on Workload Characterization (IISWC 2024)

Award: 🏆 Best paper nominee

Artifact: Available, Reviewed, Reproduced

1. Problem

ML pipelines often experience significant slowdowns in the data preprocessing stage. Existing tools either lack visibility into fine-grained operation timings or cannot link high-level Python functions with low-level hardware behavior, making bottlenecks hard to diagnose.

2. Motivation

Preprocessing can consume up to 65% of training time, leading to poor GPU utilization. Tools that can bridge the semantic gap between Python-level operations and CPU microarchitectural activity are missing, limiting actionable insights for optimization.

3. Contribution / Solution

The authors introduce Lotus, a profiling framework consisting of:

LotusTrace: Captures fine-grained (<10ms) timings of individual preprocessing steps in PyTorch’s DataLoader with minimal overhead.
LotusMap: Bridges Python and C++ layers by mapping preprocessing operations to their corresponding low-level C/C++ functions, enabling correlation with hardware counters from tools like Intel VTune or AMD uProf.

Together, they provide full-stack visibility into preprocessing performance at both the software and hardware level.

4. Results / Observations

Short-lived ops dominate: Most preprocessing operations take less than 10ms; many are under 100µs, making them invisible to traditional profilers.
High variance in batch times: Variability in image sizes and randomness in transforms leads to 5–15% standard deviation in batch times, complicating resource provisioning.
Out-of-order arrivals hurt performance: Shared queues between DataLoader workers cause batches to arrive out of order, introducing main-process wait times and delaying GPU consumption.
Diminishing returns with more workers (cores): Increasing DataLoader workers initially reduces job time, but beyond a threshold (e.g., 20), it increases CPU contention with minimal end-to-end gains.
Lotus is comparatively better: Compared to profilers like py-spy, austin, and Scalene, Lotus incurs lower overhead and provides richer insights with <2% runtime overhead and fine-grained batch-level instrumentation.