Lotus: Characterization of Machine Learning Preprocessing Pipelines via Framework and Hardware Profiling

Rajveer Bachkaniwala, Harshith Lanka, Kexin Rong, Ada Gavrilovska

2024 IEEE International Symposium on Workload Characterization (IISWC 2024)

Award: 🏆 Best paper nominee

Artifact: Available, Reviewed, Reproduced

[PDF] [Code] [Slides]



1. Problem

ML pipelines often experience significant slowdowns in the data preprocessing stage. Existing tools either lack visibility into fine-grained operation timings or cannot link high-level Python functions with low-level hardware behavior, making bottlenecks hard to diagnose.

2. Motivation

Preprocessing can consume up to 65% of training time, leading to poor GPU utilization. Tools that can bridge the semantic gap between Python-level operations and CPU microarchitectural activity are missing, limiting actionable insights for optimization.

3. Contribution / Solution

The authors introduce Lotus, a profiling framework consisting of:

  • LotusTrace: Captures fine-grained (<10ms) timings of individual preprocessing steps in PyTorch’s DataLoader with minimal overhead.
  • LotusMap: Bridges Python and C++ layers by mapping preprocessing operations to their corresponding low-level C/C++ functions, enabling correlation with hardware counters from tools like Intel VTune or AMD uProf.

Together, they provide full-stack visibility into preprocessing performance at both the software and hardware level.

4. Results / Observations

  • Short-lived ops dominate: Most preprocessing operations take less than 10ms; many are under 100µs, making them invisible to traditional profilers.
  • High variance in batch times: Variability in image sizes and randomness in transforms leads to 5–15% standard deviation in batch times, complicating resource provisioning.
  • Out-of-order arrivals hurt performance: Shared queues between DataLoader workers cause batches to arrive out of order, introducing main-process wait times and delaying GPU consumption.
  • Diminishing returns with more workers (cores): Increasing DataLoader workers initially reduces job time, but beyond a threshold (e.g., 20), it increases CPU contention with minimal end-to-end gains.
  • Lotus is comparatively better: Compared to profilers like py-spy, austin, and Scalene, Lotus incurs lower overhead and provides richer insights with <2% runtime overhead and fine-grained batch-level instrumentation.