1. Problem

ML pipelines often experience significant slowdowns in the data preprocessing stage. Existing tools either lack visibility into fine-grained operation timings or cannot link high-level Python functions with low-level hardware behavior, making bottlenecks hard to diagnose.

2. Motivation

Preprocessing can consume up to 65% of training time, leading to poor GPU utilization. Tools that can bridge the semantic gap between Python-level operations and CPU microarchitectural activity are missing, limiting actionable insights for optimization.

3. Contribution / Solution

The authors introduce Lotus, a profiling framework consisting of:

Together, they provide full-stack visibility into preprocessing performance at both the software and hardware level.

4. Results / Observations