Clean data is not an IT hobby — it's a performance lever. For quant-minded cyclists and coaches, small improvements in training data quality yield outsized gains in adaptive systems: more reliable CTL/ATL trends, clearer readiness signals, and smarter next-session prescriptions. This short guide gives practical, science-based steps you can apply today to reduce noise, eliminate duplicates, and keep your time series consistent so the algorithm — and your coach — can do the right thing.
Why data hygiene matters (quick physiology link)
Adaptive coaches use rolling summaries (CTL, ATL, TSB), HRV trends, and interval detection to make decisions. Garbage in → misleading load and readiness outputs. A duplicate ride can double-count TSS; a 20-minute gap in power can shift TSB by days. Clean data protects the physiology model and your training margin.
Core principles: simplicity over complexity
- Idempotency: ingest a ride once and only once. If the same file reappears, merge or drop it.
- Provenance: track device, firmware, and upload path (Garmin, Wahoo, Strava). Metadata explains weird spikes.
- Minimal transformation at write-time: validate on ingest; postpone heavy analytics to downstream jobs.
Practical checks to improve training data quality
1) Basic validation rules
- Range checks: power 0–2,500 W, HR 30–220 bpm, cadence 0–80+ rpm. Flag values outside plausible ranges.
- Monotonic timestamp checks: no timestamps that go backwards; reject or repair files with negative time deltas.
- Duration sanity: rides under 1 minute or exceeding 24 hours should be reviewed.
2) Metadata matters — always capture it
- Device model and serial
- Firmware version
- Upload source and timestamps (device recorded vs server receipt)
This metadata lets you identify systemic issues (e.g., an old firmware that doubles cadence samples).
Deduplication: stop double-counting your progress
Duplicates create the largest, most obvious distortion in load-based coaching. Use a layered approach:
- Exact-match dedupe: compute a file hash (SHA-256) and reject re-uploads with the same hash.
- Key-match dedupe: match by (device id, start_time_utc, duration, total_distance). If all match, mark duplicate.
- Fuzzy-match dedupe: for near-duplicates (small time offsets, truncated files), compute a similarity score on key metrics (avg power, TSS, moving-average power curve). If similarity exceeds threshold, prompt to merge.
When merging, keep the best-quality stream (longest contiguous power, less missingness) and preserve provenance fields.
For a user-facing explanation of why duplicates appear and how to prevent them, see: /knowledge-base/archive-strava-duplicates-merge
Time series consistency: align samples so models behave
Adaptive algorithms expect comparable, regularized inputs. Inconsistent sampling rates or timezone drift adds noise.
- Normalize sampling frequency: resample power/HR/cadence to a standard base (e.g., 1s or 1Hz) using forward-fill for short gaps and NaN for longer gaps.
- Flag long gaps (>30s) vs micro-gaps (<5s): treat them differently — interpolate micro-gaps, mark long gaps for imputation or exclusion.
- Use UTC everywhere: store and compare start times in UTC to avoid DST/timezone issues.
- Keep lap/interval boundaries consistent: when mapping completed rides to planned workouts, use start-time tolerance (±60s) and interval shape matching, not just name matching.
Missing data and imputation — be conservative
- Never invent metrics for long gaps. Short, physiologically plausible interpolation (power for <10s) is acceptable; larger holes should be marked as missing.
- Use domain-specific imputation: if cadence is missing but power and speed exist, avoid fabricating cadence-driven metrics that feed automatic analysis.
Signal cleaning: keep physiological meaning
- Smoothing vs. distortion: light rolling median (3–5s) removes spikes without depressing peak power. Avoid heavy low-pass filters that remove sprints.
- Drift detection: compare start vs end power bias on long rides; large drifts indicate power meter drift. See power meter calibration best practices in N+One (link internal) for daily habits that reduce drift.
Lightweight ETL checklist for athletes and coaches
- Ingest file → compute hash → check exact duplicates.
- Parse metadata; validate ranges and monotonic timestamps.
- Resample to a canonical rate; mark gaps and their length.
- Apply fuzzy dedupe against recent rides (48–72h window).
- Flag anomalies and present them to the user for review (e.g., "High probability duplicate").
- Store both raw and cleaned streams; keep lineage for audits.
Low-effort habits cyclists can adopt today
- Zero-offset (calibrate) your power meter regularly. Small hardware habits prevent large downstream corrections.
- Connect integrations in the recommended order (device → platform → coach) to avoid duplicate uploads from multiple paths. If you see delayed or missing syncs, consult /knowledge-base/archive-garmin-sync-troubleshooting
- Name sessions and use manual tags for interval days; consistent naming improves fuzzy matching and planned-workout mapping.
- Avoid in-ride file splitting: some devices create multiple files for pauses — merge them on export or enable auto-merge settings.
Why this matters for adaptive systems (N+One’s edge)
Adaptive coaches update your plan based on recent stress and recovery. A single duplicate ride can inflate ATL, lower TSB, and push the system to soft-prescribe recovery weeks you don’t need. Conversely, missing or corrupted rides can hide fitness gains.
Clean inputs produce clearer readiness signals (HRV, RHR, duration-weighted TSS) and more confident next-session recommendations — the N+One promise of "The Next Session."
Troubleshooting cheat-sheet (fast wins)
- Duplicate TSS? Check hashes and device upload paths first.
- Unexpected TSB swing? Look for overlapping rides or timezone-shifted start times.
- Weird HR spikes? Compare device firmware and sampling rate; smooth 3–5s median.
- Inconsistent power between indoor/outdoor? See indoor–outdoor power differences guidance and check calibration.
Conclusion — Key takeaways
- Data hygiene is a direct performance lever. Clean, consistent inputs let adaptive models recommend the right next session.
- Prevent duplicates with hashing + fuzzy matching. Merge conservative; preserve raw streams.
- Standardize time series (sampling rate, UTC) and flag long gaps. Conservative imputation preserves physiological meaning.
- Adopt easy habits: routine power-meter zero-offset, consistent upload order, and clear session naming.
Try these tactics this week: fix one recurring duplicate or add a hash-based check to your workflow, then watch CTL/ATL/TSB stability improve. For hands-on help with sync issues and best setup order, see /knowledge-base/archive-garmin-sync-troubleshooting and /knowledge-base/archive-strava-duplicates-merge.
Ready to turn cleaner data into smarter sessions? Sign up for N+One and let adaptive coaching use your clean signal to pick The Next Session.