# Data engineering tips for cyclists

Clean data is not an IT hobby — it's a performance lever. For quant-minded cyclists and coaches, small improvements in training data quality yield outsized gains in adaptive systems: more reliable CTL/ATL trends, clearer readiness signals, and smarter next-session prescriptions. This short guide gives practical, science-based steps you can apply today to reduce noise, eliminate duplicates, and keep your time series consistent so the algorithm — and your coach — can do the right thing.

## Why data hygiene matters (quick physiology link)

Adaptive coaches use rolling summaries (CTL, ATL, TSB), HRV trends, and interval detection to make decisions. **Garbage in → misleading load and readiness outputs**. A duplicate ride can double-count TSS; a 20-minute gap in power can shift TSB by days. Clean data protects the physiology model and your training margin.

## Core principles: simplicity over complexity

- **Idempotency:** ingest a ride once and only once. If the same file reappears, merge or drop it.
- **Provenance:** track device, firmware, and upload path (Garmin, Wahoo, Strava). Metadata explains weird spikes.
- **Minimal transformation at write-time:** validate on ingest; postpone heavy analytics to downstream jobs.

## Practical checks to improve training data quality

### 1) Basic validation rules

- **Range checks:** power 0–2,500 W, HR 30–220 bpm, cadence 0–80+ rpm. Flag values outside plausible ranges.
- **Monotonic timestamp checks:** no timestamps that go backwards; reject or repair files with negative time deltas.
- **Duration sanity:** rides under 1 minute or exceeding 24 hours should be reviewed.

### 2) Metadata matters — always capture it

- Device model and serial
- Firmware version
- Upload source and timestamps (device recorded vs server receipt)

This metadata lets you identify systemic issues (e.g., an old firmware that doubles cadence samples).

## Deduplication: stop double-counting your progress

Duplicates create the largest, most obvious distortion in load-based coaching. Use a layered approach:

1. **Exact-match dedupe:** compute a file hash (SHA-256) and reject re-uploads with the same hash.
2. **Key-match dedupe:** match by (device id, start_time_utc, duration, total_distance). If all match, mark duplicate.
3. **Fuzzy-match dedupe:** for near-duplicates (small time offsets, truncated files), compute a similarity score on key metrics (avg power, TSS, moving-average power curve). If similarity exceeds threshold, prompt to merge.

When merging, **keep the best-quality stream** (longest contiguous power, less missingness) and preserve provenance fields.

For a user-facing explanation of why duplicates appear and how to prevent them, see: /knowledge-base/archive-strava-duplicates-merge

## Time series consistency: align samples so models behave

Adaptive algorithms expect comparable, regularized inputs. Inconsistent sampling rates or timezone drift adds noise.

- **Normalize sampling frequency:** resample power/HR/cadence to a standard base (e.g., 1s or 1Hz) using forward-fill for short gaps and NaN for longer gaps.
- **Flag long gaps (>30s) vs micro-gaps (<5s):** treat them differently — interpolate micro-gaps, mark long gaps for imputation or exclusion.
- **Use UTC everywhere:** store and compare start times in UTC to avoid DST/timezone issues.
- **Keep lap/interval boundaries consistent:** when mapping completed rides to planned workouts, use start-time tolerance (±60s) and interval shape matching, not just name matching.

## Missing data and imputation — be conservative

- **Never invent metrics for long gaps.** Short, physiologically plausible interpolation (power for <10s) is acceptable; larger holes should be marked as missing.
- **Use domain-specific imputation:** if cadence is missing but power and speed exist, avoid fabricating cadence-driven metrics that feed automatic analysis.

## Signal cleaning: keep physiological meaning

- **Smoothing vs. distortion:** light rolling median (3–5s) removes spikes without depressing peak power. Avoid heavy low-pass filters that remove sprints.
- **Drift detection:** compare start vs end power bias on long rides; large drifts indicate power meter drift. See power meter calibration best practices in N+One (link internal) for daily habits that reduce drift.

## Lightweight ETL checklist for athletes and coaches

1. Ingest file → compute hash → check exact duplicates.
2. Parse metadata; validate ranges and monotonic timestamps.
3. Resample to a canonical rate; mark gaps and their length.
4. Apply fuzzy dedupe against recent rides (48–72h window).
5. Flag anomalies and present them to the user for review (e.g., "High probability duplicate").
6. Store both raw and cleaned streams; keep lineage for audits.

## Low-effort habits cyclists can adopt today

- **Zero-offset (calibrate) your power meter regularly.** Small hardware habits prevent large downstream corrections.
- **Connect integrations in the recommended order** (device → platform → coach) to avoid duplicate uploads from multiple paths. If you see delayed or missing syncs, consult /knowledge-base/archive-garmin-sync-troubleshooting
- **Name sessions and use manual tags** for interval days; consistent naming improves fuzzy matching and planned-workout mapping.
- **Avoid in-ride file splitting:** some devices create multiple files for pauses — merge them on export or enable auto-merge settings.

## Why this matters for adaptive systems (N+One’s edge)

Adaptive coaches update your plan based on recent stress and recovery. **A single duplicate ride can inflate ATL, lower TSB, and push the system to soft-prescribe recovery weeks you don’t need.** Conversely, missing or corrupted rides can hide fitness gains.

Clean inputs produce clearer readiness signals (HRV, RHR, duration-weighted TSS) and more confident next-session recommendations — the N+One promise of "The Next Session."

## Troubleshooting cheat-sheet (fast wins)

- Duplicate TSS? Check hashes and device upload paths first.
- Unexpected TSB swing? Look for overlapping rides or timezone-shifted start times.
- Weird HR spikes? Compare device firmware and sampling rate; smooth 3–5s median.
- Inconsistent power between indoor/outdoor? See indoor–outdoor power differences guidance and check calibration.

## Conclusion — Key takeaways

- **Data hygiene is a direct performance lever.** Clean, consistent inputs let adaptive models recommend the right next session.
- **Prevent duplicates with hashing + fuzzy matching.** Merge conservative; preserve raw streams.
- **Standardize time series (sampling rate, UTC) and flag long gaps.** Conservative imputation preserves physiological meaning.
- **Adopt easy habits:** routine power-meter zero-offset, consistent upload order, and clear session naming.

Try these tactics this week: fix one recurring duplicate or add a hash-based check to your workflow, then watch CTL/ATL/TSB stability improve. For hands-on help with sync issues and best setup order, see /knowledge-base/archive-garmin-sync-troubleshooting and /knowledge-base/archive-strava-duplicates-merge.

Ready to turn cleaner data into smarter sessions? Sign up for N+One and let adaptive coaching use your clean signal to pick The Next Session.