Machine Learning

Gradient Descent, Three Reps at a Time

May 2026

This is a companion to Linear Regression, Explained Through CrossFit. That post introduced gradient descent as the mechanism that adjusts weights to reduce error — this one walks through what that actually looks like, step by step, using the same 1RM dataset.

The setup

In the main post, the model learns four weights — one for reps, one for weight lifted, one for weeks trained, one for sleep. For the purpose of walking through the arithmetic, we're going to simplify: one feature, one weight, one bias. Everything that's true here is equally true for all four features. There's just more arithmetic.

The feature we'll use is weeks_trained. The training data (same rows as before):

weeks_trained	actual 1RM (kg)
4	140
6	148
8	155
10	152
12	165

The model we're trying to learn:

predicted 1RM = w × weeks_trained + b

We start with both unknowns set to zero:

w = 0,  b = 0

That's the blank logbook. The model knows nothing yet.

The update rule

When the model makes a prediction and gets it wrong, it adjusts w and b in proportion to the error. The update rule for a single data point is:

error     = predicted − actual

w_new = w − α × error × weeks_trained
b_new = b − α × error

α (alpha) is the learning rate — how large a step to take after each mistake. We'll use α = 0.001. Small enough to stay stable, large enough to see the weights move.

The sign matters. If error is negative (we predicted too low), subtracting a negative nudges the weight up — which is what we want. If error is positive (we predicted too high), it nudges the weight down.

Iteration 1 — first session, cold start

We train on the first row: weeks_trained = 4, actual 1RM = 140 kg.

Prediction:

predicted = 0 × 4 + 0 = 0 kg

The model predicts zero. It knows nothing — this is expected.

Error:

error = 0 − 140 = −140

Off by 140 kg in the wrong direction.

Weight updates:

w_new = 0 − 0.001 × (−140 × 4) = 0 + 0.56  = 0.56
b_new = 0 − 0.001 × (−140)     = 0 + 0.14  = 0.14

Because the error was negative (we guessed too low), the update is positive — both weights move up. The model has seen one data point and made its first ever adjustment.

Iteration 2 — second session, slight improvement

We train on the second row: weeks_trained = 8, actual 1RM = 155 kg.

Prediction using the weights from iteration 1:

predicted = 0.56 × 8 + 0.14 = 4.48 + 0.14 = 4.62 kg

Better than zero, but still badly off.

Error:

error = 4.62 − 155 = −150.38

Still large and negative. The model still underestimates.

Weight updates:

w_new = 0.56 − 0.001 × (−150.38 × 8) = 0.56 + 1.20 = 1.76
b_new = 0.14 − 0.001 × (−150.38)     = 0.14 + 0.15 = 0.29

w jumped from 0.56 to 1.76 — a much larger jump than the first iteration because the error term was bigger and it's multiplied by weeks_trained = 8 (double the 4 from row 1). The model is correcting more aggressively.

Iteration 3 — third session, weights gaining ground

We train on the third row: weeks_trained = 12, actual 1RM = 165 kg.

Prediction:

predicted = 1.76 × 12 + 0.29 = 21.12 + 0.29 = 21.41 kg

Error:

error = 21.41 − 165 = −143.59

Still a big undershoot — 12 weeks in, the model is only predicting 21 kg.

Weight updates:

w_new = 1.76 − 0.001 × (−143.59 × 12) = 1.76 + 1.72 = 3.48
b_new = 0.29 − 0.001 × (−143.59)      = 0.29 + 0.14 = 0.43

Where we stand after three reps

After iteration	w	b	Prediction on row 1 (x=4)	Actual 1RM	Error
0 (start)	0.00	0.00	0 kg	140 kg	−140
1	0.56	0.14	2.38 kg	140 kg	−138
2	1.76	0.29	7.33 kg	140 kg	−133
3	3.48	0.43	14.35 kg	140 kg	−126

The predictions are still terrible. The error on row 1 has shrunk from −140 to −126 across three iterations — progress, but not nearly enough to be useful.

This is normal. Three iterations is three reps into a 10-week strength cycle. The model needs to pass through the full dataset hundreds or thousands of times before the weights stabilize.

What convergence actually looks like

A fully trained model on this simplified dataset (one feature, five data points) converges to roughly:

w ≈ 3.2,  b ≈ 127

Meaning: each additional week of training adds about 3.2 kg to the predicted 1RM, and a lifter at zero weeks starts with a ~127 kg baseline. That matches the intuition from the data — the range is 140–165 kg across 4–12 weeks, a spread of 25 kg over 8 weeks, or about 3.1 kg/week.

After iteration 3, w = 3.48 — already close to the target weight. But b = 0.43 is nowhere near 127. The bias takes much longer to converge because it only changes by α × error each step, with no feature value multiplying it. It'll get there, but it needs more reps.

Why the feature value is in the weight update

Look at the update rule for w again:

w_new = w − α × error × weeks_trained

The × weeks_trained term means: the bigger the feature value, the larger the step. This makes sense — if a feature had a value of zero, it contributed nothing to the prediction, so its weight doesn't need to change regardless of the error. Only features that were actually "active" in the prediction get adjusted.

In CrossFit terms: if you didn't touch the barbell last session, the coach isn't going to critique your bar path.

The four-feature version is identical in structure

With all four features from the main post, each weight gets its own update:

error = predicted − actual

w₁_new = w₁ − α × error × reps
w₂_new = w₂ − α × error × weight_lifted
w₃_new = w₃ − α × error × weeks_trained
w₄_new = w₄ − α × error × hours_of_sleep
b_new  = b  − α × error

Five updates instead of two, applied simultaneously. The intuition is the same: each weight shifts in the direction that would have reduced the error on this training point. Do that thousands of times across all training rows and the model converges.

The part gradient descent can't see

One thing worth noting: gradient descent optimizes the weights to fit the training data you gave it. It has no way to know whether those weights generalize to new sessions it's never seen.

That's why the main post emphasizes the test set — holding back data to check whether the converged weights actually predict well, or whether they just memorized your logbook.

The math of convergence is the easy part. The harder part is knowing when to stop and whether what you've learned is real signal or just noise.