Processing math: 100%

The convergence rate of the OLS estimator for linear regression

A lot of machine learning and modern statistics concerns the rate at which certain statistical estimators converge to the quantities they are estimating. For instance, in classification, you use a collection of observations of covariates xiRp and classifications yi{±1} to fit an estimated classification function h:Rp{±1}, and the relevant question is how close is this estimated function to the actual classification function as the number of observations increase?

In this post, I’ll consider the consistency rate of the simple ordinary least-squares estimator for linear regression coefficients. The estimators of more practical interest in a modern setting (high-dimensional with fewer observations than variables) are more complicated, but OLS estimators are a natural starting point, and comparatively much easier to work with and understand.

To establish the setting, assume that our observations take the form yi=xTiβ+εi, where xi are our observations, and εi are i.i.d. N(0,1/n) noise that are independent of the covariates. We can write this as the matrix–vector equation
y=Xβ+ε.
The rows of the n×p matrix X constitute the observed covariates.

Given data X and y that we expect conform to this model, the OLS estimator for the regression coefficients is ˆβ=(XTX)1XTy, if we assume that X is full-rank. We are interested in how fast ˆβ converges to β as the number of observations p is increased. It seems most natural to consider the 2 error since we are dealing with least-squares estimation.

If X is full-rank with p-largest singular value bounded away from 0, then XX=I and we can estimate
ˆββ2=(XTX)1XTε2X2X(XTX)1XTε2.
Since X(XTX)1XT is a projection onto a p dimensional-subspace of Rn, we see that in this case
ˆββ2σp(X)1p/n.

It’s natural to ask whether this bound is optimal. To see that it is, consider the case of an orthogonal design, i.e. XTX=Ip. Then
ˆββ2=XTε2.
Since XTε is a N(0,Ip/n) r.v., we see that the right-hand side quantity concentrates around p/n.