A lot of machine learning and modern statistics concerns the rate at which certain statistical estimators converge to the quantities they are estimating. For instance, in classification, you use a collection of observations of covariates \(\mathbf{x}_i \in \mathbb{R}^p \) and classifications \(y_i \in \{\pm 1\}\) to fit an estimated classification function \(h : \mathbb{R}^p \rightarrow \{\pm 1\},\) and the relevant question is how close is this estimated function to the actual classification function as the number of observations increase?

In this post, I’ll consider the consistency rate of the simple ordinary least-squares estimator for linear regression coefficients. The estimators of more practical interest in a modern setting (high-dimensional with fewer observations than variables) are more complicated, but OLS estimators are a natural starting point, and comparatively much easier to work with and understand.

To establish the setting, assume that our observations take the form \(y_i = \mathbf{x}_i^T \mathbf{\beta} + \varepsilon_i,\) where \(\mathbf{x}_i\) are our observations, and \(\varepsilon_i\) are i.i.d. \(\mathcal{N}(0,1/n)\) noise that are independent of the covariates. We can write this as the matrix–vector equation

\[

\mathbf{y} = \mathbf{X} \mathbf{\beta} + \mathbf{\varepsilon}.

\]

The rows of the \(n \times p\) matrix \(\mathbf{X}\) constitute the observed covariates.

Given data \(\mathbf{X}\) and \(\mathbf{y}\) that we expect conform to this model, the OLS estimator for the regression coefficients is \(\hat{\mathbf{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y},\) if we assume that \(\mathbf{X}\) is full-rank. We are interested in how fast \(\hat{\mathbf{\beta}}\) converges to \(\mathbf{\beta}\) as the number of observations \(p\) is increased. It seems most natural to consider the \(\ell_2\) error since we are dealing with least-squares estimation.

If \(\mathbf{X}\) is full-rank with \(p\)-largest singular value bounded away from \(0\), then \(\mathbf{X}^\dagger \mathbf{X} = \mathbf{I}\) and we can estimate

\[

\|\hat{\mathbf{\beta}} – \mathbf{\beta} \|_2 = \|(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{\varepsilon}\|_2 \leq \|\mathbf{X}^\dagger\|_2 \|\mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{\varepsilon}\|_2.

\]

Since \(\mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T\) is a projection onto a \(p\) dimensional-subspace of \(\mathbb{R}^n,\) we see that in this case

\[

\|\hat{\mathbf{\beta}} – \mathbf{\beta} \|_2 \lesssim \sigma_p(\mathbf{X})^{-1} \sqrt{p/n}.

\]

It’s natural to ask whether this bound is optimal. To see that it is, consider the case of an orthogonal design, i.e. \(\mathbf{X}^T \mathbf{X} = \mathbf{I}_p.\) Then

\[

\|\hat{\mathbf{\beta}} – \mathbf{\beta} \|_2 = \|\mathbf{X}^T \mathbf{\varepsilon}\|_2.

\]

Since \(\mathbf{X}^T \mathbf{\varepsilon}\) is a \(\mathbf{N}(0, \mathbf{I}_p/n)\) r.v., we see that the right-hand side quantity concentrates around \(\sqrt{p/n}\).