Finding the true number of components or clusters
Using split-half methods
I was frustrated enough by usual methods to select the right number of components in Principal Component Analysis or of clusters in k-means clustering that I made some new methods that, at least, are currently more satisfactory to me.
Both these methods, instead of using a very arbitrary heuristic like the scree criterion, use a slightly-less arbitrary method based on random split-halves of the data. The rationale is that true components/clusters should be similar in split halves, whereas those that represent mere sampling error will tend to vary.
The methods, with simulation scripts to test whether they work for a given use case, are available, Open Source, in Python on GitHub:
PCA: https://github.com/thomasgladwin/teg_get_best_n (DOI:10.5281/zenodo.7803738)
K-means: https://github.com/thomasgladwin/teg_CCC (DOI:10.5281/zenodo.7857078)
The “true” number of components in PCA
In principle, of course, there is no “true” number of PCA components except for all of them, as there’s no statistical model. But we can define “true” in terms of the number of independent, generating, latent variables, to which random noise is added. Can we recover the number of those generating latent variables? That seems to be how PCA is often used conceptually.
Details: The SHEM method (for Split-Half Eigenvector Matching) works as follows. For an nd-array X, with shape == (nObservations, nVariables), a number of random splits are performed. For each split separately, a PCA is performed, via eigendecomposition of the covariance matrix of X. Each of the first split's eigenvectors is matched to the most-similar of the second split's eigenvectors. Similarity is measured via the dot product. The vector of similarities is sorted from high to low, and the vectors are averaged over all random splits. Finally, the optimal seperation between the high versus low similarities is determined by a between-within variance criterion. An estimated zero components is possible.
So, rather than looking at which components have “big versus small” eigenvalues as typical, we look at whether components have “high versus low similarity”. Eigenvalues implicitly still matter, in the ordering of components, but do not determine the cut-off. SHEM, at least with some idealized simulated data, recovers the true number of generating variables well.
The true number of clusters in k-means clustering
The Cluster Consistency Criterion (CCC) algorithm, similarly to SHEM, follows the rationale that true cluster centres should be similar in random split-halves of the data. If too many clusters are specified, the cluster centres will become driven by random sampling error.
Details: The CCC implements this as follows. For each number of clusters, the data are split into random halves for a given number of splits (e.g., 20). For each split, a k-means cluster analysis is run on each half separately. The distances between most-similar cluster centres are summed. The similarity score is e^(-distance_sum). The mean similarity score over random splits is the score for the given number of clusters.
The best estimate of the true number of clusters is determined by where the improvement in the score drops off, which occurs when the number of clusters becomes higher than the true number of clusters.
As true clusters are added, the random sampling error determining where cluster centres are placed (as these can’t match adequately the trust cluster centres) reduces; adding a cluster beyond the true number forces an arbitrary, sample-specific placement that will vary more between random splits.
Conclusion
As has been pointed out (e.g., Björklund, 2019), using the wrong number of components undermines the interpretability of results using methods such as this. Simulations suggest that SHEM and the CCC can, at least under some circumstances, address this concern, although more study is needed to determine their relative value and range of applicability.

