Regress out the effects of confounding variables using a linear least squares regression model.

## Usage

`regress_out(mat, latent_data, prediction_axis = c("row", "col"))`

## Arguments

- mat
Input IterableMatrix

- latent_data
Data to regress out, as a

`data.frame`

where each column is a variable to regress out.- prediction_axis
Which axis corresponds to prediction outputs from the linear models (e.g. the gene axis in typical single cell analysis). Options include "row" (default) and "col".

## Details

Conceptually, `regress_out`

calculates a linear least squares best fit model for each row of the matrix.
(Or column if `prediction_axis`

is `"col"`

).
The input data for each regression model are the columns of `latent_data`

, and each model tries to
predict the values in the corresponding row (or column) of `mat`

. After fitting each model, `regress_out`

will subtract the model predictions from the input values, aiming to only retain effects that are
not explained by the variables in `latent_data`

.

These models can be fit efficiently since they all share the same input data and so most of the calculations for the closed-form best fit solution are shared. A QR factorization of the model matrix and a dense matrix-vector multiply are sufficient to fully calculate the residual values.

*Efficiency considerations*: As the output matrix is dense rather than sparse, mean and variance calculations may
run comparatively slowly. However, PCA and matrix/vector multiply operations can be performed at nearly the same
cost as the input matrix due to mathematical simplifications. Memory usage scales with `n_features * ((nrow(mat) + ncol(mat))`

.
Generally, `n_features == ncol(latent_data)`

, but for categorical variables in `latent_data`

, each
category will be expanded into its own indicator variable. Memory usage will therefore be higher when
using categorical input variables with many (i.e. >100) distinct values.