Skip to contents

Apply a custom R function to each row/col of a BPCells matrix. This will run slower than the builtin C++-backed functions, but will keep most of the memory benefits from disk-backed operations.

Usage

apply_by_row(mat, fun, ...)

apply_by_col(mat, fun, ...)

Arguments

mat

IterableMatrix object

fun

function(val, row, col) that takes in a row/col of values and returns a summary output. Argument details:

  1. val - Vector length (# non-zero values) with the value for each non-zero matrix entry

  2. row - one-based row index (apply_by_col: vector length (# non-zero values), apply_by_row: single integer)

  3. col - one-based col index (apply_by_col: single integer, apply_by_row: vector length (# non-zero values))

  4. ... - Optional additional arguments (should not be named row, col, or val)

...

Optional additional arguments passed to fun

Value

apply_by_row - A list of length nrow(matrix) with the results returned by fun() on each row

apply_by_col - A list of length ncol(matrix) with the results returned by fun() on each row

Details

These functions require row-major matrix storage for apply_by_row and col-major storage for apply_by_col, so matrices stored in the wrong order may neeed a re-ordered copy created using transpose_storage_order() first. This is required to be able to keep memory-usage low and allow calculating the result with a single streaming pass of the input matrix.

If vector/matrix outputs are desired instead of lists, calling unlist(x) or do.call(cbind, x) or do.call(rbind, x) can convert the list output.

See also

For an interface more similar to base::apply, see the BPCellsArray project. For calculating colMeans on a sparse single cell RNA matrix it is about 8x slower than apply_by_col, due to the base::apply interface not being sparsity-aware. (See pull request #104 for benchmarking.)

Examples

mat <- matrix(rbinom(40, 1, 0.5) * sample.int(5, 40, replace = TRUE), nrow = 4)
rownames(mat) <- paste0("gene", 1:4)
mat
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> gene1    0    0    2    0    0    0    3    3    0     2
#> gene2    1    0    3    0    0    3    4    1    1     2
#> gene3    2    0    1    0    0    2    0    0    0     0
#> gene4    0    0    0    0    3    0    1    0    0     0

mat <- mat %>% as("dgCMatrix") %>% as("IterableMatrix")

#######################################################################
## apply_by_row() example
#######################################################################
## Get mean of every row

## expect an error in the case that col-major matrix is passed
apply_by_row(mat, function(val, row, col) {sum(val) / nrow(mat)}) %>% 
 unlist()
#> Error in apply_by_row(mat, function(val, row, col) {    sum(val)/nrow(mat)}): Cannot call apply_by_row on a col-major matrix. Please call transpose_storage_order() first

## Need to transpose matrix to make sure it is in row-order
mat_row_order <- transpose_storage_order(mat)

## works as expected for row major
apply_by_row(mat_row_order, 
 function(val, row, col) sum(val) / ncol(mat_row_order)
) %>% unlist()
#> [1] 1.0 1.5 0.5 0.4

# Also analogous to running rowMeans() without names
rowMeans(mat)
#> gene1 gene2 gene3 gene4 
#>   1.0   1.5   0.5   0.4 


#######################################################################
## apply_by_col() example
#######################################################################
## Get argmax of every col
apply_by_col(mat, 
 function(val, row, col) if (length(val) > 0) row[which.max(val)] else 1L
) %>% unlist()
#>  [1] 3 1 2 1 4 2 2 1 2 1