Read/write sparse matrices — write_matrix

BPCells matrices are stored in sparse format, meaning only the non-zero entries are stored. Matrices can store integer counts data or decimal numbers (float or double). See details for more information.

Usage

write_matrix_memory(mat, compress = TRUE)

write_matrix_dir(
  mat,
  dir,
  compress = TRUE,
  buffer_size = 8192L,
  overwrite = FALSE
)

open_matrix_dir(dir, buffer_size = 8192L)

write_matrix_hdf5(
  mat,
  path,
  group,
  compress = TRUE,
  buffer_size = 8192L,
  chunk_size = 1024L,
  overwrite = FALSE,
  gzip_level = 0L
)

open_matrix_hdf5(path, group, buffer_size = 16384L)

Arguments

compress: Whether or not to compress the data.
dir: Directory to save the data into
buffer_size: For performance tuning only. The number of items to be buffered in memory before calling writes to disk.
overwrite: If TRUE, write to a temp dir then overwrite existing data. Alternatively, pass a temp path as a string to customize the temp dir location.
path: Path to the hdf5 file on disk
group: The group within the hdf5 file to write the data to. If writing to an existing hdf5 file this group must not already be in use
chunk_size: For performance tuning only. The chunk size used for the HDF5 array storage.
gzip_level: Gzip compression level. Default is 0 (no compression). This is recommended when both compression and compatibility with outside programs is required. Otherwise, using compress=TRUE is recommended as it is >10x faster with often similar compression levels.
matrix: Input matrix, either IterableMatrix or dgCMatrix

Value

BPCells matrix object

Details

Storage locations

Matrices can be stored in a directory on disk, in memory, or in an HDF5 file. Saving in a directory on disk is a good default for local analysis, as it provides the best I/O performance and lowest memory usage. The HDF5 format allows saving within existing hdf5 files to group data together, and the in memory format provides the fastest performance in the event memory usage is unimportant.

Bitpacking Compression

For typical RNA counts matrices holding integer counts, this bitpacking compression will result in 6-8x less space than an R dgCMatrix, and 4-6x smaller than a scipy csc_matrix. The compression will be more effective when the count values in the matrix are small, and when the rows of the matrix are sorted by rowMeans. In tests on RNA-seq data optimal ordering could save up to 40% of storage space. On non-integer data only the row indices are compressed, not the values themselves so space savings will be smaller.

For non-integer data matrices, bitpacking compression is much less effective, as it can only be applied to the indexes of each entry but not the values. There will still be some space savings, but far less than for counts matrices.

Examples

## Create temporary directory to keep demo matrix
data_dir <- file.path(tempdir(), "mat")
if (dir.exists(data_dir)) unlink(data_dir, recursive = TRUE)
dir.create(data_dir, recursive = TRUE, showWarnings = FALSE)

mat <- get_demo_mat()
mat
#> 3582 x 2600 IterableMatrix object with class MatrixDir
#> 
#> Row names: ENSG00000272602, ENSG00000250312 ... ENSG00000255512
#> Col names: TTTAGCAAGGTAGCTT-1, AGCCGGTTCCGGAACC-1 ... TACTAAGTCCAATAGC-1
#> 
#> Data type: uint32_t
#> Storage order: column major
#> 
#> Queued Operations:
#> 1. Load compressed matrix from directory /home/imman/.local/share/R/BPCells/demo_data/demo_mat_filtered_subsetted

#######################################################################
## write_matrix_memory() example
#######################################################################
mat_memory <- write_matrix_memory(mat)
mat_memory
#> 3582 x 2600 IterableMatrix object with class PackedMatrixMem_uint32_t
#> 
#> Row names: ENSG00000272602, ENSG00000250312 ... ENSG00000255512
#> Col names: TTTAGCAAGGTAGCTT-1, AGCCGGTTCCGGAACC-1 ... TACTAAGTCCAATAGC-1
#> 
#> Data type: uint32_t
#> Storage order: column major
#> 
#> Queued Operations:
#> 1. Load compressed matrix from memory


#######################################################################
## write_matrix_dir() example
#######################################################################
mat %>% write_matrix_dir(
 file.path(data_dir, "demo_mat"),
 overwrite = TRUE
)
#> 3582 x 2600 IterableMatrix object with class MatrixDir
#> 
#> Row names: ENSG00000272602, ENSG00000250312 ... ENSG00000255512
#> Col names: TTTAGCAAGGTAGCTT-1, AGCCGGTTCCGGAACC-1 ... TACTAAGTCCAATAGC-1
#> 
#> Data type: uint32_t
#> Storage order: column major
#> 
#> Queued Operations:
#> 1. Load compressed matrix from directory /tmp/RtmpsGFdDm/mat/demo_mat


#######################################################################
## open_matrix_dir() example
#######################################################################
mat <- open_matrix_dir(
 file.path(data_dir, "demo_mat")
)
mat
#> 3582 x 2600 IterableMatrix object with class MatrixDir
#> 
#> Row names: ENSG00000272602, ENSG00000250312 ... ENSG00000255512
#> Col names: TTTAGCAAGGTAGCTT-1, AGCCGGTTCCGGAACC-1 ... TACTAAGTCCAATAGC-1
#> 
#> Data type: uint32_t
#> Storage order: column major
#> 
#> Queued Operations:
#> 1. Load compressed matrix from directory /tmp/RtmpsGFdDm/mat/demo_mat


#######################################################################
## write_matrix_hdf5() example
#######################################################################
mat %>% write_matrix_hdf5(path = file.path(data_dir, "demo_mat.h5"), group = "mat")
#> 3582 x 2600 IterableMatrix object with class MatrixH5
#> 
#> Row names: ENSG00000272602, ENSG00000250312 ... ENSG00000255512
#> Col names: TTTAGCAAGGTAGCTT-1, AGCCGGTTCCGGAACC-1 ... TACTAAGTCCAATAGC-1
#> 
#> Data type: uint32_t
#> Storage order: column major
#> 
#> Queued Operations:
#> 1. Load compressed matrix in hdf5 file /tmp/RtmpsGFdDm/mat/demo_mat.h5, group mat


#######################################################################
## open_matrix_hdf5() example
#######################################################################
mat_hdf5 <- open_matrix_hdf5(
 file.path(data_dir, "demo_mat.h5"),
 group = 'mat'
)
mat_hdf5
#> 3582 x 2600 IterableMatrix object with class MatrixH5
#> 
#> Row names: ENSG00000272602, ENSG00000250312 ... ENSG00000255512
#> Col names: TTTAGCAAGGTAGCTT-1, AGCCGGTTCCGGAACC-1 ... TACTAAGTCCAATAGC-1
#> 
#> Data type: uint32_t
#> Storage order: column major
#> 
#> Queued Operations:
#> 1. Load compressed matrix in hdf5 file /tmp/RtmpsGFdDm/mat/demo_mat.h5, group mat