Matrix and Fragment Storage Formats
Source:vignettes/web-only/bitpacking-format.Rmd
bitpacking-format.Rmd
This provides an overview of the BPCells compressed and uncompressed matrix formats. While BPCells is pre-1.0 these formats may be updated, though BPCells aims to maintain backwards read-compatibility across updates.
Format version changes:
- v2 (March 2023): Add support for
non-zero entries or fragments in a single file.
-
idxptr
: uint32 -> uint64 -
chr_ptr
: uint32 -> uint64 - Add
*_idx_offsets
fields to bitpacked data.
-
- v1 (August 2022): Initial stability of v1 file formats
Matrix Logical Storage Layout
For data storage, we use a storage abstraction of named data arrays, stored as e.g. a single group hdf5 or a directory of files. The matrix format is a compressed sparse column/row (CSC/CSR) format with the following data arrays:
Name | Type | Length |
---|---|---|
val |
uint32/float32/float64 | # non-zeros |
index |
uint32 | # non-zeros |
idxptr |
uint64 | # cols + 1 |
shape |
uint32 | 2 |
row_names |
string | 0 or # rows |
col_names |
string | 0 or # cols |
storage_order |
string | 1 |
The interpretation of each array is as follows:
-
val
- Values of non-zero entries in increasing order of (column, row) position. -
index
-index[i]
provides the 0-based row index for the value found inval[i]
(or column index for row-major storage order) -
idxptr
- The indexes inidx
andval
for the entries in columnj
can be found fromidxptr[j]
toidxptr[j+1] - 1
, inclusive. (or rowj
for row-major storage order) -
shape
- number of rows in matrix, followed by number of columns -
row_names
- Names for each row of the matrix (optional) -
col_names
- Names for each column of the matrix (optional) -
storage_order
-col
for compressed-sparse-column, orrow
for compressed-sparse-row
Bitpacked compressed matrices consist of the following modifications:
-
val
: For unsigned 32-bit integers, we replaceval
withval_data
,val_idx
, andval_idx_offsets
corresponding to a BP-128m1 encoding as described below. The total number of values is already stored as the last value inidxptr
. For 32-bit and 64-bit floatsval
remains unchanged. -
index
: We replace theindex
array with a BP-128d1z encoded data in arraysindex_data
,index_idx
,index_idx_offsets
, andindex_starts
Each matrix is stored as a single directory, HDF5 group, or R S4
object. The storage format for each matrix is encoded as a version
string. The current version string is of the format
[compression]-[datatype]-matrix-v2
, where
[compression]
can be either packed
or
unpacked
, and [datatype]
can be one of
uint
, float
, or double
corresponding to 32-bit unsigned integer, 32-bit float, and 64-bit
double respectively. In v1 formats, the only difference is that
idxptr
had type uint32.
Genomic fragments logical storage layout
BPCells fragment files store (chromosome, start, end, cell ID) for each fragment, sorted by (chromosome, start). The coordinate system follows the bed format convention, where the first base of a chromosome is numbered 0 and the end coordinate of each fragment is non-inclusive. This means a 10 base pair long fragment starting at the first base of the genome will have start=0 and end=10. End coordinates are always guaranteed to be at least as large as start coordinates.
Uncompressed fragment data is stored in the following arrays:
Name | Type | Length |
---|---|---|
cell |
uint32 | # fragments |
start |
uint32 | # fragments |
end |
uint32 | # fragments |
end_max |
uint32 | # fragments/128 |
chr_ptr |
uint64 | 2 # chromosomes |
cell_names |
string | # cells |
chr_names |
string | # chromosomes |
These arrays have the following contents:
-
cell
: List of numeric cell IDs, one per fragment. The smallest cell ID is 0.
-
start
: List of fragment start coordinates. The first base in a chromosome is 0. -
end
: List of fragment end coordinates. The base of the end coordinate is one past the last base in the fragment.
-
end_max
:end_max[i]
is the maximum end coordinate of all fragments from the start of the chromosome to the fragment at indexi*128-127
. If multiple chromosomes have fragments in a given chunk of 128 fragments,end_max
is the maximum of all those end coordinates. Theend_max
array allows for quickly seeking to fragments overlapping a given genomic region.
-
chr_ptr
:chr_ptr[2*i]
is the index of the first fragment in chromosomei
in thecell
,start
, andend
arrays.chr_ptr[2*i + 1]-1
is the index of the last fragment in chromosomei
. Fragments need not necessarily be sorted in order of increasing chromosome ID, though all fragments for a given chromosome must still be stored contiguously. This allows logically re-ordering chromosomes at write-time even if the input data source does not support reading chromosomes out-of-order (i.e. 10x fragment files without a genome index).
-
cell_names
: string identifiers for each numeric cell ID. -
chr_names
: string identifiers for each numeric chromosome ID.
Compressed fragments are stored with the following modifications:
-
cell
is replaced withcell_data
,cell_idx
, andcell_idx_offsets
, compressed according to BP-128 encoding. -
start
is replaced withstart_data
,start_idx
,start_idx_offsets
, andstart_starts
, compressed according to BP-128d1 encoding. -
end
is replaced withend_data
,end_idx
, andend_idx_offsets
, which storesstart - end
for each fragment, encoded using BP-128 encoding.
The current version string is equal to
unpacked-fragments-v2
for uncompressed fragments, and
packed-fragments-v2
for compressed fragments. In v1
formats, the only difference is that chr_ptr
had type
uint32.
Bitpacking formats
Our bitpacked formats are based on the formats described in a paper by Lemire and Boytsov.
BP-128
The vanilla BP-128 format is stored in 3 arrays as follows:
-
data
- stream of bitpacked data, represented as 32-bit integers with the interleaved bit layout as shown in Lemire and Boytsov figure 6. A chunk of 128 32-bit input integers with bits per integer will be stored using 32-bit integers holding the bitpacked data. -
idx
- list of 32-bit integers, where the encoded data for integers index128*i
to128*i + 127
can be found in data from indexidx[i]
to indexidx[i+1]-1
. For lists with (4 billion) entries or greater, idx stores the index modulo -
idx_offsets
- list of 64-bit integers, where the values ofidx
with indices fromidx_offsets[i]
toidx_offsets[i+1]-1
should havei*(2^32)
added to them.
BP-128d1
Equivalent to the BP-128* algorithm from Lemire and Boytsov where integers are difference encoded prior to bitpacking. This is best for lists of sorted integers.
-
data
- Encoding as with vanilla BP-128, but we do difference encoding prior to bitpacking: , , , …, -
idx
,idx_offsets
- identical to BP-128 -
starts
- list of 32-bit integers, wherestarts[i]
is the decoded value for the integer at index128*i
BP-128d1z
Similar to BP128d1 but with zigzag encoding applied after difference encoding. This is best for lists of close but not fully sorted runs of integers.
-
data
- Encoding as with BP-128d1, but between difference encoding and bitpacking, the results are zigzag encoded, where if , and if . -
idx
,idx_offsets
- identical to BP-128 -
starts
- identical to BP128-d1
Illustrative reference code for BP-128 and the d1 and zigzag transformations can be found here.
Physical storage layout
The abstraction of named data arrays can be realized by a few different formats. The three currently supported by BPCells are:
Directory of files format:
This is the default storage backend due to its simplicity and high
performance. Arrays are stored as binary files within a directory.
Numeric array files have an 8-byte header followed by data values in
little-endian binary format for integers and IEEE-754 for 32-bit and
64-bit floating point numbers. Header values are 8-byte ASCII text as
follows: unsigned 32-bit integer UINT32v1
, unsigned 64-bit
integer UINT64v1
, 32-bit float FLOATSv1
,
64-bit float DOUBLEv1
. Arrays of strings are stored as
ASCII text with one array value per line with no header. The version
string is stored as a file named “version” containing the version string
followed by a newline.
Hdf5 file format:
This storage backend can be useful for embedding BPCells formats as a group within an h5ad or other HDF5 file. Arrays of numbers are stored as HDF5 datasets using the built-in HDF5 encoding format. Arrays of strings are stored as HDF5 variable length string datasets.
The version string is stored as a version attribute on the HDF5 group.
R object format:
This storage backend is primarily useful for testing, or when bitpacking compression of in-memory data is desired to avoid disk bandwidth bottlenecks. Strings are stored as native R character arrays. Unsigned integers and 32-bit floats are stored in native R integer arrays by bitcasting the R signed integers into the required data types. 64-bit floats are stored in native R numeric arrays. 64-bit integers are stored as doubles in R numeric arrays. This reduces the highest representable value from to (about 9 quadrillion), which we do not expect to pose practical problems. Named collections of arrays are stored in R lists (when writing) or S4 objects (when reading). The version string is stored as a string vector named “version” of length 1.