Basepair insertion counts tutorial#
BPCells python bindings can be used to query basepair-level coverage for predefined cell types.
The way this works is in two steps:
10x or ArchR arrow files are converted to BPCells format. This is most flexible to do with the BPCells R bindings, though single-sample 10x import is supported through the python bindings.
The BPCells python bindings use the input fragment files to create large matrix of dimensions (# cell types, # basepairs in genome). This is where cell type groupings are determined.
The BPCells python bindings can slice arbitrary genomic regions, returning a numpy array of dimensions (regions, cell types, basepairs)
Benchmark estimates#
Benchmark dataset: 600K cell subset of the Catlas paper, with 2.5 billion fragments Benchmark task: Load 128 random 501-bp peak regions from 111 cell types at basepair resolution Storage location: Local SSD. Networked file systems will be slower
BPCells |
BigWigs |
|
---|---|---|
Creation time |
4.7 minutes, 8 threads |
? |
File size |
6.2 GB |
13 GB |
Query time |
0.37 seconds |
2.2 seconds |
Main benefits of BPCells#
Cell type count aggregation can be re-run fully from Python
Query time is about 6x faster than BigWigs
Caveat for this prototype: due to development time limitations, this insertion matrix implementation does not support >=2^32 non-zero entries (4.29 billion). The catlas dataset had about 3.2 billion non-zero entries. This limitation can be removed with additional technical work, or as a workaround multiple matrix objects can be created that each individually have <2^32 non-zero entries.
Usage Demo#
import bpcells.experimental
import pandas as pd
Data download#
We use a public 500-cell 10x dataset
import os.path
import subprocess
import tempfile
tmpdir = tempfile.TemporaryDirectory()
fragments_10x_path = os.path.join(tmpdir.name, "atac_fragments.tsv.gz")
data_url = "https://cf.10xgenomics.com/samples/cell-atac/2.0.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_fragments.tsv.gz"
subprocess.run(["curl", "--silent", data_url], stdout=open(fragments_10x_path, "w"))
CompletedProcess(args=['curl', '--silent', 'https://cf.10xgenomics.com/samples/cell-atac/2.0.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_fragments.tsv.gz'], returncode=0)
metadata_url = "https://cf.10xgenomics.com/samples/cell-atac/2.0.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_singlecell.csv"
metadata_path = os.path.join(tmpdir.name, "cell_metadata.csv")
subprocess.run(["curl", "--silent", metadata_url], stdout=open(metadata_path, "w"))
cell_metadata = pd.read_csv(metadata_path)
cell_metadata = cell_metadata[cell_metadata.is__cell_barcode == 1].reset_index()
cell_metadata
index | barcode | total | duplicate | chimeric | unmapped | lowmapq | mitochondrial | nonprimary | passed_filters | is__cell_barcode | excluded_reason | TSS_fragments | DNase_sensitive_region_fragments | enhancer_region_fragments | promoter_region_fragments | on_target_fragments | blacklist_region_fragments | peak_region_fragments | peak_region_cutsites | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1093 | AAACTGCAGACTCGGA-1 | 26219 | 15751 | 6 | 184 | 1310 | 0 | 3 | 8965 | 1 | 0 | 4499 | 0 | 0 | 0 | 4499 | 0 | 6264 | 11966 |
1 | 1807 | AAAGATGCACCTATTT-1 | 43137 | 25369 | 7 | 239 | 2544 | 0 | 7 | 14971 | 1 | 0 | 7135 | 0 | 0 | 0 | 7135 | 0 | 9825 | 18819 |
2 | 1828 | AAAGATGCAGATACAA-1 | 129867 | 91043 | 28 | 839 | 9577 | 0 | 40 | 28340 | 1 | 0 | 16771 | 0 | 0 | 0 | 16771 | 0 | 21166 | 40512 |
3 | 3144 | AAAGGGCTCGCTCTAC-1 | 97970 | 63834 | 16 | 492 | 5080 | 0 | 1 | 28547 | 1 | 0 | 13504 | 0 | 0 | 0 | 13504 | 0 | 18688 | 35366 |
4 | 3300 | AAATGAGAGTCCCGCA-1 | 16066 | 7440 | 0 | 71 | 805 | 0 | 0 | 7750 | 1 | 0 | 3619 | 0 | 0 | 0 | 3619 | 0 | 5287 | 10165 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
479 | 248048 | TTTGGCCAGATCTAAG-1 | 97255 | 63290 | 22 | 551 | 8536 | 0 | 1643 | 23213 | 1 | 0 | 11383 | 0 | 0 | 0 | 11383 | 0 | 16123 | 30530 |
480 | 248113 | TTTGGCCAGTACAGAT-1 | 15054 | 7596 | 3 | 115 | 794 | 13 | 0 | 6533 | 1 | 0 | 3264 | 0 | 0 | 0 | 3264 | 0 | 4617 | 8813 |
481 | 248137 | TTTGGCCAGTGCTCGC-1 | 54672 | 32244 | 8 | 300 | 3455 | 0 | 2 | 18663 | 1 | 0 | 11252 | 0 | 0 | 0 | 11252 | 0 | 13312 | 25376 |
482 | 248322 | TTTGGCCGTCCCGTGA-1 | 60394 | 32487 | 9 | 546 | 3196 | 0 | 0 | 24156 | 1 | 0 | 10596 | 0 | 0 | 0 | 10596 | 0 | 14413 | 27039 |
483 | 248733 | TTTGGTTCAGAGATGC-1 | 66942 | 36246 | 23 | 369 | 3904 | 174 | 11 | 26215 | 1 | 0 | 13128 | 0 | 0 | 0 | 13128 | 0 | 17938 | 34256 |
484 rows × 20 columns
cell_metadata.is__cell_barcode.sum()
np.int64(484)
Convert to BPCells format#
Notice that the conversion allows for adjusting the start/end coordinates, as well as subsetting to only the barcodes passing QC. Adding 1 to the end coordinate is necessary for 10x inputs produced by cellranger
%%time
fragments_bpcells_path = os.path.join(tmpdir.name, "bpcells_fragments")
bpcells.experimental.import_10x_fragments(
input = fragments_10x_path,
output = fragments_bpcells_path,
shift_end=1,
keeper_cells=cell_metadata.barcode[cell_metadata.is__cell_barcode == 1]
)
CPU times: user 3.43 s, sys: 74.9 ms, total: 3.51 s
Wall time: 3.45 s
Create the insertion matrix#
To calculate the insertion matrix, we first define our cell groups, as well as the ordering of the cell groups we want for our output matrix. Here we use the first two characters of the cell barcode since annotated cell types are not available. Note that it is possible to leave cells out when calling build_cell_groups
, in which case that data will not be included in the precalculated matrix
Next, we precalculate the insertion counts matrix, which can use parallelization to speed up portions of the work.
%%time
barcodes = cell_metadata.barcode
clusters = cell_metadata.barcode.str.slice(0,2)
cluster_order = sorted(set(clusters))
cell_groups_array = bpcells.experimental.build_cell_groups(fragments_bpcells_path, barcodes, clusters, cluster_order)
# We could provide a dict or local file path, but URL is easier
chrom_sizes = "http://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.chrom.sizes"
insertions_matrix_path = os.path.join(tmpdir.name, "bpcells_insertions_matrix")
bpcells.experimental.precalculate_insertion_counts(
fragments_bpcells_path,
insertions_matrix_path,
cell_groups_array,
chrom_sizes,
threads=4
)
CPU times: user 3min 8s, sys: 710 ms, total: 3min 8s
Wall time: 1min 45s
<PrecalculatedInsertionMatrix with 16 pseudobulks and 24 chromomsomes stored in
/tmp/tmpenistmm7/bpcells_insertions_matrix
Querying the insertion matrix#
We can load the pre-calculated matrix from its input path.
mat = bpcells.experimental.PrecalculatedInsertionMatrix(insertions_matrix_path)
mat
<PrecalculatedInsertionMatrix with 16 pseudobulks and 24 chromomsomes stored in
/tmp/tmpenistmm7/bpcells_insertions_matrix
mat.shape
(np.uint32(16), np.uint32(3088269832))
To query the matrix, we use a pandas DataFrame, with columns (chrom, start, end). All the regions must be the same length
query_regions = pd.DataFrame({
"chrom": ["chr1", "chr1", "chr6"],
"start": [1_000_000, 2_000_000, 10_000_000],
})
query_regions["end"] = query_regions.start + 1000
query_regions
chrom | start | end | |
---|---|---|---|
0 | chr1 | 1000000 | 1001000 |
1 | chr1 | 2000000 | 2001000 |
2 | chr6 | 10000000 | 10001000 |
BPCells returns a numpy array of dimensions (regions, cell types, basepairs), holding the per-base counts for each cell type
x = mat.get_counts(query_regions)
x
array([[[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]],
[[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]],
[[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]]], dtype=int32)
x.shape
(3, 16, 1000)
x.sum()
np.int64(363)
Pytorch-compatible dataset#
It is simple to wrap this matrix as a pytorch-compatible dataset, given a set of regions for the training set. Note the use of the non-standard __getitems__()
function which pytorch uses to provide batched loading for higher performance. This dataset object can be directly passed to torch.utils.data.DataLoader
.
class BPCellsDataset:
def __init__(self, regions, matrix_dir):
self.regions = regions[["chrom", "start", "end"]]
matrix_dir = str(os.path.abspath(os.path.expanduser(matrix_dir)))
self.mat = bpcells.experimental.PrecalculatedInsertionMatrix(matrix_dir)
peak_width = self.regions.end[0] - self.regions.start[0]
assert (self.regions.end - self.regions.start == peak_width).all()
def __getitem__(self, i):
return self.__getitems__([i])[0]
def __getitems__(self, idx):
# Adding this function allows for batched loading
# See: https://github.com/pytorch/pytorch/issues/107218
# Return tensor of shape (batch_size, n_tasks, basepairs)
return self.mat.get_counts(
self.regions.iloc[idx,]
)
def __len__(self):
return self.regions.shape[0]