Basepair insertion counts tutorial#

BPCells python bindings can be used to query basepair-level coverage for predefined cell types.

The way this works is in two steps:

  1. 10x or ArchR arrow files are converted to BPCells format. This is most flexible to do with the BPCells R bindings, though single-sample 10x import is supported through the python bindings.

  2. The BPCells python bindings use the input fragment files to create large matrix of dimensions (# cell types, # basepairs in genome). This is where cell type groupings are determined.

  3. The BPCells python bindings can slice arbitrary genomic regions, returning a numpy array of dimensions (regions, cell types, basepairs)

Benchmark estimates#

Benchmark dataset: 600K cell subset of the Catlas paper, with 2.5 billion fragments Benchmark task: Load 128 random 501-bp peak regions from 111 cell types at basepair resolution Storage location: Local SSD. Networked file systems will be slower

BPCells

BigWigs

Creation time

4.7 minutes, 8 threads

?

File size

6.2 GB

13 GB

Query time

0.37 seconds

2.2 seconds

Main benefits of BPCells#

  • Cell type count aggregation can be re-run fully from Python

  • Query time is about 6x faster than BigWigs

Caveat for this prototype: due to development time limitations, this insertion matrix implementation does not support >=2^32 non-zero entries (4.29 billion). The catlas dataset had about 3.2 billion non-zero entries. This limitation can be removed with additional technical work, or as a workaround multiple matrix objects can be created that each individually have <2^32 non-zero entries.

Usage Demo#

import bpcells.experimental

import pandas as pd

Data download#

We use a public 500-cell 10x dataset

import os.path
import subprocess
import tempfile
tmpdir = tempfile.TemporaryDirectory()
fragments_10x_path = os.path.join(tmpdir.name, "atac_fragments.tsv.gz")

data_url = "https://cf.10xgenomics.com/samples/cell-atac/2.0.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_fragments.tsv.gz"
subprocess.run(["curl", "--silent", data_url], stdout=open(fragments_10x_path, "w"))
CompletedProcess(args=['curl', '--silent', 'https://cf.10xgenomics.com/samples/cell-atac/2.0.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_fragments.tsv.gz'], returncode=0)
metadata_url = "https://cf.10xgenomics.com/samples/cell-atac/2.0.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_singlecell.csv"
metadata_path = os.path.join(tmpdir.name, "cell_metadata.csv")
subprocess.run(["curl", "--silent", metadata_url], stdout=open(metadata_path, "w"))

cell_metadata = pd.read_csv(metadata_path)
cell_metadata = cell_metadata[cell_metadata.is__cell_barcode == 1].reset_index()
cell_metadata
index barcode total duplicate chimeric unmapped lowmapq mitochondrial nonprimary passed_filters is__cell_barcode excluded_reason TSS_fragments DNase_sensitive_region_fragments enhancer_region_fragments promoter_region_fragments on_target_fragments blacklist_region_fragments peak_region_fragments peak_region_cutsites
0 1093 AAACTGCAGACTCGGA-1 26219 15751 6 184 1310 0 3 8965 1 0 4499 0 0 0 4499 0 6264 11966
1 1807 AAAGATGCACCTATTT-1 43137 25369 7 239 2544 0 7 14971 1 0 7135 0 0 0 7135 0 9825 18819
2 1828 AAAGATGCAGATACAA-1 129867 91043 28 839 9577 0 40 28340 1 0 16771 0 0 0 16771 0 21166 40512
3 3144 AAAGGGCTCGCTCTAC-1 97970 63834 16 492 5080 0 1 28547 1 0 13504 0 0 0 13504 0 18688 35366
4 3300 AAATGAGAGTCCCGCA-1 16066 7440 0 71 805 0 0 7750 1 0 3619 0 0 0 3619 0 5287 10165
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
479 248048 TTTGGCCAGATCTAAG-1 97255 63290 22 551 8536 0 1643 23213 1 0 11383 0 0 0 11383 0 16123 30530
480 248113 TTTGGCCAGTACAGAT-1 15054 7596 3 115 794 13 0 6533 1 0 3264 0 0 0 3264 0 4617 8813
481 248137 TTTGGCCAGTGCTCGC-1 54672 32244 8 300 3455 0 2 18663 1 0 11252 0 0 0 11252 0 13312 25376
482 248322 TTTGGCCGTCCCGTGA-1 60394 32487 9 546 3196 0 0 24156 1 0 10596 0 0 0 10596 0 14413 27039
483 248733 TTTGGTTCAGAGATGC-1 66942 36246 23 369 3904 174 11 26215 1 0 13128 0 0 0 13128 0 17938 34256

484 rows × 20 columns

cell_metadata.is__cell_barcode.sum()
np.int64(484)

Convert to BPCells format#

Notice that the conversion allows for adjusting the start/end coordinates, as well as subsetting to only the barcodes passing QC. Adding 1 to the end coordinate is necessary for 10x inputs produced by cellranger

%%time
fragments_bpcells_path = os.path.join(tmpdir.name, "bpcells_fragments")
bpcells.experimental.import_10x_fragments(
    input = fragments_10x_path, 
    output = fragments_bpcells_path, 
    shift_end=1, 
    keeper_cells=cell_metadata.barcode[cell_metadata.is__cell_barcode == 1]
)
CPU times: user 3.43 s, sys: 74.9 ms, total: 3.51 s
Wall time: 3.45 s

Create the insertion matrix#

To calculate the insertion matrix, we first define our cell groups, as well as the ordering of the cell groups we want for our output matrix. Here we use the first two characters of the cell barcode since annotated cell types are not available. Note that it is possible to leave cells out when calling build_cell_groups, in which case that data will not be included in the precalculated matrix

Next, we precalculate the insertion counts matrix, which can use parallelization to speed up portions of the work.

%%time
barcodes = cell_metadata.barcode
clusters = cell_metadata.barcode.str.slice(0,2)
cluster_order = sorted(set(clusters))

cell_groups_array = bpcells.experimental.build_cell_groups(fragments_bpcells_path, barcodes, clusters, cluster_order)

# We could provide a dict or local file path, but URL is easier
chrom_sizes = "http://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.chrom.sizes"

insertions_matrix_path = os.path.join(tmpdir.name, "bpcells_insertions_matrix")

bpcells.experimental.precalculate_insertion_counts(
    fragments_bpcells_path, 
    insertions_matrix_path, 
    cell_groups_array, 
    chrom_sizes, 
    threads=4
)
CPU times: user 3min 8s, sys: 710 ms, total: 3min 8s
Wall time: 1min 45s
<PrecalculatedInsertionMatrix with 16 pseudobulks and 24 chromomsomes stored in 
	/tmp/tmpenistmm7/bpcells_insertions_matrix

Querying the insertion matrix#

We can load the pre-calculated matrix from its input path.

mat = bpcells.experimental.PrecalculatedInsertionMatrix(insertions_matrix_path)
mat
<PrecalculatedInsertionMatrix with 16 pseudobulks and 24 chromomsomes stored in 
	/tmp/tmpenistmm7/bpcells_insertions_matrix
mat.shape
(np.uint32(16), np.uint32(3088269832))

To query the matrix, we use a pandas DataFrame, with columns (chrom, start, end). All the regions must be the same length

query_regions = pd.DataFrame({
    "chrom": ["chr1", "chr1", "chr6"],
    "start": [1_000_000, 2_000_000, 10_000_000],
})
query_regions["end"] = query_regions.start + 1000
query_regions
chrom start end
0 chr1 1000000 1001000
1 chr1 2000000 2001000
2 chr6 10000000 10001000

BPCells returns a numpy array of dimensions (regions, cell types, basepairs), holding the per-base counts for each cell type

x = mat.get_counts(query_regions)
x
array([[[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]], dtype=int32)
x.shape
(3, 16, 1000)
x.sum()
np.int64(363)

Pytorch-compatible dataset#

It is simple to wrap this matrix as a pytorch-compatible dataset, given a set of regions for the training set. Note the use of the non-standard __getitems__() function which pytorch uses to provide batched loading for higher performance. This dataset object can be directly passed to torch.utils.data.DataLoader.

class BPCellsDataset:
    def __init__(self, regions, matrix_dir):
        self.regions = regions[["chrom", "start", "end"]]

        matrix_dir = str(os.path.abspath(os.path.expanduser(matrix_dir)))
        self.mat = bpcells.experimental.PrecalculatedInsertionMatrix(matrix_dir)
        
        peak_width = self.regions.end[0] - self.regions.start[0]
        assert (self.regions.end - self.regions.start == peak_width).all()

    def __getitem__(self, i):
        return self.__getitems__([i])[0]

    def __getitems__(self, idx):
        # Adding this function allows for batched loading
        # See: https://github.com/pytorch/pytorch/issues/107218

        # Return tensor of shape (batch_size, n_tasks, basepairs)
        return self.mat.get_counts(
            self.regions.iloc[idx,]
        )

    def __len__(self):
        return self.regions.shape[0]