Basepair insertion counts tutorial#

BPCells python bindings can be used to query basepair-level coverage for predefined cell types.

The way this works is in two steps:

10x or ArchR arrow files are converted to BPCells format. This is most flexible to do with the BPCells R bindings, though single-sample 10x import is supported through the python bindings.
The BPCells python bindings use the input fragment files to create large matrix of dimensions (# cell types, # basepairs in genome). This is where cell type groupings are determined.
The BPCells python bindings can slice arbitrary genomic regions, returning a numpy array of dimensions (regions, cell types, basepairs)

Benchmark estimates#

Benchmark dataset: 600K cell subset of the Catlas paper, with 2.5 billion fragments Benchmark task: Load 128 random 501-bp peak regions from 111 cell types at basepair resolution Storage location: Local SSD. Networked file systems will be slower

	BPCells	BigWigs
Creation time	4.7 minutes, 8 threads	?
File size	6.2 GB	13 GB
Query time	0.37 seconds	2.2 seconds

Main benefits of BPCells#

Cell type count aggregation can be re-run fully from Python
Query time is about 6x faster than BigWigs

Caveat for this prototype: due to development time limitations, this insertion matrix implementation does not support >=2^32 non-zero entries (4.29 billion). The catlas dataset had about 3.2 billion non-zero entries. This limitation can be removed with additional technical work, or as a workaround multiple matrix objects can be created that each individually have <2^32 non-zero entries.

Usage Demo#

import bpcells.experimental

import pandas as pd

Data download#

We use a public 500-cell 10x dataset

import os.path
import subprocess
import tempfile
tmpdir = tempfile.TemporaryDirectory()
fragments_10x_path = os.path.join(tmpdir.name, "atac_fragments.tsv.gz")

data_url = "https://cf.10xgenomics.com/samples/cell-atac/2.0.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_fragments.tsv.gz"
subprocess.run(["curl", "--silent", data_url], stdout=open(fragments_10x_path, "w"))

CompletedProcess(args=['curl', '--silent', 'https://cf.10xgenomics.com/samples/cell-atac/2.0.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_fragments.tsv.gz'], returncode=0)

metadata_url = "https://cf.10xgenomics.com/samples/cell-atac/2.0.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_singlecell.csv"
metadata_path = os.path.join(tmpdir.name, "cell_metadata.csv")
subprocess.run(["curl", "--silent", metadata_url], stdout=open(metadata_path, "w"))

cell_metadata = pd.read_csv(metadata_path)
cell_metadata = cell_metadata[cell_metadata.is__cell_barcode == 1].reset_index()
cell_metadata

	index	barcode	total	duplicate	chimeric	unmapped	lowmapq	mitochondrial	nonprimary	passed_filters	is__cell_barcode	excluded_reason	TSS_fragments	DNase_sensitive_region_fragments	enhancer_region_fragments	promoter_region_fragments	on_target_fragments	blacklist_region_fragments	peak_region_fragments	peak_region_cutsites
0	1093	AAACTGCAGACTCGGA-1	26219	15751	6	184	1310	0	3	8965	1	0	4499	0	0	0	4499	0	6264	11966
1	1807	AAAGATGCACCTATTT-1	43137	25369	7	239	2544	0	7	14971	1	0	7135	0	0	0	7135	0	9825	18819
2	1828	AAAGATGCAGATACAA-1	129867	91043	28	839	9577	0	40	28340	1	0	16771	0	0	0	16771	0	21166	40512
3	3144	AAAGGGCTCGCTCTAC-1	97970	63834	16	492	5080	0	1	28547	1	0	13504	0	0	0	13504	0	18688	35366
4	3300	AAATGAGAGTCCCGCA-1	16066	7440	0	71	805	0	0	7750	1	0	3619	0	0	0	3619	0	5287	10165
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
479	248048	TTTGGCCAGATCTAAG-1	97255	63290	22	551	8536	0	1643	23213	1	0	11383	0	0	0	11383	0	16123	30530
480	248113	TTTGGCCAGTACAGAT-1	15054	7596	3	115	794	13	0	6533	1	0	3264	0	0	0	3264	0	4617	8813
481	248137	TTTGGCCAGTGCTCGC-1	54672	32244	8	300	3455	0	2	18663	1	0	11252	0	0	0	11252	0	13312	25376
482	248322	TTTGGCCGTCCCGTGA-1	60394	32487	9	546	3196	0	0	24156	1	0	10596	0	0	0	10596	0	14413	27039
483	248733	TTTGGTTCAGAGATGC-1	66942	36246	23	369	3904	174	11	26215	1	0	13128	0	0	0	13128	0	17938	34256

484 rows × 20 columns

cell_metadata.is__cell_barcode.sum()

np.int64(484)

Convert to BPCells format#

Notice that the conversion allows for adjusting the start/end coordinates, as well as subsetting to only the barcodes passing QC. Adding 1 to the end coordinate is necessary for 10x inputs produced by cellranger

%%time
fragments_bpcells_path = os.path.join(tmpdir.name, "bpcells_fragments")
bpcells.experimental.import_10x_fragments(
    input = fragments_10x_path, 
    output = fragments_bpcells_path, 
    shift_end=1, 
    keeper_cells=cell_metadata.barcode[cell_metadata.is__cell_barcode == 1]
)

CPU times: user 3.43 s, sys: 74.9 ms, total: 3.51 s
Wall time: 3.45 s

Create the insertion matrix#

To calculate the insertion matrix, we first define our cell groups, as well as the ordering of the cell groups we want for our output matrix. Here we use the first two characters of the cell barcode since annotated cell types are not available. Note that it is possible to leave cells out when calling build_cell_groups, in which case that data will not be included in the precalculated matrix

Next, we precalculate the insertion counts matrix, which can use parallelization to speed up portions of the work.

%%time
barcodes = cell_metadata.barcode
clusters = cell_metadata.barcode.str.slice(0,2)
cluster_order = sorted(set(clusters))

cell_groups_array = bpcells.experimental.build_cell_groups(fragments_bpcells_path, barcodes, clusters, cluster_order)

# We could provide a dict or local file path, but URL is easier
chrom_sizes = "http://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.chrom.sizes"

insertions_matrix_path = os.path.join(tmpdir.name, "bpcells_insertions_matrix")

bpcells.experimental.precalculate_insertion_counts(
    fragments_bpcells_path, 
    insertions_matrix_path, 
    cell_groups_array, 
    chrom_sizes, 
    threads=4
)

CPU times: user 3min 8s, sys: 710 ms, total: 3min 8s
Wall time: 1min 45s

<PrecalculatedInsertionMatrix with 16 pseudobulks and 24 chromomsomes stored in 
	/tmp/tmpenistmm7/bpcells_insertions_matrix

Querying the insertion matrix#

We can load the pre-calculated matrix from its input path.

mat = bpcells.experimental.PrecalculatedInsertionMatrix(insertions_matrix_path)
mat

<PrecalculatedInsertionMatrix with 16 pseudobulks and 24 chromomsomes stored in 
	/tmp/tmpenistmm7/bpcells_insertions_matrix

mat.shape

(np.uint32(16), np.uint32(3088269832))

To query the matrix, we use a pandas DataFrame, with columns (chrom, start, end). All the regions must be the same length

query_regions = pd.DataFrame({
    "chrom": ["chr1", "chr1", "chr6"],
    "start": [1_000_000, 2_000_000, 10_000_000],
})
query_regions["end"] = query_regions.start + 1000
query_regions

	chrom	start	end
0	chr1	1000000	1001000
1	chr1	2000000	2001000
2	chr6	10000000	10001000

BPCells returns a numpy array of dimensions (regions, cell types, basepairs), holding the per-base counts for each cell type

x = mat.get_counts(query_regions)
x

array([[[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]], dtype=int32)

x.shape

(3, 16, 1000)

x.sum()

np.int64(363)

Pytorch-compatible dataset#

It is simple to wrap this matrix as a pytorch-compatible dataset, given a set of regions for the training set. Note the use of the non-standard __getitems__() function which pytorch uses to provide batched loading for higher performance. This dataset object can be directly passed to torch.utils.data.DataLoader.

class BPCellsDataset:
    def __init__(self, regions, matrix_dir):
        self.regions = regions[["chrom", "start", "end"]]

        matrix_dir = str(os.path.abspath(os.path.expanduser(matrix_dir)))
        self.mat = bpcells.experimental.PrecalculatedInsertionMatrix(matrix_dir)
        
        peak_width = self.regions.end[0] - self.regions.start[0]
        assert (self.regions.end - self.regions.start == peak_width).all()

    def __getitem__(self, i):
        return self.__getitems__([i])[0]

    def __getitems__(self, idx):
        # Adding this function allows for batched loading
        # See: https://github.com/pytorch/pytorch/issues/107218

        # Return tensor of shape (batch_size, n_tasks, basepairs)
        return self.mat.get_counts(
            self.regions.iloc[idx,]
        )

    def __len__(self):
        return self.regions.shape[0]