---
jupytext:
  formats: ipynb,md:myst
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.15.1
kernelspec:
  display_name: Python 3 (ipykernel)
  language: python
  name: python3
mystnb:
  execution_timeout: 120
    
---

# Basepair insertion counts tutorial

BPCells python bindings can be used to query basepair-level coverage for predefined cell types. 

The way this works is in two steps: 

1. 10x or ArchR arrow files are converted to BPCells format. This is most flexible to do with the BPCells R bindings, though single-sample 10x import is supported through the python bindings.
2. The BPCells python bindings use the input fragment files to create large matrix of dimensions (# cell types, # basepairs in genome). This is where cell type groupings are determined.
3. The BPCells python bindings can slice arbitrary genomic regions, returning a numpy array of dimensions (regions, cell types, basepairs)

## Benchmark estimates
Benchmark dataset: 600K cell subset of the [Catlas paper](https://www.sciencedirect.com/science/article/pii/S0092867421012794), with 2.5 billion fragments
Benchmark task: Load 128 random 501-bp peak regions from 111 cell types at basepair resolution
Storage location: Local SSD. Networked file systems will be slower

|               | BPCells                | BigWigs     |
|---------------|------------------------|-------------|
| Creation time | 4.7 minutes, 8 threads | ?           |
| File size     | 6.2 GB                 | 13 GB       |
| Query time    | 0.37 seconds           | 2.2 seconds |

## Main benefits of BPCells
- Cell type count aggregation can be re-run fully from Python
- Query time is about 6x faster than BigWigs

**Caveat for this prototype**: due to development time limitations, this insertion matrix implementation does not support >=2^32 non-zero entries (4.29 billion). The catlas dataset had about 3.2 billion non-zero entries. This limitation can be removed with additional technical work, or as a workaround multiple matrix objects can be created that each individually have <2^32 non-zero entries.

+++

# Usage Demo

```{code-cell} ipython3
import bpcells.experimental

import pandas as pd
```

## Data download
We use a public 500-cell 10x dataset

```{code-cell} ipython3
import os.path
import subprocess
import tempfile
tmpdir = tempfile.TemporaryDirectory()
fragments_10x_path = os.path.join(tmpdir.name, "atac_fragments.tsv.gz")

data_url = "https://cf.10xgenomics.com/samples/cell-atac/2.0.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_fragments.tsv.gz"
subprocess.run(["curl", "--silent", data_url], stdout=open(fragments_10x_path, "w"))
```

```{code-cell} ipython3
metadata_url = "https://cf.10xgenomics.com/samples/cell-atac/2.0.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_singlecell.csv"
metadata_path = os.path.join(tmpdir.name, "cell_metadata.csv")
subprocess.run(["curl", "--silent", metadata_url], stdout=open(metadata_path, "w"))

cell_metadata = pd.read_csv(metadata_path)
cell_metadata = cell_metadata[cell_metadata.is__cell_barcode == 1].reset_index()
cell_metadata
```

```{code-cell} ipython3
cell_metadata.is__cell_barcode.sum()
```

## Convert to BPCells format
Notice that the conversion allows for adjusting the start/end coordinates, as well as subsetting to only the barcodes passing QC. Adding 1 to the end coordinate is necessary for 10x inputs produced by cellranger

```{code-cell} ipython3
%%time
fragments_bpcells_path = os.path.join(tmpdir.name, "bpcells_fragments")
bpcells.experimental.import_10x_fragments(
    input = fragments_10x_path, 
    output = fragments_bpcells_path, 
    shift_end=1, 
    keeper_cells=cell_metadata.barcode[cell_metadata.is__cell_barcode == 1]
)
```

## Create the insertion matrix
To calculate the insertion matrix, we first define our cell groups, as well as the ordering of the cell groups we want for our output matrix. Here we use the first two characters of the cell barcode since annotated cell types are not available. Note that it is possible to leave cells out when calling `build_cell_groups`, in which case that data will not be included in the precalculated matrix

Next, we precalculate the insertion counts matrix, which can use parallelization to speed up portions of the work.

```{code-cell} ipython3
%%time
barcodes = cell_metadata.barcode
clusters = cell_metadata.barcode.str.slice(0,2)
cluster_order = sorted(set(clusters))

cell_groups_array = bpcells.experimental.build_cell_groups(fragments_bpcells_path, barcodes, clusters, cluster_order)

# We could provide a dict or local file path, but URL is easier
chrom_sizes = "http://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.chrom.sizes"

insertions_matrix_path = os.path.join(tmpdir.name, "bpcells_insertions_matrix")

bpcells.experimental.precalculate_insertion_counts(
    fragments_bpcells_path, 
    insertions_matrix_path, 
    cell_groups_array, 
    chrom_sizes, 
    threads=4
)
```

## Querying the insertion matrix

We can load the pre-calculated matrix from its input path.

```{code-cell} ipython3
mat = bpcells.experimental.PrecalculatedInsertionMatrix(insertions_matrix_path)
mat
```

```{code-cell} ipython3
mat.shape
```

To query the matrix, we use a pandas DataFrame, with columns (chrom, start, end). All the regions must be the same length

```{code-cell} ipython3
query_regions = pd.DataFrame({
    "chrom": ["chr1", "chr1", "chr6"],
    "start": [1_000_000, 2_000_000, 10_000_000],
})
query_regions["end"] = query_regions.start + 1000
query_regions
```

BPCells returns a numpy array of dimensions (regions, cell types, basepairs), holding the per-base counts for each cell type

```{code-cell} ipython3
x = mat.get_counts(query_regions)
x
```

```{code-cell} ipython3
x.shape
```

```{code-cell} ipython3
x.sum()
```

## Pytorch-compatible dataset

It is simple to wrap this matrix as a pytorch-compatible dataset, given a set of regions for the training set. Note the use of the non-standard `__getitems__()` function which pytorch uses to provide batched loading for higher performance. This dataset object can be directly passed to `torch.utils.data.DataLoader`.

```{code-cell} ipython3
class BPCellsDataset:
    def __init__(self, regions, matrix_dir):
        self.regions = regions[["chrom", "start", "end"]]

        matrix_dir = str(os.path.abspath(os.path.expanduser(matrix_dir)))
        self.mat = bpcells.experimental.PrecalculatedInsertionMatrix(matrix_dir)
        
        peak_width = self.regions.end[0] - self.regions.start[0]
        assert (self.regions.end - self.regions.start == peak_width).all()

    def __getitem__(self, i):
        return self.__getitems__([i])[0]

    def __getitems__(self, idx):
        # Adding this function allows for batched loading
        # See: https://github.com/pytorch/pytorch/issues/107218

        # Return tensor of shape (batch_size, n_tasks, basepairs)
        return self.mat.get_counts(
            self.regions.iloc[idx,]
        )

    def __len__(self):
        return self.regions.shape[0]
```