Read gene annotations from gtf format into a data frame. The source can be a URL, a gtf file on disk, or a gencode release version.
Usage
read_gtf(
path,
attributes = c("gene_id"),
tags = character(0),
features = c("gene"),
keep_attribute_column = FALSE,
backup_url = NULL,
timeout = 300
)
read_gencode_genes(
dir,
release = "latest",
annotation_set = c("basic", "comprehensive"),
gene_type = "lncRNA|protein_coding|IG_.*_gene|TR_.*_gene",
attributes = c("gene_id", "gene_type", "gene_name"),
tags = character(0),
features = c("gene"),
timeout = 300
)
read_gencode_transcripts(
dir,
release = "latest",
transcript_choice = c("MANE_Select", "Ensembl_Canonical", "all"),
annotation_set = c("basic", "comprehensive"),
gene_type = "lncRNA|protein_coding|IG_.*_gene|TR_.*_gene",
attributes = c("gene_id", "gene_type", "gene_name", "transcript_id"),
features = c("transcript", "exon"),
timeout = 300
)
Arguments
- path
Path to file (or desired save location if backup_url is used)
- attributes
Vector of GTF attribute names to parse out as columns
Vector of tags to parse out as boolean presence/absence
- features
List of features types to keep from the GTF (e.g. gene, transcript, exon, intron)
- keep_attribute_column
Boolean for whether to preserve the raw attribute text column
- backup_url
If path does not exist, provides a URL to download the gtf from
- timeout
Maximum time in seconds to wait for download from backup_url
- dir
Output directory to cache the downloaded gtf file
- release
release version (prefix with M for mouse versions). For most recent version, use "latest" or "latest_mouse"
- annotation_set
Either "basic" or "comprehensive" annotation sets (see details section).
- gene_type
Regular expression with which gene types to keep. Defaults to protein_coding, lncRNA, and IG/TR genes
- transcript_choice
Method for selecting representative transcripts. Choices are:
MANE_Select: human-only, most conservative
Ensembl_Canonical: human+mouse, superset of MANE_Select for human
all: Preserve all transcript models (not recommended for plotting)
Value
Data frame with coordinates using the 0-based convention. Columns are:
chr
source
feature
start
end
score
strand
frame
attributes (optional; named according to listed attributes)
tags (named according to listed tags)
Details
read_gtf
Read gtf from a file or URL
read_gencode_genes
Read gene annotations directly from GENCODE. The file name will vary depending
on the release and annotation set requested, but will be of the format
gencode.v42.annotation.gtf.gz
. GENCODE currently recommends the basic set:
https://www.gencodegenes.org/human/. In release 42, both the comprehensive and
basic sets had identical gene-level annotations, but the comprehensive set had
additional transcript variants annotated.
read_gencode_transcripts
Read transcript models from GENCODE, for use with trackplot_gene()
Examples
#######################################################################
## read_gtf() example
#######################################################################
species <- "Saccharomyces_cerevisiae"
version <- "GCF_000146045.2_R64"
head(read_gtf(
path = sprintf("./reference/%s_genomic.gtf.gz", version),
backup_url = sprintf(
"https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/%s/reference/%s/%s_genomic.gtf.gz",
species, version, version
)
))
#> # A tibble: 6 × 9
#> chr source feature start end score strand frame gene_id
#> <chr> <chr> <chr> <dbl> <int> <chr> <chr> <chr> <chr>
#> 1 NC_001133.9 RefSeq gene 1806 2169 . - . YAL068C
#> 2 NC_001133.9 RefSeq gene 2479 2707 . + . YAL067W-A
#> 3 NC_001133.9 RefSeq gene 7234 9016 . - . YAL067C
#> 4 NC_001133.9 RefSeq gene 11564 11951 . - . YAL065C
#> 5 NC_001133.9 RefSeq gene 12045 12426 . + . YAL064W-B
#> 6 NC_001133.9 RefSeq gene 13362 13743 . - . YAL064C-A
#######################################################################
## read_gencode_genes() example
#######################################################################
read_gencode_genes("./references", release = "42")
#> # A tibble: 39,319 × 11
#> chr source feature start end score strand frame gene_id gene_type
#> <chr> <chr> <chr> <dbl> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 chr1 HAVANA gene 11868 14409 . + . ENSG00000290… lncRNA
#> 2 chr1 HAVANA gene 29553 31109 . + . ENSG00000243… lncRNA
#> 3 chr1 HAVANA gene 34553 36081 . - . ENSG00000237… lncRNA
#> 4 chr1 HAVANA gene 57597 64116 . + . ENSG00000290… lncRNA
#> 5 chr1 HAVANA gene 65418 71585 . + . ENSG00000186… protein_…
#> 6 chr1 HAVANA gene 89294 133723 . - . ENSG00000238… lncRNA
#> 7 chr1 HAVANA gene 89550 91105 . - . ENSG00000239… lncRNA
#> 8 chr1 HAVANA gene 139789 140339 . - . ENSG00000239… lncRNA
#> 9 chr1 HAVANA gene 141473 173862 . - . ENSG00000241… lncRNA
#> 10 chr1 HAVANA gene 160445 161525 . + . ENSG00000241… lncRNA
#> # ℹ 39,309 more rows
#> # ℹ 1 more variable: gene_name <chr>
#######################################################################
## read_gencode_transcripts() example
#######################################################################
## If read_gencode_genes() was already ran on the same release,
## will reuse previously downloaded annotations
read_gencode_transcripts("./references", release = "42")
#> # A tibble: 220,296 × 13
#> chr source feature start end score strand frame gene_id gene_type
#> <chr> <chr> <chr> <dbl> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 chr1 HAVANA transcript 65418 71585 . + . ENSG00000… protein_…
#> 2 chr1 HAVANA exon 65418 65433 . + . ENSG00000… protein_…
#> 3 chr1 HAVANA exon 65519 65573 . + . ENSG00000… protein_…
#> 4 chr1 HAVANA exon 69036 71585 . + . ENSG00000… protein_…
#> 5 chr1 HAVANA transcript 450739 451678 . - . ENSG00000… protein_…
#> 6 chr1 HAVANA exon 450739 451678 . - . ENSG00000… protein_…
#> 7 chr1 HAVANA transcript 685715 686654 . - . ENSG00000… protein_…
#> 8 chr1 HAVANA exon 685715 686654 . - . ENSG00000… protein_…
#> 9 chr1 HAVANA transcript 923922 944574 . + . ENSG00000… protein_…
#> 10 chr1 HAVANA exon 923922 924948 . + . ENSG00000… protein_…
#> # ℹ 220,286 more rows
#> # ℹ 3 more variables: gene_name <chr>, transcript_id <chr>, MANE_Select <lgl>