Read gene annotations from gtf format into a data frame. The source can be a URL, a gtf file on disk, or a gencode release version.
Usage
read_gtf(
path,
attributes = c("gene_id"),
tags = character(0),
features = c("gene"),
keep_attribute_column = FALSE,
backup_url = NULL,
timeout = 300
)
read_gencode_genes(
dir,
release = "latest",
annotation_set = c("basic", "comprehensive"),
gene_type = "lncRNA|protein_coding|IG_.*_gene|TR_.*_gene",
attributes = c("gene_id", "gene_type", "gene_name"),
tags = character(0),
features = c("gene"),
timeout = 300
)
read_gencode_transcripts(
dir,
release = "latest",
transcript_choice = c("MANE_Select", "Ensembl_Canonical", "all"),
annotation_set = c("basic", "comprehensive"),
gene_type = "lncRNA|protein_coding|IG_.*_gene|TR_.*_gene",
attributes = c("gene_id", "gene_type", "gene_name", "transcript_id"),
features = c("transcript", "exon"),
timeout = 300
)Arguments
- path
Path to file (or desired save location if backup_url is used)
- attributes
Vector of GTF attribute names to parse out as columns
Vector of tags to parse out as boolean presence/absence
- features
List of features types to keep from the GTF (e.g. gene, transcript, exon, intron)
- keep_attribute_column
Boolean for whether to preserve the raw attribute text column
- backup_url
If path does not exist, provides a URL to download the gtf from
- timeout
Maximum time in seconds to wait for download from backup_url
- dir
Output directory to cache the downloaded gtf file
- release
release version (prefix with M for mouse versions). For most recent version, use "latest" or "latest_mouse"
- annotation_set
Either "basic" or "comprehensive" annotation sets (see details section).
- gene_type
Regular expression with which gene types to keep. Defaults to protein_coding, lncRNA, and IG/TR genes
- transcript_choice
Method for selecting representative transcripts. Choices are:
MANE_Select: human-only, most conservative
Ensembl_Canonical: human+mouse, superset of MANE_Select for human
all: Preserve all transcript models (not recommended for plotting)
Value
Data frame with coordinates using the 0-based convention. Columns are:
chr
source
feature
start
end
score
strand
frame
attributes (optional; named according to listed attributes)
tags (named according to listed tags)
Details
read_gtf
Read gtf from a file or URL
read_gencode_genes
Read gene annotations directly from GENCODE. The file name will vary depending
on the release and annotation set requested, but will be of the format
gencode.v42.annotation.gtf.gz. GENCODE currently recommends the basic set:
https://www.gencodegenes.org/human/. In release 42, both the comprehensive and
basic sets had identical gene-level annotations, but the comprehensive set had
additional transcript variants annotated.
read_gencode_transcripts
Read transcript models from GENCODE, for use with trackplot_gene()
Examples
#######################################################################
## read_gtf() example
#######################################################################
species <- "Saccharomyces_cerevisiae"
version <- "GCF_000146045.2_R64"
head(read_gtf(
path = sprintf("./reference/%s_genomic.gtf.gz", version),
backup_url = sprintf(
"https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/%s/reference/%s/%s_genomic.gtf.gz",
species, version, version
)
))
#> # A tibble: 6 × 9
#> chr source feature start end score strand frame gene_id
#> <chr> <chr> <chr> <dbl> <int> <chr> <chr> <chr> <chr>
#> 1 NC_001133.9 RefSeq gene 1806 2169 . - . YAL068C
#> 2 NC_001133.9 RefSeq gene 2479 2707 . + . YAL067W-A
#> 3 NC_001133.9 RefSeq gene 7234 9016 . - . YAL067C
#> 4 NC_001133.9 RefSeq gene 11564 11951 . - . YAL065C
#> 5 NC_001133.9 RefSeq gene 12045 12426 . + . YAL064W-B
#> 6 NC_001133.9 RefSeq gene 13362 13743 . - . YAL064C-A
#######################################################################
## read_gencode_genes() example
#######################################################################
read_gencode_genes("./references", release = "42")
#> # A tibble: 39,319 × 11
#> chr source feature start end score strand frame gene_id gene_type
#> <chr> <chr> <chr> <dbl> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 chr1 HAVANA gene 11868 14409 . + . ENSG00000290… lncRNA
#> 2 chr1 HAVANA gene 29553 31109 . + . ENSG00000243… lncRNA
#> 3 chr1 HAVANA gene 34553 36081 . - . ENSG00000237… lncRNA
#> 4 chr1 HAVANA gene 57597 64116 . + . ENSG00000290… lncRNA
#> 5 chr1 HAVANA gene 65418 71585 . + . ENSG00000186… protein_…
#> 6 chr1 HAVANA gene 89294 133723 . - . ENSG00000238… lncRNA
#> 7 chr1 HAVANA gene 89550 91105 . - . ENSG00000239… lncRNA
#> 8 chr1 HAVANA gene 139789 140339 . - . ENSG00000239… lncRNA
#> 9 chr1 HAVANA gene 141473 173862 . - . ENSG00000241… lncRNA
#> 10 chr1 HAVANA gene 160445 161525 . + . ENSG00000241… lncRNA
#> # ℹ 39,309 more rows
#> # ℹ 1 more variable: gene_name <chr>
#######################################################################
## read_gencode_transcripts() example
#######################################################################
## If read_gencode_genes() was already ran on the same release,
## will reuse previously downloaded annotations
read_gencode_transcripts("./references", release = "42")
#> # A tibble: 220,296 × 13
#> chr source feature start end score strand frame gene_id gene_type
#> <chr> <chr> <chr> <dbl> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 chr1 HAVANA transcript 65418 71585 . + . ENSG00000… protein_…
#> 2 chr1 HAVANA exon 65418 65433 . + . ENSG00000… protein_…
#> 3 chr1 HAVANA exon 65519 65573 . + . ENSG00000… protein_…
#> 4 chr1 HAVANA exon 69036 71585 . + . ENSG00000… protein_…
#> 5 chr1 HAVANA transcript 450739 451678 . - . ENSG00000… protein_…
#> 6 chr1 HAVANA exon 450739 451678 . - . ENSG00000… protein_…
#> 7 chr1 HAVANA transcript 685715 686654 . - . ENSG00000… protein_…
#> 8 chr1 HAVANA exon 685715 686654 . - . ENSG00000… protein_…
#> 9 chr1 HAVANA transcript 923922 944574 . + . ENSG00000… protein_…
#> 10 chr1 HAVANA exon 923922 924948 . + . ENSG00000… protein_…
#> # ℹ 220,286 more rows
#> # ℹ 3 more variables: gene_name <chr>, transcript_id <chr>, MANE_Select <lgl>
