Skip to contents

Read gene annotations from gtf format into a data frame. The source can be a URL, a gtf file on disk, or a gencode release version.

Usage

read_gtf(
  path,
  attributes = c("gene_id"),
  tags = character(0),
  features = c("gene"),
  keep_attribute_column = FALSE,
  backup_url = NULL,
  timeout = 300
)

read_gencode_genes(
  dir,
  release = "latest",
  annotation_set = c("basic", "comprehensive"),
  gene_type = "lncRNA|protein_coding|IG_.*_gene|TR_.*_gene",
  attributes = c("gene_id", "gene_type", "gene_name"),
  tags = character(0),
  features = c("gene"),
  timeout = 300
)

read_gencode_transcripts(
  dir,
  release = "latest",
  transcript_choice = c("MANE_Select", "Ensembl_Canonical", "all"),
  annotation_set = c("basic", "comprehensive"),
  gene_type = "lncRNA|protein_coding|IG_.*_gene|TR_.*_gene",
  attributes = c("gene_id", "gene_type", "gene_name", "transcript_id"),
  features = c("transcript", "exon"),
  timeout = 300
)

Arguments

path

Path to file (or desired save location if backup_url is used)

attributes

Vector of GTF attribute names to parse out as columns

tags

Vector of tags to parse out as boolean presence/absence

features

List of features types to keep from the GTF (e.g. gene, transcript, exon, intron)

keep_attribute_column

Boolean for whether to preserve the raw attribute text column

backup_url

If path does not exist, provides a URL to download the gtf from

timeout

Maximum time in seconds to wait for download from backup_url

dir

Output directory to cache the downloaded gtf file

release

release version (prefix with M for mouse versions). For most recent version, use "latest" or "latest_mouse"

annotation_set

Either "basic" or "comprehensive" annotation sets (see details section).

gene_type

Regular expression with which gene types to keep. Defaults to protein_coding, lncRNA, and IG/TR genes

transcript_choice

Method for selecting representative transcripts. Choices are:

  • MANE_Select: human-only, most conservative

  • Ensembl_Canonical: human+mouse, superset of MANE_Select for human

  • all: Preserve all transcript models (not recommended for plotting)

Value

Data frame with coordinates using the 0-based convention. Columns are:

  • chr

  • source

  • feature

  • start

  • end

  • score

  • strand

  • frame

  • attributes (optional; named according to listed attributes)

  • tags (named according to listed tags)

Details

read_gtf

Read gtf from a file or URL

read_gencode_genes

Read gene annotations directly from GENCODE. The file name will vary depending on the release and annotation set requested, but will be of the format gencode.v42.annotation.gtf.gz. GENCODE currently recommends the basic set: https://www.gencodegenes.org/human/. In release 42, both the comprehensive and basic sets had identical gene-level annotations, but the comprehensive set had additional transcript variants annotated.

read_gencode_transcripts

Read transcript models from GENCODE, for use with trackplot_gene()

Examples

#######################################################################
## read_gtf() example
#######################################################################
species <- "Saccharomyces_cerevisiae"
version <- "GCF_000146045.2_R64"
head(read_gtf(
 path = sprintf("./reference/%s_genomic.gtf.gz", version),
 backup_url = sprintf(
   "https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/%s/reference/%s/%s_genomic.gtf.gz",
   species, version, version
 )
))
#> # A tibble: 6 × 9
#>   chr         source feature start   end score strand frame gene_id  
#>   <chr>       <chr>  <chr>   <dbl> <int> <chr> <chr>  <chr> <chr>    
#> 1 NC_001133.9 RefSeq gene     1806  2169 .     -      .     YAL068C  
#> 2 NC_001133.9 RefSeq gene     2479  2707 .     +      .     YAL067W-A
#> 3 NC_001133.9 RefSeq gene     7234  9016 .     -      .     YAL067C  
#> 4 NC_001133.9 RefSeq gene    11564 11951 .     -      .     YAL065C  
#> 5 NC_001133.9 RefSeq gene    12045 12426 .     +      .     YAL064W-B
#> 6 NC_001133.9 RefSeq gene    13362 13743 .     -      .     YAL064C-A


#######################################################################
## read_gencode_genes() example
#######################################################################
read_gencode_genes("./references", release = "42")
#> # A tibble: 39,319 × 11
#>    chr   source feature  start    end score strand frame gene_id       gene_type
#>    <chr> <chr>  <chr>    <dbl>  <int> <chr> <chr>  <chr> <chr>         <chr>    
#>  1 chr1  HAVANA gene     11868  14409 .     +      .     ENSG00000290… lncRNA   
#>  2 chr1  HAVANA gene     29553  31109 .     +      .     ENSG00000243… lncRNA   
#>  3 chr1  HAVANA gene     34553  36081 .     -      .     ENSG00000237… lncRNA   
#>  4 chr1  HAVANA gene     57597  64116 .     +      .     ENSG00000290… lncRNA   
#>  5 chr1  HAVANA gene     65418  71585 .     +      .     ENSG00000186… protein_…
#>  6 chr1  HAVANA gene     89294 133723 .     -      .     ENSG00000238… lncRNA   
#>  7 chr1  HAVANA gene     89550  91105 .     -      .     ENSG00000239… lncRNA   
#>  8 chr1  HAVANA gene    139789 140339 .     -      .     ENSG00000239… lncRNA   
#>  9 chr1  HAVANA gene    141473 173862 .     -      .     ENSG00000241… lncRNA   
#> 10 chr1  HAVANA gene    160445 161525 .     +      .     ENSG00000241… lncRNA   
#> # ℹ 39,309 more rows
#> # ℹ 1 more variable: gene_name <chr>


#######################################################################
## read_gencode_transcripts() example
#######################################################################
## If read_gencode_genes() was already ran on the same release, 
## will reuse previously downloaded annotations
read_gencode_transcripts("./references", release = "42")
#> # A tibble: 220,296 × 13
#>    chr   source feature     start    end score strand frame gene_id    gene_type
#>    <chr> <chr>  <chr>       <dbl>  <int> <chr> <chr>  <chr> <chr>      <chr>    
#>  1 chr1  HAVANA transcript  65418  71585 .     +      .     ENSG00000… protein_…
#>  2 chr1  HAVANA exon        65418  65433 .     +      .     ENSG00000… protein_…
#>  3 chr1  HAVANA exon        65519  65573 .     +      .     ENSG00000… protein_…
#>  4 chr1  HAVANA exon        69036  71585 .     +      .     ENSG00000… protein_…
#>  5 chr1  HAVANA transcript 450739 451678 .     -      .     ENSG00000… protein_…
#>  6 chr1  HAVANA exon       450739 451678 .     -      .     ENSG00000… protein_…
#>  7 chr1  HAVANA transcript 685715 686654 .     -      .     ENSG00000… protein_…
#>  8 chr1  HAVANA exon       685715 686654 .     -      .     ENSG00000… protein_…
#>  9 chr1  HAVANA transcript 923922 944574 .     +      .     ENSG00000… protein_…
#> 10 chr1  HAVANA exon       923922 924948 .     +      .     ENSG00000… protein_…
#> # ℹ 220,286 more rows
#> # ℹ 3 more variables: gene_name <chr>, transcript_id <chr>, MANE_Select <lgl>