Skip to contents

Read gene annotations from gtf format into a data frame. The source can be a URL, a gtf file on disk, or a gencode release version.

Usage

read_gtf(
  path,
  attributes = c("gene_id"),
  tags = character(0),
  features = c("gene"),
  keep_attribute_column = FALSE,
  backup_url = NULL,
  timeout = 300
)

read_gencode_genes(
  dir,
  release = "latest",
  annotation_set = c("basic", "comprehensive"),
  gene_type = "lncRNA|protein_coding|IG_.*_gene|TR_.*_gene",
  attributes = c("gene_id", "gene_type", "gene_name"),
  tags = character(0),
  features = c("gene"),
  timeout = 300
)

read_gencode_transcripts(
  dir,
  release = "latest",
  transcript_choice = c("MANE_Select", "Ensembl_Canonical", "all"),
  annotation_set = c("basic", "comprehensive"),
  gene_type = "lncRNA|protein_coding|IG_.*_gene|TR_.*_gene",
  attributes = c("gene_id", "gene_type", "gene_name", "transcript_id"),
  features = c("transcript", "exon"),
  timeout = 300
)

Arguments

path

Path to file (or desired save location if backup_url is used)

attributes

Vector of GTF attribute names to parse out as columns

tags

Vector of tags to parse out as boolean presence/absence

features

List of features types to keep from the GTF (e.g. gene, transcript, exon, intron)

keep_attribute_column

Boolean for whether to preserve the raw attribute text column

backup_url

If path does not exist, provides a URL to download the gtf from

timeout

Maximum time in seconds to wait for download from backup_url

dir

Output directory to cache the downloaded gtf file

release

release version (prefix with M for mouse versions). For most recent version, use "latest" or "latest_mouse"

annotation_set

Either "basic" or "comprehensive" annotation sets (see details section).

gene_type

Regular expression with which gene types to keep. Defaults to protein_coding, lncRNA, and IG/TR genes

transcript_choice

Method for selecting representative transcripts. Choices are:

  • MANE_Select: human-only, most conservative

  • Ensembl_Canonical: human+mouse, superset of MANE_Select for human

  • all: Preserve all transcript models (not recommended for plotting)

Value

Data frame with coordinates using the 0-based convention. Columns are:

  • chr

  • source

  • feature

  • start

  • end

  • score

  • strand

  • frame

  • attributes (optional; named according to listed attributes)

  • tags (named according to listed tags)

Details

read_gtf

Read gtf from a file or URL

read_gencode_genes

Read gene annotations directly from GENCODE. The file name will vary depending on the release and annotation set requested, but will be of the format gencode.v42.annotation.gtf.gz. GENCODE currently recommends the basic set: https://www.gencodegenes.org/human/. In release 42, both the comprehensive and basic sets had identical gene-level annotations, but the comprehensive set had additional transcript variants annotated.

read_gencode_transcripts

Read transcript models from GENCODE, for use with trackplot_gene()