[Tutorial] Human Genome Annotation

Tutorial for Tidyverse (Chapter 4)

1. Introduction

1.1. What is gene annotation?

Over the past years, we have learnt that there are a number of chromosomes and genes in our genome. Counting the number of chromosomes is fairly easy but students might find difficult to say how many genes we have in our genome. If you can get an answer for this, could you tell how many genes encode protein and how many do not?

To answer this question, we need to access the database for gene annotation. Gene annotation is the process of making nucleotide sequence meaningful - where genes are located? whether it is protein-coding or noncoding. If you would like to get an overview of gene annotation, please find this link.

One of well-known collaborative efforts in gene annotation is the GENCODE consortium. It is a part of the Encyclopedia of DNA Elements (The ENCODE project consortium) and aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation (Harrow et al. 2012). You might find another database for gene annotation, like RefSeq, CCDS, and need to understand differences between them.

Figure 1. Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction (Frankish et al. 2015). A) Mean number of alternatively spliced transcripts per multi-exon protein-coding locus B) Mean number of unique CDS per multi-exon protein-coding locus C) Mean number of unique (non-redundant) exons per multi-exon protein-coding locus D) Percentage genomic coverage of unique (non-redundant) exons at multi-exon protein-coding loci.

In this tutorial, we will access to gene annotation from the GENCODE consortium and explore genes and functional elements in our genome.

1.2. Aims

What we will do with this dataset:

  • Be familiar with gene annotation modality.

  • Tidy data and create a table for your analysis.

  • Apply tidyverse functions for data munging.

Please note that there is better solution for getting gene annotation in R if you use a biomart. Our tutorial is only designed to have a practice on tidyverse exercise.

2. Explore your data

2.1. Unboxing your dataset

This tutorial will use a gene annotation file from the GENCODE. You will need to download the file from the GENCODE. If you are using terminal, please download file using wget:

# Run from your terminal, not R console
# wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.basic.annotation.gtf.gz

# Once you downloaded the file, you won't need to download it again. So please comment out the command above by adding #

Once you download the file, you can print out the first few lines using the following bash command (we will learn UNIX commands later):

# Run from your terminal, not R console
gzcat gencode.v31.basic.annotation.gtf.gz | head -7

The file is the GFT file format, which you will find most commonly in gene annotation. Please read the file format thoroughly in the link above.

For the tutorial, we need to load two packages. If the package is not installed in your system, please install it.

  • tidyverse, a package you have learnt from the chapter 5.

  • readr, a package provides a fast and friendly way to read. Since the file gencode.v31.basic.annotation.gtf.gz is pretty large, you will need some function to load data quickly into your workspace. readr in a part of tidyverse, so you can just load tidyverse to use readr functions.

Let's load the GTF file into your workspace. We will use read_delim function from the readr package. This is much faster loading than read.delim or read.csv from R base. However, please keep in mind that some parameters and output class for read_delim are slightly different from them.

 library(tidyverse)
 d = read_delim('gencode.v31.basic.annotation.gtf.gz', 
               delim='\t', skip = 5, progress = F, 
               col_names = F)

Can you find out what the parameters mean? Few things to note are:

  • The GTF file contains the first few lines for comments (#). In general, the file contains description, provider, date, format.

  • The GTF file does not have column names so you will need to assign `FALSE for col_names.

This is sort of canonical way to load your dataset into R. However, we are using a GTF format, which is specific to gene annotation so we can use a package to specifically handle a GTF file.

Here I introduce the package rtracklayer. Let's install the package first.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("rtracklayer")

Then, now you can read the GTF file using this package. Then, you can check the class of the object d.

d = rtracklayer::import('gencode.v31.basic.annotation.gtf.gz')
class(d)

You will find out that this is GRanges class. This is from the package Genomic Range, specifically dealing with genomic datasets but we are not heading into this in this tutorial. So please find this information if you are serious on this.

We are converting d into a data frame as following:

d = d %>% as.data.frame()

Let's overview few lines from the data frame, and explore what you get in this object.

head(d)

One thing you can find is that there is no columns in the data frame. Let's match which information is provided in columns. You can find the instruction page in the website (link).

Based on this, you can assign a name for 9 columns. One thing to remember is you should not use space for the column name. Spacing in the column name is actually working but not a good habit for your code. So please replace a space with underscore in the column name.

# Assign column names according to the GENCODE instruction.
cols = c('chrom', 'source', 'feature_type', 'start', 'end', 'score', 'strand', 'phase', 'info')

Now you can set up the column names into the col_names parameter, and load the file into a data frame.

d = read_delim('gencode.v31.basic.annotation.gtf.gz', 
               delim='\t', skip = 5, 
               progress = F,
               col_names = cols)

You can find the column names are now all set.

head(d)

When you loaded the file, you see the message about the data class. You might want to overview this data.

summary(d)

2.2. How many feature types in the GENCODE dataset?

As instructed in the GENCODE website, the GENCODE dataset provides a range of annotations for the feature type. You can check feature types using ____ function.

d %>% group_by(feature_type) %>% count(feature_type)
# table(d$feature_type)

How many feature types provided in the GENCODE? And how many items stored for each feature type? Please write down the number of feature types from the dataset. Also, if you are not familiar with these types, it would be good to put one or two sentences that can describe each type).

2.3. How many genes we have?

Let's count the number of genes in our genome. Since we know that the column feature_type contains rows with gene, which contains obviously annotations for genes. We might want to subset those rows from the data frame.

d1 = filter(d, feature_type == 'gene')
# d1 = d[d$feature_type == 'gene', ]

2.4. Ensembl, Havana and CCDS.

Gene annotation for the human genome is provided by multiple organizations with different gene annotation methods and strategy. This means that information can be varying by resources, and users need to understand heterogeniety inherent in annotation databases.

The GENCODE project utlizes two sources of gene annotation.

  1. Havana: Manual gene annotation (detailed strategy in here)

  2. Ensembl: Automatic gene annotation (detailed strategy in here)

It provides the combination of Ensembl/HAVANA gene set as the default gene annotation for the human genome. In addition, they also guarantee that all transcripts from the Consensus Coding Sequence (CCDS) set are present in the GENCODE gene set. The CCDS project is a collaborative effort to identify a core set of protein coding regions that are consistently annotated and of high quality. Initial results from the Consensus CDS (CCDS) project are now available through the appropriate Ensembl gene pages and from the CCDS project page at NCBI. The CCDS set is built by consensus among Ensembl, the National Center for Biotechnology Information (NCBI), and the HUGO Gene Nomenclature Committee (HGNC) for human (link).

Figure 2. Comparison of CCDS and Gencode (Source).

Right. Then now we count how many genes annotated with HAVANA and ENSEMBL.

d %>% group_by(source) %>% count(source)

2.5. do.call

Since the last column info contains a long string for multiple annotations, we will need to split it to extract each annotation. For example, the first line for transcript annotation looks like this:

chr1    HAVANA    transcript    11869    14409    .    +    .    gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "lncRNA"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

If you would like to split transcript_support_level and create a new column, you can use strsplit function.

a = 'chr1    HAVANA    transcript    11869    14409    .    +    .    gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "lncRNA"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";'

strsplit(a, 'transcript_support_level\\s+"')

After split the string, you can select the second item in the list ([[1]][2]).

strsplit(a, 'transcript_support_level\\s+"')[[1]][2]

You can find the 1 in the first position, which you will need to split again.

b = strsplit(a, 'transcript_support_level\\s+"')[[1]][2]
strsplit(b, '\\"')

From this, you will get the first item in the list ([[1]][1]).

Now you would like to apply strsplit function across vectors. For this, do.call function can be easily implemented to strsplit over the vectors from one column. Let's try this.

head(do.call(rbind.data.frame, strsplit(a, 'transcript_support_level\\s+"'))[[2]])

Now you can write two lines of codes to process two steps we discussed above.

# First filter transcripts and create a data frame.
d2 <- d %>% filter(feature_type == 'transcript')

# Now apply the functions. 
d2$transcript_support_level <- as.character(do.call(rbind.data.frame, 
                                                    strsplit(d2$info, 'transcript_support_level\\s+"'))[[2]])

d2$transcript_support_level <- as.character(do.call(rbind.data.frame, 
                                                    strsplit( d2$transcript_support_level, '\\"'))[[1]])

Now you can check the strsplit works.

head(d2$transcript_support_level)

You can use the same method to extract other annotations, like gene_id, gene_name etc.

3. Exercises

Here I list the questions for your activity. Please note that it is an exercise for tidyverse functions, which you will need to use in your code. In addition, you will need to write an one-line code for each question using pipe %>%.

For questions, you should read some information thoroughly, including:

3.1. Annotation of transcripts in our genome

  1. Computes the number of transcripts per gene. What is the mean number of transcripts per gene? What is the quantile (25%, 50%, 75%) for these numbers? Which gene has the greatest number of transcript?

  2. Compute the number of transcripts per gene among gene biotypes. For example, compare the number of transcript per gene between protein-coding genes, long noncoding genes, pseudogenes.

  3. Final task is to compute the number of transcripts per gene per chromosome.

3.2. Gene length in the GENCODE

  1. What is the average length of human genes?

  2. Is the distribution of gene length differed by autosomal and sex chromosomes? Please calculate the quantiles (0%, 25%, 50%, 75%, 100%) of the gene length for each group.

  3. Is the distribution of gene length differed by gene biotype? Please calculate the quantiles (0%, 25%, 50%, 75%, 100%) of the gene length for each group.

3.3. Transcript support levels (TSL)

The GENCODE TSL provides a consistent method of evaluating the level of support that a GENCODE transcript annotation is actually expressed in humans.

  1. With transcript, how many transcripts are categorized for each TSL?

  2. From the first question, please count the number of transcript for each TSL by gene biotype.

  3. From the first question, please count the number of transcript for each TSL by source.

3.4. CCDS in the GENCODE

  1. With gene, please create a data frame with the columns - gene_id, gene_name, hgnc_id, gene_type, chromosome, start, end, and strand. Then, please create new columns for presence of hgnc and ccds. For example, you can put 1 in the column isHgnc, if hgnc annotation is avaiable, or 0 if not. Then, you can put 1 in the column isCCDS, if ccds annotation is avaiable, or 0 if not.

  2. Please count the number of hgnc by gene biotypes.

  3. Please count the number of hgnc by level. Please note that level in this question is not TSL. Please find information in this link: 1 (verified loci), 2 (manually annotated loci), 3 (automatically annotated loci).

3.5. Transcripts in the GENCODE

  1. Which gene has the largest number of transcripts?

  2. Please calculate the quantiles (0%, 25%, 50%, 75%, 100%) of the gene length for protein coding genes and long noncoding genes.

  3. Please count the number of transcripts by chromosomes.

3.6. Autosomal vs. Sex chromosomes.

  1. Please calculate the number of genes per chromosome.

  2. Please compare the number of genes between autosomal and sex chromosome (Mean, Median).

  3. Please divide the genes into groups ‘protein coding’ and ‘long noncoding’, and then compare the number of genes in each chromosomes within groups.

Last updated