[Tutorial] Gene expression profiles in cancer patients
After ggplot2 part
1. Introduction
Lung cancer one of major cancers, accounting for 2.09 million deaths out of the 9.6 million total cancer deaths in 2018. There are two main types of lung cancer: small cell lung carcinoma (SCLC) and non-small cell lung carcinoma (NSCLC). NSCLC is responsible for 85–90% of lung cancer cases and its two largest subtypes are lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC).
LUAD develops in the periphery of the lungs and may be associated with smoking, but is the most common lung cancer type among non-smokers. In contrast, LUSC accounts for 25–30% of all total lung cancer cases while LUAD accounts for 40% of all total lung cancer cases. LUSC are likely to be found in the middle of the lungs and is associated with smoking.
In this tutorial, we will explore the sample dataset of LUSC and LUAD from the The Cancer Genome Atlas (TCGA), a public resource for the genomic dataset. Here we use two types of lung cancers - LUAD and LUSC and will examine which information we can use for genomic analyses of lung cancers.
2. Obtain the gene expression profile dataset
To access the gene expression data, there are two ways
Formal way (but slow): you can download the data as described in 2.1
Simple ways: get the file from my dropbox link where I downloaded and save to R object as described in 2.2
2.1. Download the data from GDC portal
You will need to install the TCGAbiolinks package to obtain the lung cancer gene expression profiles from TCGA.
library(tidyverse)
# Install the TCGAbiolinks package if necessary
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("TCGAbiolinks")
library(TCGAbiolinks)
Then, Obtain the TCGA dataset from the GDC portal. You first download the gene expression dataset for Lung Adenocarcinoma (LUAD).
3.1. Which tissues or samples are available for your analysis?
Lung cancer samples in the dataset are provided in tumor or normal tissues. The column shortLetterCode contains Sample Type Codes to describe the type of tissues collected in the dataset.
TP: Primary solid Tumor
TR: Recurrent solid Tumor
NT: Solid Tissue Normal
Check which samples are included in the LUAD dataset.
d_luad %>% count(shortLetterCode)
shortLetterCode<chr>
n<int>
NT
59
TP
533
TR
2
Check which samples are included in the LUSC dataset.
d_lusc %>% count(shortLetterCode)
shortLetterCode<chr>
n<int>
NT
49
TP
502
Let’s make a simple bar plot to compare the number of tissues between two tissue types. We will look at the LUAD samples first.
We made two separate plots for the LUAD and LUSC dataset. It would be great if we can merge them into one plot. Here I put the code for this. It would be good practice if you comment or delete the line that you don’t understand fully, you will compare the difference between outcomes. Also, please figure out what Qs do in this plot.
bind_rows(d_luad %>%
mutate(type='luad') %>%
select(type, shortLetterCode, tumor_stage),
d_lusc %>%
mutate(type='lusc') %>%
select(type, shortLetterCode, tumor_stage)) %>%
count(shortLetterCode, type) %>%
complete(type, shortLetterCode, fill = list(n = 0)) %>% # Q1: What is this?
ggplot(., aes(shortLetterCode, n, fill=shortLetterCode)) +
geom_bar(stat="identity", position=position_dodge()) +
facet_wrap(~type, scales = 'free_x', ncol=5) # Q2: What is this?
3.2. Distribution of clinical variables
We will plot the distribution of age at diagnosis using histogram. Few things you can check: - Is the distribution continuous? - Is it fowlloing the normal distribution? - What is the scale on the x-axis? - If the distribution is stratified, which makes it?
More plot types on distribution. First we can try the plot for both histogram and density.
# Plot both histogram and density plot
ggplot(d_luad, aes(age_at_diagnosis/365)) +
geom_histogram(bins=100, colour="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666")
What did you miss from this?
Let’s plot a bit different version. We will replace the y-axis of the histogram with the y-axis of the density plot.
# Histogram with density plot
ggplot(d_luad, aes(age_at_diagnosis/365)) +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="#FF6666")
Q: Do you think whether it is good visualization for the distribution?
From the plot above, we still found the bump around the center. What would contribute to this stratification? Let’s look at some variables from other columns. First we can try information from the gender column.
# Try different options from the manual
# http://www.sthda.com/english/wiki/ggplot2-violin-plot-quick-start-guide-r-software-and-data-visualization
Do you think gender is the reason? If not, how about tumor stages?
ggplot(d_luad, aes(age_at_diagnosis/365)) +
labs(x ='Age at Diagnosis (years)', y = 'Number of LUAD patients',
title = 'TCGA: Lung cancer adenocarcinoma') +
geom_histogram(bins=100)
More tasks for the class.
Q1. Add labels for the plot.
Q2. Change the color for categories.
Q3. Save to PDF file
Q4. Select the column with continuous information and plot the distribution by yourself.
# assign p for plot
p <- ggplot(d_luad, aes(age_at_diagnosis/365)) +
labs(x ='Age at Diagnosis (years)', y = 'Number of LUAD patients',
title = 'TCGA: Lung cancer adenocarcinoma') +
geom_histogram(bins=100)
# save to file
ggsave('plot.TCGA_LUAD.histogram_AgeOfDX.20191002.pdf', p, width = 16, height = 9)
3.3. Tumor stages of lung cancers
First information you might want to explore is staging cancers. Different types of staging systems are used for different types of cancer. You can read further information in general staging rules, or specifics in lung cancers (e.g. stage IA).
# Count the number of tumors in the LUAD dataset
counts_tumor <- d_luad %>% count(tumor_stage)
# Try bar plot
ggplot(counts_tumor, aes( tumor_stage, n)) + geom_bar(stat="identity")
Now you can combine both for visualization.
# Cancer type by tumor stages
bind_rows(d_luad %>%
filter(shortLetterCode == 'TP') %>%
mutate(type='luad') %>%
select(type, tumor_stage),
d_lusc %>%
filter(shortLetterCode == 'TP') %>%
mutate(type='lusc') %>%
select(type, tumor_stage)) %>%
count(type, tumor_stage) %>%
mutate(tumor_stage = factor(tumor_stage, levels=rev(unique(tumor_stage)))) %>%
ggplot(aes( tumor_stage, n, fill=type)) +
labs(title = 'Cancer type by tumor stages',
x = '', y='Number of samples for TCGA RNA-seq') +
theme_minimal() + geom_bar(stat="identity") +
facet_wrap(~type) + coord_flip()
3.4. Cancer type by gender
Let’s try another information - gender. We will describe this by a bar plot like above. After plotting, which information you can read from this?
NB: I am not pretty sure why TCGA put gender, not sex in the dataset, in addition to mixed use of female/male with gender.
# Cancer type by Sex
bind_rows(d_luad %>%
filter(shortLetterCode == 'TP') %>%
mutate(type='luad') %>%
select(type, gender),
d_lusc %>%
filter(shortLetterCode == 'TP') %>%
mutate(type='lusc') %>%
select(type, gender)) %>%
count(type, gender) %>%
ggplot(aes( gender, n, fill=type)) +
labs(title = 'Cancer type by Sex',
x = '', y='Number of samples for TCGA RNA-seq') +
theme_minimal() + geom_bar(stat="identity") +
facet_wrap(~type) + coord_flip()
Is there difference in tumor stages by gender? Let’s check with LUSC.