[Tutorial] Gene expression profiles in cancer patients

After ggplot2 part

1. Introduction

Lung cancer one of major cancers, accounting for 2.09 million deaths out of the 9.6 million total cancer deaths in 2018. There are two main types of lung cancer: small cell lung carcinoma (SCLC) and non-small cell lung carcinoma (NSCLC). NSCLC is responsible for 85–90% of lung cancer cases and its two largest subtypes are lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC).

LUAD develops in the periphery of the lungs and may be associated with smoking, but is the most common lung cancer type among non-smokers. In contrast, LUSC accounts for 25–30% of all total lung cancer cases while LUAD accounts for 40% of all total lung cancer cases. LUSC are likely to be found in the middle of the lungs and is associated with smoking.

In this tutorial, we will explore the sample dataset of LUSC and LUAD from the The Cancer Genome Atlas (TCGA), a public resource for the genomic dataset. Here we use two types of lung cancers - LUAD and LUSC and will examine which information we can use for genomic analyses of lung cancers.

2. Obtain the gene expression profile dataset

To access the gene expression data, there are two ways

  1. Formal way (but slow): you can download the data as described in 2.1

  2. Simple ways: get the file from my dropbox link where I downloaded and save to R object as described in 2.2

2.1. Download the data from GDC portal

You will need to install the TCGAbiolinks package to obtain the lung cancer gene expression profiles from TCGA.

library(tidyverse)

# Install the TCGAbiolinks package if necessary
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("TCGAbiolinks")

library(TCGAbiolinks)

Then, Obtain the TCGA dataset from the GDC portal. You first download the gene expression dataset for Lung Adenocarcinoma (LUAD).

query <- GDCquery(project = "TCGA-LUAD",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - FPKM-UQ")
GDCdownload(query)
d_luad0 <- GDCprepare(query)
d_luad = as.data.frame(d_luad0@colData)

Now we are downloading the dataset for Lung Squamous Cell Carcinoma (LUSC).

query <- GDCquery(project = "TCGA-LUSC",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - FPKM-UQ")
GDCdownload(query)
d_lusc0 <- GDCprepare(query)
d_lusc = as.data.frame(d_lusc0@colData)

For simplicity, we are choosing only genes on the chromosome 1. Please save to the Rdata into your working folder.

library(SummarizedExperiment)
e_luad = assay(d_luad0)
e_lusc = assay(d_lusc0)

g_luad = d_luad0@rowRanges %>% as.data.frame() %>% filter(seqnames=='chr1') %>% pull(ensembl_gene_id)
g_lusc = d_lusc0@rowRanges %>% as.data.frame() %>% filter(seqnames=='chr1') %>% pull(ensembl_gene_id)
e_luad = e_luad[g_luad,]
e_lusc = e_lusc[g_lusc,]

save(d_luad, d_lusc, e_luad, e_lusc, file='data.TCGA_LUAD_LUSC.gene_expression.Rdata')

You can download the data from this link. Please save this file to the working folder. Then, load the objects to your workspace.

load('data.TCGA_LUAD_LUSC.gene_expression.Rdata')

3. Explore the input dataset

Let’s explore which columns are provided in the LUAD dataset.

colnames(d_luad)
 [1] "sample"                                            "patient"                                          
 [3] "barcode"                                           "shortLetterCode"                                  
 [5] "definition"                                        "year_of_diagnosis"                                
 [7] "classification_of_tumor"                           "last_known_disease_status"                        
 [9] "updated_datetime.x"                                "primary_diagnosis"                                
[11] "tumor_stage"                                       "age_at_diagnosis"                                 
[13] "morphology"                                        "prior_treatment"                                  
[15] "state.x"                                           "diagnosis_id"                                     
[17] "tumor_grade"                                       "treatments"                                       
[19] "icd_10_code"                                       "days_to_diagnosis"                                
[21] "tissue_or_organ_of_origin"                         "progression_or_recurrence"                        
[23] "prior_malignancy"                                  "synchronous_malignancy"                           
[25] "site_of_resection_or_biopsy"                       "days_to_last_follow_up"                           
[27] "cigarettes_per_day"                                "updated_datetime.y"                               
[29] "alcohol_history"                                   "years_smoked"                                     
[31] "state.y"                                           "exposure_id"                                      
[33] "pack_years_smoked"                                 "updated_datetime"                                 
[35] "gender"                                            "state"                                            
[37] "year_of_birth"                                     "race"                                             
[39] "days_to_birth"                                     "ethnicity"                                        
[41] "vital_status"                                      "days_to_death"                                    
[43] "demographic_id"                                    "age_at_index"                                     
[45] "year_of_death"                                     "bcr_patient_barcode"                              
[47] "disease_type"                                      "releasable"                                       
[49] "released"                                          "primary_site"                                     
[51] "project_id"                                        "name"                                             
[53] "is_ffpe"                                           "subtype_patient"                                  
[55] "subtype_Sex"                                       "subtype_Age.at.diagnosis"                         
[57] "subtype_T.stage"                                   "subtype_N.stage"                                  
[59] "subtype_Tumor.stage"                               "subtype_Smoking.Status"                           
[61] "subtype_Survival"                                  "subtype_Transversion.High.Low"                    
[63] "subtype_Nonsilent.Mutations"                       "subtype_Nonsilent.Mutations.per.Mb"               
[65] "subtype_Oncogene.Negative.or.Positive.Groups"      "subtype_Fusions"                                  
[67] "subtype_expression_subtype"                        "subtype_chromosome.affected.by.chromothripsis"    
[69] "subtype_iCluster.Group"                            "subtype_CIMP.methylation.signature."              
[71] "subtype_MTOR.mechanism.of.mTOR.pathway.activation" "subtype_Ploidy.ABSOLUTE.calls"                    
[73] "subtype_Purity.ABSOLUTE.calls"                    

Do we see the same columns for the LUSC dataset?

colnames(d_lusc)
 [1] "sample"                            "patient"                           "barcode"                          
 [4] "shortLetterCode"                   "definition"                        "classification_of_tumor"          
 [7] "last_known_disease_status"         "updated_datetime.x"                "primary_diagnosis"                
[10] "tumor_stage"                       "age_at_diagnosis"                  "vital_status"                     
[13] "morphology"                        "days_to_death"                     "state.x"                          
[16] "diagnosis_id"                      "tumor_grade"                       "treatments"                       
[19] "tissue_or_organ_of_origin"         "days_to_birth"                     "progression_or_recurrence"        
[22] "prior_malignancy"                  "site_of_resection_or_biopsy"       "days_to_last_follow_up"           
[25] "cigarettes_per_day"                "updated_datetime.y"                "years_smoked"                     
[28] "state.y"                           "exposure_id"                       "updated_datetime"                 
[31] "gender"                            "year_of_birth"                     "state"                            
[34] "state.1"                           "race"                              "demographic_id"                   
[37] "ethnicity"                         "year_of_death"                     "bcr_patient_barcode"              
[40] "disease_type"                      "releasable"                        "released"                         
[43] "primary_site"                      "project_id"                        "name"                             
[46] "is_ffpe"                           "subtype_patient"                   "subtype_Sex"                      
[49] "subtype_Age.at.diagnosis"          "subtype_T.stage"                   "subtype_N.stage"                  
[52] "subtype_M.stage"                   "subtype_Smoking.Status"            "subtype_Pack.years"               
[55] "subtype_Nonsilent.Mutatios"        "subtype_Nonsilent.Mutatios.per.Mb" "subtype_Selected.Mutation.Summary"
[58] "subtype_High.Level.Amplifications" "subtype_Homozygous.Deletions"      "subtype_Expression.Subtype"       

3.1. Which tissues or samples are available for your analysis?

Lung cancer samples in the dataset are provided in tumor or normal tissues. The column shortLetterCode contains Sample Type Codes to describe the type of tissues collected in the dataset.

  • TP: Primary solid Tumor

  • TR: Recurrent solid Tumor

  • NT: Solid Tissue Normal

Check which samples are included in the LUAD dataset.

d_luad %>% count(shortLetterCode)

shortLetterCode<chr>

n<int>

NT

59

TP

533

TR

2

Check which samples are included in the LUSC dataset.

d_lusc %>% count(shortLetterCode)

shortLetterCode<chr>

n<int>

NT

49

TP

502

Let’s make a simple bar plot to compare the number of tissues between two tissue types. We will look at the LUAD samples first.

d_luad %>%
  count(shortLetterCode) %>% 
  ggplot(., aes(shortLetterCode, n, fill=shortLetterCode)) + 
  geom_bar(stat="identity", position=position_dodge()) 

Now we can plot the LUSC samples.

d_lusc %>%
  count(shortLetterCode) %>% 
  ggplot(., aes(shortLetterCode, n, fill=shortLetterCode)) + 
  geom_bar(stat="identity", position=position_dodge()) 

We made two separate plots for the LUAD and LUSC dataset. It would be great if we can merge them into one plot. Here I put the code for this. It would be good practice if you comment or delete the line that you don’t understand fully, you will compare the difference between outcomes. Also, please figure out what Qs do in this plot.

bind_rows(d_luad %>% 
            mutate(type='luad') %>%
            select(type, shortLetterCode, tumor_stage), 
          d_lusc %>% 
            mutate(type='lusc') %>%
            select(type, shortLetterCode, tumor_stage)) %>%
  count(shortLetterCode, type) %>% 
  complete(type, shortLetterCode, fill = list(n = 0)) %>% # Q1: What is this?
  ggplot(., aes(shortLetterCode, n, fill=shortLetterCode)) + 
  geom_bar(stat="identity", position=position_dodge()) + 
  facet_wrap(~type, scales = 'free_x', ncol=5) # Q2: What is this?

3.2. Distribution of clinical variables

We will plot the distribution of age at diagnosis using histogram. Few things you can check: - Is the distribution continuous? - Is it fowlloing the normal distribution? - What is the scale on the x-axis? - If the distribution is stratified, which makes it?

# Plot histogram
ggplot(d_luad, aes(age_at_diagnosis)) + geom_histogram(bins=100)

The x-axis is a day scale. Let’s convert it to year.

# Change the axis
ggplot(d_luad, aes(age_at_diagnosis/365)) + geom_histogram(bins=100)

To create a smooth density, we use the geom_density.

ggplot(d_luad, aes(age_at_diagnosis)) + geom_density()

More plot types on distribution. First we can try the plot for both histogram and density.

# Plot both histogram and density plot
ggplot(d_luad, aes(age_at_diagnosis/365)) + 
  geom_histogram(bins=100, colour="black", fill="white") + 
  geom_density(alpha=.2, fill="#FF6666")

What did you miss from this?

Let’s plot a bit different version. We will replace the y-axis of the histogram with the y-axis of the density plot.

# Histogram with density plot
ggplot(d_luad, aes(age_at_diagnosis/365)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white")+
 geom_density(alpha=.2, fill="#FF6666") 

Q: Do you think whether it is good visualization for the distribution?

From the plot above, we still found the bump around the center. What would contribute to this stratification? Let’s look at some variables from other columns. First we can try information from the gender column.

ggplot(d_luad, aes(age_at_diagnosis/365, fill=gender)) + geom_histogram(bins=100)

Histogram would be okay but not best visualization. What else we can try?

ggplot(d_luad, aes(gender, age_at_diagnosis/365, fill=gender)) + 
  geom_boxplot()

Can you tell the difference between boxplot and density plot? We can try a violin plot.

ggplot(d_luad, aes(gender, age_at_diagnosis/365, fill=gender)) + 
  geom_violin()

Can you tell the difference between boxplot and violin plot?

# ggplot(d_luad, aes(gender, age_at_diagnosis/365, fill=gender)) +
#   geom_boxplot() + 
#   geom_violin()

# ggplot(d_luad, aes(gender, age_at_diagnosis/365, fill=gender)) +
#   geom_violin() + 
#   geom_boxplot()

# ggplot(d_luad, aes(gender, age_at_diagnosis/365)) +
#   geom_violin(aes(fill=gender)) + 
#   geom_boxplot(fill='white')

ggplot(d_luad, aes(gender, age_at_diagnosis/365)) +
  geom_violin(aes(fill=gender)) + 
  geom_boxplot(fill='white', width=0.25)
ggplot(d_luad, aes(gender, age_at_diagnosis/365)) + 
  geom_violin(aes(fill=gender))  + 
  geom_boxplot(fill='white') + 
  geom_dotplot(binaxis='y', stackdir='center', dotsize=1, alpha=0.5)

# Try different options from the manual
# http://www.sthda.com/english/wiki/ggplot2-violin-plot-quick-start-guide-r-software-and-data-visualization

Do you think gender is the reason? If not, how about tumor stages?

ggplot(d_luad, aes(age_at_diagnosis/365)) + 
  labs(x ='Age at Diagnosis (years)', y = 'Number of LUAD patients',
       title = 'TCGA: Lung cancer adenocarcinoma') +
  geom_histogram(bins=100) 

More tasks for the class.

  • Q1. Add labels for the plot.

  • Q2. Change the color for categories.

  • Q3. Save to PDF file

  • Q4. Select the column with continuous information and plot the distribution by yourself.

# assign p for plot
p <- ggplot(d_luad, aes(age_at_diagnosis/365)) + 
  labs(x ='Age at Diagnosis (years)', y = 'Number of LUAD patients',
       title = 'TCGA: Lung cancer adenocarcinoma') +
  geom_histogram(bins=100) 


# save to file
ggsave('plot.TCGA_LUAD.histogram_AgeOfDX.20191002.pdf', p, width = 16, height = 9)

3.3. Tumor stages of lung cancers

First information you might want to explore is staging cancers. Different types of staging systems are used for different types of cancer. You can read further information in general staging rules, or specifics in lung cancers (e.g. stage IA).

# Count the number of tumors in the LUAD dataset
counts_tumor <- d_luad %>% count(tumor_stage)

# Try bar plot
ggplot(counts_tumor, aes( tumor_stage, n)) + geom_bar(stat="identity")

Now you can combine both for visualization.

# Cancer type by tumor stages
bind_rows(d_luad %>% 
            filter(shortLetterCode == 'TP') %>% 
            mutate(type='luad') %>%
            select(type, tumor_stage), 
          d_lusc %>% 
            filter(shortLetterCode == 'TP') %>% 
            mutate(type='lusc') %>%
            select(type, tumor_stage)) %>% 
  count(type, tumor_stage) %>%
  mutate(tumor_stage = factor(tumor_stage, levels=rev(unique(tumor_stage)))) %>%
  ggplot(aes( tumor_stage, n, fill=type)) + 
  labs(title = 'Cancer type by tumor stages',
       x = '', y='Number of samples for TCGA RNA-seq') + 
  theme_minimal() + geom_bar(stat="identity") + 
  facet_wrap(~type) + coord_flip() 

3.4. Cancer type by gender

Let’s try another information - gender. We will describe this by a bar plot like above. After plotting, which information you can read from this?

NB: I am not pretty sure why TCGA put gender, not sex in the dataset, in addition to mixed use of female/male with gender.

# Cancer type by Sex
bind_rows(d_luad %>% 
            filter(shortLetterCode == 'TP') %>% 
            mutate(type='luad') %>%
            select(type, gender), 
          d_lusc %>% 
            filter(shortLetterCode == 'TP') %>% 
            mutate(type='lusc') %>%
            select(type, gender)) %>% 
  count(type, gender) %>%
  ggplot(aes( gender, n, fill=type)) + 
  labs(title = 'Cancer type by Sex',
       x = '', y='Number of samples for TCGA RNA-seq') + 
  theme_minimal() + geom_bar(stat="identity") + 
  facet_wrap(~type) + coord_flip() 

Is there difference in tumor stages by gender? Let’s check with LUSC.

# Add gender with stages
bind_rows(d_luad %>% 
            filter(shortLetterCode == 'TP') %>% 
            mutate(type='luad') %>%
            select(type, tumor_stage, gender), 
          d_lusc %>% 
            filter(shortLetterCode == 'TP') %>% 
            mutate(type='lusc') %>%
            select(type, tumor_stage, gender)) %>% 
  count(type, tumor_stage, gender) %>% 
  complete(gender, type, tumor_stage, fill = list(n = 0)) %>%
  mutate(tumor_stage = factor(tumor_stage, levels=rev(unique(tumor_stage))), 
         gender = factor(gender)) %>%
  ggplot(., aes(tumor_stage, n, fill=gender)) + 
  geom_bar(stat="identity", position=position_dodge()) + 
  facet_wrap(~type) + coord_flip() 

3.5. Site of biopsy

# Cancer type by biopsy
bind_rows(d_luad %>% 
            filter(shortLetterCode == 'TP') %>% 
            mutate(type='luad') %>%
            select(type, site_of_resection_or_biopsy), 
          d_lusc %>% 
            filter(shortLetterCode == 'TP') %>% 
            mutate(type='lusc') %>%
            select(type, site_of_resection_or_biopsy)) %>% 
  count(type, site_of_resection_or_biopsy) %>%
  ggplot(aes( site_of_resection_or_biopsy, n, fill=type)) + 
  labs(title = 'Cancer type by biopsy',
       x = '', y='Number of samples for TCGA RNA-seq') + 
  theme_minimal() + geom_bar(stat="identity") + 
  facet_wrap(~type) + coord_flip() 

Plot the years at diagnosis by tissue or organ of origin. We would also include difference by gender.

bind_rows(d_luad %>% 
            filter(shortLetterCode == 'TP') %>% 
            mutate(type='luad') %>%
            select(type, gender, age_at_diagnosis, tissue_or_organ_of_origin), 
          d_lusc %>% 
            filter(shortLetterCode == 'TP') %>% 
            mutate(type='lusc') %>%
            select(type, gender, age_at_diagnosis, tissue_or_organ_of_origin)) %>%
  ggplot(., aes(gender, age_at_diagnosis/365, fill=gender)) + 
  geom_boxplot() + labs(y='year at diagnosis') + 
  facet_wrap(~tissue_or_organ_of_origin)

Last updated