[Tutorial] A proteogenomic portrait of lung squamous cell carcinoma

Work in progress

In your portfolio, you are given the latest multi-omics study, Chen et al. (2020) published in Cell, describing a unique molecular phenotype of lung adenocarcinoma in Taiwan. This study provides a range of omics datasets, including genomic, gene expression, proteomic, and phosphoproteomic profiles, obtained from the patients.

1. Background

Chen et al. (2020) deal with highly intriguing question regarding the characteristics of lung adenocarcinoma patients in East Asia: a high percentage of never-smokers, early onset and predominant EGFR mutations. They prospectively collected the Taiwanese cohort, representing early stage, predominantly female, non-smoking lung adenocarcinoma, and generated multi-omics profiles to delineate the demographically distinct molecular attributes and hallmarks of tumor progression.

2. Dataset

All scientific articles (papers) provide the dataset. You can find it in the main text but also in the supplementary section. Chen et al. (2020) is a "Resource" paper, where various sets of information are available. Please go to the Supplemental Information, and download all files into your working folder. These files are called "Supplementary files" (or "Supplementary Table" if it's a table or "Supplementary Figure if it's a figure). I will place these files in the folder "Chen2020". In the section, you can find the description of the files.

Since these files are in the MS Excel format, you can open the file using the readxl package. The file contains multiple sheets and has a huge size. You can of course open the file in your MS Excel but it might not be loaded properly. So we are trying to open it in R.

Let's get the list of sheet names first. Once you know where the information is placed, you can open the specific sheet of your interest.

library(readxl)
readxl::excel_sheets('Chen2020/1-s2.0-S0092867420307431-mmc1.xlsx')

# [1] "description"                    "Table S1A_clinical_103patient"  "Table S1B_ClinicalStatisics"   
# [4] "Table S1C_SNV"                  "Table S1D_transcriptome_log2TN" "Table S1E_ProteomeLog2TN"      
# [7] "Table S1F_PhosPepLog2TN"        "Table S1G_PhosSiteLog2TN"       "Table S1H_C>T"                 
# [10] "Table S1I_C>A"                  "Table S1J_Nonsmoker TW TCGA"  

It seems the first sheet is for description. Let's open it for detailed information.

d = read_excel('Chen2020/1-s2.0-S0092867420307431-mmc1.xlsx', sheet=1)

Table S1 describes "Clinical Clinical, multi-omics data and proteogenomic characteristics of lung adenocarcinoma (LUAD) patients in TW cohort. Related to Figure 1." and contains following sheets

  • Table S1A. Characteristics and clinical data of TW lung cancer patients, Related to Figure 1.

  • Table S1B. Statistics of TW LUAD patients, Related to Figure 1.

  • Table S1C. Somatic mutation profile of TW LUAD patients, Related to Figure 1A.

  • Table S1D. mRNA expression results, Related to Figure 2, 3, 5, 7B, S1H, S3, S6.

  • Table S1E. Proteomic expression results normalized by column (patient) median, Related to Figure 2-3, 5-7, S1, S3, S5-7.

  • Table S1F. Quantitative phosphoproteomic data at phosphopeptide level without normalization, Related to Figure 3I, 5, S1, S5.

  • Table S1G. Quantitative phosphoproteomic data at phosphosite level normalized by column (patient) median, Related to Figure 2E, 3, S1, S3, S6.

  • Table S1H. Comparison of C>T distribution between smoker and non-smoker from TW and TCGA cohort, Related to Figure 1.

  • Table S1I. Comparison of C>A distribution between smoker and non-smoker from TW and TCGA cohort, Related to Figure 1.

  • Table S1J. Distribution of somatic mutation frequency of oncogenes and tumor suppressor genes in the non-smokers of TW LUAD and TCGA cohort, Relative to Figure 1.

Last updated