Difficulty in finding DEG using TCGA data from FireBrowse
2
0
Entering edit mode
@lawardeankita1-14844
Last seen 18 months ago
Estonia

Hello sir/madam,

I am working on LUAD data obtained from FireBrowse. Its a RNSseqV2 RSEM normalized data of gene expression

(file name: LUAD.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.data, renamed this file to

LUAD_RNA_seq)

I want to find Differential gene expression between Stages of tumor samples, i.e. in between stage I and Stage IV samples. but i'm not getting any differentially expressed genes. 

My questions are,

1) How to select samples specific to stage of tumor? referring to the code below i could extract samples specific to stage I and stage IV but i'm not sure if that is the right method to do so.

2) referring to my R code what is the wrong step i'm following?

3) If my complete analysis to find the DEG is wrong, please tell me what steps I'm suppose to follow or the R packages for the FireBrowse data analysis? or any tutorial which i can follow to get the differentially expression genes.

(if there is no wrong in the analysis please let me know that too.)

Below is the R code i have used to find out the DEG,

rna <- read.table("RNA/LUAD_RNA_seq.txt",nrows = 20533, header = T, row.names = 1, sep = "\t")

rem <- function(x){

  x <- as.matrix(x)

  x<- t(apply(x,1,as.numeric))

  r <- as.numeric(apply(x,1,function(i)sum(i == 0)))

  remove <- which(r > dim(x)[2]*0.5) 

  return(remove)

}

remove <- rem(rna)
rna <- rna[-remove,]
dim(rna)

x <- t(apply(rna,1,as.numeric))

dim(x)

#removing probes without symbol

row.has.na <- apply(x, 1, function(x){any(is.na(x))})

sum(row.has.na)

x <- x[!row.has.na,]

dim(x)

# Filter lowly expressed genes

expr <- x[rowSums(x)>0,] # Remove genes with 0 RPKM in all samples

lowexp <- expr <= 0.1

nsamples <- ncol(expr)

nsamples

lowexpr <- rowSums(lowexp) > (0.95 * nsamples)

log2expr <- log2(expr[!lowexpr,]+1)

head(log2expr)

colnames(rna)

colnames(log2expr)<- colnames(rna)

head(log2expr)

colnames(log2expr) <- gsub('\\.','-',substr(colnames(log2expr),1,28))

#read files of tumor stage

library(xlsx)

Tumor_stage <- read.xlsx(file = "Clinical_merged_picked/Tumor_stage_data.xlsx", sheetIndex = 1)

colnames(Tumor_stage)

Tumor_stage$Stage_I_Barcode <- toupper(Tumor_stage$Stage_I_Barcode)

Tumor_stage$Stage_IV_Barcode <- toupper(Tumor_stage$Stage_IV_Barcode)

#select only tumor samples

Tumor_Samples <- subset(log2expr, select = substr(colnames(log2expr),14,14) == '0')

dim(Tumor_Samples)

#subset the stages (I and IV) of tumor samples from Tumor_samples

Stage_I <- subset(Tumor_Samples, select = substr(colnames(Tumor_Samples),1,12) %in% Tumor_stage$Stage_I_Barcode)

dim(Stage_I)

Stage_IV <- subset(Tumor_Samples, select = substr(colnames(Tumor_Samples),1,12) %in% Tumor_stage$Stage_IV_Barcode)

dim(Stage_IV)

#remove duplicate colnames

Stage_I <- Stage_I[, !duplicated(colnames(Stage_I))]

Stage_IV <- Stage_IV[, !duplicated(colnames(Stage_IV))]

stage_I_IV <- cbind(Stage_I, Stage_IV)

dim(stage_I_IV)

head(stage_I_IV)

#design matrix

groups <- c(rep("Stage_I",277), rep("STage_IV",26))

group <- factor(groups)

group

#design matrix

design <- model.matrix(~ 0 + group)

design

colnames(design)<- c("Stage_I","Stage_IV" )

tail(design)

dim(design)

library(limma)

library(edgeR)

y <- DGEList(stage_I_IV, remove.zeros=T)

v <- voom(y, design)

fit <- lmFit(v, design)

cont.matrix <- makeContrasts(I_IV = Stage_I - Stage_IV, levels = design)

cont.matrix

contrast.fit <- contrasts.fit(fit, cont.matrix)

fit2 <- eBayes(contrast.fit)

fit2$nsamples <- ncol(stage_I_IV)

fit2

fit2$nsamples

summa.fit <- decideTests(fit2)

summary(summa.fit)

#this is how i get when i check for the DEG
         I_IV
Down       0
NotSig 17762
Up         0

deg <- topTable(fit2, coef = "I_IV", p.value = 0.05, lfc = 2, number = nrow(v$E))

The topTable result is zero rows and zero columns

Your answers would really help me to clear my doubts,

 

Thank you all,

Ankita.

TCGA FireBrowse limma DEG • 3.3k views
ADD COMMENT
1
Entering edit mode
Check topTable without setting the limit for p.value and lfc. I am sure that you have no significant genes on p.value = 0.05 and lfc = 2. You can also check by increasing the limit of p.value.
ADD REPLY
0
Entering edit mode
Hello Agaz Hussain Want, Yes did try topTable without setting the p.value and lfc, but then it give all genes as DEG(around 17000 genes). As referring to my question is this the R code correct for this type of data?? Am I following it correctly or is there anything I'm doing wrong in the analysis workflow? Thank you so much for your reply, Ankita
ADD REPLY
3
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 6 hours ago
The city by the bay

There are a number of obvious problems with respect to how you're using limma:

  • stage_I_IV seems to be defined from log2expr. This is wrong, voom() expects raw counts.
  • Your comments in the filtering section suggest that you're using RPKMs. This is also not appropriate, see my previous point.
  • Normalization for composition biases has not been performed via calcNormFactors.
  • I don't know how you chose 0.95 to define lowexpr, but if any group contains fewer than 5% of the samples, you'll never see genes that are upregulated in that group and silent in all other groups.
  • Do not use lfc= in topTable(), see ?topTable for details. Use treat() instead.

There are also a number of less obvious things:

  • Have you checked the quality of the samples? See ?voomWithQualityWeights.
  • Have you checked for hidden factors of variation? See the RUVSeq and sva packages.
  • Have you tried using robust=TRUE in eBayes(), to protect against outliers?
ADD COMMENT
0
Entering edit mode

Hi Aaron Lun,

Thank out for pointing out wrong steps in my R code.

  •  Referring to your first point, I don't know how to obtain raw counts from FireBrowse. There in the FireBrowse only normalized gene expression data i found. If you know from where i can get the raw counts for the LUAD data please tell me so.
  • And I did not perform any normalization steps, because i thought the data is already normalized so there is no need to do. But according to your third point i should check for for the normalized factors.
  • According to your fourth point,  stage I sample number is 277 and stage IV sample number is 26, so does that mean One group of sample should at least have 5% of another sample number? I did not got this point, could you please explain me?
  • And i did not check the quality of samples using voomWithQualityWeights. But for voomWithQualityWeights, it requires raw counts matrix. I don't have Raw counts data of RNA-seq. (am i right about voomWithQualityWeights function?)

  • The last three points i did not check for the LUAD data.

Can i ask you for one help,

If possible could you please provide me with a workflow (step by step) like what should be used when and how for such data obtained from FireBrowse to find out differential gene expression analysis.

 

Thank you so much for helping me out and clearing my doubts,

I look forward to hear from you very soon,

Thanks,

Ankita

ADD REPLY
1
Entering edit mode

If you want to use voom, edgeR or co., you will need counts. There are a number of ways to obtain count data from TCGA - the recount2 project comes to mind, as does the TCGAbiolinks package mentioned by svlachavas.

Regarding the sample numbers; look at your filtering step:

lowexp <- expr <= 0.1
nsamples <- ncol(expr)
nsamples
lowexpr <- rowSums(lowexp) > (0.95 * nsamples)

You are discarding all genes with expression below 0.1 in more than 95% of samples. Now, imagine that you had a group that was 4% of the total sample size. Further imagine that you have a set of genes that are upregulated in this group and silent in all other groups. These genes are probably interesting with respect to this group but would be discarded by your filtering strategy. In your case, it probably doesn't matter because your smaller group is still pretty large. Nonetheless, the use of an "at least N" filtering strategy is always something to be careful of, especially when dealing with large unbalanced designs.

Regarding vignettes; a quick Google around will give you a number of options, of which I've listed a few here:

ADD REPLY
0
Entering edit mode

Hi Aaron Lun,

Thank you for explaining me that point about gene filtering.

And today i did not use any filtering strategy and also i did not use voom function and yet there is zero DEG between stage I and stage IV.

And you have mentions in your first reply about voomWithQualityWeights and calcNormFactors but these two functions should be  used when the data is raw (is that right?) or can I use these functions when the data is normalized?

Thanks,

Ankita

ADD REPLY
1
Entering edit mode

As I have mentioned before, these functions are all designed for count data. Normalized data is not count data. So, the solution is simple; find the raw count data from TCGA.

Regarding your lack of DEGs; well, I'm not psychic. There could be a million things in the dataset - latent variables, outliers, insufficient sample sizes, highly variable genes - preventing you from detecting DE. And that's not even counting the problems with your code, and the fact that you're not using count data.

You're just going to have to explore the data further to try to understand why genes are non-DE. I would take a look at known markers that distinguish stage I and IV, and see how they behave; this might be informative regarding hidden factors of variation (e.g., patient sex, age, and so on). Also, do the analysis on the count data.

ADD REPLY
0
Entering edit mode
svlachavas ▴ 840
@svlachavas-7225
Last seen 14 months ago
Germany/Heidelberg/German Cancer Resear…

Hello,

just to add a small idea in conjuction to Aaron's already complete and detailed answer. If you really want raw counts/data for a specific TCGA project, have you tried the TCGAbiolinks R package ?

For example:

library(TCGAbiolinks)

query.exp.hg38 <- GDCquery(project = "TCGA-STAD", 
                           data.category = "Transcriptome Profiling", 
                           data.type = "Gene Expression Quantification", 
                           workflow.type = "HTSeq - Counts") # example query

GDCdownload(query.exp.hg38,files.per.chunk = 50)

exp.hg38 <- GDCprepare(query = query.exp.hg38)

 

In this way, you would have in the end a SummarizedExperiment object-including HTSeq Counts-and from this point, you would choose or follow any pipeline you wish to implement or test-

also a relative link for more searching:

http://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/query.html

Hope that helps,

Efstathios-Iason

 

 

ADD COMMENT
0
Entering edit mode

Hi Efstathios-lason,

Yes i tried, I have used TCGA data from GDC portal and i had asked some question in biostars as i wasn't getting any DEG. That time I was grouping samples according to days to death where, less than one year and more than three years was the grouping method. So i changed the data portal and shifted to FireBrowse data. Here in FireBrowse data i check for the sample grouping method according to days to death and for this no DEG. then now i'm trying to find DEG between stages of Tumor.

One more to add, I also used RTCGAToolbox package to find DEG in stages of tumor by following this tutorial http://genomicsclass.github.io/book/pages/tcga.html

But i made some changes to the workflow so as to make the data in the required format for RTCGAToolbox, because i have windows OS and RTCGAToolbox cannot download the data if windows OS is on the PC (beacase of some long path or something requierd for windows OS)

Below is the RTCGAToolbox code i used, there i have samples of all stages.

library(RTCGAToolbox)

# read expression and phenotype info

#  import expression
rnaseq <- read.table(file = "RNA/LUAD_RNA_seq.txt", header=T, row.names=1, sep='\t')
dim(x)
rnaseq <- rnaseq[-1,]
rnaseq[1:4,1:4]
rid = tolower(substr(colnames(rnaseq),1,12))
rid
rid = gsub("-", ".", rid)
rid
mean(rid %in% rownames(clin))
colnames(rnaseq) = rid
colnames(rnaseq)
which(duplicated(colnames(rnaseq)))
rnaseq = rnaseq[,-which(duplicated(colnames(rnaseq)))]
dim(rnaseq)

# clinical info merged picked
phen <- read.table(file = "Clinical_merged_picked/LUAD.clin.merged.picked.txt", header=F, row.names=1, sep="\t")
dim(phen)

phen_T <- read.table(file = "Clinical_merged_picked/LUAD.clin.merged.picked.txt", header=T, row.names=1, sep="\t")
length(phen_T[phen_T["pathology_T_stage",] == 't1',])
stage_t1 <- which(phen_T["pathology_T_stage",] == 't1')
stage_t1a <- which(phen_T["pathology_T_stage",] == 't1a')
stage_t1b <- which(phen_T["pathology_T_stage",] == 't1b')
length(stage_t1b)

names(phen_T)
dim(phen_T)
rownames(phen_T)

clin <- as.data.frame(t(phen_T))
clin
rownames(clin)
names(clin)

with(clin, table(pathology_T_stage, pathology_N_stage))
which(clin$pathology_T_stage == 't1')
clin$t_stage = factor(substr(clin$pathology_T_stage,1,2))
table(clin$t_stage)

rem <- function(x){
  x <- as.matrix(x)
  x<- t(apply(x,1,as.numeric))
  r <- as.numeric(apply(x,1,function(i)sum(i == 0)))
  remove <- which(r > dim(x)[2]*0.5) 
  return(remove)
}

remove <- rem(rnaseq)
rnaseq <- rnaseq[-remove,]
dim(rnaseq)

rna <- t(apply(rnaseq,1,as.numeric))
dim(rna)

#create an expressionset
library(Biobase)
readES = ExpressionSet(log2(rna+1))
dim(readES)
sampleNames(readES)<- colnames(rnaseq)
sampleNames(readES)
pData(readES) = clin[sampleNames(readES),]
readES
pData(readES)

t_stage
#model matrix
mm = model.matrix(~0+t_stage, data=pData(readES))
mm
dim(mm)
mm1 <- mm[,-5]
mm1
colnames(mm)<- c("t_stage1","t_stage2","t_stage3","t_stage4","t_stagex")
cont.matrix <- makeContrasts(I_IV = t_stage1 - t_stage4,
                             I_II = t_stage1 - t_stage2,
                             I_III = t_stage1 - t_stage3,
                             II_III = t_stage2 - t_stage3,
                             II_IV = t_stage2 - t_stage4,
                             III_IV = t_stage3 - t_stage4, levels = mm)
f1 = lmFit(readES, mm)
f1
ef1 = eBayes(contrasts.fit(f1, cont.matrix))

I also got this similar mesage as that tutorial
## Warning: Zero sample variances detected, have been offset away from zero

colnames(ef1)
summa.fit <- decideTests(ef1)
summary(summa.fit)

deg_i_iv <- topTable(ef1,coef ="I_IV",p.value = 0.05,lfc = 2, number = nrow(readES))
dim(deg_i_iv)
deg_i_iv

topTable(ef1, 1:4)

 But i'm more interested in finding the DEG separately by making separate expression set for two stages.

ADD REPLY
0
Entering edit mode

Dear Akita,

with all respect your answer is a bit confusing- why do you mention days to death with Tumor Stage ? these are totally irrelevant fields-it is very important to first define your biological question of interest-

moreover, did you find a specific problem with TCGAbiolinks-keep in mind that TCGAbiolinks stores both legend TCGA data, as also the harmonized data after moving to GDC portal. I still dont know why you did not get any DEgenes, because i can't also see the relative code you have used. Personally, i have used the TCGAbiolinks many times, and has many crusial options and possibilities of data to download.

The recount2 resource is also an excellent choise (https://jhubiostatistics.shinyapps.io/recount/).

Moreover, i have not used the RTCGAToolbox R package, so i only some extra thoughts:

1) The tutorial you have posted above from the PH525x series from edx, is mainly a tutorial for integrating and relating different molecular layers of data: for example gene expression with copy number alteration,etc.

2) As for your message from the tutorial: i personally think, possibly, at least in one or more genes, the log ratio of the defined contrasts is almost the same for the vast majority of samples-which in turn, would return a close to zero variance, which could finally create an infinite statistic, and that's why the offset is implemented (of course some more specialist answers would be more appropriate)

3) Secondly, you would definately need raw counts in order to move on with DE analysis, as Aaron already have mentioned. Alternatively, if you want to use legend data or pre-processed, check the following post for another approach:

A: Possible ways of performing differential gene expression and analysis of RNA-Seq

4) Keep in mind that in your downstream analysis you should perform some EDA plots, like PCA, MDS and hierarchical clustering to see, how your samples are grouped based on Tumor Stage that you would like to find DEgenes. Could be strong differences and patterns observed ? Alternatively, you could group the various Tumor Stages, for instance comparing T1 & T2 versus T3 and T4-but still this is data driven,and with various options.

5) Another idea-which hovever deviates from the topic of the forum-is due to the fact that tumor stage is a categorical variable, you could perform a multinomial logistic regression or an multinomial elastic net implementation...

ADD REPLY
0
Entering edit mode

Hi svlachavas, 

Thank you very much for the reply and the explanation.

But still i have one query about the link you have provided, that link is about analyzing colorectal cancer data. So can i use this R package for my data (for LUAD data) ?

I had earlier gone through this link, but thought that this link is specifically for colorectal cancer data, so i did not choose to use that R package.

I f can use this R package for the LUAD data then can you tell me how to get samples specific to Tumor stage I and IV, i'm bit confused there.

And to clear the confusion, I was using data of only days to death and there was no correlation with tumor stages. But i could not find the DEG so i then shifted to use the data according to tumor stage. 

Actually i want to use clinical parameters to find the DEG between those samples, So first i used days to death data and now i'm using tumor stage data.

Sure i will check all the plots you have mentioned above for LUAD data. but about hierarchical clustering of data, i want to group the samples into stage one and stage IV thus it will be supervised clustering, but if i follow the unsupervised clustering then it would surely form groups which will mix the samples from both the groups.

Thank so much for helping me out and your time,

Thanks,

Ankita

 

ADD REPLY
0
Entering edit mode

Hello svlachavas, Aaron Lun,

I'm very sorry to disturb you again, but i just wanted to ask this question about EBSeq package,

I'm referring to the EBSeq package, so one question i wanted to ask, 

1) Can i use RSEM normalized counts as input to the EBSeq pipeline?

As this package is especially for RSEM output data analysis, so does this means that this R package can work with the data obtained from FireBrowse (RNA-SeqV2 RSEM normalized level 3 gene expression data) ?

I was going through the questions and answers for EBSeq on Github, here is the link: https://github.com/lengning/EBSeq

There they i have mentioned that "In general, it is not appropriate to perform cross sample comparisons using TPM, FPKM or RPKM without further normalization. Instead, you may use normalized counts (It can be generated by GetNormalizedMat() function from raw count, note EBSeq testing functions takes raw counts and library size factors)"

2) so does this mean i can use LUAD data (RNS-SeqV2 RSEM normalized) from FireBrowse for this EBSeq pipeline?

If i'm wrong somewhere please guide me.

I look forward to hear from you,

Thanks,

Ankita

 

ADD REPLY
0
Entering edit mode

Dear Ankita,

again with all respect, prior posting any further questions or suggestions, you should firstly make an effort and read or search any tutorials or relative papers related !! thus, you don't trust the answers already above ? if not, why you mention this time the EBSeq R package ? there is a specific reason you believe that you will get any DEgenes, or you should not follow any of the above approaches ?

Nevertheless, from the EBSeq R package vignette in page 6 for DEgene analysis:

"Data: The object Data should be a G − by − S matrix containing the expression values for each gene and each sample, where G is the number of genes and S is the number of samples. These values should exhibit raw counts, without normalization across samples. Counts of this nature may be obtained from RSEM,Cufflinks, or a similar approach."

Again, you still need raw counts !! which is not in your case, as the RSEM level 3 data have already been tranformed by a "type of normalization": http://seqanswers.com/forums/showthread.php?t=42911

"....The values are divided by the 75-percentile and multiplied by 1000..."

Moreover, from where you have read that "As this package is especially for RSEM output data analysis" ?

From the relative publication https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3624807/

RSEM is mentioned, among other methods such as Cufflinks, as the EBSeq package is primarily based for isoform DE inference, as also for the identification of DE genes...

Overall, if you still willing to continue with the rsem normalized counts, why you dont implement the limma R package, use normalizeQuantiles (you could also first log2 transform your data) to take somehow into account the library depth and then use the eBayes(trend=TRUE) pipeline, similarly as you would handle microarray data analysed ?

ADD REPLY
0
Entering edit mode

Hello svlachavas,

I'm so sorry that i have misinterpreted the EBSeq usage and making it more confusing. I am very sorry for the inconvenience I caused.

Thank you so much for this explanation.  I will try the last part you mentioned in above comment.

 

Thanks again,

Ankita.

ADD REPLY

Login before adding your answer.

Traffic: 554 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6