Appropriate design matrix set up in limma for 2-factor approches with NA values for specific levels
1
0
Entering edit mode
svlachavas ▴ 840
@svlachavas-7225
Last seen 15 months ago
Germany/Heidelberg/German Cancer Resear…

Based on an initial analysis of a TCGA RNA-Seq dataset(processed like microarrays), i would like to formulate a specific design matrix with limma, in order to perform a 2-factor type analysis. In detail, a subset of my phenoData object:

head(pdat.sel)
                sample_type age_at_initial_pathologic_diagnosis  msi gender
TCGA.AA.3662 adjacentnormal                                  80 <NA>      f
TCGA.A6.4105          tumor                                  79  MSS      m
TCGA.F4.6463          tumor                                  51 <NA>      m
TCGA.F4.6806          tumor                                  59  MSS      f
TCGA.A6.6650          tumor                                  69 <NA>      f
TCGA.AZ.6600 adjacentnormal                                  64 <NA>      m

 

colon_type <- paste0(pData(TCGA.RNASeqV2_eset)$sample_type,"_",

pData(TCGA.RNASeqV2_eset)$msi)
table(colon_type)

colon_type
adjacentnormal_MSI adjacentnormal_MSS  adjacentnormal_NA          tumor_MSI 
                 1                  2                 11                  8 
         tumor_MSS          tumor_NA 
                30                143

My main contrast of interest, is the tumor_MSI versus tumor_MSS (30 vs 8):

however, as you can see from the above information, a lot of cancer samples are not available (NA marked) for this information, and also a very small number of normal samples has this information (which is anyway irrelevant as i would not like to utilize at all any normal samples).

So, any naive approach like the following, which is first to subset my initial expression set of 195 samples:

ind <- sampleNames(TCGA.RNASeqV2_eset)[TCGA.RNASeqV2_eset$msi%in%c("MSS","MSI")& TCGA.RNASeqV2_eset$sample_type%in%"tumor"]

eset.sel <- TCGA.RNASeqV2_eset[,ind]

dim(eset.sel)
Features  Samples 
   20502       38 

mult_con <- factor(paste0(pData(eset.sel)$sample_type,"_",
pData(eset.sel)$msi))

levels(mult_con)
[1] "tumor_MSI" "tumor_MSS"

and then proceed:

age <- pData(eset.sel)$age_at_initial_pathologic_diagnosis
sex <- factor(pData(eset.sel)$gender)

design.2 <- model.matrix(~mult_con+sex+age) # include also these covariates to adjust for age and gender

fit <- lmFit(eset.sel,design) 
fit2 <- eBayes(fit,robust=TRUE,trend=TRUE)...

 

Or subsetting the initial eSet, is inappropriate, and another approach is less erroneous ?

 

limma multiple factor design NA values design and contrast matrix • 1.7k views
ADD COMMENT
2
Entering edit mode
@gordon-smyth
Last seen 5 hours ago
WEHI, Melbourne, Australia

As you probably already know, you can't have NAs in the columns of the design matrix. So you can either subset the data to those samples for which MSI status is known, or you can code the NA values of MSI status to some new arbitrary value. Either approach could be ok and neither is "erroneous". Subsetting is probably the easiest. It's up to you really.

ADD COMMENT
0
Entering edit mode

Thank you Gordon for the confirmation. I will probably stick and subset the initial dataset, in order to formulate the above mentioned comparison, than perhaps mess up things with adding arbitary values. Finally, do you believe that any non-specific intensity procedure for removing "unexpressed" genes, should be better performed after i subset my dataset, based on the final number of samples remained ?

ADD REPLY
1
Entering edit mode

Yes, you should subset your features after you have removed the samples that you drop.

To put it another way, the removal of "unexpressed" genes should be performed on the same exact dataset (ie. same samples) that you will be pushing through your differential expression pipeline.

ADD REPLY
0
Entering edit mode

Thank you Steve for your answer !!

ADD REPLY

Login before adding your answer.

Traffic: 682 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6