Entering edit mode
Hi all,
I am using TCGA level 3 data (rsem raw counts) for samples with matched normal to analyse differential expression using DESeq2. I was wondering the design I am using is correct.
coldata looks like this.
DataFrame with 118 rows and 2 columns condition pid <factor> <factor> TCGA.BJ.A28R.11A.11R.A16R.07 Normal_Tissue TCGA.BJ.A28R TCGA.BJ.A28R.01A.11R.A16R.07 Primary_Tumor TCGA.BJ.A28R TCGA.BJ.A28W.11A.11R.A32Y.07 Normal_Tissue TCGA.BJ.A28W TCGA.BJ.A28W.01A.11R.A32Y.07 Primary_Tumor TCGA.BJ.A28W TCGA.BJ.A28X.11A.11R.A22L.07 Normal_Tissue TCGA.BJ.A28X ... ... ... TCGA.KS.A41I.01A.11R.A23N.07 Primary_Tumor TCGA.KS.A41I TCGA.KS.A41J.11A.12R.A23N.07 Normal_Tissue TCGA.KS.A41J TCGA.KS.A41J.01A.11R.A23N.07 Primary_Tumor TCGA.KS.A41J TCGA.KS.A41L.11A.11R.A23N.07 Normal_Tissue TCGA.KS.A41L TCGA.KS.A41L.01A.11R.A23N.07 Primary_Tumor TCGA.KS.A41L
design I am using is :
design = ~ condition + pid + pid:condition
Since each patient has a matched normal, I am putting an interaction in design. Is this the right way ?
Thanks.
Hi Michael, I'm using a identical design formula, two samples per patient, normal and healthy tissue, wanting to test for the difference between normal and healthy while controlling for the patient effect, but it's telling me "factor levels were dropped which had no samples". In reading the vignette it says that 3 samples per unique combination are needed for controlling for count outliers so, in a situation like the one above does cooks distance need to be turned off? Would you expect that warning about factor levels for the above design formula. Thanks.
"factor levels were dropped which had no samples"
this is simply a message telling you that, for the factors in the design, there were levels which had no samples. You can continue. There is not a problem. (Unless you are surprised to find out that levels don't have samples, in which case you should figure out which samples might be missing and why.)
if you have your column data object, x, before constructing the DESeqDataSet, you can see what is happening:
Where you should substitute 'sample' and 'condition' with the names of the appropriate columns in x.