Hello,
I am doing RNA-seq analysis of DE genes from mare's endometrial biopsies. It is a timecourse experiment with samples at 0h, 24h and 48h across three different stages (follicular, luteal and anoestrous). I am having a hard time trying to figure out the design I should use and also having trouble when using contrast to return the results.
My metadata looks like this:
sample,timepoint,horse,phase,phaseANDtimepoint sample_1.bam,0h,1A,follicular,follicular_0h sample_2.bam,24h,1A,follicular,follicular_24h sample_3.bam,48h,1A,follicular,follicular_48h sample_4.bam,0h,1B,follicular,follicular_0h sample_6.bam,48h,1B,follicular,follicular_48h sample_7.bam,0h,2C,follicular,follicular_0h sample_8.bam,24h,2C,follicular,follicular_24h sample_9.bam,48h,2C,follicular,follicular_48h sample_10.bam,0h,3B,follicular,follicular_0h sample_11.bam,24h,3B,follicular,follicular_24h sample_12.bam,48h,3B,follicular,follicular_48h sample_13.bam,0h,4B,follicular,follicular_0h sample_14.bam,24h,4B,follicular,follicular_24h sample_15.bam,48h,4B,follicular,follicular_48h sample_16.bam,0h,4C,follicular,follicular_0h sample_17.bam,24h,4C,follicular,follicular_24h sample_18.bam,48h,4C,follicular,follicular_48h sample_19.bam,0h,8B,luteal,luteal_0h sample_20.bam,24h,8B,luteal,luteal_24h sample_21.bam,48h,8B,luteal,luteal_48h sample_22.bam,0h,8C,luteal,luteal_0h sample_23.bam,24h,8C,luteal,luteal_24h sample_24.bam,48h,8C,luteal,luteal_48h sample_25.bam,0h,9B,luteal,luteal_0h sample_26.bam,24h,9B,luteal,luteal_24h sample_27.bam,48h,9B,luteal,luteal_48h sample_28.bam,0h,9C,luteal,luteal_0h sample_29.bam,24h,9C,luteal,luteal_24h sample_30.bam,48h,9C,luteal,luteal_48h sample_31.bam,0h,11B,luteal,luteal_0h sample_32.bam,24h,11B,luteal,luteal_24h sample_33.bam,48h,11B,luteal,luteal_48h sample_34.bam,0h,11D,luteal,luteal_0h sample_35.bam,24h,11D,luteal,luteal_24h sample_36.bam,48h,11D,luteal,luteal_48h sample_37.bam,0h,12A,anoestrous,anoestrous_0h sample_38.bam,24h,12A,anoestrous,anoestrous_24h sample_39.bam,48h,12A,anoestrous,anoestrous_48h sample_40.bam,0h,12B,anoestrous,anoestrous_0h sample_41.bam,24h,12B,anoestrous,anoestrous_24h sample_42.bam,48h,12B,anoestrous,anoestrous_48h sample_43.bam,0h,12D,anoestrous,anoestrous_0h sample_44.bam,24h,12D,anoestrous,anoestrous_24h sample_45.bam,48h,12D,anoestrous,anoestrous_48h sample_46.bam,0h,13A,anoestrous,anoestrous_0h sample_47.bam,24h,13A,anoestrous,anoestrous_24h sample_48.bam,48h,13A,anoestrous,anoestrous_48h sample_49.bam,0h,13B,anoestrous,anoestrous_0h sample_50.bam,24h,13B,anoestrous,anoestrous_24h sample_51.bam,48h,13B,anoestrous,anoestrous_48h sample_52.bam,0h,13C,anoestrous,anoestrous_0h sample_53.bam,24h,13C,anoestrous,anoestrous_24h sample_54.bam,48h,13C,anoestrous,anoestrous_48h
At first I did not have the last column (phaseANDtimepoint) but I had to add since I want to retrieve the following comparisons/results:
1) Follicular vs Luteal at 0h, at 24h and at 48h
2) Luteal vs Anoestrous at 0h, at 24h and at 48h
3) Follicular vs Anoestrous at 0h, at 24h and at 48h
My code looks like:
library(GenomicFeatures) txdb_exp2 <- makeTxDbFromGFF("Equus_caballus.EquCab2.89.gtf", format="gtf") ebg_exp2 <- exonsBy(txdb_exp2, by="gene") se_exp2 <- summarizeOverlaps(features=ebg_exp2, reads=bamfiles, mode="Union", singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE ) dim(se_exp2) assayNames(se_exp2) colSums(assay(se_exp2)) rowRanges(se_exp2) sampleTableExp2 <- read.csv("exp2_metadata.csv") colData(se_exp2) <- DataFrame(sampleTableExp2) library(DESeq2)
at first, I have done:
dds_exp2 <- DESeqDataSet(se_exp2, design = ~ horse + timepoint) dds_exp2 <- DESeq(dds_exp2) resultsNames(dds_exp2)
but when trying to get the results between phases at a certain time point:
> res_exp2_follicular0h_luteal0h <- results(dds_exp2, contrast=c("phaseANDtimepoint", "follicular_0h", "luteal_0h")) Error in cleanContrast(object, contrast, expanded = isExpanded, listValues = listValues, : follicular_0h and luteal_0h should be levels of phaseANDtimepoint such that phaseANDtimepoint_follicular_0h_vs_anoestrous_0h and phaseANDtimepoint_luteal_0h_vs_anoestrous_0h are contained in 'resultsNames(object)'
then I realised I had to include in the design the 'phaseANDtimepoint' info and I tried:
dds_exp2_phaseANDtimepoint <- DESeqDataSet(se_exp2, design = ~ horse + phaseANDtimepoint) dds_exp2_phaseANDtimepoint <- DESeq(dds_exp2_phaseANDtimepoint) resultsNames(dds_exp2_phaseANDtimepoint)
> dds_exp2_phaseANDtimepoint <- DESeqDataSet(se_exp2, design = ~ horse + phaseANDtimepoint) Error in checkFullRank(modelMatrix) : the model matrix is not full rank, so the model cannot be fit as specified. One or more variables or interaction terms in the design formula are linear combinations of the others and must be removed. Please read the vignette section 'Model matrix not full rank': vignette('DESeq2')
I do understand that my phaseANDtimepoint column repeats the info contained in the other columns, however I cannot get rid of them - the "phase" and "timepoint" columns from the metadata are needed to retrieve the PCA graph plotting my data the way I need.
Ps.: I also thought about using the LRT analysis
ddsLRT_exp2 <- DESeqDataSet(se_exp2, design = ~ horse + timepoint) ddsLRT_exp2 <- estimateSizeFactors(ddsLRT_exp2) ddsLRT_exp2 <- estimateDispersions(ddsLRT_exp2) ddsLRT_exp2 <- nbinomLRT(ddsLRT_exp2, reduced = ~ horse) resLRT_exp2 <- results(ddsLRT_exp2) summary(resLRT_exp2) write.csv(resLRT_exp2, "resLRT_exp2.csv")
OR
ddsLRT_phaseANDtimepoint_exp2 <- DESeqDataSet(se_exp2, design = ~ horse + phaseANDtimepoint) ddsLRT_phaseANDtimepoint_exp2 <- estimateSizeFactors(ddsLRT_phaseANDtimepoint_exp2) ddsLRT_phaseANDtimepoint_exp2 <- estimateDispersions(ddsLRT_phaseANDtimepoint_exp2) ddsLRT_phaseANDtimepoint_exp2 <- nbinomLRT(ddsLRT_phaseANDtimepoint_exp2, reduced = ~ horse) resLRT_phaseANDtimepoint_exp2 <- results(ddsLRT_phaseANDtimepoint_exp2) summary(resLRT_phaseANDtimepoint_exp2) write.csv(resLRT_phaseANDtimepoint_exp2, "resLRT_phaseANDtimepoint_exp2.csv")
but it did not work either and I would have only one result. Since I want to have different results comparing the phases at the different timepoints I do not think that is the best option, is it?
Many thanks!
I have changed my metadata to:
and was trying to do the following which is the same as you advised me to do when I posted a question the other day (https://support.bioconductor.org/p/113148/).
Then I would do contrasts like:
but I keep receiving the following warning:
I was reading more about it and reading other questions here on the website and I think I spotted the problem: I had to exclude sample 5 from my analysis because something went wrong during the RNA-seq and we could not afford to rerun it!
Therefore one of my biological replicates (horse 1B) has only samples at 0h and 48h (and does not have at 24h!). Do you think that can be the problem? If so, is there something I can do to run the analysis excluding only this samples? I am not willing to exclude this biological replicate from the analysis (we only have n=6 for each phase and we do not want to go down to 5 horses). If I exclude one horse from the "follicular" group I will have to exclude one from the "luteal" and one from the "anoestrous" phase too and that would be a problem for me!
Many thanks for your reply and for your time, Michael! You are always so helpful!
My answer above still stands. You are trying to control for horse and compare directly across groups. These are confounded.
Many thanks for your reply, Michael. I had to step back from this analysis, but now I am back to it.
If I perform the analysis without controlling for horses, would that be correct? Taking into account that each group (follicular, luteal and anoestrous) have different horses?!
If so, would the correct design be:
?
I'm not suggesting to not control for horse, I'm saying that you can't control for horse with fixed effects and compare across two groups with different horses.
There is a different approach entirely in the limma-voom framework, which models correlations among samples, called duplicateCorrelation(). You may want to try that approach instead if you need to compare across groups where the groups have entirely different subjects, and you want to control for the subject baseline.