Hi all,
I'm trying to normalize my Illumina MiSeq data using Phyloseq and DESeq2. I want to look at the interactions between my site location and by date. I attempted to use the following command:
Laurendds = phyloseq_to_deseq2(Lauren_data_merged, design = ~Location + Date + Location:Date)
But am getting this error:
converting counts to integer mode
Error in DESeqDataSet(se, design = design, ignoreRank) :
the model matrix is not full rank, so the model cannot be fit as specified.
one or more variables or interaction terms in the design formula
are linear combinations of the others and must be removed
Doing some research into this issue I have seen that others have used the DESeqDataSetFromMatrix command (as in this post https://support.bioconductor.org/p/63134/), but my metadata (mapping file) and my count table(.biom file) are already merged into my Lauren_data_merged object. I'm not sure where my variables are interacting to cause this error. I'll post an example of my metadata file below:
head(Thesis_mapping)
Sample Data: [6 samples by 11 sample variables]:
X.SampleID BarcodeSequence LinkerPrimerSequence InputFileName MonthSampleTaken Location Salinity WaterTemperature
PE12 NA NA PE12.fasta July M9 26.0 30
PE15 PE15 NA NA PE15.fasta July M9 14.5 30
PE17 PE17 NA NA PE17.fasta July BB 17.0 30
PE18 PE18 NA NA PE18.fasta July M9 17.0 30
PE20 PE20 NA NA PE20.fasta August BB 22.5 30
PE21 PE21 NA NA PE21.fasta August M9 22.5 30
Season Weather Date
Wet Rain 7/10/13
Wet Rain 7/19/13
Wet Rain 7/23/13
Wet Rain 7/23/13
Wet Rain 8/3/13
Wet Rain 8/3/13
I have a total of 82 samples (I only posted the first 6 as an example). I do have replicates as I took samples weekly for an entire year at the same two locations just on different dates, so essentially it is a time series data set but I don't know what my interaction term would be. Are you aware of how I could go about setting this up in DESeq2 so that I can get variance stabilized data for further downstream analysis?
Hi Michael,
This might be a very simple question, but since I'm very new to R and coding in general I don't quite understand everything yet, when you say just use a design of ~1 for normalization, do you mean that when I execute this command: Laurendds = phyloseq_to_deseq2(Lauren_data_merged, design = ~Location + Date + Location:Date) I could use just one of my variables: Location or Date like this: Laurendds = phyloseq_to_deseq2(Lauren_data_merged, design = ~Location) and it would variance stabilize my data?
I also was thinking about looking at my samples by month rather than by date, that way i would have multiple replicates of each site from the same month for further downstream analysis.
Thank you for your help in advance!
Read over the vignette section 2.1.1 "Blind dispersion estimation" to see if you want to use blind dispersion estimation.
If you use blind=FALSE, then the design will be used by the transformation functions. If you use blind=TRUE, then it doesn't matter what you put in design, because ~1 will be substituted within the transformation functions.
I'd recommend using
~Location + Date
, and blind=FALSE.(The details are described in the vignette, but summarized: this is still almost unsupervised, because the design is only used to estimate the dispersion, and only the global trend of dispersion~mean is passed on to the transformations.)