Hi everyone,
I have data from metagenomic experiments: pyrosequencing of the 16S gene (454 GS Junior Roche). This gene was sequenced for 44 samples divided in 2 treatment groups A and B. The number of patients studied is 22, and for each patient one sample received the treatment A and another sample received the treatment B. So, we have paired data.
The reads from 71 bacterial species were counted in each of the 44 samples, leading to a count matrix which looks like that (71 rows and 44 paired samples, here just the 6 first rows and columns):
P1.grpA P2.grpA P3.grpA P1.grpB P2.grpB P3.grpB
Species1 0 0 0 0 0 0
Species2 0 0 0 0 0 0
Species3 0 0 104 0 0 107
Species4 0 0 0 0 0 0
Species5 0 0 0 0 0 0
Species6 0 0 0 0 0 0
The objective of the study is to detect species differentially expressed between the two groups.
I used edgeR to analyze the data, here is my code and some outputs:
>dgeFull <- DGEList(counts=data,remove.zeros=TRUE)
# Removing 24 rows with all zero counts.
>summary(dgeFull$samples$lib.size)
#Min. 1st Qu. Median Mean 3rd Qu. Max.
#343 1096 3564 5957 7417 32600
>dgeFull <- calcNormFactors(dgeFull, method="TMM")
>summary(dgeFull$samples$norm.factors)
#Min. 1st Qu. Median Mean 3rd Qu. Max.
#0.1733 0.9240 1.1320 1.0800 1.2640 1.8060
>gp <-dgeFull$sampleInfo$gp
>patient <- dgeFull$sampleInfo$patient
>design <- model.matrix(~patient+gp)
>y <- estimateGLMCommonDisp(dgeFull,design)
>y$common.dispersion # 3.220813
>sqrt(y$common.dispersion) # 1.79
>y <- estimateGLMTrendedDisp(y,design)
>y <- estimateGLMTagwiseDisp(y,design)
>plotBCV(y)
>fit <- glmFit(y,design)
>lrt <- glmLRT(fit, coef=23)
>topTags(lrt,adjust.method="BH")
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
[5] LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] edgeR_3.8.6 limma_3.22.6 rj_2.0.3-1
loaded via a namespace (and not attached):
[1] rj.gd_2.0.0-1 tools_3.1.2
My questions are :
- I am seriously wondering if edgeR (TMM normalization, and computation of the dispersion parameter) is convenient for my data (or if my data met some problems) because of :
- In all the count data I have seen analyzed with edgeR, the number of features was high and never with less than 100 features like in my data. Moreover, my dataset contains a lot of zeros (22 samples only have more than 8 strictly positive values among the 71 species and half of the 71 species have <=1 strictly positive values over the 44 samples)
- The library sizes are very hetoregeneous (from 343 to 32600)
- The normalization factor are very extreme (with a min value of 0.1),
- The value of BCV (=1.79) is very high
- The plotBCV is very strange, with an increasing trend, which seems to be strongly influenced by outliers (without these outliers the trens would be "flat"). May be do I have to NOT perform the estimateGLMTrendedDisp function ?
- Do I have to add a filter for the species, for example with :
keep <- rowSums(cpm(dgeFull) > 1) >= 22 ? In this case, only 5 species will be kept…
- I also had (data not shown) a SAMPLE (NOT a species) with all zero counts, and so the “calcNormFactors” generates this error message:
Error in quantile.default(x, p = p) :
# missing values and NaN's not allowed if 'na.rm' is FALSE
I removed this sample from my dataset, but I am wondering if I could keep it and set artificially a normalization factor of 1 ? Do you think I have to remove it ?
- Finaly, how exactly is computed the “log2FC” of the TopTags outputs ?
I tried log2(rawCount(mean(gpB)))- log2(rawCount(mean(gpA))) (and also with cpm values rather than rawCount) but I do not found what is generated in the outputs.
Thanks a lot for your help,
Eleonore
You're using old versions of R and Bioconductor. It's good practice to use the most recent versions; this ensures that problems in old versions have been fixed, so we don't spend time talking about bugs that have already been dealt with.