Hello all,
I am analyzing an RNAseq experiment in cucumber comparing two genotypes at two ages: Vlaspik (V) and Gy (G) at 8 and 16 days. I used HTseq-count for my read counts and subsequently performed DE analysis using DEseq2.
Briefly:
#Create DESeq data file from sample tabel, HTSeq data and model with interaction ddsHTseq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = "./HTSeq_Output_files/", design= ~ genotype + age + genotype:age) #Re-level the data set defining G and 8dpp as the base levels ddsHTseq$genotype <- relevel(ddsHTseq$genotype,"Gy 14") ddsHTseq$age <- relevel(ddsHTseq$age,"8dpp") #Run DESeq and check resultnames=comparisons dds <- DESeq(ddsHTseq) #Results for the contrast comparisons V16G16 <- results(dds,contrast=list(c("genotype_Vlaspik_vs_Gy.14","genotypeVlaspik.age16dpp"))) V8G8 <- results(dds, contrast=c("genotype","Vlaspik","Gy 14")) V16V8 <- results(dds, contrast=list(c("age_16dpp_vs_8dpp","genotypeVlaspik.age16dpp"))) G16G8 <- results(dds, contrast=c("age","16dpp","8dpp")) #subset results by adjusted pvalue < 0.05 V16G16filt <- subset(V16G16, padj < 0.05) V8G8filt <- subset(V8G8, padj < 0.05) V16V8filt <- subset(V16V8, padj < 0.05) G16G8filt <- subset(G16G8, padj < 0.05) #filter subset of results by lfc lfc <- 0.99 V16G16filt_lfc <- subset(V16G16filt, log2FoldChange > lfc | log2FoldChange < -(lfc)) V8G8filt_lfc <- subset(V8G8filt, log2FoldChange > lfc | log2FoldChange < -(lfc)) V16V8filt_lfc <- subset(V16V8filt, log2FoldChange > lfc | log2FoldChange < -(lfc)) G16G8filt_lfc <- subset(G16G8filt, log2FoldChange > lfc | log2FoldChange < -(lfc)) #of these are Upregulated genes V16G16filt_lfc_up <- subset(V16G16filt, log2FoldChange > lfc) V8G8filt_lfc_up <- subset(V8G8filt, log2FoldChange > lfc) V16V8filt_lfc_up <- subset(V16V8filt, log2FoldChange > lfc) G16G8filt_lfc_up <- subset(G16G8filt, log2FoldChange > lfc) #of these are Downregulated genes V16G16filt_lfc_dwn <- subset(V16G16filt, log2FoldChange < -(lfc)) V8G8filt_lfc_dwn <- subset(V8G8filt, log2FoldChange < -(lfc)) V16V8filt_lfc_dwn <- subset(V16V8filt, log2FoldChange < -(lfc)) G16G8filt_lfc_dwn <- subset(G16G8filt, log2FoldChange < -(lfc))
I am specifically interested in genes upregulated in V16G16 and V16V8. Ie:
#genes up in V16 vs V8 and G16 ARR_associated_filt_up<-V16G16filt_lfc_up[intersect(rownames(V16G16filt_lfc_up), rownames(V16V8filt_lfc_up)),]
I want to subsequently perform GOseq analysis on this set (65 genes).
Questions:
1.For GOseq, should my set of background assayed genes be all the genes assayed in both contrasts? or something else? Ie:
ARR_associated<-V16G16[union(rownames(V16G16),rownames(V16V8)),]
2. As far as I know HTseq can’t generate a gene length file. I thus created my gene length from the GFF3 I used for HTseq-count by summing exon lengths and calculating the median of transcripts per gene. Is this valid for GOseq?
txdb <- makeTxDbFromGFF("../../Chinese Long Genome/cucumber_v2.gff3",format="gff3") exBytx<-exonsBy(txdb, by="tx", use.names=T) txlengthData<-sum(width(reduce(exBytx))) #Some nomenclature changes from transcript to gene names names(txlengthData)<-gsub("M","G",names(txlengthData)) names(txlengthData)<-gsub("\\..*","",names(txlengthData)) #median transcript length per gene txlengthData<-ave(x=txlengthData, names(txlengthData),FUN = median)
3. Calculating PWF for each contrast without any filtering gives ok to good looking plots for all but one contrast – V8G8, in which proportion of DE goes down with length. What do I make of this?
4. Furthermore, when I apply a fold change cut off of 2, all plots decrease with length. Why?
5. The plot for the set of interest described above is also flipped compared to the example in the vignette .
6. Can I continue with the analysis? If I do proceed, the resulting GO terms make biological sense.
7. More of a statistical question: When performing FDR calculation on the resulting p-vals, should I include those p-vals that equal 1? In the case of my 65 genes, only few GO terms are significant out of the thousand tested, thus multiple testing here yields no significant terms.
Thank you to all and any who can help.
Ben Mansfeld
sessionInfo() R version 3.2.3 (2015-12-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] goseq_1.22.0 BiocInstaller_1.20.3 rtracklayer_1.28.10 GenomicFeatures_1.20.6 [5] AnnotationDbi_1.30.1 Biobase_2.28.0 RSQLite_1.0.0 DBI_0.4-1 [9] geneLenDataBase_1.4.0 BiasedUrn_1.07 pheatmap_1.0.8 ggplot2_2.1.0 [13] DESeq2_1.8.2 RcppArmadillo_0.6.700.6.0 Rcpp_0.12.4 GenomicRanges_1.20.8 [17] GenomeInfoDb_1.4.3 IRanges_2.2.9 S4Vectors_0.6.6 BiocGenerics_0.14.0 [21] NCmisc_1.1.4
Thanks!
Any idea about the weird PWF plots? Why they are affected by filtering by log fold change? I read some other posts with similar issues but no conclusive response.
-Ben
Without knowing for sure, my guess would be that longer genes have more counts and are therefore sensitive to DE discovery even in the fold change is small. By having a threshold and remove these, you might be reducing the proportion of longer genes which are DE.
Thanks for the reply.
That's what I guessed. Do you think this makes a difference to the analysis?
Does fitting to the PWF still work if DE goes down with size? Is GOseq still a valid approach for my purposes?
Thanks again