Question

GOSeq: analysis of unsupported genome after HTseq and DEseq2 - building gene lengths, comparing contrasts and understanding results.

0

Entering edit mode

Ben Mansfeld • 0

@ben-mansfeld-10812

Last seen 8.8 years ago

Michigan State University

Hello all,

I am analyzing an RNAseq experiment in cucumber comparing two genotypes at two ages: Vlaspik (V) and Gy (G) at 8 and 16 days. I used HTseq-count for my read counts and subsequently performed DE analysis using DEseq2.

Briefly:

#Create DESeq data file from sample tabel, HTSeq data and model with interaction
ddsHTseq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = "./HTSeq_Output_files/", design= ~ genotype + age + genotype:age)
#Re-level the data set defining G and 8dpp as the base levels
ddsHTseq$genotype <- relevel(ddsHTseq$genotype,"Gy 14")
ddsHTseq$age <- relevel(ddsHTseq$age,"8dpp")
#Run DESeq and check resultnames=comparisons
dds <- DESeq(ddsHTseq)

#Results for the contrast comparisons
V16G16 <- results(dds,contrast=list(c("genotype_Vlaspik_vs_Gy.14","genotypeVlaspik.age16dpp")))
V8G8 <- results(dds, contrast=c("genotype","Vlaspik","Gy 14"))
V16V8 <- results(dds, contrast=list(c("age_16dpp_vs_8dpp","genotypeVlaspik.age16dpp")))
G16G8 <- results(dds, contrast=c("age","16dpp","8dpp"))

#subset results by adjusted pvalue < 0.05
V16G16filt <- subset(V16G16, padj < 0.05)
V8G8filt <- subset(V8G8, padj < 0.05)
V16V8filt <- subset(V16V8, padj < 0.05)
G16G8filt <- subset(G16G8, padj < 0.05)

#filter subset of results by lfc
lfc <- 0.99
V16G16filt_lfc <- subset(V16G16filt, log2FoldChange > lfc | log2FoldChange < -(lfc))
V8G8filt_lfc <- subset(V8G8filt, log2FoldChange > lfc | log2FoldChange < -(lfc))
V16V8filt_lfc <- subset(V16V8filt, log2FoldChange > lfc | log2FoldChange < -(lfc))
G16G8filt_lfc <- subset(G16G8filt, log2FoldChange > lfc | log2FoldChange < -(lfc))

#of these are Upregulated genes
V16G16filt_lfc_up <- subset(V16G16filt, log2FoldChange > lfc)
V8G8filt_lfc_up <- subset(V8G8filt, log2FoldChange > lfc)
V16V8filt_lfc_up <- subset(V16V8filt, log2FoldChange > lfc)
G16G8filt_lfc_up <- subset(G16G8filt, log2FoldChange > lfc)

#of these are Downregulated genes
V16G16filt_lfc_dwn <- subset(V16G16filt, log2FoldChange < -(lfc))
V8G8filt_lfc_dwn <- subset(V8G8filt, log2FoldChange < -(lfc))
V16V8filt_lfc_dwn <- subset(V16V8filt, log2FoldChange < -(lfc))
G16G8filt_lfc_dwn <- subset(G16G8filt, log2FoldChange < -(lfc))

I am specifically interested in genes upregulated in V16G16 and V16V8. Ie:

#genes up in V16 vs V8 and G16
ARR_associated_filt_up<-V16G16filt_lfc_up[intersect(rownames(V16G16filt_lfc_up), rownames(V16V8filt_lfc_up)),]

I want to subsequently perform GOseq analysis on this set (65 genes).

Questions:

1.For GOseq, should my set of background assayed genes be all the genes assayed in both contrasts? or something else? Ie:

ARR_associated<-V16G16[union(rownames(V16G16),rownames(V16V8)),]

2. As far as I know HTseq can’t generate a gene length file. I thus created my gene length from the GFF3 I used for HTseq-count by summing exon lengths and calculating the median of transcripts per gene. Is this valid for GOseq?

txdb <- makeTxDbFromGFF("../../Chinese Long Genome/cucumber_v2.gff3",format="gff3")
exBytx<-exonsBy(txdb, by="tx", use.names=T)
txlengthData<-sum(width(reduce(exBytx)))
#Some nomenclature changes from transcript to gene names
names(txlengthData)<-gsub("M","G",names(txlengthData))
names(txlengthData)<-gsub("\\..*","",names(txlengthData))
#median transcript length per gene
txlengthData<-ave(x=txlengthData, names(txlengthData),FUN = median)

3. Calculating PWF for each contrast without any filtering gives ok to good looking plots for all but one contrast – V8G8, in which proportion of DE goes down with length. What do I make of this?

4. Furthermore, when I apply a fold change cut off of 2, all plots decrease with length. Why?

5. The plot for the set of interest described above is also flipped compared to the example in the vignette .

6. Can I continue with the analysis? If I do proceed, the resulting GO terms make biological sense.

7. More of a statistical question: When performing FDR calculation on the resulting p-vals, should I include those p-vals that equal 1? In the case of my 65 genes, only few GO terms are significant out of the thousand tested, thus multiple testing here yields no significant terms.

Thank you to all and any who can help.

Ben Mansfeld

sessionInfo()

R version 3.2.3 (2015-12-10)

Platform: x86_64-w64-mingw32/x64 (64-bit)

Running under: Windows >= 8 x64 (build 9200)


locale:

[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252

[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252   


attached base packages:

[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base    


other attached packages:

 [1] goseq_1.22.0              BiocInstaller_1.20.3      rtracklayer_1.28.10       GenomicFeatures_1.20.6  

 [5] AnnotationDbi_1.30.1      Biobase_2.28.0            RSQLite_1.0.0             DBI_0.4-1               

 [9] geneLenDataBase_1.4.0     BiasedUrn_1.07            pheatmap_1.0.8            ggplot2_2.1.0           

[13] DESeq2_1.8.2              RcppArmadillo_0.6.700.6.0 Rcpp_0.12.4               GenomicRanges_1.20.8    

[17] GenomeInfoDb_1.4.3        IRanges_2.2.9             S4Vectors_0.6.6           BiocGenerics_0.14.0     

[21] NCmisc_1.1.4

deseq2 goseq htseqcounts rnaseq plant • 2.1k views

ADD COMMENT • link updated 8.9 years ago by Gordon Smyth 52k • written 8.9 years ago by Ben Mansfeld • 0

score 0 · Answer 1 · 2016-06-02

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 14 minutes ago

WEHI, Melbourne, Australia

An alternative would be to use Rsubread::featureCounts() to make the read counts for each gene. That is runnable from the R prompt, produces an R object directly, and returns appropriate gene lengths for you without any mucking around. See:

http://bioinformatics.oxfordjournals.org/content/30/7/923

However the median of exon length transcripts per gene, as you have computed, is probably good enough. Any reasonable measure of gene length will probably do the job.

ADD COMMENT • link 8.9 years ago Gordon Smyth 52k

0

Entering edit mode

Thanks!

Any idea about the weird PWF plots? Why they are affected by filtering by log fold change? I read some other posts with similar issues but no conclusive response.

-Ben

ADD REPLY • link 8.9 years ago Ben Mansfeld • 0

1

Entering edit mode

Without knowing for sure, my guess would be that longer genes have more counts and are therefore sensitive to DE discovery even in the fold change is small. By having a threshold and remove these, you might be reducing the proportion of longer genes which are DE.

ADD REPLY • link 8.9 years ago Nadia Davidson ▴ 320

0

Entering edit mode

Thanks for the reply.

That's what I guessed. Do you think this makes a difference to the analysis?

Does fitting to the PWF still work if DE goes down with size? Is GOseq still a valid approach for my purposes?

Thanks again

ADD REPLY • link 8.9 years ago Ben Mansfeld • 0