Question: GoSeq: opposite pwf plot for up- and downregulated genes
2
1
Entering edit mode
Jakub ▴ 50
@jakub-9073
Last seen 11 months ago
United Kingdom

Hi,

I have posted this on biostars but this is the more appropriate forum to ask this question (apologies if you have seen this already).

I am running an enrichment analysis on 3000 differentially expressed genes (mouse) following DESeq2.

The pwf plot for upregulated genes (log2FoldChange > 0) is similar to the one in the vignette, with long genes being more differentially expressed.

However, the plot for significantly downregulated genes is inverted. High proportion of short genes that are DE and low proportion of  long DE genes.

If I plot all DE expressed genes no sensible line can be dawn as the bins cancel each other out (high scatter, poor fit). I have seen quite a few of these graphs when searching for "pwf plot goseq" on google including some on the bioconductor forum - but as yet there is no explanation for this.

 

Any ideas why this might be?

When I manually bin the deseq2 results by baseMean I do not get a difference in for downregulated (=FALSE) vs upregulated (=TRUE) genes.

Jakub

 

PS:

If images are not displayed, here are some external links:

Image1: https://goo.gl/photos/M8pmZSeaEZPb9F7J7

Image2: https://goo.gl/photos/826NWbbjUEAoqpCn6

Image3: https://goo.gl/photos/D7jZJVZ8zWLRPXgR6

goseq • 3.4k views
ADD COMMENT
0
Entering edit mode

FYI (although you've already probably noticed), the last 2 images dind't come through.

Also, just curious: did you provide your own gene lengths, or are you using the ones that goseq fetches/provides for you?

ADD REPLY
0
Entering edit mode

Thanks,

yes, I've noticed that the images disappear on my mobile phone, but I can see them both on mac and windows computers. I will provide additional external links to the images to fix that.

I used the bioconductor package TxDb.Mmusculus.UCSC.mm10.ensGene.

ADD REPLY
0
Entering edit mode

Did you also use TxDb.Mmusculus.UCSC.mm10.ensGene as the gene models to quantify expression? What pipeline (featureCounts, HTSeq, summarizeOverlaps, something else?) did you use to do that?

ADD REPLY
0
Entering edit mode

I used genecounts, our local featurecounts-like package to count reads and then used DESeq2 to quantify expression. I didn't use the TxDb package for that but used mm10 annotations (ensembl81) with a minor modification (for one human gene; transgenic mouse model).

ADD REPLY
0
Entering edit mode

It would be natural and correct to get the gene lengths from your local "featurecounts" package. Otherwise the exon definitions used to compute the counts won't be the same as that used to compute the gene lengths.

ADD REPLY
0
Entering edit mode

Thanks. To avoid a discussion around the fidelity of the genecounts package, I have rerun the analysis with "featurecounts", and get identical results.

ADD REPLY
2
Entering edit mode
Jakub ▴ 50
@jakub-9073
Last seen 11 months ago
United Kingdom

I am just going through some old posts.

Just for completeness (since it is still being looked at by people who might look for an answer), the answer to this question was that the data was subject to 5'/3' bias in one of the experimental groups in particular, yielding to an overexpression of short genes in one group and long genes in the other experimental group.

Fragmentation bias, or 5'/3' bias is relatively common and if you are unlucky it will segregate between groups like in this experiment.

ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 38 minutes ago
WEHI, Melbourne, Australia

Well, if you have run goseq correctly and obtained these plots, then it suggests there was something seriously wrong with either the normalization or the DE analysis of your data.

You don't give any information about how you ran goseq however.

ADD COMMENT
0
Entering edit mode

I have thought about this very hard, and have run goseq in a variety of ways.

  • Similar to what goseq package suggests:
genes=as.integer(p.adjust(res$padj[(res$log2FoldChange!=0) & (!is.na(res$log2FoldChange))], method="BH")<.05)
names(genes)= row.names(res[(res$log2FoldChange!=0) & (!is.na(res$log2FoldChange)) , ])
genes[is.na(genes)] <- 0
pwf=nullp(genes,"mm10","ensGene")
  • My own version based on the DESeq2 results table (top code most important, apologies for the probably very clunky way of generating the 0-1 vector expected by goseq):
#established DEgenes and background
DEgenes <- rownames(subset(res, padj <0.05))
backM <- rownames(res)[!rownames(res) %in% rownames(subset(res, padj <0.05))]

# converts into 0-1 vector
genes <- vector(mode="numeric", length=length(DEgenes) )
genes[0:length(DEgenes)] <-1
names(genes) <- DEgenes

names(backM) <- backM
backM_new <- vector(mode="numeric", length=length(backM) )
backM_new[0:length(backM)] <-0
names(backM_new) <- backM

geneset <- c(genes,backM_new)

#run pwf
pwf=nullp(geneset,"mm10","ensGene")
  • also used a "matchit" based method suggested elswhere on this forum by Bernd Klaus.

Yes, you are also right that sanity checking the DESeq analysis and normalisation was already my next step and I am currently working on it and will update here with more info. I am will do my own gene-length plots.

The MA plot does looks absolutely normal though: https://goo.gl/photos/2Pnyucj7ftSpX2uk9

ADD REPLY
1
Entering edit mode

Having trouble following your second code block, but the first one is wrong. You have

as.integer(p.adjust(res$padj[(res$log2FoldChange!=0) & (!is.na(res$log2FoldChange))], method="BH")<.05)

You are calling p.adjust on already "padjusted" values (res$padj).

If you want to include a logFC threshold in your significance filter, consider using the lfcThreshold parametner in your call to results, so: going blind, here:

res <- results(dds, name='whatever', lfcThreshold=0.75)
res.nona <- subset(res, !is.na(padj))
de.genes <- as.integer(res.nona$padj <= 0.05)
names(de.genes) <- rownames(res.nona))
pwf <- nullp(de.genes, "mm10", "ensGene")
...

You might want to provide your own gene lengths though, since you've provided your own gene models? I'm not sure if featureCounts outputs this or not, but it would be nice if you can get the effective lengths of the genes used ... this might be different than the total gene length due to how exons from multiple isoforms were summarized, as well as what happens when the same genomic locus is shared by different "genes"

ADD REPLY

Login before adding your answer.

Traffic: 519 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6