Interspecies differential expression of orthologs with Edger

0

Entering edit mode

assaf www ▴ 140

@assaf-www-6709

Last seen 5.7 years ago

Thanks Tim , quickly going over the article, it looks like a kind of cross-species-WGCNA ? very intriguing. Does Edger DE analysis is built on the assumption that most genes are not differentially expressed, and that only a small portion of them do (say <20%) ? I mean, in cross-species studies, or when comparing different tissues of the same organism, if this assumption doesn't hold, should it be a serious concern ? Assaf On Thu, Aug 28, 2014 at 7:42 PM, Tim Triche, Jr. <tim.triche at="" gmail.com=""> wrote: > This is super helpful. Just to be clear, the most robust solution is to > use edgeR and offset for putative gene length, TMM & library size while > using raw counts (not effective counts) estimated by e.g. RSEM, eXpress, or > the like? > > Also re: cross-species comparisons, while my experience is that it is > indeed a can of worms, Mark Gerstein's group recently published a method > that might interest others working on non-model or incompletely annotated > organisms: > > http://genomebiology.com/2014/15/8/R100 > > Any thoughts on applicability of the method for kooky experiments such as > comparing Drosophila hemocytes, zebrafish vascular endothelial progenitors > and the same in mice? Or for that matter, alligator differentiation. > > I never realized how hard RNAseq in non-model organisms was until I tried > it. > > > Statistics is the grammar of science. > Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science=""> > > > On Thu, Aug 28, 2014 at 8:37 AM, assaf www <assafwww at="" gmail.com=""> wrote: > >> I checked, it's true, the results look the same. >> As for FPKM, of course you are right. >> >> Thanks a lot >> Assaf >> >> >> >> On Thu, Aug 28, 2014 at 2:47 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> >> wrote: >> >> > The code should have been: >> > >> > offset <- expandAsMatrix(getOffset(y),dim(y)) >> > offset <- offset + gl >> > >> > This should give same result as your code. >> > >> > rpkm() corrects for gene length as well as library size -- that's the >> > whole purpose of RPKMs: >> > >> > rpkm(y, gene.length=geneLength) >> > >> > should work for you without any modification. >> > >> > Gordon >> > >> > >> > On Wed, 27 Aug 2014, assaf www wrote: >> > >> > This is very helpful for me, thanks. >> >> >> >> A slight change that I made in the code you sent, to avoid some R >> erros, >> >> is >> >> >> >> # to replace: >> >> offset = offset + gl >> >> # with: >> >> offset = sweep(gl,2,offset,"+") >> >> >> >> In addition to differential expression tests, I wanted also to extract >> >> FPKMs values (and/or normalized CPM values), that would take into >> account >> >> all components of the offset (which if I'm not mistaken are: >> library_size >> >> + >> >> TMM + gene_size). >> >> I assume rpkm()/cpm() should correct only for library_size + TMM. >> >> Is there a possibly "decent" solution for that ? >> >> >> >> all the best, and thanks, >> >> Assaf >> >> >> >> >> >> >> >> On Wed, Aug 27, 2014 at 4:45 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> >> >> wrote: >> >> >> >> It works something like this: >> >>> >> >>> library(edgeR) >> >>> y <- DGEList(counts=counts) >> >>> y <- calcNormFactors(y) >> >>> >> >>> # Column correct log gene lengths >> >>> # Columns of gl should add to zero >> >>> >> >>> gl <- log(geneLength) >> >>> gl <- t(t(gl)-colMeans(gl)) >> >>> >> >>> # Combine library sizes, norm factors and gene lengths: >> >>> >> >>> offset <- expandAsMatrix(getOffset(y)) >> >>> offset <- offset + gl >> >>> >> >>> Then >> >>> >> >>> y$offset <- offset >> >>> y <- estimateGLMCommonDisp(y,design) >> >>> >> >>> etc. >> >>> >> >>> Note that I have not tried this myself. It should work in principle >> from >> >>> a differential expression point of view. >> >>> >> >>> On the other hand, there may be side effects regarding dispersion >> trend >> >>> estimation -- I do not have time to explore this. >> >>> >> >>> Gordon >> >>> >> >>> --------------------------------------------- >> >>> Professor Gordon K Smyth, >> >>> Bioinformatics Division, >> >>> Walter and Eliza Hall Institute of Medical Research, >> >>> 1G Royal Parade, Parkville, Vic 3052, Australia. >> >>> http://www.statsci.org/smyth >> >>> >> >>> On Wed, 27 Aug 2014, assaf www wrote: >> >>> >> >>> Probably wrong, but the reason I thought of using quantile >> normalization >> >>> >> >>>> is >> >>>> the need to correct both for the species-length, and library size. >> >>>> >> >>>> >> >>>> On Wed, Aug 27, 2014 at 2:40 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> >> >>>> wrote: >> >>>> >> >>>> That doesn't look helpful to me. I suggested that you incorporate >> gene >> >>>> >> >>>>> lengths into the offsets, not do quantile normalization of cpms. >> >>>>> >> >>>>> Sorry, I just don't have time to develop a code example for you. I >> >>>>> hope >> >>>>> someone else will help. >> >>>>> >> >>>>> The whole topic of interspecies differential expression is a can of >> >>>>> worms. >> >>>>> Even if you adjust for gene length, there will still be differences >> in >> >>>>> gc >> >>>>> content and mappability between the species. >> >>>>> >> >>>>> Gordon >> >>>>> >> >>>>> >> >>>>> On Wed, 27 Aug 2014, assaf www wrote: >> >>>>> >> >>>>> Dear Gordon thanks, >> >>>>> >> >>>>> >> >>>>>> Suppose I start with the following matrices: >> >>>>>> >> >>>>>> # 'counts' is the Rsem filtered counts >> >>>>>> >> >>>>>> counts[1:4,] >> >>>>>> >> >>>>>>> >> >>>>>>> h0 h1 h2 n0 n1 n2 >> >>>>>>> >> >>>>>> ENSRNOG00000000021 36 17 20 10 25 38 >> >>>>>> ENSRNOG00000000024 1283 615 731 644 807 991 >> >>>>>> ENSRNOG00000000028 26 12 11 18 23 28 >> >>>>>> ENSRNOG00000000029 22 13 12 16 17 15 >> >>>>>> >> >>>>>> # 'geneLength' is the species-specific gene lengths, for species >> 'h' >> >>>>>> and >> >>>>>> 'n': >> >>>>>> >> >>>>>> geneLength[1:3,] >> >>>>>> >> >>>>>>> >> >>>>>>> h0.length h1.length h2.length n0.length >> n1.length >> >>>>>>> >> >>>>>> n2.length >> >>>>>> ENSRNOG00000000021 1200 1200 1200 1303 >> 1303 >> >>>>>> 1303 >> >>>>>> ENSRNOG00000000024 1050 1050 1050 3080 >> 3080 >> >>>>>> 3080 >> >>>>>> ENSRNOG00000000028 1047 1047 1047 1121 >> 1121 >> >>>>>> 1121 >> >>>>>> >> >>>>>> >> >>>>>> does the following code look correct, and may allow allows across >> >>>>>> species >> >>>>>> analysis ?: >> >>>>>> (technically it works, I checked) >> >>>>>> >> >>>>>> # quantile normalization: (as in here: >> >>>>>> http://davetang.org/wiki/tiki-index.php?page=Analysing+ >> >>>>>> digital+gene+expression >> >>>>>> ) >> >>>>>> >> >>>>>> x1 = counts*1000/geneLength >> >>>>>> library(limma) >> >>>>>> x2 = normalizeBetweenArrays(data.matrix(x1),method="quantile") >> >>>>>> offset = log(counts+0.1)-log(x2+0.1) >> >>>>>> >> >>>>>> ... >> >>>>>> >> >>>>>> y <- estimateGLMCommonDisp(y,design,offset=offset) >> >>>>>> y <- estimateGLMTrendedDisp(y,design,offset=offset) >> >>>>>> y <- estimateGLMTagwiseDisp(y,design,offset=offset) >> >>>>>> fit <- glmFit(y,design,offset=offset) >> >>>>>> >> >>>>>> ... >> >>>>>> >> >>>>>> >> >>>>>> Thanks in advance for any help.., >> >>>>>> Assaf >> >>>>>> >> >>>>>> >> >>>>> ____________________________________________________________ >> >>> __________ >> >>> The information in this email is confidential and intended solely for >> the >> >>> addressee. >> >>> You must not disclose, forward, print or use it without the >> permission of >> >>> the sender. >> >>> ______________________________________________________________________ >> >>> >> >>> >> >> >> > ______________________________________________________________________ >> > The information in this email is confidential and inte...{{dropped:10}} >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]

RNASeq Normalization Organism zebrafish edgeR • 1.3k views

ADD COMMENT • link updated 10.7 years ago by Gordon Smyth 52k • written 10.7 years ago by assaf www ▴ 140

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 15 hours ago

WEHI, Melbourne, Australia

On Tue, 2 Sep 2014, assaf www wrote: > Does Edger DE analysis is built on the assumption that most genes are > not differentially expressed, and that only a small portion of them do > (say <20%) ? Only the calcNormFactors() step of edgeR makes any assumption of this sort. calcNormFactors assumes that either that most genes are not DE or that the DE is reasonably symmetric. > I mean, in cross-species studies, or when comparing different tissues of > the same organism, if this assumption doesn't hold, should it be a > serious concern ? In a cross-species comparison there will be many DE genes, but some will be up and some will be down. The DE will not be all in one direction, I would guess that normalization will not be a serious concern. Of all the concerns with cross-species comparisons, this seems to me to be far from the most serious. Best wishes Gordon ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 10.7 years ago Gordon Smyth 52k

0

Entering edit mode

Thanks Gordon, To summarize the results I got on the cross-species data, after embedding the length-effect to the GLM offset matrix, as in the code you sent, please see the attached MA plot: 1) for >5 and <-5 log fold change, genes' logFC is positively correlated with mean log CPM, something I haven?t seen before in Edger standard runs. 2) most genes with fold change around > 1.3, or < -1.3, are significant, which looks to me too ?liberal?. Please note that each group contains 6 true biological replicates (variance within each group is large) . The first problem worries me most, any idea is very welcomed. Many thanks, Assaf On Wed, Sep 3, 2014 at 2:08 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > > On Tue, 2 Sep 2014, assaf www wrote: > > Does Edger DE analysis is built on the assumption that most genes are not >> differentially expressed, and that only a small portion of them do (say >> <20%) ? >> > > Only the calcNormFactors() step of edgeR makes any assumption of this > sort. calcNormFactors assumes that either that most genes are not DE or > that the DE is reasonably symmetric. > > I mean, in cross-species studies, or when comparing different tissues of >> the same organism, if this assumption doesn't hold, should it be a serious >> concern ? >> > > In a cross-species comparison there will be many DE genes, but some will > be up and some will be down. The DE will not be all in one direction, I > would guess that normalization will not be a serious concern. > > Of all the concerns with cross-species comparisons, this seems to me to be > far from the most serious. > > Best wishes > Gordon > > ______________________________________________________________________ > The information in this email is confidential and intended solely for the > addressee. > You must not disclose, forward, print or use it without the permission of > the sender. > ______________________________________________________________________ > -------------- next part -------------- A non-text attachment was scrubbed... Name: crossspecies.png Type: image/png Size: 65085 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140905="" c599392b="" attachment.png="">

ADD REPLY • link 10.7 years ago assaf www ▴ 140

Login before adding your answer.