Normalization of polyA RNA-seq?

0

Entering edit mode

Xiaohui Wu ▴ 280

@xiaohui-wu-4141

Last seen 10.6 years ago

Hi all, I have two libraries of RNA-seq only with polyA of same tissue (leaf and leaf), and have mapped them to the genome. Most of these reads are in 3'UTR but not spread over the whole gene body. And the size of these two libraries are in great difference, like 25,000 reads versus 1250,000 reads. About 40% and 60% of genes only have 1 read in small lib and bigger one, respectively. Most of the tags are dominated by only a few genes. I want to combine these two libs for larger one, but I think I should normalize the read count before pooling them together. If use TPM normalization, read count in smaller library will be multiplied by 50 times, that means the 1-tag gene will become 50-tag gene, in the small lib, while maybe that gene is also 1-tag gene in bigger lib, I feel not comfortable that TPM may make skew the read distribution. Do you have any idea on normalizing the data instead of TPM? Thanks a lot in advance. Regards, Xiaohui [[alternative HTML version deleted]]

Normalization Normalization • 1.6k views

ADD COMMENT • link updated 14.7 years ago by Simon Anders ★ 3.8k • written 14.7 years ago by Xiaohui Wu ▴ 280

0

Entering edit mode

Simon Anders ★ 3.8k

@simon-anders-3855

Last seen 4.7 years ago

Zentrum für Molekularbiologie, Universi…

Hi Xiaohui > I have two libraries of RNA-seq only with polyA of same tissue (leaf and > leaf), and have mapped them to the genome. Most of these reads are in 3'UTR > but not spread over the whole gene body. And the size of these two > libraries are in great difference, like 25,000 reads versus 1250,000 reads. > About 40% and 60% of genes only have 1 read in small lib and bigger one, > respectively. Most of the tags are dominated by only a few genes. I want to > combine these two libs for larger one, but I think I should normalize the > read count before pooling them together. > > If use TPM normalization, read count in smaller library will be multiplied > by 50 times, that means the 1-tag gene will become 50-tag gene, in the > small lib, while maybe that gene is also 1-tag gene in bigger lib, I feel > not comfortable that TPM may make skew the read distribution. Do you have > any idea on normalizing the data instead of TPM? So, you want to ensure that both libraries get the same weight in your downstream analysis. but why would you want that? The smaller library contains less information, so it should not get the same weight. Actually, your description is not to clear. You want to combine the two libraries to a single one, i.e., give up the information which sample each read came from. This would make sense only if these are replicates. If so, it seems very suspicious that a gene that has one count in the small library only gets one count in the bigger one. This might occur occasionally, but should not happen for many genes. You should really double-check whether you did the counting correctly. (Try, for example, my htseq-count script [http://www-huber.embl.de/users/anders/HTSeq/doc/count.html] to see whether its results are similar to yours.) Apart from this issue: If you really just want to combine the reads to one large sample, just add up the number, without normalization. If, however, you want to compare the samples against each other, and normalize to make them comparable, you may want to look at the normalization functions of DESeq (function 'estimateSizeFactors') or edgeR (function 'calcNormFactors'). Simon

ADD COMMENT • link 14.7 years ago Simon Anders ★ 3.8k

0

Entering edit mode

Hi, I just now looked at the "counting reads with HTSEQ" page (http://www-huber.embl.de/users/anders/HTSeq/doc/count.html) Is it possible to use a 'bed' file (instead of GFF) to provide the gene models for counting? Probably you already have access to plenty of 'bed' format files to try, but just in case not, here is the list of all gene models from the Arabidopsis thaliana genome: http://www.bioviz.org/quickload/A_thaliana_Jun_2009/TAIR9.bed.gz -Ann On Sat, Aug 14, 2010 at 5:31 AM, Simon Anders <anders at="" embl.de=""> wrote: > Hi Xiaohui > >> I have two libraries of RNA-seq only with polyA of same tissue (leaf and >> leaf), and have mapped them to the genome. Most of these reads are in > 3'UTR >> but not spread over the whole gene body. And the size of these two >> libraries are in great difference, like 25,000 reads versus 1250,000 > reads. >> About 40% and 60% of genes only have 1 read in small lib and bigger one, >> respectively. Most of the tags are dominated by only a few genes. I want > to >> combine these two libs for larger one, but I think I should normalize > the >> read count before pooling them together. >> >> If use TPM normalization, read count in smaller library will be > multiplied >> by 50 times, that means the 1-tag gene will become 50-tag gene, in the >> small lib, while maybe that gene is also 1-tag gene in bigger lib, I > feel >> not comfortable that TPM may make skew the read distribution. Do you > have >> any idea on normalizing the data instead of TPM? > > So, you want to ensure that both libraries get the same weight in your > downstream analysis. but why would you want that? The smaller library > contains less information, so it should not get the same weight. > > Actually, your description is not to clear. You want to combine the two > libraries to a single one, i.e., give up the information which sample each > read came from. This would make sense only if these are replicates. If so, > it seems very suspicious that a gene that has one count in the small > library only gets one count in the bigger one. This might occur > occasionally, but should not happen for many genes. You should really > double-check whether you did the counting correctly. (Try, for example, my > htseq-count script > [http://www-huber.embl.de/users/anders/HTSeq/doc/count.html] to see whether > its results are similar to yours.) > > Apart from this issue: If you really just want to combine the reads to one > large sample, just add up the number, without normalization. If, however, > you want to compare the samples against each other, and normalize to make > them comparable, you may want to look at the normalization functions of > DESeq (function 'estimateSizeFactors') or edgeR (function > 'calcNormFactors'). > > ?Simon > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 14.7 years ago Ann Loraine ▴ 110

Login before adding your answer.