Entering edit mode
Xiaohui Wu
▴
280
@xiaohui-wu-4141
Last seen 10.3 years ago
Thank you Simon. Yes, I want to pool them together to do downstream
analyses, like comparing with other tissues or just some analyses of
that dataset. At first, I thought that I should compare these two libs
first to make sure they are similar enough to be combined. The
correlation of gene expression of all genes between these two libs was
0.5, which was not so high, while the correlation between another two
different libs was only 0.01, so I thought they could be combined.
I'm sure my counting is correct. Yes, you are right, I was not clear
enough, the gene has 1 read in small lib does have more reads in big
lib but 80% of them are no more than 20 times, 50% of them are less
than 10 times. When I used normalization method like DESeq and EdgeR
as you said to get the estimate size factor, the difference of the
normalized lib size is 20 times. For example, the big lib size is
1,200,000, the small one is 25,000, the size factor is 1:2.5, so after
normalization, the new lib size is: 1,200,000:625,000, which is still
about 20 times. I mean after normalization, the gene expression of the
small one will become higher than the big one.
Maybe I just concern too much, there should be some genes with higher
expression in lib1 but some other genes with higher expression in
lib2, I can't make them so consistent so same even the two libs from
same tissue same condition. As you said, I will add up the number
without normalization to do analyses of one single lib, and normalize
the libs to do sample comparison. Thank you again, I'm not so confused
now.
And another question about the size factor, the TPM normaliztion is:
newCount=(oldCount*1,000,000)/libsize. Does the normalization in EdgeR
or DESeq estimate the size factor to adjust the lib size, but not do
other things? That is I can replace the libsize with the adjusted
libsize in the TPM fomular to do the normalization?
Xiaohui
Simon Anders
2010-08-14 05:31:47
Wu, Xiaohui Ms.
bioconductor
Re: [BioC] Normalization of polyA RNA-seq?
Hi Xiaohui
> I have two libraries of RNA-seq only with polyA of same tissue (leaf
and
> leaf), and have mapped them to the genome. Most of these reads are
in
3'UTR
> but not spread over the whole gene body. And the size of these two
> libraries are in great difference, like 25,000 reads versus 1250,000
reads.
> About 40% and 60% of genes only have 1 read in small lib and bigger
one,
> respectively. Most of the tags are dominated by only a few genes. I
want
to
> combine these two libs for larger one, but I think I should
normalize
the
> read count before pooling them together.
>
> If use TPM normalization, read count in smaller library will be
multiplied
> by 50 times, that means the 1-tag gene will become 50-tag gene, in
the
> small lib, while maybe that gene is also 1-tag gene in bigger lib, I
feel
> not comfortable that TPM may make skew the read distribution. Do you
have
> any idea on normalizing the data instead of TPM?
So, you want to ensure that both libraries get the same weight in your
downstream analysis. but why would you want that? The smaller library
contains less information, so it should not get the same weight.
Actually, your description is not to clear. You want to combine the
two
libraries to a single one, i.e., give up the information which sample
each
read came from. This would make sense only if these are replicates. If
so,
it seems very suspicious that a gene that has one count in the small
library only gets one count in the bigger one. This might occur
occasionally, but should not happen for many genes. You should really
double-check whether you did the counting correctly. (Try, for
example, my
htseq-count script
[http://www-huber.embl.de/users/anders/HTSeq/doc/count.html] to see
whether
its results are similar to yours.)
Apart from this issue: If you really just want to combine the reads to
one
large sample, just add up the number, without normalization. If,
however,
you want to compare the samples against each other, and normalize to
make
them comparable, you may want to look at the normalization functions
of
DESeq (function 'estimateSizeFactors') or edgeR (function
'calcNormFactors').
Simon
[[alternative HTML version deleted]]