Question

polyester: adding noise to RNASeq counts?

0

Entering edit mode

krishna312 • 0

@krishna312-6866

Last seen 8.8 years ago

Finland

I have a RNASeq raw count data. I want to generate different versions of the count data with varying level of random noise for a method evaluation. For example, the data with highest level of noise will have fewest differentially expressed genes and vice-versa.

I estimated the parameters of the original count data using 'get_params' function in 'polyester' package.
The 'create_read_numbers' function then uses the estimated parameters and generates count data with similar distribution, however, without biological signal (no differentially expressed genes).

Is it possible to retain the biological signal of the original data in the artificial data? And, then add varying level of noise into the generated data?

I will appreciate for your help!

Best wishes,

Krishna

rnaseq polyester • 1.8k views

ADD COMMENT • link updated 8.8 years ago by Alyssa Frazee ▴ 210 • written 8.8 years ago by krishna312 • 0

score 2 · Accepted Answer · 2016-07-05

Hey Krishna,

In the `create_read_numbers` function, there are arguments called `mod` and `beta`. You can specify differential expression using the `beta` argument (`mod` is generally used to specify which group a sample belongs to, and `beta` can then be the differential expression coefficient for each gene. `beta` should have the same length as the number of genes, and is multiplicative, since the outcome value is on the log scale in this function). One way to retain the DE signal from the original data would be to estimate the differential expression coefficients directly from the original data, and use those as inputs (`beta`).

Another way to do this would be to add the differing levels of random noise to the original data yourself (using whatever underlying distribution you'd like), and using the `simulate_experiment_countmat` function in polyester to generate the simulated reads. We designed `simulate_experiment_countmat` with use cases like this in mind (where you already have a transcript-by-sample matrix of counts).

Hope this helps!