Hi, Im analysing sequencing data and comparing distinct samples, however between two of my conditions I have very different read numbers and that is causing me troubles during the analysis. I would like to remove reads from some of my samples but these reads should be random so that I dont skew my data. Does anybody has idea how can I do that?
I don't know what analysis you are conducting or what sort of sequencing you are doing, but I would be horrified to see anyone doing what you propose to do. It would be far better to improve your analysis methods so that the analysis can handle unequal sequencing depths without skewing the results. Generally speaking, such analysis methods do exist.
Having said that, if you have a matrix of read counts, and want to reduce the library size for one or more of the samples, it is easy and quick to do that using the thinCounts() function of the edgeR package. That is equivalent to randomly selecting rows of the raw FastQ file but very, very much more efficient.
For example, if `counts' is a matrix of read counts, then
will create a new matrix for which all the columns have the same total count. The thining is done in such a way as to simulate random selection of reads.
Thank you, I think that what you answered is just what i need to do but I have a problem. Im working with a fastq file whose reads I read with readDNAstringset and then use thinCounts, however I get this: Error in colSums(x) : 'x' must be numeric. I know that maybe this is too basic but Im starting with bioinformatics, can you help me figure out how to solve it? Thanks!
No, I can't help because I have no idea what sort of analysis you are trying to do. Why would you run readDNAstringset? I don't know.
Regarding thinCounts(), the error message seems pretty self explanatory. thinCounts() operates on a numeric matrix of counts but that's not what readDNAstringset produces.
As you have suggested, this is indeed pretty basic. One always needs to pay a bit of attention to what sort of arguments functions accept and what output they produce.
Thank you, I think that what you answered is just what i need to do but I have a problem. Im working with a fastq file whose reads I read with readDNAstringset and then use thinCounts, however I get this: Error in colSums(x) : 'x' must be numeric. I know that maybe this is too basic but Im starting with bioinformatics, can you help me figure out how to solve it? Thanks!
No, I can't help because I have no idea what sort of analysis you are trying to do. Why would you run readDNAstringset? I don't know.
Regarding thinCounts(), the error message seems pretty self explanatory. thinCounts() operates on a numeric matrix of counts but that's not what readDNAstringset produces.
As you have suggested, this is indeed pretty basic. One always needs to pay a bit of attention to what sort of arguments functions accept and what output they produce.