question about DeSeq

0

Entering edit mode

vasu punj ▴ 80

@vasu-punj-4410

Last seen 10.5 years ago

What kind of input DEseq can take Can it take Cufflinks out put or we have to use FKPM conversion. Thanks [[alternative HTML version deleted]]

DESeq DESeq • 1.1k views

ADD COMMENT • link updated 14.2 years ago by Simon Anders ★ 3.8k • written 14.2 years ago by vasu punj ▴ 80

0

Entering edit mode

Simon Anders ★ 3.8k

@simon-anders-3855

Last seen 4.5 years ago

Zentrum für Molekularbiologie, Universi…

Hi On 12/22/2010 08:23 AM, vasu punj wrote: > What kind of input DEseq can take Can it take Cufflinks out put or we have to use FKPM conversion. DESeq needs integer counts, i.e., just the raw number of reads that align to a feature, without any normalization for sequencing depth or transcript length. Hence, the FPKM values reported by cufflinks are not suitable. the easiest is to use HTSeq-count to get a table of raw counts. http://www-huber.embl.de/users/anders/HTSeq/ http://www-huber.embl.de/users/anders/HTSeq/doc/count.html Simon

ADD COMMENT • link 14.2 years ago Simon Anders ★ 3.8k

0

Entering edit mode

On 12/22/2010 05:24 AM, Simon Anders wrote: > Hi > > On 12/22/2010 08:23 AM, vasu punj wrote: >> What kind of input DEseq can take Can it take Cufflinks out put or we >> have to use FKPM conversion. > > DESeq needs integer counts, i.e., just the raw number of reads that > align to a feature, without any normalization for sequencing depth or > transcript length. Hence, the FPKM values reported by cufflinks are not > suitable. the easiest is to use HTSeq-count to get a table of raw counts. > > http://www-huber.embl.de/users/anders/HTSeq/ > http://www-huber.embl.de/users/anders/HTSeq/doc/count.html It is as straight-forward and flexible to count overlaps with Bioconductor IRanges countOverlaps, with the added advantages of integration with annotation resources (e.g., GenomicFeatures TranscriptDb) and familiar (to those on the Bioconductor mailing list) programming environment. Martin > > Simon > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793

ADD REPLY • link 14.2 years ago Martin Morgan 25k

0

Entering edit mode

Simon asked for me to put code behind my words, and to provide the equivalent commands to the counting operations at http://www-huber.embl.de/users/anders/HTSeq/doc/count.html There are many apporaches to this kind of task; the following is one. ## Data We'd like test cases representing Simon's figure. So create a `GRangesList` of regions of interest (`subj`) and reads (`query`), e.g., using a short helper function `rng` for brevity library(GenomicFeatures) rng <- function(s, w) ## ranges by 'start', 'width'; same chromosome, strand GRanges(seq="chr1", IRanges(s, width=w), strand="+") subj <- GRangesList(A=rng(1000, 900), B=rng(2000, 900), C=rng(c(3000, 3600), c(500, 300)), D=rng(c(4000, 4600), c(500, 300)), E1=rng(5000, 700), E2=rng(5500, 400), F1=rng(6000, 700), F2=rng(6500, 400), G1=rng(7000, 700), G2=rng(7500, 400)) query <- GRangesList(a=rng(1500, 100), b=rng(2850, 100), c=rng(3450, 200), d=rng(c(4400, 4600), 100), e=rng(5100, 100), f=rng(6450, 100), g=rng(7600, 100)) Normally one might create `subj` with `makeTranscriptDbFromUCSC`, `import` (for GFF-like) or other `rtracklayer` functionality, or from queries to `biomaRt` or using the `AnnotationDbi`, `*org`, and `BSgenome*` Bioconductor annotation packages. `query` might normally come from BAM files (e.g., `readGappedAlignements` in `GenomicRanges` or `scanBam` in `Rsamtools`) or other aligned reads (using base `R` functionality for pure text files, or perhaps `readAligned` from `ShortRead`, if the alignments do not contain gaps). These operations are described in the vignettes of the corresponding packages. ## union To implement this, we count the number of times each read (`query`) overlaps a region of interset (`subj`), then drop those reads with more than one hit and count the number of times each subject overlaps a (now unambiguous) read hitsPerQuery <- countOverlaps(query, subj) countOverlaps(subj, query[hitsPerQuery == 1]) ## strict intersection Simon's intersection requires a way to count queries that lie strictly within regions of interest. This is not available with `GRangesList` objects, so we create a helper function to coerce our data to a `RangesList`, and to create an index that maps in the reverse direction grl2rl <- function(x) ## coerce GRangesList to RangesList as(unlist(x, use.names=FALSE), "RangesList") maprl <- function(x) ## map to 'x' GRangesList indicies from RangesList rep(seq_len(length(x)), elementLengths(x)) We will also need a helper function to take overlaps found on the `RangesList` objects and map them to `GRangesList`, taking into account the query (`qmap`) and subject (`smap`) maps. We also use the opportunity to remove queries that hit multiple subjects recount <- function(olap, qmap, smap) ## re-map to GRangesList, removing multi-hit queries { ## re-map hits q_hits <- qmap[queryHits(olap)] s_hits <- smap[subjectHits(olap)] ## remove queries that align to multiple subjects hitmap0 <- lapply(split(s_hits, q_hits), unique) hitmap <- Filter(function(x) length(x) == 1, hitmap0) ## return as vector of counts tabulate(as.integer(hitmap), max(smap)) } We count with olaps <- findOverlaps(grl2rl(query), grl2rl(subj), type="within") lapply(olaps, recount, maprl(query), maprl(subj)) grl2rl loses strand information. This will be appropriate for some RNAseq protocols; if not, then the two lines above can be repeated for appropriate strand subsetes of query, subj. Interesting exercises would modify 'recount' to, say, use in the 'union' approach to divide queries amongst the regions they overlap (rather than discarding) or to return counts-per-exon rather than summarizing to the original gene level. ## Non-empty intersect The goal is to count reads that overlap, by any number of nucleotides, any single subject. To do this we write a helper to coerce a `GRangesList` to disjoint regions that are represented exactly once, i.e., discarding subject regions overlapping one another. grl2udrl <- function(x) ## GRangesList to unique disjoint RangesList { rl <- unlist(x) d <- disjoin(rl) ud <- d[countOverlaps(d, rl) == 1] as(ud, "RangesList") } mapudrl <- function(x) ## map to 'x' GRangesList from u. d. RangesList subjectHits(findOverlaps(grl2udrl(x), x)) and then we count olaps <- findOverlaps(grl2rl(query), grl2udrl(subj)) lapply(olaps, recount, maprl(query), mapudrl(subj)) Martin ----- Original Message ----- > On 12/22/2010 05:24 AM, Simon Anders wrote: > > Hi > > > > On 12/22/2010 08:23 AM, vasu punj wrote: > >> What kind of input DEseq can take Can it take Cufflinks out put or > >> we > >> have to use FKPM conversion. > > > > DESeq needs integer counts, i.e., just the raw number of reads that > > align to a feature, without any normalization for sequencing depth > > or > > transcript length. Hence, the FPKM values reported by cufflinks are > > not > > suitable. the easiest is to use HTSeq-count to get a table of raw > > counts. > > > > http://www-huber.embl.de/users/anders/HTSeq/ > > http://www-huber.embl.de/users/anders/HTSeq/doc/count.html > > It is as straight-forward and flexible to count overlaps with > Bioconductor IRanges countOverlaps, with the added advantages of > integration with annotation resources (e.g., GenomicFeatures > TranscriptDb) and familiar (to those on the Bioconductor mailing list) > programming environment. > > Martin > > > > > Simon > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 14.1 years ago Martin Morgan 25k

Login before adding your answer.