Reading Paired End Native Report Format in ShortRead
1
0
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 10.2 years ago
Hi, I am trying to read the alignments generated using NovoAlign. The format I have the data is Paired End Native Report Format(http://computing.bio.cam.ac.uk/local/doc/NovoCraftV2.06.pdf). What is the most efficient way to read this data into ShortRead? Since it is paired end data I have two files corresponding to the two sides. I tried without success using the different formats using readAligned(). I also read an earlier posting about it which suggests to convert it to SAM format. I would appreciate your suggestions. Cheers../Murli -- output of sessionInfo(): R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.14.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 [4] Rsamtools_1.8.5 lattice_0.20-6 Biostrings_2.24.1 [7] GenomicRanges_1.8.7 IRanges_1.14.4 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 stats4_2.15.0 [6] tools_2.15.0 zlibbioc_1.2.0 -- Sent via the guest posting facility at bioconductor.org.
convert convert • 1.2k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States
On 07/12/2012 11:42 AM, Murli Nair [guest] wrote: > > Hi, > > I am trying to read the alignments generated using NovoAlign. The format I have the data is Paired End Native Report Format(http://computing.bio.cam.ac.uk/local/doc/NovoCraftV2.06.pdf). > What is the most efficient way to read this data into ShortRead? Since it is paired end data I have two files corresponding to the two sides. > I tried without success using the different formats using readAligned(). I also read an earlier posting about it which suggests to convert it to SAM format. > I would appreciate your suggestions. From the document you reference Three output formats are provided. 1. Native 2. Extended Native 3. Pairwise 4. SAM If Paired End Native Report Format is 1 or 2 with a single record per line then I believe the only support for input would be as tab- delimited files (read.table and friends; these are flexible and could easily be used to iterate through a large file in a memory efficient way); you would then use an appropriate constructor, e.g., GenomicRanges::GappedAlignmentPairs, to create an object that you could manipulate. Format 3 looks challenging to parse. Generally, for aligned reads aim for BAM files, which is output format 4 followed by using Rsamtools or other with asBam, sortBam, indexBam to create a sorted bam file and index. use GenomicRanges::readGappedAlignmentPairs for many paired-end tasks. It might help to think a little further ahead about what you want to do, e.g., GenomicRanges::summarizeOverlaps would be useful in RNAseq differential expression to count reads in regions of interest, and would need bam files but would manage data input for you. Martin > Cheers../Murli > > > -- output of sessionInfo(): > > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] ShortRead_1.14.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 > [4] Rsamtools_1.8.5 lattice_0.20-6 Biostrings_2.24.1 > [7] GenomicRanges_1.8.7 IRanges_1.14.4 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 stats4_2.15.0 > [6] tools_2.15.0 zlibbioc_1.2.0 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
Hi Martin, Thanks for your suggestions. I have tried for some time to parse the novoAlign output using read.table, with no success yet. I have just started using shortRead and have not worked with GenomicRanges yet. I have created two files of the paired end data containing ~500 lines of data that may be downloaded from http://bioinformatics.iusb.edu/seqSubset.tar . Would it be possible for your take a look at the format and guide me a little here? I would greatly appreciate it. Cheers../Murli On Thu, Jul 12, 2012 at 2:58 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: > On 07/12/2012 11:42 AM, Murli Nair [guest] wrote: >> >> >> Hi, >> >> I am trying to read the alignments generated using NovoAlign. The format I >> have the data is Paired End Native Report >> Format(http://computing.bio.cam.ac.uk/local/doc/NovoCraftV2.06.pdf). >> What is the most efficient way to read this data into ShortRead? Since it >> is paired end data I have two files corresponding to the two sides. >> I tried without success using the different formats using readAligned(). I >> also read an earlier posting about it which suggests to convert it to SAM >> format. >> I would appreciate your suggestions. > > > From the document you reference > > Three output formats are provided. > > 1. Native > > 2. Extended Native > > 3. Pairwise > > 4. SAM > > If Paired End Native Report Format is 1 or 2 with a single record per line > then I believe the only support for input would be as tab-delimited files > (read.table and friends; these are flexible and could easily be used to > iterate through a large file in a memory efficient way); you would then use > an appropriate constructor, e.g., GenomicRanges::GappedAlignmentPairs, to > create an object that you could manipulate. Format 3 looks challenging to > parse. > > Generally, for aligned reads aim for BAM files, which is output format 4 > followed by using Rsamtools or other with asBam, sortBam, indexBam to create > a sorted bam file and index. use GenomicRanges::readGappedAlignmentPairs for > many paired-end tasks. > > It might help to think a little further ahead about what you want to do, > e.g., GenomicRanges::summarizeOverlaps would be useful in RNAseq > differential expression to count reads in regions of interest, and would > need bam files but would manage data input for you. > > Martin > >> Cheers../Murli >> >> >> -- output of sessionInfo(): >> >> R version 2.15.0 (2012-03-30) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] ShortRead_1.14.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 >> [4] Rsamtools_1.8.5 lattice_0.20-6 Biostrings_2.24.1 >> [7] GenomicRanges_1.8.7 IRanges_1.14.4 BiocGenerics_0.2.0 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 >> stats4_2.15.0 >> [6] tools_2.15.0 zlibbioc_1.2.0 >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > >
ADD REPLY
0
Entering edit mode
On 07/13/2012 11:17 AM, Murli wrote: > Hi Martin, > Thanks for your suggestions. I have tried for some time to parse the > novoAlign output using read.table, with no success yet. I have just > started using shortRead and have not worked with GenomicRanges yet. I > have created two files of the paired end data containing ~500 lines > of data that may be downloaded from > http://bioinformatics.iusb.edu/seqSubset.tar . Would it be possible > for your take a look at the format and guide me a little here? I > would greatly appreciate it. > Cheers../Murli Really I think the most straight-forward approach is to get novoAlign to generate SAM or BAM output; if SAM then use Rsamtools::asBam to convert to BAM. You can then read the data in with Rsamtools::scanBam, GenomicRanges::readGappedAlignments, and in devel GenomicRanges::readGappedAlignmentPairs. Otherwise you'll spend a lot of time working through the appropriate interpretation of the novoAlign fields. I'm attaching a script that might at least provide some inspiration if you really want to parse the current text format; there are some significant problems (see the FIXME's, for example) and some aspects of your files look weird, e.g., the second column contains 'S' indicating single-end alignments, but the intention seems to have been to align paired end data? Martin > > > > On Thu, Jul 12, 2012 at 2:58 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: >> On 07/12/2012 11:42 AM, Murli Nair [guest] wrote: >>> >>> >>> Hi, >>> >>> I am trying to read the alignments generated using NovoAlign. The format I >>> have the data is Paired End Native Report >>> Format(http://computing.bio.cam.ac.uk/local/doc/NovoCraftV2.06.pdf). >>> What is the most efficient way to read this data into ShortRead? Since it >>> is paired end data I have two files corresponding to the two sides. >>> I tried without success using the different formats using readAligned(). I >>> also read an earlier posting about it which suggests to convert it to SAM >>> format. >>> I would appreciate your suggestions. >> >> >> From the document you reference >> >> Three output formats are provided. >> >> 1. Native >> >> 2. Extended Native >> >> 3. Pairwise >> >> 4. SAM >> >> If Paired End Native Report Format is 1 or 2 with a single record per line >> then I believe the only support for input would be as tab-delimited files >> (read.table and friends; these are flexible and could easily be used to >> iterate through a large file in a memory efficient way); you would then use >> an appropriate constructor, e.g., GenomicRanges::GappedAlignmentPairs, to >> create an object that you could manipulate. Format 3 looks challenging to >> parse. >> >> Generally, for aligned reads aim for BAM files, which is output format 4 >> followed by using Rsamtools or other with asBam, sortBam, indexBam to create >> a sorted bam file and index. use GenomicRanges::readGappedAlignmentPairs for >> many paired-end tasks. >> >> It might help to think a little further ahead about what you want to do, >> e.g., GenomicRanges::summarizeOverlaps would be useful in RNAseq >> differential expression to count reads in regions of interest, and would >> need bam files but would manage data input for you. >> >> Martin >> >>> Cheers../Murli >>> >>> >>> -- output of sessionInfo(): >>> >>> R version 2.15.0 (2012-03-30) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] ShortRead_1.14.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 >>> [4] Rsamtools_1.8.5 lattice_0.20-6 Biostrings_2.24.1 >>> [7] GenomicRanges_1.8.7 IRanges_1.14.4 BiocGenerics_0.2.0 >>> >>> loaded via a namespace (and not attached): >>> [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 >>> stats4_2.15.0 >>> [6] tools_2.15.0 zlibbioc_1.2.0 >>> >>> -- >>> Sent via the guest posting facility at bioconductor.org. >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> >> -- >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 >> >> -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 -------------- next part -------------- library(GenomicRanges) .novoAlign_input <- function(fl, filt, gal, ...) { ## input col.names <- c("qname", "type", "qseq", "qual", "status", "maps", "mapq", "rname", "pos", "strand", "mrnm", "mpos", "mstrand", "mismatches") df <- read.table(fl, sep="\t", fill=TRUE, col.names=col.names, stringsAsFactors=FALSE, ...) ## filter and coerce to GappedAlignments object gal(filt(df)) } .novoAlign_filter <- function(df, ...) { ## keep only uniquely aligning pairs on the forward or reverse strand keep <- (df$status == "U") & (df$strand %in% c("F", "R")) df[keep,] } .novoAlign_GappedAlignments <- function(df, ...) { ## coerce some information to GappedAlignments object with(df, { GappedAlignments(seqnames=rname, pos=pos, ## FIXME: not correct for indels cigar=paste0(nchar(qseq), "M"), strand=strand(c(F="+", R="-")[strand]), names=qname, qseq=DNAStringSet(qseq), qual=PhredQuality(qual)) }) } .novoAlign_GappedAlignmentPairs <- function(first, last, ...) { ## create GappedAlignmentPairs nms <- union(as.character(seqnames(first)), as.character(seqnames(last))) seqlevels(first) <- seqlevels(last) <- nms isProperPair <- rep(TRUE, length(first)) GappedAlignmentPairs(first, last, isProperPair) } fls <- dir(pattern="novoOutput$") pairs <- lapply(fls, .novoAlign_input, .novoAlign_filter, .novoAlign_GappedAlignments) ## FIXME: is this matching correct? o <- match(names(pairs[[1]]), names(pairs[[2]])) galp <- .novoAlign_GappedAlignmentPairs(pairs[[1]][!is.na(o)], pairs[[2]][o[!is.na(o)]])
ADD REPLY
0
Entering edit mode
Martin Thanks a lot for your time. I contacted the developers of Novoalign and they have a perl script to convert the novo output to sam called novo2sam.pl. I was able to convert the data I have to sam format using that. I shall try to use Rsamtools to convert it to Bam. I tried using samtools and the reference index file to convert it to Bam format which I did achieve. But for some reason it did not have the chromosome information when I tried to read it using readAligned(). Let me play with the suggestions you have just send me and see how it goes. Also, the my data while it is paired end data, my alignments are single read alignments. This is because I am looking at the structure of DNA in chromatin so the contacts are from different chromosomes. I presume it probably is inefficient to align paired-end data if they involve different chromosomes. I am not sure of this but that is the only reason I can think of. You may comment if you have any thoughts on this. Thanks again for script, I greatly appreciate it. Cheers../Murli On Sun, Jul 15, 2012 at 6:31 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: > On 07/13/2012 11:17 AM, Murli wrote: >> >> Hi Martin, >> Thanks for your suggestions. I have tried for some time to parse the >> novoAlign output using read.table, with no success yet. I have just >> started using shortRead and have not worked with GenomicRanges yet. I >> have created two files of the paired end data containing ~500 lines >> of data that may be downloaded from >> http://bioinformatics.iusb.edu/seqSubset.tar . Would it be possible >> for your take a look at the format and guide me a little here? I >> would greatly appreciate it. >> Cheers../Murli > > > Really I think the most straight-forward approach is to get novoAlign to > generate SAM or BAM output; if SAM then use Rsamtools::asBam to convert to > BAM. You can then read the data in with Rsamtools::scanBam, > GenomicRanges::readGappedAlignments, and in devel > GenomicRanges::readGappedAlignmentPairs. Otherwise you'll spend a lot of > time working through the appropriate interpretation of the novoAlign fields. > > I'm attaching a script that might at least provide some inspiration if you > really want to parse the current text format; there are some significant > problems (see the FIXME's, for example) and some aspects of your files look > weird, e.g., the second column contains 'S' indicating single-end > alignments, but the intention seems to have been to align paired end data? > > Martin > > > >> >> >> >> On Thu, Jul 12, 2012 at 2:58 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: >>> >>> On 07/12/2012 11:42 AM, Murli Nair [guest] wrote: >>>> >>>> >>>> >>>> Hi, >>>> >>>> I am trying to read the alignments generated using NovoAlign. The format >>>> I >>>> have the data is Paired End Native Report >>>> Format(http://computing.bio.cam.ac.uk/local/doc/NovoCraftV2.06.pdf). >>>> What is the most efficient way to read this data into ShortRead? Since >>>> it >>>> is paired end data I have two files corresponding to the two sides. >>>> I tried without success using the different formats using readAligned(). >>>> I >>>> also read an earlier posting about it which suggests to convert it to >>>> SAM >>>> format. >>>> I would appreciate your suggestions. >>> >>> >>> >>> From the document you reference >>> >>> Three output formats are provided. >>> >>> 1. Native >>> >>> 2. Extended Native >>> >>> 3. Pairwise >>> >>> 4. SAM >>> >>> If Paired End Native Report Format is 1 or 2 with a single record per >>> line >>> then I believe the only support for input would be as tab- delimited files >>> (read.table and friends; these are flexible and could easily be used to >>> iterate through a large file in a memory efficient way); you would then >>> use >>> an appropriate constructor, e.g., GenomicRanges::GappedAlignmentPairs, to >>> create an object that you could manipulate. Format 3 looks challenging to >>> parse. >>> >>> Generally, for aligned reads aim for BAM files, which is output format 4 >>> followed by using Rsamtools or other with asBam, sortBam, indexBam to >>> create >>> a sorted bam file and index. use GenomicRanges::readGappedAlignmentPairs >>> for >>> many paired-end tasks. >>> >>> It might help to think a little further ahead about what you want to do, >>> e.g., GenomicRanges::summarizeOverlaps would be useful in RNAseq >>> differential expression to count reads in regions of interest, and would >>> need bam files but would manage data input for you. >>> >>> Martin >>> >>>> Cheers../Murli >>>> >>>> >>>> -- output of sessionInfo(): >>>> >>>> R version 2.15.0 (2012-03-30) >>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>> [7] LC_PAPER=C LC_NAME=C >>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> other attached packages: >>>> [1] ShortRead_1.14.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 >>>> [4] Rsamtools_1.8.5 lattice_0.20-6 Biostrings_2.24.1 >>>> [7] GenomicRanges_1.8.7 IRanges_1.14.4 BiocGenerics_0.2.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 >>>> stats4_2.15.0 >>>> [6] tools_2.15.0 zlibbioc_1.2.0 >>>> >>>> -- >>>> Sent via the guest posting facility at bioconductor.org. >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >>> -- >>> Computational Biology / Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N. >>> PO Box 19024 Seattle, WA 98109 >>> >>> Location: Arnold Building M1 B861 >>> Phone: (206) 667-2793 >>> >>> > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > >
ADD REPLY

Login before adding your answer.

Traffic: 527 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6