Question

support for reading direct RNA-Seq FASTQs

0

Entering edit mode

kent.riemondy ▴ 20

@kentriemondy-14219

Last seen 9 months ago

Denver, University of Colorado Anschutz…

I'm working with Nanopore direct RNA-seq data which generates FASTQs with U bases. Is there a Bioconductor package that supports reading these files? I've tried using methods from ShortRead, but get errors due to the U bases. Thanks in advance.

suppressPackageStartupMessages(library(ShortRead))

fq <- tempfile()
u_fq_txt <- paste(c("@readid", 
                     "UCGA",
                     "+",
                     "]]]]"), 
                   collapse = "\n")
writeLines(u_fq_txt, fq)

strm <- FastqStreamer(fq)
yield(strm)
#> Error in x$yield(...): _DNAencode(): invalid DNAString input character: 'U' (byte value 85)

readFastq(fq)
#> Error: Input/Output
#>   file(s):
#>     /var/folders/r9/g3c47jrj40gc14d8qsqx7src0000gn/T//RtmpwZ8EOi/file4dee4320c214
#>   message: invalid character '


t_fq_txt <- paste(c("@readid", 
                    "TCGA",
                    "+",
                    "]]]]"), 
                  collapse = "\n")
writeLines(t_fq_txt, fq)

strm <- FastqStreamer(fq)
yield(strm)
#> class: ShortReadQ
#> length: 1 reads; width: 4 cycles

readFastq(fq)
#> class: ShortReadQ
#> length: 1 reads; width: 4 cycles

unlink(fq)

ShortRead nanopore • 1.1k views

ADD COMMENT • link 3.0 years ago kent.riemondy ▴ 20

score 1 · Answer 1 · 2022-04-20

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 1 hour ago

United States

What's the use case? If you just need to read the sequences into R, you could use Biostrings, in particular readRNAStringSet

> fq <- tempfile()
> u_fq_txt <- paste(c("@readid", 
                     "UCGA",
                     "+",
                     "]]]]"), 
                   collapse = "\n")
> writeLines(u_fq_txt, fq)
> readRNAStringSet(fq, "fastq")
RNAStringSet object of length 1:
    width seq                                               names               
[1]     4 UCGA                                              readid

ADD COMMENT • link 3.0 years ago James W. MacDonald 68k

0

Entering edit mode

Or alternatively

> z <- readRNAStringSet(fq, "fastq", with.qualities = TRUE)

> zz <- QualityScaledRNAStringSet(z, PhredQuality(mcols(z)$qualities))

> zz
  A QualityScaledRNAStringSet instance containing:

RNAStringSet object of length 1:
    width seq                                               names               
[1]     4 UCGA                                              readid

PhredQuality object of length 1:
    width seq
[1]     4 ]]]]

ADD REPLY • link 3.0 years ago James W. MacDonald 68k

0

Entering edit mode

The original use case was to concatenate multiple fastqs into a single fastq, while also converting the U's to T's for compatibility with downstream tools. The streaming functionality of FastqStreamer seemed like a good approach to keep the memory usage low while converting each fastq. I could use unix tools ( e.g. cat and awk), but was curious to see how to do it using bioconductor tools.

Your response helped point me in the right direction. I used a BStringSet initially to allow for Us to be converted to Ts, and subsequently coerced to a DNAStringSet. I could then generate a ShortReadQ object and write records to disk with writeFastq. Probably not the most efficient but worked for my initial use case.

Here's my approach, that allows for limiting the # of lines read at a time, in case it is useful to anyone.

suppressPackageStartupMessages(library(ShortRead))

fq <- tempfile()
outfq <- tempfile()
u_fq_txt <- paste(c("@readid", 
                    "UCGA",
                    "+",
                    "]]]]"), 
                  collapse = "\n")
writeLines(u_fq_txt, fq)

convert_rna_fastq <- function(fn, outfn, n_per_chunk = 100000){
  if(file.exists(outfn)){
    unlink(outfn)
  }
  stopifnot(n_per_chunk %% 4 == 0)

  con <- file(description=fn, open="r")   
  data <- readLines(con, n = n_per_chunk)

  repeat {
    if (length(data) == 0)
      break

    if(length(data) %% 4 != 0){
      stop("something wrong with fastq file?")
    }

    n_lines <- length(data)
    id_lines <- seq(1, n_lines, by = 4)
    seq_lines <- seq(2, n_lines, by = 4)
    qual_lines <- seq(4, n_lines, by = 4)

    ids <- BStringSet(data[id_lines], start = 2)

    sread <- BStringSet(data[seq_lines])
    sread <- chartr("U", "T", sread)
    sread <- as(sread, "DNAStringSet")

    qual <- FastqQuality(data[qual_lines])

    fastq <- new("ShortReadQ", id = ids, sread = sread, quality = qual)

    writeFastq(fastq, outfn, mode='a')

    # last chunk was final chunk
    if (length(data) != n_per_chunk)  
      break
    # read next chunk
    data <- tryCatch(readLines(con, n = n_per_chunk), error=function(e) e)
  }

  close(con)
}

convert_rna_fastq(fq, outfq)

fq_records <- readFastq(outfq)
sread(fq_records)
#> DNAStringSet object of length 1:
#>     width seq
#> [1]     4 TCGA

^{Created on 2022-04-20 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)}

ADD REPLY • link 3.0 years ago kent.riemondy ▴ 20