Question

.fastq to .txt conversion for EdgeR package and merging two paired end sequence files

0

Entering edit mode

hamidrezarazzaghian • 0

@hamidrezarazzaghian-9208

Last seen 3.1 years ago

Canada

Dear all,

I a post-doc at the University of British Columbia, Canada and I'm pretty new to RNA-seq data analysis. I want to do the TMM normalization on my RNA-seq data using EdgeR package in R. I have two questions:

1) How can I convert .fastq files to .txt files to be able to feed them into the EdgeR package?

2) My RNA-seq data are paired sequence .fastq files. What quality control should I do on them and how should I merge them together prior to analysis?

Thanks for the help,

Hamid

EdgeR normalization fastq txt TMM • 5.3k views

ADD COMMENT • link updated 2.6 years ago by Gordon Smyth 52k • written 9.0 years ago by hamidrezarazzaghian • 0

score 3 · Answer 1 · 2015-11-16

3

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 2 days ago

United States

You don't feed FASTQ files to edgeR. You first have to align against the genome of your species and then get counts per gene, which is what you then feed into edgeR. For that you could use something like the Rsubread package. It has a User's guide, so I would start there.

ADD COMMENT • link 9.0 years ago James W. MacDonald 67k

score 1 · Answer 2 · 2021-10-09

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

You can follow one of the example workflows, for example:

Chen Y, Lun ATL, Smyth GK (2016). From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Research 5, 1438.
RnaSeqGeneEdgeRQL Workflow

or else follow the edgeR User's Guide.

Personally I use Rsubread::align followed by Rsubread::featureCounts to generate counts as input to edgeR, and that works just fine on a Windows laptop running R for Windows. Rsubread takes about 20 minutes on my Windows 10 laptop to align a FASTQ files with about 20 million paired-end reads.

See also

Liao, Y, Smyth, GK, Shi, W (2019). The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Research 47(8), e47.

or see

Law CW, Alhamdoosh M, Su S, Dong X, Tian L, Smyth GK, Ritchie ME (2016). RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000Research 5, 1408.

for a different workflow using limma and edgeR.

ADD COMMENT • link 3.1 years ago Gordon Smyth 52k

0

Entering edit mode

According to the guideline of Rsubread, we're supposed to feed the unmapping files with txt format into the software. Is there any suggestion for the converting or can I just input the fastq files directly? Many thanks!

ADD REPLY • link 2.6 years ago Yeji • 0

0

Entering edit mode

align inputs FASTQ files directly. There is no suggestion in the Rsubread User's Guide or documentation that any sort of conversion is required.

ADD REPLY • link 2.6 years ago Gordon Smyth 52k

score 0 · Answer 3 · 2015-11-18

0

Entering edit mode

hamidrezarazzaghian • 0

@hamidrezarazzaghian-9208

Last seen 3.1 years ago

Canada

Thanks James for the fast reply. Unfortunately is not available in windows-based R. Do you know any other package for this purpose?

Thanks

ADD COMMENT • link 9.0 years ago hamidrezarazzaghian • 0

2

Entering edit mode

As Martin noted, you can use Rbowtie, but that is for the original bowtie aligner, which doesn't do gapped alignments. If you are doing RNA-Seq you probably want bowtie2, which does do gapped alignments. You can run bowtie2 on Windows, so that is probably the best bet, but you have to run it from the command line, not from within R.

Most aligners assume you are using some sort of Linux variant, so you are sort of hamstrung by the fact that you are on Windows. But Linux is free after all, and it's relatively simple to set up a dual-boot Ubuntu/Windows OS on your comp, so if you are serious that might be something to consider.

One thing about kallisto and sleuth (and salmon or sailfish and sleuth while we are at it). These packages are intended to make comparisons at the transcript level, rather than the gene level. Since part of the alignment process is to infer which transcript a read came from, there is additional uncertainty in your count measurement that you have to account for when fitting a model. This has two downsides. First, that additional uncertainty has a cost, which is reduction in power to detect differences. Second, you shouldn't use something like edgeR or DESeq2 for transcript-level counts because the model they fit doesn't account for that uncertainty, so you have to use something like sleuth (either Lior Pachter's version or the patched version from Rob Patro's group) to fit the model. And sleuth is just a github package now, so you are pretty much on your own if you want to go that route.

As an (apparent) beginner, you are probably better off just getting bowtie2 and going from there.

ADD REPLY • link 9.0 years ago James W. MacDonald 67k

0

Entering edit mode

AFAIK "gapped alignments" in bowtie2 means indels, not junctions, so bowtie2 is not suited for RNA-seq. The original bowtie only supported mismatches, no indels.

H.

ADD REPLY • link 9.0 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Hervé,

Thanks for pointing that out. I naively thought that 'gapped alignment' was more or less a consistently applied term, but obviously not so much.

ADD REPLY • link 9.0 years ago James W. MacDonald 67k

1

Entering edit mode

The Rbowtie package wraps (an older?) version of the Bowtie aligner, but probably most people use alignment tools outside R. The airway vignette and differential expression work flow describe overall approaches that go from FASTQ to count matrices via whole-genome alignment. kallisto is a different and fast though not cross-platform approach; see SummarizedExperiment::readKallisto() in addition to the github sleuth package.

The poster has FASTQ files, but needs alignment (BAM) files before trying to count reads; b.nota's efforts would only be relevant after alignment. Ways to summarize aligned reads to counts across platforms and in R include bamsignals or perhaps GenomicFeatures::summarizeOverlaps().

ADD REPLY • link 9.0 years ago Martin Morgan 25k

0

Entering edit mode

I counted the reads once in R with a self made script using the libraries: IRanges, GenomicRanges, and Rsamtools. However, if you are pretty new to RNA-seq I would not recommend to try this yourself. It was pretty hard to do this.

I think the easiest way for you to get your counts is to install a virual machine with Ubuntu and try featureCount in Rsubread there.

ADD REPLY • link 9.0 years ago b.nota ▴ 370

0

Entering edit mode

Unfortunately is not available in windows-based R.

Rsubread has been available in R for Windows for a few years now.

ADD REPLY • link 3.1 years ago Gordon Smyth 52k

score 0 · Answer 4 · 2015-11-19

0

Entering edit mode

hamidrezarazzaghian • 0

@hamidrezarazzaghian-9208

Last seen 3.1 years ago

Canada

Thanks everyone for all the help.

ADD COMMENT • link 9.0 years ago hamidrezarazzaghian • 0