Question

Error: Input Files have different Amount of Reads

0

Entering edit mode

abano • 0

@abano-23787

Last seen 4.8 years ago

When I try to run the Rsubread commands subjunc or align in Studio, I get the error "Input files have different amount of reads.

subjunc (index="~/Documents/HarvardLincs/myindedx", readfile1="~/Documents/HarvardLincs/SRR120607501.fastq.gz", readfile2="~/Documents/HarvardLincs/SRR120607502.fastq.gz", outputfile="BT549.bam",nthreads = 8)

ERROR: two input files have different amounts of reads! The program has to terminate and no alignment results were generated!

Error in .load.delete.summary(output_file[i]) : Summary file BT549.bam.summary was not generated! The program terminated wrongly!

sessionInfo() R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.5

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] Rsubread_2.2.2

loaded via a namespace (and not attached): [1] compiler4.0.1 Matrix1.2-18 tools4.0.1 tinytex0.24 grid4.0.1
[6] xfun0.15 lattice_0.20-41

I get the same error message for several fastq files that I downloaded from GEO. Is it true that all these files have a data problem, or is this some other issue?

Thanks AB

SubRead Align Subjunc • 3.4k views

ADD COMMENT • link updated 4.8 years ago by James W. MacDonald 68k • written 4.8 years ago by abano • 0

0

Entering edit mode

Looking at the file names it appears that you might have used fastq-dump to get these data? Or did you download the original format files from Google or AWS and rename? If the former, did you ensure you did things correctly? Like, did you do

zcat ~/Documents/HarvardLincs/SRR120607501.fastq.gz | wc -l
zcat ~/Documents/HarvardLincs/SRR120607502.fastq.gz | wc -l

and ensure that you get the same number of reads?

ADD REPLY • link 4.8 years ago James W. MacDonald 68k

0

Entering edit mode

Yes, I used SRA toolkit to download the data on cluster. Then I downloaded it to my laptop to process it on RStudio.

I just ran the commands that you provided and this is what I get

[abano@sabine Harvard]$ zcat SRR120607501.fastq.gz | wc -l 101032420 [abano@sabine Harvard]$ zcat SRR120607502.fastq.gz | wc -l 101029168

So the 2 source files have different number of reads. Is there a way to fix this? what are my options?

ADD REPLY • link 4.8 years ago abano • 0

0

Entering edit mode

Yes, I used SRA toolkit to download the data on cluster. Then I downloaded it to my laptop to process it on RStudio.

I just ran the commands that you provided and this is what I get

[abano@sabine Harvard]$ zcat SRR120607501.fastq.gz | wc -l 101032420 [abano@sabine Harvard]$ zcat SRR120607502.fastq.gz | wc -l 101029168

So the 2 source files have different number of reads. Is there a way to fix this? what are my options?

ADD REPLY • link 4.8 years ago abano • 0

0

Entering edit mode

Yes, I used SRA toolkit to download the data on cluster. Then I downloaded it to my laptop to process it on RStudio.

I just ran the commands that you provided and this is what I get

[abano@sabine Harvard]$ zcat SRR120607501.fastq.gz | wc -l 101032420 [abano@sabine Harvard]$ zcat SRR120607502.fastq.gz | wc -l 101029168

So the 2 source files have different number of reads. Is there a way to fix this? what are my options?

ADD REPLY • link 4.8 years ago abano • 0

score 1 · Answer 1 · 2020-07-02

You don't need to post the same comment three times. And this question is now off-topic for this site, having to do with getting data and stuff rather than using Bioconductor tools. Howeva...

You could have known about this issue by paying attention to the messages you get from fastq-dump

$ fastq-dump --split-files SRR12060750
2020-07-02T15:36:38 fastq-dump.2.9.6 sys: error unknown while reading file within network system module - mbedtls_ssl_read returned -76 ( NET - Reading information from the socket failed )
##<snip of lots more errors that don't matter since fastq-dump will just keep chugging>
Rejected 813 READS because READLEN < 1
Read 25258105 spots for SRR12060750

You see that last bit about Rejecting 813 reads? And do note that (101032420 - 101029168)/4 = 813

You could have avoided this by using --split-3:

  --split-3                        Legacy 3-file splitting for mate-pairs:
                                   First biological reads satisfying dumping
                                   conditions are placed in files *_1.fastq and
                                   *_2.fastq If only one biological read is
                                   present it is placed in *.fastq Biological
                                   reads and above are ignored.

or you could just get the original FASTQ files from Google or AWS see here, under the Data Access tab.