Hey,
I am trying to count the number of reads against genomic features (genes) from an RNA-seq dataset. I downloaded ~100 GB of .bam files that correspond to n=12 samples. I generated index files with 'bamIndex()'.
I get an error when trying to make my index files, although the command deposits the .bai files in the working directory as I would expect.
Additionally, when I pass my bamfilelist object to 'summarizeOverlaps()' I get another error message that I am unable to interpret or find a resolution for.
library(TxDb.Drerio.UCSC.danRer10.refGene)
library(GenomicAlignments)
library(org.Dr.eg.db)
library(Rsamtools)
bams <- dir("D:\\NGS\\Danio.CODE\\12-Tissue.RNA-seq",pattern=".bam",full.names=TRUE)
bami<-indexBam(bams) #
error:
1: In FUN(X[[i]], ...) :
[bam_header_read] EOF marker is absent. The input is probably truncated.
2: In FUN(X[[i]], ...) :
[bam_index_core] truncated file? Continue anyway. (-4)
3: In FUN(X[[i]], ...) :
[bam_header_read] EOF marker is absent. The input is probably truncated.
4: In FUN(X[[i]], ...) :
[bam_index_core] truncated file? Continue anyway. (-4)
# I use the index files in spite of the above error message, continuing to the code below
bfl <- BamFileList(bams, asMates = TRUE, yieldSize=7500000)
genes <- genes(TxDb.Drerio.UCSC.danRer10.refGene) # GRanges for GRCz10/danRer10 reference genome
S<-summarizeOverlaps(genes,bfl,mode="Union",singleEnd=FALSE,ignore.strand=TRUE,inter.feature=FALSE,fragments=TRUE)
Warning messages:
1: In doTryCatch(return(expr), name, parentenv, handler) :
[bam_header_read] EOF marker is absent. The input is probably truncated.
2: In doTryCatch(return(expr), name, parentenv, handler) :
[bam_header_read] EOF marker is absent. The input is probably truncated.
3: In doTryCatch(return(expr), name, parentenv, handler) :
[bam_header_read] EOF marker is absent. The input is probably truncated.
4: In doTryCatch(return(expr), name, parentenv, handler) :
[bam_header_read] EOF marker is absent. The input is probably truncated.
count<-assays(gene.data)$counts # counts for all genes are equal to 0 in all 12 samples.... wtf?
These are the .bam files I downloaded from: https://danio-code.zfin.org/dataExport/daniocode/detailSeries/382/ RNA-seqBobelab0001AS.DCD002415SQ.USERmickael.dong.R.genome.bam (13.9 GB) # testis RNA-seqBobelab0001AS.DCD002414SQ.USERmickael.dong.R.genome.bam (3.88 GB) # ovary RNA-seqBobelab0001AS.DCD002413SQ.USERmickael.dong.R.genome.bam (3.76 GB) # unfertilized egg RNA-seqBobelab0001AS.DCD002412SQ.USERmickael.dong.R.genome.bam (9.81 GB) # early embryonic cell RNA-seqBobelab0001AS.DCD002411SQ.USERmickael.dong.R.genome.bam (7.27 GB) # intestine RNA-seqBobelab0001AS.DCD002410SQ.USERmickael.dong.R.genome.bam (12.7 GB) # bone tissue RNA-seqBobelab0001AS.DCD002409SQ.USERmickael.dong.R.genome.bam (8.34 GB) # kidney RNA-seqBobelab0001AS.DCD002408SQ.USERmickael.dong.R.genome.bam (8.31 GB) # liver RNA-seqBobelab0001AS.DCD002407SQ.USERmickael.dong.R.genome.bam (5.67 GB) # muscle RNA-seqBobelab0001AS.DCD002406SQ.USERmickael.dong.R.genome.bam (17.7 GB) # heart RNA-seqBobelab0001AS.DCD002405SQ.USERmickael.dong.R.genome.bam (9.68 GB) # gill RNA-seqBobelab0001AS.DCD002404SQ.USERmickael.dong.R.genome.bam (6.54 GB) # brain
Can anyone help explain how to fix this error or provide a workaround? Thanks, Matt
> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] org.Dr.eg.db_3.6.0
[2] GenomicAlignments_1.16.0
[3] SummarizedExperiment_1.10.1
[4] DelayedArray_0.6.6
[5] BiocParallel_1.14.2
[6] matrixStats_0.54.0
[7] Rsamtools_1.32.3
[8] Biostrings_2.48.0
[9] XVector_0.20.0
[10] TxDb.Drerio.UCSC.danRer10.refGene_3.4.3
[11] GenomicFeatures_1.32.3
[12] AnnotationDbi_1.42.1
[13] Biobase_2.40.0
[14] GenomicRanges_1.32.7
[15] GenomeInfoDb_1.16.0
[16] IRanges_2.14.12
[17] S4Vectors_0.18.3
[18] BiocGenerics_0.26.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 compiler_3.5.2 prettyunits_1.0.2
[4] bitops_1.0-6 tools_3.5.2 progress_1.2.0
[7] zlibbioc_1.26.0 biomaRt_2.36.1 digest_0.6.18
[10] bit_1.1-14 lattice_0.20-38 RSQLite_2.1.1
[13] memoise_1.1.0 pkgconfig_2.0.2 rlang_0.3.1
[16] Matrix_1.2-15 DBI_1.0.0 GenomeInfoDbData_1.1.0
[19] rtracklayer_1.40.6 stringr_1.4.0 httr_1.4.0
[22] hms_0.4.2 grid_3.5.2 bit64_0.9-7
[25] R6_2.4.0 XML_3.98-1.17 blob_1.1.1
[28] magrittr_1.5 assertthat_0.2.0 stringi_1.2.4
[31] RCurl_1.95-4.11 crayon_1.3.4
Thank you very much for looking into this and providing these recommendations.
To solve this problem, I downloaded the .bam files provided for the updated reference assembly (danRer11 vs. danRer10), and I did not run into any issues.
It is likely that the download went awry on the first go
Zebrafish genomics can be so difficult sometimes!
It might also be that you have unfortunately downloaded the file as 'text' rather than 'binary'.