indexBam(), summarizeOverlaps() error: RNA-seq
1
0
Entering edit mode
mat149 ▴ 80
@mat149-11450
Last seen 6 days ago
United States

Hey, I am trying to count the number of reads against genomic features (genes) from an RNA-seq dataset. I downloaded ~100 GB of .bam files that correspond to n=12 samples. I generated index files with 'bamIndex()'.
I get an error when trying to make my index files, although the command deposits the .bai files in the working directory as I would expect. Additionally, when I pass my bamfilelist object to 'summarizeOverlaps()' I get another error message that I am unable to interpret or find a resolution for.

library(TxDb.Drerio.UCSC.danRer10.refGene)
library(GenomicAlignments)
library(org.Dr.eg.db)
library(Rsamtools)

bams <- dir("D:\\NGS\\Danio.CODE\\12-Tissue.RNA-seq",pattern=".bam",full.names=TRUE)
bami<-indexBam(bams)  #
error:
1: In FUN(X[[i]], ...) :
  [bam_header_read] EOF marker is absent. The input is probably truncated.
2: In FUN(X[[i]], ...) :
  [bam_index_core] truncated file? Continue anyway. (-4)
3: In FUN(X[[i]], ...) :
  [bam_header_read] EOF marker is absent. The input is probably truncated.
4: In FUN(X[[i]], ...) :
  [bam_index_core] truncated file? Continue anyway. (-4)
# I use the index files in spite of the above error message, continuing to the code below

    bfl <- BamFileList(bams, asMates = TRUE, yieldSize=7500000)
    genes <- genes(TxDb.Drerio.UCSC.danRer10.refGene)       # GRanges for GRCz10/danRer10 reference genome
    S<-summarizeOverlaps(genes,bfl,mode="Union",singleEnd=FALSE,ignore.strand=TRUE,inter.feature=FALSE,fragments=TRUE)
Warning messages:
1: In doTryCatch(return(expr), name, parentenv, handler) :
  [bam_header_read] EOF marker is absent. The input is probably truncated.
2: In doTryCatch(return(expr), name, parentenv, handler) :
  [bam_header_read] EOF marker is absent. The input is probably truncated.
3: In doTryCatch(return(expr), name, parentenv, handler) :
  [bam_header_read] EOF marker is absent. The input is probably truncated.
4: In doTryCatch(return(expr), name, parentenv, handler) :
  [bam_header_read] EOF marker is absent. The input is probably truncated.    



count<-assays(gene.data)$counts     # counts for all genes are equal to 0 in all 12 samples.... wtf?

These are the .bam files I downloaded from: https://danio-code.zfin.org/dataExport/daniocode/detailSeries/382/ RNA-seqBobelab0001AS.DCD002415SQ.USERmickael.dong.R.genome.bam (13.9 GB) # testis RNA-seqBobelab0001AS.DCD002414SQ.USERmickael.dong.R.genome.bam (3.88 GB) # ovary RNA-seqBobelab0001AS.DCD002413SQ.USERmickael.dong.R.genome.bam (3.76 GB) # unfertilized egg RNA-seqBobelab0001AS.DCD002412SQ.USERmickael.dong.R.genome.bam (9.81 GB) # early embryonic cell RNA-seqBobelab0001AS.DCD002411SQ.USERmickael.dong.R.genome.bam (7.27 GB) # intestine RNA-seqBobelab0001AS.DCD002410SQ.USERmickael.dong.R.genome.bam (12.7 GB) # bone tissue RNA-seqBobelab0001AS.DCD002409SQ.USERmickael.dong.R.genome.bam (8.34 GB) # kidney RNA-seqBobelab0001AS.DCD002408SQ.USERmickael.dong.R.genome.bam (8.31 GB) # liver RNA-seqBobelab0001AS.DCD002407SQ.USERmickael.dong.R.genome.bam (5.67 GB) # muscle RNA-seqBobelab0001AS.DCD002406SQ.USERmickael.dong.R.genome.bam (17.7 GB) # heart RNA-seqBobelab0001AS.DCD002405SQ.USERmickael.dong.R.genome.bam (9.68 GB) # gill RNA-seqBobelab0001AS.DCD002404SQ.USERmickael.dong.R.genome.bam (6.54 GB) # brain

Can anyone help explain how to fix this error or provide a workaround? Thanks, Matt

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] org.Dr.eg.db_3.6.0                     
 [2] GenomicAlignments_1.16.0               
 [3] SummarizedExperiment_1.10.1            
 [4] DelayedArray_0.6.6                     
 [5] BiocParallel_1.14.2                    
 [6] matrixStats_0.54.0                     
 [7] Rsamtools_1.32.3                       
 [8] Biostrings_2.48.0                      
 [9] XVector_0.20.0                         
[10] TxDb.Drerio.UCSC.danRer10.refGene_3.4.3
[11] GenomicFeatures_1.32.3                 
[12] AnnotationDbi_1.42.1                   
[13] Biobase_2.40.0                         
[14] GenomicRanges_1.32.7                   
[15] GenomeInfoDb_1.16.0                    
[16] IRanges_2.14.12                        
[17] S4Vectors_0.18.3                       
[18] BiocGenerics_0.26.0                    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0             compiler_3.5.2         prettyunits_1.0.2     
 [4] bitops_1.0-6           tools_3.5.2            progress_1.2.0        
 [7] zlibbioc_1.26.0        biomaRt_2.36.1         digest_0.6.18         
[10] bit_1.1-14             lattice_0.20-38        RSQLite_2.1.1         
[13] memoise_1.1.0          pkgconfig_2.0.2        rlang_0.3.1           
[16] Matrix_1.2-15          DBI_1.0.0              GenomeInfoDbData_1.1.0
[19] rtracklayer_1.40.6     stringr_1.4.0          httr_1.4.0            
[22] hms_0.4.2              grid_3.5.2             bit64_0.9-7           
[25] R6_2.4.0               XML_3.98-1.17          blob_1.1.1            
[28] magrittr_1.5           assertthat_0.2.0       stringi_1.2.4         
[31] RCurl_1.95-4.11        crayon_1.3.4
genomicAlignments RNA-seq Rsamtools • 1.5k views
ADD COMMENT
1
Entering edit mode
Mike Smith ★ 6.6k
@mike-smith
Last seen 1 day ago
EMBL Heidelberg

I think there's a few different issues here. I've download the testis bam and you code ran without errors, so it's not not an inherent issue with your code or the tools.

Give this, I would verify that the bam files you've obtained have downloaded correctly. The EOF marker is absent error certainly seems to indicate that a download may have finished prematurely and you've ended up with either two or four (I'm not sure which) truncated files. It's a shame the DANIO-CODE site doesn't provide a hash you can verify against easily, but perhaps you can use the number of reads in a file to identify the corrupted ones. I certainly wouldn't continue with processing until I've resolved an error like this, what's the point if your data are probably damaged?

As for why you get no counts for the samples that did work, I think that's because the bam files number the chromosomes "1", "2", etc and the gene annotation uses "chr1", "chr2", etc. Hence it doesn't find matches and so there are no overlapping hits. To verify, here's the first 10 seqence "levels":

> seqlevels(bfl[[1]])[1:10]
 [1] "1"  "10" "11" "12" "13" "14" "15" "16" "17" "18"
> seqlevels(genes)[1:10]
 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9"  "chr10"

You could update the the level in the annotation using something like:

new_levels <- stringr::str_replace_all(pattern = c("chr(Un_)?" = "", 
                                                   "v1" = ".1", 
                                                   "M" = "MT"), 
                                       string = seqlevels(genes))
genes <- renameSeqlevels(genes, new_levels)
ADD COMMENT
1
Entering edit mode

Thank you very much for looking into this and providing these recommendations.

To solve this problem, I downloaded the .bam files provided for the updated reference assembly (danRer11 vs. danRer10), and I did not run into any issues.

It is likely that the download went awry on the first go

Zebrafish genomics can be so difficult sometimes!

ADD REPLY
0
Entering edit mode

It might also be that you have unfortunately downloaded the file as 'text' rather than 'binary'.

ADD REPLY

Login before adding your answer.

Traffic: 629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6