Error when reading bam file with duplicated header
1
2
Entering edit mode
Rene ▴ 20
@rene-7733
Last seen 9.0 years ago
United States

Hello everyone,

I am trying to mass process approximately 1k bam files. I am reading each file by using the readGAlignmentsFromBam function from the GenomicAlignments package and considering a pre-defined set of regions as in :


bamfile = “ENCFF001KLP.bam” 

reads = readGAlignmentsFromBam(bamfile , param = NULL)

​ When the header of the bam file is duplicated as below, I am getting the following error when reading: Error in seqlevels(seqinfo(x)) : error in evaluating the argument ‘x’ in selecting a method for function ‘seqlevels’: Error in .normargSeqlevels(seqnames) : supplied ‘seqlevels’ cannot contain duplicated sequence names

Certainly is possible to fix the files by hand but it seems that this is kind of headers is common, since I have found several files with the same header (I am copying one example after the question) . The files publicly available to download from http://www.encodeproject.org.

My session info is:


R version 3.1.1 (2014-07-10)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] stats4    parallel  grid      stats     graphics  grDevices utils
 [8] datasets  methods   base

other attached packages:
 [1] Segvis_2.0              rbamtools_2.10.0        knitr_1.10.5
 [4] GenomicAlignments_1.2.2 Rsamtools_1.18.3        Biostrings_2.34.1
 [7] XVector_0.6.0           GenomicRanges_1.18.4    GenomeInfoDb_1.2.5
[10] IRanges_2.0.1           S4Vectors_0.4.0         BiocGenerics_0.12.1
[13] gridExtra_0.9.1         ggplot2_1.0.1           data.table_1.9.4

​

Thanks in advance, best

Rene

Header:

@SQ SN:chr1 LN:197195432 AS:mm9 SP:mouse
@SQ SN:chr2 LN:181748087 AS:mm9 SP:mouse
@SQ SN:chr3 LN:159599783 AS:mm9 SP:mouse
@SQ SN:chr4 LN:155630120 AS:mm9 SP:mouse
@SQ SN:chr5 LN:152537259 AS:mm9 SP:mouse
@SQ SN:chr6 LN:149517037 AS:mm9 SP:mouse
@SQ SN:chr7 LN:152524553 AS:mm9 SP:mouse
@SQ SN:chr8 LN:131738871 AS:mm9 SP:mouse
@SQ SN:chr9 LN:124076172 AS:mm9 SP:mouse
@SQ SN:chr10 LN:129993255 AS:mm9 SP:mouse
@SQ SN:chr11 LN:121843856 AS:mm9 SP:mouse
@SQ SN:chr12 LN:121257530 AS:mm9 SP:mouse
@SQ SN:chr13 LN:120284312 AS:mm9 SP:mouse
@SQ SN:chr14 LN:125194864 AS:mm9 SP:mouse
@SQ SN:chr15 LN:103494974 AS:mm9 SP:mouse
@SQ SN:chr16 LN:98319150 AS:mm9 SP:mouse
@SQ SN:chr17 LN:95272651 AS:mm9 SP:mouse
@SQ SN:chr18 LN:90772031 AS:mm9 SP:mouse
@SQ SN:chr19 LN:61342430 AS:mm9 SP:mouse
@SQ SN:chrX LN:166650296 AS:mm9 SP:mouse
@SQ SN:chrY LN:15902555 AS:mm9 SP:mouse
@HD VN:1.0 SO:unsorted
@SQ SN:chr1 LN:197195432
@SQ SN:chr2 LN:181748087
@SQ SN:chr3 LN:159599783
@SQ SN:chr4 LN:155630120
@SQ SN:chr5 LN:152537259
@SQ SN:chr6 LN:149517037
@SQ SN:chr7 LN:152524553
@SQ SN:chr8 LN:131738871
@SQ SN:chr9 LN:124076172
@SQ SN:chr10 LN:129993255
@SQ SN:chr11 LN:121843856
@SQ SN:chr12 LN:121257530
@SQ SN:chr13 LN:120284312
@SQ SN:chr14 LN:125194864
@SQ SN:chr15 LN:103494974
@SQ SN:chr16 LN:98319150
@SQ SN:chr17 LN:95272651
@SQ SN:chr18 LN:90772031
@SQ SN:chr19 LN:61342430
@SQ SN:chrX LN:166650296
@SQ SN:chrY LN:15902555

@PG ID:Bowtie VN:0.12.7 CL:”bowtie -t -a -q —solexa1.3-quals -p 64 -m 1 -n 2 -l 25 -e 3001 -S /mnt/thumper/solexa/genomes/bowtie_indexes/mm9 RenLab-MD16.fastq RenLab-MD16.\
sam”

rsamtools genomicalignments • 4.4k views
ADD COMMENT
1
Entering edit mode

This apparently hasn't been fixed yet, I still get:

reads = readGAlignmentsFromBam('test.fastq.gz.clipped_M.bam.sorted.bam' 

, param = NULL)
Error in GAlignments(seqnames = bamcols$rname, pos = bamcols$pos, 

cigar = bamcols$cigar,  :
  'seqlengths' must be an integer vector with unique names
In addition: Warning messages:
1: 'readGAlignmentsFromBam' is deprecated.
Use 'readGAlignments' instead.
See help("Deprecated")
2: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated

For a file that has duplicated header entries. I am not sure if duplicated header entries should raise an error, but loading this file has worked fine with an older version of bioconductor. These files were generated using STAR aligner converted bam by samtools 0.1.18, and sorted and indexed by samtools 1.2.

In the meantime, I have used the following commands to generate a unique header:

samtools view -H test.bam | sort -u > unique.header.sam

samtools reheader unique.header.sam test.bam > testuniqueheader.bam

samtools sort  and index needs to be run as well, so I am not sure I should do this for all my files.

 


sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.7 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
[1] GenomicAlignments_1.4.1 Rsamtools_1.20.4        Biostrings_2.36.4      
[4] XVector_0.8.0           GenomicRanges_1.20.6    GenomeInfoDb_1.4.2     
[7] IRanges_2.2.7           S4Vectors_0.6.5         BiocGenerics_0.14.0    

loaded via a namespace (and not attached):
[1] zlibbioc_1.14.0      futile.logger_1.4.1  lambda.r_1.1.7      
[4] futile.options_1.0.0 BiocParallel_1.2.21  bitops_1.0-6    

 

 


 

ADD REPLY
1
Entering edit mode
@martin-morgan-1513
Last seen 5 months ago
United States

Your version of R / Bioc is a bit old, but this

setMethod("seqinfo", "BamFile", function (x) {
    h <- scanBamHeader(x, what = "targets")[["targets"]]
    h <- h[!duplicated(h)]
    Seqinfo(names(h), unname(h))
})

might work; I'll introduce a fix into the development and release versions of Rsamtools later today / tomorrow.

Personally I think this is a problem with the Encode files, and should be fixed up-stream; to manually fix the files I guess the magic is to edit samtools view -H <file.bam> and then samtools reheader <edited.header.txt> <file.bam> > file.rehead.bam.

 

ADD COMMENT
0
Entering edit mode

The workaround doesn't work for me unfortunately.
 

ADD REPLY
0
Entering edit mode
I guess you meant:

h <- h[!duplicated(names(h))]

Seqinfo(names(h), h)

But this still raises the same error.

ADD REPLY
1
Entering edit mode

Sorry, the original fix addressed reading the headers using scanBamHeader(); there were additional problems in Rsamtools and GenomicAlignments; these should be fixed in  GenomicAlignments / Rsamtools 1.4.2 / 1.20.5 (release) and 1.5.18 / 1.21.19 (devel).

ADD REPLY

Login before adding your answer.

Traffic: 570 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6