Hello everyone,
I am trying to mass process approximately 1k bam files. I am reading each file by using the readGAlignmentsFromBam function from the GenomicAlignments package and considering a pre-defined set of regions as in :
bamfile = “ENCFF001KLP.bam”
reads = readGAlignmentsFromBam(bamfile , param = NULL)
When the header of the bam file is duplicated as below, I am getting the following error when reading: Error in seqlevels(seqinfo(x)) : error in evaluating the argument ‘x’ in selecting a method for function ‘seqlevels’: Error in .normargSeqlevels(seqnames) : supplied ‘seqlevels’ cannot contain duplicated sequence names
Certainly is possible to fix the files by hand but it seems that this is kind of headers is common, since I have found several files with the same header (I am copying one example after the question) . The files publicly available to download from http://www.encodeproject.org.
My session info is:
R version 3.1.1 (2014-07-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel grid stats graphics grDevices utils
[8] datasets methods base
other attached packages:
[1] Segvis_2.0 rbamtools_2.10.0 knitr_1.10.5
[4] GenomicAlignments_1.2.2 Rsamtools_1.18.3 Biostrings_2.34.1
[7] XVector_0.6.0 GenomicRanges_1.18.4 GenomeInfoDb_1.2.5
[10] IRanges_2.0.1 S4Vectors_0.4.0 BiocGenerics_0.12.1
[13] gridExtra_0.9.1 ggplot2_1.0.1 data.table_1.9.4
Thanks in advance, best
Rene
Header:
@SQ SN:chr1 LN:197195432 AS:mm9 SP:mouse
@SQ SN:chr2 LN:181748087 AS:mm9 SP:mouse
@SQ SN:chr3 LN:159599783 AS:mm9 SP:mouse
@SQ SN:chr4 LN:155630120 AS:mm9 SP:mouse
@SQ SN:chr5 LN:152537259 AS:mm9 SP:mouse
@SQ SN:chr6 LN:149517037 AS:mm9 SP:mouse
@SQ SN:chr7 LN:152524553 AS:mm9 SP:mouse
@SQ SN:chr8 LN:131738871 AS:mm9 SP:mouse
@SQ SN:chr9 LN:124076172 AS:mm9 SP:mouse
@SQ SN:chr10 LN:129993255 AS:mm9 SP:mouse
@SQ SN:chr11 LN:121843856 AS:mm9 SP:mouse
@SQ SN:chr12 LN:121257530 AS:mm9 SP:mouse
@SQ SN:chr13 LN:120284312 AS:mm9 SP:mouse
@SQ SN:chr14 LN:125194864 AS:mm9 SP:mouse
@SQ SN:chr15 LN:103494974 AS:mm9 SP:mouse
@SQ SN:chr16 LN:98319150 AS:mm9 SP:mouse
@SQ SN:chr17 LN:95272651 AS:mm9 SP:mouse
@SQ SN:chr18 LN:90772031 AS:mm9 SP:mouse
@SQ SN:chr19 LN:61342430 AS:mm9 SP:mouse
@SQ SN:chrX LN:166650296 AS:mm9 SP:mouse
@SQ SN:chrY LN:15902555 AS:mm9 SP:mouse
@HD VN:1.0 SO:unsorted
@SQ SN:chr1 LN:197195432
@SQ SN:chr2 LN:181748087
@SQ SN:chr3 LN:159599783
@SQ SN:chr4 LN:155630120
@SQ SN:chr5 LN:152537259
@SQ SN:chr6 LN:149517037
@SQ SN:chr7 LN:152524553
@SQ SN:chr8 LN:131738871
@SQ SN:chr9 LN:124076172
@SQ SN:chr10 LN:129993255
@SQ SN:chr11 LN:121843856
@SQ SN:chr12 LN:121257530
@SQ SN:chr13 LN:120284312
@SQ SN:chr14 LN:125194864
@SQ SN:chr15 LN:103494974
@SQ SN:chr16 LN:98319150
@SQ SN:chr17 LN:95272651
@SQ SN:chr18 LN:90772031
@SQ SN:chr19 LN:61342430
@SQ SN:chrX LN:166650296
@SQ SN:chrY LN:15902555
@PG ID:Bowtie VN:0.12.7 CL:”bowtie -t -a -q —solexa1.3-quals -p 64 -m 1 -n 2 -l 25 -e 3001 -S /mnt/thumper/solexa/genomes/bowtie_indexes/mm9 RenLab-MD16.fastq RenLab-MD16.\
sam”
This apparently hasn't been fixed yet, I still get:
For a file that has duplicated header entries. I am not sure if duplicated header entries should raise an error, but loading this file has worked fine with an older version of bioconductor. These files were generated using STAR aligner converted bam by samtools 0.1.18, and sorted and indexed by samtools 1.2.
In the meantime, I have used the following commands to generate a unique header:
samtools sort and index needs to be run as well, so I am not sure I should do this for all my files.