Question

problem with characters in chromosome names in DiffBind dba

0

Entering edit mode

Georg Otto ▴ 120

@georg-otto-6333

Last seen 6.4 years ago

United Kingdom

Dear all,

I have bed files with ChIP-seq data, with sequence (chromosome) names like this:

chr1, chr10, chr11, chr12, chr13, chr13_random, chr14, . . ., chrX

I import the bed file data into a DiffBind DBA object using the command

> dbaObj <- dba(sampleSheet = "samples.csv")

The problem is that the chromosome names in the DBA object apparently get stripped of their characters, so "chr1" becomes "1" and the chromosomes "X" and "Y" are missing altogether. More specifically, in

> dbaObj$chrmap
[1] "chr1"         "chr10"        "chr11"        "chr12"        "chr13"
[6] "chr14"        "chr15"        "chr16"        "chr17"        "chr18"
[11] "chr19"        "chr2"         "chr3"         "chr4"         "chr5"
[16] "chr6"         "chr7"         "chr8"         "chr9"         "chrUn_random"
[21] "chrX"         "chrY"

the chromosome names are intact, while in

> head(dbaObj$allvectors)
CHR   START     END     sample1     sample2     sample3     sample4
1   1 3435940 3436366 -1.00000000 0.01958029 0.04800027 0.07549102
2   1 3441807 3442009 -1.00000000 -1.00000000 0.01884463 -1.00000000
3   1 4408039 4408343 -1.00000000 -1.00000000 0.02658240 -1.00000000
4   1 4486341 4486691 -1.00000000 0.03415241 -1.00000000 -1.00000000
5   1 4561545 4562084 0.02217287 0.07207616 -1.00000000 0.02224536
6   1 4592472 4592914 0.03316482 0.06758506 -1.00000000 0.03507404

the chromosome names have changed, and chromosome X is missing.

Maybe I miss something here? I would be happy if someone could give me a hint how to fix this.
Best wishes,
Georg

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] grid stats4 parallel stats graphics grDevices utils
[8] datasets methods base

other attached packages:
[1] DiffBind_1.12.0         limma_3.22.1            ChIPpeakAnno_2.16.2
[4] AnnotationDbi_1.28.1    Biobase_2.26.0          RSQLite_1.0.0
[7] DBI_0.3.1               biomaRt_2.22.0          VennDiagram_1.6.9
[10] chipseq_1.16.0          ShortRead_1.24.0        GenomicAlignments_1.2.1
[13] Rsamtools_1.18.2        BiocParallel_1.0.0      BSgenome_1.34.0
[16] rtracklayer_1.26.2      Biostrings_2.34.0       XVector_0.6.0
[19] GenomicRanges_1.18.3    GenomeInfoDb_1.2.3      IRanges_2.0.0
[22] S4Vectors_0.4.0         BiocGenerics_0.12.1

loaded via a namespace (and not attached):
[1] amap_0.8-12            base64enc_0.1-2        BatchJobs_1.5
[4] BBmisc_1.8             bitops_1.0-6           brew_1.0-6
[7] caTools_1.17.1         checkmate_1.5.0        codetools_0.2-9
[10] compiler_3.1.1         digest_0.6.4           edgeR_3.8.3
[13] fail_1.2               foreach_1.4.2          gdata_2.13.3
[16] GenomicFeatures_1.18.2 GO.db_3.0.0            gplots_2.14.2
[19] gtools_3.4.1           hwriter_1.3.2          iterators_1.0.7
[22] KernSmooth_2.23-13     lattice_0.20-29        latticeExtra_0.6-26
[25] MASS_7.3-35            multtest_2.22.0        RColorBrewer_1.0-5
[28] RCurl_1.95-4.3         sendmailR_1.2-1        splines_3.1.1
[31] stringr_0.6.2          survival_2.37-7        tools_3.1.1
[34] XML_3.98-1.1           zlibbioc_1.12.0

DiffBind ChIP-Seq • 1.5k views

ADD COMMENT • link updated 10.4 years ago by Rory Stark ★ 5.2k • written 10.4 years ago by Georg Otto ▴ 120

score 2 · Accepted Answer · 2014-12-01

Hi Georg-

Directly accessing the internal fields, like dbaObj$allvectors, isn't really supported, but I can tell you what is going on and how to do what you want to do.

The officially supported way of getting the data in dbaObj$vectors is via the dba.peakset() function, with bRetrieve=TRUE. This will map the chromosome names back to what you expect to see.

The internal structures use factor numbers which make certain merging and sorting operations more efficient . As you have sees, the "real" chromosome labels are stored in the $chrmap field.You can map them as follows:

> chromosomes = dbaObj$chrmap[dbaObj$allvectors[,1]]

This is what dba.peakset() does internally when bRetrieve=TRUE.

Hope this helps-

Rory