Issues with BSgenome.Mmusculus.UCSC.mm39 and generating motif matrix in Signac ("trying to load regions beyond the boundaries of non-circular sequence "chr1")
1
0
Entering edit mode
Paul • 0
@b9ff1a03
Last seen 1 hour ago
United States

Hello!

I am trying to generate a motif matrix for my ATAC-Seq data using the mm39 genome for annotations in the Signac R package. My code is below, and using this code, I consistently run into this error message: trying to load regions beyond the boundaries of non-circular sequence "chr1"


annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Mmusculus.v79)
seqlevelsStyle(annotations) <- 'UCSC'
genome(annotations) <- 'mm39'

RNA_integrated <- AddMotifs(
  object = RNA_integrated,
  genome = BSgenome.Mmusculus.UCSC.mm39,
  pfm = pfm
)

The strange thing is that when I revert back to the mm10 mouse genome, I can generate a motif matrix without a single hitch, so I am wondering if there is something I need to change in my code to accommodate for the mm39 genome? Any and all help would be greatly appreciated!

genomes BSgenome.Mmusculus.UCSC.mm39 • 104 views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 5 minutes ago
United States

The EnsDb.Mmusculus.v79 package is mm38, not mm39. If you want a current EnsDb, use AnnotationHub

> library(AnnotationHub)
> hub <- AnnotationHub()
snapshotDate(): 2024-10-28
> z <- query(hub, c("mus musculus","ensdb"))
> z
AnnotationHub with 87 records
# snapshotDate(): 2024-10-28
# $dataprovider: Ensembl
# $species: Mus musculus, Mus musculus musculus, Mus musculus domesticus, Mus musculus castaneus
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer, rdatadateadded,
#   preparerclass, tags, rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53222"]]' 

             title                                        
  AH53222  | Ensembl 87 EnsDb for Mus Musculus            
  AH53726  | Ensembl 88 EnsDb for Mus Musculus            
  AH56691  | Ensembl 89 EnsDb for Mus Musculus            
  AH57770  | Ensembl 90 EnsDb for Mus Musculus            
  AH60788  | Ensembl 91 EnsDb for Mus Musculus            
  ...        ...                                          
  AH116906 | Ensembl 112 EnsDb for Mus musculus           
  AH116907 | Ensembl 112 EnsDb for Mus musculus musculus  
  AH116908 | Ensembl 112 EnsDb for Mus musculus domesticus
  AH116909 | Ensembl 112 EnsDb for Mus musculus           
  AH119358 | Ensembl 113 EnsDb for Mus musculus           

> subset(as(mcols(z), "data.frame")[,c("title","genome")], genome %in% paste0("GRCm", 38:39))
                                      title genome
AH53222   Ensembl 87 EnsDb for Mus Musculus GRCm38
AH53726   Ensembl 88 EnsDb for Mus Musculus GRCm38
AH56691   Ensembl 89 EnsDb for Mus Musculus GRCm38
AH57770   Ensembl 90 EnsDb for Mus Musculus GRCm38
AH60788   Ensembl 91 EnsDb for Mus Musculus GRCm38
AH60992   Ensembl 92 EnsDb for Mus Musculus GRCm38
AH64461   Ensembl 93 EnsDb for Mus Musculus GRCm38
AH64944   Ensembl 94 EnsDb for Mus musculus GRCm38
AH67971   Ensembl 95 EnsDb for Mus musculus GRCm38
AH69210   Ensembl 96 EnsDb for Mus musculus GRCm38
AH73905   Ensembl 97 EnsDb for Mus musculus GRCm38
AH75036   Ensembl 98 EnsDb for Mus musculus GRCm38
AH78811   Ensembl 99 EnsDb for Mus musculus GRCm38
AH79718  Ensembl 100 EnsDb for Mus musculus GRCm38
AH83247  Ensembl 101 EnsDb for Mus musculus GRCm38
AH89211  Ensembl 102 EnsDb for Mus musculus GRCm38
AH89457  Ensembl 103 EnsDb for Mus musculus GRCm39
AH95775  Ensembl 104 EnsDb for Mus musculus GRCm39
AH98078  Ensembl 105 EnsDb for Mus musculus GRCm39
AH100674 Ensembl 106 EnsDb for Mus musculus GRCm39
AH104895 Ensembl 107 EnsDb for Mus musculus GRCm39
AH109367 Ensembl 108 EnsDb for Mus musculus GRCm39
AH109655 Ensembl 109 EnsDb for Mus musculus GRCm39
AH113713 Ensembl 110 EnsDb for Mus musculus GRCm39
AH116340 Ensembl 111 EnsDb for Mus musculus GRCm39
AH116909 Ensembl 112 EnsDb for Mus musculus GRCm39
AH119358 Ensembl 113 EnsDb for Mus musculus GRCm39

You can then use one of those GRCm39 versions by doing ensdb <- hub[["AH119358"]], if e.g., you want the most recent one. There are other strain specific ones as well, but for brevity I am just showing the 'regular' ones.

0
Entering edit mode

Unfortunately I've updated my code with the newest GRCm39 genome, and yet I am still getting the same error.... Is there anything else I can try to fix this error other than just trying all of the GRCm39 versions one by one?

ADD REPLY
1
Entering edit mode

You will probably have to ask the Signac people. I don't know exactly what is going on under the hood for AddMotif, but the EnsDb seems OK to me.

> ensdb <- hub[["AH119358"]]
> z <- GetGRangesFromEnsDb(ensdb)
> seqlevelsStyle(z) <- "UCSC"
> library(BSgenome.Mmusculus.UCSC.mm39)
> zz <- getSeq(Mmusculus, subset(z, seqnames(z) %in% "chr1" & strand(z) != "*"))
> zz
DNAStringSet object of length 92162:
        width seq                                                                              names               
    [1]  1417 AAACATTGGAACTTACCAAATTTTACTGGAGATGAATTG...TAACAAACTAGACACGTAAATTAATCTCTGCCACTCCA ENSMUSE00000866652
    [2]   795 AAACATTGGAACTTACCAAATTTTACTGGAGATGAATTG...ATTTCTATTTTTCAAATAAAGGTGAAAATGAAGTTGTC ENSMUSE00000867897
    [3]  2194 TTAGTTAAGAGCACTGACTGCTCTTGCAAAGGACCCAGG...GCCAAACAATCCAACAGGATCCTCCACTCTTCTTTCAT ENSMUSE00000863980
    [4]  2736 GCACACTACGGTCCATCTCCAACAACCGCAGTGTTGCCA...TGAAGAGAATCACAATAATTTGGGCAGATACTTTGCAG ENSMUSE00000858910
    [5]  2487 GTTTCACAGCAGCAGCCTCCCTTGTGTCCTTGGCTTGGG...GGTGCTTTTTGTTGCATTAAAAGTGCCGGTCAAACTTT ENSMUSE00000448840
    ...   ... ...
[92158]   162 AGTTACCTGAGTGGAACTTTGATGGCTCTAGTACCTTTC...GCTATGTGAAGTTTTCAAGTATAACCGGAAACCTGCAG ENSMUST00000381748
[92159]   147 AGACCAACTTGAGGCACATCTGTAAACGGATAATGGACA...ATTTGGTTGGCCTTCCAATGGCTTCCCTGGACCCCAAG ENSMUST00000381748
[92160]   128 GCCCGTATTACTGCGGTGTGGGAGCAGACAAGGCCTACG...AGATTACGGGGACAAATGCGGAGGTTATGCCTGCCCAG ENSMUST00000381748
[92161]   200 TGGGAATTCCAGATAGGACCCTGTGAGGGGATCCGAATG...TTCAGCACCAAGGCCATGCGGGAGGAGAATGGTCTGAA ENSMUST00000381748
[92162]   319 GTGCATTGAGGAGGCCATTGACAAACTGAGCAAGAGGCA...ACGAAACAGGCGACGAACCCTTCCAATACAAGAACTAA ENSMUST00000381748

Presumably what they are doing is extracting the sequences, which appears to work (there's nothing off the end of chr1), so I don't think it's an issue with any Bioconductor packages/functions.

ADD REPLY
0
Entering edit mode

This is the traceback I got after running AddMotifs if this is informative?

23: stop(wmsg("trying to load regions beyond the boundaries ", "of non-circular sequence \"", 
        seqname, "\""))
22: loadFUN(x, seqname, ranges)
21: loadFUN(x, seqname, ranges)
20: loadSubseqsFromStrandedSequence(x@single_sequences, seqname, 
        ranges(gr), strand(gr), is_circular)
19: FUN(X[[i]], ...)
18: lapply(seq_len(length(grl)), function(i) {
        gr <- grl[[i]]
        if (length(gr) == 0L) 
            return(DNAStringSet())
        seqlevel <- grl_seqlevels[i]
        is_circular <- isCircular(x)[[seqlevel]]
        idx <- match(seqlevel, x@user_seqnames)
        if (is.na(idx)) 
            stop("invalid sequence name: ", seqlevel)
        seqname <- names(x@user_seqnames)[[idx]]
        if (is.null(snplocs(x, seqname))) {
            subject <- try(get(seqname, envir = x@.seqs_cache, inherits = FALSE), 
                silent = TRUE)
            if (is(subject, "try-error")) {
                ans <- loadSubseqsFromStrandedSequence(x@single_sequences, 
                    seqname, ranges(gr), strand(gr), is_circular)
                return(ans)
            }
            .linkToCachedObject(subject) <- .newLinkToCachedObject(seqname, 
                x@.seqs_cache, x@.link_counts)
     ...
17: lapply(seq_len(length(grl)), function(i) {
        gr <- grl[[i]]
        if (length(gr) == 0L) 
            return(DNAStringSet())
        seqlevel <- grl_seqlevels[i]
        is_circular <- isCircular(x)[[seqlevel]]
        idx <- match(seqlevel, x@user_seqnames)
        if (is.na(idx)) 
            stop("invalid sequence name: ", seqlevel)
        seqname <- names(x@user_seqnames)[[idx]]
        if (is.null(snplocs(x, seqname))) {
            subject <- try(get(seqname, envir = x@.seqs_cache, inherits = FALSE), 
                silent = TRUE)
            if (is(subject, "try-error")) {
                ans <- loadSubseqsFromStrandedSequence(x@single_sequences, 
                    seqname, ranges(gr), strand(gr), is_circular)
                return(ans)
            }
            .linkToCachedObject(subject) <- .newLinkToCachedObject(seqname, 
                x@.seqs_cache, x@.link_counts)
     ...
16: .extractFromBSgenomeSingleSequences(x, sseq_args$names, sseq_args$start, 
        sseq_args$end, sseq_args$width, sseq_args$strand)
15: .local(x, ...)
14: getSeq(genome, subject)
13: getSeq(genome, subject)
12: .local(pwms, subject, ...)
11: matchMotifs(pwms_list, subject, ...)
10: matchMotifs(pwms_list, subject, ...)
9: motifmatchr::matchMotifs(pwms = pwm, subject = features, genome = genome, 
       out = "scores", ...)
8: motifmatchr::matchMotifs(pwms = pwm, subject = features, genome = genome, 
       out = "scores", ...)
7: CreateMotifMatrix(features = object, pwm = pfm, genome = genome, 
       use.counts = FALSE)
6: AddMotifs.default(object = granges(x = object), genome = genome, 
       pfm = pfm, verbose = verbose)
5: AddMotifs(object = granges(x = object), genome = genome, pfm = pfm, 
       verbose = verbose)
4: AddMotifs.ChromatinAssay(object = object[[assay]], genome = genome, 
       pfm = pfm, verbose = verbose)
3: AddMotifs(object = object[[assay]], genome = genome, pfm = pfm, 
       verbose = verbose)
2: AddMotifs.Seurat(object = RNA_integrated, genome = BSgenome.Mmusculus.UCSC.mm39, 
       pfm = pfm)
1: AddMotifs(object = RNA_integrated, genome = BSgenome.Mmusculus.UCSC.mm39, 
       pfm = pfm)
ADD REPLY

Login before adding your answer.

Traffic: 792 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6