fail to annotate chromosome data to single cell object
1
0
Entering edit mode
@fc71d99c
Last seen 15 months ago
United Kingdom

I am trying to annotate my genes with ensemble and

ah = AnnotationHub()
ens.hs.107<- query(ah, c("Homo sapiens", "EnsDb", 107))[[1]] 

genes <- rowData(sce)$ID
head(genes)

gene_annot <- AnnotationDbi::select(ens.hs.107, 
                                    keys = genes,
                                    keytype = "GENEID",
                                    columns = c("GENEID", "SEQNAME")) %>%
  set_names(c("ID", "Chromosome"))

head(gene_annot)

rowData(sce) <- merge(rowData(sce), gene_annot, by.= "ID", sort=FALSE)
Error in .local(x, ..., value = value) : 
  26554 rows in value to replace 26664rows
##  genes found in genne_annot  have fewer GENEIDs than genes in sce object
rownames(rowData(sce)) <- rowData(sce)$ID

I have tried to use by.x and all.x but not sure its correct as I don't know why I have fewer geneids

 by.x, all.x= T
#and
 by.x,

Thank you!

error SingleCell annotate • 1.5k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States

I know nothing of the tidyverse, so can't help with how you are doing things with that magrittr pipe. But it's easy enough using base R functions. Also note that you don't have to ask for the GENEID to be returned as a column when using select, as that happens by default.

gene_annot <- select(ens.hs.107, rowData(sce)$ID, "SEQNAME", "GENEID")
rd <- rowData(sce)
rd$Chr <- gene_annot[match(rd$ID, gene_annot$GENEID),"SEQNAME"]
rowData(sce) <- rd
ADD COMMENT
0
Entering edit mode

Thank you!

I have retrieved chromosomes but still have the same problem with the length of ensbl ID and gene ID

> dim(gene_annot)
[1] 26634     2
> dim(sce)
[1] 26664 65897
ADD REPLY
0
Entering edit mode

What do you get from

all(rowData(sce)$ID %in% keys(ens.hs.107, "GENEID"))

You might have mismatched versions.

ADD REPLY
0
Entering edit mode

Or it might be PAR genes.

> gns <- genes(ens.hs.107)
Warning message:
In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 1 out-of-bound range located on sequence
  LRG_432. Note that ranges located on a sequence whose length is unknown
  (NA) or on a circular sequence are not considered out-of-bound (use
  seqlengths() and isCircular() to get the lengths and circularity flags
  of the underlying sequences). You can use trim() to trim these ranges.
  See ?`trim,GenomicRanges-method` for more information.

> ypar <- GRanges(c("Y","Y"), IRanges(c(1,56887903),c(10000,57217415)))
> subsetByOverlaps(gns, ypar)
GRanges object with 0 ranges and 9 metadata columns:
   seqnames    ranges strand |     gene_id   gene_name gene_biotype
      <Rle> <IRanges>  <Rle> | <character> <character>  <character>
   seq_coord_system description gene_id_version canonical_transcript
        <character> <character>     <character>          <character>
        symbol entrezid
   <character>   <list>
  -------
  seqinfo: 457 sequences (1 circular) from GRCh38 genome

I know that Ensembl hard masks the Y PAR region in their genome FASTA files, and maybe they do the same for the genes that occur in those regions? UCSC doesn't appear to.

> library(TxDb.Hsapiens.UCSC.hg38.refGene)
> gns2 <- genes(TxDb.Hsapiens.UCSC.hg38.refGene, single.strand.genes.only = FALSE)
> ypar2 <- ypar
> seqlevelsStyle(ypar2) <- "UCSC"
> subsetByOverlaps(gns2, ypar2)
GRangesList object of length 5:
$`100128260`
GRanges object with 2 ranges and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]     chrX 156014564-156016830      -
  [2]     chrY   57201084-57203350      -
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome

$`10251`
GRanges object with 2 ranges and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]     chrX 155612565-155782457      +
  [2]     chrY   56954255-56968977      +
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome

$`3581`
GRanges object with 2 ranges and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]     chrX 155997696-156010817      +
  [2]     chrY   57184216-57197337      +
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome

$`6845`
GRanges object with 2 ranges and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]     chrX 155881345-155943769      +
  [2]     chrY   57067865-57130289      +
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome

$`727856`
GRanges object with 2 ranges and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]     chrX 156025658-156028183      -
  [2]     chrY   57212178-57214703      -
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome
ADD REPLY
0
Entering edit mode

It gives

False

I have found cell ranger used GRCh38-2020-A for mapping so I am using ensbl 101 now. I have 30 GENEIDs that don't match, maybe is the Y PAR genes but just wonder why more people don't have the same issue. Not sure how to get around it and/or whether this will have an effect on the downstream analysis.

ADD REPLY
0
Entering edit mode

I think people do have the same issue. I often get data from the sequencing core that has already been aligned to whatever version of the Ensembl genome they are using, and I then have to iterate through the AnnotationHub, testing for consistency in available Ensembl Gene IDs from a given Ensembl version and the data I have in hand until I can find the version that matches.

ADD REPLY
0
Entering edit mode

Thank you!

I have been looking further into it and it looks like it's due to PAR as CellRanger uses GENCODE v32. The IDs are version controlled so I cannot map directly using annotationhub will try using UCS.

ADD REPLY
0
Entering edit mode

Hi, I have exactly the same problem here. May I ask how did you solve this at the end? I have around 200 genes that cannot be annotated by ens.hs.107 and I downloaded the reference (human (GRCh38)) from cellranger directly. Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 482 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6