Getting CDS positioning for IlluminaHumanv4 annotations
3
0
Entering edit mode
@sylvain-foisy-5539
Last seen 5.3 years ago
Canada

Hi,

Anyone anyone have a suggestion on how to know if a given probe in a list is either 5' UTR, coding or 3' UTR using data from illuminaHumanv4.db? 

Thanks in advance

Sylvain

illumina annotation • 1.4k views
ADD COMMENT
0
Entering edit mode
Mark Dunning ★ 1.1k
@mark-dunning-3319
Last seen 21 months ago
Sheffield, Uk

Hi Sylvain,

Not quite; the best you can do is to tell whether it is inronic, intergenic or coding

 

library(illuminaHumanv4.db)
prbs <- mappedkeys(illuminaHumanv4CODINGZONE)
table(unlist(mget(prbs, illuminaHumanv4CODINGZONE)))

     Intergenic        Intronic  Transcriptomic Transcriptomic? 
           2019            1057           37611            6290 

Hope this helps,

Mark

ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 2 days ago
Seattle, WA, United States

Hi Sylvain,

Here is one way to go that makes use of the gene model stored in the TxDb object for hg19. It requires a little bit of coding. But first we need to get the genomic ranges of your probes. We'll use the CHRLOC and CHRLOCEND maps from the illuminaHumanv4.db package for that. Note that a given probe can be mapped to more than 1 genomic location:

library(GenomicRanges)
probesToGRanges <- function(chipdb, probe_ids)
{
  pkgname <- map_prefix <- get("packageName", chipdb@.xData)
  nc <- nchar(map_prefix)
  if (substr(map_prefix, nc - 2L, nc) == ".db")
    map_prefix <- substr(map_prefix, 1, nc - 3L)
  CHRLOC_map_name <- paste0(map_prefix, "CHRLOC")
  CHRLOCEND_map_name <- paste0(map_prefix, "CHRLOCEND")
  CHRLOC_map <- get(CHRLOC_map_name, envir=asNamespace(pkgname))
  CHRLOCEND_map <- get(CHRLOCEND_map_name, envir=asNamespace(pkgname))
  chrloc <- IntegerList(mget(probe_ids, CHRLOC_map))
  chrlocend <- IntegerList(mget(probe_ids, CHRLOCEND_map))
  chrloc_eltlens <- elementLengths(chrloc)
  chrlocend_eltlens <- elementLengths(chrlocend)
  ## Sanity check.
  stopifnot(identical(chrloc_eltlens, chrlocend_eltlens))
  chrloc <- unlist(chrloc, use.names=FALSE)
  chrlocend <- unlist(chrlocend, use.names=FALSE)
  probe_id <- rep.int(names(chrloc_eltlens), chrloc_eltlens)
  is_na <- is.na(chrloc)
  ## Sanity check.
  stopifnot(identical(is_na, is.na(chrlocend)))
  if (any(is_na)) {
    keep_idx <- which(!is_na)
    chrloc <- chrloc[keep_idx]
    chrlocend <- chrlocend[keep_idx]
    probe_id <- probe_id[keep_idx]
  }
  is_neg <- chrloc < 0L
  ## Sanity check.
  stopifnot(identical(is_neg, chrlocend < 0L))
  ans_seqnames <- names(chrloc)
  ans_ranges <- IRanges(abs(chrloc), abs(chrlocend))
  ans_strand <- strand(is_neg)
  GRanges(ans_seqnames, ans_ranges, ans_strand, probe_id)
}

library(illuminaHumanv4.db)
my_probes <- sample(keys(illuminaHumanv4.db), 10)
my_probe_ranges <- probesToGRanges(illuminaHumanv4.db, my_probes)
my_probe_ranges
# GRanges object with 13 ranges and 1 metadata column:
#       seqnames                 ranges strand   |     probe_id
#          <Rle>              <IRanges>  <Rle>   |  <character>
#   [1]        9 [128509617, 128729655]      +   | ILMN_1810100
#   [2]        9 [128510478, 128729655]      +   | ILMN_1810100
#   [3]        7 [ 40172342,  40174251]      -   | ILMN_1745119
#   [4]       10 [ 63166401,  63213208]      -   | ILMN_1809639
#   [5]        X [ 10413350,  10645779]      -   | ILMN_2371433
#   ...      ...                    ...    ... ...          ...
#   [9]        X [ 10437305,  10535643]      -   | ILMN_2371433
#  [10]        X [ 10473381,  10535643]      -   | ILMN_2371433
#  [11]        7 [140396481, 140406446]      +   | ILMN_2117330
#  [12]        8 [  2792875,   4852328]      -   | ILMN_1746945
#  [13]        2 [110300374, 110371783]      -   | ILMN_2258958
#  -------
#  seqinfo: 6 sequences from an unspecified genome; no seqlengths

Then get the CDS, 5' and 3' UTRs for hg19:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
cds <- cdsBy(txdb, by="tx")
five_utrs <- fiveUTRsByTranscript(txdb)
three_utrs <- threeUTRsByTranscript(txdb)

Finally find where your probes are landing:

seqlevels(my_probe_ranges) <- paste0("chr", seqlevels(my_probe_ranges))
seqinfo(my_probe_ranges) <- merge(seqinfo(my_probe_ranges), seqinfo(txdb))
mcols(my_probe_ranges)$is_coding <- my_probe_ranges %within% cds
mcols(my_probe_ranges)$is_in_5utr <- my_probe_ranges %within% five_utrs
mcols(my_probe_ranges)$is_in_3utr <- my_probe_ranges %within% three_utrs
my_probe_ranges
# GRanges object with 13 ranges and 4 metadata columns:
#        seqnames                 ranges strand   |     probe_id is_coding
#           <Rle>              <IRanges>  <Rle>   |  <character> <logical>
#    [1]     chr9 [128509617, 128729655]      +   | ILMN_1810100     FALSE
#    [2]     chr9 [128510478, 128729655]      +   | ILMN_1810100     FALSE
#    [3]     chr7 [ 40172342,  40174251]      -   | ILMN_1745119     FALSE
#    [4]    chr10 [ 63166401,  63213208]      -   | ILMN_1809639     FALSE
#    [5]     chrX [ 10413350,  10645779]      -   | ILMN_2371433     FALSE
#    ...      ...                    ...    ... ...          ...       ...
#    [9]     chrX [ 10437305,  10535643]      -   | ILMN_2371433     FALSE
#   [10]     chrX [ 10473381,  10535643]      -   | ILMN_2371433     FALSE
#   [11]     chr7 [140396481, 140406446]      +   | ILMN_2117330     FALSE
#   [12]     chr8 [  2792875,   4852328]      -   | ILMN_1746945     FALSE
#   [13]     chr2 [110300374, 110371783]      -   | ILMN_2258958     FALSE
#        is_in_5utr is_in_3utr
#         <logical>  <logical>
#    [1]      FALSE      FALSE
#    [2]      FALSE      FALSE
#    [3]      FALSE      FALSE
#    [4]      FALSE      FALSE
#    [5]      FALSE      FALSE
#    ...        ...        ...
#    [9]      FALSE      FALSE
#   [10]      FALSE      FALSE
#   [11]      FALSE      FALSE
#   [12]      FALSE      FALSE
#   [13]      FALSE      FALSE
#   -------
#   seqinfo: 93 sequences (1 circular) from hg19 genome

You can use %over% instead of %within%, in which case a TRUE is returned when
a probe range has an overlap with one of the ranges in the GRangesList on the
right (with %within% the probe range has to be fully contained in one of the
ranges on the right).

Cheers,
H.

ADD COMMENT
0
Entering edit mode

But let's step back a little bit. I didn't realize immediately that the CHRLOC and CHRLOCEND maps provide the locations of the *gene* that a probe id is mapped to (via the ENSEMBL map), and not the location of the probe itself (which is not a well defined concept anyway since a probe id is actually a probe set id and generally consists of more than 1 probe).

So going from probe ids to genomic ranges could actually be performed in a different way: first go from probe ids to Ensembl gene ids (via the ENSEMBL map), then use the TxDb object for hg19 (but this time the one obtained from the Ensembl Genes track):

library(GenomicFeatures)
txdb <- makeTranscriptDbFromUCSC("hg19", "ensGene")

probesToGRanges2 <- function(chipdb, txdb, probe_ids)
{
  pkgname <- map_prefix <- get("packageName", chipdb@.xData)
  nc <- nchar(map_prefix)
  if (substr(map_prefix, nc - 2L, nc) == ".db")
    map_prefix <- substr(map_prefix, 1, nc - 3L)
  ENSEMBL_map_name <- paste0(map_prefix, "ENSEMBL")
  ENSEMBL_map <- get(ENSEMBL_map_name, envir=asNamespace(pkgname))
  ensembl_id <- CharacterList(mget(probe_ids, ENSEMBL_map))
  ensembl_id_eltlens <- elementLengths(ensembl_id)
  ensembl_id <- unlist(ensembl_id, use.names=FALSE)
  probe_id <- rep.int(names(ensembl_id_eltlens), ensembl_id_eltlens)
  is_na <- is.na(ensembl_id)
  if (any(is_na)) {
    keep_idx <- which(!is_na)
    ensembl_id <- ensembl_id[keep_idx]
    probe_id <- probe_id[keep_idx]
  }
  genes <- unname(genes(txdb))
  m <- match(ensembl_id, mcols(genes)$gene_id)
  ## 'm' can contain NAs. This happens when some probe ids are mapped to
  ## Ensembl ids that are not represented in 'genes'. We drop them.
  is_na <- is.na(m)
  if (any(is_na)) {
    keep_idx <- which(!is_na)
    m <- m[keep_idx]
    ensembl_id <- ensembl_id[keep_idx]
    probe_id <- probe_id[keep_idx]
  }
  ans <- genes[m]
  mcols(ans)$probe_id <- probe_id
  ans
}

my_probe_ranges2 <- probesToGRanges2(illuminaHumanv4.db, txdb, my_probes)

Note that my_probe_ranges2 is similar to my_probe_ranges but not exactly the same. This could be due to different versions of Ensembl that were used to make the illuminaHumanv4.db and the Ensembl Genes track.

Anyway, the point is that since these ranges are gene ranges, it doesn't really make sense to find out where they land with respect to their CDS, 5' or 3' UTR regions.

Cheers,

H.

ADD REPLY
0
Entering edit mode

I think your original approach would have worked if you used the GENOMICLOCATION map, which provides the actual genomic location of the probes, rather than the borders of gene they map within.  Again, probes can map to more than one location if there are several equally good matches.

Also, for Illumina arrays it probably is fair to consider individual probes separately.  I don't think they follow the Affy paradigm of probes and probesets.

ADD REPLY
0
Entering edit mode
@sylvain-foisy-5539
Last seen 5.3 years ago
Canada

 

SHi to all,

 

 

Thanks for the pointers ;-) Like Mike said, using GENOMICLOCATION is a better way since it gives actual position of the probe on the genome; every once in a while, a probe will span two axons. And as Mike said, Illumina does not use the concept of probesets: a probe is a unique 50bp oligo on the bead. A few genes have 2 or 3 probes but they are usually scattered across the exons. 

Hervé: I'll try your suggestion with Mike's inputs ASAP and I'll provide the results ;-)

Thanks for the help!

S

ADD COMMENT
0
Entering edit mode

You're welcome. Hopefully probesToGRanges() code is not too hard to adapt to work with the GENOMICLOCATION map.

H.

PS: Please use the ADD COMMENT link so your comment appears like a comment and not like an answer to your own question.

ADD REPLY

Login before adding your answer.

Traffic: 499 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6