Question

How to annotate gene names for genomic coordinates?

0

Entering edit mode

seeker • 0

@seeker-7773

Last seen 3.1 years ago

Netherlands

I want to annotate gene names for a segmented file by finding overlap with it with a reference file. Would like to know how can it be done using GenomicRanges Package or any other bioconductor packages.

Segmented_file

Chromosome	Start	End
1	5930000	11730000
1	16850000	18010000

reference_file

Chr	Start	End	Gene
1	5930500	6230500	SPSB1
1	6930500	7340500	SPSB2
1	16854500	16950000	TAS1R1
1	17810032	17910064	ENO1

Expected results

Chromosome	Start	End	Gene
1	5930000	11730000	SPSB1,SPSB2
1	16850000	18010000	TAS1R1,ENO1

annotation • 5.0k views

ADD COMMENT • link written 9.8 years ago by seeker • 0

score 1 · Answer 1 · 2015-06-24

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 4 days ago

United States

Here is one way:

> library(GenomicRanges)
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following object is masked from ‘package:stats’:

    xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, as.vector, cbind, colnames,
    do.call, duplicated, eval, evalq, Filter, Find, get, intersect,
    is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax,
    pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rep.int,
    rownames, sapply, setdiff, sort, table, tapply, union, unique,
    unlist, unsplit

Loading required package: S4Vectors
Loading required package: stats4
Loading required package: IRanges
Loading required package: GenomeInfoDb
> segmented <- GRanges(c(1,1), IRanges(c(5930000, 16850000), c(11730000,18010000)))
> reference <- GRanges(rep(1,4), IRanges(c(5930500,6930500,16854500,17810032), c(6230500,7340500,16950000, 17910064)))
> mcols(reference)$Genes <- c("SPSB1","SPSB2","TAS1R1","ENO1")
> olap <- findOverlaps(segmented, reference)
> mcols(segmented)$Genes <- sapply(split(mcols(reference)$Genes, queryHits(olap)), paste, collapse = ",")
> segmented
GRanges object with 2 ranges and 1 metadata column:
      seqnames               ranges strand |       Genes
         <Rle>            <IRanges>  <Rle> | <character>
  [1]        1 [ 5930000, 11730000]      * | SPSB1,SPSB2
  [2]        1 [16850000, 18010000]      * | TAS1R1,ENO1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

ADD COMMENT • link 9.8 years ago James W. MacDonald 68k

0

Entering edit mode

@James..when i tried this command on my dataset, getting an error message

Warning message:
In split.default(mcols(segment)$Gens, queryHits(olap)) :
data length is not a multiple of split variable

ADD REPLY • link 9.8 years ago seeker • 0

0

Entering edit mode

When you tried what command? You need to show your work if you expect anybody to help.

ADD REPLY • link 9.8 years ago James W. MacDonald 68k

0

Entering edit mode

@James..

library(GenomicRanges)

df1.Grange <- GRanges(seqnames= df1$Chromosome, IRanges(start=df1$Start, end =df1$End),ID= df1$Gene.symb) ref.Grange <- GRanges(seqnames= ref$Chromosome.Name, IRanges(start=ref$Gene.Start..bp., end = ref$Gene.End..bp.),ID= ref$Associated.Gene.Name.1)

olap <- findOverlaps(df1.Grange, ref.Grange)

mcols(df1.Grange)$M.Genes <- sapply(split(mcols(ref.Grange)$ID, queryHits(olap)), paste, collapse = ",")

Error message: 136 elements in value to replace 148 elements
In addition: Warning message:
In split.default(mcols(ref.Grange)$ID, queryHits(olap)) :
data length is not a multiple of split variable

ADD REPLY • link 9.8 years ago seeker • 0

1

Entering edit mode

Here is a slight modification of your original example:

> segmented <- GRanges(c(1,1,1), IRanges(c(5930000, 16850000, 17911000), c(11730000,18010000, 19010000)))
> reference <- GRanges(rep(1,4), IRanges(c(5930500,6930500,16854500,17810032), c(6230500,7340500,16950000, 17910064)))
> mcols(reference)$Genes <- c("SPSB1","SPSB2","TAS1R1","ENO1")
> olap <- findOverlaps(segmented, reference)
> olap
Hits object with 4 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  [2]         1           2
  [3]         2           3
  [4]         2           4
  -------
  queryLength: 3
  subjectLength: 4
> mcols(segmented)$Genes <- sapply(split(mcols(reference)$Genes, queryHits(olap)), paste, collapse = ",")
Error in `[[<-`(`*tmp*`, name, value = c("SPSB1,SPSB2", "TAS1R1,ENO1")) :
  2 elements in value to replace 3 elements

So let's look a bit deeper to figure out what is going on. Let's run some of the code separately. The call to split() generates a list containing the genes that are within each range of segmented. Note here that there are only two list items, because there are no genes in the third range of segmented.

> split(mcols(reference)$Genes, queryHits(olap))
$`1`
[1] "SPSB1" "SPSB2"

$`2`
[1] "TAS1R1" "ENO1"

The naive code I gave you assumes that there will be one or more genes in each range of segmented, which fails because we collapse like this:

> sapply(split(mcols(reference)$Genes, queryHits(olap)), paste, collapse = ",")
            1             2
"SPSB1,SPSB2" "TAS1R1,ENO1"

And there are only two things here that we are trying to fit into a DataFrame that has three rows. So you either have to subset the segmented GRanges object to just those ranges that have genes, or you have to take the list of gene names, and 'bump it out' to have empty character vectors for those ranges in segmented that have no genes, and then collapse to a vector which will be the right length.

These are both basic data manipulation steps in R, so I will leave it to you to try to figure out how to do so. The best way to learn R is to have something concrete (and relatively simple) that you want to accomplish, and then figure out how to do it. So there you go.

ADD REPLY • link 9.8 years ago James W. MacDonald 68k

2

Entering edit mode

Hi Jim, Rajith,

Don't use split(). First turn Hits object olap into a list-like object parallel to segmented:

as(olap, "List")
# IntegerList of length 3
# [[1]] 1 2
# [[2]] 3 4
# [[3]] integer(0)

Then use extractList():

mcols(segmented)$Gene <- extractList(mcols(reference)$Gene,
                                     as(olap, "List"))
segmented
# GRanges object with 3 ranges and 1 metadata column:
#       seqnames               ranges strand |            Gene
#          <Rle>            <IRanges>  <Rle> | <CharacterList>
#   [1]        1 [ 5930000, 11730000]      * |     SPSB1,SPSB2
#   [2]        1 [16850000, 18010000]      * |     TAS1R1,ENO1
#   [3]        1 [17911000, 19010000]      * |                
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Hope this helps,

H.

ADD REPLY • link 9.8 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Herve,

Yes indeed it was helpful. But the class of segmented$Gene in CharacterList, which makes it difficult to do the analysis. So i converted. Hope this is the correct way.

segmented$Gene <- unlist(lapply(segment$Gene , function(m) {paste(m,collapse = ",")}))

ADD REPLY • link 9.8 years ago seeker • 0

1

Entering edit mode

or use unstrsplit():

unstrsplit(mcols(segmented)$Gene, sep=",")

It's equivalent but much faster than sapply(..., paste, collapse=",") (the difference will be noticeable if the CharacterList object contains tens of thousands of character vectors).

Cheers,

H.

ADD REPLY • link 9.8 years ago Hervé Pagès 16k

0

Entering edit mode

Nice! Thanks Herve.

ADD REPLY • link 9.8 years ago James W. MacDonald 68k

score 0 · Answer 2 · 2015-06-24

Maybe you want have a try ChIPpeakAnno v3.2.2

library(ChIPpeakAnno)
segmented <- GRanges(c(1,1), IRanges(c(5930000, 16850000), c(11730000,18010000)))
reference <- GRanges(rep(1,4), IRanges(c(5930500,6930500,16854500,17810032), c(6230500,7340500,16950000, 17910064), names=c("SPSB1", "SPSB2", "TAS1R1", "ENO1")))
res <- annotatePeakInBatch(segmented, AnnotationData=reference, output="overlapping") 
res