Entering edit mode
Dear Guangchuang,
Thank you very much for the feedback! Any feedbacks and responses sent
to bioconductor@r-project.org are archived and web searchable. I
would appreciate if you could send email to Bioconductor list for
feedbacks forward.
If you type ?annotatePeakInBatch, you will notice that
FeatureLocForDistance is set to TSS by default, meaning that using
start of feature when feature is on plus strand and using end of
feature when feature is on minus strand. I think this is consistent
with your expectation.
The discrepancy described by you is due to the different annotation
source used. You used transcript coordinates instead of gene
coordinates stored in TSS.human.NCBI36. Another difference is that you
used end(peak) to calculate distance while annotatePeakInBatch uses
start(peak) by default. As a user, you can set PeakLocForDistance as
start, end or middle of the peak. However, the result will be
different. Please type ?annotatePeakInBatch to get the details to
customize your parameter setting to fit your needs. Also annotation
from different databases of different versions will be slightly
different as you already pointed out. You can use getAnnotation
function in ChIPpeakAnno to obtain the most recent Ensemble
annotation. Alternatively, you could download annotation from UCSC
genome browser and convert it to RangedData as annotation source. You
also can download transcript coordinates to feed into
annotatePeakInBatch to get the nearest transcript instead of gene.
To test whether there exists this bug in the annotatePeakInBatch , let
us use your example with a simple annotation consisting only two
genes with the exact coordinates provided by you instead of using the
built-in annotation. Looks like you are interested in peak end to TSS,
so I will set PeakLocForDistance = "end" here.
You can see that the results obtained, by customizing the parameters
in annoatePeakInBatch, is consistent with what you get.
require(ChIPpeakAnno)
packageVersion("ChIPpeakAnno")
peak <- RangedData(space="chr1", IRanges(24736757, 24737528))
TSS <- RangedData(space="chr1", IRanges(start=c(24742284
,24683489,24683489,24683489,24695211), end = c( 24799466,
24700300,24740262, 24741587,24718169), names=c("NIPAL3",
"STPG1-1","STPG1-2","STPG1-3","STPG1-4")), strand=c("+","-", "-",
"-","-"))
TSS
RangedData with 5 rows and 1 value column across 1 space
space ranges | strand
<factor> <iranges> | <character>
NIPAL3 chr1 [24742284, 24799466] | +
STPG1-1 chr1 [24683489, 24700300] | -
STPG1-2 chr1 [24683489, 24740262] | -
STPG1-3 chr1 [24683489, 24741587] | -
STPG1-4 chr1 [24695211, 24718169] | -
ap <- annotatePeakInBatch(peak, Annotation=TSS,
PeakLocForDistance="end")
ap
RangedData with 1 row and 9 value columns across 1 space
space ranges | peak strand
feature start_position end_position insideFeature distancetoFeature
shortestDistance
<factor> <iranges> | <character> <character>
<character> <numeric> <numeric> <character>
<numeric> <numeric>
1 STPG1-2 chr1 [24736757, 24737528] | 1 -
STPG1-2 24683489 24740262 inside 2734
2734
fromOverlappingOrNearest
<character>
1 STPG1-2 NearestStart
Hope this clears things for you. Please do not hesitate to send
feedbacks to Bioconductor list if you have additional questions or
suggestions.
Many thanks for the detailed examples!
Best regards,
Julie
On 1/14/14 4:01 AM, "Yu, Guangchuang" <gcyu@connect.hku.hk> wrote:
Dear Dr. Zhu,
I found a bug of your package, please refer to
http://ygc.name/2014/01/14/r-package-chippeakanno-sucks/ for details.
Hope that it will solve in next release.
Best Regards,
Guangchuang
I used R package ChIPpeakAnno for annotating peaks, and found that it
handle the DNA strand in the wrong way. Maybe the developers were from
the computer science but not biology background.
?
> require(ChIPpeakAnno)
> packageVersion("ChIPpeakAnno")
[1] '2.10.0'
> peak <- RangedData(space="chr1", IRanges(24736757, 24737528))
> data(TSS.human.GRCh37)
> ap <- annotatePeakInBatch(peak, Annotation=TSS.human.GRCh37)
> ap
RangedData with 1 row and 9 value columns across 1 space
space ranges | peak
strand
<factor> <iranges> | <character>
<character>
1 ENSG00000001461 1 [24736757, 24737528] | 1
+
feature start_position end_position
insideFeature
<character> <numeric> <numeric>
<character>
1 ENSG00000001461 ENSG00000001461 24742284 24799466
upstream
distancetoFeature shortestDistance
fromOverlappingOrNearest
<numeric> <numeric>
<character>
1 ENSG00000001461 -5527 4756
NearestStart
In this example, I defined a peak ranging from chr1:24736757 to
chr1:24737528 and annotated the peak using ChIPpeakAnno package.
It returns that the nearest gene is ENSG00000001461, whose gene symbol
is NIPAL3.
?
5
> require(org.Hs.eg.db)
> gene.ChIPpeakAnno <- select(org.Hs.eg.db, key=ap$feature,
keytype="ENSEMBL", columns=c("ENSEMBL", "ENTREZID", "SYMBOL"))
> gene.ChIPpeakAnno
ENSEMBL ENTREZID SYMBOL
1 ENSG00000001461 57185 NIPAL3
When looking at the peak in Genome Browser, I found the nearest gene
is STPG1.
Screenshot 2014-01-13 22.00.46
The gene symbol, STPG1, was converted to ENTREZID for future
processing.
?View Code RSPLUS
1
2
3
4
> gene.nearest <- select(org.Hs.eg.db, key="STPG1", keytype="SYMBOL",
columns=c("ENSEMBL", "ENTREZID", "SYMBOL"))
> gene.nearest
SYMBOL ENSEMBL ENTREZID
1 STPG1 ENSG00000001460 90529
We can query the coordination of these two genes, and compare their
distances to the peak.
?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
> require(TxDb.Hsapiens.UCSC.hg19.knownGene)
> knownGene <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene,
by="gene")
>
> gene.ChIPpeakAnno.info <- knownGene[[gene.ChIPpeakAnno$ENTREZID]]
> gene.ChIPpeakAnno.info
GRanges with 4 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<rle> <iranges> <rle> | <integer> <character>
[1] chr1 [24742245, 24781314] + | 618 uc010oek.2
[2] chr1 [24742245, 24799473] + | 619 uc001bjh.3
[3] chr1 [24742245, 24799473] + | 620 uc009vrc.3
[4] chr1 [24782628, 24792864] + | 621 uc001bji.3
---
seqlengths:
chr1 chr2 ...
chrUn_gl000249
249250621 243199373 ...
38502
After getting the information of the gene annotated by ChIPpeakAnno, I
also found that the ranges of the gene is slightly different from the
one returned by annotatePeakInBatch function. This may due to the
variation of different versions and it's not a big deal.
As the gene, NIPAL3, is encoded in + strand, the nearest distance is:
?View Code RSPLUS
1
2
> min(abs(start(peak) - startgene.ChIPpeakAnno.info)))
[1] 5488
While the gene, STPG1, is encoded in - strand, the end of the gene
coordination is actually the start position of the gene and the start
of the gene coordination is the end position. So the distance should
be calculated by the end coordination.
[[alternative HTML version deleted]]