Stefan,
FYI, Jianhong will modify the function with two new parameters, one is
NucleotideLevel (TRUE or FALSE) to allow both peak centric and
nucleotide
centric view. Another parameter will be precedence, which is only
applicable
to peak centric view (NucleotideLevel = FALSE). If no precedence
specified,
double count will be enabled, which means that if a peak overlap with
both
promoter and 5'UTR, then both promoter and 5'UTR will be incremented.
If a
precedence order is specified, for example, if promoter is specified
before
5'UTR, then only promoter will be incremented for the same example. In
addition, Jaccard index will be computed.
Please let us know your thoughts on this. Many thanks for your
brilliant
ideas!
Best regards,
Julie
On 3/18/13 9:11 AM, "Lihua Julie Zhu" <julie.zhu at="" umassmed.edu="">
wrote:
> Stefan,
>
> Many thanks for the detailed information on your testing procedure
and
> results!
>
> AssignChromosomeRegion is peak centric and it considers the the
precedence
> of each region. Since peaks are usually in the promoter region, it
got the
> highest precedence. The reason to have precedence is to ensure the
peak
> number assigned to all different regions adds up to total peak
number, which
> is different from nucleotide centric view implemented by bed tools.
>
> Thanks for bringing up a very good point! We probably need to add a
> parameter to have an option to choose between peak centric view and
> nucleotide centric view.
>
> Best regards,
>
> Julie
>
>
> On 3/17/13 3:47 PM, "Stefan Pinter" <pinter at="" molbio.mgh.harvard.edu=""> wrote:
>
>> Hi Julie,
>>
>> maybe I'm misunderstanding something about this function. Below is
what I've
>> done as a positive control (using refseq genes as "peaks" compared
to ensembl
>> annotation), can you tell me whether these numbers make sense? I
was
>> expecting
>> to see close to 100% overlap (according to bedtools closest on
refseq genes
>> against ensembl genes, 958/967 unique refseq X-linked genes are
overlapping
>> an
>> ensembl gene (distance = 0)).
>>
>> But with prox. promoter cutoff = 0, I get only get 42% prox.
promoter and
>> 45%
>> "enhancer", which I believe could also be called intergenic, right?
Even with
>> default of 1k cutoff, only 81% would be counted as "genic" (sum of
everything
>> - "Enhancer").
>> Thanks,
>> - Stefan.
>>
>>
>> First, I pulled in the ensembl gene set from archive, because my
data are in
>> mm9:
>>
>> ensembl67=useMart(host='may2012.archive.ensembl.org',
>> biomart='ENSEMBL_MART_ENSEMBL', dataset='mmusculus_gene_ensembl')
>>> TSS <- getAnnotation(ensembl67, featureType ="TSS")
>>> Exon <- getAnnotation(ensembl67, featureType ="Exon")
>>> utr5 <- getAnnotation(ensembl67, featureType ="5utr")
>>> utr3 <- getAnnotation(ensembl67, featureType ="3utr")
>>
>> Then, I used all X-linked genes from the conservative mm9 refseq
gene set
>> (merged to contain unique names only) as my "peak" list [as a
positive
>> control].
>>
>>> bed.rGm<-read.table("genes/chrX.rGm4.bed", header = FALSE)
>>> rd.rGm<-BED2RangedData(bed.rGm, header=FALSE)
>>> rd.rGm
>> RangedData with 967 rows and 2 value columns across 1 space
>>
>> Finally, I annotated these "peaks" relative to TSS and used
>> assignChromosomeRegion with 0k or 1k cutoff:
>>
>>> rda.rGm = annotatePeakInBatch(rd.rGm, AnnotationData=TSS,
PeakLocForDistance
>>> = "middle")
>>> l.feat0k.rGm <-ChIPpeakAnno:::assignChromosomeRegion(rda.rGm, TSS,
Exon,
>>> utr5, utr3, proximal.promoter.cutoff=0,
immediate.downstream.cutoff=0)
>>> l.feat0k.rGm
>> $Exon
>> [1] 3.205791
>>
>> $Intron
>> [1] 0
>>
>> $`5UTR`
>> [1] 3.516029
>>
>> $`3UTR`
>> [1] 6.30817
>>
>> $`Proximal Promoter%`
>> [1] 41.98552
>>
>> $`Immediate Downstream`
>> [1] 0
>>
>> $Enhancer
>> [1] 44.98449
>>
>>> l.feat1k.rGm <-ChIPpeakAnno:::assignChromosomeRegion(rda.rGm, TSS,
Exon,
>>> utr5, utr3)
>>> l.feat1k.rGm
>> $Exon
>> [1] 3.205791
>>
>> $Intron
>> [1] 0
>>
>> $`5UTR`
>> [1] 3.516029
>>
>> $`3UTR`
>> [1] 6.30817
>>
>> $`Proximal Promoter%`
>> [1] 80.97208
>>
>> $`Immediate Downstream`
>> [1] 0.1034126
>>
>> $Enhancer
>> [1] 5.894519
>>
>> ----- Original Message -----
>> From: "Lihua Zhu (Julie)" <julie.zhu at="" umassmed.edu="">
>> To: "Stefan Pinter" <pinter at="" molbio.mgh.harvard.edu="">,
>> "<bioconductor at="" r-project.org="">" <bioconductor at="" r-project.org="">
>> Cc: "Jianhong Ou" <jianhong.ou at="" umassmed.edu="">
>> Sent: Saturday, March 16, 2013 8:25:44 PM
>> Subject: Re: [BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
>>
>> Stefan,
>>
>> Could you please send me a simple dataset to see what went wrong?
Thanks for
>> letting know!
>>
>> Thanks for hashing out the chromosome region enrichment score
calculation!
>> Looks like you are looking for something very similar to GO
enrichment
>> analysis (getEnrichedGO function in ChIPpeakAnno).
>>
>> For option a, as you mentioned, we will need the genome level
estimation of
>> the distribution for each category, which could be tricky. Once we
have
>> these estimation, then Hypergeometric test can be applied to find
whether a
>> category is enriched/depleted.
>>
>> Jaccard index (option b) is an interesting alternative. We could
implement
>> Jaccard index at the peak level first and add options for computing
Jaccard
>> index at the nucleotide level later.
>>
>> Best regards,
>>
>> Julie
>>
>>
>> On 3/16/13 7:41 PM, "Stefan Pinter" <pinter at="" molbio.mgh.harvard.edu=""> wrote:
>>
>>> Julie, I compared my results from ChipPeakAnno with a simple
overlap in
>>> bedtools: the % of peaks inside genes (exon + intron + 3utr +
5utr) reported
>>> by ChIPpeakAnno is over an order of magnitude lower than what I
learned from
>>> bedtools. Browsing the data indicates that bedtools is right.
Something is
>>> off, do you have a test.data set you can check against?
>>> Thanks,
>>> -S.
>>>
>>> ----- Original Message -----
>>> From: "Stefan Pinter" <pinter at="" molbio.mgh.harvard.edu="">
>>> To: "Lihua Zhu (Julie)" <julie.zhu at="" umassmed.edu="">
>>> Cc: "bioconductor" <bioconductor at="" r-project.org="">
>>> Sent: Saturday, March 16, 2013 1:30:55 PM
>>> Subject: Re: [BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
>>>
>>> Hi Julie,
>>>
>>> well, I was more thinking of something along the lines of either:
>>>
>>> a.) Over/Under-representation:
>>> Percentage of exons in genome = 2%
>>> Percentage of peaks overlapping exons = 10%
>>> Exons are 5-fold over-represented
>>>
>>> But I realize now, that will only work with a full annotation, not
a custom
>>> annotation.
>>>
>>> b.) something like the jaccard index used in bedtools:
>>>
>>> total length of intersection / total length of union
>>>
>>>
http://en.wikipedia.org/wiki/Jaccard_index
>>>
>>> Just suggestions, there are probably many more possible ways to
calculate a
>>> score.
>>> Thank you,
>>> - Stefan.
>>>
>>> ----- Original Message -----
>>> From: "Lihua Zhu (Julie)" <julie.zhu at="" umassmed.edu="">
>>> To: "Stefan Pinter" <pinter at="" molbio.mgh.harvard.edu="">
>>> Cc: "<bioconductor at="" r-project.org="">" <bioconductor at="" r-project.org="">
>>> Sent: Friday, March 15, 2013 1:28:09 PM
>>> Subject: Re: [BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
>>>
>>> Stefan,
>>>
>>> Regarding the enrichment/depletion score, Does the following toy
example
>>> illustrate how to calculate such score? If not, could you please
give a toy
>>> example? Thanks!
>>>
>>> For example, if the total number of peaks = 100, number of peaks
assigned to
>>> promoter = 90, number of peaks assigned to enhancer = 10, then the
>>> enrichment score = 90% and 10% for promoter region and enhancer
region
>>> respectively.
>>>
>>> We will add closest to annotatePeakInBatch in the dev version.
Thanks for
>>> the great suggestion!
>>>
>>> Please cc bioconductor at r-project.org. Thanks!
>>>
>>> Best regards,
>>>
>>> Julie
>>>
>>>
>>> On 3/15/13 12:26 PM, "Stefan Pinter" <pinter at="" molbio.mgh.harvard.edu=""> wrote:
>>>
>>>> Sure, thank you Julie.
>>>>
>>>> Yes, that is what I meant. Since all the necessary information
(peak
>>>> locations, feature locations, total count, total length) are
already
>>>> available
>>>> in RangedData & Annotation, it would be very convenient to
calculate
>>>> enrichment/depletion and possibly even significance scores (by
permutation)
>>>> on
>>>> the fly. The significance score may be overkill, but if the
function at
>>>> least
>>>> reported enrichment/depletion scores, the user could always
supply a number
>>>> of
>>>> shuffled ranges to build a random model of enrichment scores and
calculate
>>>> significance after.
>>>>
>>>> In addition, would it be possible to add another definition for
>>>> PeakLocForDistance in annotatePeakInBatch?
>>>> PeakLocForDistance = "start means using start of the peak to
calculate the
>>>> distance to feature"
>>>>
>>>> It would be helpful to have "closest", meaning distance to
feature measured
>>>> from peak start or peak end, whichever is closer. That would help
with
>>>> broad
>>>> peaks, which using "middle" for isn't very helpful.
>>>>
>>>> Thank you and Best wishes,
>>>> - Stefan.
>>>>
>>>> ----- Original Message -----
>>>> From: "Lihua Zhu (Julie)" <julie.zhu at="" umassmed.edu="">
>>>> To: "Stefan Pinter" <pinter at="" molbio.mgh.harvard.edu="">
>>>> Cc: "<bioconductor at="" r-project.org="">" <bioconductor at="" r-project.org="">
>>>> Sent: Friday, March 15, 2013 8:06:03 AM
>>>> Subject: Re: [BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
>>>>
>>>> Stefan,
>>>>
>>>> You can download 2.6.1 to directly use assignChromosomeRegion
without the
>>>> package prefix (I updated last night and it might take sometime
to
>>>> propagate
>>>> to the installation site). To find out what parameters supported
by the
>>>> function, in R, please type help(
ChIPpeakAnno:::assignChromosomeRegion).
>>>> You will see that indeed PeakLocForDistance is not supported. If
needed, I
>>>> can add such parameter.
>>>>
>>>> Also regarding your previous suggestion of adding enrichment
status of
>>>> feature length, do you mean enrichment of peaks in certain
category of
>>>> chromosome region? For example, a significant enrichment score
with 90%
>>>> peaks in promoter region would be interesting.
>>>>
>>>> Could you please keep the thread in the Bioc for others to
>>>> contribute/benefit? Thanks!
>>>>
>>>> Best regards,
>>>>
>>>> Julie
>>>>
>>>>
>>>> On 3/14/13 10:10 PM, "Stefan Pinter" <pinter at="" molbio.mgh.harvard.edu=""> wrote:
>>>>
>>>>> PPS. I think PeakLocForDistance is not working in that function:
>>>>>
>>>>>> l.feat.dRF8.100k
<-ChIPpeakAnno:::assignChromosomeRegion(rda.dRF8, TSS,
>>>>>> Exon,
>>>>>> utr5, utr3, proximal.promoter.cutoff=100000,
>>>>>> immediate.downstream.cutoff=100000,
PeakLocForDistance="middle")
>>>>> Error in ChIPpeakAnno:::assignChromosomeRegion(rda.dRF8, TSS,
Exon, utr5,
>>>>> :
>>>>> unused argument(s) (PeakLocForDistance = "middle")
>>>>>> l.feat.dRF8.100k
<-ChIPpeakAnno:::assignChromosomeRegion(rda.dRF8, TSS,
>>>>>> Exon,
>>>>>> utr5, utr3, proximal.promoter.cutoff=100000,
>>>>>> immediate.downstream.cutoff=100000, PeakLocForDistance= middle)
>>>>> Error in ChIPpeakAnno:::assignChromosomeRegion(rda.dRF8, TSS,
Exon, utr5,
>>>>> :
>>>>> unused argument(s) (PeakLocForDistance = middle)
>>>>>> l.feat.dRF8.100k
<-ChIPpeakAnno:::assignChromosomeRegion(rda.dRF8, TSS,
>>>>>> Exon,
>>>>>> utr5, utr3, proximal.promoter.cutoff=100000,
>>>>>> immediate.downstream.cutoff=100000, PeakLocForDistance =
c("middle"))
>>>>> Error in ChIPpeakAnno:::assignChromosomeRegion(rda.dRF8, TSS,
Exon, utr5,
>>>>> :
>>>>> unused argument(s) (PeakLocForDistance = c("middle"))
>>>>>
>>>>> ----- Original Message -----
>>>>> From: "Stefan Pinter" <pinter at="" molbio.mgh.harvard.edu="">
>>>>> To: "Lihua Zhu (Julie)" <julie.zhu at="" umassmed.edu="">
>>>>> Sent: Thursday, March 14, 2013 9:39:10 PM
>>>>> Subject: Re: [BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
>>>>>
>>>>> PS. if I might add a simple feature addition to that function -
it would
>>>>> be
>>>>> a
>>>>> TRUE/FALSE trigger for whether to also report
enrichment/depletion stats
>>>>> based
>>>>> on feature lengths, i.e. 50% peaks labeled as intergenic (or
enhancers) is
>>>>> less interesting than 50% of peaks reported as exonic (much
greater
>>>>> enrichment
>>>>> as total exonic feature length is smaller). Thanks, Best...S
>>>>>
>>>>> ----- Original Message -----
>>>>> From: "Stefan Pinter" <pinter at="" molbio.mgh.harvard.edu="">
>>>>> To: "Lihua Zhu (Julie)" <julie.zhu at="" umassmed.edu="">
>>>>> Cc: "bioconductor" <bioconductor at="" r-project.org="">
>>>>> Sent: Thursday, March 14, 2013 9:32:33 PM
>>>>> Subject: Re: [BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
>>>>>
>>>>> Hi Julie,
>>>>>
>>>>> yes, that worked! Thank you for the quick help, I very much
appreciate it!
>>>>> Thank you and Best wishes,
>>>>> - Stefan.
>>>>>
>>>>> ----- Original Message -----
>>>>> From: "Lihua Zhu (Julie)" <julie.zhu at="" umassmed.edu="">
>>>>> To: "Lihua Zhu (Julie)" <julie.zhu at="" umassmed.edu="">, "Stefan
Pinter"
>>>>> <pinter at="" molbio.mgh.harvard.edu="">
>>>>> Cc: "<bioconductor at="" r-project.org="">" <bioconductor at="" r-project.org="">
>>>>> Sent: Thursday, March 14, 2013 8:50:51 PM
>>>>> Subject: Re: [BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
>>>>>
>>>>> Stefan,
>>>>>
>>>>> Thanks for reporting this! Actually the function
assignChromosomeRegion
>>>>> is
>>>>> not exported.
>>>>>
>>>>> Could you please try the following to see if you can use it?
Thanks!
>>>>>
>>>>> ChIPpeakAnno:::assignChromosomeRegion
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Julie
>>>>>
>>>>>
>>>>> On 3/14/13 7:36 PM, "Lihua Julie Zhu" <julie.zhu at="" umassmed.edu=""> wrote:
>>>>>
>>>>>> Stefan,
>>>>>>
>>>>>> Could you please send me the sessionInfo? I just want to make
sure you
>>>>>> installed version 2.6.0. Thanks!
>>>>>>
>>>>>> I noticed that when I try to install ChIPpeakAnno with the
following
>>>>>> code,
>>>>>> I
>>>>>> get 2.4 version instead.
>>>>>>
>>>>>> source("
http://bioconductor.org/biocLite.R")
>>>>>> biocLite("ChIPpeakAnno")
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Julie
>