Hi all,
Hoping someone could give me a bit of direction here.
I have a set of genes which are all members of the same pathway.
I want to identify if there are any transcription factor binding sites
(TFBS) in the "promoters" (so far defined as 5kb upstream of the TSS)
that
are more common to genes among the pathway.
I have managed to get the 5kb upstream using biomaRt (although the
query
throws an intermittent error, moaning about the upstream_flank filter,
doesn't happen all the time, it's weird!)
I also managed to download all the JASPAR matrices, parse the file for
only
human ones and convert them into position weight matrices.
Lastly, I have produced a table of counts of each human TFBS motif in
each
of my genes using countPWM(pwm, seq, cutoff="90%")
This is as far as I have gotten and am simply wondering what do I do
next.
>From some reading the hypergeometric distribution is used in this
situation
but I am not sure what metrics to place in as the white balls drawn,
total
white balls, black balls etc., for those of you familiar with the
hypergeometric distribution.
I read that perhaps I should compare to a background set of genes,
some
sources say all other genes. This seems like overkill.
Any help is appreciated.
Cheers,
Davy
--
David Kavanagh
Nephrology Research, Centre of Public Health, Queen's University
Belfast, A
floor, Tower Block,
City Hospital, Lisburn Road, BT9 7AB, Belfast, United Kingdom
[[alternative HTML version deleted]]
On 27.02.2012 11:17, Davy wrote:
> Hi all,
> Hoping someone could give me a bit of direction here.
>
> I have a set of genes which are all members of the same pathway.
>
> I want to identify if there are any transcription factor binding
> sites
> (TFBS) in the "promoters" (so far defined as 5kb upstream of the
TSS)
> that
> are more common to genes among the pathway.
>
> I have managed to get the 5kb upstream using biomaRt (although the
> query
> throws an intermittent error, moaning about the upstream_flank
> filter,
> doesn't happen all the time, it's weird!)
>
> I also managed to download all the JASPAR matrices, parse the file
> for only
> human ones and convert them into position weight matrices.
>
> Lastly, I have produced a table of counts of each human TFBS motif
in
> each
> of my genes using countPWM(pwm, seq, cutoff="90%")
>
> This is as far as I have gotten and am simply wondering what do I do
> next.
>>From some reading the hypergeometric distribution is used in this
>> situation
> but I am not sure what metrics to place in as the white balls drawn,
> total
> white balls, black balls etc., for those of you familiar with the
> hypergeometric distribution.
>
> I read that perhaps I should compare to a background set of genes,
> some
> sources say all other genes. This seems like overkill.
>
> Any help is appreciated.
> Cheers,
> Davy
Hi Davy,
Your second paragraph is a little vague/confusing ('more common' than
what?). But if the question is does a given motif appear more often in
your pathway genes than one would expect by chance from a random
sampling of genes from the genome then the hypergeometric seems
appropriate. The nature of the white/black balls depends a little on
how
you initially selected your genes and the precise question you wish to
ask, but essentially it will be:
White balls: All genes in the genome (or other background set) that
contain your motif
Black balls: All genes in the genome (or other background set) that
don't contain your motif
Balls drawn: All genes in your pathway
White balls drawn: Genes in your pathway that contain the motif
So if 1000 genes contain the motif, there are 30,000 genes in the
genome, 20 genes in the pathway and 10 genes in the pathway contain
the
motif then the call to phyper would be:
> phyper(10,1000,30000-1000,20,lower.tail=F)
[1] 6.820356e-12
--
Alex Gutteridge
Hi,
I think the first elements defined in the "phyper()" function is the
quantile, so the scenario shall be like this:
> phyper(9,1000,30000-1000,20,lower.tail=F)
[1] 2.209245e-10
Or, you can use 'fisher.test()':
> fisher.test(matrix(c(10,20-10,1000-10,30000-1000-20+10),
2,2),alternative="greater")$p.value
[1] 2.209245e-10
Cheers,
Yuan
On 27 Feb 2012, at 11:34, Alex Gutteridge wrote:
> On 27.02.2012 11:17, Davy wrote:
>> Hi all,
>> Hoping someone could give me a bit of direction here.
>>
>> I have a set of genes which are all members of the same pathway.
>>
>> I want to identify if there are any transcription factor binding
>> sites
>> (TFBS) in the "promoters" (so far defined as 5kb upstream of the
>> TSS) that
>> are more common to genes among the pathway.
>>
>> I have managed to get the 5kb upstream using biomaRt (although the
>> query
>> throws an intermittent error, moaning about the upstream_flank
>> filter,
>> doesn't happen all the time, it's weird!)
>>
>> I also managed to download all the JASPAR matrices, parse the file
>> for only
>> human ones and convert them into position weight matrices.
>>
>> Lastly, I have produced a table of counts of each human TFBS motif
>> in each
>> of my genes using countPWM(pwm, seq, cutoff="90%")
>>
>> This is as far as I have gotten and am simply wondering what do I
>> do next.
>>> From some reading the hypergeometric distribution is used in this
>>> situation
>> but I am not sure what metrics to place in as the white balls
>> drawn, total
>> white balls, black balls etc., for those of you familiar with the
>> hypergeometric distribution.
>>
>> I read that perhaps I should compare to a background set of genes,
>> some
>> sources say all other genes. This seems like overkill.
>>
>> Any help is appreciated.
>> Cheers,
>> Davy
>
> Hi Davy,
>
> Your second paragraph is a little vague/confusing ('more common'
> than what?). But if the question is does a given motif appear more
> often in your pathway genes than one would expect by chance from a
> random sampling of genes from the genome then the hypergeometric
> seems appropriate. The nature of the white/black balls depends a
> little on how you initially selected your genes and the precise
> question you wish to ask, but essentially it will be:
>
> White balls: All genes in the genome (or other background set) that
> contain your motif
> Black balls: All genes in the genome (or other background set) that
> don't contain your motif
> Balls drawn: All genes in your pathway
> White balls drawn: Genes in your pathway that contain the motif
>
> So if 1000 genes contain the motif, there are 30,000 genes in the
> genome, 20 genes in the pathway and 10 genes in the pathway contain
> the motif then the call to phyper would be:
>
>> phyper(10,1000,30000-1000,20,lower.tail=F)
> [1] 6.820356e-12
>
> --
> Alex Gutteridge
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
Thanks Alex and Yuan,
One last question though.
Is biomart capable of downloading the promoter sequences of every gene
in
the genome, and would searching all those sequences not be very
expensive
in terms of memory?
Cheers,
Davy
--
David Kavanagh
Nephrology Research, Centre of Public Health, Queen's University
Belfast, A
floor, Tower Block,
City Hospital, Lisburn Road, BT9 7AB, Belfast, United Kingdom
[[alternative HTML version deleted]]
On 27.02.2012 14:14, Davy wrote:
> Thanks Alex and Yuan,
> One last question though.
>
> Is biomart capable of downloading the promoter sequences of every
> gene in
> the genome, and would searching all those sequences not be very
> expensive
> in terms of memory?
>
> Cheers,
> Davy
I don't know if memory would be prohibitive or not. If it is then the
usual approach in these cases is to do it chromosome by chromosome (or
any other convenient division). All you need at the end is the number
of
genes hit so there's no need to have everything in memory at once.
--
Alex Gutteridge
Hi Davy,
This is a bit off the topic of the list, but hope it could be of help
some others.
My personal experience is biomart is based on Ensembl gene
annotations, which won't be exactly the same if you adopt another
annotation, like UCSC, in terms of gene names, transcripts... Apart
from the differences may exist in mitochondria genome included by
these two databases, I believe the master majority of the genes are
the same coordinated on the genome, so it should be ok to download the
promoter sequences from either, and biomart is quite efficient on
this.
In terms of motif searching, it's really depends on the programs/
tools, i.e. the algorithms, you pick up to use.
Cheers,
Yuan
P.S. Promoters from UCSC could be retrieved from its table browser.
On 27 Feb 2012, at 14:14, Davy wrote:
> Thanks Alex and Yuan,
> One last question though.
>
> Is biomart capable of downloading the promoter sequences of every
> gene in
> the genome, and would searching all those sequences not be very
> expensive
> in terms of memory?
>
> Cheers,
> Davy
>
> --
> David Kavanagh
> Nephrology Research, Centre of Public Health, Queen's University
> Belfast, A
> floor, Tower Block,
> City Hospital, Lisburn Road, BT9 7AB, Belfast, United Kingdom
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor