Help with promoter analysis

0

Entering edit mode

Davy ▴ 20

@davy-5133

Last seen 10.2 years ago

Hi all, Hoping someone could give me a bit of direction here. I have a set of genes which are all members of the same pathway. I want to identify if there are any transcription factor binding sites (TFBS) in the "promoters" (so far defined as 5kb upstream of the TSS) that are more common to genes among the pathway. I have managed to get the 5kb upstream using biomaRt (although the query throws an intermittent error, moaning about the upstream_flank filter, doesn't happen all the time, it's weird!) I also managed to download all the JASPAR matrices, parse the file for only human ones and convert them into position weight matrices. Lastly, I have produced a table of counts of each human TFBS motif in each of my genes using countPWM(pwm, seq, cutoff="90%") This is as far as I have gotten and am simply wondering what do I do next. >From some reading the hypergeometric distribution is used in this situation but I am not sure what metrics to place in as the white balls drawn, total white balls, black balls etc., for those of you familiar with the hypergeometric distribution. I read that perhaps I should compare to a background set of genes, some sources say all other genes. This seems like overkill. Any help is appreciated. Cheers, Davy -- David Kavanagh Nephrology Research, Centre of Public Health, Queen's University Belfast, A floor, Tower Block, City Hospital, Lisburn Road, BT9 7AB, Belfast, United Kingdom [[alternative HTML version deleted]]

Transcription convert biomaRt Transcription convert biomaRt • 1.9k views

ADD COMMENT • link 12.7 years ago Davy ▴ 20

0

Entering edit mode

Alex Gutteridge ▴ 650

@alex-gutteridge-2935

Last seen 10.2 years ago

United States

On 27.02.2012 11:17, Davy wrote: > Hi all, > Hoping someone could give me a bit of direction here. > > I have a set of genes which are all members of the same pathway. > > I want to identify if there are any transcription factor binding > sites > (TFBS) in the "promoters" (so far defined as 5kb upstream of the TSS) > that > are more common to genes among the pathway. > > I have managed to get the 5kb upstream using biomaRt (although the > query > throws an intermittent error, moaning about the upstream_flank > filter, > doesn't happen all the time, it's weird!) > > I also managed to download all the JASPAR matrices, parse the file > for only > human ones and convert them into position weight matrices. > > Lastly, I have produced a table of counts of each human TFBS motif in > each > of my genes using countPWM(pwm, seq, cutoff="90%") > > This is as far as I have gotten and am simply wondering what do I do > next. >>From some reading the hypergeometric distribution is used in this >> situation > but I am not sure what metrics to place in as the white balls drawn, > total > white balls, black balls etc., for those of you familiar with the > hypergeometric distribution. > > I read that perhaps I should compare to a background set of genes, > some > sources say all other genes. This seems like overkill. > > Any help is appreciated. > Cheers, > Davy Hi Davy, Your second paragraph is a little vague/confusing ('more common' than what?). But if the question is does a given motif appear more often in your pathway genes than one would expect by chance from a random sampling of genes from the genome then the hypergeometric seems appropriate. The nature of the white/black balls depends a little on how you initially selected your genes and the precise question you wish to ask, but essentially it will be: White balls: All genes in the genome (or other background set) that contain your motif Black balls: All genes in the genome (or other background set) that don't contain your motif Balls drawn: All genes in your pathway White balls drawn: Genes in your pathway that contain the motif So if 1000 genes contain the motif, there are 30,000 genes in the genome, 20 genes in the pathway and 10 genes in the pathway contain the motif then the call to phyper would be: > phyper(10,1000,30000-1000,20,lower.tail=F) [1] 6.820356e-12 -- Alex Gutteridge

ADD COMMENT • link 12.7 years ago Alex Gutteridge ▴ 650

0

Entering edit mode

Hi, I think the first elements defined in the "phyper()" function is the quantile, so the scenario shall be like this: > phyper(9,1000,30000-1000,20,lower.tail=F) [1] 2.209245e-10 Or, you can use 'fisher.test()': > fisher.test(matrix(c(10,20-10,1000-10,30000-1000-20+10), 2,2),alternative="greater")$p.value [1] 2.209245e-10 Cheers, Yuan On 27 Feb 2012, at 11:34, Alex Gutteridge wrote: > On 27.02.2012 11:17, Davy wrote: >> Hi all, >> Hoping someone could give me a bit of direction here. >> >> I have a set of genes which are all members of the same pathway. >> >> I want to identify if there are any transcription factor binding >> sites >> (TFBS) in the "promoters" (so far defined as 5kb upstream of the >> TSS) that >> are more common to genes among the pathway. >> >> I have managed to get the 5kb upstream using biomaRt (although the >> query >> throws an intermittent error, moaning about the upstream_flank >> filter, >> doesn't happen all the time, it's weird!) >> >> I also managed to download all the JASPAR matrices, parse the file >> for only >> human ones and convert them into position weight matrices. >> >> Lastly, I have produced a table of counts of each human TFBS motif >> in each >> of my genes using countPWM(pwm, seq, cutoff="90%") >> >> This is as far as I have gotten and am simply wondering what do I >> do next. >>> From some reading the hypergeometric distribution is used in this >>> situation >> but I am not sure what metrics to place in as the white balls >> drawn, total >> white balls, black balls etc., for those of you familiar with the >> hypergeometric distribution. >> >> I read that perhaps I should compare to a background set of genes, >> some >> sources say all other genes. This seems like overkill. >> >> Any help is appreciated. >> Cheers, >> Davy > > Hi Davy, > > Your second paragraph is a little vague/confusing ('more common' > than what?). But if the question is does a given motif appear more > often in your pathway genes than one would expect by chance from a > random sampling of genes from the genome then the hypergeometric > seems appropriate. The nature of the white/black balls depends a > little on how you initially selected your genes and the precise > question you wish to ask, but essentially it will be: > > White balls: All genes in the genome (or other background set) that > contain your motif > Black balls: All genes in the genome (or other background set) that > don't contain your motif > Balls drawn: All genes in your pathway > White balls drawn: Genes in your pathway that contain the motif > > So if 1000 genes contain the motif, there are 30,000 genes in the > genome, 20 genes in the pathway and 10 genes in the pathway contain > the motif then the call to phyper would be: > >> phyper(10,1000,30000-1000,20,lower.tail=F) > [1] 6.820356e-12 > > -- > Alex Gutteridge > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.7 years ago Yuan Hao ▴ 240

0

Entering edit mode

Davy ▴ 20

@davy-5133

Last seen 10.2 years ago

Thanks Alex and Yuan, One last question though. Is biomart capable of downloading the promoter sequences of every gene in the genome, and would searching all those sequences not be very expensive in terms of memory? Cheers, Davy -- David Kavanagh Nephrology Research, Centre of Public Health, Queen's University Belfast, A floor, Tower Block, City Hospital, Lisburn Road, BT9 7AB, Belfast, United Kingdom [[alternative HTML version deleted]]

ADD COMMENT • link 12.7 years ago Davy ▴ 20

0

Entering edit mode

On 27.02.2012 14:14, Davy wrote: > Thanks Alex and Yuan, > One last question though. > > Is biomart capable of downloading the promoter sequences of every > gene in > the genome, and would searching all those sequences not be very > expensive > in terms of memory? > > Cheers, > Davy I don't know if memory would be prohibitive or not. If it is then the usual approach in these cases is to do it chromosome by chromosome (or any other convenient division). All you need at the end is the number of genes hit so there's no need to have everything in memory at once. -- Alex Gutteridge

ADD REPLY • link 12.7 years ago Alex Gutteridge ▴ 650

0

Entering edit mode

Hi Davy, This is a bit off the topic of the list, but hope it could be of help some others. My personal experience is biomart is based on Ensembl gene annotations, which won't be exactly the same if you adopt another annotation, like UCSC, in terms of gene names, transcripts... Apart from the differences may exist in mitochondria genome included by these two databases, I believe the master majority of the genes are the same coordinated on the genome, so it should be ok to download the promoter sequences from either, and biomart is quite efficient on this. In terms of motif searching, it's really depends on the programs/ tools, i.e. the algorithms, you pick up to use. Cheers, Yuan P.S. Promoters from UCSC could be retrieved from its table browser. On 27 Feb 2012, at 14:14, Davy wrote: > Thanks Alex and Yuan, > One last question though. > > Is biomart capable of downloading the promoter sequences of every > gene in > the genome, and would searching all those sequences not be very > expensive > in terms of memory? > > Cheers, > Davy > > -- > David Kavanagh > Nephrology Research, Centre of Public Health, Queen's University > Belfast, A > floor, Tower Block, > City Hospital, Lisburn Road, BT9 7AB, Belfast, United Kingdom > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.7 years ago Yuan Hao ▴ 240

Login before adding your answer.