How can I extract promoter regions from a list NCBI genbank acc?

0

Entering edit mode

Yisong Zhen ▴ 200

@yisong-zhen-2952

Last seen 7.4 years ago

Dear All, I read the GenomicFeatures manual, there is only a example to extract promoters from a list EntrezGeneID. So how can I extract a group promoters from a list NCBI genbank acc, like "NM_144551" "NM_019413""? I have already installed locally "TxDb.Mmusculus.UCSC.mm9.knownGene". Thanks. Yisong [[alternative HTML version deleted]]

GenomicFeatures GenomicFeatures • 3.3k views

ADD COMMENT • link updated 12.1 years ago by Marc Carlson ★ 7.2k • written 12.1 years ago by Yisong Zhen ▴ 200

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 8.3 years ago

United States

Hi Yisong, Are you using mm9 and not mm10? Can you please show us your sessionInfo()? Things have been improved with the latest release, but I can't tell from your post what version you are using? Marc On 11/05/2012 12:23 AM, Yisong Zhen wrote: > Dear All, > > I read the GenomicFeatures manual, there is only a example to extract > promoters from a list EntrezGeneID. So how can I extract a group promoters > from a list NCBI genbank acc, like "NM_144551" "NM_019413""? I have > already installed locally "TxDb.Mmusculus.UCSC.mm9.knownGene". Thanks. > > Yisong > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.1 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 8.3 years ago

United States

Hi Yisong, Here is one way to solve this problem: ## 1st load up the org package for mouse and get the entrez gene IDs library(org.Mm.eg.db) cols(org.Mm.eg.db) ids = c("NM_144551", "NM_019413") res = select(org.Mm.eg.db, keys=ids, cols="ENTREZID", keytype="REFSEQ") res egs = res[,"ENTREZID"] ## Then load the knownGene based package that you mentioned earlier. (note that the gene IDs used by this package are entrez gene IDs! library("TxDb.Mmusculus.UCSC.mm9.knownGene") mm9 = TxDb.Mmusculus.UCSC.mm9.knownGene ## a shorter name, just for convenience cols(mm9) ## Now at this point you have not really told me what you wanted. ## So lets start with just how you can get information about your genes of interest. ## Lets start by just gettting out some of the transcript information for these genes: res2 = select(mm9, keys=egs, cols=c("TXNAME","TXSTRAND","TXCHROM","TXSTART","TXEND"), keytype="GENEID") res2 ## From your description it sounds like you wanted a GRanges object with the ranges for your promoters. ## In that case you could do it like this: proms = promoters(mm9) ## Then you can extract the txnames from before... txnms = unique(res2[,"TXNAME"]) ## And then subset to only the promoters that have the names that you want here myproms = proms[mcols(proms)[,'tx_name'] %in% txnms,] Now I feel that I should also mention that with the latest version of Bioconductor, we have added a new package called OrganismDbi. So with that version you can load a package like Homo.sapiens or Mus.musculus and use select() on an object that refers to several annotation packages at once. When that package is pointed to the correct resources, it can simplify some of the steps above. For details on this approach, please see the OrganismDbi package. Hope that helps, Marc On 11/05/2012 12:23 AM, Yisong Zhen wrote: > Dear All, > > I read the GenomicFeatures manual, there is only a example to extract > promoters from a list EntrezGeneID. So how can I extract a group promoters > from a list NCBI genbank acc, like "NM_144551" "NM_019413""? I have > already installed locally "TxDb.Mmusculus.UCSC.mm9.knownGene". Thanks. > > Yisong > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.1 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Dear Marc, I read your code, It looks like that there is a misunderstanding. I did not clarify my question. In gene annotation from NCBI, same EntrezGene ID probably will have several 5 UTR-differentent ReSeq IDs. Then they will not share same promoter. What I want to do is to use the unique RefSeqIDs to extract thier corresponding promoter regions. I do not want to extract those promoters which, however, share identical Entrez GeneID. I am not familiar with bioconductor and I only found that the similar function in package "GenomicFeatures". Please provide me some example code if convenient so I can learn it step by step. Bests, Yisong On Tue, Nov 6, 2012 at 7:29 AM, Marc Carlson <mcarlson@fhcrc.org> wrote: > Hi Yisong, > > Here is one way to solve this problem: > > ## 1st load up the org package for mouse and get the entrez gene IDs > library(org.Mm.eg.db) > cols(org.Mm.eg.db) > ids = c("NM_144551", "NM_019413") > res = select(org.Mm.eg.db, keys=ids, cols="ENTREZID", keytype="REFSEQ") > res > egs = res[,"ENTREZID"] > > ## Then load the knownGene based package that you mentioned earlier. (note > that the gene IDs used by this package are entrez gene IDs! > library("TxDb.Mmusculus.UCSC.**mm9.knownGene") > mm9 = TxDb.Mmusculus.UCSC.mm9.**knownGene ## a shorter name, just for > convenience > cols(mm9) > > ## Now at this point you have not really told me what you wanted. > ## So lets start with just how you can get information about your genes of > interest. > ## Lets start by just gettting out some of the transcript information for > these genes: > res2 = select(mm9, keys=egs, cols=c("TXNAME","TXSTRAND","**TXCHROM","TXSTART","TXEND"), > keytype="GENEID") > res2 > > ## From your description it sounds like you wanted a GRanges object with > the ranges for your promoters. > ## In that case you could do it like this: > proms = promoters(mm9) > ## Then you can extract the txnames from before... > txnms = unique(res2[,"TXNAME"]) > ## And then subset to only the promoters that have the names that you want > here > myproms = proms[mcols(proms)[,'tx_name'] %in% txnms,] > > Now I feel that I should also mention that with the latest version of > Bioconductor, we have added a new package called OrganismDbi. So with that > version you can load a package like Homo.sapiens or Mus.musculus and use > select() on an object that refers to several annotation packages at once. > When that package is pointed to the correct resources, it can simplify > some of the steps above. For details on this approach, please see the > OrganismDbi package. > > > Hope that helps, > > > > Marc > > > On 11/05/2012 12:23 AM, Yisong Zhen wrote: > >> Dear All, >> >> I read the GenomicFeatures manual, there is only a example to extract >> promoters from a list EntrezGeneID. So how can I extract a group promoters >> from a list NCBI genbank acc, like "NM_144551" "NM_019413""? I have >> already installed locally "TxDb.Mmusculus.UCSC.mm9.**knownGene". Thanks. >> >> Yisong >> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]

ADD REPLY • link 12.1 years ago Yisong Zhen ▴ 200

0

Entering edit mode

Hi Yisong, If you really want the refseq IDs tied DIRECTLY to the sequences (IOW, you don't just want all the transcripts from the associated genes, you want exactly those transcripts and ONLY those transcripts) then you need a resource that contains that information. The "TxDb.Mmusculus.UCSC.mm9. knownGene" package does not do that, because it it is based upon the knownGene track at UCSC (which does not contain any refseq IDs). So if you want to do that you will need to process the UCSC refGene track into a package and install that. You can do that by calling the makeTxDbPackageFromUCSC() function like this. makeTxDbPackageFromUCSC(version="1.0.0", maintainer="Some One <so@someplace.org>", author="Some One <so@someplace.org>", genome="mm9", tablename="refGene") Then install the package, and load it up. Once installed you can do this: library("TxDb.Mmusculus.UCSC.mm9.refGene") mm9 = TxDb.Mmusculus.UCSC.mm9.refGene ## shortcut ## Then you could just get the promoters. But now notice that they have the refseq IDs attached as gene names. proms = promoters(mm9) Please let me know if you have further questions, Marc On 11/05/2012 07:09 PM, Yisong Zhen wrote: > Dear Marc, > > I read your code, It looks like that there is a misunderstanding. I > did not clarify my question. > > In gene annotation from NCBI, same EntrezGene ID probably will have > several 5 UTR-differentent ReSeq IDs. Then they will not share same > promoter. What I want to do is to use the unique RefSeqIDs to extract > thier corresponding promoter regions. I do not want to extract those > promoters which, however, share identical Entrez GeneID. > > I am not familiar with bioconductor and I only found that > the similar function in package "GenomicFeatures". > > Please provide me some example code if convenient so I can learn it > step by step. > > Bests, > > Yisong > > On Tue, Nov 6, 2012 at 7:29 AM, Marc Carlson <mcarlson@fhcrc.org> <mailto:mcarlson@fhcrc.org>> wrote: > > Hi Yisong, > > Here is one way to solve this problem: > > ## 1st load up the org package for mouse and get the entrez gene IDs > library(org.Mm.eg.db) > cols(org.Mm.eg.db) > ids = c("NM_144551", "NM_019413") > res = select(org.Mm.eg.db, keys=ids, cols="ENTREZID", > keytype="REFSEQ") > res > egs = res[,"ENTREZID"] > > ## Then load the knownGene based package that you mentioned > earlier. (note that the gene IDs used by this package are entrez > gene IDs! > library("TxDb.Mmusculus.UCSC. mm9.knownGene") > mm9 = TxDb.Mmusculus.UCSC.mm9. knownGene ## a shorter name, just > for convenience > cols(mm9) > > ## Now at this point you have not really told me what you wanted. > ## So lets start with just how you can get information about your > genes of interest. > ## Lets start by just gettting out some of the transcript > information for these genes: > res2 = select(mm9, keys=egs, cols=c("TXNAME","TXSTRAND"," > TXCHROM","TXSTART","TXEND"), keytype="GENEID") > res2 > > ## From your description it sounds like you wanted a GRanges > object with the ranges for your promoters. > ## In that case you could do it like this: > proms = promoters(mm9) > ## Then you can extract the txnames from before... > txnms = unique(res2[,"TXNAME"]) > ## And then subset to only the promoters that have the names that > you want here > myproms = proms[mcols(proms)[,'tx_name'] %in% txnms,] > > Now I feel that I should also mention that with the latest version > of Bioconductor, we have added a new package called OrganismDbi. > So with that version you can load a package like Homo.sapiens or > Mus.musculus and use select() on an object that refers to several > annotation packages at once. When that package is pointed to the > correct resources, it can simplify some of the steps above. For > details on this approach, please see the OrganismDbi package. > > > Hope that helps, > > > > Marc > > > On 11/05/2012 12:23 AM, Yisong Zhen wrote: > > Dear All, > > I read the GenomicFeatures manual, there is only a example to > extract > promoters from a list EntrezGeneID. So how can I extract a > group promoters > from a list NCBI genbank acc, like "NM_144551" > "NM_019413""? I have > already installed locally "TxDb.Mmusculus.UCSC.mm9. > knownGene". Thanks. > > Yisong > > [[alternative HTML version deleted]] > > ______________________________ _________________ > Bioconductor mailing list > Bioconductor@r-project.org <mailto:bioconductor@r-project.org> > https://stat.ethz.ch/mailman/ listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane. > science.biology.informatics. conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > ______________________________ _________________ > Bioconductor mailing list > Bioconductor@r-project.org <mailto:bioconductor@r-project.org> > https://stat.ethz.ch/mailman/ listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane. > science.biology.informatics. conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > [[alternative HTML version deleted]]

ADD REPLY • link 12.1 years ago Marc Carlson ★ 7.2k

Login before adding your answer.