Question

How to get a unique line of annotation for each specific genomic position by using biomaRt package

0

Entering edit mode

Mao Jianfeng ▴ 290

@mao-jianfeng-3598

Last seen 10.6 years ago

Dear listers, I am new to bioconductor. I have genomic variations (SNP, indel, CNV) coordinated by chromosome:start:end in GFF/BED/VCF format. One genomic variation is defined a specific genomic position (in base pair). for example: # SNPs,chr,start,end SNP_1,1,43,43 SNP_2,2,56,56 I would like to get such genomic variations annotated by various gen/protein/passway centric annotations (as listed in BioMart databases). I tried R/bioconductor biomaRt package. But, I failed to get a unique line of annotation for a specific genomic position. Could you please give any directions on that? Thanks in advance. ################################################code I used as an example########################### library(biomaRt) listMarts() plant = useMart("plant_mart_7") alyr=useDataset("alyrata_eg_gene", mart=plant) atha = useDataset ("athaliana_eg_gene",mart=plant) listAttributes(alyr) listFilters(alyr) chr<-c(rep(1, 10)) start<-c(33, 999, 3000, 7000, 9000, 10000, 12000, 19000, 80000, 100000) end<-c(33, 999, 3000, 7000, 9000, 10000, 12000, 19000, 80000, 100000) getBM(attributes = c("chromosome_name","start_position","ensembl_gene_id", "go_biological_process_linkage_type"), filters = c("chromosome_name", "start", "end"), values = list(chr, start, end), mart=alyr, uniqueRows = TRUE) ###################################################################### ##################### -- Jian-Feng, Mao

Annotation biomaRt Annotation biomaRt • 1.5k views

ADD COMMENT • link updated 14.2 years ago by Steve Lianoglou ★ 13k • written 14.2 years ago by Mao Jianfeng ▴ 290

score 0 · Answer 1 · 2011-02-08

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 20 days ago

United States

Hi, On Tue, Feb 8, 2011 at 5:49 AM, Mao Jianfeng <jianfeng.mao at="" gmail.com=""> wrote: > Dear listers, > > I am new to bioconductor. > > I have genomic variations (SNP, indel, CNV) coordinated by > chromosome:start:end in GFF/BED/VCF format. One genomic variation is > defined a specific genomic position (in base pair). > > for example: > # SNPs,chr,start,end > SNP_1,1,43,43 > SNP_2,2,56,56 > > I would like to get such genomic variations annotated by various > gen/protein/passway centric annotations (as listed in BioMart > databases). I tried R/bioconductor biomaRt package. But, I failed to > get a unique line of annotation for a specific genomic position. Could > you please give any directions on that? Could you explain a bit more about what you mean when you say "get a unique line of annotation"? The only informative info `getBM` query is returning is the gene id for the location, and the GO term evidence code (go_biological_process_linkage_type). If you add, say, "go_biological_process_id", you get the biological go terms associated with the position, ie: result <- getBM(attributes=c("chromosome_name","start_position","ensem bl_gene_id", "go_biological_process_linkage_type", "go_biological_process_id"), filters = c("chromosome_name", "start", "end"), values = list(chr, start, end), mart=alyr, uniqueRows = TRUE) If you problem is that some positions have more than one row, like so: chromosome_name start_position ensembl_gene_id ... go_biological_process_id 1 33055 scaffold_100013.1 GO:0006355 1 33055 scaffold_100013.1 GO:0006886 1 33055 scaffold_100013.1 GO:0006913 1 33055 scaffold_100013.1 GO:0007165 1 33055 scaffold_100013.1 GO:0007264 this happens because multiple go terms are shared at that location. If you want to just pick one, but you'll have to decide how you want to do that. If you want to somehow summarize each chromosome/start_position into one row, you can iterate over the data by this combination easily with, say, the ddply function from the plyr package: library(plyr) summary <- ddply(result, .(chromosome_name, start_position), function(x) { # x will have all of the rows for a given chromosome_name / start_position # combo. We can arbitrarily just return the first row, but you'll likely # want to do something smarter: x[1,] }) If you look at `summary`, you'll have one row per position. -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 14.2 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Dear Steve, Thanks for your kindness. Could you please give me more directions on this annotation problem? ######################### (1) ######################### I want each my SNP has just one line of annotation in separate columns. If there are the multiple terms for the same attributes (for example, multiple go terms are shared at that location), I would like to include them in the same column with symbols (such ; : | ) separated each of them. for example I have SNPs like this: # SNPs,chr,start,end SNP_1,1,43,43 SNP_2,2,56,56 I would have annotations like this: # SNPs,chr,start,end,go_term SNP_1,1,43,43,go_1:go_3 SNP_2,2,56,56,go_100:go_1000 ######################### (2) ######################### Alternatively, I would like to have the SNPs position be combined with its annotations results, so as to know which the annotation lines are corresponding to. I do not know how to do that using bioconductor packages. Look the example followed: for example I have SNPs like this: # SNPs,chr,start,end SNP_1,1,43,43 SNP_2,2,56,56 I would have annotations like this: # SNPs,chr,start,end,go_term SNP_1,1,43,43,go_1 SNP_1,1,43,43,go_3 SNP_2,2,56,56,go_100 SNP_2,2,56,56,go_1000 Jian-Feng, 2011/2/8 Steve Lianoglou <mailinglist.honeypot at="" gmail.com="">: > Hi, > > On Tue, Feb 8, 2011 at 5:49 AM, Mao Jianfeng <jianfeng.mao at="" gmail.com=""> wrote: >> Dear listers, >> >> I am new to bioconductor. >> >> I have genomic variations (SNP, indel, CNV) coordinated by >> chromosome:start:end in GFF/BED/VCF format. One genomic variation is >> defined a specific genomic position (in base pair). >> >> for example: >> # SNPs,chr,start,end >> SNP_1,1,43,43 >> SNP_2,2,56,56 >> >> I would like to get such genomic variations annotated by various >> gen/protein/passway centric annotations (as listed in BioMart >> databases). I tried R/bioconductor biomaRt package. But, I failed to >> get a unique line of annotation for a specific genomic position. Could >> you please give any directions on that? > > Could you explain a bit more about what you mean when you say "get a > unique line of annotation"? > > The only informative info `getBM` query is returning is the gene id > for the location, and the GO term evidence code > (go_biological_process_linkage_type). If you add, say, > "go_biological_process_id", you get the biological go terms associated > with the position, ie: > > result <- getBM(attributes=c("chromosome_name","start_position","ens embl_gene_id", > ?"go_biological_process_linkage_type", "go_biological_process_id"), > ?filters = c("chromosome_name", "start", "end"), > ?values = list(chr, start, end), mart=alyr, uniqueRows = TRUE) > > If you problem is that some positions have more than one row, like so: > > chromosome_name start_position ? ? ensembl_gene_id ?... > go_biological_process_id > ? ? ? ? ? ? ?1 ? ? ? ? ?33055 ? scaffold_100013.1 > GO:0006355 > ? ? ? ? ? ? ?1 ? ? ? ? ?33055 ? scaffold_100013.1 > GO:0006886 > ? ? ? ? ? ? ?1 ? ? ? ? ?33055 ? scaffold_100013.1 > GO:0006913 > ? ? ? ? ? ? ?1 ? ? ? ? ?33055 ? scaffold_100013.1 > GO:0007165 > ? ? ? ? ? ? ?1 ? ? ? ? ?33055 ? scaffold_100013.1 > GO:0007264 > > this happens because multiple go terms are shared at that location. If > you want to just pick one, but you'll have to decide how you want to > do that. > > If you want to somehow summarize each chromosome/start_position into > one row, you can iterate over the data by this combination easily > with, say, the ddply function from the plyr package: > > library(plyr) > summary <- ddply(result, .(chromosome_name, start_position), function(x) { > ?# x will have all of the rows for a given chromosome_name / start_position > ?# combo. We can arbitrarily just return the first row, but you'll likely > ?# want to do something smarter: > ?x[1,] > }) > > If you look at `summary`, you'll have one row per position. > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > -- Jian-Feng, Mao the Institute of Botany, Chinese Academy of Botany,

ADD REPLY • link 14.2 years ago Mao Jianfeng ▴ 290

0

Entering edit mode

Hi, On Tue, Feb 8, 2011 at 9:24 AM, Mao Jianfeng <jianfeng.mao at="" gmail.com=""> wrote: > Dear Steve, > > Thanks for your kindness. Could you please give me more directions on > this annotation problem? > > ######################### > (1) > ######################### > I want each my SNP has just one line of annotation in separate > columns. If there are the multiple terms for the same attributes (for > example, multiple go terms are shared at that location), I would like > to include them in the same column with symbols (such ; ?: ?| ) > separated each of them. > > for example I have SNPs like this: > # SNPs,chr,start,end > SNP_1,1,43,43 > SNP_2,2,56,56 > > I would have annotations like this: > # SNPs,chr,start,end,go_term > SNP_1,1,43,43,go_1:go_3 > SNP_2,2,56,56,go_100:go_1000 I'll give you this one ... continuing from my previous example: say the getBM call stores its return value in `result`: library(plyr) summary <- ddply(result, .(chromosome_name, start_position), function(x) { new.x <- x[1,] new.x$go_biological_process_id <- paste(x$go_biological_process_id, collapse="|") new.x }) I'll leave the rest as an exercise for you. -steve > > ######################### > (2) > ######################### > Alternatively, I would like to have the SNPs position be combined with > its annotations results, so as to know which the annotation lines are > corresponding to. I do not know how to do that using bioconductor > packages. Look the example followed: > > for example I have SNPs like this: > # SNPs,chr,start,end > SNP_1,1,43,43 > SNP_2,2,56,56 > > I would have annotations like this: > # SNPs,chr,start,end,go_term > SNP_1,1,43,43,go_1 > SNP_1,1,43,43,go_3 > SNP_2,2,56,56,go_100 > SNP_2,2,56,56,go_1000 > > Jian-Feng, > > 2011/2/8 Steve Lianoglou <mailinglist.honeypot at="" gmail.com="">: >> Hi, >> >> On Tue, Feb 8, 2011 at 5:49 AM, Mao Jianfeng <jianfeng.mao at="" gmail.com=""> wrote: >>> Dear listers, >>> >>> I am new to bioconductor. >>> >>> I have genomic variations (SNP, indel, CNV) coordinated by >>> chromosome:start:end in GFF/BED/VCF format. One genomic variation is >>> defined a specific genomic position (in base pair). >>> >>> for example: >>> # SNPs,chr,start,end >>> SNP_1,1,43,43 >>> SNP_2,2,56,56 >>> >>> I would like to get such genomic variations annotated by various >>> gen/protein/passway centric annotations (as listed in BioMart >>> databases). I tried R/bioconductor biomaRt package. But, I failed to >>> get a unique line of annotation for a specific genomic position. Could >>> you please give any directions on that? >> >> Could you explain a bit more about what you mean when you say "get a >> unique line of annotation"? >> >> The only informative info `getBM` query is returning is the gene id >> for the location, and the GO term evidence code >> (go_biological_process_linkage_type). If you add, say, >> "go_biological_process_id", you get the biological go terms associated >> with the position, ie: >> >> result <- getBM(attributes=c("chromosome_name","start_position","en sembl_gene_id", >> ?"go_biological_process_linkage_type", "go_biological_process_id"), >> ?filters = c("chromosome_name", "start", "end"), >> ?values = list(chr, start, end), mart=alyr, uniqueRows = TRUE) >> >> If you problem is that some positions have more than one row, like so: >> >> chromosome_name start_position ? ? ensembl_gene_id ?... >> go_biological_process_id >> ? ? ? ? ? ? ?1 ? ? ? ? ?33055 ? scaffold_100013.1 >> GO:0006355 >> ? ? ? ? ? ? ?1 ? ? ? ? ?33055 ? scaffold_100013.1 >> GO:0006886 >> ? ? ? ? ? ? ?1 ? ? ? ? ?33055 ? scaffold_100013.1 >> GO:0006913 >> ? ? ? ? ? ? ?1 ? ? ? ? ?33055 ? scaffold_100013.1 >> GO:0007165 >> ? ? ? ? ? ? ?1 ? ? ? ? ?33055 ? scaffold_100013.1 >> GO:0007264 >> >> this happens because multiple go terms are shared at that location. If >> you want to just pick one, but you'll have to decide how you want to >> do that. >> >> If you want to somehow summarize each chromosome/start_position into >> one row, you can iterate over the data by this combination easily >> with, say, the ddply function from the plyr package: >> >> library(plyr) >> summary <- ddply(result, .(chromosome_name, start_position), function(x) { >> ?# x will have all of the rows for a given chromosome_name / start_position >> ?# combo. We can arbitrarily just return the first row, but you'll likely >> ?# want to do something smarter: >> ?x[1,] >> }) >> >> If you look at `summary`, you'll have one row per position. >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> ?| Memorial Sloan-Kettering Cancer Center >> ?| Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> > > > > -- > Jian-Feng, Mao > > the Institute of Botany, > Chinese Academy of Botany, > -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 14.2 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Thanks a lot. Steve. You have given me a good guide. Jian-Feng, 2011/2/8 Steve Lianoglou <mailinglist.honeypot at="" gmail.com="">: > Hi, > > On Tue, Feb 8, 2011 at 9:24 AM, Mao Jianfeng <jianfeng.mao at="" gmail.com=""> wrote: >> Dear Steve, >> >> Thanks for your kindness. Could you please give me more directions on >> this annotation problem? >> >> ######################### >> (1) >> ######################### >> I want each my SNP has just one line of annotation in separate >> columns. If there are the multiple terms for the same attributes (for >> example, multiple go terms are shared at that location), I would like >> to include them in the same column with symbols (such ; ?: ?| ) >> separated each of them. >> >> for example I have SNPs like this: >> # SNPs,chr,start,end >> SNP_1,1,43,43 >> SNP_2,2,56,56 >> >> I would have annotations like this: >> # SNPs,chr,start,end,go_term >> SNP_1,1,43,43,go_1:go_3 >> SNP_2,2,56,56,go_100:go_1000 > > I'll give you this one ... continuing from my previous example: say > the getBM call stores its return value in `result`: > > library(plyr) > summary <- ddply(result, .(chromosome_name, start_position), function(x) { > ?new.x <- x[1,] > ?new.x$go_biological_process_id <- paste(x$go_biological_process_id, > collapse="|") > ?new.x > }) > > I'll leave the rest as an exercise for you. > > -steve > > >> >> ######################### >> (2) >> ######################### >> Alternatively, I would like to have the SNPs position be combined with >> its annotations results, so as to know which the annotation lines are >> corresponding to. I do not know how to do that using bioconductor >> packages. Look the example followed: >> >> for example I have SNPs like this: >> # SNPs,chr,start,end >> SNP_1,1,43,43 >> SNP_2,2,56,56 >> >> I would have annotations like this: >> # SNPs,chr,start,end,go_term >> SNP_1,1,43,43,go_1 >> SNP_1,1,43,43,go_3 >> SNP_2,2,56,56,go_100 >> SNP_2,2,56,56,go_1000 >> >> Jian-Feng, >> >> 2011/2/8 Steve Lianoglou <mailinglist.honeypot at="" gmail.com="">: >>> Hi, >>> >>> On Tue, Feb 8, 2011 at 5:49 AM, Mao Jianfeng <jianfeng.mao at="" gmail.com=""> wrote: >>>> Dear listers, >>>> >>>> I am new to bioconductor. >>>> >>>> I have genomic variations (SNP, indel, CNV) coordinated by >>>> chromosome:start:end in GFF/BED/VCF format. One genomic variation is >>>> defined a specific genomic position (in base pair). >>>> >>>> for example: >>>> # SNPs,chr,start,end >>>> SNP_1,1,43,43 >>>> SNP_2,2,56,56 >>>> >>>> I would like to get such genomic variations annotated by various >>>> gen/protein/passway centric annotations (as listed in BioMart >>>> databases). I tried R/bioconductor biomaRt package. But, I failed to >>>> get a unique line of annotation for a specific genomic position. Could >>>> you please give any directions on that? >>> >>> Could you explain a bit more about what you mean when you say "get a >>> unique line of annotation"? >>> >>> The only informative info `getBM` query is returning is the gene id >>> for the location, and the GO term evidence code >>> (go_biological_process_linkage_type). If you add, say, >>> "go_biological_process_id", you get the biological go terms associated >>> with the position, ie: >>> >>> result <- getBM(attributes=c("chromosome_name","start_position","e nsembl_gene_id", >>> ?"go_biological_process_linkage_type", "go_biological_process_id"), >>> ?filters = c("chromosome_name", "start", "end"), >>> ?values = list(chr, start, end), mart=alyr, uniqueRows = TRUE) >>> >>> If you problem is that some positions have more than one row, like so: >>> >>> chromosome_name start_position ? ? ensembl_gene_id ?... >>> go_biological_process_id >>> ? ? ? ? ? ? ?1 ? ? ? ? ?33055 ? scaffold_100013.1 >>> GO:0006355 >>> ? ? ? ? ? ? ?1 ? ? ? ? ?33055 ? scaffold_100013.1 >>> GO:0006886 >>> ? ? ? ? ? ? ?1 ? ? ? ? ?33055 ? scaffold_100013.1 >>> GO:0006913 >>> ? ? ? ? ? ? ?1 ? ? ? ? ?33055 ? scaffold_100013.1 >>> GO:0007165 >>> ? ? ? ? ? ? ?1 ? ? ? ? ?33055 ? scaffold_100013.1 >>> GO:0007264 >>> >>> this happens because multiple go terms are shared at that location. If >>> you want to just pick one, but you'll have to decide how you want to >>> do that. >>> >>> If you want to somehow summarize each chromosome/start_position into >>> one row, you can iterate over the data by this combination easily >>> with, say, the ddply function from the plyr package: >>> >>> library(plyr) >>> summary <- ddply(result, .(chromosome_name, start_position), function(x) { >>> ?# x will have all of the rows for a given chromosome_name / start_position >>> ?# combo. We can arbitrarily just return the first row, but you'll likely >>> ?# want to do something smarter: >>> ?x[1,] >>> }) >>> >>> If you look at `summary`, you'll have one row per position. >>> >>> -- >>> Steve Lianoglou >>> Graduate Student: Computational Systems Biology >>> ?| Memorial Sloan-Kettering Cancer Center >>> ?| Weill Medical College of Cornell University >>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>> >> >> >> >> -- >> Jian-Feng, Mao >> >> the Institute of Botany, >> Chinese Academy of Botany, >> > > > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > -- Jian-Feng, Mao the Institute of Botany, Chinese Academy of Botany,

ADD REPLY • link 14.2 years ago Mao Jianfeng ▴ 290