Question

gene2pathway retrain: which model is more complete?

0

Entering edit mode

Bogdan ▴ 50

@bogdan-3629

Last seen 10.7 years ago

After converting my custom gene2Domains mapping into a list of vectors > head(entrez2interpro_nested) $`679594` [1] "IPR019956" "IPR019954" "IPR019955" "IPR000626" $`682397` [1] "IPR019956" "IPR019954" and feeding that into retrain(), I now have the 4th model (most complete?), built using genes: 5667 of 5667 features: 4007 level detectors: 78 This obsoletes my Questions 3 and 4 from my previous email. However, Questions 1 and 2 are still not fully clear to me. I would now paraphrase Q2 into: Of all the retrain()-generated models I now have, which one is theoretically better to use? The one with the most genes, most level detectors, or most features (domains)? Or the one with the lowest average prediction error, disregarding all other factors? On 22 August 2010 17:00, Bogdan <b.t.tokovenko at="" imbg.org.ua=""> wrote: > Dear all, > > I have 2 PCs: server running Debian Lenny and R 2.7.1, and home > running Debian Testing and R 2.11.1. Both have gene2pathway 1.6.1 (and > dependencies) installed. > > When running `model.rno = retrain(organism = "rno")`, I got slightly > different outputs describing the components to build the model: > > (server) > genes: 4055 of 5667 > features: 3553 > level detectors: 74 > > (home) > genes: 3987 of 5577 > features: 3488 > level detectors: 75 > > Question 1: retrain() manual states that all the data for model > training is fetched from KEGG and Ensembl. How then could these > differences (above) be possible? I've run each retrain twice, to be sure that was not > a momentarily glitch. > > > Seeing this, I've decided to manually supply gene2Domains mapping. > Using BioMart, I asked for all entrez-interpro pairs: >> head(entrez2interpro_list) > $`679594` > [1] "IPR019956" > > $`679594` > [1] "IPR019954" > > $`679594` > [1] "IPR019955" > > $`679594` > [1] "IPR000626" > > $`682397` > [1] "IPR019956" > > $`682397` > [1] "IPR019954" > >> length(unique(names(entrez2interpro_list))) > [1] 17666 > >> model.rno = retrain(organism = "rno", gene2Domains = entrez2interpro_list) > > Feeding entrez2interpro_list to retrain(), I got these numbers: > > (manual gene2Domains) > genes: 5677 of 5677 > features: 1852 > level detectors: 78 > > Question 2 (main question): Of these 3 models I now have, which one is > theoretically better to use? The one with most genes, most level > detectors, or most features? > > Question 3: Is the format of my entrez2interpro_list correct? There > were no errors, but that list has duplicate rownames. I wonder if each > EntrezID should be in the list only once, with all relevant IPRs > packed into a nested list. > (possibly related) Question 4: How could it happen that there are only > 1852 features for the most complete coverage of gene mappings in > "manual gene2Domains" case? -- Regards, Bogdan Tokovenko -- Laboratory of Systems Biology, Department of Genetic Information Translation Mechanisms, Institute of Molecular Biology and Genetics, Kyiv, Ukraine http://SysBio.org.ua/ http://BioMed.org.ua/COTRASIF/

Genetics Coverage biomaRt gene2pathway Genetics Coverage biomaRt gene2pathway • 1.5k views

ADD COMMENT • link 14.7 years ago Bogdan ▴ 50

score 0 · Answer 1 · 2010-08-22

Dear all, I have 2 PCs: server running Debian Lenny and R 2.7.1, and home running Debian Testing and R 2.11.1. Both have gene2pathway 1.6.1 (and dependencies) installed. When running `model.rno = retrain(organism = "rno")`, I got slightly different outputs describing the components to build the model: (server) genes: 4055 of 5667 features: 3553 level detectors: 74 (home) genes: 3987 of 5577 features: 3488 level detectors: 75 Question 1: retrain() manual states that all the data for model training is fetched from KEGG and Ensembl. How then could these differences be possible? I've run each twice, to be sure that was not a momentarily glitch. Seeing this, I've decided to manually supply gene2Domains mapping. Using BioMart, I asked for all entrez-interpro pairs: > head(entrez2interpro_list) $`679594` [1] "IPR019956" $`679594` [1] "IPR019954" $`679594` [1] "IPR019955" $`679594` [1] "IPR000626" $`682397` [1] "IPR019956" $`682397` [1] "IPR019954" > length(unique(names(entrez2interpro_list))) [1] 17666 > model.rno = retrain(organism = "rno", gene2Domains = entrez2interpro_list) Feeding this list to retrain(), I got these numbers: (manual gene2Domains) genes: 5677 of 5677 features: 1852 level detectors: 78 Question 2 (main question): Of these 3 models I now have, which one is theoretically better to use? The one with most genes, most level detectors, or most features? Question 3: Is the format of my entrez2interpro_list correct? There were no errors, but that list has duplicate rownames. I wonder if each EntrezID should be in the list only once, with all relevant IPRs packed into a nested list. (possibly related) Question 4: How could it happen that there are only 1852 features for the most complete coverage of gene mappings in "manual gene2Domains" case? -- Regards, Bogdan Tokovenko -- Laboratory of Systems Biology, Department of Genetic Information Translation Mechanisms, Institute of Molecular Biology and Genetics, Kyiv, Ukraine http://SysBio.org.ua/ http://BioMed.org.ua/COTRASIF/

score 0 · Answer 2 · 2010-08-23

I believe there is a plausible explanation for Question 1: quite a number of software packages have different versions at home and on the server, *including* gene2pathway - which is 1.6.0 on server and 1.6.1 at home. Previously, I erroneously believed gene2pathway versions were the same. Now only Question 2 remains somewhat unanswered. As soon as the final model is retrained, I'll be able to compare average prediction errors and thus conclude on which model is better. On 23 August 2010 14:52, Bogdan <b.t.tokovenko at="" imbg.org.ua=""> wrote: > After converting my custom gene2Domains mapping into a list of vectors > >> head(entrez2interpro_nested) > $`679594` > ?[1] "IPR019956" "IPR019954" "IPR019955" "IPR000626" > > $`682397` > [1] "IPR019956" "IPR019954" > > and feeding that into retrain(), I now have the 4th model (most > complete?), built using > genes: 5667 of 5667 > features: 4007 > level detectors: 78 > > This obsoletes my Questions 3 and 4 from my previous email. > However, Questions 1 and 2 are still not fully clear to me. > > I would now paraphrase Q2 into: > Of all the retrain()-generated models I now have, which one is > theoretically better to use? > The one with the most genes, most level detectors, or most features (domains)? > Or the one with the lowest average prediction error, disregarding all > other factors? > > On 22 August 2010 17:00, Bogdan <b.t.tokovenko at="" imbg.org.ua=""> wrote: >> Dear all, >> >> I have 2 PCs: server running Debian Lenny and R 2.7.1, and home >> running Debian Testing and R 2.11.1. Both have gene2pathway 1.6.1 (and >> dependencies) installed. >> >> When running `model.rno = retrain(organism = "rno")`, I got slightly >> different outputs describing the components to build the model: >> >> (server) >> genes: 4055 of 5667 >> features: 3553 >> level detectors: 74 >> >> (home) >> genes: 3987 of 5577 >> features: 3488 >> level detectors: 75 >> >> Question 1: retrain() manual states that all the data for model >> training is fetched from KEGG and Ensembl. How then could these >> differences (above) be possible? I've run each retrain twice, to be sure that was not >> a momentarily glitch. >> >> >> Seeing this, I've decided to manually supply gene2Domains mapping. >> Using BioMart, I asked for all entrez-interpro pairs: >>> head(entrez2interpro_list) >> $`679594` >> [1] "IPR019956" >> >> $`679594` >> [1] "IPR019954" >> >> $`679594` >> [1] "IPR019955" >> >> $`679594` >> [1] "IPR000626" >> >> $`682397` >> [1] "IPR019956" >> >> $`682397` >> [1] "IPR019954" >> >>> length(unique(names(entrez2interpro_list))) >> [1] 17666 >> >>> model.rno = retrain(organism = "rno", gene2Domains = entrez2interpro_list) >> >> Feeding entrez2interpro_list to retrain(), I got these numbers: >> >> (manual gene2Domains) >> genes: 5677 of 5677 >> features: 1852 >> level detectors: 78 >> >> Question 2 (main question): Of these 3 models I now have, which one is >> theoretically better to use? The one with most genes, most level >> detectors, or most features? >> >> Question 3: Is the format of my entrez2interpro_list correct? There >> were no errors, but that list has duplicate rownames. I wonder if each >> EntrezID should be in the list only once, with all relevant IPRs >> packed into a nested list. >> (possibly related) Question 4: How could it happen that there are only >> 1852 features for the most complete coverage of gene mappings in >> "manual gene2Domains" case? -- Regards, Bogdan Tokovenko -- Laboratory of Systems Biology, Department of Genetic Information Translation Mechanisms, Institute of Molecular Biology and Genetics, Kyiv, Ukraine http://SysBio.org.ua/ http://BioMed.org.ua/COTRASIF/

score 0 · Answer 3 · 2010-08-23

Replying to the last one of my own questions: > Of all the retrain()-generated models I now have, which one is > theoretically better to use? > The one with the most genes, most level detectors, or most features (domains)? > Or the one with the lowest average prediction error, disregarding all > other factors? version #genes #domains #detectors avg. error(ksvm) over 11 bags -- not using custom gene2Domains -- 1.6.1 3987/5577 3488 75 0.006523582 1.6.0 4055/5667 3533 74 0.007034632 -- using custom gene2Domains -- 1.6.0 5667/5667 4007 78 0.008454084 I've decided to use the last model. -- Regards, Bogdan