dataset dim for siggenes

0

Entering edit mode

Frederico Moraes Ferreira ▴ 100

@frederico-moraes-ferreira-5929

Last seen 5 months ago

Brazil

Hi list, I have a qPCR 116 x60 data set processed with limma. Results showed 30 DE miRNAs. My idea is to pick-up 10 of them for validation running further statistical tests and taking the most recurrent mirs from all analyses (does it make sense?). Well, I was thinking of using siggenes, however, their authors recommend it for high- dimensional data. Will siggenes be suitable for my data? if not, could someone suggest others packages and perhaps tests more appropriated to this size data? Best. Fred [[alternative HTML version deleted]]

qPCR limma siggenes • 1.6k views

ADD COMMENT • link updated 10.2 years ago by James W. MacDonald 67k • written 10.2 years ago by Frederico Moraes Ferreira ▴ 100

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 1 day ago

United States

Hi Fred, I am assuming you have 116 miRNAs, and 60 samples. In which case you could probably just use a conventional t-test or linear model, although using limma wouldn't be a controversial decision. Not too sure about siggenes though. You have to estimate the proportion of true nulls, and I don't know if 116 comparisons are enough. But the larger question is the issue of running further statistical tests for validation. I am not sure what you mean by that. Quantitative PCR is (for better or worse) assumed to be the 'gold standard' for quantification of nucleic acid sequences, so there doesn't seem to be much more to do. Certainly re-running the analyses using a slightly different method isn't useful. That's like weighing yourself on a bunch of different scales; it tells you way more about the scales than it does about your weight. I think the next step (or really, the first step if you haven't already done so) is to ensure that your data meet all the underlying assumptions for linear modelling, so that you can have confidence in the conclusions you draw from the results. Best, Jim On Fri, Sep 12, 2014 at 11:18 AM, <ferreirafm at="" usp.br=""> wrote: > Hi list, > I have a qPCR 116 x60 data set processed with limma. Results showed 30 DE > miRNAs. My idea is to pick-up 10 of them for validation running further > statistical tests and taking the most recurrent mirs from all analyses > (does it make sense?). Well, I was thinking of using siggenes, however, > their authors recommend it for high- dimensional data. Will siggenes be > suitable for my data? if not, could someone suggest others packages and > perhaps tests more appropriated to this size data? > Best. > Fred > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 [[alternative HTML version deleted]]

ADD COMMENT • link 10.2 years ago James W. MacDonald 67k

0

Entering edit mode

Hi Jim, Thanks for your message. For "validation" I meant to select 10 out of those 30 mirs to run another qPCR experiment for different samples keeping the same number of groups (4) and biological reps (15). My issue is how to select them. I was wondering which packages and tests else I could try in order to take the best 10 mirs for validation. So, I thought that would be useful to take the same mirs from different tests like ANOVA, SAM, LIMMA and others. Also, I would like to make sure I will run those tests using appropriated packages for that size data. From your answer, I understood it doesn't make sense. Best, Fred ----- Mensagem original ----- > De: "James W. MacDonald" <jmacdon at="" uw.edu=""> > Para: ferreirafm at usp.br > Cc: "bioconductor" <bioconductor at="" r-project.org=""> > Enviadas: Sexta-feira, 12 de Setembro de 2014 12:47:55 > Assunto: Re: [BioC] dataset dim for siggenes > Hi Fred, > I am assuming you have 116 miRNAs, and 60 samples. In which case you > could probably just use a conventional t-test or linear model, > although using limma wouldn't be a controversial decision. Not too > sure about siggenes though. You have to estimate the proportion of > true nulls, and I don't know if 116 comparisons are enough. > But the larger question is the issue of running further statistical > tests for validation. I am not sure what you mean by that. > Quantitative PCR is (for better or worse) assumed to be the 'gold > standard' for quantification of nucleic acid sequences, so there > doesn't seem to be much more to do. Certainly re-running the > analyses using a slightly different method isn't useful. That's like > weighing yourself on a bunch of different scales; it tells you way > more about the scales than it does about your weight. > I think the next step (or really, the first step if you haven't > already done so) is to ensure that your data meet all the underlying > assumptions for linear modelling, so that you can have confidence in > the conclusions you draw from the results. > Best, > Jim > On Fri, Sep 12, 2014 at 11:18 AM, < ferreirafm at usp.br > wrote: > > Hi list, > > > I have a qPCR 116 x60 data set processed with limma. Results showed > > 30 DE miRNAs. My idea is to pick-up 10 of them for validation > > running further statistical tests and taking the most recurrent > > mirs > > from all analyses (does it make sense?). Well, I was thinking of > > using siggenes, however, their authors recommend it for high- > > dimensional data. Will siggenes be suitable for my data? if not, > > could someone suggest others packages and perhaps tests more > > appropriated to this size data? > > > Best. > > > Fred > > > [[alternative HTML version deleted]] > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor at r-project.org > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 [[alternative HTML version deleted]]

ADD REPLY • link 10.2 years ago Frederico Moraes Ferreira ▴ 100

0

Entering edit mode

Hi Jim, Could you please possibly tell me which tests should I have to perform in order to ensure that my data fulfills the linear model assumptions? Turning back to my question "performing several different tests to decide which mirs to take", could you explain a little bit more why such approach doesn make sense. Best, Fred ----- Mensagem original ----- > De: "James W. MacDonald" <jmacdon at="" uw.edu=""> > Para: ferreirafm at usp.br > Cc: "bioconductor" <bioconductor at="" r-project.org=""> > Enviadas: Sexta-feira, 12 de Setembro de 2014 12:47:55 > Assunto: Re: [BioC] dataset dim for siggenes > Hi Fred, > I am assuming you have 116 miRNAs, and 60 samples. In which case you > could probably just use a conventional t-test or linear model, > although using limma wouldn't be a controversial decision. Not too > sure about siggenes though. You have to estimate the proportion of > true nulls, and I don't know if 116 comparisons are enough. > But the larger question is the issue of running further statistical > tests for validation. I am not sure what you mean by that. > Quantitative PCR is (for better or worse) assumed to be the 'gold > standard' for quantification of nucleic acid sequences, so there > doesn't seem to be much more to do. Certainly re-running the > analyses using a slightly different method isn't useful. That's like > weighing yourself on a bunch of different scales; it tells you way > more about the scales than it does about your weight. > I think the next step (or really, the first step if you haven't > already done so) is to ensure that your data meet all the underlying > assumptions for linear modelling, so that you can have confidence in > the conclusions you draw from the results. > Best, > Jim > On Fri, Sep 12, 2014 at 11:18 AM, < ferreirafm at usp.br > wrote: > > Hi list, > > > I have a qPCR 116 x60 data set processed with limma. Results showed > > 30 DE miRNAs. My idea is to pick-up 10 of them for validation > > running further statistical tests and taking the most recurrent > > mirs > > from all analyses (does it make sense?). Well, I was thinking of > > using siggenes, however, their authors recommend it for high- > > dimensional data. Will siggenes be suitable for my data? if not, > > could someone suggest others packages and perhaps tests more > > appropriated to this size data? > > > Best. > > > Fred > > > [[alternative HTML version deleted]] > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor at r-project.org > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 [[alternative HTML version deleted]]

ADD REPLY • link 10.2 years ago Frederico Moraes Ferreira ▴ 100

0

Entering edit mode

Hi Fred, I'll take the second question first. The methods that have been developed for analyzing microarray data are all just modifications of the existing linear modeling methods that people have used for years (t-test, ANOVA, linear modeling of continuous covariates, etc). The reason that people have developed these methods is because in general, with microarray data you run into the problem of making tons of comparisons with very little replication. The problem with doing something like that is you a) need to adjust the p-values to reflect that you are making (possibly thousands) of simultaneous comparisons, and b) you often have maybe 3 or 4 replicates for each group, so your power to detect differences is probably really low. So the goal was to figure out ways to improve the power for these comparisons in a statistically rigorous manner, and there were lots of ways that people developed to do that. There was also some concern that the usual assumption of normally distributed data might not hold for all the genes being compared, so different groups developed ways to increase power and also generate permuted null distributions, so you wouldn't have to make an assumption that might not hold. But in the end, all these methods (limma, siggenes, multtest, etc) are just fitting t-tests that are modified to help increase power. So they are all doing essentially the same thing, but in a slightly different manner. So if you run your samples through limma, and then siggenes, and then multtest, any changes in your results will simply reflect differences in the methods used, but won't give you any more information about your samples. And since you have 15 replicates for each group, you would probably get very similar results if you were to just use 'regular' methods, because you aren't measuring that many genes, and you have pretty good replication. On the other hand, running a new set of samples will tell you a great deal. This has to do with the underlying hypothesis that you are (usually) testing. In general when you are doing a comparison, you are trying to estimate a population parameter using a sample from that population. In other words, you are trying to make a statement about all the members of a population, based on a sample from that population. There is always the possibility that you were unlucky and chose a set of subjects from the two populations you are comparing that are really different, but in truth there is no difference between the two populations. You then make your measurements, and say 'look, gene X appears to be expressed at a much higher level in population 1 as compared to population 2'. But remember, you were unlucky in your choice of subjects to represent the two populations, and there really aren't any differences. So repeating the experiment with new subjects will likely not have the same result, and you will be glad that you didn't try to publish your results. Or alternatively, if you re-run your analysis for the 10 top genes, and they are all significant in the next set of samples, then you have pretty good evidence that there really is a difference between the two populations, because you got the same results with two separate sets of subjects. But of course that assumes you are doing a reasonable job of selecting subjects in an unbiased manner, which is a different topic altogether... For the first question, there are any number of things you can and should test. I won't go into them here because a simple google search like 'R testing anova assumptions' is likely to bring up all the results you need to answer that question. Best, Jim On Fri, Sep 12, 2014 at 3:53 PM, <ferreirafm at="" usp.br=""> wrote: > Hi Jim, > Could you please possibly tell me which tests should I have to perform in > order to ensure that my data fulfills the linear model assumptions? > Turning back to my question "performing several different tests to decide > which mirs to take", could you explain a little bit more why such approach > doesn make sense. > Best, > Fred > > ------------------------------ > > *De: *"James W. MacDonald" <jmacdon at="" uw.edu=""> > *Para: *ferreirafm at usp.br > *Cc: *"bioconductor" <bioconductor at="" r-project.org=""> > *Enviadas: *Sexta-feira, 12 de Setembro de 2014 12:47:55 > *Assunto: *Re: [BioC] dataset dim for siggenes > > Hi Fred, > > I am assuming you have 116 miRNAs, and 60 samples. In which case you could > probably just use a conventional t-test or linear model, although using > limma wouldn't be a controversial decision. Not too sure about siggenes > though. You have to estimate the proportion of true nulls, and I don't know > if 116 comparisons are enough. > > But the larger question is the issue of running further statistical tests > for validation. I am not sure what you mean by that. Quantitative PCR is > (for better or worse) assumed to be the 'gold standard' for quantification > of nucleic acid sequences, so there doesn't seem to be much more to do. > Certainly re-running the analyses using a slightly different method isn't > useful. That's like weighing yourself on a bunch of different scales; it > tells you way more about the scales than it does about your weight. > > I think the next step (or really, the first step if you haven't already > done so) is to ensure that your data meet all the underlying assumptions > for linear modelling, so that you can have confidence in the conclusions > you draw from the results. > > Best, > > Jim > > > > On Fri, Sep 12, 2014 at 11:18 AM, <ferreirafm at="" usp.br=""> wrote: > >> Hi list, >> I have a qPCR 116 x60 data set processed with limma. Results showed 30 DE >> miRNAs. My idea is to pick-up 10 of them for validation running further >> statistical tests and taking the most recurrent mirs from all analyses >> (does it make sense?). Well, I was thinking of using siggenes, however, >> their authors recommend it for high- dimensional data. Will siggenes be >> suitable for my data? if not, could someone suggest others packages and >> perhaps tests more appropriated to this size data? >> Best. >> Fred >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 [[alternative HTML version deleted]]

ADD REPLY • link 10.2 years ago James W. MacDonald 67k

0

Entering edit mode

Hi Jim, Thank you very much for your really nice explanation. I'm going to study your answer and, if you don't mind, I would like turn back to it later. I thought that bayesian approach implemented on LIMMA would have different assumptions from t-test and ANOVA . Also, in fact, normality condition doesn't hold true for all miRNAs along patients. I'll turn back to ANOVA assumptions to make additional tests.What to do if they fail? About sampling, we are trying to gather patients as similar as possible to that from the first experiment, using several criteria like age, sex, weight, heart flow and other factors commonly used for phenotyping. I hope we are luck in the sense that you pointed. Best, Fred ----- Mensagem original ----- > De: "James W. MacDonald" <jmacdon at="" uw.edu=""> > Para: ferreirafm at usp.br > Cc: "bioconductor" <bioconductor at="" r-project.org=""> > Enviadas: Sexta-feira, 12 de Setembro de 2014 18:11:29 > Assunto: Re: [BioC] dataset dim for siggenes > Hi Fred, > I'll take the second question first. The methods that have been > developed for analyzing microarray data are all just modifications > of the existing linear modeling methods that people have used for > years (t-test, ANOVA, linear modeling of continuous covariates, > etc). The reason that people have developed these methods is because > in general, with microarray data you run into the problem of making > tons of comparisons with very little replication. The problem with > doing something like that is you a) need to adjust the p-values to > reflect that you are making (possibly thousands) of simultaneous > comparisons, and b) you often have maybe 3 or 4 replicates for each > group, so your power to detect differences is probably really low. > So the goal was to figure out ways to improve the power for these > comparisons in a statistically rigorous manner, and there were lots > of ways that people developed to do that. > There was also some concern that the usual assumption of normally > distributed data might not hold for all the genes being compared, so > different groups developed ways to increase power and also generate > permuted null distributions, so you wouldn't have to make an > assumption that might not hold. > But in the end, all these methods (limma, siggenes, multtest, etc) > are just fitting t-tests that are modified to help increase power. > So they are all doing essentially the same thing, but in a slightly > different manner. So if you run your samples through limma, and then > siggenes, and then multtest, any changes in your results will simply > reflect differences in the methods used, but won't give you any more > information about your samples. And since you have 15 replicates for > each group, you would probably get very similar results if you were > to just use 'regular' methods, because you aren't measuring that > many genes, and you have pretty good replication. > On the other hand, running a new set of samples will tell you a great > deal. This has to do with the underlying hypothesis that you are > (usually) testing. In general when you are doing a comparison, you > are trying to estimate a population parameter using a sample from > that population. In other words, you are trying to make a statement > about all the members of a population, based on a sample from that > population. There is always the possibility that you were unlucky > and chose a set of subjects from the two populations you are > comparing that are really different, but in truth there is no > difference between the two populations. You then make your > measurements, and say 'look, gene X appears to be expressed at a > much higher level in population 1 as compared to population 2'. But > remember, you were unlucky in your choice of subjects to represent > the two populations, and there really aren't any differences. So > repeating the experiment with new subjects will likely not have the > same result, and you will be glad that you didn't try to publish > your results. > Or alternatively, if you re-run your analysis for the 10 top genes, > and they are all significant in the next set of samples, then you > have pretty good evidence that there really is a difference between > the two populations, because you got the same results with two > separate sets of subjects. But of course that assumes you are doing > a reasonable job of selecting subjects in an unbiased manner, which > is a different topic altogether... > For the first question, there are any number of things you can and > should test. I won't go into them here because a simple google > search like 'R testing anova assumptions' is likely to bring up all > the results you need to answer that question. > Best, > Jim > On Fri, Sep 12, 2014 at 3:53 PM, < ferreirafm at usp.br > wrote: > > Hi Jim, > > > Could you please possibly tell me which tests should I have to > > perform in order to ensure that my data fulfills the linear model > > assumptions? > > > Turning back to my question "performing several different tests to > > decide which mirs to take", could you explain a little bit more why > > such approach doesn make sense. > > > Best, > > > Fred > > > > De: "James W. MacDonald" < jmacdon at uw.edu > > > > > > > Para: ferreirafm at usp.br > > > > > > Cc: "bioconductor" < bioconductor at r-project.org > > > > > > > Enviadas: Sexta-feira, 12 de Setembro de 2014 12:47:55 > > > > > > Assunto: Re: [BioC] dataset dim for siggenes > > > > > > Hi Fred, > > > > > > I am assuming you have 116 miRNAs, and 60 samples. In which case > > > you > > > could probably just use a conventional t-test or linear model, > > > although using limma wouldn't be a controversial decision. Not > > > too > > > sure about siggenes though. You have to estimate the proportion > > > of > > > true nulls, and I don't know if 116 comparisons are enough. > > > > > > But the larger question is the issue of running further > > > statistical > > > tests for validation. I am not sure what you mean by that. > > > Quantitative PCR is (for better or worse) assumed to be the 'gold > > > standard' for quantification of nucleic acid sequences, so there > > > doesn't seem to be much more to do. Certainly re-running the > > > analyses using a slightly different method isn't useful. That's > > > like > > > weighing yourself on a bunch of different scales; it tells you > > > way > > > more about the scales than it does about your weight. > > > > > > I think the next step (or really, the first step if you haven't > > > already done so) is to ensure that your data meet all the > > > underlying > > > assumptions for linear modelling, so that you can have confidence > > > in > > > the conclusions you draw from the results. > > > > > > Best, > > > > > > Jim > > > > > > On Fri, Sep 12, 2014 at 11:18 AM, < ferreirafm at usp.br > wrote: > > > > > > > Hi list, > > > > > > > > > > I have a qPCR 116 x60 data set processed with limma. Results > > > > showed > > > > 30 DE miRNAs. My idea is to pick-up 10 of them for validation > > > > running further statistical tests and taking the most recurrent > > > > mirs > > > > from all analyses (does it make sense?). Well, I was thinking > > > > of > > > > using siggenes, however, their authors recommend it for high- > > > > dimensional data. Will siggenes be suitable for my data? if > > > > not, > > > > could someone suggest others packages and perhaps tests more > > > > appropriated to this size data? > > > > > > > > > > Best. > > > > > > > > > > Fred > > > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > > > _______________________________________________ > > > > > > > > > > Bioconductor mailing list > > > > > > > > > > Bioconductor at r-project.org > > > > > > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > > > > > > > Search the archives: > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > -- > > > > > > James W. MacDonald, M.S. > > > > > > Biostatistician > > > > > > University of Washington > > > > > > Environmental and Occupational Health Sciences > > > > > > 4225 Roosevelt Way NE, # 100 > > > > > > Seattle WA 98105-6099 > > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 [[alternative HTML version deleted]]

ADD REPLY • link 10.2 years ago Frederico Moraes Ferreira ▴ 100

Login before adding your answer.