Hello all,
I have a text tab delimitated file from 100 biological samples with
the names of samples as the names of columns.
What is the memory efficient way of extracting only some specific
columns(samples) and working on them?
Should I make a new file of that and work with the new file like :
new <- read.table (myfile, header = T ) [ , c(column names)]
and then write this new to a new file?
Thank you in advance
On Wed, Aug 15, 2012 at 8:11 AM, Fatemehsadat Seyednasrollah
<fatsey@utu.fi>wrote:
> Hello all,
>
> I have a text tab delimitated file from 100 biological samples with
the
> names of samples as the names of columns.
> What is the memory efficient way of extracting only some specific
> columns(samples) and working on them?
> Should I make a new file of that and work with the new file like :
> new <- read.table (myfile, header = T ) [ , c(column names)]
>
>
Have a look at the colClasses argument to read.table(). You could,
for
example, read the first few lines of the file to get the header (using
nrows), figure out which columns to read based on that, and then set
the
colClasses accordingly to read the full table.
Sean
> and then write this new to a new file?
> Thank you in advance
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
[[alternative HTML version deleted]]
Dear listers,
Apologies if my question is not strictly related to Bioconductor,
though
one never knows, maybe there's a package that does what I need.
I am analysing a list of differentially expressed genes from an
Illumina
microarray. In particular I'm trying to compare the list of
differentially expressed genes to an existing list of genes
preferentially expressed in the stem cell population (stem cell
signature). When I do so, 10% of DE genes belong to the stem cell
signature. What I'd like to do now is to find out, how likely that
would
happen by chance, i.e. put a p value on it.
At the moment I know:
There're 17119 unique genes in my dataset.
Of them 86 are differentially expressed.
The stem cell signature contains 510 genes.
It is combined from several platforms, which makes it hard to
establish
the total number of unique genes, but it's at least 20819 (the size of
the largest platform).
There are 9 overlapping genes between DE genes and the stem cell
signature.
So I wonder:
1) If there's an accepted way to calculate a p value using these data.
For instance could I run a like of a chi squared test? E.g. stem cell
specific genes represent 510/20819=2.4% of total dataset. So expected
number of the stem cell genes in my DE genes would be 86x2.4%=2. So my
chi squared test would be based on 9 observed vs 2 expected.
2) Or do I have to generate a geneset based on the stem cell signature
and go through GSEA algorithms to calculate enrichment and
significance.
Any pointers in the right direction would be much appreciated.
Many thanks for your time and help!
Aliaksei.
Howdy,
Disclaimer: I am not a statistician and am always reluctant to give
such
advice since I'd never really claim authority in this arena, but ...
here
goes ;-)
On Wednesday, August 15, 2012, Aliaksei Holik wrote:
> Dear listers,
>
> Apologies if my question is not strictly related to Bioconductor,
though
> one never knows, maybe there's a package that does what I need.
>
> I am analysing a list of differentially expressed genes from an
Illumina
> microarray. In particular I'm trying to compare the list of
differentially
> expressed genes to an existing list of genes preferentially
expressed in
> the stem cell population (stem cell signature). When I do so, 10% of
DE
> genes belong to the stem cell signature. What I'd like to do now is
to find
> out, how likely that would happen by chance, i.e. put a p value on
it.
>
> At the moment I know:
> There're 17119 unique genes in my dataset.
> Of them 86 are differentially expressed.
>
> The stem cell signature contains 510 genes.
> It is combined from several platforms, which makes it hard to
establish
> the total number of unique genes, but it's at least 20819 (the size
of the
> largest platform).
>
> There are 9 overlapping genes between DE genes and the stem cell
signature.
>
> So I wonder:
>
> 1) If there's an accepted way to calculate a p value using these
data. For
> instance could I run a like of a chi squared test? E.g. stem cell
specific
> genes represent 510/20819=2.4% of total dataset. So expected number
of the
> stem cell genes in my DE genes would be 86x2.4%=2. So my chi squared
test
> would be based on 9 observed vs 2 expected.
A fisher's test would seem like the natural first choice. I'm also
pretty
sure that (for large enough N) the chi-square is a good approximation
to
the same, so your intuition is spot on!
Your choice in numbers (ie. what the real size of "the urn" that you
sample
from is) is crucial, so some more care is required there.
2) Or do I have to generate a geneset based on the stem cell signature
and
> go through GSEA algorithms to calculate enrichment and significance.
These aren't mutually exclusively and sure -- if you have a "signature
set"
why not add it to the pool you would compare against with GSEA and let
it
rip.
The difference here is that you will need the expression values for
your
genes and not just a list of DE genes for this to work (it wasn't
clear to
me if you had that -- it's also not clear if your expression is coming
from
different arrays or the gene set is: mixing expression from different
platforms is tricky)
HTH,
-steve
> Any pointers in the right direction would be much appreciated.
>
> Many thanks for your time and help!
>
> Aliaksei.
>
> ______________________________**_________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor="">
> Search the archives: http://news.gmane.org/gmane.**
> science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor="">
>
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
[[alternative HTML version deleted]]
On 15.08.2012 14:51, Aliaksei Holik wrote:
> Dear listers,
>
> Apologies if my question is not strictly related to Bioconductor,
> though one never knows, maybe there's a package that does what I
> need.
>
> I am analysing a list of differentially expressed genes from an
> Illumina microarray. In particular I'm trying to compare the list of
> differentially expressed genes to an existing list of genes
> preferentially expressed in the stem cell population (stem cell
> signature). When I do so, 10% of DE genes belong to the stem cell
> signature. What I'd like to do now is to find out, how likely that
> would happen by chance, i.e. put a p value on it.
>
> At the moment I know:
> There're 17119 unique genes in my dataset.
> Of them 86 are differentially expressed.
>
> The stem cell signature contains 510 genes.
> It is combined from several platforms, which makes it hard to
> establish the total number of unique genes, but it's at least 20819
> (the size of the largest platform).
>
> There are 9 overlapping genes between DE genes and the stem cell
> signature.
>
> So I wonder:
>
> 1) If there's an accepted way to calculate a p value using these
> data. For instance could I run a like of a chi squared test? E.g.
> stem
> cell specific genes represent 510/20819=2.4% of total dataset. So
> expected number of the stem cell genes in my DE genes would be
> 86x2.4%=2. So my chi squared test would be based on 9 observed vs 2
> expected.
Hypergeometric test?
> phyper(9-1,86,17119-86,510,lower.tail=F)
[1] 0.001035456
For the total number of genes I used your lower estimate to be
conservative. To be completely correct I think you would need to
remove
any of the 510 genes that are not in your 17,119 gene dataset. That
will
only boost the P value though (as they cannot be called DE if they are
not in your dataset) and it is already 'significant' by most journals
standards.
--
Alex Gutteridge
Hi Aliaksei,
I will ask two questions before I give any suggestions.
I am thinking what the suitable gene set testing methods are in your
case.
First, depending on the biological knowledge, will you consider the
stem cell signature genes as a gene set or the differential expressed
genes as a gene set if you have to choose one?
Second, do you have the expression data for the two datasets? In your
latest email, you may have the expression data.
As you may know, in your case, we need a gene set from one data set
and expression data from another study to do a gene set test.
If we take the stem cell signature genes as the gene set, we will need
to have the expression data in your study. With them, our gene set
testing methods "ROAST", "CAMERA" and "ROMER" in limma can work well.
You choose which you want to use depending on your statistical
hypothesis. I may suggest starting with ROAST. The R code of ROAST
should be straightforward. I will be happy to help if you have
questions about using it.
"ROAST: rotation gene set tests for complex microarray experiments"
On the other hand, if you don't have (don't want to use) any
expression data at this moment. But do you have the t statistic (or
log fold change) results from the analysis genome-wide? If we still
take the stem cell signature genes as the gene set, you can use
"geneSetTest" (usually rank based p value) in limma to do the test
with the genome-wide t statistics or log fold change from your own
study. We have mentioned that this may be a bit optimistic in our
"CAMERA" paper. You will still be able to draw conclusion if you see a
very significant p value.
"Camera: a competitive gene set test accounting for inter-gene
correlation"
The above gene set tests (like GSEA) only require you have one gene
set, and they don't need the cutoff to make the other gene list.
In another case, you may not even have the genome-wide t statistics or
log fold change, hypergeometric test or Fisher's test might be the
only options, as others suggested. The R function "phyper" may help.
Di
----
Di Wu
Postdoctoral fellow
Harvard University, Statistics Department
Harvard Medical School
Science Center, 1 Oxford Street, Cambridge, MA 02138-2901 USA
________________________________________
From: bioconductor-bounces@r-project.org [bioconductor-
bounces@r-project.org] On Behalf Of Aliaksei Holik
[salvador@bio.bsu.by]
Sent: Wednesday, August 15, 2012 9:51 AM
Cc: bioconductor at r-project.org
Subject: [BioC] Gene enrichment question
Dear listers,
Apologies if my question is not strictly related to Bioconductor,
though
one never knows, maybe there's a package that does what I need.
I am analysing a list of differentially expressed genes from an
Illumina
microarray. In particular I'm trying to compare the list of
differentially expressed genes to an existing list of genes
preferentially expressed in the stem cell population (stem cell
signature). When I do so, 10% of DE genes belong to the stem cell
signature. What I'd like to do now is to find out, how likely that
would
happen by chance, i.e. put a p value on it.
At the moment I know:
There're 17119 unique genes in my dataset.
Of them 86 are differentially expressed.
The stem cell signature contains 510 genes.
It is combined from several platforms, which makes it hard to
establish
the total number of unique genes, but it's at least 20819 (the size of
the largest platform).
There are 9 overlapping genes between DE genes and the stem cell
signature.
So I wonder:
1) If there's an accepted way to calculate a p value using these data.
For instance could I run a like of a chi squared test? E.g. stem cell
specific genes represent 510/20819=2.4% of total dataset. So expected
number of the stem cell genes in my DE genes would be 86x2.4%=2. So my
chi squared test would be based on 9 observed vs 2 expected.
2) Or do I have to generate a geneset based on the stem cell signature
and go through GSEA algorithms to calculate enrichment and
significance.
Any pointers in the right direction would be much appreciated.
Many thanks for your time and help!
Aliaksei.
_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor