Hello,
First of, I just want to say I've never used DESeq2 before and I'm new to R. I've a counts.htseq file I've created with none of the mentioned tools. I simply used bash to aggregate the gene counts of each of my samples in to one file, which i've called counts.htseq.
Now, i thought it would be a breeze to run deseq2, but the first thing i noticed before even running the first line of code, is that I need a sample information table or "coldata". The documentation does not explain what that means or how I can generate one applicable to my counts file.
So, what is this "coldata" object and what kind of sample information is it supposed to contain and how do I make it? The documentation assumes that this is clear, but its not.
My counts file has 118 samples and thousands of genes expression values (read counts). Please see image of counts.htseq2 below. I'd appreciate any help in this regard.
Hi Michael,
Thanks for the reply. I've looked at the vignette, but its still not clear to me. It emphasizes a lot on using SummariedExperiment objects, which apparently works with colData function. However, i'm using a counts file, which i'm reading using "read.table" in R. The documentation says its possible to use a counts matrix or an htseq_counts_file, but it doesn't say how i'm supposed to generate a coldata file from that. When I try coldata <- colData(counts_file), I just get an error. Am I supposed to create this coldata file myself instead? If so, what do I need to provide. I'm trying to identify differential gene expression between samples that are sequenced from tumors and samples sequenced from culture.
Quoting from the link I sent
“However, when you work with your own data, you will have to add the pertinent sample / phenotypic information for the experiment at this stage. We highly recommend keeping this information in a comma-separated value (CSV) or tab-separated value (TSV) file, which can be exported from an Excel spreadsheet, and the assign this to the
colData
slot, making sure that the rows correspond to the columns of theSummarizedExperiment.”Thank you, Michael. I really appreciate your help. It sounds like for my case, I'd only have to include a second column in my 'coldata' listing my three conditions 'tumor','culture','pdx' so that it corresponds to the samples, correct? Also, just to be sure, does it matter if these conditions are not sorted as long as they are listed in order of the columns? I ask this because the way my samples are listed is alphabetically, and so the conditions are dispersed like (tumor,pdx,pdx,culture,culture,tumor...)
The only thing that matters is that each row of colData matches each column for the counts. The first row corresponds to the first column, the second row corresponds to the second column.
We say as much in the vignette text, quoted here:
“It is absolutely critical that the columns of the count matrix and the rows of the column data (information about samples) are in the same order. DESeq2 will not make guesses as to which column of the count matrix belongs to which row of the column data, these must be provided to DESeq2 already in consistent order.”