What is colData? How do I make one?
3
1
Entering edit mode
mjrarcher ▴ 10
@mjrarcher-18313
Last seen 5.8 years ago

Hello, 

First of, I just want to say I've never used DESeq2 before and I'm new to R. I've a counts.htseq file I've created with none of the mentioned tools. I simply used bash to aggregate the gene counts of each of my samples in to one file, which i've called counts.htseq.

Now, i thought it would be a breeze to run deseq2, but the first thing i noticed before even running the first line of code, is that I need a sample information table or "coldata". The documentation does not explain what that means or how I can generate one applicable to my counts file. 

So, what is this "coldata" object and what kind of sample information is it supposed to contain and how do I make it? The documentation assumes that this is clear, but its not. 

My counts file has 118 samples and thousands of genes expression values (read counts). Please see image of counts.htseq2 below. I'd appreciate any help in this regard.

Imgur

deseq2 coldata • 45k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 1 day ago
United States

In DESeq2 vignette we describe colData as a table of sample information. 

The vignette has lots of information but if you’re brand new to RNA-seq analysis we also recommend reading the workflow which goes at a slower pace. See for example this section:

http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#the-deseqdataset-object-sample-information-and-the-design-formula

ADD COMMENT
0
Entering edit mode

Hi Michael,

Thanks for the reply. I've looked at the vignette, but its still not clear to me. It emphasizes a lot on using SummariedExperiment objects, which apparently works with colData function. However, i'm using a counts file, which i'm reading using "read.table" in R. The documentation says its possible to use a counts matrix or an htseq_counts_file, but it doesn't say how i'm supposed to generate a coldata file from that. When I try coldata <- colData(counts_file), I just get an error. Am I supposed to create this coldata file myself instead? If so, what do I need to provide. I'm trying to identify differential gene expression between samples that are sequenced from tumors and samples sequenced from culture. 

ADD REPLY
2
Entering edit mode

Quoting from the link I sent

“However, when you work with your own data, you will have to add the pertinent sample / phenotypic information for the experiment at this stage. We highly recommend keeping this information in a comma-separated value (CSV) or tab-separated value (TSV) file, which can be exported from an Excel spreadsheet, and the assign this to the colData slot, making sure that the rows correspond to the columns of theSummarizedExperiment.”

ADD REPLY
0
Entering edit mode

Thank you, Michael. I really appreciate your help. It sounds like for my case, I'd only have to include a second column  in my 'coldata' listing my three conditions 'tumor','culture','pdx' so that it corresponds to the samples, correct? Also, just to be sure, does it matter if these conditions are not sorted as long as they are listed in order of the columns?  I ask this because the way my samples are listed is alphabetically, and so the conditions are dispersed like (tumor,pdx,pdx,culture,culture,tumor...)

ADD REPLY
0
Entering edit mode

The only thing that matters is that each row of colData matches each column for the counts. The first row corresponds to the first column, the second row corresponds to the second column.

We say as much in the vignette text, quoted here:

“It is absolutely critical that the columns of the count matrix and the rows of the column data (information about samples) are in the same order. DESeq2 will not make guesses as to which column of the count matrix belongs to which row of the column data, these must be provided to DESeq2 already in consistent order.”

 

ADD REPLY
0
Entering edit mode
@ryan-c-thompson-5618
Last seen 28 days ago
Icahn School of Medicine at Mount Sinai…

The DESeqDataSet used by DESeq2 is a subclass of SummarizedExperiment, which is what provides rowData and colData. You should read more about SummarizedExperiment objects here: https://www.bioconductor.org/packages/devel/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html

Briefly, colData is a data frame containing metadata about each sample. It should contain a sample identifier as well as any relevant experimental factors (e.g. treatment/control, cell type, tissue, etc.).

ADD COMMENT
0
Entering edit mode

Hi Ryan,

But i'm not using SummarizedExperiment objects. I already have a counts file, and according to the manual you can use DESeq2 with either or a few different options, but they all have that colData in common.

ADD REPLY
0
Entering edit mode
swbarnes2 ★ 1.4k
@swbarnes2-14086
Last seen 15 hours ago
San Diego

Since you are new, I strongly recommend that you find a tutorial with example data, and walk through the tutorial with it, stopping to examine what you've got every step, so you understand what's going on.  Walk through a few different tutorials, with their data and with yours.

But yes, you need colData.  That's the part where you tell the software which samples are controls and which ones aren't, among other things.  

ADD COMMENT

Login before adding your answer.

Traffic: 754 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6