DESeq2: add annotations (from data frame) to DESeqDataSet
3
4
Entering edit mode
@darya-vanichkina-6050
Last seen 7.9 years ago
Australia/Centenary Institute Universit…

Hi! 

Sorry if this is a very basic question, but how do I add additional information from a data frame (i.e. gene symbols, chromosomal coordinates, etc) to a DESeqDataSet which I obtained from a count table with gene names as row names?

I.e. my count table looks like this: 

 head(counttable)
  A1 A2 A3 B1 B2 B3
ENSMUSG00000000001 2153 2549 2114 1142 1454 442
ENSMUSG00000000003 0 0 0 0 0 0
ENSMUSG00000000028 297 365 338 120 181 63
ENSMUSG00000000031 6628 5779 6986 10880 18988 9827
ENSMUSG00000000037 206 183 112 125 167 30
ENSMUSG00000000049 0 0 0 0 0 0

I'm using the following code to create the object:

groups <- c(rep("A", 3), rep("B", 3))
dds <- DESeqDataSetFromMatrix(countData = counttable, colData = as.data.frame(groups) ,design = ~ groups)

I've used biomaRt to query ensembl and obtain an annotation data frame which has MGI gene symbols and other information I'm interested in:

head(annotation)

     ensembl_gene_id mgi_symbol   gene_biotype
1 ENSMUSG00000064336      mt-Tf        Mt_tRNA
2 ENSMUSG00000064337    mt-Rnr1        Mt_rRNA
3 ENSMUSG00000064338      mt-Tv        Mt_tRNA
4 ENSMUSG00000064339    mt-Rnr2        Mt_rRNA
5 ENSMUSG00000064340     mt-Tl1        Mt_tRNA
6 ENSMUSG00000064341     mt-Nd1 protein_coding

What I am trying to do (in the end) is to add this information to the DESeqDataSet so that (for example)

- when I plot the heatmap my gene names are MGI symbols, not ensembl IDs

- when I use ReportingTools to generate the report, I have human - readable gene names and biotype classes in the output table as well

Thanks in advance !!!

 


R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] parallel  grid      stats     graphics  grDevices datasets  utils     methods  
[9] base     

other attached packages:
 [1] gplots_2.14.2             RColorBrewer_1.0-5        Biobase_2.24.0           
 [4] pasilla_0.4.0             BiocInstaller_1.14.3      DESeq2_1.4.5             
 [7] RcppArmadillo_0.4.450.1.0 Rcpp_0.11.3               GenomicRanges_1.16.4     
[10] GenomeInfoDb_1.0.2        IRanges_1.22.10           BiocGenerics_0.10.0      
[13] gridExtra_0.9.1           ggplot2_1.0.0             edgeR_3.6.8              
[16] limma_3.20.9              biomaRt_2.20.0           

loaded via a namespace (and not attached):
 [1] annotate_1.42.1      AnnotationDbi_1.26.1 bitops_1.0-6         caTools_1.17.1      
 [5] colorspace_1.2-4     DBI_0.3.1            DESeq_1.16.0         digest_0.6.4        
 [9] gdata_2.13.3         genefilter_1.46.1    geneplotter_1.42.0   gtable_0.1.2        
[13] gtools_3.4.1         KernSmooth_2.23-13   lattice_0.20-29      locfit_1.5-9.1      
[17] MASS_7.3-35          munsell_0.4.2        plyr_1.8.1           proto_0.3-10        
[21] RCurl_1.95-4.3       reshape2_1.4         RSQLite_1.0.0        scales_0.2.4        
[25] splines_3.1.1        stats4_3.1.1         stringr_0.6.2        survival_2.37-7     
[29] tools_3.1.1          XML_3.98-1.1         xtable_1.7-4         XVector_0.4.0       
deseq2 annotation granges • 21k views
ADD COMMENT
7
Entering edit mode
@mikelove
Last seen 18 hours ago
United States

hi Darya,

DESeqDataSet is a subclass of SummarizedExperiment, and you can add to the metadata columns of the rowData with code like

mcols(dds) <- cbind(mcols(dds), newMcols)

However, you have to make sure that these new columns line up row for row with the DESeqDataSet:

all(rownames(dds) == annotation$ensembl_gene_id)

If this is not the case you need to reorder the annotation data.frame using the match() function, so that the Ensembl IDs match up, then check again that everything is lined up:

annotation <- annotation[match(rownames(dds), annotation$ensembl_gene_id),]
all(rownames(dds) == annotation$ensembl_gene_id)

The heatmap.2 function looks for the rownames of the matrix you give it. So you could assign the mgi_symbols as the rownames of the DESeqDataSet, but you need to make sure you have unique mgi_symbols for every ensembl_gene_id.

sum(is.na(annotation$mgi_symbol))
sum(duplicated(annotation$mgi_symbol))

If these sums above are both 0, then:

rownames(dds) <- mcols(dds)$mgi_symbol

should work.

If these sums are not 0, you can manually supply the row labels to the heatmap.2() function using the "labRow" argument. See ?heatmap.2

You can add this mgi symbol column to the results table like so:

res <- results(dds)
res$mgi_symbol <- mcols(dds)$mgi_symbol

I don't know the details on how to tell ReportingTools which columns to output into the final report. The ReportingTools vignette might have some more information.

ADD COMMENT
1
Entering edit mode

Thanks for getting back to me so quickly!

Unfortunately, the annotation (gencode M3) is not as clean as one would like, so 

sum(duplicated(annotation$mgi_symbol)) # corrected your code here: did you mean this?
is non-zero.

To get around this I followed your advice (posting code here in case someone else has this question):

annotation <- annotation[match(rownames(dds), annotation$ensembl_gene_id),]
all(rownames(dds) == annotation$ensembl_gene_id)
TRUE

sum(is.na(annotation$mgi_symbol))
0

sum(duplicated(annotation$mgi_symbol)) 
2680

Many of these are have alternative transcripts annotated as separate genes:

ENSMUSG00000062078 and ENSMUSG00000044407 Qk

ENSMUSG00000067798, ENSMUSG00000040003, and ENSMUSG00000073174 - Magi2 PCG and ncRNA

So I shouldn't be ignoring them. For the heat map, therefor, I can use 

heatmap.2(assay(vsd)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none",dendrogram="none", trace="none", margin=c(10, 6), labRow = annotation[select,"mgi_symbol"])

For exporting to a report, I just merged my differential gene list (as a data frame) with the annotation data frame and exported the merged data frame instead of the DESeq2 object.

ADD REPLY
1
Entering edit mode

This is a pretty old post, but I came across this trying to do the same. And then I have found that the library scuttle, with the function uniquifyFeatureNames is perfect to solve the problem of non-unique/missing gene names. I though I would add this here for other people that comes across the same problem.

PS: Even though the library is for scRNAseq, it does not really matter, as the function takes as input vectors of gene names and could be used for anything.

ADD REPLY
0
Entering edit mode

it would be better to start a new question and link this post in it to get an answer.

ADD REPLY
2
Entering edit mode

This is not a question. It is a solution to the problem stated in the comment I was replying to.

ADD REPLY
0
Entering edit mode

Hi Michael,

I have the same question on the project I am involved in. I tried the line of code you mentioned: 

all(rownames(dds) == annotation$ensembl_gene_id)

But instead of'TRUE', I got 'NA ' as my answer. So I tried 

all(rownames(dds) == annotation$ensembl_gene_id,na.rm = TRUE)

which gave me TRUE as the answer. But when I tried

sum(is.na(rownames(dds))),

the answer is 0. I am not able to figure out what the difference is. 

Also, on a slightly tangential track, most of the genes in the biomaRt database are missing mgi_symbols. I am using the Rattus norvegicus annotation dataset from biomaRt. Is there an alternative to this?

ADD REPLY
0
Entering edit mode

Hi Michael,

I have a question about checking order of rows between the columns. My dds and lengthData objects both have the same number of rows (lengthData is a data frame with only one column named lengthData):

> nrow(dds)
[1] 16548
> nrow(lengthData)
[1] 16548

And I was pretty sure they had the same row order:

> head(rownames(dds))
[1] "ML000110a" "ML000111a" "ML000112a" "ML000113a" "ML000114a" "ML000115a"
> head(rownames(lengthData))
[1] "ML000110a" "ML000111a" "ML000112a" "ML000113a" "ML000114a" "ML000115a"
> tail(rownames(dds))
[1] "ML50541a" "ML50721a" "ML50771a" "ML50791a" "ML50792a" "ML50851a"
> tail(rownames(lengthData))
[1] "ML50541a" "ML50721a" "ML50771a" "ML50791a" "ML50792a" "ML50851a"

But this returns FALSE:

> all(rownames(dds) == lengthData$lengthData)
[1] FALSE

Still this works: mcols(dds) <- cbind(mcols(dds), lengthData) and I am able to use the fpkm() function after renaming lengthData column to basepairs.

How can I be sure that lengthData and dds do not have the same row order?

 

ADD REPLY
0
Entering edit mode

What about:

all.equal(rownames(dds), rownames(lengthData))
ADD REPLY
0
Entering edit mode

TRUE!
Thanks for showing me `all.equal()`

Jon
 

ADD REPLY
0
Entering edit mode

HI Michael,

If your time allows, Can you please look into this problem? Add new columns to DESeqResults from text file by matching rownames.

ADD REPLY
0
Entering edit mode

Hi Micheal

I think the gene Symbol when we use DESeqDataSetFromMatrix, it is somehow easy compare the unclear way how to deal with the SummarizedExperiment object

when we construct the count matrix from featcherCounts we add the argument (GTF.attrType="gene_name")

I would like you to answer our questions in clear way how to deal with TxDb object and the GRangeList and get the genes symbol as rownames in the assay of the RangedSummarizedExperiment object.

 

I will write now a separate question hoping to get an answer or clearer vignette concerning SummarizedExperiment object.

 

Thanks

ADD REPLY
0
Entering edit mode

Hi Michael, Is it possible to add column directly to metadata in SummarizedExperiment? I tried to add column named Other to metadata and later fill it with my additional data

experimentSummary$rnaseq = cbind(mcols(experimentSummary$rnaseq),"Other")

Which seem like removing metadata and only "Other" exists, original metadata is lost

I tried also

experimentSummary$rnaseq = cbind(colData(experimentSummary$rnaseq),"Other")

which seems to work inspecting the object with View(experimentSummary) but there is something wrong with format of the metadata. Also column name is "Other" with quotation marks and all rows in that column are filled with "Other" string.

ADD REPLY
0
Entering edit mode
bioprog • 0
@bioprog-7204
Last seen 9.0 years ago

I tried the above steps but i cant get it to work to add my features. Here is what I tried:

dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ condition)
dds$condition <- relevel(dds$condition, "WT")
dds <- DESeq(dds)

#Add Feature data
featureData <-read.delim(file="FeatureData.txt", sep="\t", header=T)
all(rownames(dds) ==featureData$GeneID) #to confirm that they are in the same order
-->TRUE
mcols(dds) <- cbind(mcols(dds), featureData)
names(mcols(dds)) #I see the new columns added at the end

res <- results(dos) #When i do this, i get the following error:

Error in results(dds) : 
  cannot find results columns in object, first call DESeq, nbinomWaldTest, or nbinomLRT

If I do NOT add the feature, i dont get an error. hence, how can I run this with the added features, and how can i output a table with the added features include in it?

Further, how can I output a matrix with the normalized counts of each sample along with the log2fold change and p-value results, + added features? 

 

 

 

ADD COMMENT
0
Entering edit mode

hi, 

Can you confirm that you are using DESeq2 version 1.10?

packageVersion("DESeq2)
ADD REPLY
0
Entering edit mode

No, i am using 1.6.3 apparently when checking by:

packageVersion("DESeq2")
[1] ‘1.6.3’

To try to get 1.10, i first removed DESeq2 using remove.packages('DESeq2')

then

source("http://bioconductor.org/biocLite.R")
biocLite('DESeq2')

then reloaded R and the DESeq2, but i still got 1.6.3

Here's my R version:

"R version 3.1.2 (2014-10-31)"

I then downloaded the .tar file and i think i got it as 1.10.1 

packageVersion("DESeq2")
[1] ‘1.10.1’

However, DESEq2 wouldnt work then. I get:

Error: package ‘SummarizedExperiment’ required by ‘DESeq2’ could not be found << this is not available for 3.1 version of R
In addition: Warning message:
package ‘DESeq2’ was built under R version 3.2.3 

Unfortunately, once I upgrade to 3.2.3, I lose most of my working R packages so ih ad to downgrade to 3.1

Is that the right package for the above steps to work?

The manual states the following step for features:

mcols(dds) <- DataFrame(mcols(dds), featureData), which didn't work for me. 

Why do you prefer cbind? And does the mcols(dds) step automatically merge data based on correct ID irrespective if it is in same order?

I hope you can also advise whether there is a way to output a whole dataframe/matrix with the counts for each sample, as well as the added features, log2 fold, p value and the adjusted p value. It would be quite useful for looking into specific counts.

ADD REPLY
0
Entering edit mode
Check this page: http://bioconductor.org/install You will need to upgrade to the latest version of R, and update all Bioconductor packages. This is best practice to stay up to date because you want to benefit from the bugs fixed (including this bug about adding to mcols) and features added to all the various Bioconductor packages since October 2014.
ADD REPLY
0
Entering edit mode

thanks, i tried that and lost most of my packages unfortunately due to the fact it wont work with the upgraded version of R! Anyway, in my current version ,can i at least output a table iwht the normalized counts for each samples along with the log2 fold and p values?

ADD REPLY
1
Entering edit mode

I think this should work, anyway, you can get the normalized counts matrix with the code below and combine with whatever other vectors you like:

res.mat <- cbind(counts(dds, normalized=TRUE), res$log2FoldChange, res$padj)
ADD REPLY
0
Entering edit mode
bioprog • 0
@bioprog-7204
Last seen 9.0 years ago

 

 

.

 

ADD COMMENT

Login before adding your answer.

Traffic: 909 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6