Hi!
Sorry if this is a very basic question, but how do I add additional information from a data frame (i.e. gene symbols, chromosomal coordinates, etc) to a DESeqDataSet which I obtained from a count table with gene names as row names?
I.e. my count table looks like this:
head(counttable)
A1 | A2 | A3 | B1 | B2 | B3 | |
ENSMUSG00000000001 | 2153 | 2549 | 2114 | 1142 | 1454 | 442 |
ENSMUSG00000000003 | 0 | 0 | 0 | 0 | 0 | 0 |
ENSMUSG00000000028 | 297 | 365 | 338 | 120 | 181 | 63 |
ENSMUSG00000000031 | 6628 | 5779 | 6986 | 10880 | 18988 | 9827 |
ENSMUSG00000000037 | 206 | 183 | 112 | 125 | 167 | 30 |
ENSMUSG00000000049 | 0 | 0 | 0 | 0 | 0 | 0 |
I'm using the following code to create the object:
groups <- c(rep("A", 3), rep("B", 3)) dds <- DESeqDataSetFromMatrix(countData = counttable, colData = as.data.frame(groups) ,design = ~ groups)
I've used biomaRt to query ensembl and obtain an annotation data frame which has MGI gene symbols and other information I'm interested in:
head(annotation)
ensembl_gene_id mgi_symbol gene_biotype
1 ENSMUSG00000064336 mt-Tf Mt_tRNA
2 ENSMUSG00000064337 mt-Rnr1 Mt_rRNA
3 ENSMUSG00000064338 mt-Tv Mt_tRNA
4 ENSMUSG00000064339 mt-Rnr2 Mt_rRNA
5 ENSMUSG00000064340 mt-Tl1 Mt_tRNA
6 ENSMUSG00000064341 mt-Nd1 protein_coding
What I am trying to do (in the end) is to add this information to the DESeqDataSet so that (for example)
- when I plot the heatmap my gene names are MGI symbols, not ensembl IDs
- when I use ReportingTools to generate the report, I have human - readable gene names and biotype classes in the output table as well
Thanks in advance !!!
R version 3.1.1 (2014-07-10) Platform: x86_64-apple-darwin13.1.0 (64-bit) locale: [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8 attached base packages: [1] parallel grid stats graphics grDevices datasets utils methods [9] base other attached packages: [1] gplots_2.14.2 RColorBrewer_1.0-5 Biobase_2.24.0 [4] pasilla_0.4.0 BiocInstaller_1.14.3 DESeq2_1.4.5 [7] RcppArmadillo_0.4.450.1.0 Rcpp_0.11.3 GenomicRanges_1.16.4 [10] GenomeInfoDb_1.0.2 IRanges_1.22.10 BiocGenerics_0.10.0 [13] gridExtra_0.9.1 ggplot2_1.0.0 edgeR_3.6.8 [16] limma_3.20.9 biomaRt_2.20.0 loaded via a namespace (and not attached): [1] annotate_1.42.1 AnnotationDbi_1.26.1 bitops_1.0-6 caTools_1.17.1 [5] colorspace_1.2-4 DBI_0.3.1 DESeq_1.16.0 digest_0.6.4 [9] gdata_2.13.3 genefilter_1.46.1 geneplotter_1.42.0 gtable_0.1.2 [13] gtools_3.4.1 KernSmooth_2.23-13 lattice_0.20-29 locfit_1.5-9.1 [17] MASS_7.3-35 munsell_0.4.2 plyr_1.8.1 proto_0.3-10 [21] RCurl_1.95-4.3 reshape2_1.4 RSQLite_1.0.0 scales_0.2.4 [25] splines_3.1.1 stats4_3.1.1 stringr_0.6.2 survival_2.37-7 [29] tools_3.1.1 XML_3.98-1.1 xtable_1.7-4 XVector_0.4.0
Thanks for getting back to me so quickly!
Unfortunately, the annotation (gencode M3) is not as clean as one would like, so
To get around this I followed your advice (posting code here in case someone else has this question):
Many of these are have alternative transcripts annotated as separate genes:
ENSMUSG00000062078 and ENSMUSG00000044407 Qk
ENSMUSG00000067798, ENSMUSG00000040003, and ENSMUSG00000073174 - Magi2 PCG and ncRNA
So I shouldn't be ignoring them. For the heat map, therefor, I can use
For exporting to a report, I just merged my differential gene list (as a data frame) with the annotation data frame and exported the merged data frame instead of the DESeq2 object.
This is a pretty old post, but I came across this trying to do the same. And then I have found that the library
scuttle
, with the function uniquifyFeatureNames is perfect to solve the problem of non-unique/missing gene names. I though I would add this here for other people that comes across the same problem.PS: Even though the library is for scRNAseq, it does not really matter, as the function takes as input vectors of gene names and could be used for anything.
it would be better to start a new question and link this post in it to get an answer.
This is not a question. It is a solution to the problem stated in the comment I was replying to.
Hi Michael,
I have the same question on the project I am involved in. I tried the line of code you mentioned:
But instead of'TRUE', I got 'NA ' as my answer. So I tried
which gave me TRUE as the answer. But when I tried
the answer is 0. I am not able to figure out what the difference is.
Also, on a slightly tangential track, most of the genes in the biomaRt database are missing mgi_symbols. I am using the Rattus norvegicus annotation dataset from biomaRt. Is there an alternative to this?
Hi Michael,
I have a question about checking order of rows between the columns. My
dds
and lengthData objects both have the same number of rows (lengthData is a data frame with only one column named lengthData):And I was pretty sure they had the same row order:
But this returns FALSE:
Still this works:
mcols(dds) <- cbind(mcols(dds), lengthData)
and I am able to use the fpkm() function after renaming lengthData column to basepairs.How can I be sure that lengthData and dds do not have the same row order?
What about:
TRUE!
Thanks for showing me `all.equal()`
Jon
HI Michael,
If your time allows, Can you please look into this problem? Add new columns to DESeqResults from text file by matching rownames.
Hi Micheal
I think the gene Symbol when we use DESeqDataSetFromMatrix, it is somehow easy compare the unclear way how to deal with the SummarizedExperiment object
when we construct the count matrix from featcherCounts we add the argument (GTF.attrType="gene_name")
I would like you to answer our questions in clear way how to deal with TxDb object and the GRangeList and get the genes symbol as rownames in the assay of the RangedSummarizedExperiment object.
I will write now a separate question hoping to get an answer or clearer vignette concerning SummarizedExperiment object.
Thanks
Hi Michael, Is it possible to add column directly to metadata in
SummarizedExperiment
? I tried to add column namedOther
to metadata and later fill it with my additional dataWhich seem like removing metadata and only "Other" exists, original metadata is lost
I tried also
which seems to work inspecting the object with
View(experimentSummary)
but there is something wrong with format of the metadata. Also column name is"Other"
with quotation marks and all rows in that column are filled with"Other"
string.