Hi Robin,
On Monday, October 21, 2013 9:23:34 AM, Robin [guest] wrote:
>
> My pipeline for adding annotations suddenly didn't work, maybe
because some packages are updated. Could you check if something is
wrong with my pipeline, I loose all the rows when I add annotations:
>
>> data <-
read.table("~/Documents/Jobb/mRNA_Ch12/TreatedControlCounts.csv",
row.names=1, sep="", header=T)
>> head(data)
> T0h T0.25h T0.5h T1h T2h T3h T6h T12h T24h T48h C0h
C0.25h C0.5h C1h C2h C3h
> NM_001001130 68 95 56 43 66 62 68 90 63 89 65
85 58 49 81 49
> NM_001001144 0 1 4 0 1 1 1 4 1 2 1
3 1 0 6 3
> NM_001001152 79 129 52 50 24 45 130 154 112 147 56
85 67 33 52 31
> NM_001001160 1 1 2 0 0 1 0 0 0 1 0
1 2 4 2 1
> NM_001001176 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0
> NM_001001177 1 3 2 3 0 1 1 0 2 0 1
6 4 1 3 0
> C6h C12h C24h C48h
> NM_001001130 76 73 48 77
> NM_001001144 0 1 2 1
> NM_001001152 93 77 65 109
> NM_001001160 0 2 1 0
> NM_001001176 0 0 0 0
> NM_001001177 0 3 0 2
>> y <- DGEList(counts=data[,1:20], genes=data[,0:1])
What are you expecting data[,0:1] to do?
Note that R is not Perl, so counting starts at 1, so R is ignoring the
zero. What you are in fact doing is putting the first column of your
'data' data.frame into the DGEList object, and since you are only
using
one column, R is dropping the dimension attribute, and thereby
reducing
to a vector. You later ask for the row.names() of this vector, and
since it has no rows, it has no row.names either.
As an example:
> x <- data.frame(letters=1:26)
> row.names(x) <- letters
> x[,0:1]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24 25
[26] 26
> row.names(x[,0:1])
>
Not that the call to row.names above didn't return anything, because
there wasn't anything to return.
In addition, you are going through a lot of unnecessary work below,
trying to get mappings to Entrez Gene from RefSeq IDs. This is
actually
a simple one-liner.
gns <- select(org.Hs.eg.db, row.names(data), c("ENTREZID","SYMBOL"),
"ACCNUM")
You may get a warning that you have some one-to-many mappings here.
How
you deal with that is up to you. I normally take the most naive
approach possible and just do
gns <- gns[!duplicated(gns[,1]),]
After which you an check that
all.equal(gns$ACCNUM, row.names(data))
and then you can do
y <- DGEList(data, genes = gns)
Best,
Jim
>> library(org.Mm.eg.db)
>> idfound <- rownames(y$genes) %in% mappedRkeys(org.Mm.egREFSEQ)
>> y <- y[idfound,]
>> dim(y)
> [1] 0 20
>> egREFSEQ <- toTable(org.Mm.egREFSEQ)
>> m <- match(rownames(y$genes), egREFSEQ$accession)
>> y$genes$EntrezGene <- egREFSEQ$gene_id[m]
>> y$genes$Symbol <- egSYMBOL$symbol[m]
>> m <- match(y$genes$EntrezGene, egSYMBOL$gene_id)
>> y$genes$Symbol <- egSYMBOL$symbol[m]
>> head(y$genes)
> [1] genes EntrezGene Symbol
> <0 rows> (or 0-length row.names)
>
> -- output of sessionInfo():
>
>> sessionInfo()
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
LC_TIME=en_US.UTF-8
> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
LC_ADDRESS=C
> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets
methods base
>
> other attached packages:
> [1] edgeR_3.4.0 limma_3.18.0 GSEABase_1.24.0
> [4] annotate_1.40.0 org.Mm.eg.db_2.10.1
AnnotationForge_1.4.0
> [7] org.Hs.eg.db_2.10.1 GOstats_2.28.0 graph_1.40.0
> [10] Category_2.28.0 GO.db_2.10.1 RSQLite_0.11.4
> [13] DBI_0.2-7 Matrix_1.0-14 lattice_0.20-24
> [16] AnnotationDbi_1.24.0 Biobase_2.22.0 BiocGenerics_0.8.0
>
> loaded via a namespace (and not attached):
> [1] genefilter_1.44.0 grid_3.0.2 IRanges_1.20.0
RBGL_1.38.0
> [5] splines_3.0.2 stats4_3.0.2 survival_2.37-4
tools_3.0.2
> [9] XML_3.98-1.1 xtable_1.7-1
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
Small comment on Jim's script
On Mon, Oct 21, 2013 at 10:50 AM, James W. MacDonald <jmacdon@uw.edu>
wrote:
> gns <- select(org.Hs.eg.db, row.names(data), c("ENTREZID","SYMBOL"),
> "ACCNUM")
>
> You may get a warning that you have some one-to-many mappings here.
How
> you deal with that is up to you. I normally take the most naive
approach
> possible and just do
>
> gns <- gns[!duplicated(gns[,1]),]
>
Because duplicated is FALSE only for the second occurance of an
object, ie
> duplicated(c("a", "a"))
[1] FALSE TRUE
you are effectively resolving your one-to-many mappings by including
only
the first mapping in the object. This (of course) depends on the
order of
the mappings. I may be wrong, but I don't think there is any implied
order
of the mappings, and even if there is, I am not sure that the database
interface respects this order in how it returns things.
Unfortunately, the
code above hides the fact that you are just picking a random mapping,
so I
would suggest something like
tapply(gns, gns[,1], function(xx) {
xx[ sample(1:nrow(xx), size = 1), ]
})
which makes the random nature obvious, or, in case you want to discard
stuff,
gns[ ! gns[,1] %in% gns[duplicated(gns[,1]),1],]
(code not fully tested)
Best,
Kasper
[[alternative HTML version deleted]]
Thanks guys, it worked.
My code also worked if I justed changed [,0:1] to [,1:1].
I will try your suggestion Kasper.
On Mon, Oct 21, 2013 at 5:24 PM, Kasper Daniel Hansen <
kasperdanielhansen@gmail.com> wrote:
> Small comment on Jim's script
>
>
> On Mon, Oct 21, 2013 at 10:50 AM, James W. MacDonald
<jmacdon@uw.edu>wrote:
>
>> gns <- select(org.Hs.eg.db, row.names(data),
c("ENTREZID","SYMBOL"),
>> "ACCNUM")
>>
>> You may get a warning that you have some one-to-many mappings here.
How
>> you deal with that is up to you. I normally take the most naive
approach
>> possible and just do
>>
>> gns <- gns[!duplicated(gns[,1]),]
>>
>
> Because duplicated is FALSE only for the second occurance of an
object, ie
> > duplicated(c("a", "a"))
> [1] FALSE TRUE
> you are effectively resolving your one-to-many mappings by including
only
> the first mapping in the object. This (of course) depends on the
order of
> the mappings. I may be wrong, but I don't think there is any implied
order
> of the mappings, and even if there is, I am not sure that the
database
> interface respects this order in how it returns things.
Unfortunately, the
> code above hides the fact that you are just picking a random
mapping, so I
> would suggest something like
>
> tapply(gns, gns[,1], function(xx) {
> xx[ sample(1:nrow(xx), size = 1), ]
> })
>
> which makes the random nature obvious, or, in case you want to
discard
> stuff,
> gns[ ! gns[,1] %in% gns[duplicated(gns[,1]),1],]
>
> (code not fully tested)
>
> Best,
> Kasper
>
>
[[alternative HTML version deleted]]