Entering edit mode
Guest User
★
13k
@guest-user-4897
Last seen 10.2 years ago
Hello,
I am new to analyzing array files. I am attempting to generate a CSV
file that contains a gene symbol and RMA-processed expression data for
a set of arrays for input into an online pathway ID tool (TNBCtype,
http://cbc.mc.vanderbilt.edu/tnbc/).
My problem/question (not sure if It is either, or I don't understand
the process correctly):
when I am exporting the csv file, there are duplicate entries for some
gene names (i.e. ESR1). I am under the impression that RMA and the
process I am using (target = 'core') summarizes at the gene level, so
I am not sure why I am getting duplicate entries for certain (not all)
genes after writing the expression file. I have gone through this
process with some mouse array data (mouse gene 10 st arrays) and have
not run into this problem of duplicate gene names.
Any insights on what I might be doing incorrectly, or in understanding
the output I should expect, would be greatly appreciated.
Is averaging the values of these instances of duplicate gene names a
valid thing to do?
Thank you!
-Ed O'Donnell
postdoctoral scholar
Oregon state university
My commands (Analysis.R), run as source("Analysis.R"):
---------------------
#install packages for analysis of the mouse array
source("http://bioconductor.org/biocLite.R")
biocLite("hugene10sttranscriptcluster.db")
biocLite("oligo")
biocLite("annotate")
#load required packages
library(oligo)
library(hugene10sttranscriptcluster.db)
library(annotate)
#set wd to myworkingdirectory
setwd("myworkingdirectory")
#read in the raw data from the files and the pDatat
rawData <- read.celfiles(list.celfiles())
#rma normalization
rmaCore <- rma(rawData, target = 'core')
#annotation
ID <- featureNames(rmaCore)
Symbol <- getSYMBOL(ID, "hugene10sttranscriptcluster.db")
Name <- as.character(lookUp(ID, "hugene10sttranscriptcluster.db",
"GENENAME"))
#make a temporary data frame with all the identifiers...
tmpframe <-data.frame(ID=ID, Symbol=Symbol,
Name=Name,stringsAsFactors=F)
tmpframe[tmpframe=="NA"] <- NA
#assign data frame to rma-results
fData(rmaCore) <- tmpframe
#expression table with gene name and annotation info, processed with
sed after export to get the quotations in the right spot and remove NA
lines
write.table(cbind(pData(featureData(rmaCore))[,"Symbol"],exprs(rmaCore
)),file="better_annotation.csv", quote = FALSE, sep = ",")
----------
-- output of sessionInfo():
R version 3.0.3 (2014-03-06)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets
methods
[8] base
other attached packages:
[1] pd.hugene.1.0.st.v1_3.8.0 gplots_2.12.1
[3] annotate_1.40.1
hugene10sttranscriptcluster.db_8.0.1
[5] org.Hs.eg.db_2.10.1 RSQLite_0.11.4
[7] DBI_0.2-7 AnnotationDbi_1.24.0
[9] limma_3.18.13 oligo_1.26.6
[11] Biostrings_2.30.1 XVector_0.2.0
[13] IRanges_1.20.7 Biobase_2.22.0
[15] oligoClasses_1.24.0 BiocGenerics_0.8.0
[17] BiocInstaller_1.12.0
loaded via a namespace (and not attached):
[1] affxparser_1.34.2 affyio_1.30.0 bit_1.1-11
[4] bitops_1.0-6 caTools_1.16 codetools_0.2-8
[7] ff_2.2-12 foreach_1.4.1 gdata_2.13.2
[10] GenomicRanges_1.14.4 gtools_3.3.1 iterators_1.0.6
[13] KernSmooth_2.23-12 preprocessCore_1.24.0 splines_3.0.3
[16] stats4_3.0.3 tcltk_3.0.3 tools_3.0.3
[19] XML_3.95-0.2 xtable_1.7-3 zlibbioc_1.8.0
--
Sent via the guest posting facility at bioconductor.org.