RPKM normalization for several samples using a for loop
1
0
Entering edit mode
@humberto_munoz-10903
Last seen 8.4 years ago

I am using a for loop to compute the RPKM values for 11 samples, however, I am getting only the gene IDs from the last sample. I would like to know what commands I am missing in the lines below. If somebody knows a more efficient way to do this task, please tell me.

Thanks

Humberto

setwd("/Users/hmunozbarona/Documents/Normalization-R")
rm(list=ls(all=TRUE)) # remove all variables
files <- dir(pattern="*\.csv$")
RPKM = NULL
for (i in 1:11)
  {
  data<-read.csv(files[[i]])
  id<- data['GeneID']
  cnts<- data['ReadCount']
  lens<- data['Length']
  y <- DGEList(genes=data.frame(gene = id,Length=lens), counts=cnts)
  RPKM[i]= data.frame(gene =id, counts=cnts, rpkm(y))
print(RPKM[i])
}
edger forloop • 2.2k views
ADD COMMENT
0
Entering edit mode

I wonder why all your samples are in separate files in the first place. Tools like featureCounts or htseq-counts collate the read counts into a matrix for you, so reading separate files of counts is generally unnecessary.

ADD REPLY
1
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

Well, one obvious problem is that you are treating RPKM as a list, but you haven't made it a list. You need

RPKM <- list()

instead of

RPKM <- NULL

and

RPKM[[i]] <- data.frame(etc)

instead of

RPKM[i] <- data.frame(etc)

As for being more efficient, you have posted many questions on this support site about readDGE() and rpkm(), so you already know that there is no need for a for loop. I can't imagine the code as you have written it can possibly be useful to you.

ADD COMMENT
0
Entering edit mode

In my updated code I corrected these lines below, however, I still got some error with the list RPKM.

 setwd("/Users/hmunozbarona/Documents/Normalization-R")
rm(list=ls(all=TRUE)) # remove all variables
# files <- dir(pattern="*\.csv")
print(files)
my_data <- list()
RPKM <- list()
for (i in seq_along(files)) {
  my_data[[i]] <- read.csv(file = files[i])
  id<- my_data[[i]]['GeneID']
  cnts<- my_data[[i]]['ReadCount']
  lens<- my_data[[i]]['Length']
  y <- DGEList(genes=data.frame(gene = id,Length=lens), counts=cnts)
  RPKM[[i]] <- data.frame(gene =id, rpkm(y))
}
# countDf <- data.frame(gene = id, count = cnts, length = lens)
group<- c(1,1,1,1,1,1,1,1,1,1,1)
RG<-readDGE(RPKM, columns=c(1,2), group=NULL, labels=NULL)
RG$samples
keep <-rowSums(cpm(RG)>1) >=1
RG<- RG[keep, , keep.lib.sizes=FALSE] 

rownames(design)<-colnames(RG)
rownames(design)
design
RG <- estimateDisp(RG, design, robust=TRUE)
x<-RGtagwise.dispersion)
plotBCV(RG)
fit <- glmQLFit(RG, design, robust=TRUE)
plotQLDisp(fit)
qlf <-glmQLFTest(fit, coef=4)
topTags(qlf,n=50, sort.by="logFC")
is.de <- decideTestsDGE(qlf, p.value=0.05)
summaryis.de)
plot(qlfcoefficients[,5])

The error starts in these lines:

> # countDf <- data.frame(gene = id, count = cnts, length = lens)
> group<- c(1,1,1,1,1,1,1,1,1,1,1)
> RG<-readDGE(RPKM, columns=c(1,2), group=NULL, labels=NULL)
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
  cannot open file 'list(GeneID = c(641612588, 641610273, 641612823, 641611428, 641610552, 641612589, 641612814, 641611971, 641612541, 641610260, 641610167, 641610059, 641611190, 641612220, 641611972, 641610056, 641611932, 641611769, 641610514, 641612072, 641612379, 641612089, 641611662, 641612486,  [... truncated]

My 11 CVS files all have more than 2800 rows and 4 columns as follows:

GeneID  Length  ReadCount   Normalized Coverage1 * 109
641612588   2604    13466   10730.812
641610273   582 11767   41954.421
641612823   1713    10837   13127.64
ADD REPLY
0
Entering edit mode

Why would you think that it makes sense to pass RPKM values to readDGE()? Obviously readDGE() needs a vector of 11 file names to read, not a list of thousands of numerical values.

Why would you compute RPKM at all, since edgeR requires counts and not RPKM?

I just have to give up at this point, as I can't make sense of the logic of your code. I have answered your original question, and this followup question doesn't seem a logical progression from the original post.

ADD REPLY
0
Entering edit mode

 Gordon thanks for your comments. Excuse me R-ignorance in this topic, I am trying to compare the effect on the False Discovery Rate (FDR) of the normalization methods (TMM, RLE, Upper-quartile and RPKM) using these set of 11 samples. In  edgeR, I calculated easily TMM, RLE and Upper-quartile normalizations for all samples, but not RPKM.  With the lines below, I am trying to calculate the RPKM  for the 11 samples individually using the for loop. My problem is when I am trying to collate all rpkm values to perform the other functions  for the dispersion, CBV, fitting with the glmQLFit, conduct the glmQLFTest and make the plots.  If there is a better way to do it, please let me know it. I will try your suggested  tools featureCounts or htseq-counts collate the read counts into a matrix.

Humberto 

 

ADD REPLY
0
Entering edit mode

The tools featureCounts and htseq-counts are in packages not available for the R version 3.3.0 that I have.

ADD REPLY
0
Entering edit mode

So you did biocLite("featureCounts") and biocLite("htseq-counts") and they both failed. Well that's because these are not R packages. featureCounts is a function defined in the Rsubread package, which is available for R 3.3.0. And htseq-counts is part of the HTSeq software which is a Python package.

H.

ADD REPLY
0
Entering edit mode

I tried to install the Rsubread package in my R 3.3.0 version and this is what I got.

 > install.packages("Rsubread")
Warning in install.packages :
  package ‘Rsubread’ is not available (for R version 3.3.0)
ADD REPLY
0
Entering edit mode

.. which is because the function to install Bioconductor packages is biocLite rather than install.packages. The Bioconductor website explains how to install Bioconductor packages.

ADD REPLY

Login before adding your answer.

Traffic: 552 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6