Question

Getting rawdata for TMM normalization

0

Entering edit mode

humberto_munoz • 0

@humberto_munoz-10903

Last seen 8.8 years ago

I have three samples of dataset, a reference and two replicates for 1 hour and 24 hours. In each sample the gene IDs are arranged differently and the number of genes per sample are: 3598 (Reference), 3617 (1 Hour), 3621 (24 Hours). How can I use edgeR to create a raw data that contains the read counts for the three samples similarly as is presented on page 42 in the edgeRUserGuide (Updated version 2016).

Any useful suggestion will be appreciated

normalization • 2.9k views

ADD COMMENT • link 8.8 years ago humberto_munoz • 0

score 0 · Answer 1 · 2016-06-28

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 11 days ago

United States

You need to use the same pipeline to do the mapping/quantitation of reads for all samples first. You can then combine these easily (since the output will be consistent (same number of rows, and same order)) into a count matrix which will then be straightforward to use in downstream analyses.

ADD COMMENT • link 8.8 years ago Steve Lianoglou ★ 13k

score 0 · Answer 2 · 2016-06-28

0

Entering edit mode

Yury Bukhman ▴ 20

@yury-bukhman-3186

Last seen 8.4 years ago

Hi Humberto,

You can format your 3 datasets as data frames with columns gene_id and count. You can then use R's merge function to combine them with parameter by = gene_id. You will need to use merge twice, merging first and second datasets first, then merging the result with the third dataset. This will create a single data frame containing genes that are in common in all 3 datasets. If you want to keep all of your genes, set merge parameter all = TRUE. If a gene is missing in one of your samples, you will get NA in the corresponding count column. You can then replace those NA values by 0. See help page for merge for further details. Does this help?

Yury

ADD COMMENT • link 8.8 years ago Yury Bukhman ▴ 20

0

Entering edit mode

Technically that is what you can do.

The problem is that we don't know why the counts/results are different for these different groups of data. You don't really want to combine quantitated results using one pipeline w/ quantitated results from another, then try and do differential expression across ...

ADD REPLY • link 8.8 years ago Steve Lianoglou ★ 13k

score 0 · Answer 3 · 2016-06-28

0

Entering edit mode

humberto_munoz • 0

@humberto_munoz-10903

Last seen 8.8 years ago

Hi Yury, Thanks for your comment. I am a beginner using edgeR. I am confused trying to format the three datasets as data frame. Actually, the samples have four columns, but only I am using the first three. These are my code lines and error when I try to merge s1 and s2:

> s1<-read.csv("~/Documents/Normalization-R/TMMNormalization/Reference6803.csv")
> s2<-read.csv("~/Documents/Normalization-R/TMMNormalization/CO21Hour.csv")
> s3<-read.csv("~/Documents/Normalization-R/TMMNormalization/CO224Hours.csv")
>
> x<- data.frame(1,s1)
> y<- data.frame(1,s2)
> ## y<- data.frame(1, s2['GeneID'], count = s2['ReadsCount'], length = s2['DNALength'])
> ## z<- data.frame(1, s3['GeneID'], count = s3['ReadsCount'], length = s3['DNALength'])
>
> merge(x, y, by = GeneID)
Error in fix.by(by.x, x) : object 'GeneID' not found

ADD COMMENT • link 8.8 years ago humberto_munoz • 0

0

Entering edit mode

Try merge(x, y, by = "GeneID"). Sorry I forgot the quotes in my previous answer. Without the quotes, R thinks you are referring to an object in an environment, rather than a data frame column name.

Here is a quick example:

> x  <- data.frame(gene_id = letters[1:3], count = (1:3)*100)
> x
  gene_id count
1       a   100
2       b   200
3       c   300
> y  <- data.frame(gene_id = letters[2:4], count = (2:4)*100)
> y
  gene_id count
1       b   200
2       c   300
3       d   400
> merge(x, y, by = "gene_id")
  gene_id count.x count.y
1       b     200     200
2       c     300     300
> merge(x, y, by = "gene_id", all = TRUE)
  gene_id count.x count.y
1       a     100      NA
2       b     200     200
3       c     300     300
4       d      NA     400

I don't know what your situation is, but Steve does have a good point about the danger of combining results from different workflows, so be careful.

ADD REPLY • link 8.8 years ago Yury Bukhman ▴ 20

score 0 · Answer 4 · 2016-06-29

Steve and Yury thanks for your comments.

The 3 dataset results are RNASeq expression data from the same Genome. The first sample s1, is a reference sample, that will be compared with the other two samples with CO2 added, s2 during 1 hour and s3 with 24 hours. I am still getting error with the column definition using the data.frame function with my code as seen below.

> s1<-read.csv("~/Documents/Normalization-R/TMMNormalization/Reference6803.csv")
> s2<-read.csv("~/Documents/Normalization-R/TMMNormalization/CO21Hour.csv")
> s3<-read.csv("~/Documents/Normalization-R/TMMNormalization/CO224Hours.csv")
>
> x<- data.frame(gene_id = s1['GeneID'], count = s1['ReadsCount'], length = s1['DNALength'])
Error in `[.data.frame`(s1, "GeneID") : undefined columns selected
> x
Error: object 'x' not found
> y<- data.frame(gene_id = s2['GeneID'], count = s2['ReadsCount'], length = s2['DNALength'])
Error in `[.data.frame`(s2, "GeneID") : undefined columns selected
> y
Error: object 'y' not found
> z<- data.frame(gene_id = s3['GeneID'], count = s3['ReadsCount'], length = s3['DNALength'])
Error in `[.data.frame`(s3, "GeneID") : undefined columns selected
> z
Error: object 'z' not found
> merge(x, y, by = "gene_id")
Error in merge(x, y, by = "gene_id") : object 'x' not found

score 0 · Answer 5 · 2016-07-01

Humberto, try head(s1). What do you see?

If you haven't already done so, read the appropriate section of An Introduction to R on the R Project's web site: https://cran.r-project.org/doc/manuals/r-release/R-intro.html#The-read_002etable_0028_0029-function. Also try ?read.csv and example(read.csv).

Also, you don't need to rename GeneID to gene_id. You can merge data frames with any column names, as long as there is a matching column that you can merge on. Read merge help page, e.g. ?merge. Also try example(merge).

score 0 · Answer 6 · 2016-07-01

Yury, Thanks a lot. I followed your comments with the code lines:

> x<- data.frame(gene_id = s1['GeneID'], count = s1['ReadsCount'], length = s1['DNALength'])
> y<- data.frame(gene_id = s2['GeneID'], count = s2['ReadsCount'], length = s2['DNALength'])
> z<- data.frame(gene_id = s3['GeneID'], count = s3['ReadsCount'], length = s3['DNALength'])
>
> w<-merge(x, y, by = "GeneID", all = TRUE)
>
> View(merge(w, z, by = "GeneID", all = TRUE))

It creates a file with the 7 columns: GeneID, ReadsCount.x, DNALength.x, ReadsCount.y, DNALength.y,ReadsCount, DNALength. Let me know if there is a way to have the last two columns with ending .z. Also, how I can change NA to 0.

Again Thanks

score 0 · Answer 7 · 2016-07-01

0

Entering edit mode

Yury Bukhman ▴ 20

@yury-bukhman-3186

Last seen 8.4 years ago

You can rename data frame columns using function names or colnames. See example("names").

You can replace NA with 0 like so:

> a <- c(1:3,NA)
> a
[1]  1  2  3 NA
> a[is.na(a)] <- 0
> a
[1] 1 2 3 0

ADD COMMENT • link 8.8 years ago Yury Bukhman ▴ 20

score 0 · Answer 8 · 2016-07-05

0

Entering edit mode