Question

Can I feed TCGA normalized count data to EdgeR for differential gene expression analysis

1

Entering edit mode

ycding ▴ 10

@ycding-7496

Last seen 4.0 years ago

United States

Dear R people,

I downloaded TCGA level3 RNAseq data for 63 prostate tumor samples with conditions (age difference) I am interested in. I want to identify differential expressed genes between two age groups. I am reading the user guide for the EdgeR program and found that it takes integer count data as input. however, the level 3 TCGA data are in the following format:

Hybridization REF	TCGA-HC-8260-01A-11R-2263-07	TCGA-HC-8260-11A-01R-2263-07
gene_id	normalized_count	normalized_count
EML3\|256364	973.2368	839.2435
EML4\|27436	1135.3904	727.7384
EML5\|161436	518.8917	543.7352
EML6\|400954	102.6448	24.4287
EMP1\|2012	2250.9446	1138.2979
EMP2\|2013	2926.6373	4150.5122
so, it is normalized count.		Can I use EdgeR? thank you !!! Yuanchun Ding

EdgeR TCGA normalized count • 14k views

ADD COMMENT • link updated 9.7 years ago by alakatos ▴ 130 • written 9.7 years ago by ycding ▴ 10

Steve Lianoglou · Answer 1 · 2015-03-20

4

Entering edit mode

Stephen Piccolo ▴ 600

@stephen-piccolo-6761

Last seen 4.2 years ago

United States

One option is to use this data set that we put together: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944

It has raw read counts (obtained using the Rsubread package), which you can use with edgeR.

We are putting together a publication based on it, so if you end up using it, please let me know, and I'll send you the citation info.

ADD COMMENT • link 9.7 years ago Stephen Piccolo ▴ 600

1

Entering edit mode

An ExpressionSet of this data is available in AnnotationHub

library(AnnotationHub)
hub = AnnotationHub()
gse = query(hub, "GSE62944")[[1]]

(a little expensive to download the first time, but then cached locally for subsequent use). It is based on the following script, which retrieves and processes the data to a 264.5Mb ExpressionSet, ready for use (expect that the pData are character vectors instead of types)...

## Imports: GEOquery, Biobase

suppl <- GEOquery::getGEOSuppFiles("GSE62944")
setwd("GSE62944")

clinvar <- local({
    clinvarFile <- "GSE62944_TCGA_20_420_Clinical_Variables_7706_Samples.txt.gz"
    data <- scan(clinvarFile, what=character(), sep="\t", quote="")
    m <- matrix(data, 7707)
    dimnames(m) <- list(m[,1], m[1,])
    as.data.frame(m[-1, -1])
})

CancerType <-
    read.delim("GSE62944_TCGA_20_CancerType_Samples.txt.gz",
               header=FALSE, colClasses=c("character", "factor"),
               col.names=c("sample", "type"))
idx <- match(rownames(clinvar), CancerType$sample)
stopifnot(!anyNA(idx))
clinvar$CancerType <- CancerType$type[idx]

countFile <- 
    "GSM1536837_TCGA_20.Illumina.tumor_Rsubread_FeatureCounts.txt.gz"
untar("GSE62944_RAW.tar", countFile)
counts <- local({
    data <- scan(countFile, what=character(), sep="\t", quote="")
    m <- matrix(data, 7707)
    dimnames(m) <- list(m[,1], m[1,])
    m <- t(m[-1, -1])
    mode(m) <- "integer"
    m
})

eset <- Biobase::ExpressionSet(counts, AnnotatedDataFrame(clinvar))

ADD REPLY • link 9.7 years ago • updated 9.6 years ago Martin Morgan 25k

0

Entering edit mode

Without actually having a close look, that seems like a very useful resource!

ADD REPLY • link 9.7 years ago Martin Morgan 25k

0

Entering edit mode

FYI: We released an updated version of the data, which now includes 9,264 tumor samples and 741 normal samples from TCGA. The previous files are still available, so the above code should still work. The new file names can be found here: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944. We plan to do at least 1-2 more releases as additional, publishable data become available via TCGA.

ADD REPLY • link 9.5 years ago Stephen Piccolo ▴ 600

0

Entering edit mode

That's a great resource, Stephen. Thanks! I do have a suggestion though. I wonder if it would be possible to use Entrez Gene IDs to annotate the count tables in future releases, rather than using Gene Symbols. After all, that's what Subread internal annotation is using. I suspect you might be using some custom annotation. Still, I feel using Entrez Gene IDs might be more cross-compatible.

ADD REPLY • link 9.3 years ago Aliaksei Holik ▴ 350

0

Entering edit mode

Hi Stephen!

Would you please send me the citation info?

Thank you.

Anita

ADD REPLY • link 9.3 years ago alakatos ▴ 130

1

Entering edit mode

Our paper describing this data set is now published in Bioinformatics (http://www.ncbi.nlm.nih.gov/pubmed/26209429). It would be fantastic if you (and others) could cite this article. Thanks!

ADD REPLY • link updated 9.3 years ago by Steve Lianoglou ★ 13k • written 9.3 years ago by Stephen Piccolo ▴ 600

0

Entering edit mode

Thanks Stephen. This seems like a very promising resource. Do you have any help material as to how to load data into R as expressionSet , combine cancer and normal patients in one table and do the anaysis of cancer vs. non cancer data?

ADD REPLY • link 8.6 years ago hamda.binte.ajmal • 0

score 2 · Answer 2 · 2015-03-21

2

Entering edit mode

Gordon Smyth 51k

@gordon-smyth

Last seen 23 minutes ago

WEHI, Melbourne, Australia

Aaron is right that edgeR needs counts rather than normalized values. And the subread/featureCounts count summaries from Stephen Piccolo sound extraordinarily useful, an excellent way to go.

However let me just point that edgeR will work with non-integer counts, provided that they add up to the correct library sizes for each sample. If the TCGA "normalized counts" are in fact expected counts from RSEM, then they will work quite well with either edgeR or voom.

ADD COMMENT • link 9.7 years ago Gordon Smyth 51k

2

Entering edit mode

I feel like this comment should be pinned somehow, or made into a separate post/news item. Since analysis of count/RNAseq data has crawled itself out of the water an onto land, everyone has said that non-count-based input was a non-starter for edgeR (and other methods of similar ilk), and I believe this is the first place (to my knowledge) that it has been authoritatively stated that RSEM output actually works well (as opposed to it being "passable") with these tools.

ADD REPLY • link 9.7 years ago Steve Lianoglou ★ 13k

score 1 · Answer 3 · 2015-03-20

Normalized counts are not appropriate for edgeR. The variance in the negative binomial model depends on the mean, so the significance of any differences will depend on the absolute size of the count. This information about the absolute size is lost upon normalization. In short, if you want to use edgeR, try to get the original counts rather than transformed values that are obtained after high-level processing.

score 1 · Answer 4 · 2015-03-25

1

Entering edit mode

alakatos ▴ 130

@alakatos-6983

Last seen 5.2 years ago

United States

Yes, it does. Thank you.

I am wondering if you have a tentative schedule for the new release including the normal samples.

Thank you.

Anita

ADD COMMENT • link 9.7 years ago alakatos ▴ 130

0

Entering edit mode

Sorry for the delayed response. I didn't see your question until now. We have processed the normal samples (and about 1500 additional tumor samples). We're in the process of uploading these data to GEO and will post a comment as soon as they are available. In the meantime, feel free to contact me directly, and I can send the data to you another way.

ADD REPLY • link 9.5 years ago Stephen Piccolo ▴ 600

0

Entering edit mode

We now include normal data from TCGA in this data set. Please see comment above.

ADD REPLY • link 9.5 years ago Stephen Piccolo ▴ 600

0

Entering edit mode

Great!

Thank you very much.

Anita

ADD REPLY • link 9.4 years ago alakatos ▴ 130

score 0 · Answer 5 · 2015-03-20

0

Entering edit mode

ycding ▴ 10

@ycding-7496

Last seen 4.0 years ago

United States

Hi Stephen and Aaron, thank you so much for your quick help. I briefly checked into the link provided by Stephen, it is a very good data source, I will definitely look into it.

ADD COMMENT • link 9.7 years ago ycding ▴ 10

Martin Morgan · Answer 6 · 2015-03-25

Hello Stephen and Martin,

Thank you for making the dataset and code available.

I would like to use a subset of GSM1536837_TCGA_20.Illumina.tumor_Rsubread_TPM.txt.gz file for my analysis.

Question 1.

Would you please confirm that this modified code below is correct?

countFile <- "GSM1536837_TCGA_20.Illumina.tumor_Rsubread_TPM.txt.gz"
untar("GSE62944_RAW.tar", countFile)
counts <- local({
    data <- scan(countFile, what=character(), sep="\t", quote="")
    m <- matrix(data, 7707)
    dimnames(m) <- list(m[,1], m[1,])
    m <- t(m[-1, -1])
    mode(m) <- "float"
    m
})

Question 2.

I am puzzled by the coding system and I would like to make sure that I am using the correct labeling “normal” and “tumor” for the subset of the dataset. I am interested in the 373 SKCM subjects available in GSE62944 clinvar. I am wondering if your dataset includes only the subjects with TN (Tumor, Matched Normal) or both TN and NT (Normal, Matched Tumor).

In case both classes are included, shall I try to match the bcr_patient_uuids in order to identify them?

TGCA website barcode:

bcr_patient_uuid	bcr_patient_barcode	bcr_sample_uuid
3DD5A206-D7F3-42F1-B9CC-4B31C76D495D	TCGA-BF-A1PU	A4D739D2-63A1-41FC-82C1-F1D027DB756D

clinvar barcode:

bcr_patient_barcode	patient_id	bcr_patient_uuid
TCGA-BF-A1PU-01A-11R-A18S-07	TCGA-BF-A1PU	3DD5A206-D7F3-42F1-B9CC-4B31C76D495D

Or what would be the most straightforward way for a novice with less opportunity for mistakes?

Thank you in advance,

Anita