Dear R people,
I downloaded TCGA level3 RNAseq data for 63 prostate tumor samples with conditions (age difference) I am interested in. I want to identify differential expressed genes between two age groups. I am reading the user guide for the EdgeR program and found that it takes integer count data as input. however, the level 3 TCGA data are in the following format:
Hybridization REF | TCGA-HC-8260-01A-11R-2263-07 | TCGA-HC-8260-11A-01R-2263-07 |
gene_id | normalized_count | normalized_count |
EML3|256364 | 973.2368 | 839.2435 |
EML4|27436 | 1135.3904 | 727.7384 |
EML5|161436 | 518.8917 | 543.7352 |
EML6|400954 | 102.6448 | 24.4287 |
EMP1|2012 | 2250.9446 | 1138.2979 |
EMP2|2013 | 2926.6373 | 4150.5122 |
so, it is normalized count. |
|
Can I use EdgeR? thank you !!! Yuanchun Ding |
An ExpressionSet of this data is available in AnnotationHub
(a little expensive to download the first time, but then cached locally for subsequent use). It is based on the following script, which retrieves and processes the data to a 264.5Mb ExpressionSet, ready for use (expect that the pData are character vectors instead of types)...
Without actually having a close look, that seems like a very useful resource!
FYI: We released an updated version of the data, which now includes 9,264 tumor samples and 741 normal samples from TCGA. The previous files are still available, so the above code should still work. The new file names can be found here: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944. We plan to do at least 1-2 more releases as additional, publishable data become available via TCGA.
That's a great resource, Stephen. Thanks! I do have a suggestion though. I wonder if it would be possible to use Entrez Gene IDs to annotate the count tables in future releases, rather than using Gene Symbols. After all, that's what Subread internal annotation is using. I suspect you might be using some custom annotation. Still, I feel using Entrez Gene IDs might be more cross-compatible.
Hi Stephen!
Would you please send me the citation info?
Thank you.
Anita
Our paper describing this data set is now published in Bioinformatics (http://www.ncbi.nlm.nih.gov/pubmed/26209429). It would be fantastic if you (and others) could cite this article. Thanks!
Thanks Stephen. This seems like a very promising resource. Do you have any help material as to how to load data into R as expressionSet , combine cancer and normal patients in one table and do the anaysis of cancer vs. non cancer data?