Question

News:Experimental data package 'seqc'

5

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 4 months ago

Australia/Melbourne

We have created a new experimental data package called 'seqc'. It includes gene-level read count data generated by the SEQC (SEquencing Quality Control) project, which is the third stage of the well-known MAQC project (a US FDA initiative). The SEQC/MAQC-III Consortium produced benchmark RNA-seq data for the assessment of RNA sequencing technologies and data analysis methods (published recently on Nature Biotechnology - http://www.ncbi.nlm.nih.gov/pubmed/25150838):

Sequence reads were aligned to human reference genome hg19 using the Subread aligner and were then summarized to genes using the featureCounts program. This package includes the gene-level read count data for 2,758 libraries. It can be downloaded from the following link (188MB):

http://bioconductor.org/packages/release/data/experiment/html/seqc.html

In addition to the read count data, this package also includes exon-exon junction data generated for human brain reference RNA and universal human reference RNA samples. Exon-exon junctions were detected by using the Subjunc aligner.

Moreover, TaqMan RT-PCR validation data for ~1000 genes and ERCC spike-in sequence data are included in this package as well.

We hope this package is a useful resource for the community.

Wei

seqc rsubread featurecounts subjunc ercc News • 3.3k views

ADD COMMENT • link 10.4 years ago Wei Shi ★ 3.6k

score 4 · Answer 1 · 2014-11-18

Thanks a lot for processing and annotating the data in the way that you have. This will be a super useful resource ... especially since I already have a need for it ;-)

I've created some helper functions that allow you to create a (semi-decently) annotated ExpressionSet from the data given some user specified criteria and put it in the gist here. Perhaps something like this would be useful to include in the package?

You would use it like so:

## Fetch all of the RefSeq data from all centers and sequencing platforms:
R> e <- seqc.eSet('gene', 'refseq')
R> head(pData(e))
    platform sample replicate lane  flowcell center
 1|      ILM      A         1  L01 FlowCellA    AGR
 2|      ILM      A         1  L01 FlowCellB    AGR
 3|      ILM      A         1  L02 FlowCellA    AGR
 4|      ILM      A         1  L02 FlowCellB    AGR
 5|      ILM      A         1  L03 FlowCellA    AGR
 6|      ILM      A         1  L03 FlowCellB    AGR

R> with(pData(e), table(platform, center))
         center
 platform AGR BGI CNL COH LIV MAY MGP NVS NWU NYU PSU SQW
      ILM 256 384 360 128   0 384   0 320   0   0   0   0
      LIF   0   0   0   0  50   0   0   0 285   0 288 288
      ROC   0   0   0   0   0   0   4   0   0   4   0   4

## Fetch just the Illumina RefSeq data from all centers:
R> ilm <- seqc.eSet('gene', 'refseq', 'ILM')
R> with(pData(ilm), table(platform, center))
         center
 platform AGR BGI CNL COH MAY NVS
      ILM 256 384 360 128 384 320

Currently I've only implemented this parsing/aggregating for gene-level features (ie. no junction or taqman data), but I can add those later if you think these would be helpful to include in the package.

score 0 · Answer 2 · 2014-11-18

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 4 months ago

Australia/Melbourne

Thanks for the code, Steve. I have just added them to the package and committed to svn devel repository ...

ADD COMMENT • link 10.4 years ago Wei Shi ★ 3.6k

0

Entering edit mode

That was quick! Thanks for incorporating that ... I of course now feel compelled to round off the functionality so that one could get ExpressionSets for all of the data. I'll let you know when the gist is updated with that ...

ADD REPLY • link 10.4 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Happy to incorporate them when you code are updated! It will be helpful if you could provide .Rd files as well ...

ADD REPLY • link 10.4 years ago Wei Shi ★ 3.6k