Question

What are the methods to get count data per cell from single cell fastq given only R1, R2, and I1 fastq files

1

Entering edit mode

Matthew Thornton ▴ 380

@matthew-thornton-5564

Last seen 22 days ago

USA, Los Angeles, USC

Hello,

I am starting to process single cell RNA sequencing data and I noticed that all of the bioconductor tutorials for single cell (https://f1000research.com/articles/5-2122/v2) start from well groomed data that is already in a count matrix with cells for columns and genes for rows . This is pretty far from the output of the Instrument and more should be done to facilitate getting the count data necessary for the main methods of obtaining single cell sequencing (Illumina and PacBio). I was given as the output of bcl2fastq three fastq files, R1, R2 (paired) data and an index fastq that has the barcodes that were used to multiplex the samples. After googling extensively, that there are not a lot of options and what I see is that people use Cell Ranger (software from 10X genomics) to do the analysis and then from there, export count data. None of this is very satisfactorily explained despite having excellent bioconductor tutorials for single cell data (that all start from well groomed count data), like: https://www.bioconductor.org/help/course-materials/2017/BioC2017/Day2/Workshops/singleCell/doc/workshop.html.

Cell Ranger uses STAR and it seems like it does more than you would want, if you intend to use the R/Bioconductor software, or process the data in a method similar to what you would do with bulk RNA-seq.

R1, R2 regular paired-end fastqs

I2

@K00124:391:HWNTHBBXX:3:1101:4219:1309 1:N:0:TTCCCGAT
TTCCCGAT
+
A-A<FA--
@K00124:391:HWNTHBBXX:3:1101:7101:1309 1:N:0:TTCCCGAC
TTCCCGAC
+
AAA<FF--
@K00124:391:HWNTHBBXX:3:1101:7222:1309 1:N:0:GCAGTAGC
GCAGTAGC

What is your method for getting count data given R1, R2, and I1?

What is the best way to export this count data into R? HDF5Array?? Which hdf5 files do you use from the output of cellranger count? (or aggr)

Any comments or advice is greatly appreciated, and will most likely enrich the community as 10X genomics increases in popularity. It is not like people aren't already trying to get help, they are just not getting much (https://www.biostars.org/p/356000/)

Thank you

fastq single cell • 6.5k views

ADD COMMENT • link updated 6.1 years ago by Gordon Smyth 52k • written 6.1 years ago by Matthew Thornton ▴ 380

score 4 · Answer 1 · 2019-03-09

Cell Ranger uses STAR and it seems like it does more than you would want

I would say that CellRanger does the necessary amount of work that needed to get a count matrix. One should not underestimate the complexity of the 10X sequencing construct, which involves at least four pieces of 10X-specific information split across each read pair:

Cell barcode
UMI
Gene sequence
Sample barcode

... not including any Illumina-related pieces. (The sample barcode is technically 10X's design, I believe, so I'm counting that above.) Any pre-processing pipeline has to do a lot of work to get to a count table, e.g., demultiplexing on the sample barcode, matching the cell barcode to the whitelist, aligning the gene sequence, and removing PCR duplicates using the UMIs. Add in the data munging and you end up with something big like CellRanger.

What is your method for getting count data given R1, R2, and I1?

I just use CellRanger. It sounds like you don't want to use it, but the safest bet for pre-processing such a complex data type is to use the software developed by the same company that designs the protocol! However, if you need a R/Bioconductor solution, scPipe is a good place to start.

What is the best way to export this count data into R?

Importing CellRanger outputs is the bread and butter of DropletUtils. Note that you'll need the BioC-devel version of this package to import count tables from CellRanger version 3.

score 1 · Answer 2 · 2019-03-08

If you want a pipeline that goes from fastqs to gene counts that is less of a black box than 10xGenomics Cellranger, you can use what the McCarroll lab cooked up for Drop-seq

https://github.com/broadinstitute/Drop-seq/releases

The principle is pretty much the same; get alignments, gene assignments, cell barcodes, and UMIs all together, filter away UMI duplicates, the total everything up for each cell barcode.

The major difference between 10XGenomics and Drop-Seq is that 10xGenomics cell barcodes all derive from a white list, and Drop-Seq ones do not, and Drop-seq cell barcodes are prone to being short, which has to be corrected for.

(Also note that newer versions of cellranger do not need the index file separate like that; they just want R1 and R2 fastqs as they are made from Illumina software)

score 1 · Answer 3 · 2019-03-10

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 8 hours ago

WEHI, Melbourne, Australia

You can export count data from Cell Ranger in a compact text format. As an example of Cell Ranger output, see the three supplementary files here:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2759556

These files can be read quickly and simply into R using edgeR::read10X. There are analogous functions Read10X and read10XCounts in the Seurat and DropletUtils packages.