What are the methods to get count data per cell from single cell fastq given only R1, R2, and I1 fastq files
3
1
Entering edit mode
@matthew-thornton-5564
Last seen 11 weeks ago
USA, Los Angeles, USC

Hello,

I am starting to process single cell RNA sequencing data and I noticed that all of the bioconductor tutorials for single cell (https://f1000research.com/articles/5-2122/v2) start from well groomed data that is already in a count matrix with cells for columns and genes for rows . This is pretty far from the output of the Instrument and more should be done to facilitate getting the count data necessary for the main methods of obtaining single cell sequencing (Illumina and PacBio). I was given as the output of bcl2fastq three fastq files, R1, R2 (paired) data and an index fastq that has the barcodes that were used to multiplex the samples. After googling extensively, that there are not a lot of options and what I see is that people use Cell Ranger (software from 10X genomics) to do the analysis and then from there, export count data. None of this is very satisfactorily explained despite having excellent bioconductor tutorials for single cell data (that all start from well groomed count data), like: https://www.bioconductor.org/help/course-materials/2017/BioC2017/Day2/Workshops/singleCell/doc/workshop.html.

Cell Ranger uses STAR and it seems like it does more than you would want, if you intend to use the R/Bioconductor software, or process the data in a method similar to what you would do with bulk RNA-seq.

R1, R2 regular paired-end fastqs

I2

@K00124:391:HWNTHBBXX:3:1101:4219:1309 1:N:0:TTCCCGAT
TTCCCGAT
+
A-A<FA--
@K00124:391:HWNTHBBXX:3:1101:7101:1309 1:N:0:TTCCCGAC
TTCCCGAC
+
AAA<FF--
@K00124:391:HWNTHBBXX:3:1101:7222:1309 1:N:0:GCAGTAGC
GCAGTAGC

What is your method for getting count data given R1, R2, and I1?

What is the best way to export this count data into R? HDF5Array?? Which hdf5 files do you use from the output of cellranger count? (or aggr)

Any comments or advice is greatly appreciated, and will most likely enrich the community as 10X genomics increases in popularity. It is not like people aren't already trying to get help, they are just not getting much (https://www.biostars.org/p/356000/)

Thank you

fastq single cell • 5.8k views
ADD COMMENT
4
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 3 hours ago
The city by the bay

Cell Ranger uses STAR and it seems like it does more than you would want

I would say that CellRanger does the necessary amount of work that needed to get a count matrix. One should not underestimate the complexity of the 10X sequencing construct, which involves at least four pieces of 10X-specific information split across each read pair:

  • Cell barcode
  • UMI
  • Gene sequence
  • Sample barcode

... not including any Illumina-related pieces. (The sample barcode is technically 10X's design, I believe, so I'm counting that above.) Any pre-processing pipeline has to do a lot of work to get to a count table, e.g., demultiplexing on the sample barcode, matching the cell barcode to the whitelist, aligning the gene sequence, and removing PCR duplicates using the UMIs. Add in the data munging and you end up with something big like CellRanger.

What is your method for getting count data given R1, R2, and I1?

I just use CellRanger. It sounds like you don't want to use it, but the safest bet for pre-processing such a complex data type is to use the software developed by the same company that designs the protocol! However, if you need a R/Bioconductor solution, scPipe is a good place to start.

What is the best way to export this count data into R?

Importing CellRanger outputs is the bread and butter of DropletUtils. Note that you'll need the BioC-devel version of this package to import count tables from CellRanger version 3.

ADD COMMENT
0
Entering edit mode

Thank you very much for your explanation. I would have to install CellRanger on a large multipurpose (academic) linux cluster. I was hoping to avoid this, as I have STAR installed already. I will give scPipe a try. I will probably also use Cell Ranger locally.

ADD REPLY
0
Entering edit mode

Hello! I just asked a question related to the molecule_info.h5 file into DropletUtils. I will install the development package. that may fix it. Thank you!

ADD REPLY
0
Entering edit mode

Hello! I just asked a question related to the molecule_info.h5 file into DropletUtils. I will install the development package. that may fix it. Thank you!

ADD REPLY
1
Entering edit mode
swbarnes2 ★ 1.4k
@swbarnes2-14086
Last seen 15 hours ago
San Diego

If you want a pipeline that goes from fastqs to gene counts that is less of a black box than 10xGenomics Cellranger, you can use what the McCarroll lab cooked up for Drop-seq

https://github.com/broadinstitute/Drop-seq/releases

The principle is pretty much the same; get alignments, gene assignments, cell barcodes, and UMIs all together, filter away UMI duplicates, the total everything up for each cell barcode.

The major difference between 10XGenomics and Drop-Seq is that 10xGenomics cell barcodes all derive from a white list, and Drop-Seq ones do not, and Drop-seq cell barcodes are prone to being short, which has to be corrected for.

(Also note that newer versions of cellranger do not need the index file separate like that; they just want R1 and R2 fastqs as they are made from Illumina software)

ADD COMMENT
1
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

You can export count data from Cell Ranger in a compact text format. As an example of Cell Ranger output, see the three supplementary files here:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2759556

These files can be read quickly and simply into R using edgeR::read10X. There are analogous functions Read10X and read10XCounts in the Seurat and DropletUtils packages.

ADD COMMENT
0
Entering edit mode

I will do that. Thank you!!

ADD REPLY

Login before adding your answer.

Traffic: 746 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6