Hello,
I'm using DropletUtils 1.20.0 in Bioconductor 3.17. I used cellranger multi (version 7.1.0) to demultiplex my samples. I loaded one of the H5 files into R with read10xCounts(h5file). When I look at the rownames of the sce object (see below), the last 12 gene names are CMO301, CMO302 up to CMO312. These are the 10X genomics CMO tags that are used for tagging cells for cell multiplexing. The CMOs should not be added to the sce object as gene names. I could not find an option in read10xCounts to eliminate these rows (nor were google searches productive).
Why are CMO tags added as genes? Is this a bug or expected?
> h5file
h2
"sample_filtered_feature_bc_matrix.h5"
> sce = read10xCounts(h5file)
> tail(rownames(sce),n=15)
[1] "ENSMUSG00000094855" "ENSMUSG00000095019" "ENSMUSG00000095041" "CMO301"
[5] "CMO302" "CMO303" "CMO304" "CMO305"
[9] "CMO306" "CMO307" "CMO308" "CMO309"
[13] "CMO310" "CMO311" "CMO312"
>
Thank you
Thank you. That should be the default behavior.
The problem is that how many people are going to know about this? The documentation doesn't describe this behavior and needs to be updated. It is going to affect downstream analyses by keeping the CMO tags as genes.
This doesn't happen by default because it would create instability in the behaviour of the package when new types of feature are introduced - we don't especially want to privilege "CMO" type features in case of future changes that would break data interactions with older versions of
DropletUtils
.Thanks to ATpoint for the very helpful answers you have provided here.
I understand that it can't be the default but I believe it should be added as a parameter. How do I submit an issue for this? Is this the github repo? https://github.com/MarioniLab/DropletUtils. I couldn't find it under the Bioconductor github. Thank you
You can submit an issue there, but since I am the maintainer I will not promise you that I will implement it! I would welcome the suggestion there though, so please add it if you would like :)
Certainly I wouldn't implement the splitting by default, because many existing pieces of code will be set up to run as things stand, and we don't want to break them. We could add an argument to split the matrices off, but it's hard to understand why a user wouldn't just use
splitAltExps
themselves.Everyone who reads
?read10xCounts
and knows that one's dataset has hashtags, so one naturally would start checking where they end up, no? Thousands of people use the function, I think it's not all bad.I read this in the docs and assumed that only genes were in the rows
A SingleCellExperiment object containing count data for each gene (row) and cell (column) across all samples
. Unfortunately in this context genes is ambiguous. Honestly, I wasn't even thinking where the CMO counts were goingI come from a Seurat background and rarely use DropletUtils. When I use Seurat's Read10X_h5 function it returns a matrix by default. If there are multiple modalities in the h5 file, it will return a list of matrices and give the user useful feedback. Maybe I expected this when using DropletUtils.
Example
Thank you for your help