Question

Merging of Human and Mouse SingleCellExperiment Objects Error

0

Entering edit mode

Dario Strbenac ★ 1.6k

@dario-strbenac-5916

Last seen 9 days ago

Australia

I have a human and mouse SingleCellExperiment object and would like to concatenate them. I have converted mouse IDs into human IDs and subset each object to the intersection of IDs. I get an error message because Symbol ends up with duplicate values in it.

head(rowData(SCEmouse))
DataFrame with 6 rows and 3 columns
                                ID      Symbol            Type
                       <character> <character>     <character>
ENSMUSG00000061195 ENSG00000186092       OR4F5 Gene Expression
ENSMUSG00000093804 ENSG00000284733      OR4F29 Gene Expression
ENSMUSG00000096351 ENSG00000187634      SAMD11 Gene Expression
ENSMUSG00000095567 ENSG00000188976       NOC2L Gene Expression
ENSMUSG00000078485 ENSG00000187583     PLEKHN1 Gene Expression
ENSMUSG00000078486 ENSG00000187642       PERM1 Gene Expression

head(rowData(SCEhuman))
DataFrame with 6 rows and 3 columns
                             ID      Symbol            Type
                    <character> <character>     <character>
ENSG00000186092 ENSG00000186092       OR4F5 Gene Expression
ENSG00000284733 ENSG00000284733      OR4F29 Gene Expression
ENSG00000187634 ENSG00000187634      SAMD11 Gene Expression
ENSG00000188976 ENSG00000188976       NOC2L Gene Expression
ENSG00000187583 ENSG00000187583     PLEKHN1 Gene Expression
ENSG00000187642 ENSG00000187642       PERM1 Gene Expression

cbind(SCEhuman, SCEmouse) # Error

range(table(mcols(SCEhuman)[, "Symbol"]))
  1 2
range(table(mcols(SCEmouse)[, "Symbol"]))
  1 19

Is there a better way to do this task? The matching which I performed was to replace the mouse ID and symbol in the rowData of the mouse object with the human value based on an orthologs table from ENSEMBL.

SingleCellExperiment • 1.3k views

ADD COMMENT • link 15 months ago Dario Strbenac ★ 1.6k

0

Entering edit mode

What is the error message? My guess that it is rownames duplication because one mouse gene might have more than one human ortholog.

ADD REPLY • link 15 months ago ATpoint ★ 4.8k

0

Entering edit mode

It is treating Symbol as some kind of primary key, bizarrely. I had to change the commas to get the support website to accept it. You can see my tabulation of the values in Symbol column shows that there are some multiples.

> cbind(SCEhuman, SCEmouse)
Error in FUN(X[[i]], ...) : 
  column(s) 'Symbol' in 'mcols' are duplicated and the data do not match

ADD REPLY • link 15 months ago Dario Strbenac ★ 1.6k

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 3 hours ago

The city by the bay

I don't recall SingleCellExperiment doing anything fancy with rowData during cbind. This works for me:

example(SingleCellExperiment, echo=FALSE)
rowData(sce)$Symbol <- as.character(sample(200))
cbind(sce, sce)

Session info:

R Under development (unstable) (2023-11-10 r85507)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/trunk/lib/libRblas.so 
LAPACK: /home/luna/Software/R/trunk/lib/libRlapack.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] SingleCellExperiment_1.25.0 SummarizedExperiment_1.33.2
 [3] Biobase_2.63.0              GenomicRanges_1.55.1       
 [5] GenomeInfoDb_1.39.5         IRanges_2.37.0             
 [7] S4Vectors_0.41.3            BiocGenerics_0.49.1        
 [9] MatrixGenerics_1.15.0       matrixStats_1.2.0          

loaded via a namespace (and not attached):
 [1] SparseArray_1.3.3       zlibbioc_1.49.0         Matrix_1.6-5           
 [4] lattice_0.22-5          abind_1.4-5             GenomeInfoDbData_1.2.11
 [7] S4Arrays_1.3.2          XVector_0.43.1          RCurl_1.98-1.14        
[10] bitops_1.0-7            grid_4.4.0              DelayedArray_0.29.0    
[13] compiler_4.4.0          tools_4.4.0             crayon_1.5.2

ADD COMMENT • link 15 months ago Aaron Lun ★ 28k

0

Entering edit mode

It originates from SummarizedExperiment. I now know what to fix.

> sce2 <- sce
> rowData(sce2)$Symbol <- rev(rowData(sce)$Symbol)
> cbind(sce, sce2)
Error in FUN(X[[i]], ...) :
  column(s) 'Symbol' in 'mcols' are duplicated and the data do not match

ADD REPLY • link 15 months ago Dario Strbenac ★ 1.6k

score 2 · Accepted Answer · 2024-01-19

You cannot cbind two things where one has duplicates. How would that work, exactly?

If the rowData for the first SingleCellExperiment has this:

> rowData(sce)
DataFrame with 200 rows and 2 columns
                     ID      Symbol
            <character> <character>
ENSG0000001 ENSG0000001           1
ENSG0000002 ENSG0000002           2
ENSG0000003 ENSG0000003           3

And the second has

DataFrame with 200 rows and 2 columns
                     ID      Symbol
            <character> <character>
ENSG0000001 ENSG0000001           1
ENSG0000002 ENSG0000002           2
ENSG0000002 ENSG0000002           2

Which row of the second SingleCellExperiment is concatenated with the second row of the first? Here's an example

> example(SingleCellExperiment, echo=FALSE)
> ID <- paste0("ENSG", sprintf("%07d", 1:nrow(sce)))
> rd <- DataFrame(ID = ID, Symbol = as.character(1:nrow(sce)))
> rownames(sce) <- ID
> rowData(sce)
DataFrame with 200 rows and 2 columns
                     ID      Symbol
            <character> <character>
ENSG0000001 ENSG0000001           1
ENSG0000002 ENSG0000002           2
ENSG0000003 ENSG0000003           3
ENSG0000004 ENSG0000004           4
ENSG0000005 ENSG0000005           5
...                 ...         ...
ENSG0000196 ENSG0000196         196
ENSG0000197 ENSG0000197         197
ENSG0000198 ENSG0000198         198
ENSG0000199 ENSG0000199         199
ENSG0000200 ENSG0000200         200
> sce2 <- sce
> set.seed(0xabeef)
> sce2 <- sce2[sample(1:nrow(sce2), nrow(sce2), TRUE),]
> cbind(sce, sce2)
Error in FUN(X[[i]], ...) : 
  column(s) 'ID' in 'mcols' are duplicated and the data do not match

In this instance, cbind is dispatched to the method for SummarizedExperiment, and if you look at ?cbind,SummarizedExperiment-method you will see