Merging of Human and Mouse SingleCellExperiment Objects Error
2
0
Entering edit mode
Dario Strbenac ★ 1.5k
@dario-strbenac-5916
Last seen 4 hours ago
Australia

I have a human and mouse SingleCellExperiment object and would like to concatenate them. I have converted mouse IDs into human IDs and subset each object to the intersection of IDs. I get an error message because Symbol ends up with duplicate values in it.

head(rowData(SCEmouse))
DataFrame with 6 rows and 3 columns
                                ID      Symbol            Type
                       <character> <character>     <character>
ENSMUSG00000061195 ENSG00000186092       OR4F5 Gene Expression
ENSMUSG00000093804 ENSG00000284733      OR4F29 Gene Expression
ENSMUSG00000096351 ENSG00000187634      SAMD11 Gene Expression
ENSMUSG00000095567 ENSG00000188976       NOC2L Gene Expression
ENSMUSG00000078485 ENSG00000187583     PLEKHN1 Gene Expression
ENSMUSG00000078486 ENSG00000187642       PERM1 Gene Expression

head(rowData(SCEhuman))
DataFrame with 6 rows and 3 columns
                             ID      Symbol            Type
                    <character> <character>     <character>
ENSG00000186092 ENSG00000186092       OR4F5 Gene Expression
ENSG00000284733 ENSG00000284733      OR4F29 Gene Expression
ENSG00000187634 ENSG00000187634      SAMD11 Gene Expression
ENSG00000188976 ENSG00000188976       NOC2L Gene Expression
ENSG00000187583 ENSG00000187583     PLEKHN1 Gene Expression
ENSG00000187642 ENSG00000187642       PERM1 Gene Expression

cbind(SCEhuman, SCEmouse) # Error

range(table(mcols(SCEhuman)[, "Symbol"]))
  1 2
range(table(mcols(SCEmouse)[, "Symbol"]))
  1 19

Is there a better way to do this task? The matching which I performed was to replace the mouse ID and symbol in the rowData of the mouse object with the human value based on an orthologs table from ENSEMBL.

SingleCellExperiment • 951 views
ADD COMMENT
0
Entering edit mode

What is the error message? My guess that it is rownames duplication because one mouse gene might have more than one human ortholog.

ADD REPLY
0
Entering edit mode

It is treating Symbol as some kind of primary key, bizarrely. I had to change the commas to get the support website to accept it. You can see my tabulation of the values in Symbol column shows that there are some multiples.

> cbind(SCEhuman, SCEmouse)
Error in FUN(X[[i]], ...) : 
  column(s) 'Symbol' in 'mcols' are duplicated and the data do not match
ADD REPLY
2
Entering edit mode
@james-w-macdonald-5106
Last seen 16 hours ago
United States

You cannot cbind two things where one has duplicates. How would that work, exactly?

If the rowData for the first SingleCellExperiment has this:

> rowData(sce)
DataFrame with 200 rows and 2 columns
                     ID      Symbol
            <character> <character>
ENSG0000001 ENSG0000001           1
ENSG0000002 ENSG0000002           2
ENSG0000003 ENSG0000003           3

And the second has

DataFrame with 200 rows and 2 columns
                     ID      Symbol
            <character> <character>
ENSG0000001 ENSG0000001           1
ENSG0000002 ENSG0000002           2
ENSG0000002 ENSG0000002           2

Which row of the second SingleCellExperiment is concatenated with the second row of the first? Here's an example

> example(SingleCellExperiment, echo=FALSE)
> ID <- paste0("ENSG", sprintf("%07d", 1:nrow(sce)))
> rd <- DataFrame(ID = ID, Symbol = as.character(1:nrow(sce)))
> rownames(sce) <- ID
> rowData(sce)
DataFrame with 200 rows and 2 columns
                     ID      Symbol
            <character> <character>
ENSG0000001 ENSG0000001           1
ENSG0000002 ENSG0000002           2
ENSG0000003 ENSG0000003           3
ENSG0000004 ENSG0000004           4
ENSG0000005 ENSG0000005           5
...                 ...         ...
ENSG0000196 ENSG0000196         196
ENSG0000197 ENSG0000197         197
ENSG0000198 ENSG0000198         198
ENSG0000199 ENSG0000199         199
ENSG0000200 ENSG0000200         200
> sce2 <- sce
> set.seed(0xabeef)
> sce2 <- sce2[sample(1:nrow(sce2), nrow(sce2), TRUE),]
> cbind(sce, sce2)
Error in FUN(X[[i]], ...) : 
  column(s) 'ID' in 'mcols' are duplicated and the data do not match

In this instance, cbind is dispatched to the method for SummarizedExperiment, and if you look at ?cbind,SummarizedExperiment-method you will see

1
Entering edit mode

Ugh. Hit Enter by mistake. Continued here

 'cbind(...)': 'cbind' combines objects with the same features of
          interest but different samples (columns in 'assays').  The
          colnames in 'colData(SummarizedExperiment)' must match or an
          error is thrown.  Duplicate columns of
          'rowData(SummarizedExperiment)' must contain the same data.

The error you are getting arises from the fact that you have two columns with the same name, but different contents. They have to be identical, which apparently includes the order

> cbind(sce, sce[sample(1:nrow(sce), nrow(sce)),])
Error in FUN(X[[i]], ...) : 
  column(s) 'ID' in 'mcols' are duplicated and the data do not match

And from this, the solution is obvious.

> scesmall <- sce2[!duplicated(rowData(sce2)$ID),]
> cbind(sce[match(rowData(scesmall)$ID, rowData(sce)$ID),], scesmall)
class: SingleCellExperiment 
dim: 130 200 
metadata(0):
assays(2): counts logcounts
rownames(130): ENSG0000109 ENSG0000130 ... ENSG0000059 ENSG0000143
rowData names(2): ID Symbol
colnames: NULL
colData names(0):
reducedDimNames(2): PCA tSNE
mainExpName: NULL
altExpNames(0):
ADD REPLY
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 9 hours ago
The city by the bay

I don't recall SingleCellExperiment doing anything fancy with rowData during cbind. This works for me:

example(SingleCellExperiment, echo=FALSE)
rowData(sce)$Symbol <- as.character(sample(200))
cbind(sce, sce)

Session info:

R Under development (unstable) (2023-11-10 r85507)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/trunk/lib/libRblas.so 
LAPACK: /home/luna/Software/R/trunk/lib/libRlapack.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] SingleCellExperiment_1.25.0 SummarizedExperiment_1.33.2
 [3] Biobase_2.63.0              GenomicRanges_1.55.1       
 [5] GenomeInfoDb_1.39.5         IRanges_2.37.0             
 [7] S4Vectors_0.41.3            BiocGenerics_0.49.1        
 [9] MatrixGenerics_1.15.0       matrixStats_1.2.0          

loaded via a namespace (and not attached):
 [1] SparseArray_1.3.3       zlibbioc_1.49.0         Matrix_1.6-5           
 [4] lattice_0.22-5          abind_1.4-5             GenomeInfoDbData_1.2.11
 [7] S4Arrays_1.3.2          XVector_0.43.1          RCurl_1.98-1.14        
[10] bitops_1.0-7            grid_4.4.0              DelayedArray_0.29.0    
[13] compiler_4.4.0          tools_4.4.0             crayon_1.5.2
ADD COMMENT
0
Entering edit mode

It originates from SummarizedExperiment. I now know what to fix.

> sce2 <- sce
> rowData(sce2)$Symbol <- rev(rowData(sce)$Symbol)
> cbind(sce, sce2)
Error in FUN(X[[i]], ...) :
  column(s) 'Symbol' in 'mcols' are duplicated and the data do not match
ADD REPLY

Login before adding your answer.

Traffic: 928 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6