Question

Filtering curatedMetagenomicData object by rownames

0

Entering edit mode

Davide • 0

@23029b76

Last seen 3.6 years ago

I'm trying to use curatedMetagenomicData. However, I did not find the way to filter data based on rownames (which shoud be sample's unique IDs). Let's say:

alcoholStudy <-
  filter(sampleMetadata, age >= 18) %>%
  filter(!is.na(alcohol)) %>%
  filter(body_site == "stool") %>%
  select(where(~ !all(is.na(.x)))) %>%
  returnSamples("relative_abundance")

# I want to analize only these samples
my_rownames <- as.list("JAS_1", "JAS_10", "JAS_2", "JAS_3", "JAS_4", "JAS_5", "JAS_6")

How can I achieve this goal?

Thank You in advance Davide

curatedMetagenomicData TreeSummarizedExperiment • 1.3k views

ADD COMMENT • link 3.6 years ago Davide • 0

score 0 · Answer 1 · 2021-08-17

0

Entering edit mode

Lucas Schiffer ▴ 240

@schifferl

Last seen 11 months ago

New York, NY

Hi Davide, to achieve this you can simply filter out the samples you want from the sampleMetadata data.frame and pass it on to the returnSamples function. I've rewritten your code to do so, and the select line simply drops columns of all NA values. Regards, Lucas

library(curatedMetagenomicData)
library(dplyr)

interestingSamples <-
    c("JAS_1", "JAS_10", "JAS_2", "JAS_3", "JAS_4", "JAS_5", "JAS_6")

dataYouWanted <-
    filter(sampleMetadata, sample_id %in% interestingSamples) %>% 
    select(where(~ !all(is.na(.x)))) %>%
    returnSamples("relative_abundance")

ADD COMMENT • link 3.6 years ago Lucas Schiffer ▴ 240

0

Entering edit mode

Thank you Lucas for your suggestions! However, as you can verify, subject_id, has duplicates while rownames has not. This is the reason why I would use rownames instead of subject_id for filtering data. Just to show you this point I would share with you this code:

MyStudy <-
  filter(sampleMetadata, age >= 18) %>%
  filter(disease == "healthy") %>%
  filter(body_site == "stool") %>%
  select(where(~ !all(is.na(.x)))) %>%
  returnSamples("relative_abundance")

.

# Frequency table of subject_id (showing just top 10 most frequent subject_id)
head(sort(table(MyStudy@colData@listData$subject_id),decreasing=T), 10)

M2072 M2042 M2084 C3022 M2039 M2041 M2047 M2077 M2079 M2061 
   24    23    23    21    15    14    14    14    14    13

If I'm right, in the example, M2072 has 23 duplicated

# Frequency table of rownames (showing just top 10 most frequent  rownames)
head(sort(table(MyStudy@colData@rownames),decreasing=T), 10)

a00820d6-7ae6-11e9-a106-68b59976a384                           A01_02_1FE                           A02_01_1FE 
                                   1                                    1                                    1 
                          A03_01_1FE                           A04_01_1FE                           A04_04_1FE 
                                   1                                    1                                    1 
                          A05_01_1FE a053077c-7ae6-11e9-a106-68b59976a384                           A06_01_1FE 
                                   1                                    1                                    1 
                          A07_01_1FE 
                                   1

While it seems that rownames has no duplicates

ADD REPLY • link 3.6 years ago Davide • 0

0

Entering edit mode

If you'll look closely, you will notice I filtered by sample_id and not subject_id – the rownames correspond to sample_id and are therefore always unique. In the MyStudy code above, some of your samples are HMP_2019_ibdmdb subjects, where there are multiple samples per subject because of longitudinal follow-up. You can verify this by removing the returnSamples("relative_abundance") line and looking at the metadata. Hope that helps.