ENCODExplorer doesn't return results which are on the website
1
0
Entering edit mode
@lizingsimmons-6935
Last seen 4.0 years ago
Germany

ENCODExplorer seems like a really neat way to query and download ENCODE data, however I'm finding it doesn't return data that exists according to the ENCODE website, e.g.

queryEncode(biosample = "ES-E14", target = "CTCF", file_format = "bed", assay = "ChIP-seq", organism = "Mus musculus", fixed = FALSE)

returns NULL, although here there are three bed files available for download.

encodexplorer • 1.9k views
ADD COMMENT
2
Entering edit mode

Sorry for the delay in the answer, I was out of town.

I'd like to add some point to Mike's answers.

About the usage of the `|` in a queryEncode call. The function was not meant to deal with boolean operators. But this is something I think would be interesting to add support for. I created an Issue and will try to add this for the next release.

About the snapshots. We plan to update the snapshot before the next Bioconductor release, and every subsequent release afterward. One of the problem we encountered is that there were some changes in the ENCODE database (for instance, the Roadmap Epigenomics datasets are now availble). We will have to update some part of the code to make sure we update all the metadata correctly. Right now, we do not plan on adding snapshots between release.

ADD REPLY
0
Entering edit mode

Oh, that is interesting that it doesn't actually support the Boolean operators - using `|` works quite well considering that! I'd strongly support adding official support for this - it's useful to be able to search for multiple cell lines and targets at once, and the alternative is nested for loops which is not very nice :)

ADD REPLY
0
Entering edit mode

I agree that it would make a lot of sense. And it should not be such a big change. But I have to prepare a test suite before adding it officially in the documentation/vignettes. I will also have to solve the ^ problem mentionned by Mike that is only applied to the first term.

ADD REPLY
2
Entering edit mode
Mike Smith ★ 6.6k
@mike-smith
Last seen 6 hours ago
EMBL Heidelberg

The queryEncode() function doesn't query the website directly, but rather looks at a list of experimental data you provide in the df argument.  If you don't provide anything here, it uses an internal snapshot of the data.  It looks like this may be too old to include the bed files, but does include the other files that are part of the experiment:

> tail(sort(unique(ENCODExplorer::encode_df$experiment[,"date_released"])))
[1] "2014-12-17" "2015-01-08" "2015-02-12" "2015-03-31" "2015-04-14" "2015-05-18"

 

ADD COMMENT
1
Entering edit mode

Sorry to un-accept your answer, but I've found that the example I gave was misleading. The final bed file on that page can't be found by queryEncode in some situations -- but does exist in the up-to-date encode_df I've managed to create. I think the problem may be that some bed files have NA values for some of the columns, including organism, technical_replicate_number, and biological_replicate_number.

Using dplyr to filter for the files I want, I get 14 results.

z <- encode_df$experiment %>%
  filter(biosample_name %in% c("ES-E14", "ES-Bruce4")) %>%
  filter(target %in% c("CTCF-mouse", "Control-mouse")) %>%
  filter(file_format %in% c("bed", "bam"))

Using queryEncode like this, I get 19 - those 14, plus one that is a different target that I'm not sure why it matches, and four bigBed files.

x <- queryEncode(df = encode_df, fixed = FALSE,
            biosample = "ES-E14|ES-Bruce4",
            target = "CTCF|Control",
            file_format = "bam|bed")

If I try to create input to queryEncode like this, I get only 12 results, including none of the rows with NA in e.g. biological_replicate_number, but including two bigBed files.

biosamples <- c("ES-E14", "ES-Bruce4")
targets <- c("CTCF","Control")
formats <- c("bam", "bed")

y <- queryEncode(df = encode_df, assay = "ChIP-seq", organism = "Mus musculus",
            fixed = FALSE,
            biosample = paste(biosamples, collapse = "|"),
            target = paste(targets, collapse = "|"),
            file_format = paste(formats, collapse = "|"))

I have no idea what is going on here.

Edit: wait, no I do know what it is... some have organism = NA. So this will work!

y2 <- queryEncode(df = encode_df,
            fixed = FALSE,
            biosample = paste(biosamples, collapse = "|"),
            target = paste(targets, collapse = "|"),
            file_format = paste(formats, collapse = "|"))
ADD REPLY
1
Entering edit mode

Cool, I think I understand why you get differing results using your various queries.  

Internally queryEncode() uses grepl() to find your searches, and it sets the argument ignore.case = TRUE.  If you use this in your dplyr example, you'll also find the hit with the different target, since it contains the word 'control' and everything else matches.

encode_df$experiment %>%  
    filter(biosample_name %in% c("ES-E14", "ES-Bruce4")) %>%
    filter(grepl(pattern = "CTCF|Control", x = target, ignore.case = TRUE)) %>%
    filter(file_format %in% c("bed", "bam"))

This works, but actually oversimplifies things a little, since the query is transformed to allow for possible spaces,  commas and hyphens.

> ENCODExplorer:::query_transform("bam|bed")
[1] "^b[ ,-]?a[ ,-]?m[ ,-]?|[ ,-]?b[ ,-]?e[ ,-]?d"

This also places a '^' to fix the first letter to the start of the word, but it is only added right at the start of the query.  The second half of the query above will happily match to 'bigBed' when you ignore the case.  If you swap the arguments round in your query, you'll lose the bigBed results.

queryEncode(df = encode_df, fixed = FALSE,
            biosample = "ES-E14|ES-Bruce4",
            target = "CTCF|Control",
            file_format = "bed|bam")
ADD REPLY
0
Entering edit mode

Thanks, this does seem to be the problem. I wasn't expecting there to be such recent updates to the website, or for the snapshot to be more than six months old.

ADD REPLY

Login before adding your answer.

Traffic: 586 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6