Question

How to run the annotation tool for GARFIELD v2: garfield_annotate_uk10k.sh

0

Entering edit mode

rodrigo.duarte88 ▴ 40

@rodrigoduarte88-16306

Last seen 8 months ago

United States

Hi all,

I think I may have an impossible question here, but if by any chance any of you has used the EMBL-EBI tool 'Garfield' (https://www.ebi.ac.uk/birney-srv/GARFIELD/), could you please advise? Garfield is available as a bunch of shell scripts or as an R package, but I am having an issue specifically with the shell script to create a GARFIELD annotation. Thus, I am aware maybe this is not the best forum to ask for help, but since the authors don't really provide a support channel for this pretty amazing tool, I thought maybe some of you may have come across this issue?

To reiterate, I want to create a functional category for GARFIELD based on a list of SNPs I am interested. The pipeline comes with a script 'garfield_annotate_uk10k.sh' which should do the job, but this is not working. This script is a series of shell commands (and use of a binary file for which code is not available unfortunately). This script will convert an INPUT bed annotation file that informs chr and SNP coordinates, like:

head bed
chr start   end
1   833927  833927
1   1344686 1344686
1   17006932    17006932
1   20498952    20498952
1   21775805    21775805

... to OUTPUT files per chromosome, which look like:

head chr1
833528  0   
833927  1
1344686 0
1344892 0
17006932    1
20498952    0

This output files contains a basal list of SNPs (based on SNP data that comes with the pipeline, on garfield-data.tar.gz), with the second column containing 1/0 to inform if the SNP is present in my input/annotation/bed file, according to the genomic coordinates for that SNP (chr + position).

There are a few errors in the published script (e.g. it's missing "mkdir -p $OUTPUT_DIR" at the beginning; there's a lost "fi" at the end of the script), so the original script runs when I correct these. But it's still producing the wrong output: it's essentially just outputting the SNPs from the "basal list of SNPs" from garfield-data, and not attributing "1"s to the SNPs that are present in my input bed file. Current output looks like this:

head chr1
1 28590 ./garfield-data/tmp.chr1.bed
1 31719 0
1 52152 0
1 52167 0
1 52185 0
1 54392 0
1 54421 0

It's possible my input bed file is not in the right format (unfortunately the authors just say that they require a bed file, but don't say how it should be formatted, so I am using the typical bed file structure (chr start end). It's also possible that there's something wrong with the binary file that comes with the script "garfield_annotate_uk10k_helper". But given the presence of typos in the script, it's possible that there's a mistake in some shell commands which are causing the wrong output to be produced?

I would really appreciate if anyone could have a look at this!

Download files:
garfield-v2.tar.gz
garfield-data.tar.gz (83Gb when uncompressed, sorry!)
Example bed file (not sure this is exactly the format the software takes as input, because authors did not specify in the manual)

Thanks a lot!

garfield awk unix shell embl-ebi • 2.1k views

ADD COMMENT • link updated 3.9 years ago by bk11 • 0 • written 5.9 years ago by rodrigo.duarte88 ▴ 40

0

Entering edit mode

bk11 • 0

@bk11-19547

Last seen 3.9 years ago

I am too having not much luck with garfield_annotate_uk10k.sh and R script provided here. I am still seeing all SNPs in my annotation have 0. Have any one overcome this problem?

ADD COMMENT • link 3.9 years ago bk11 • 0

score 2 · Accepted Answer · 2019-03-25

Just in case someone stumbles upon this in the future, too.... I had help from a colleague and created the required annotation for GARFIELD in R using the short script below:

library(data.table)    
dbsnp<-fread("eQTL_positions.bed", skip = 1)

for(chr in 1:22){
  tmp<-fread(paste0("maftssd/chr", chr))
  allVar<-cbind(tmp$V1, 0)
  overlap<-intersect(allVar[,1], dbsnp$V2[which(dbsnp[,1] == chr)])
  allVar[match(overlap, allVar[,1]),2]<-1
  write.table(allVar,paste0("eQTL_annot/chr", chr), col.names = FALSE, row.names = FALSE, sep = "\t", quote = FALSE)
}

Essentially, this matches my list of SNP positions within my custom bed file with the list of SNPs within "garfield-data.tar.gz", based on files available in "maftssd". These are also split by chromosome number, with the first column consisting of SNP position. Final annotation file is generated per chromosome, containing all SNPs in "maftssd", and 0/1s for absence/presence of the SNP in the annotation bed file.