Distance calculation for variants in intergenic regions
1
0
Entering edit mode
Lna • 0
@lna-10651
Last seen 7.6 years ago

Hi,

I want to determine the three genes which are closest to my annotated intergenic variants in both directions. After annotating the variant list, I used commands which are part of the examples of locateVariants in the variantannotation package.

This is what I tried:

loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene, AllVariants(promoter=PromoterVariants(downstream=500)))
names(loc) <- NULL
out <- as.data.frame(loc)
out$names <- names(target)[ out$QUERYID ]
annotTable <- out[ , c("names", "seqnames","QUERYID", "start", "end", "LOCATION", "GENEID", "PRECEDEID", "FOLLOWID")]

for (i in 1:nrow(annotTable)) {
  annotTable$distance[[i]] <- integer(0)  
}
for (i in 1:length(loc)){
  p_ids <- unlist(loc$FOLLOWID[i,],use.names=FALSE)  
  if (length(p_ids) > 0){
    exp_ranges <- rep(loc[i,], elementLengths(loc$FOLLOWID[i,]))
    p_dist <- distance(exp_ranges, TxDb.Hsapiens.UCSC.hg19.knownGene, id=p_ids, type="gene")
    print(loc$FOLLOWID[i,])
    print(p_dist)
    dist.order <- order(p_dist)
    annotTable$distance[[i]] <- p_dist[dist.order[1:3]]
  }
}

Now I wonder why the output of the two print statements looks like this, having "NAs" in the list of distances:

FOLLOWIDs:

[[1]] 100130417 100132062 100133331 100133331 100288069 148398 26155 284593 339451 ... 643837 729737 729759 729759 79501 79854 84069 84808 9636

Distances:
[1] 209983 736219     NA     NA 737862 385064     NA 350732 184839 170121 252618 163701  73301  55113  13064 129248 129248 269974 924234 696203
[21] 442766 994792

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 15.10

locale:
[1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8     LC_MONETARY=de_DE.UTF-8  
[6] LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C           
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocInstaller_1.20.3                    org.Hs.eg.db_3.2.3                      RSQLite_1.0.0                         
[4] DBI_0.3.1                               TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.22.13               
[7] AnnotationDbi_1.32.3                    VariantAnnotation_1.16.4                Rsamtools_1.22.0                      
[10] SummarizedExperiment_1.0.2              GenomicRanges_1.22.4                    GenomeInfoDb_1.6.3                    
[13] Biostrings_2.38.4                       XVector_0.10.0                          IRanges_2.4.8                         
[16] S4Vectors_0.8.11                        GWASTools_1.16.1                        Biobase_2.30.0                        
[19] BiocGenerics_0.16.1                    

loaded via a namespace (and not attached):
[1] Rcpp_0.12.4             futile.logger_1.4.1     plyr_1.8.3              futile.options_1.0.0    bitops_1.0-6          
[6] tools_3.2.2             zlibbioc_1.16.0         biomaRt_2.26.1          BSgenome_1.38.0         gtable_0.2.0          
[11] lattice_0.20-33         ncdf_1.6.9              logistf_1.21            Matrix_1.2-4            SparseM_1.7           
[16] rtracklayer_1.30.4      MatrixModels_0.4-1      lmtest_0.9-34           grid_3.2.2              GWASExactHW_1.01      
[21] quantsmooth_1.36.0      DNAcopy_1.44.0          XML_3.98-1.4            BiocParallel_1.4.3      survival_2.38-3       
[26] lambda.r_1.1.7          ggplot2_2.1.0           GenomicAlignments_1.6.3 scales_0.4.0            splines_3.2.2         
[31] gdsfmt_1.6.2            colorspace_1.2-6        quantreg_5.21           sandwich_2.3-4          RCurl_1.95-4.8        
[36] munsell_0.4.3           zoo_1.7-12            

How can the distance not be known if the "PRECEDEIDs" and "FOLLOWIDs" are determined based on distance parameters of locateVariants and the same package is used (TxDb.Hsapiens.UCSC.hg19.knownGene) for both commands (locateVariants and distance)? Did I do anything wrong? Is there another way to obtain the genes which are closest to each variant in my list?

Thank you for your help!

locatevariants variantannotation genomicfeatures • 2.9k views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 3.1 years ago
United States

There should be no need to loop here, as we can do everything on top of the lists.

    p_ids <- unlist(loc_int$PRECEDEID, use.names=FALSE)
    exp_ranges <- rep(loc_int,  elementNROWS(loc_int$PRECEDEID))
    p_dist <- distance(exp_ranges, txdb, id=p_ids, type="gene")
    p_dist_l <- relist(p_dist, loc_int$PRECEDEID)
    loc_int$distance3 <- phead(sort(p_dist_l), 3L)

Just so you know, the NAs were coming from the out of bounds subscript when there were fewer than 3 elements. phead() takes care of that for you.

ADD COMMENT
0
Entering edit mode

I found it: GenomicRanges:::findKNN(). It's not exported, because I never tested it, and maybe it should be generalized to IRanges. It would be great if someone on the core team got it into working order. Then we could support it in locateVariants().

 

ADD REPLY
0
Entering edit mode

locateVariants can find the closest one? As far as I understood it, it just provides a seemingly unsorted list of all genes within the user defined region (?). Which is not much of a problem if I can obtain the respective distances with the "distance()" function and order by distance. The problem is that some of the distances are "NA", so I cannot tell whether these genes are among the closest 3 genes.
 

ADD REPLY
0
Entering edit mode

Yea, my bad, I made an assumption. Please see the edited answer.

ADD REPLY
0
Entering edit mode

You are right that I have to take care of the out of bounds case for results with fewer than three elements, but the NA entries are surely not coming from that, since my print statement comes before cutting the list to three elements. I am printing the complete list here, input list as well as output.

In the meanwhile I was checking the specific geneIDs in loc$FOLLOWID that lead to NAs on the UCSC webpage. The thing is: For all "following" genes of a specific gene in "target" on chr1 within a given distance, I obtain a list of geneID on chr1 with "locateVariants". That is ok, so far. But when I check on these IDs on the UCSC webpage I get several hits for some IDs, including hits on chr5. Could this be the problem? Maybe the distance function only checks on this hit on chr5 and obviously cannot determine the distance of a gene on chr1 to chr5?

If this is the problem... what can I do about it?

ADD REPLY
0
Entering edit mode

Maybe you could make available a subset of the "target" object that reproduces the problem.

ADD REPLY
0
Entering edit mode
GRanges object with 1 range and 5 metadata columns:
      seqnames             ranges strand | paramRangeID            REF                ALT      QUAL      FILTER
         <Rle>          <IRanges>  <Rle> |     <factor> <DNAStringSet> <DNAStringSetList> <numeric> <character>
  549     chr1 [1064802, 1064801]      * |         <NA>                                        <NA>          GT

Is this ok, or what did you have in mind?

ADD REPLY
0
Entering edit mode

I can't reproduce the NAs, and I just noticed that your Bioconductor version is over a year old. Please update.

ADD REPLY
0
Entering edit mode

Ok, thank you, I'll try that. So you get the same list of distances (printed above), but "normal" distances except of the NA entries?

ADD REPLY
0
Entering edit mode

Everything should be uptodate now. I deleted R and all Bioconductor packages but the problem remains.

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8     LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] org.Hs.eg.db_3.4.0                      TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.26.0                  AnnotationDbi_1.36.0                   
 [5] VariantAnnotation_1.20.1                Rsamtools_1.26.1                        Biostrings_2.42.0                       XVector_0.14.0                         
 [9] SummarizedExperiment_1.4.0              GenomicRanges_1.26.1                    GenomeInfoDb_1.10.1                     IRanges_2.8.1                          
[13] S4Vectors_0.12.0                        GWASTools_1.20.0                        Biobase_2.34.0                          BiocGenerics_0.20.0                    
[17] BiocInstaller_1.24.0                   

loaded via a namespace (and not attached):
 [1] bitops_1.0-6             tools_3.3.2              zlibbioc_1.20.0          biomaRt_2.30.0           ncdf4_1.15               BSgenome_1.42.0         
 [7] RSQLite_1.0.0            lattice_0.20-34          logistf_1.21             Matrix_1.2-7.1           DBI_0.5-1                SparseM_1.74            
[13] rtracklayer_1.34.1       MatrixModels_0.4-1       lmtest_0.9-34            grid_3.3.2               GWASExactHW_1.01         quantsmooth_1.40.0      
[19] DNAcopy_1.48.0           XML_3.98-1.5             survival_2.40-1          BiocParallel_1.8.1       GenomicAlignments_1.10.0 splines_3.3.2           
[25] gdsfmt_1.10.0            quantreg_5.29            sandwich_2.3-4           RCurl_1.95-4.8           zoo_1.7-13   

When look for "following" genes and use "distance()" to obtain the respective distances, I get this:

[1] 209983 736219     NA     NA 737862 385064     NA 350732 184839 170121 252618 163701  73301  55113  13064 129248 129248 269974 924234 696203 442766 994792

I think this is some kind of bug. Using a distance range to obtain geneids and using these geneids to obtain distances leads to the correct distances in 19 of 22 cases und fails three times...

ADD REPLY
0
Entering edit mode

I removed R and Bioconductor completely and installed the new versions.

I was also thinking that there should be no loop, but somehow it doesn't work for me.

p_ids <- unlist(loc_int$FOLLOWID, use.names=FALSE)
exp_ranges <- rep(loc_int,  elementNROWS(loc_int$FOLLOWID))
p_dist <- distance(exp_ranges, TxDb.Hsapiens.UCSC.hg19.knownGene, id=p_ids, type="gene")
loc_int$FOLLOW_DIST <- relist(p_dist, loc_int$FOLLOWID)

which should work exactly like the code you posted, except I removed the last part with sorting and selecting three elements (for testing), leads to this (corresponds to the line of the query I posted above):

> unlist(loc_int$FOLLOW_DIST[1])
 [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

When I check other queries, most of the values are like this and there are some numbers in between the NAs, which are far too big to be the distances I was looking for. I really don't know what this means, but when I made the loop and made a query for each line in my table, it worked fine (I checked the distances), except of the original problem with single NAs at some positions.

ADD REPLY
0
Entering edit mode

Ok, I see what is happening. There are 403 genes that have transcripts on both strands. Those genes are excluded from gene-level queries by default. The distance() method should probably retrieve all of the genes and reduce over the strands. But for this analysis, I think you might want the distance to the transcript with the promoter containing the variant. It would be nice if locateVariants() had the option to return transcript IDs, not just gene Ids.

ADD REPLY
0
Entering edit mode

Sorry, but I don't really get it... When I obtain all distances at once (like you proposed) OR when I make a loop over my variants (so I obtain a list of distances for one variant) OR when I make a loop over my variants and a nested loop over each geneid of one variant, shouldn't this give the same result in each of the three cases??? If not... I really don't understand why. I get three different results and it seems only the last option (two loops) leads to the correct distances (even without any NAs). It would really be helpful if you clarified this.

ADD REPLY
0
Entering edit mode

Not sure. When I run the code above, I get the correct results (or at least they appear to be correct), just with some NAs.

ADD REPLY
0
Entering edit mode

So you get NAs? Didn't you say you don't get them? Ok... let's make a simple test including the second and third option I described (Call distance() for all geneids close to my variant at once AND call distance() for each of the geneids independently):

gr <- GRanges("chr1", IRanges(c(1064801), width=1))
loc_int <- locateVariants(gr, TxDb.Hsapiens.UCSC.hg19.knownGene, AllVariants(promoter=PromoterVariants(downstream=500)))
exp_ranges <- rep(loc_int,  elementNROWS(loc_int$FOLLOWID))
p_ids <- unlist(loc_int$FOLLOWID, use.names=FALSE)
p_ids
p_dist <- distance(exp_ranges, TxDb.Hsapiens.UCSC.hg19.knownGene, id=p_ids, type="gene")
p_dist
p_dist2 <- c()
for (i in 1:length(p_ids)){
  p_dist2[i] <- distance(gr, TxDb.Hsapiens.UCSC.hg19.knownGene, id=p_ids[i], type="gene")
}
p_dist2

This leads to the following output:

> p_ids
 [1] "100130417" "100132062" "100133331" "100133331" "100288069" "148398"    "26155"     "284593"    "339451"    "375790"   
[11] "401934"    "54991"     "57801"     "643837"    "729737"    "729759"    "729759"    "79501"     "79854"     "84069"    
[21] "84808"     "9636"     
> p_dist
 [1] 209983 736219     NA     NA 737862 385064     NA 350732 184839 170121 252618 163701  73301  55113  13064 129248 129248
[18] 269974 924234 696203 442766 994792
> p_dist2
 [1] 209983 736219 737862 737862 350732 184839 170121 252618 163701  73301  55113  13064 129248 269974 924234 696203 696203
[18] 994792 301898 154316 147327 114881

The list p_dist2 seems to include all distances of the geneids in p_ids, no NAs, the distances I checked are correct. Everything fine. The first two entries of p_dist are the same as in p_dist2. The remaining distances which are the same in both lists are at different positions and there are three NAs.

I would really like to know:

1. Can anyone reproduce the fact that p_dist and p_dist2 differ?

2. Is there any other explanation for a difference between p_dist and p_dist2 than a bug in distance()?

If I have to get the distances using two loops I think it will take days...

ADD REPLY
0
Entering edit mode

The issue is as you initially suspected: there are genes with transcripts that are non-overlapping (even on different chromosomes), so there is no sane way to reduce those to a range for the entire gene. There was a bug that prevented an informative error message in this case, and I've fixed it. The real solution here is to query by transcript ID, and it looks like Val is working on that.

ADD REPLY
0
Entering edit mode

Ok, but does that explain why distance() gives different results depending on whether a single geneid is submitted or the geneid is part of a list of geneids? If it does not work for the genes you mentioned, shouldn't it work at all for these genes?

ADD REPLY
0
Entering edit mode

Yes, it does explain it, because everything got mixed up (a technical term) when the appropriate assertion was not in place.
 

ADD REPLY
0
Entering edit mode

I can add an option for locationVariants() to return transcript ids. I'll post back here when it's done.

Valerie

ADD REPLY
0
Entering edit mode

This is done in VariantAnnotation 1.21.6. I've added a 'distanceIDAsGeneID' argument to the IntergenicVariants() constructor:

IntergenicVariants(distanceIDAsGeneID=TRUE)

When TRUE (default) the values in PRECEDEID and FOLLOWID are geneids; when FALSE transcript ids are returned. I've modified the example to show that the call to distance() should specify type="tx" when distanceIDAsGeneID=TRUE.

Let me know if you run into problems or if this was not the implementation you were after.

Valerie

ADD REPLY
0
Entering edit mode

The name of the argument seems strange since the locateVariants() function does not (explicitly) compute any distances. Maybe something like idType=c("gene", "tx")? Or maybe it could just return both the gene and tx ID? Maybe the return value could somehow keep track of the type of ID in each column, so that distance() would not require the type argument?

ADD REPLY
1
Entering edit mode

I agree, idType is a better name. I've made this change in 1.21.7 and made it a character with default "gene".

Returning both id types and tracking id type in the columns would be more involved. I can't get to this right away - could put it on the TODO or accept a patch.

Val

ADD REPLY
0
Entering edit mode

I've re-worked distance,GenomicRanges,TxDb-method in GenomicFeatures so it's more well behaved. It now returns NA when (1) 'id' is not in 'y' or (2) when 'id' defines a range that can't be reduced to a single range (e.g., gene id spans multiple chromosomes). Previously the method would just throw an error when either of these conditions were met.

Changes are in GenomicFeatures 1.27.4.

Valerie

ADD REPLY

Login before adding your answer.

Traffic: 680 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6