Extract GRanges for gene data frame
2
1
Entering edit mode
mdeea123 ▴ 10
@mdeea123-11297
Last seen 8.4 years ago

I have a set of genes with associated data that I want to locate on the mouse genome

My approach is to

  • extract the genes in the genome
  • subset the genes to include only my list
  • add the extra data for my list into the GRanges as metadata
  • export the relevant parts of the GRanges object.
MouseGenesGR <- genes(Mus.musculus, columns="SYMBOL") 

head(means)
          Hmeans   Lmeans    Nmeans
Aspn    3.779835 7.795773 11.534907
Angpt1 10.251626 8.220874  4.713242
Gm773   2.867759 6.118540 10.058305
Lifr    6.828385 8.136443  5.265472
Il1rl1  5.832741 8.427293 11.550407
Ogn     7.865109 9.429068 12.435698

dim(means)
[1] 1000    3

MySet1 <- MouseGenesGR[any(mcols(MouseGenesGR)$SYMBOL %in% row.names(means))]  #subset my genes

length(MySet1)
[1] 911

What happened to the other 89?

And how do I query that?

Now I would like to add in the metadata with 

mcols(MySet1)$HMeans <- means$Hmeans 
mcols(MySet1)$LMeans <- means$Lmeans 
mcols(MySet1)$NMeans <- means$Nmeans 

but I can't because the two objects are now of different lengths

Error in `[[<-`(`*tmp*`, name, value = c(3.77983495075, 10.2516255123333,  : 
  1000 elements in value to replace 911 elements

Any help appreciated.

Mitch

 
GRange • 2.6k views
ADD COMMENT
0
Entering edit mode
phil.chapman ▴ 150
@philchapman-8324
Last seen 8.3 years ago
United Kingdom

I'm not sure if you're trying to end up with a data.frame or a GRanges object but either way you might be better off coercing to/from a data frame as it makes the join operation easier. For example:

library(EnsDb.Hsapiens.v75)
library(GenomicRanges)
library(dplyr)
library(tibble)

#info on genes
gene_info <- genes(EnsDb.Hsapiens.v75, filter=GenebiotypeFilter('protein_coding'))
gene_info <- as.data.frame(gene_info) %>% tbl_df()

#dummy mean info
mean_info <- matrix(nrow=7, rnorm(n=21), dimnames = list(c('BRAF','EGFR','NRAS','TP53','PTEN','KRAS', 'NOMATCHGENE'), c('Hmeans','Lmeans','Nmeans')))
mean_info <- mean_info %>% as.data.frame() %>% rownames_to_column('gene_name')

#join operation
merged_df <- gene_info %>% inner_join(mean_info, by='gene_name')
merged_gr <- merged_df %>% as.data.frame() %>% as('GRanges')
names(merged_gr) <-  merged_gr$gene_id

If you want to see where your missing symbols are coming from, you can just substitute the inner_join with a left_join to ensure that you receive a data frame the same length as the left hand data frame:

> lj_df <- mean_info %>% left_join(gene_info, by='gene_name')
> lj_df %>% dplyr::filter(is.na(gene_id)) %>% dplyr::select(1:10)
    gene_name   Hmeans    Lmeans     Nmeans seqnames start end width strand gene_id
1 NOMATCHGENE -0.23859 0.3558788 -0.9541681     <NA>    NA  NA    NA   <NA>    <NA>

 

 

 

 

ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 5 months ago
United States

Here's something reproducible

library(Mus.musculus)
gr = genes(Mus.musculus, columns="SYMBOL")
df = data.frame(1:5, row.names=unlist(gr$SYMBOL[1:5]))

Use merge() to merge the two objects

all = merge(gr, df, by.x="SYMBOL", by.y="row.names", sort=FALSE)

Coerce the result back to a genomic ranges for future  computation

as(all, "GRanges")

As for the missing symbols, you could ask about the (asymetric) setdiff(unlist(gr$SYMBOLS), rownames(df)) or similar; one might have more luck with column="ALIAS", though the merge would require more work -- each element of gr might match to 0 or more rows of df.

For SYMBOL, I'd probably tweak gr after checking that the 'CharacterList' representation of SYMBOL is not needed

> all(lengths(gr$SYMBOL) == 1)
[1] TRUE
> gr$SYMBOL = unlist(gr$SYMBOL)

 

 

ADD COMMENT

Login before adding your answer.

Traffic: 388 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6