select With Regular Expression
2
@dario-strbenac-5916
Last seen 15 hours ago
Australia
The protocadherin family of genes has gene symbols such as PCDHA1, PCDHA2, and PCDHB1. I'd like to get the chromosome, strand, start and end coordinates of every protocadherin gene. The select
function has a keys
parameter which requires a character vector. Instead of manually finding which elements have the PCDH suffix
> symbols <- keys(org.Hs.eg.db, "SYMBOL")
> pKeys <- grep("PCDH*", symbols, value = TRUE)
> head(select(org.Hs.eg.db, pKeys, "ENTREZID","SYMBOL"))
'select()' returned 1:1 mapping between keys and columns
SYMBOL ENTREZID
1 PCDH1 5097
2 PCDHGC3 5098
3 PCDH7 5099
4 PCDH8 5100
5 PCDH9 5101
6 PCDHGB4 8641
is there a way to use regular expressions with select
? Once the gene symbols are converted into Entrez IDs, I'll query org.Hs.eg.db
for the locations.
annotationdbi
Wildcard
• 1.1k views
@james-w-macdonald-5106
Last seen 1 hour ago
United States
select(Homo.sapiens, keys(Homo.sapiens, "SYMBOL", pattern = "^PCDH"), c("CDSCHROM","CDSSTART","CDSEND"), "SYMBOL")
Or maybe more usefully, depending on what you are after
> tx <- transcriptsBy(Homo.sapiens, columns = "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
> z <- mapIds(Homo.sapiens, keys(Homo.sapiens, "SYMBOL", pattern = "^PCDH"), "ENTREZID","SYMBOL")
'select()' returned 1:1 mapping between keys and columns
> tx[names(tx) %in% z]
GRangesList object of length 71:
$100874064
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | tx_name SYMBOL
<Rle> <IRanges> <Rle> | <character> <CharacterList>
[1] chr13 [67399301, 67489163] + | uc031qmb.1 PCDH9-AS2
$100874086
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | tx_name SYMBOL
[1] chr13 [67551521, 67559908] + | uc031qmc.1 PCDH9-AS3
$26025
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | tx_name SYMBOL
[1] chr5 [140810158, 140812789] + | uc011dba.2 PCDHGA12
[2] chr5 [140810158, 140892548] + | uc003lkt.2 PCDHGA12
...
<68 more elements>
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
>
@johannes-rainer-6987
Last seen 28 days ago
Italy
You can use pattern search using ensembldb
filters in EnsDb objects:
## Load the human annotations for Ensembl 75
> library(EnsDb.Hsapiens.v75)
> edb <- EnsDb.Hsapiens.v75
## Use a GenenameFilter specifying the pattern (has to be a SQL pattern, so, % instead of *)
> Res <- select(edb, keys=GenenameFilter("PCDH%", condition="like"))
> unique(Res$GENENAME)
[1] "PCDHB4" "PCDHA6" "PCDHGA2" "PCDH11Y" "PCDH11X" "PCDHB2"
[7] "PCDHB3" "PCDHB5" "PCDHB6" "PCDHB7" "PCDHB15" "PCDH12"
[13] "PCDH17" "PCDHB8" "PCDHB10" "PCDHB14" "PCDHB12" "PCDH8"
[19] "PCDH10" "PCDHB18" "PCDH15" "PCDH1" "PCDH19" "PCDH7"
[25] "PCDHB1" "PCDHB9" "PCDH9" "PCDHB13" "PCDH18" "PCDHB16"
[31] "PCDHB11" "PCDH20" "PCDHGA1" "PCDHA9" "PCDHA8" "PCDHA7"
[37] "PCDHA5" "PCDHA4" "PCDHA2" "PCDHA1" "PCDH9-AS3" "PCDH8P1"
[43] "PCDH9-AS2" "PCDH9-AS4" "PCDH9-AS1" "PCDHA13" "PCDHGC3" "PCDHGC5"
[49] "PCDHGC4" "PCDHAC2" "PCDHAC1" "PCDHGB8P" "PCDHA11" "PCDHA14"
[55] "PCDHA10" "PCDHA12" "PCDHGA12" "PCDHGB6" "PCDHGA5" "PCDHGA7"
[61] "PCDHGA6" "PCDHGA8" "PCDHGA10" "PCDHGA11" "PCDHGB2" "PCDHGB4"
[67] "PCDHGB7" "PCDHGB1" "PCDHGA3" "PCDHA3" "PCDHB17" "PCDHGA9"
[73] "PCDHB19P" "PCDHGB3" "PCDHGA4"
## Alternatively, just use the genes method:
> genes(edb, filter=GenenameFilter("PCDH%", condition="like"))
GRanges object with 109 ranges and 5 metadata columns:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
ENSG00000099715 Y [ 4868267, 5610265] + | ENSG00000099715
ENSG00000169851 4 [30722037, 31148422] + | ENSG00000169851
ENSG00000136099 13 [53418109, 53422775] - | ENSG00000136099
... ... ... ... . ...
ENSG00000240764 5 [140868808, 140892546] + | ENSG00000240764
ENSG00000156453 5 [141232938, 141258811] - | ENSG00000156453
ENSG00000113555 5 [141323150, 141349304] - | ENSG00000113555
gene_name entrezid gene_biotype seq_coord_system
<character> <character> <character> <character>
ENSG00000099715 PCDH11Y 83259;27328 protein_coding chromosome
ENSG00000169851 PCDH7 5099 protein_coding chromosome
ENSG00000136099 PCDH8 5100 protein_coding chromosome
... ... ... ... ...
ENSG00000240764 PCDHGC5 5098;56097 protein_coding chromosome
ENSG00000156453 PCDH1 5097 protein_coding chromosome
ENSG00000113555 PCDH12 51294 protein_coding chromosome
-------
seqinfo: 7 sequences from GRCh37 genome
As you see above, Ensembl 75 bases on the "old" GRCh37 genome release; you might want to use a more recent package, but that would be easy to create e.g. using AnnotationHub
(check the ensembldb vignette).
cheers, jo
Login before adding your answer.
Traffic: 885 users visited in the last hour