Pick coding vs non-coding from Ensembl data set
1
@jdmentze-18572
Last seen 6.1 years ago
I have the ensembl and EnsDb.Hsapiens.v86 packages, as well as a data set with 60,000+ rows of genes. However, I only wish to focus on those that are non-coding or coding genes at a time. I am unsure which function to use, and how to properly word it to do this. How would I do so(in R) so that I can create a new data set for retrieving and analysis? EX:
R> data_coding <- "function"
R>data_noncoding <- "function"
R
ensembl
ensembldb
noncoding rna
protein coding
• 1.2k views
@johannes-rainer-6987
Last seen 12 weeks ago
Italy
You can simply filter the complete EnsDb
database to contain only protein coding genes or all other genes (note: this includes miRNA genes, lincRNAs, pseudogenes, snoRNA, scaRNA, sRNA, scRNA, rRNA ...).
> library(EnsDb.Hsapiens.v86)
> ## Filter the EnsDb database for protein coding genes
> edb_coding <- filter(EnsDb.Hsapiens.v86, filter = ~ gene_biotype == "protein_coding")
> ## Any query will now extract only information for protein coding genes
> genes(edb_coding)
GRanges object with 22285 ranges and 6 metadata columns:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
ENSG00000186092 1 69091-70008 + | ENSG00000186092
ENSG00000279928 1 182393-184158 + | ENSG00000279928
... ... ... ... . ...
ENSG00000280301 Y 25463994-25473714 + | ENSG00000280301
ENSG00000172288 Y 25622162-25624902 + | ENSG00000172288
gene_name gene_biotype seq_coord_system symbol
<character> <character> <character> <character>
ENSG00000186092 OR4F5 protein_coding chromosome OR4F5
ENSG00000279928 FO538757.2 protein_coding chromosome FO538757.2
... ... ... ... ...
ENSG00000280301 AC006328.1 protein_coding chromosome AC006328.1
ENSG00000172288 CDY1 protein_coding chromosome CDY1
entrezid
<list>
ENSG00000186092 79501
ENSG00000279928 c(107984078, 102725121)
... ...
ENSG00000280301 NA
ENSG00000172288 9085
-------
seqinfo: 287 sequences from GRCh38 genome
> ## Or getting all transcripts
> transcripts(edb_coding)
GRanges object with 158444 ranges and 6 metadata columns:
seqnames ranges strand | tx_id
<Rle> <IRanges> <Rle> | <character>
ENST00000335137 1 69091-70008 + | ENST00000335137
ENST00000624431 1 182393-184158 + | ENST00000624431
... ... ... ... . ...
ENST00000361963 Y 25622162-25624338 + | ENST00000361963
ENST00000306609 Y 25622162-25624902 + | ENST00000306609
tx_biotype tx_cds_seq_start tx_cds_seq_end
<character> <integer> <integer>
ENST00000335137 protein_coding 69091 70008
ENST00000624431 protein_coding 182709 184158
... ... ... ...
ENST00000361963 protein_coding 25622443 25624065
ENST00000306609 protein_coding 25622443 25624527
gene_id tx_name
<character> <character>
ENST00000335137 ENSG00000186092 ENST00000335137
ENST00000624431 ENSG00000279928 ENST00000624431
... ... ...
ENST00000361963 ENSG00000172288 ENST00000361963
ENST00000306609 ENSG00000172288 ENST00000306609
-------
seqinfo: 287 sequences from GRCh38 genome
For the non-coding genes you can do it analogously:
> edb_noncoding <- filter(EnsDb.Hsapiens.v86, filter = ~ gene_biotype != "protein_coding")
Also note that you can use return.type = "DataFrame"
in each function (e.g. genes
, transcripts
, ...) to extract the information as a DataFrame
instead of the default GRanges
.
cheers, jo
Login before adding your answer.
Traffic: 565 users visited in the last hour