lncRNA Genes in a Dataset
1
@2fc02fda
Last seen 2.6 years ago
Turkey
I am trying to extract genes coding "lncRNA"s from a huge dataset. There are about 40,000 genes with the ensemble ID. I can't search all of them on the website obviously. I would like to learn a way to extract these via R. Thank you in advance.
ncRNAtools
RNASeqData
• 1.5k views
@james-w-macdonald-5106
Last seen 1 day ago
United States
You don't say the species, so I will presume human. The easiest way to get these data is from an EnsDb
package, and they are mostly on the AnnotationHub
. Here's how you would get one and filter to just the lncRNAs.
> library(AnnotationHub)
> hub <- AnnotationHub()
> query(hub, c("homo sapiens","ensdb"))
AnnotationHub with 20 records
# snapshotDate(): 2021-10-20
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53211"]]'
title
AH53211 | Ensembl 87 EnsDb for Homo Sapiens
AH53715 | Ensembl 88 EnsDb for Homo Sapiens
AH56681 | Ensembl 89 EnsDb for Homo Sapiens
AH57757 | Ensembl 90 EnsDb for Homo Sapiens
AH60773 | Ensembl 91 EnsDb for Homo Sapiens
... ...
AH83216 | Ensembl 101 EnsDb for Homo sapiens
AH89180 | Ensembl 102 EnsDb for Homo sapiens
AH89426 | Ensembl 103 EnsDb for Homo sapiens
AH95744 | Ensembl 104 EnsDb for Homo sapiens
AH98047 | Ensembl 105 EnsDb for Homo sapiens
## we'll use the latest version
> ensdb <- hub[["AH98047"]]
loading from cache
require("ensembldb")
> gns <- genes(ensdb)
> gns
GRanges object with 69329 ranges and 9 metadata columns:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
ENSG00000223972 1 11869-14409 + | ENSG00000223972
ENSG00000227232 1 14404-29570 - | ENSG00000227232
ENSG00000278267 1 17369-17436 - | ENSG00000278267
ENSG00000243485 1 29554-31109 + | ENSG00000243485
ENSG00000284332 1 30366-30503 + | ENSG00000284332
... ... ... ... . ...
ENSG00000224240 Y 26549425-26549743 + | ENSG00000224240
ENSG00000227629 Y 26586642-26591601 - | ENSG00000227629
ENSG00000237917 Y 26594851-26634652 - | ENSG00000237917
ENSG00000231514 Y 26626520-26627159 - | ENSG00000231514
ENSG00000235857 Y 56855244-56855488 + | ENSG00000235857
gene_name gene_biotype seq_coord_system
<character> <character> <character>
ENSG00000223972 DDX11L1 transcribed_unproces.. chromosome
ENSG00000227232 WASH7P unprocessed_pseudogene chromosome
ENSG00000278267 MIR6859-1 miRNA chromosome
ENSG00000243485 MIR1302-2HG lncRNA chromosome
ENSG00000284332 MIR1302-2 miRNA chromosome
... ... ... ...
ENSG00000224240 CYCSP49 processed_pseudogene chromosome
ENSG00000227629 SLC25A15P1 unprocessed_pseudogene chromosome
ENSG00000237917 PARP4P1 unprocessed_pseudogene chromosome
ENSG00000231514 CCNQP2 processed_pseudogene chromosome
ENSG00000235857 CTBP2P1 processed_pseudogene chromosome
description gene_id_version canonical_transcript
<character> <character> <character>
ENSG00000223972 DEAD/H-box helicase .. ENSG00000223972.5 ENST00000450305
ENSG00000227232 WASP family homolog .. ENSG00000227232.5 ENST00000488147
ENSG00000278267 microRNA 6859-1 [Sou.. ENSG00000278267.1 ENST00000619216
ENSG00000243485 MIR1302-2 host gene .. ENSG00000243485.5 ENST00000473358
ENSG00000284332 microRNA 1302-2 [Sou.. ENSG00000284332.1 ENST00000607096
... ... ... ...
ENSG00000224240 CYCS pseudogene 49 [.. ENSG00000224240.1 ENST00000420810
ENSG00000227629 solute carrier famil.. ENSG00000227629.1 ENST00000456738
ENSG00000237917 poly(ADP-ribose) pol.. ENSG00000237917.1 ENST00000435945
ENSG00000231514 CCNQ pseudogene 2 [S.. ENSG00000231514.1 ENST00000435741
ENSG00000235857 CTBP2 pseudogene 1 [.. ENSG00000235857.1 ENST00000431853
symbol entrezid
<character> <list>
ENSG00000223972 DDX11L1 102725121,100287596,100287102,...
ENSG00000227232 WASH7P <NA>
ENSG00000278267 MIR6859-1 102466751
ENSG00000243485 MIR1302-2HG <NA>
ENSG00000284332 MIR1302-2 100302278
... ... ...
ENSG00000224240 CYCSP49 <NA>
ENSG00000227629 SLC25A15P1 <NA>
ENSG00000237917 PARP4P1 <NA>
ENSG00000231514 CCNQP2 <NA>
ENSG00000235857 CTBP2P1 <NA>
-------
seqinfo: 456 sequences from GRCh38 genome
## note the gene_biotype column above
> lncs <- gns[gns$gene_biotype %in% "lncRNA"]
> lncs
GRanges object with 18812 ranges and 9 metadata columns:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
ENSG00000243485 1 29554-31109 + | ENSG00000243485
ENSG00000237613 1 34554-36081 - | ENSG00000237613
ENSG00000238009 1 89295-133723 - | ENSG00000238009
ENSG00000239945 1 89551-91105 - | ENSG00000239945
ENSG00000239906 1 139790-140339 - | ENSG00000239906
... ... ... ... . ...
ENSG00000228296 Y 25063083-25099892 - | ENSG00000228296
ENSG00000223641 Y 25182277-25213389 - | ENSG00000223641
ENSG00000228786 Y 25378300-25394719 - | ENSG00000228786
ENSG00000240450 Y 25482908-25486705 + | ENSG00000240450
ENSG00000231141 Y 25728490-25733388 + | ENSG00000231141
gene_name gene_biotype seq_coord_system
<character> <character> <character>
ENSG00000243485 MIR1302-2HG lncRNA chromosome
ENSG00000237613 FAM138A lncRNA chromosome
ENSG00000238009 lncRNA chromosome
ENSG00000239945 lncRNA chromosome
ENSG00000239906 lncRNA chromosome
... ... ... ...
ENSG00000228296 TTTY4C lncRNA chromosome
ENSG00000223641 TTTY17C lncRNA chromosome
ENSG00000228786 LINC00266-4P lncRNA chromosome
ENSG00000240450 CSPG4P1Y lncRNA chromosome
ENSG00000231141 TTTY3 lncRNA chromosome
description gene_id_version canonical_transcript
<character> <character> <character>
ENSG00000243485 MIR1302-2 host gene .. ENSG00000243485.5 ENST00000473358
ENSG00000237613 family with sequence.. ENSG00000237613.2 ENST00000417324
ENSG00000238009 novel transcript ENSG00000238009.6 ENST00000477740
ENSG00000239945 novel transcript ENSG00000239945.1 ENST00000495576
ENSG00000239906 novel transcript ENSG00000239906.1 ENST00000493797
... ... ... ...
ENSG00000228296 testis-specific tran.. ENSG00000228296.1 ENST00000456123
ENSG00000223641 testis-specific tran.. ENSG00000223641.2 ENST00000421387
ENSG00000228786 long intergenic non-.. ENSG00000228786.5 ENST00000427373
ENSG00000240450 CSPG4 pseudogene 1 Y.. ENSG00000240450.1 ENST00000306641
ENSG00000231141 testis-specific tran.. ENSG00000231141.1 ENST00000417334
symbol entrezid
<character> <list>
ENSG00000243485 MIR1302-2HG <NA>
ENSG00000237613 FAM138A 645520
ENSG00000238009 <NA>
ENSG00000239945 <NA>
ENSG00000239906 <NA>
... ... ...
ENSG00000228296 TTTY4C 474150
ENSG00000223641 TTTY17C <NA>
ENSG00000228786 LINC00266-4P <NA>
ENSG00000240450 CSPG4P1Y 114758
ENSG00000231141 TTTY3 114760
-------
seqinfo: 456 sequences from GRCh38 genome
Login before adding your answer.
Traffic: 796 users visited in the last hour
Thanks a lot!