I recently had to deal with intronic reads from a bulk total RNA sequencing experiment (see Fig. 1a from this preprint) and derived an intronic annotation for that purpose using Bioconductor. Our aim was to count strictly intronic reads. If this is also your aim, then the script below may be useful for you.
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(rtracklayer)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
introns <- unlist(intronsByTranscript(txdb))
strictintrons <- setdiff(introns, exons(txdb))
ov <- findOverlaps(strictintrons, introns, type="within")
l <- split(subjectHits(ov), queryHits(ov))
txnegs <- select(txdb, keys=names(introns)[unlist(l)],
columns=c("TXNAME", "GENEID"), keytype="TXID")
strictintrons$tx_name <- CharacterList(relist(txnegs$TXNAME, l))
strictintrons$tx_name <- sapply(strictintrons$tx_name, paste, collapse=":")
strictintrons$gene_id <- CharacterList(relist(txnegs$GENEID, l))
strictintrons$gene_id <- sapply(strictintrons$gene_id, paste, collapse=":")
names(strictintrons) <- paste(seqnames(strictintrons),
start(strictintrons),
end(strictintrons),
strand(strictintrons),
sep=":")
stopifnot(all(!duplicated(names(strictintrons)))) ## QC
strictintrons
GRanges object with 301900 ranges and 1 metadata column:
seqnames ranges strand |
<Rle> <IRanges> <Rle> |
chr1:12228:12612:+ chr1 12228-12612 + |
chr1:12722:12974:+ chr1 12722-12974 + |
chr1:13053:13220:+ chr1 13053-13220 + |
chr1:30040:30266:+ chr1 30040-30266 + |
chr1:30668:30975:+ chr1 30668-30975 + |
... ... ... ... .
chrUn_GL000213v1:134117:134275:- chrUn_GL000213v1 134117-134275 - |
chrUn_GL000213v1:134391:138766:- chrUn_GL000213v1 134391-138766 - |
chrUn_GL000219v1:54294:78841:- chrUn_GL000219v1 54294-78841 - |
chrUn_GL000219v1:78953:79936:- chrUn_GL000219v1 78953-79936 - |
chrUn_GL000219v1:80029:83212:- chrUn_GL000219v1 80029-83212 - |
tx_name
<character>
chr1:12228:12612:+ ENST00000456328.2:ENST00000450305.2
chr1:12722:12974:+ ENST00000456328.2:ENST00000450305.2
chr1:13053:13220:+ ENST00000456328.2:ENST00000450305.2
chr1:30040:30266:+ ENST00000473358.1
chr1:30668:30975:+ ENST00000473358.1:ENST00000469289.1
... ...
chrUn_GL000213v1:134117:134275:- ENST00000616049.4:ENST00000616157.1
chrUn_GL000213v1:134391:138766:- ENST00000616049.4:ENST00000616157.1
chrUn_GL000219v1:54294:78841:- ENST00000612919.1
chrUn_GL000219v1:78953:79936:- ENST00000612919.1
chrUn_GL000219v1:80029:83212:- ENST00000612919.1
gene_id
<character>
chr1:12228:12612:+ 100287102:100287102
chr1:12722:12974:+ 100287102:100287102
chr1:13053:13220:+ 100287102:100287102
chr1:30040:30266:+ NA
chr1:30668:30975:+ NA:NA
... ...
chrUn_GL000213v1:134117:134275:- 100288966:100288966
chrUn_GL000213v1:134391:138766:- 100288966:100288966
chrUn_GL000219v1:54294:78841:- 283788
chrUn_GL000219v1:78953:79936:- 283788
chrUn_GL000219v1:80029:83212:- 283788
-------
seqinfo: 595 sequences (1 circular) from hg38 genome
## if you want to export this annotation as a GTF file
export(strictintrons, "strictintrons.gtf")
## if you would use this annotation to count reads
## using 'summarizeOverlaps()' from the pkg 'GenomicAlignments'
## one may have to tune here the arguments when
## calling 'BamFileList()' and 'summarizeOverlaps()'
bfl <- BamFileList(bamfiles)
se <- summarizeOverlaps(strictintrons, bfl)
cheers,
robert.
This is perhaps a more general discussion point and not really specific to this bioconductor support forum. Anyways, I would start by just mapping the data to the complete reference genome using bwa. Following this I would evaluate how much of my reads that map to different areas of the genome and make my call on howto proceed based on that.
That's fair. I wasn't sure if this was something that the authors of
tximport
,tximeta
,ensembldb
have come across and thought I would ask on the Bioc support site. I'm happy to withdraw my question if it's not relevant enough to this community.I think implicit in the question is the desire to do this in or with R / Bioconductor, so the question is appropriate here.