Hello Bioconductor,
I'm working on RNA and small RNA seq and I'm a bit confused regarding the version that I can use between the alignment of fastqs and R packages. The trouble starts when I try to make my own GTF file for small RNA sequences that exist in other databases and I don't want to mix up sequences from one version, genomic ranges from another and transcripts from another one.
I make a fast file from a small RNA database that has coordinates of the old genome version hg18. I use the GENCODE release_34 GRCh38 of primary assembly fasta for the alignment. I get the sequences that have at least one alignment to the genome. Then for easier manipulation, I create a bed from bam to import the genomic ranges of my alignments in R and work with them. And this is the first file I want to merge with the next one.
Following, I use another BED file that has small RNA sequences from multiple sources and I perform:
suppressPackageStartupMessages({
library('tidyverse')
library('plyranges')
library("BSgenome.Hsapiens.UCSC.hg38")
})
small_RNAs_bed <- read_bed("small_RNAs_DB.bed")%>%
keepStandardChromosomes(pruning.mode = "coarse")
sInfo <- Seqinfo(genome="hg38")
seqlevels(small_RNAs_bed ) <- seqlevels(small_RNAs_bed )
seqinfo(small_RNAs_bed) <- sInfo
As I want to also check the sequences of small RNAs from that BED I extract them from:
transcripts_human <- Views(BSgenome.Hsapiens.UCSC.hg38, small_RNAs_bed)
So, are these sequences that I get the same with the ones of GENCODE release_34??
What's more, are the TxDb.Hsapiens.UCSC.hg38.knownGene
the same as the one that exist in the GTF file of GENCODE primary assembly?