Question

How to get longest isoform

0

Entering edit mode

Madza Farias-Virgens • 0

@mfarias-virgens

Last seen 2.9 years ago

United States

I need to extract sequences for the longest isoforms

I am able to extract CDS using

# load GTF
txdb <- makeTxDbFromGFF("BFgenomic.gff", format="gff3")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK

# Get dna seq
dna <- readDNAStringSet("/users/mfariasv/data/mfariasv/newBF20/BFgenomic.fa")

# extract CDS
txdb.cds_by_transcript <- cdsBy(txdb, by="tx", use.names = TRUE) 

# Extract CDS sequences
cds_by_trnascript <- getSeq(dna, txdb.cds_by_transcript, use.names = TRUE) 

# use lapply function and unlist function and diretcly convert into a DNAStringSet
LargeDNAStringSet <- DNAStringSet(lapply(cds_by_trnascript, function(x) {unlist(x)})) 

# Write to fasta
writeXStringSet(LargeDNAStringSet, "my.fasta")

But I don't know how to go about retrieving the longest isoforms

I initially got a warning when running

txdb <- makeTxDbFromGFF("BFgenomic.gff", format="gff3")
...
1: In .find_exon_cds(exons, cds) :
  The following transcripts have exons that contain more than one CDS (only the first CDS
  was kept for each exon): rna-NM_001142785.2, rna-NM_001172764.1

but I manually deleted those entries from my gff as they where uninteresting.

I also can't find the info for if/how GenomicFeatures cdsBy() uses the frame info in the gff to get the CDS.

Thank you

GenomicFeatures Biostrings • 3.1k views

ADD COMMENT • link updated 3.1 years ago by Hervé Pagès 16k • written 3.1 years ago by Madza Farias-Virgens • 0

score 1 · Answer 1 · 2021-11-15

Here is one way to extract the longest CDS for each gene:

tx_lens <- transcriptLengths(txdb, with.cds_len=TRUE)

## tx_lens is a data.frame:
head(tx_lens)
#   tx_id     tx_name     gene_id nexon tx_len cds_len
# 1     1 FBtr0300689 FBgn0031208     2   1880     855
# 2     2 FBtr0300690 FBgn0031208     3   1802    1443
# 3     3 FBtr0330654 FBgn0031208     2   1844     819
# 4     4 FBtr0309810 FBgn0263584     2   2230       0
# 5     5 FBtr0306539 FBgn0067779     7   4051    3024
# 6     6 FBtr0306536 FBgn0067779     5   3731    3024

cds_len_per_gene <- splitAsList(tx_lens$cds_len, tx_lens$gene_id)
cds_len_per_gene
# IntegerList of length 15682
# [["FBgn0000003"]] 0
# [["FBgn0000008"]] 3990 3990 3990
# [["FBgn0000014"]] 993 1773 993 993
# [["FBgn0000015"]] 813 972 1482 813 813 813
# [["FBgn0000017"]] 4863 4917 5118 4824 4770 5172 4569 4515 5001
# [["FBgn0000018"]] 1620
# [["FBgn0000022"]] 606
# [["FBgn0000024"]] 1950 1950
# [["FBgn0000028"]] 1185 1101 1191 1107 1128 1095 ... 1080 1113 1131 1152 1137
# [["FBgn0000032"]] 1368 1317
# ...
# <15672 more elements>

tx_id_per_gene <- splitAsList(tx_lens$tx_id, tx_lens$gene_id)

which_max <- which.max(cds_len_per_gene)
length_of_longest_cds <- unlist(cds_len_per_gene[as.list(which_max)])
tx_id_of_longest_cds <- unlist(tx_id_per_gene[as.list(which_max)])

## 'length_of_longest_cds' is a named integer vector that maps each gene id to the length
## of the longest CDS for that gene:
head(length_of_longest_cds)
# FBgn0000003 FBgn0000008 FBgn0000014 FBgn0000015 FBgn0000017 FBgn0000018 
#           0        3990        1773        1482        5172        1620 

## 'tx_id_of_longest_cds' is a named integer vector that maps each gene id to the transcript id
## of the longest CDS for that gene:
head(tx_id_of_longest_cds)
# FBgn0000003 FBgn0000008 FBgn0000014 FBgn0000015 FBgn0000017 FBgn0000018 
#       17345        7681       21864       21876       16227        4279 

## 'length_of_longest_cds' and 'tx_id_of_longest_cds' are parallel vectors with identical names:
identical(names(length_of_longest_cds), names(tx_id_of_longest_cds))  # TRUE

## Let's summarize these results in a 3-column data.frame with 1 row per gene:
longest_cds <- data.frame(gene_id=names(length_of_longest_cds),
                          length_of_longest_cds=length_of_longest_cds,
                          tx_id_of_longest_cds=tx_id_of_longest_cds)

## Let's remove rows for which the longest CDS has length 0 (non-coding genes):
longest_cds <- subset(longest_cds, length_of_longest_cds != 0)

Now let's extract all the CDS fragments grouped by transcript and keep only the longest ones:

cds_by_tx <- cdsBy(txdb, by="tx")
cds_by_tx <- cds_by_tx[as.character(longest_cds$tx_id_of_longest_cds)]

cds_by_tx is a GRangesList object _parallel_ to longest_cds i.e. it has 1 list element per row in longest_cds.

The names on cds_by_tx are transcript ids. You can replace them with gene ids with:

names(cds_by_tx) <- longest_cds$gene_id

Use cds_by_tx as you've been using your txdb.cds_by_transcript object to extract all CDS sequences from your dna object. The difference being that now you're going to extract only the longest CDS for each gene. Once you have the longest CDS sequences in a DNAStringSet object (e.g. longest_cds_seqs), make sure that the names on the object are identical to longest_cds$gene_id and that the sequence lengths (which you can extract with width(longest_cds_seqs)) are identical to longest_cds$length_of_longest_cds.

Finally note that there can be more than one longest CDS per gene but the method described above will only pick-up one (the first one for each gene in the tx_lens data.frame).

Hope this helps,

H.