It all depends on what you mean by CDS. That's dependent on the transcript, and most genes have multiple transcripts. But anyway, here's a start.
> library(BSgenome.Hsapiens.UCSC.hg19)
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> library(UniProt.ws)
> up <- UniProt.ws()
## a couple UniProt IDs
> protids <- c("P38398","P04637")
## convert to NCBI Gene IDs
> ncbid <- select(up, protids, "xref_geneid", "UniProtKB")
## get CDS by gene
> cds <- cdsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
## subset to the two we want - note that the results from UniProt have trailing semicolons
> my.cds <- cds[gsub(";", "",ncbid$GeneID)]
## get the sequences
> seqs <- getSeq(Hsapiens, my.cds)
> seqs
DNAStringSetList of length 2
[["672"]] CAATTGGGCAGATGTGTGAGGCACCTGTGGTGACCCGAGAGTGGGTGTTGGACAGTGTAGCACTCTA...
[["7157"]] GCCAAGGCAGGTGGATCACCTGAGGTCAGGAGTTCAAGACAGCCTGGCCAACATAGCGAAATCCCA...
> seqs[[1]]
DNAStringSet object of length 36:
width seq
[1] 125 CAATTGGGCAGATGTGTGAGGCACCTGTGGTGA...CTGATACCCCAGATCCCCCACAGCCACTACTGA
[2] 19 CAATTGGGCAGATGTGTGA
[3] 61 GGTGTCCACCCAATTGTGGTTGTGCAGCCAGATGCCTGGACAGAGGACAATGGCTTCCATG
[4] 74 ATCAACTGGAATGGATGGTACAGCTGTGTGGTG...GTGAAGGAGCTTTCATCATTCACCCTTGGCACA
[5] 31 ATGAGTGACAGCAAGAAAACCTGGCTGCAAT
... ... ...
[32] 78 ATTTTGCATGCTGAAACTTCTCAACCAGAAGAA...GTGTCCTTTATGTAAGAATGATATAACCAAAAG
[33] 2 AT
[34] 54 TCTGGAGTTGATCAAGGAACCTGTCTCCACAAAGTGTGACCACATATTTTGCAA
[35] 80 ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTA...TGCTATGCAGAAAATCTTAGAGTGTCCCATCTG
[36] 4 ATGG
Or maybe you want the CDS by transcript
## CDS by transcript, need the names
> cds2 <- cdsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, use.names = TRUE)
## map NCBI ID to UCSC transcript ID
> txids <- select(TxDb.Hsapiens.UCSC.hg19.knownGene, gsub(";", "",ncbid$GeneID), "TXNAME","GENEID")
'select()' returned 1:many mapping between keys and columns
## remove those not available
> txids <- txids$TXNAME[txids$TXNAME %in% names(cds2)]
> my.cds2 <- cds2[txids]
> seqs2 <- getSeq(Hsapiens, my.cds2)
> seqs2
DNAStringSetList of length 34
[["uc010whl.2"]] ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAA...
[["uc002icp.4"]] AT ...
[["uc010whm.2"]] ATGG ...
[["uc002icu.3"]] ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAA...
[["uc010cyx.3"]] ATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGAT...
[["uc002icq.3"]] ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAA...
[["uc002ict.3"]] ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAA...
[["uc010whn.2"]] ATGCACAGTTGCTCTGGGAGTCTTCAGAATAGAAACTACCCATCTCAAGAGGAGCTCATT...
[["uc010who.3"]] ATGAGTGACAGCAAGAAAACCTGGCTGCAAT ...
[["uc010whp.2"]] ATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGAT...
...
<24 more elements>
> seqs2[[1]]
DNAStringSet object of length 22:
width seq
[1] 80 ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTA...TGCTATGCAGAAAATCTTAGAGTGTCCCATCTG
[2] 54 TCTGGAGTTGATCAAGGAACCTGTCTCCACAAAGTGTGACCACATATTTTGCAA
[3] 78 ATTTTGCATGCTGAAACTTCTCAACCAGAAGAA...GTGTCCTTTATGTAAGAATGATATAACCAAAAG
[4] 89 GAGCCTACAAGAAAGTACGAGATTTAGTCAACT...TTTGTGCTTTTCAGCTTGACACAGGTTTGGAGT
[5] 140 ATGCAAACAGCTATAATTTTGCAAAAAAGGAAA...CTTCTACAGAGTGAACCCGAAAATCCTTCCTTG
... ... ...
[18] 84 CATGATTTTGAAGTCAGAGGAGATGTGGTCAAT...CCAAAGCGAGCAAGAGAATCCCAGGACAGAAAG
[19] 55 ATCTTCAGGGGGCTAGAAATCTGTTGCTATGGGCCCTTCACCAACATGCCCACAG
[20] 74 ATCAACTGGAATGGATGGTACAGCTGTGTGGTG...GTGAAGGAGCTTTCATCATTCACCCTTGGCACA
[21] 61 GGTGTCCACCCAATTGTGGTTGTGCAGCCAGATGCCTGGACAGAGGACAATGGCTTCCATG
[22] 125 CAATTGGGCAGATGTGTGAGGCACCTGTGGTGA...CTGATACCCCAGATCCCCCACAGCCACTACTGA
>
Thanks! I'll look at these. It seems like this will at least get me going in the right direction.
I want the DNA that is the CDS for the protein. Not the transcript sequence or the gene sequence. I don't want the 5' UTRs, 3' UTRs, or introns.
The uniprot protein IDs are for a specific protein sequence. The human PKLR gene generates two isoforms. The uniprot IDs are P30613-1 and P30613-2.
That's what the
cdsBy
function provides (hence the name). You can get all the parts of the gene and see that none overlap.You can eyeball the UTR ranges and see that they don't overlap the CDS ranges. We can use
findOverlaps
to check the introns vs the exons, and there is one overlap of an exon with an intron between the two transcripts. If you look at the UCSC genome browser you will note that the two transcripts are identical except for a very small exon (7 nt long) in one transcript that is in the middle of an intron for the other transcript, coming right after the 5' UTR. That small exon technically overlaps the intron for the other transcript, but otherwise the CDS is disjoint from all the things you don't want.