Hi,
I am having a problem with AnnotationHub when trying to convert transcript coordinates to genome coordinates.
Some information first on the context of where the data comes from: I started with a GTF file assembled from RNA-Seq data using the Ensembl reference transcriptome (v. 99) as reference. I then translated these transcripts in 3 frames to generate a theoretical translatome. This protein database was then used to search mass spec data to identify proteins in my samples. I next mapped the identified proteins back to their transcript IDs and sequences and was able to get transcript coordinates for each protein sequence. Now, I would like to convert the transcript positions to genomic positions.
I used the Annotation Hub package with the following code:
ah = AnnotationHub()
ahDb <- query(ah, pattern = c("Homo sapiens", "EnsDb", 99))
ahEdb <- ahDb[[1]]
df_protein_coding['genome.start'] <- NA
df_protein_coding['genome.end'] <- NA
row.names(df_protein_coding) <- NULL
for (i in 1:nrow(df_protein_coding)){
start = df_protein_coding['tx_start'][[1]][i]
end = df_protein_coding['tx_end'][[1]][i]
tx_id=df_protein_coding['transcript.id'][[1]][i]
chr = df_protein_coding['Chromosome.scaffold.name'][[1]][i]
rng_tx <- IRanges(start = start, end = end, names = tx_id)
edbx <- filter(ahEdb, filter = ~ seq_name == chr)
rng_gnm <- transcriptToGenome(rng_tx, edbx)
a <- GRangesList(rng_gnm)
df_protein_coding['genome.start'][[1]][i] = start(a)[[1]]
df_protein_coding['genome.end'][[1]][i] = end(a)[[1]]
}
However, I run into this error:
Error in df_protein_coding["genome.start"][[1]][i] <- start(a)[[1]] :
replacement has length zero
In addition: There were 24 warnings (use warnings() to see them)
The code seems to have worked for the first few entires. Here is the truncated version for the entries that worked:
header | 1 | 2 | 3 |
---|---|---|---|
peptide | VINSNCTR | NGGGFGGR | NGGGFGGR |
transcript id | ENST00000295379 | ENST00000582693 | ENST00000582693 |
transcript biotype | protein_coding | protein_coding | protein_coding |
gene id | ENSG00000163217 | ENSG00000265491 | ENSG00000265491 |
gene symbol | BMP10 | RNF115 | RNF115 |
Chromosome/scaffold name | 2 | 1 | 1 |
protein match | ENST00000295379_(+1)_26 | ENST00000582693_(+3)_2 | ENST00000582693_(+3)_2 |
tx_start | 3019 | 51 | 51 |
tx_end | 3183 | 344 | 344 |
genome.start | 68863762 | 145823772 | 145823772 |
genome.end | 68863926 | 145824045 | 145824045 |
The table has been transposed for easier formatting.
The entry that AnnotationHub has an issue with is a sequence from position 8698 to 8760 of ENST00000515408. Not sure where I have gone wrong here as it seems that the transcript coordinates are out of bounds from the one that AnnotationHub has. Does anyone have any suggestions on how to solve this? Would appreciate any help with this issue.
It's hard to say what happened by just looking at the error message. Could you please also provide the last values of the variables
start
,end
,tx_id
andchr
that caused the error?As a side note, I'm not sure you need the
edbx <- filter...
call, since you're anyway mapping just the coordinates from a single transcript to genomic coordinates.Thanks for the reply Johannes. The last value that caused the error has the following attributes:
transcript start: 8698
transcript end: 8760
tx_id: ENST00000515408
Chr: 5
gene id: ENSG00000172795
gene symbol: DCP2
Let me know if you need more information.
Noted, I will remove this code.