Question

How to write a script for identifying associate gene id

0

Entering edit mode

abhisek001 • 0

@6d5973d2

Last seen 4 months ago

India

The following is a large text file I write few lines of it, I need to extract some genes from here according to my input please write a small script. I specified Input-output at the bottom of the post. This problem could seem very tough but in brief, I just want to recognize the associate gene id of one version from another, because online sites do not have the repository of this species.

SDRB02000004.1  Genbank gene    6018    10396   .   +   .   gene_id "TEA_012962"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "TEA_012962"; 
SDRB02000004.1  Genbank transcript  6018    10396   .   +   .   gene_id "TEA_012962"; transcript_id "gnl|WGS:SDRB|TEA014503.1"; gbkey "mRNA"; locus_tag "TEA_012962"; orig_protein_id "gnl|WGS:SDRB|TEA014503.1:cds_7"; orig_transcript_id "gnl|WGS:SDRB|TEA014503.1"; product "hypothetical protein"; transcript_biotype "mRNA"; 
SDRB02000004.1  Genbank exon    6018    6864    .   +   .   gene_id "TEA_012962"; transcript_id "gnl|WGS:SDRB|TEA014503.1"; locus_tag "TEA_012962"; orig_protein_id "gnl|WGS:SDRB|TEA014503.1:cds_7"; orig_transcript_id "gnl|WGS:SDRB|TEA014503.1"; product "hypothetical protein"; transcript_biotype "mRNA"; exon_number "1"; 
SDRB02000232.1  Genbank stop_codon  994202  994204  .   +   0   gene_id "TEA_014895"; transcript_id "gnl|WGS:SDRB|TEA016705.1"; gbkey "CDS"; locus_tag "TEA_014895"; orig_transcript_id "gnl|WGS:SDRB|TEA016705.1"; product "hypothetical protein"; protein_id "THG23623.1"; exon_number "19";

My input and desire output like the following -

Input (common gene-name)                   Output (special gene name)
TEA_012962                                           TEA014503.1
TEA_014895                                           TEA016705.1

bash r bingo • 1.2k views

ADD COMMENT • link 3.7 years ago • updated 2.3 years ago abhisek001 • 0

score 0 · Answer 1 · 2021-06-28

You are being unnecessarily mysterious about what that is and where you got it. If you are asking people to help you with a problem, it's up to you to add enough context that people want to help, and have the ability to do so. Also, you apparently asked the same question on multiple help forums. That is in general considered bad form. You should restrict to one site.

Anyway, what you have is apparently this GTF. Which you could read in using import from the rtracklayer package, and then easily process however you want.

> download.file("https://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/Camellia_sinensis/latest_assembly_versions/GCA_004153795.2_AHAU_CSS_2/GCA_004153795.2_AHAU_CSS_2_genomic.gtf.gz", "GCA_004153795.2_AHAU_CSS_2_genomic.gtf.gz")
> z <- gzfile("GCA_004153795.2_AHAU_CSS_2_genomic.gtf.gz")
> library(rtracklayer)
> library(GenomicFeatures)
> gr <- import(z)
> txdb <- makeTxDbFromGRanges(gr)
> tx <- transcripts(txdb, columns = c("gene_id","tx_name"))
> whatYouWant <- data.frame(Input = unlist(mcols(tx)$gene_id), Output = sapply(strsplit(mcols(tx)$tx_name, "\\|"), "[", 3))
> head(whatYouWant)
       Input      Output
1 TEA_012962 TEA014503.1
2 TEA_012972 TEA014513.1
3 TEA_012982 TEA014523.1
4 TEA_012984 TEA014525.1
5 TEA_012976 TEA014517.1
6 TEA_012980 TEA014521.1