How to write a script for identifying associate gene id
1
0
Entering edit mode
abhisek001 • 0
@6d5973d2
Last seen 28 days ago
India

The following is a large text file I write few lines of it, I need to extract some genes from here according to my input please write a small script. I specified Input-output at the bottom of the post. This problem could seem very tough but in brief, I just want to recognize the associate gene id of one version from another, because online sites do not have the repository of this species.

SDRB02000004.1  Genbank gene    6018    10396   .   +   .   gene_id "TEA_012962"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "TEA_012962"; 
SDRB02000004.1  Genbank transcript  6018    10396   .   +   .   gene_id "TEA_012962"; transcript_id "gnl|WGS:SDRB|TEA014503.1"; gbkey "mRNA"; locus_tag "TEA_012962"; orig_protein_id "gnl|WGS:SDRB|TEA014503.1:cds_7"; orig_transcript_id "gnl|WGS:SDRB|TEA014503.1"; product "hypothetical protein"; transcript_biotype "mRNA"; 
SDRB02000004.1  Genbank exon    6018    6864    .   +   .   gene_id "TEA_012962"; transcript_id "gnl|WGS:SDRB|TEA014503.1"; locus_tag "TEA_012962"; orig_protein_id "gnl|WGS:SDRB|TEA014503.1:cds_7"; orig_transcript_id "gnl|WGS:SDRB|TEA014503.1"; product "hypothetical protein"; transcript_biotype "mRNA"; exon_number "1"; 
SDRB02000232.1  Genbank stop_codon  994202  994204  .   +   0   gene_id "TEA_014895"; transcript_id "gnl|WGS:SDRB|TEA016705.1"; gbkey "CDS"; locus_tag "TEA_014895"; orig_transcript_id "gnl|WGS:SDRB|TEA016705.1"; product "hypothetical protein"; protein_id "THG23623.1"; exon_number "19";

My input and desire output like the following -

Input (common gene-name)                   Output (special gene name)
TEA_012962                                           TEA014503.1
TEA_014895                                           TEA016705.1
bash r bingo • 1.1k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 26 minutes ago
United States

You are being unnecessarily mysterious about what that is and where you got it. If you are asking people to help you with a problem, it's up to you to add enough context that people want to help, and have the ability to do so. Also, you apparently asked the same question on multiple help forums. That is in general considered bad form. You should restrict to one site.

Anyway, what you have is apparently this GTF. Which you could read in using import from the rtracklayer package, and then easily process however you want.

> download.file("https://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/Camellia_sinensis/latest_assembly_versions/GCA_004153795.2_AHAU_CSS_2/GCA_004153795.2_AHAU_CSS_2_genomic.gtf.gz", "GCA_004153795.2_AHAU_CSS_2_genomic.gtf.gz")
> z <- gzfile("GCA_004153795.2_AHAU_CSS_2_genomic.gtf.gz")
> library(rtracklayer)
> library(GenomicFeatures)
> gr <- import(z)
> txdb <- makeTxDbFromGRanges(gr)
> tx <- transcripts(txdb, columns = c("gene_id","tx_name"))
> whatYouWant <- data.frame(Input = unlist(mcols(tx)$gene_id), Output = sapply(strsplit(mcols(tx)$tx_name, "\\|"), "[", 3))
> head(whatYouWant)
       Input      Output
1 TEA_012962 TEA014503.1
2 TEA_012972 TEA014513.1
3 TEA_012982 TEA014523.1
4 TEA_012984 TEA014525.1
5 TEA_012976 TEA014517.1
6 TEA_012980 TEA014521.1
ADD COMMENT
0
Entering edit mode

You don't need these lines:

> z <- gzfile("GCA_004153795.2_AHAU_CSS_2_genomic.gtf.gz")
> gr <- import(z)

Instead you can import directly. Seems to me that didn't used to work? But whatever. This does work

> gr <- import("GCA_004153795.2_AHAU_CSS_2_genomic.gtf.gz")
ADD REPLY
0
Entering edit mode

Thanks for helping me out but I am sorry because at that time I was in a hurry so I visited several platforms and finally I did it manually with excel. I am a beginner and hope you will understand my concern.

ADD REPLY

Login before adding your answer.

Traffic: 908 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6