Question

How to use ENCprime via coRdon with a custom background sequence set?

0

Entering edit mode

jol.espinoz ▴ 40

@jolespinoz-11290

Last seen 4.1 years ago

I want to use ENCprime and found an updated implementation in https://rdrr.io/bioc/coRdon/man/codonUsage.html

However, in the original implementation a background sequence set is required: https://github.com/jnovembre/ENCprime/blob/master/doc/ENCprimedoc.pdf

enter image description here

I would like to use my protein coding genes as the background sequences (i.e. ExBackgroundSeqs.fasta) and ribosomal proteins as my query sequences (e.g. ExSeqs.fasta).

I have my ribosomal proteins in a separate file. For example,

all_proteins.fasta ribosomal_proteins.fasta

How can I do this using coRdon's implementation?

enter image description here

Here's my attempt:

library(coRdon)
# Load in the sequences
> ribosomal_proteins = readSet(file="ribosomal.fasta")
> all_proteins = readSet(file="all_proteins.fasta")

# Create codonTable objects
> cT_ribosomal = codonTable(x=ribosomal_proteins)
> cT_all = codonTable(x=all_proteins)

# There are 10 ribosomal sequences
> length(cT_ribosomal@len)
[1] 10

# There are 500 background sequences
> length(cT_all@len)
[1] 500

# Use ENCprime using:
#   * ribosomal sequences as the query seqeuences
#   * all proteins as the background sequences
#   * Genetic code 11
# 1. Am I doing this correctly? 
> df_encprime = ENCprime(cTobject=cT_ribosomal, subsets=list(all_proteins=cT_all), self=TRUE, id_or_name2="11", ribosomal=FALSE)

# 2. What is the difference between self and all_proteins here? 
# 3. Are these values ENCprime values? The original program has a lot of outputs (https://github.com/jnovembre/ENCprime/blob/master/doc/ENCprimedoc.pdf)
> head(df_encprime)
         self all_proteins
[1,] 61.00000     61.00000
[2,] 50.35134     48.54671
[3,] 48.15243     38.09560
[4,] 61.00000     61.00000
[5,] 47.96728     46.81694
[6,] 61.00000     61.00000

# 4. There are 10 rows here which is the same number of ribosomal proteins? 
> dim(df_encprime)
[1] 10  2

# 5. If I want to have the Ncp values of the ribosomal proteins using the rest of the proteins as background sequences, am I doing this correctly? If so, would I use the `self` or the `all_proteins` column?

When I actually run ENCprime from https://github.com/jnovembre/ENCprime

ENCprime/bin/SeqCount -c ribosomal.fasta 10
ENCprime/bin/SeqCount -n all_proteins.fasta 500
ENCprime/bin/ENCprime ribosomal.fasta.codcnt all_proteins.fasta.acgtfreq 11 ExResults 1 -q

Here's the output:

Name Nc Ncp ScaledChi SumChi df p B_KM n_codons
contig_413247_104_232_-: 59.7500 61.0000 1.4217 58.2882 43 0.0599 0.9152 41
contig_413248_1_282_-: 33.0645 40.4597 0.8027 74.6483 43 0.0020 0.7128 93
contig_413249_1_378_-: 49.8205 28.5148 1.5820 197.7461 43 0.0000 1.1962 125
contig_413250_1_201_-: 44.6197 57.6868 0.7850 51.8128 43 0.1678 0.8350 66
contig_413251_1_338_-: 55.7272 45.8360 0.6035 66.9923 43 0.0110 0.5787 111
contig_413252_6_263_-: 47.1154 46.4176 0.6758 56.7674 43 0.0777 0.6558 84
contig_413253_1_212_+: 36.0447 49.1137 0.6980 48.1599 43 0.2720 0.7760 69
contig_413254_86_154_+: 61.0000 61.0000 1.6488 34.6249 43 0.8149 1.0468 21
contig_413255_1_209_-: 50.7438 54.6340 0.6736 46.4805 43 0.3310 0.6667 69
contig_413256_66_167_-: 61.0000 61.0000 1.1630 37.2152 43 0.7195 0.9912 32
Totals: 56.0227 52.0516 0.2115 150.3532 43 0.0000 0.4008 711

They are different so I don't think I'm doing this correctly.

software codon usage bias dna • 1.4k views

ADD COMMENT • link updated 5.4 years ago by anamaria ▴ 10 • written 5.4 years ago by jol.espinoz ▴ 40

score 1 · Accepted Answer · 2019-12-06

Hi,

in short, you are doing the calculation right.

I will try to explain structure of the output in more details, hopefully it makes sense to you. ENC' value is calculated for every sequence in the set of sequences given with cTobject argument - therefore, for ten input ribosomal sequences, you get ten rows in the output. In the columns of the output, you get the values for different calculations of ENC'. When you provide a set of reference genes to which CU will be compared, you get one column for each provided reference set. Setting self argument to TRUE (default), you also calculate ENC' with respect to the average CU bias of the entire set of input sequences. This does not affect the calculation of ENC' using any other subset of genes, it just adds another column to the output.

Regarding the differences between the values obtained with the two tools, they may come from several reasons, including different treatment of alternative start codons and stop codons (including them in calculations or not), as well as the simple rounding during the calculation. Here it is useful to look at the correlation between the obtained values, e.g. you could plot the values obtained with coRdon vs ENCprime and see if they align more-or-less on the diagonal.