Information

MIToS.Information — Module.

The Information module of MIToS defines types and functions useful to calculate information measures (e.g. Mutual Information (MI) and Entropy) over a Multiple Sequence Alignment (MSA). This module was designed to count Residues (defined in the MSA module) in special contingency tables (as fast as possible) and to derive probabilities from this counts. Also, includes methods for applying corrections to that tables, e.g. pseudocounts and pseudo frequencies. Finally, Information allows to use this probabilities and counts to estimate information measures and other frequency based values.

Features

Estimate multi dimensional frequencies and probabilities tables from sequences, MSAs, etc...
Correction for small number of observations
Correction for data redundancy on a MSA
Estimate information measures
Calculate corrected mutual information between residues

using MIToS.Information

Contents

Information

Types

MIToS.Information.AdditiveSmoothing — Type.

Additive Smoothing or fixed pseudocount λ for ResidueCount (in order to estimate probabilities when the number of samples is low).

Common values of λ are:

0 : No cell frequency prior, gives you the maximum likelihood estimator.
0.05 is the optimum value for λ found in Buslje et. al. 2009, similar results was obtained for λ in the range [0.025, 0.075].
1 / p : Perks prior (Perks, 1947) where p the number of parameters (i.e. residues, pairs of residues) to estimate. If p is the number of residues (20 without counting gaps), this gives you 0.05.
sqrt(n) / p : Minimax prior (Trybula, 1958) where n is the number of samples and p the number of parameters to estimate. If the number of samples n is 400 (minimum number of sequence clusters for achieve good performance in Buslje et. al. 2009) for estimating 400 parameters (pairs of residues without counting gaps) this gives you 0.05.
0.5 : Jeffreys prior (Jeffreys, 1946).
1 : Bayes-Laplace uniform prior, aka. Laplace smoothing.

MIToS.Information.BLOSUM_Pseudofrequencies — Type.

BLOSUM_Pseudofrequencies type. It takes to arguments/fields:

α : Usually the number of sequences or sequence clusters in the MSA.
β : The weight of the pseudofrequencies, a value close to 8.512 when α is the number of sequence clusters.

MIToS.Information.ContingencyTable — Type.

A ContingencyTable is a multidimensional array. It stores the contingency matrix, its marginal values and total. The type also has an internal and private temporal array and an alphabet object. It's a parametric type, taking three ordered parameters:

T : The element type of the multidimensional array.
N : It's the dimension of the array and should be an Int.
A : This should be a type, subtype of ResidueAlphabet, i.e.: UngappedAlphabet,

GappedAlphabet or ReducedAlphabet.

A ContingencyTable can be created from an alphabet if all the parameters are given. Otherwise, you need to give a type, a number (Val) and an alphabet. You can also create a ContingencyTable using a matrix and a alphabet. For example:

ContingencyTable{Float64, 2, UngappedAlphabet}(UngappedAlphabet())
ContingencyTable(Float64, Val{2}, UngappedAlphabet())
ContingencyTable(zeros(Float64,20,20), UngappedAlphabet())

MIToS.Information.Counts — Type.

A Counts object wraps a ContingencyTable storing counts/frequencies.

MIToS.Information.NoPseudocount — Type.

You can use NoPseudocount() to avoid pseudocount corrections where a Pseudocount type is needed.

MIToS.Information.NoPseudofrequencies — Type.

You can use NoPseudofrequencies() to avoid pseudocount corrections where a Pseudofrequencies type is needed.

MIToS.Information.Probabilities — Type.

A Probabilities object wraps a ContingencyTable storing probabilities. It doesn't perform any check. If the total isn't one, you must use normalize or normalize!on the ContingencyTable before wrapping it to make the sum of the probabilities equal to one.

MIToS.Information.Pseudocount — Type.

Parametric abstract type to define pseudocount types

MIToS.Information.Pseudofrequencies — Type.

Parametric abstract type to define pseudofrequencies types

Constants

MIToS.Information.BLOSUM62_Pi — Constant.

BLOSUM62 probabilities P(aa) for each residue on the UngappedAlphabet. SUM: 0.9987

MIToS.Information.BLOSUM62_Pij — Constant.

Table with conditional probabilities of residues based on BLOSUM62. The normalization is done row based. The firts row contains the P(aa|A) and so one.

Macros

Methods and functions

Base.LinAlg.normalize! — Method.

normalize! makes the sum of the frequencies to be one, in place.

Base.LinAlg.normalize — Method.

normalize returns another table where the sum of the frequencies is one.

Base.count — Method.

It returns a ContingencyTable wrapped in a Counts type with the frequencies of residues in the sequences that takes as arguments. The dimension of the table is equal to the number of sequences. You can use the keyword arguments alphabet, weights and pseudocounts to indicate the alphabet of the table (default to UngappedAlphabet()), a clustering result (default to NoClustering()) and the pseudocounts (default to NoPseudocount()) to be used during the estimation of the frequencies.

MIToS.Information.APC! — Method.

APC (Dunn et. al. 2008)

MIToS.Information.BLMI — Method.

BLMI takes a MSA or a file and a Format as first arguments. It calculates a Z score (ZBLMI) and a corrected MI/MIp as described on Busjle et. al. 2009 but using using BLOSUM62 pseudo frequencies instead of a fixed pseudocount.

Keyword argument, type, default value and descriptions:

  - beta        Float64   8.512   β for BLOSUM62 pseudo frequencies
  - lambda      Float64   0.0     Low count value
  - threshold             62      Percent identity threshold for sequence clustering (Hobohm I)
  - maxgap      Float64   0.5     Maximum fraction of gaps in positions included in calculation
  - apc         Bool      true    Use APC correction (MIp)
  - samples     Int       50      Number of samples for Z-score
  - fixedgaps   Bool      true    Fix gaps positions for the random samples

This function returns:

  - Z score (ZBLMI)
  - MI or MIp using BLOSUM62 pseudo frequencies (BLMI/BLMIp)

MIToS.Information.apply_pseudocount! — Method.

It adds the pseudocount value to the table cells.

MIToS.Information.apply_pseudofrequencies! — Method.

apply_pseudofrequencies!{T}(Pab::ContingencyTable{T,2,UngappedAlphabet}, pseudofrequencies::BLOSUM_Pseudofrequencies)

When a BLOSUM_Pseudofrequencies(α,β) is used, this function applies pseudofrequencies Gab over Pab, as a weighted mean of both. It uses the conditional probability matrix BLOSUM62_Pij and the real frequencies/probabilities Pab to estimate the pseudofrequencies Gab. α is the weight of the real frequencies Pab and β the weight of the pseudofrequencies.

Gab = Σcd Pcd ⋅ BLOSUM62( a | c ) ⋅ BLOSUM62( b | d ) Pab = (α ⋅ Pab + β ⋅ Gab )/(α + β)

MIToS.Information.buslje09 — Method.

buslje09 takes a MSA or a file and a Format as first arguments. It calculates a Z score and a corrected MI/MIp as described on Busjle et. al. 2009.

keyword argument, type, default value and descriptions:

  - lambda      Float64   0.05    Low count value
  - clustering  Bool      true    Sequence clustering (Hobohm I)
  - threshold             62      Percent identity threshold for clustering
  - maxgap      Float64   0.5     Maximum fraction of gaps in positions included in calculation
  - apc         Bool      true    Use APC correction (MIp)
  - samples     Int       100     Number of samples for Z-score
  - fixedgaps   Bool      true    Fix gaps positions for the random samples
  - alphabet    ResidueAlphabet UngappedAlphabet()  Residue alphabet to be used

This function returns:

  - Z score
  - MI or MIp

MIToS.Information.count! — Method.

It populates a ContingencyTable (first argument) using the frequencies in the sequences (last positional arguments). The dimension of the table must match the number of sequences and all the sequences must have the same length. You must indicate the used weights and pseudocounts as second and third positional arguments respectively. You can use NoPseudofrequencies() and NoClustering() to avoid the use of sequence weighting and pseudocounts, respectively.

MIToS.Information.cumulative — Method.

cumulative allows to calculate cumulative scores (i.e. cMI) as defined in Buslje et. al. 2010

"We calculated a cumulative mutual information score (cMI) for each residue as the sum of MI values above a certain threshold for every amino acid pair where the particular residue appears. This value defines to what degree a given amino acid takes part in a mutual information network." Buslje, Cristina Marino, Elin Teppa, Tomas Di Doménico, José María Delfino, and Morten Nielsen. Networks of high mutual information define the structural proximity of catalytic sites: implications for catalytic residue identification. PLoS Comput Biol 6, no. 11 (2010): e1000978.

MIToS.Information.delete_dimensions! — Method.

delete_dimensions!(out::ContingencyTable, in::ContingencyTable, dimensions::Int...)

This function fills a ContingencyTable with the counts/probabilities on in after the deletion of dimensions. i.e. This is useful for getting Pxy from Pxyz.

MIToS.Information.delete_dimensions — Method.

delete_dimensions(in::ContingencyTable, dimensions::Int...)

This function creates a ContingencyTable with the counts/probabilities on in after the deletion of dimensions. i.e. This is useful for getting Pxy from Pxyz.

MIToS.Information.gap_intersection_percentage — Method.

It calculates the gap intersection as percentage from a table of Counts.

MIToS.Information.gap_union_percentage — Method.

It calculates the gap union as percentage from a table of Counts.

MIToS.Information.gaussdca — Method.

Wrapper function to GaussDCA.gDCA. You need to install GaussDCA:

Pkg.clone("https://github.com/carlobaldassi/GaussDCA.jl")

Look into GaussDCA.jl README for further information. If you use this wrapper, please cite the GaussDCA publication and the package's doi.

It's possible to indicate the path to the julia binary where GaussDCA is installed. However, it's recommended to use the same version where MIToS is installed. That is because this function use serialize/deserialize to transfer data between the processes.

GaussDCA Publication: Baldassi, Carlo, Marco Zamparo, Christoph Feinauer, Andrea Procaccini, Riccardo Zecchina, Martin Weigt, and Andrea Pagnani. "Fast and accurate multivariate Gaussian modeling of protein families: predicting residue contacts and protein-interaction partners." PloS one 9, no. 3 (2014): e92721.

MIToS.Information.getalphabet — Method.

getalphabet allows to access the stored alphabet object.

MIToS.Information.getcontingencytable — Method.

getcontingencytable allows to access the wrapped ContingencyTable in a Probabilities or Counts object.

MIToS.Information.getmarginals — Method.

getmarginals allows to access the array with the marginal values (NamedArray).

MIToS.Information.getmarginalsarray — Method.

getmarginalsarray allows to access the array with the marginal values (Array without names).

MIToS.Information.gettable — Method.

gettable allows to access the table (NamedArray).

MIToS.Information.gettablearray — Method.

gettablearray allows to access the table (Array without names).

MIToS.Information.gettotal — Method.

gettotal allows to access the stored total value.

MIToS.Information.kullback_leibler — Method.

It calculates the Kullback-Leibler (KL) divergence from a table of Probabilities. The second positional argument is a Probabilities or ContingencyTable with the background distribution. It's optional, the default is the BLOSUM62_Pi table. Use last and optional positional argument to change the base of the log. The default base is e, so the result is in nats. You can use 2.0 as base to get the result in bits.

MIToS.Information.mapcolfreq! — Method.

It efficiently map a function (first argument) that takes a table of Counts or Probabilities (third argument). The table is filled in place with the counts or probabilities of each column from the msa (second argument).

weights (default: NoClustering()): Weights to be used for table counting.
pseudocounts (default: NoPseudocount()): Pseudocount object to be applied to table.
pseudofrequencies (default: NoPseudofrequencies()): Pseudofrequencies to be applied to the normalized (probabilities) table.

MIToS.Information.mapcolpairfreq! — Method.

It efficiently map a function (first argument) that takes a table of Counts or Probabilities (third argument). The table is filled in place with the counts or probabilities of each pair of columns from the msa (second argument). The fourth positional argument usediagonal indicates if the function should be applied to identical element pairs (default to Val{true}).

weights (default: NoClustering()): Weights to be used for table counting.
pseudocounts (default: NoPseudocount()): Pseudocount object to be applied to table.
pseudofrequencies (default: NoPseudofrequencies()): Pseudofrequencies to be applied to the normalized (probabilities) table.
diagonalvalue (default: 0): Value to fill diagonal elements if usediagonal is Val{false}.

MIToS.Information.mapseqfreq! — Method.

It efficiently map a function (first argument) that takes a table of Counts or Probabilities (third argument). The table is filled in place with the counts or probabilities of each sequence from the msa (second argument).

weights (default: NoClustering()): Weights to be used for table counting.
pseudocounts (default: NoPseudocount()): Pseudocount object to be applied to table.
pseudofrequencies (default: NoPseudofrequencies()): Pseudofrequencies to be applied to the normalized (probabilities) table.

MIToS.Information.mapseqpairfreq! — Method.

It efficiently map a function (first argument) that takes a table of Counts or Probabilities (third argument). The table is filled in place with the counts or probabilities of each pair of sequences from the msa (second argument). The fourth positional argument usediagonal indicates if the function should be applied to identical element pairs (default to Val{true}).

weights (default: NoClustering()): Weights to be used for table counting.
pseudocounts (default: NoPseudocount()): Pseudocount object to be applied to table.
pseudofrequencies (default: NoPseudofrequencies()): Pseudofrequencies to be applied to the normalized (probabilities) table.
diagonalvalue (default: 0): Value to fill diagonal elements if usediagonal is Val{false}.

MIToS.Information.marginal_entropy — Method.

It calculates marginal entropy (H) from a table of Counts or Probabilities. The second positional argument is used to indicate the magin used to calculate the entropy, e.g. it estimates the entropy H(X) if marginal is 1, H(Y) for 2, etc. Use last and optional positional argument to change the base of the log. The default base is e, so the result is in nats. You can use 2.0 as base to get the result in bits.

MIToS.Information.mutual_information — Method.

It calculates Mutual Information (MI) from a table of Counts or Probabilities. Use last and optional positional argument to change the base of the log. The default base is e, so the result is in nats. You can use 2.0 as base to get the result in bits. Calculation of MI from Counts is faster than from Probabilities.

MIToS.Information.normalized_mutual_information — Method.

It calculates a Normalized Mutual Information (nMI) by Entropy from a table of Counts or Probabilities.

nMI(X, Y) = MI(X, Y) / H(X, Y)

MIToS.Information.pairwisegapfraction — Method.

It takes a MSA or a file and a Format as first arguments. It calculates the percentage of gaps on columns pairs (union and intersection) using sequence clustering (Hobohm I).

Argument, type, default value and descriptions:

    - clustering  Bool      true    Sequence clustering (Hobohm I)
    - threshold             62      Percent identity threshold for sequence clustering (Hobohm I)

This function returns:

    - pairwise gap union as percentage
    - pairwise gap intersection as percentage

MIToS.Information.probabilities! — Method.

It populates a ContingencyTable (first argument) using the probabilities in the sequences (last positional arguments). The dimension of the table must match the number of sequences and all the sequences must have the same length. You must indicate the used weights, pseudocounts and pseudofrequencies as second, third and fourth positional arguments respectively. You can use NoClustering(), NoPseudocount() and NoPseudofrequencies() to avoid the use of sequence weighting, pseudocounts and pseudofrequencies, respectively.

MIToS.Information.probabilities — Method.

It returns a ContingencyTable wrapped in a Probabilities type with the frequencies of residues in the sequences that takes as arguments. The dimension of the table is equal to the number of sequences. You can use the keyword arguments alphabet, weights, pseudocounts and pseudofrequencies to indicate the alphabet of the table (default to UngappedAlphabet()), a clustering result (default to NoClustering()), the pseudocounts (default to NoPseudocount()) and the pseudofrequencies (default to NoPseudofrequencies()) to be used during the estimation of the probabilities.

StatsBase.entropy — Method.

It calculates the Shannon entropy (H) from a table of Counts or Probabilities. Use last and optional positional argument to change the base of the log. The default base is e, so the result is in nats. You can use 2.0 as base to get the result in bits.