MIToS' Scripts

The MIToS_Scripts.jl package offers a set of easy-to-use scripts for command-line execution without requiring Julia coding. It includes several scripts designed for various bioinformatics tasks, such as measuring estimating residue conservation and inter-residue coevolution, calculating distances between residues in a protein structure, and more.

Installation

To install MIToS_Scripts.jl, you only need Julia 1.9 or later installed on your system. Executing julia in the terminal to open the Julia REPL, and finally, run the following command:

using Pkg
Pkg.add(url = "https://github.com/MIToSOrg/MIToS_Scripts.jl")

Then, you can get the location of the installed scripts by running the following command:

using MIToS_Scripts
scripts_folder = joinpath(pkgdir(MIToS_Scripts), "scripts")

You can run them from that location. Alternatively, you can add the location to your PATH environment variable, or copy the scripts to a folder already in your PATH to run them from anywhere.

Usage

You can execute each provided script from your command line. For example, to run the Buslje09.jl script—if you are located in the folder where it is the scripts—use:

julia Buslje09.jl input_msa_file

Refer to the documentation of each script for specific usage instructions; you can access it by running the script with the --help or -h flag:

julia Buslje09.jl -h

Scripts

Buslje09.jl

usage: Buslje09.jl [-l] [-o OUTPUT] [-p PARALLEL] [-f FORMAT]
                   [-L LAMBDA] [-c] [-i THRESHOLD] [-g MAXGAP] [-a]
                   [-s SAMPLES] [-G] [-F] [--version] [-h] [FILE]

This takes a MSA file as input. It calculates and saves on
*.buslje09.csv a Z score and a corrected MI/MIp as described on:
Buslje, C. M., Santos, J., Delfino, J. M., & Nielsen, M. (2009).
Correction for phylogeny, small number of observations and data
redundancy improves the identification of coevolving amino acid pairs
using mutual information. Bioinformatics, 25(9), 1125-1131.

positional arguments:
  FILE                  File name. If it is not used, the script reads
                        from STDIN.

optional arguments:
  -l, --list            The input is a list of file names. If -p is
                        used, files will be processed in parallel.
  -o, --output OUTPUT   Name of the output file. Output will be gzip
                        if the extension is ".gz". If it starts with a
                        dot, the name is used as a suffix or extension
                        of the input filename. If it ends with a dot,
                        is used as a prefix. If the output name starts
                        and ends with dots, it's used as an interfix
                        before the extension.If a single file is used
                        and there is not a file name (STDIN), the
                        output will be print into STDOUT, unless a
                        output filename is used. You can use "STDOUT"
                        to force print into STDOUT. STDOUT can not be
                        use with --list. (default: ".buslje09.csv")
  -p, --parallel PARALLEL
                        Number of worker processes. (type: Int64,
                        default: 1)
  -f, --format FORMAT   Format of the MSA: Stockholm, Raw or FASTA
                        (default: "Stockholm")
  -L, --lambda LAMBDA   Low count value (type: Float64, default: 0.05)
  -c, --clustering      Sequence clustering (Hobohm I)
  -i, --threshold THRESHOLD
                        Percent identity threshold for sequence
                        clustering (Hobohm I) (type: Float64, default:
                        62.0)
  -g, --maxgap MAXGAP   Maximum fraction of gaps in positions included
                        in calculation (type: Float64, default: 0.5)
  -a, --apc             Use APC correction (MIp)
  -s, --samples SAMPLES
                        Number of samples for Z-score (type: Int64,
                        default: 100)
  -G, --usegap          Use gaps on statistics
  -F, --fixedgaps       Fix gaps positions for the random samples
  --version             show version information and exit
  -h, --help            show this help message and exit


MIToS 3.0.6

Bioinformatics Unit
Leloir Institute Foundation
Av. Patricias Argentinas 435, CP C1405BWE, Buenos Aires, Argentina

BLMI.jl

usage: BLMI.jl [-l] [-o OUTPUT] [-p PARALLEL] [-f FORMAT] [-b BETA]
               [-i THRESHOLD] [-g MAXGAP] [-a] [-s SAMPLES] [-F]
               [--version] [-h] [FILE]

This takes a MSA file as input. Calculates and saves on *.BLMI.csv a Z
score and a corrected MI/MIp. The script uses BLOSUM62 based pseudo
frequencies and sequences clustering (Hobohm I).

positional arguments:
  FILE                  File name. If it is not used, the script reads
                        from STDIN.

optional arguments:
  -l, --list            The input is a list of file names. If -p is
                        used, files will be processed in parallel.
  -o, --output OUTPUT   Name of the output file. Output will be gzip
                        if the extension is ".gz". If it starts with a
                        dot, the name is used as a suffix or extension
                        of the input filename. If it ends with a dot,
                        is used as a prefix. If the output name starts
                        and ends with dots, it's used as an interfix
                        before the extension.If a single file is used
                        and there is not a file name (STDIN), the
                        output will be print into STDOUT, unless a
                        output filename is used. You can use "STDOUT"
                        to force print into STDOUT. STDOUT can not be
                        use with --list. (default: ".BLMI.csv")
  -p, --parallel PARALLEL
                        Number of worker processes. (type: Int64,
                        default: 1)
  -f, --format FORMAT   Format of the MSA: Stockholm, Raw or FASTA
                        (default: "Stockholm")
  -b, --beta BETA       β for BLOSUM62 pseudo frequencies (type:
                        Float64, default: 8.512)
  -i, --threshold THRESHOLD
                        Percent identity threshold for sequence
                        clustering (Hobohm I) (type: Float64, default:
                        62.0)
  -g, --maxgap MAXGAP   Maximum fraction of gaps in positions included
                        in calculation (type: Float64, default: 0.5)
  -a, --apc             Use APC correction (MIp)
  -s, --samples SAMPLES
                        Number of samples for Z-score (type: Int64,
                        default: 50)
  -F, --fixedgaps       Fix gaps positions for the random samples
  --version             show version information and exit
  -h, --help            show this help message and exit


MIToS 3.0.6

Bioinformatics Unit
Leloir Institute Foundation
Av. Patricias Argentinas 435, CP C1405BWE, Buenos Aires, Argentina

Conservation.jl

usage: Conservation.jl [-l] [-o OUTPUT] [-p PARALLEL] [-f FORMAT] [-c]
                       [-i THRESHOLD] [--version] [-h] [FILE]

This takes a MSA file as input and it calculates and saves on
*.conservation.csv the Shannon entropy (H) and Kullback-Leibler
divergence (KL) values for each column (Johansson and Toh 2010).
It is possible to do a sequence clustering using the Hobohm I
algorithm to avoid the effect of sequence redundancy in the
conservation scores. Each sequence in a cluster is weighted using the
inverse of the number of elements in that cluster.
Shannon entropy is a common measure of the residue variability of a
particular MSA column. For each column, we consider the frequency of
the 20 natural protein residues. This uses the Euler's number e as the
base of the logarithm, so the entropy is measured in nats.
The Kullback-Leibler divergence, also called relative entropy, is a
measure of residue conservation. It measures how much a probability
distribution differs from a background distribution. In particular,
this implementation measures the divergence between the residue
distribution of an MSA column and the probabilities derived from the
BLOSUM62 substitution matrix.
Johansson, F., Toh, H., 2010. A comparative study of conservation and
variation scores. BMC Bioinformatics 11, 388.

positional arguments:
  FILE                  File name. If it is not used, the script reads
                        from STDIN.

optional arguments:
  -l, --list            The input is a list of file names. If -p is
                        used, files will be processed in parallel.
  -o, --output OUTPUT   Name of the output file. Output will be gzip
                        if the extension is ".gz". If it starts with a
                        dot, the name is used as a suffix or extension
                        of the input filename. If it ends with a dot,
                        is used as a prefix. If the output name starts
                        and ends with dots, it's used as an interfix
                        before the extension.If a single file is used
                        and there is not a file name (STDIN), the
                        output will be print into STDOUT, unless a
                        output filename is used. You can use "STDOUT"
                        to force print into STDOUT. STDOUT can not be
                        use with --list. (default:
                        ".conservation.csv")
  -p, --parallel PARALLEL
                        Number of worker processes. (type: Int64,
                        default: 1)
  -f, --format FORMAT   Format of the MSA: Stockholm, Raw or FASTA
                        (default: "Stockholm")
  -c, --clustering      Sequence clustering (Hobohm I)
  -i, --threshold THRESHOLD
                        Percent identity threshold for sequence
                        clustering (Hobohm I) (type: Float64, default:
                        62.0)
  --version             show version information and exit
  -h, --help            show this help message and exit


MIToS 3.0.6

Bioinformatics Unit
Leloir Institute Foundation
Av. Patricias Argentinas 435, CP C1405BWE, Buenos Aires, Argentina

DownloadPDB.jl

usage: DownloadPDB.jl [-c CODE] [-l LIST] [-f FORMAT] [--version] [-h]

Download gzipped files from PDB.

optional arguments:
  -c, --code CODE      PDB code
  -l, --list LIST      File with a list of PDB codes (one per line)
  -f, --format FORMAT  Format. It should be PDBFile (pdb) or PDBML
                       (xml) (default: "PDBML")
  --version            show version information and exit
  -h, --help           show this help message and exit


MIToS 3.0.6

Bioinformatics Unit
Leloir Institute Foundation
Av. Patricias Argentinas 435, CP C1405BWE, Buenos Aires, Argentina

Distances.jl

usage: Distances.jl [-l] [-o OUTPUT] [-p PARALLEL] [-d DISTANCE]
                    [-f FORMAT] [-m MODEL] [-c CHAIN] [-g GROUP] [-i]
                    [--version] [-h] [FILE]

Calculates residues distance and writes them into a *.distances.csv.gz
gzipped file.

positional arguments:
  FILE                  File name. If it is not used, the script reads
                        from STDIN.

optional arguments:
  -l, --list            The input is a list of file names. If -p is
                        used, files will be processed in parallel.
  -o, --output OUTPUT   Name of the output file. Output will be gzip
                        if the extension is ".gz". If it starts with a
                        dot, the name is used as a suffix or extension
                        of the input filename. If it ends with a dot,
                        is used as a prefix. If the output name starts
                        and ends with dots, it's used as an interfix
                        before the extension.If a single file is used
                        and there is not a file name (STDIN), the
                        output will be print into STDOUT, unless a
                        output filename is used. You can use "STDOUT"
                        to force print into STDOUT. STDOUT can not be
                        use with --list. (default:
                        ".distances.csv.gz")
  -p, --parallel PARALLEL
                        Number of worker processes. (type: Int64,
                        default: 1)
  -d, --distance DISTANCE
                        The distance to be calculated, options: All,
                        Heavy, CA, CB (default: "All")
  -f, --format FORMAT   Format of the PDB file: It should be PDBFile
                        or PDBML (default: "PDBFile")
  -m, --model MODEL     The model to be used, use All for all
                        (default: "1")
  -c, --chain CHAIN     The chain to be used, use All for all
                        (default: "All")
  -g, --group GROUP     Group of atoms to be used, should be ATOM,
                        HETATM or All for all (default: "All")
  -i, --inter           Calculate inter chain distances
  --version             show version information and exit
  -h, --help            show this help message and exit


MIToS 3.0.6

Bioinformatics Unit
Leloir Institute Foundation
Av. Patricias Argentinas 435, CP C1405BWE, Buenos Aires, Argentina

MSADescription.jl

usage: MSADescription.jl [-l] [-o OUTPUT] [-p PARALLEL] [-f FORMAT]
                        [-e] [--version] [-h] [FILE]

Creates an *.description.csv from a Stockholm file with: the number of
columns, sequences, clusters after Hobohm clustering at 62% identity
and mean percent identity. Also the mean, standard deviation and
quantiles of: sequence coverage of the MSA, gap percentage.

positional arguments:
  FILE                  File name. If it is not used, the script reads
                        from STDIN.

optional arguments:
  -l, --list            The input is a list of file names. If -p is
                        used, files will be processed in parallel.
  -o, --output OUTPUT   Name of the output file. Output will be gzip
                        if the extension is ".gz". If it starts with a
                        dot, the name is used as a suffix or extension
                        of the input filename. If it ends with a dot,
                        is used as a prefix. If the output name starts
                        and ends with dots, it's used as an interfix
                        before the extension.If a single file is used
                        and there is not a file name (STDIN), the
                        output will be print into STDOUT, unless a
                        output filename is used. You can use "STDOUT"
                        to force print into STDOUT. STDOUT can not be
                        use with --list. (default: ".description.csv")
  -p, --parallel PARALLEL
                        Number of worker processes. (type: Int64,
                        default: 1)
  -f, --format FORMAT   Format of the MSA: Stockholm, Raw or FASTA
                        (default: "Stockholm")
  -e, --exact           If it's true, the mean percent identity is
                        exact (using all the pairwise comparisons).
  --version             show version information and exit
  -h, --help            show this help message and exit


MIToS 3.0.6

Bioinformatics Unit
Leloir Institute Foundation
Av. Patricias Argentinas 435, CP C1405BWE, Buenos Aires, Argentina

PercentIdentity.jl

usage: PercentIdentity.jl [-l] [-o OUTPUT] [-p PARALLEL] [-f FORMAT]
                        [-s] [--version] [-h] [FILE]

Calculates the percentage identity between all the sequences of an MSA
and creates an *.pidstats.csv file with: The number of columns and
sequences. The mean, standard deviation, median, minimum and maximum
values and first and third quantiles of the percentage identity. It
could also create and pidlist.csv file with the percentage identity
for each pairwise comparison.

positional arguments:
  FILE                  File name. If it is not used, the script reads
                        from STDIN.

optional arguments:
  -l, --list            The input is a list of file names. If -p is
                        used, files will be processed in parallel.
  -o, --output OUTPUT   Name of the output file. Output will be gzip
                        if the extension is ".gz". If it starts with a
                        dot, the name is used as a suffix or extension
                        of the input filename. If it ends with a dot,
                        is used as a prefix. If the output name starts
                        and ends with dots, it's used as an interfix
                        before the extension.If a single file is used
                        and there is not a file name (STDIN), the
                        output will be print into STDOUT, unless a
                        output filename is used. You can use "STDOUT"
                        to force print into STDOUT. STDOUT can not be
                        use with --list. (default: ".pidstats.csv")
  -p, --parallel PARALLEL
                        Number of worker processes. (type: Int64,
                        default: 1)
  -f, --format FORMAT   Format of the MSA: Stockholm, Raw or FASTA
                        (default: "Stockholm")
  -s, --savelist        Create and pidlist.csv file with the
                        percentage identity for each pairwise
                        comparison.
  --version             show version information and exit
  -h, --help            show this help message and exit


MIToS 3.0.6

Bioinformatics Unit
Leloir Institute Foundation
Av. Patricias Argentinas 435, CP C1405BWE, Buenos Aires, Argentina

AlignedColumns.jl

usage: AlignedColumns.jl [-l] [-o OUTPUT] [-p PARALLEL] [--version]
                        [-h] [FILE]

Creates a file in Stockholm format with the aligned columns from a
Pfam Stockholm file. Insertions are deleted, as they are unaligned in
a profile HMM. The output file *.aligned.* contains UniProt residue
numbers and original column numbers in its annotations.

positional arguments:
  FILE                  File name. If it is not used, the script reads
                        from STDIN.

optional arguments:
  -l, --list            The input is a list of file names. If -p is
                        used, files will be processed in parallel.
  -o, --output OUTPUT   Name of the output file. Output will be gzip
                        if the extension is ".gz". If it starts with a
                        dot, the name is used as a suffix or extension
                        of the input filename. If it ends with a dot,
                        is used as a prefix. If the output name starts
                        and ends with dots, it's used as an interfix
                        before the extension.If a single file is used
                        and there is not a file name (STDIN), the
                        output will be print into STDOUT, unless a
                        output filename is used. You can use "STDOUT"
                        to force print into STDOUT. STDOUT can not be
                        use with --list. (default: ".aligned.")
  -p, --parallel PARALLEL
                        Number of worker processes. (type: Int64,
                        default: 1)
  --version             show version information and exit
  -h, --help            show this help message and exit


MIToS

Bioinformatics Unit
Leloir Institute Foundation
Av. Patricias Argentinas 435, CP C1405BWE, Buenos Aires, Argentina

SplitStockholm.jl

usage: SplitStockholm.jl [-p PATH] [--hideprogress] [--version] [-h]
                        file

Splits a file with multiple sequence alignments in Stockholm format,
creating one compressed file per MSA in Stockholm format:
accessionumber.gz

positional arguments:
  file             Input file

optional arguments:
  -p, --path PATH  Path for the output files [default: execution
                   directory] (default: "")
  --hideprogress   Hide the progress bar
  --version        show version information and exit
  -h, --help       show this help message and exit


MIToS 3.0.6

Bioinformatics Unit
Leloir Institute Foundation
Av. Patricias Argentinas 435, CP C1405BWE, Buenos Aires, Argentina