Pfam

MIToS defines methods and types useful for any MSA. The Pfam module uses other MIToS modules in the context of Pfam MSAs, where it’s possible to us determine how structure and sequence information should be mapped. This module defines functions that go from a Pfam MSA to the protein contact prediction performance of pairwise scores estimated from that MSA.

using MIToS.Pfam # to load the Pfam module

Features

Download and read Pfam MSAs.
Obtain PDB information from alignment annotations.
Map between sequence/alignment residues/columns and PDB structures.
Measure of AUC (ROC curve) for protein contact prediction of MI scores.

Pfam

Getting a Pfam MSA

The function downloadpfam takes a Pfam accession and downloads a Pfam MSA in Stockholm format. In that way, you can do

pfamfile = downloadpfam("PF18883")

to get the MSA. But, we are going to use an already downloaded file in this case:

using MIToS
pfamfile = joinpath(dirname(pathof(MIToS)), "..", "docs", "data", "PF18883.stockholm.gz");

"/home/runner/work/MIToS.jl/MIToS.jl/src/../docs/data/PF18883.stockholm.gz"

Use read_file function and the Stockholm FileFormat to get a AnnotatedMultipleSequenceAlignment object with the MSA and its Pfam annotations. You must set generatemapping and useidcoordinates to true the first time you read the downloaded MSA. This is necessary to some of the methods in the Pfam module.

msa = read_file(pfamfile, Stockholm, generatemapping = true, useidcoordinates = true)

AnnotatedMultipleSequenceAlignment with 1743 annotations : 859×116 Named Matrix{MIToS.MSA.Residue}
                 Seq ╲ Col │  45   46   47   48   49  …  476  477  478  479  480
───────────────────────────┼────────────────────────────────────────────────────
Q13FK3_PARXL/450-568       │   S    T    A    T    N  …    F    D    Y    L    L
A0A2C6C5P0_9GAMM/492-602   │   -    -    L    T    G       F    D    Y    F    L
A0A0A7A549_SHIDY/581-689   │   -    -    -    -    -       Y    D    Y    R    L
A0A1E3VNB9_9HYPH/439-551   │   -    -    -    -    -       F    A    Y    G    L
A0A4R0YPQ6_9GAMM/572-696   │   -    K    L    G    R       W    Q    Y    T    L
A0A1I0DBW0_9GAMM/2097-2206 │   -    -    -    -    -       Y    D    Y    Q    L
A7ZLW8_ECO24/839-959       │   -    -    -    -    -       Y    V    Y    T    L
B7LGS8_ECO55/291-403       │   S    T    V    S    N       Y    E    Y    S    L
⋮                              ⋮    ⋮    ⋮    ⋮    ⋮  ⋱    ⋮    ⋮    ⋮    ⋮    ⋮
Q0WGA9_YERPE/790-917       │   -    -    -    -    -       Y    A    Y    T    L
A0A1E3WDC8_9HYPH/497-591   │   -    -    -    -    -       F    D    Y    D    L
A0A7W6P2V6_9HYPH/510-621   │   -    -    -    -    -       Y    A    Y    R    L
A0A2T5HR93_9PSED/761-875   │   S    N    L    T    S       Y    Q    Y    T    L
A0A843YVK4_9BURK/1398-1516 │   S    M    L    S    S       F    Q    Y    L    L
M4RY98_9SPHN/2929-3041     │   -    -    -    -    -       Y    E    Y    K    L
A0A2Z3I4X0_9GAMM/539-665   │   -    -    -    -    -       Y    E    Y    T    L
A0A4R0YYZ7_9GAMM/486-609   │   -    N    L    T    N  …    Y    D    Y    G    L

Getting PDB information from an MSA

The function getseq2pdb parses the MSA annotations, if present, to return a Dict from the sequence identifier in the MSA to PDB and chain codes. In the case of recently downloaded MSAs, which usually lack PDB mapping annotations, the function will use SIFTS chain-level mappings for this task. In such cases, the required SIFTS summary CSV files are automatically downloaded and cached.

getseq2pdb(msa)

Dict{String, Vector{Tuple{String, String}}} with 1 entry:
  "ICSA_SHIFL/612-720" => [("3ML3", "A"), ("5KE1", "A"), ("5KE1", "B")]

Once you know the association between PDB chains and sequences, you can use that information together with the msacolumn2pdbresidue function to get the PDB residue number that correspond to each MSA column for given a determined sequence and PDB chain. That function downloads information from SIFTS to generate the mapping.

col2res = msacolumn2pdbresidue(msa, "ICSA_SHIFL/612-720", "3ML3", "A")

OrderedCollections.OrderedDict{Int64, String} with 102 entries:
  57  => "619"
  59  => "620"
  61  => "621"
  63  => "622"
  65  => "623"
  67  => "624"
  75  => "625"
  76  => "626"
  77  => "627"
  78  => "628"
  79  => "629"
  95  => "630"
  96  => "631"
  97  => "632"
  101 => "633"
  102 => "634"
  103 => "635"
  106 => "636"
  107 => "637"
  ⋮   => ⋮

The returned dictionary can be used to get the PDB residue associated to each column (using the msaresidues function)...

using MIToS.PDB
pdbfile = downloadpdb("3ML3")
pdb = read_file(pdbfile, MMCIFFile)
resdict = residuesdict(pdb, model = "1", chain = "A", group = "ATOM")

msaresidues(msa, resdict, col2res)

OrderedCollections.OrderedDict{Int64, MIToS.PDB.PDBResidue} with 102 entries:
  57  => PDBResidue(id=PDBResidueIdentifier(PDB_number="50", number="619", name…
  59  => PDBResidue(id=PDBResidueIdentifier(PDB_number="51", number="620", name…
  61  => PDBResidue(id=PDBResidueIdentifier(PDB_number="52", number="621", name…
  63  => PDBResidue(id=PDBResidueIdentifier(PDB_number="53", number="622", name…
  65  => PDBResidue(id=PDBResidueIdentifier(PDB_number="54", number="623", name…
  67  => PDBResidue(id=PDBResidueIdentifier(PDB_number="55", number="624", name…
  75  => PDBResidue(id=PDBResidueIdentifier(PDB_number="56", number="625", name…
  76  => PDBResidue(id=PDBResidueIdentifier(PDB_number="57", number="626", name…
  77  => PDBResidue(id=PDBResidueIdentifier(PDB_number="58", number="627", name…
  78  => PDBResidue(id=PDBResidueIdentifier(PDB_number="59", number="628", name…
  79  => PDBResidue(id=PDBResidueIdentifier(PDB_number="60", number="629", name…
  95  => PDBResidue(id=PDBResidueIdentifier(PDB_number="61", number="630", name…
  96  => PDBResidue(id=PDBResidueIdentifier(PDB_number="62", number="631", name…
  97  => PDBResidue(id=PDBResidueIdentifier(PDB_number="63", number="632", name…
  101 => PDBResidue(id=PDBResidueIdentifier(PDB_number="64", number="633", name…
  102 => PDBResidue(id=PDBResidueIdentifier(PDB_number="65", number="634", name…
  103 => PDBResidue(id=PDBResidueIdentifier(PDB_number="66", number="635", name…
  106 => PDBResidue(id=PDBResidueIdentifier(PDB_number="67", number="636", name…
  107 => PDBResidue(id=PDBResidueIdentifier(PDB_number="68", number="637", name…
  ⋮   => ⋮

...or to delete the columns without PDB residues (using the hasresidues function):

using MIToS.MSA
filtercolumns!(msa, hasresidues(msa, col2res))

AnnotatedMultipleSequenceAlignment with 1744 annotations : 859×102 Named Matrix{MIToS.MSA.Residue}
                 Seq ╲ Col │  57   59   61   63   65  …  476  477  478  479  480
───────────────────────────┼────────────────────────────────────────────────────
Q13FK3_PARXL/450-568       │   A    S    N    I    H  …    F    D    Y    L    L
A0A2C6C5P0_9GAMM/492-602   │   N    A    T    L    K       F    D    Y    F    L
A0A0A7A549_SHIDY/581-689   │   N    G    H    V    I       Y    D    Y    R    L
A0A1E3VNB9_9HYPH/439-551   │   -    -    -    -    -       F    A    Y    G    L
A0A4R0YPQ6_9GAMM/572-696   │   D    G    V    I    A       W    Q    Y    T    L
A0A1I0DBW0_9GAMM/2097-2206 │   -    -    -    -    -       Y    D    Y    Q    L
A7ZLW8_ECO24/839-959       │   G    G    T    V    Q       Y    V    Y    T    L
B7LGS8_ECO55/291-403       │   N    S    T    V    Y       Y    E    Y    S    L
⋮                              ⋮    ⋮    ⋮    ⋮    ⋮  ⋱    ⋮    ⋮    ⋮    ⋮    ⋮
Q0WGA9_YERPE/790-917       │   -    -    -    -    -       Y    A    Y    T    L
A0A1E3WDC8_9HYPH/497-591   │   -    -    -    -    -       F    D    Y    D    L
A0A7W6P2V6_9HYPH/510-621   │   -    -    -    -    -       Y    A    Y    R    L
A0A2T5HR93_9PSED/761-875   │   H    S    L    I    D       Y    Q    Y    T    L
A0A843YVK4_9BURK/1398-1516 │   -    G    T    V    A       F    Q    Y    L    L
M4RY98_9SPHN/2929-3041     │   -    -    -    -    -       Y    E    Y    K    L
A0A2Z3I4X0_9GAMM/539-665   │   -    -    -    -    -       Y    E    Y    T    L
A0A4R0YYZ7_9GAMM/486-609   │   D    S    A    V    D  …    Y    D    Y    G    L

PDB contacts and AUC

The Dict between MSA columns and PDB residue number also can be used to generate a protein contact map associated to the MSA.

cmap = msacontacts(msa, resdict, col2res)

102×102 Named PairwiseListMatrices.PairwiseListMatrix{Float64, false, Vector{Float64}}
Col1 ╲ Col2 │  57   59   61   63   65   67  …  472  476  477  478  479  480
────────────┼──────────────────────────────────────────────────────────────
57          │ NaN  1.0  1.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0
59          │ 1.0  NaN  1.0  1.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0
61          │ 1.0  1.0  NaN  1.0  1.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0
63          │ 0.0  1.0  1.0  NaN  1.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0
65          │ 0.0  0.0  1.0  1.0  NaN  1.0     0.0  0.0  0.0  0.0  0.0  0.0
67          │ 0.0  0.0  0.0  1.0  1.0  NaN     0.0  0.0  0.0  0.0  0.0  0.0
75          │ 0.0  0.0  0.0  0.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0
76          │ 0.0  0.0  0.0  0.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0
⋮               ⋮    ⋮    ⋮    ⋮    ⋮    ⋮  ⋱    ⋮    ⋮    ⋮    ⋮    ⋮    ⋮
466         │ 0.0  0.0  0.0  0.0  0.0  0.0     1.0  1.0  1.0  1.0  0.0  0.0
469         │ 0.0  0.0  0.0  0.0  0.0  0.0     1.0  1.0  0.0  0.0  0.0  0.0
472         │ 0.0  0.0  0.0  0.0  0.0  0.0     NaN  1.0  1.0  0.0  0.0  0.0
476         │ 0.0  0.0  0.0  0.0  0.0  0.0     1.0  NaN  1.0  1.0  0.0  0.0
477         │ 0.0  0.0  0.0  0.0  0.0  0.0     1.0  1.0  NaN  1.0  1.0  0.0
478         │ 0.0  0.0  0.0  0.0  0.0  0.0     0.0  1.0  1.0  NaN  1.0  1.0
479         │ 0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  1.0  NaN  1.0
480         │ 0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  1.0  1.0  NaN

That protein contact map can be used to calculate the Area Under the ROC Curve for a given score with the AUC function.

using MIToS.Information
ZMIp, MIp = buslje09(msa)

using ROCAnalysis # You need to load ROCAnalysis to use the AUC function

AUC(ZMIp, cmap)

0.7343047894573265

Pfam

Features

Contents

Getting a Pfam MSA

Getting PDB information from an MSA

PDB contacts and AUC