Pfam

MIToS defines methods and types useful for any MSA. The Pfam module uses other MIToS modules in the context of Pfam MSAs, where it’s possible to us determine how structure and sequence information should be mapped. This module defines functions that go from a Pfam MSA to the protein contact prediction performance of pairwise scores estimated from that MSA.

using MIToS.Pfam # to load the Pfam module

Features

Contents

Getting a Pfam MSA

The function downloadpfam takes a Pfam accession and downloads a Pfam MSA in Stockholm format. In that way, you can do

pfamfile = downloadpfam("PF18883")

to get the MSA. But, we are going to use an already downloaded file in this case:

using MIToS
pfamfile = joinpath(dirname(pathof(MIToS)), "..", "docs", "data", "PF18883.stockholm.gz");
"/home/runner/work/MIToS.jl/MIToS.jl/src/../docs/data/PF18883.stockholm.gz"

Use read_file function and the Stockholm FileFormat to get a AnnotatedMultipleSequenceAlignment object with the MSA and its Pfam annotations. You must set generatemapping and useidcoordinates to true the first time you read the downloaded MSA. This is necessary to some of the methods in the Pfam module.

msa = read_file(pfamfile, Stockholm, generatemapping = true, useidcoordinates = true)
AnnotatedMultipleSequenceAlignment with 1649 annotations : 811×113 Named Matrix{MIToS.MSA.Residue}
                 Seq ╲ Col │  53   54   55   56   57  …  428  429  430  431  432
───────────────────────────┼────────────────────────────────────────────────────
A0A370X5B3_9GAMM/1736-1853 │   -    -    -    -    -  …    Y    A    Y    R    L
D1AVB4_STRM9/461-568       │   -    -    -    -    -       Y    R    Y    R    L
A0A2T5HR93_9PSED/761-875   │   S    N    L    T    S       Y    Q    Y    T    L
A0A427CBK0_9GAMM/418-540   │   -    -    -    -    -       F    R    Y    R    L
A0A1I1MWW5_9BURK/779-907   │   -    -    -    -    -       Y    E    Y    R    L
C6JIE8_FUSVA/367-481       │   -    I    I    S    N       Y    E    Y    F    L
A0A017H8U7_9FUSO/691-802   │   S    S    V    T    E       Y    E    Y    T    L
A0A085VLZ6_PSESX/333-446   │   -    -    -    -    -       Y    E    Y    Y    L
⋮                              ⋮    ⋮    ⋮    ⋮    ⋮  ⋱    ⋮    ⋮    ⋮    ⋮    ⋮
A0A318T6N3_9RHIZ/636-751   │   -    -    -    K    T       Y    V    Y    Q    L
A0A388SGN0_9BURK/625-739   │   S    E    L    T    T       Y    L    Y    E    L
A0A443IAD3_9GAMM/1863-1969 │   -    -    -    -    -       Y    D    Y    S    L
A0A2S8VQP6_9CAUL/182-287   │   -    -    -    -    R       V    R    Y    R    -
A0A1A9RWJ5_9NEIS/1004-1099 │   S    T    V    T    N       -    -    -    -    -
A6T0K2_JANMA/343-459       │   -    -    -    -    -       Y    A    Y    T    L
Q2KVL9_BORA1/480-598       │   S    R    I    T    R       Y    E    Y    R    L
A0A2W4CX06_9RHIZ/331-444   │   -    -    -    -    N  …    Y    E    Y    L    L

Getting PDB information from an MSA

The function getseq2pdb parses the MSA annotations to return a Dict from the sequence identifier in the MSA to PDB and chain codes.

getseq2pdb(msa)
Dict{String, Vector{Tuple{String, String}}} with 1 entry:
  "ICSA_SHIFL/611-720" => [("5KE1", "B"), ("3ML3", "A"), ("5KE1", "A")]

Once you know the association between PDB chains and sequences, you can use that information together with the msacolumn2pdbresidue function to get the PDB residue number that correspond to each MSA column for given a determined sequence and PDB chain. That function downloads information from SIFTS to generate the mapping.

col2res = msacolumn2pdbresidue(msa, "ICSA_SHIFL/611-720", "3ML3", "A")
OrderedCollections.OrderedDict{Int64, String} with 103 entries:
  61  => "618"
  62  => "619"
  64  => "620"
  65  => "621"
  66  => "622"
  67  => "623"
  68  => "624"
  76  => "625"
  77  => "626"
  78  => "627"
  79  => "628"
  80  => "629"
  96  => "630"
  97  => "631"
  98  => "632"
  101 => "633"
  102 => "634"
  103 => "635"
  106 => "636"
  ⋮   => ⋮

The returned dictionary can be used to get the PDB residue associated to each column (using the msaresidues function)...

using MIToS.PDB
pdbfile = downloadpdb("3ML3")
pdb = read_file(pdbfile, MMCIFFile)
resdict = residuesdict(pdb, model = "1", chain = "A", group = "ATOM")

msaresidues(msa, resdict, col2res)
OrderedCollections.OrderedDict{Int64, MIToS.PDB.PDBResidue} with 103 entries:
  61  => PDBResidue:…
  62  => PDBResidue:…
  64  => PDBResidue:…
  65  => PDBResidue:…
  66  => PDBResidue:…
  67  => PDBResidue:…
  68  => PDBResidue:…
  76  => PDBResidue:…
  77  => PDBResidue:…
  78  => PDBResidue:…
  79  => PDBResidue:…
  80  => PDBResidue:…
  96  => PDBResidue:…
  97  => PDBResidue:…
  98  => PDBResidue:…
  101 => PDBResidue:…
  102 => PDBResidue:…
  103 => PDBResidue:…
  106 => PDBResidue:…
  ⋮   => ⋮

...or to delete the columns without PDB residues (using the hasresidues function):

using MIToS.MSA
filtercolumns!(msa, hasresidues(msa, col2res))
AnnotatedMultipleSequenceAlignment with 1650 annotations : 811×103 Named Matrix{MIToS.MSA.Residue}
                 Seq ╲ Col │  61   62   64   65   66  …  428  429  430  431  432
───────────────────────────┼────────────────────────────────────────────────────
A0A370X5B3_9GAMM/1736-1853 │   S    -    G    T    I  …    Y    A    Y    R    L
D1AVB4_STRM9/461-568       │   N    K    G    L    I       Y    R    Y    R    L
A0A2T5HR93_9PSED/761-875   │   D    H    S    L    I       Y    Q    Y    T    L
A0A427CBK0_9GAMM/418-540   │   R    -    G    V    I       F    R    Y    R    L
A0A1I1MWW5_9BURK/779-907   │   -    -    -    -    -       Y    E    Y    R    L
C6JIE8_FUSVA/367-481       │   L    N    G    T    T       Y    E    Y    F    L
A0A017H8U7_9FUSO/691-802   │   E    D    S    T    L       Y    E    Y    T    L
A0A085VLZ6_PSESX/333-446   │   A    -    G    T    L       Y    E    Y    Y    L
⋮                              ⋮    ⋮    ⋮    ⋮    ⋮  ⋱    ⋮    ⋮    ⋮    ⋮    ⋮
A0A318T6N3_9RHIZ/636-751   │   A    -    G    N    V       Y    V    Y    Q    L
A0A388SGN0_9BURK/625-739   │   N    S    G    T    I       Y    L    Y    E    L
A0A443IAD3_9GAMM/1863-1969 │   S    -    G    N    I       Y    D    Y    S    L
A0A2S8VQP6_9CAUL/182-287   │   D    G    G    L    I       V    R    Y    R    -
A0A1A9RWJ5_9NEIS/1004-1099 │   A    -    G    T    V       -    -    -    -    -
A6T0K2_JANMA/343-459       │   D    N    G    T    I       Y    A    Y    T    L
Q2KVL9_BORA1/480-598       │   R    D    S    S    V       Y    E    Y    R    L
A0A2W4CX06_9RHIZ/331-444   │   S    -    G    T    I  …    Y    E    Y    L    L

PDB contacts and AUC

The Dict between MSA columns and PDB residue number also can be used to generate a protein contact map associated to the MSA.

cmap = msacontacts(msa, resdict, col2res)
103×103 Named PairwiseListMatrices.PairwiseListMatrix{Float64, false, Vector{Float64}}
Col1 ╲ Col2 │  61   62   64   65   66   67  …  427  428  429  430  431  432
────────────┼──────────────────────────────────────────────────────────────
61          │ NaN  1.0  1.0  1.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0
62          │ 1.0  NaN  1.0  1.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0
64          │ 1.0  1.0  NaN  1.0  1.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0
65          │ 1.0  1.0  1.0  NaN  1.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0
66          │ 0.0  0.0  1.0  1.0  NaN  1.0     0.0  0.0  0.0  0.0  0.0  0.0
67          │ 0.0  0.0  0.0  1.0  1.0  NaN     0.0  0.0  0.0  0.0  0.0  0.0
68          │ 0.0  0.0  0.0  0.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0
76          │ 0.0  0.0  0.0  0.0  0.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0
⋮               ⋮    ⋮    ⋮    ⋮    ⋮    ⋮  ⋱    ⋮    ⋮    ⋮    ⋮    ⋮    ⋮
423         │ 0.0  0.0  0.0  0.0  0.0  0.0     1.0  1.0  1.0  1.0  0.0  0.0
424         │ 0.0  0.0  0.0  0.0  0.0  0.0     1.0  1.0  0.0  0.0  0.0  0.0
427         │ 0.0  0.0  0.0  0.0  0.0  0.0     NaN  1.0  1.0  0.0  0.0  0.0
428         │ 0.0  0.0  0.0  0.0  0.0  0.0     1.0  NaN  1.0  1.0  0.0  0.0
429         │ 0.0  0.0  0.0  0.0  0.0  0.0     1.0  1.0  NaN  1.0  1.0  0.0
430         │ 0.0  0.0  0.0  0.0  0.0  0.0     0.0  1.0  1.0  NaN  1.0  1.0
431         │ 0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  1.0  NaN  1.0
432         │ 0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  1.0  1.0  NaN

That protein contact map can be used to calculate the Area Under the ROC Curve for a given score with the AUC function.

using MIToS.Information
ZMIp, MIp = buslje09(msa)

using ROCAnalysis # You need to load ROCAnalysis to use the AUC function

AUC(ZMIp, cmap)
0.7514388233747185