Pfam
MIToS defines methods and types useful for any MSA. The Pfam module uses other MIToS modules in the context of Pfam MSAs, where it’s possible to us determine how structure and sequence information should be mapped. This module defines functions that go from a Pfam MSA to the protein contact prediction performance of pairwise scores estimated from that MSA.
using MIToS.Pfam # to load the Pfam moduleFeatures
- Download and read Pfam MSAs.
- Obtain PDB information from alignment annotations.
- Map between sequence/alignment residues/columns and PDB structures.
- Measure of AUC (ROC curve) for protein contact prediction of MI scores.
Contents
Getting a Pfam MSA
The function downloadpfam takes a Pfam accession and downloads a Pfam MSA in Stockholm format. In that way, you can do
pfamfile = downloadpfam("PF18883")to get the MSA. But, we are going to use an already downloaded file in this case:
using MIToS
pfamfile = joinpath(dirname(pathof(MIToS)), "..", "docs", "data", "PF18883.stockholm.gz");"/home/runner/work/MIToS.jl/MIToS.jl/src/../docs/data/PF18883.stockholm.gz"Use read_file function and the Stockholm FileFormat to get a AnnotatedMultipleSequenceAlignment object with the MSA and its Pfam annotations. You must set generatemapping and useidcoordinates to true the first time you read the downloaded MSA. This is necessary to some of the methods in the Pfam module.
msa = read_file(pfamfile, Stockholm, generatemapping = true, useidcoordinates = true)AnnotatedMultipleSequenceAlignment with 1743 annotations : 859×116 Named Matrix{MIToS.MSA.Residue}
Seq ╲ Col │ 45 46 47 48 49 … 476 477 478 479 480
───────────────────────────┼────────────────────────────────────────────────────
Q13FK3_PARXL/450-568 │ S T A T N … F D Y L L
A0A2C6C5P0_9GAMM/492-602 │ - - L T G F D Y F L
A0A0A7A549_SHIDY/581-689 │ - - - - - Y D Y R L
A0A1E3VNB9_9HYPH/439-551 │ - - - - - F A Y G L
A0A4R0YPQ6_9GAMM/572-696 │ - K L G R W Q Y T L
A0A1I0DBW0_9GAMM/2097-2206 │ - - - - - Y D Y Q L
A7ZLW8_ECO24/839-959 │ - - - - - Y V Y T L
B7LGS8_ECO55/291-403 │ S T V S N Y E Y S L
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋮ ⋮
Q0WGA9_YERPE/790-917 │ - - - - - Y A Y T L
A0A1E3WDC8_9HYPH/497-591 │ - - - - - F D Y D L
A0A7W6P2V6_9HYPH/510-621 │ - - - - - Y A Y R L
A0A2T5HR93_9PSED/761-875 │ S N L T S Y Q Y T L
A0A843YVK4_9BURK/1398-1516 │ S M L S S F Q Y L L
M4RY98_9SPHN/2929-3041 │ - - - - - Y E Y K L
A0A2Z3I4X0_9GAMM/539-665 │ - - - - - Y E Y T L
A0A4R0YYZ7_9GAMM/486-609 │ - N L T N … Y D Y G LGetting PDB information from an MSA
The function getseq2pdb parses the MSA annotations to return a Dict from the sequence identifier in the MSA to PDB and chain codes.
getseq2pdb(msa)Dict{String, Vector{Tuple{String, String}}} with 1 entry:
"ICSA_SHIFL/612-720" => [("3ML3", "A"), ("5KE1", "A"), ("5KE1", "B")]Once you know the association between PDB chains and sequences, you can use that information together with the msacolumn2pdbresidue function to get the PDB residue number that correspond to each MSA column for given a determined sequence and PDB chain. That function downloads information from SIFTS to generate the mapping.
col2res = msacolumn2pdbresidue(msa, "ICSA_SHIFL/612-720", "3ML3", "A")OrderedCollections.OrderedDict{Int64, String} with 102 entries:
57 => "619"
59 => "620"
61 => "621"
63 => "622"
65 => "623"
67 => "624"
75 => "625"
76 => "626"
77 => "627"
78 => "628"
79 => "629"
95 => "630"
96 => "631"
97 => "632"
101 => "633"
102 => "634"
103 => "635"
106 => "636"
107 => "637"
⋮ => ⋮The returned dictionary can be used to get the PDB residue associated to each column (using the msaresidues function)...
using MIToS.PDB
pdbfile = downloadpdb("3ML3")
pdb = read_file(pdbfile, MMCIFFile)
resdict = residuesdict(pdb, model = "1", chain = "A", group = "ATOM")
msaresidues(msa, resdict, col2res)OrderedCollections.OrderedDict{Int64, MIToS.PDB.PDBResidue} with 102 entries:
57 => PDBResidue(id=PDBResidueIdentifier(PDB_number="50", number="619", name…
59 => PDBResidue(id=PDBResidueIdentifier(PDB_number="51", number="620", name…
61 => PDBResidue(id=PDBResidueIdentifier(PDB_number="52", number="621", name…
63 => PDBResidue(id=PDBResidueIdentifier(PDB_number="53", number="622", name…
65 => PDBResidue(id=PDBResidueIdentifier(PDB_number="54", number="623", name…
67 => PDBResidue(id=PDBResidueIdentifier(PDB_number="55", number="624", name…
75 => PDBResidue(id=PDBResidueIdentifier(PDB_number="56", number="625", name…
76 => PDBResidue(id=PDBResidueIdentifier(PDB_number="57", number="626", name…
77 => PDBResidue(id=PDBResidueIdentifier(PDB_number="58", number="627", name…
78 => PDBResidue(id=PDBResidueIdentifier(PDB_number="59", number="628", name…
79 => PDBResidue(id=PDBResidueIdentifier(PDB_number="60", number="629", name…
95 => PDBResidue(id=PDBResidueIdentifier(PDB_number="61", number="630", name…
96 => PDBResidue(id=PDBResidueIdentifier(PDB_number="62", number="631", name…
97 => PDBResidue(id=PDBResidueIdentifier(PDB_number="63", number="632", name…
101 => PDBResidue(id=PDBResidueIdentifier(PDB_number="64", number="633", name…
102 => PDBResidue(id=PDBResidueIdentifier(PDB_number="65", number="634", name…
103 => PDBResidue(id=PDBResidueIdentifier(PDB_number="66", number="635", name…
106 => PDBResidue(id=PDBResidueIdentifier(PDB_number="67", number="636", name…
107 => PDBResidue(id=PDBResidueIdentifier(PDB_number="68", number="637", name…
⋮ => ⋮...or to delete the columns without PDB residues (using the hasresidues function):
using MIToS.MSA
filtercolumns!(msa, hasresidues(msa, col2res))AnnotatedMultipleSequenceAlignment with 1744 annotations : 859×102 Named Matrix{MIToS.MSA.Residue}
Seq ╲ Col │ 57 59 61 63 65 … 476 477 478 479 480
───────────────────────────┼────────────────────────────────────────────────────
Q13FK3_PARXL/450-568 │ A S N I H … F D Y L L
A0A2C6C5P0_9GAMM/492-602 │ N A T L K F D Y F L
A0A0A7A549_SHIDY/581-689 │ N G H V I Y D Y R L
A0A1E3VNB9_9HYPH/439-551 │ - - - - - F A Y G L
A0A4R0YPQ6_9GAMM/572-696 │ D G V I A W Q Y T L
A0A1I0DBW0_9GAMM/2097-2206 │ - - - - - Y D Y Q L
A7ZLW8_ECO24/839-959 │ G G T V Q Y V Y T L
B7LGS8_ECO55/291-403 │ N S T V Y Y E Y S L
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋮ ⋮
Q0WGA9_YERPE/790-917 │ - - - - - Y A Y T L
A0A1E3WDC8_9HYPH/497-591 │ - - - - - F D Y D L
A0A7W6P2V6_9HYPH/510-621 │ - - - - - Y A Y R L
A0A2T5HR93_9PSED/761-875 │ H S L I D Y Q Y T L
A0A843YVK4_9BURK/1398-1516 │ - G T V A F Q Y L L
M4RY98_9SPHN/2929-3041 │ - - - - - Y E Y K L
A0A2Z3I4X0_9GAMM/539-665 │ - - - - - Y E Y T L
A0A4R0YYZ7_9GAMM/486-609 │ D S A V D … Y D Y G LPDB contacts and AUC
The Dict between MSA columns and PDB residue number also can be used to generate a protein contact map associated to the MSA.
cmap = msacontacts(msa, resdict, col2res)102×102 Named PairwiseListMatrices.PairwiseListMatrix{Float64, false, Vector{Float64}}
Col1 ╲ Col2 │ 57 59 61 63 65 67 … 472 476 477 478 479 480
────────────┼──────────────────────────────────────────────────────────────
57 │ NaN 1.0 1.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0
59 │ 1.0 NaN 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
61 │ 1.0 1.0 NaN 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
63 │ 0.0 1.0 1.0 NaN 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
65 │ 0.0 0.0 1.0 1.0 NaN 1.0 0.0 0.0 0.0 0.0 0.0 0.0
67 │ 0.0 0.0 0.0 1.0 1.0 NaN 0.0 0.0 0.0 0.0 0.0 0.0
75 │ 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
76 │ 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
466 │ 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0
469 │ 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0
472 │ 0.0 0.0 0.0 0.0 0.0 0.0 NaN 1.0 1.0 0.0 0.0 0.0
476 │ 0.0 0.0 0.0 0.0 0.0 0.0 1.0 NaN 1.0 1.0 0.0 0.0
477 │ 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 NaN 1.0 1.0 0.0
478 │ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 NaN 1.0 1.0
479 │ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 NaN 1.0
480 │ 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 1.0 1.0 NaNThat protein contact map can be used to calculate the Area Under the ROC Curve for a given score with the AUC function.
using MIToS.Information
ZMIp, MIp = buslje09(msa)
using ROCAnalysis # You need to load ROCAnalysis to use the AUC function
AUC(ZMIp, cmap)0.7343047894573265