Linking structural and evolutionary information
Problem description
It is a very common task to map sequence to structure residue number. For example, to link structural information coming from PDB and evolutionary information calculated from multiple sequence alignments.
The naive way of mapping sequence and structure is to perform global pairwise alignment between the sequence and the PDB sequence (using the residues in ATOM). The problem with this approach is that the sequences can have missing regions and standard pairwise alignment algorithms often yield incorrect assignations around those regions (Velankar et.al. 2013). This is particularly important when aligning PDB sequences, that can have missing residues, and sequences coming from multiple sequence alignments, that can be incomplete or have unaligned regions (e.g. insert states).
The SIFTS (Structure Integration with Function, Taxonomy and Sequences) database solves this problem and provides residue level mapping between PDB and other databases (e.g. UniProt and Pfam).
The SIFTS
module of MIToS has functions to access this residue level mapping between PDB and other databases. Also, MIToS keeps track of the residue number of each residue in a multiple sequence alignment (MSA) using its annotations. Both things together, allow the correct mapping of sequence and structure without performing error-prone pairwise alignments.
Particular solutions depend on problem details, here we show some common ways to use MIToS and SIFTS to map evolutionary information calculated in an MSA (e.g. Shannon entropy) with structural information (e.g. B-factors).
PDB and Pfam alignment mapping
This is the easiest problem to solve with the MIToS Pfam
module because SIFTS already has a residue level mapping between PDB and Pfam.
For this example, we are going to map the columns in the multiple sequence alignment of the PF09645 Pfam family and the residues in the chain A from the 2VQC PDB file. The needed files are available in the MIToS test suite:
using MIToS
pdb_file = abspath(pathof(MIToS), "..", "..", "test", "data", "2VQC.pdb")
pfam_file = abspath(pathof(MIToS), "..", "..", "test", "data", "PF09645_full.stockholm")
sifts_file = abspath(pathof(MIToS), "..", "..", "test", "data", "2vqc.xml.gz")
You can also use downloadpdb
from MIToS.PDB
, downloadpfam
from MIToS.Pfam
and downloadsifts
from MIToS.SIFTS
to get the corresponding files from those databases.
It is important to read the Pfam MSA file using generatemapping=true
and useidcoordinates=true
because that allows keeping track of the residue number using the MSA annotations.
using MIToS.Pfam
msa = read_file(pfam_file, Stockholm, generatemapping = true, useidcoordinates = true)
AnnotatedMultipleSequenceAlignment with 16 annotations : 4×110 Named Matrix{MIToS.MSA.Residue}
Seq ╲ Col │ 6 7 8 9 10 … 111 112 113 114 115
───────────────────┼────────────────────────────────────────────────────
C3N734_SULIY/1-95 │ - - - N S … - - - - -
H2C869_9CREN/7-104 │ - - L N D - - - - -
Y070_ATV/2-70 │ - - - - - - - - - -
F112_SSV1/3-112 │ Q T L N S … I K A K Q
First, we need to know what is the sequence in the MSA that correspond to the PDB we want to link. Luckily, Pfam Stockholm files store the mapping between sequences and PDB chains. You can access that mapping using the getseq2pdb
function from MIToS.Pfam
seq2pdbs = getseq2pdb(msa)
Dict{String, Vector{Tuple{String, String}}} with 1 entry:
"F112_SSV1/3-112" => [("2VQC", "A")]
The returned dictionary gives you all the PDB chains associated with a determined sequence in the MSA. But, in this case, we want to go in the other direction to find all the sequences associated with a determined PDB chain. We are going to use a list comprehension because it is possible for a single chain to be associated with more than one sequence in the Pfam MSA (e.g. domain repeats).
pdb_code = "2VQC"
pdb_chain = "A"
seq_ids = [seq for (seq, pdbs) in seq2pdbs if (pdb_code, pdb_chain) in pdbs]
1-element Vector{String}:
"F112_SSV1/3-112"
In this example, we are going to use the only sequence we found for the A of 2VQC.
seq_id = seq_ids[1]
"F112_SSV1/3-112"
Finally, we can use the msacolumn2pdbresidue
function from the Pfam module to get a dictionary from the MSA column index to the PDB residue number:
pfam_id = "PF09645"
msacol2pdbres = msacolumn2pdbresidue(msa, seq_id, pdb_code, pdb_chain, pfam_id, sifts_file)
OrderedCollections.OrderedDict{Int64, String} with 110 entries:
6 => ""
7 => "4"
8 => "5"
9 => "6"
10 => "7"
11 => "8"
12 => "9"
13 => "10"
14 => "11"
15 => "12"
16 => "13"
17 => "14"
18 => "15"
19 => "16"
20 => "17"
21 => "18"
22 => "19"
23 => "20"
24 => "21"
⋮ => ⋮
This dictionary has the mapping between MSA column and PDB residue that allows the mapping between evolutionary and structural information. For example, to measure the correlation between entropy (related to residue variation in an MSA column) and the mean B factor of the residue:
using MIToS.Information
Hx = mapcolfreq!(
shannon_entropy,
msa,
Frequencies(ContingencyTable(Int, Val{1}, UngappedAlphabet())),
)
1×110 Named Matrix{Float64}
Function ╲ Col │ 6 7 8 … 113 114 115
────────────────┼──────────────────────────────────────────────────────────────
shannon_entropy │ 0.0 0.0 0.0 … 0.0 0.0 0.0
To get quick access to each PDB residue based on its residue number, we can read the PDB file into a dictionary using the read_file
and residuesdict
functions from the MIToS PDB
module:
using MIToS.PDB
res_dict = residuesdict(
read_file(pdb_file, PDBFile, occupancyfilter = true),
model = "1",
chain = "A",
)
OrderedCollections.OrderedDict{String, MIToS.PDB.PDBResidue} with 95 entries:
"4" => PDBResidue:…
"5" => PDBResidue:…
"6" => PDBResidue:…
"7" => PDBResidue:…
"8" => PDBResidue:…
"9" => PDBResidue:…
"10" => PDBResidue:…
"11" => PDBResidue:…
"12" => PDBResidue:…
"13" => PDBResidue:…
"14" => PDBResidue:…
"15" => PDBResidue:…
"16" => PDBResidue:…
"17" => PDBResidue:…
"18" => PDBResidue:…
"19" => PDBResidue:…
"20" => PDBResidue:…
"21" => PDBResidue:…
"22" => PDBResidue:…
⋮ => ⋮
Then, we can iterate the mapping dictionary to link the MSA and PDB based values:
using Statistics
x = Float64[]
y = Float64[]
for (col_index, res_number) in msacol2pdbres
if res_number != "" # i.e. MSA column has an associated PDB residue
push!(x, Hx[col_index])
push!(y, mean(parse(Float64, atom.B) for atom in res_dict[res_number].atoms))
end
end
cor(x, y)
0.059911489235294435
Unknown sequence coordinates
While Pfam alignments have the start and end of the aligned region indicated in the sequence name, other multiple sequence alignments don't give any hint about that. In those cases, we should use pairwise alignments. However, instead of aligning the sequence coming from the MSA and the PDB sequence, we can align the MSA sequence to the UniProt sequence to reduce the possibility of mapping errors. Once we have the mapping of the MSA sequence to the UniProt sequence, we can use SIFTS to map the PDB sequence to the MSA sequence using the UniProt numeration.
For this example, we are going to use the following files included in MIToS documentation:
using MIToS
pdb_file = abspath(pathof(MIToS), "..", "..", "docs", "data", "1dur.pdb")
msa_file = abspath(pathof(MIToS), "..", "..", "docs", "data", "blast_alignment.fa")
sifts_file = abspath(pathof(MIToS), "..", "..", "docs", "data", "1dur.xml.gz")
uniprot_file = abspath(pathof(MIToS), "..", "..", "docs", "data", "P00193.fasta")
First, we are going to read the MSA file. In this case, we can not use useidcoordinates=true
because the sequence names don't have the sequence coordinates in the Pfam format. However, we are going to use generatemapping=true
to get the default mapping for each sequence in the alignment (from 1
to the length of the aligned region):
using MIToS.MSA
msa = read_file(msa_file, FASTA, generatemapping = true)
AnnotatedMultipleSequenceAlignment with 102 annotations : 100×62 Named Matrix{MIToS.MSA.Residue}
Seq ╲ Col │ …
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼──────
Query_23408 PDB 1DUR 4Fe-4S ferredoxin-type 2 motif │ …
gi|120014|sp|P00193.1|FER_PEPAS RecName: Full=Ferredoxin │
gi|7546410|pdb|1DUR|A Chain A, Replacement for 1FDX 2(4FE4S) ferredoxin from (NOW) Peptostreptococcus asaccharolyticus │
gi|1180747586|ref|WP_084229968.1| 4Fe-4S dicluster domain-containing protein [Peptoniphilus asaccharolyticus] │
gi|1443100176|ref|WP_115312158.1| 4Fe-4S dicluster domain-containing protein [Peptoniphilus indolicus] │
gi|769945096|ref|WP_045078402.1| 4Fe-4S dicluster domain-containing protein [Peptoniphilus sp. ING2-D1G] │
gi|517967820|ref|WP_019138028.1| MULTISPECIES: 4Fe-4S dicluster domain-containing protein [Peptoniphilus] │
gi|496513533|ref|WP_009221816.1| MULTISPECIES: 4Fe-4S dicluster domain-containing protein [Peptoniphilus] │
⋮ ⋱
tpg|HBK85263.1||gnl|WGS:DNZI|DDZ53_04440 ferredoxin [Firmicutes bacterium] │
gi|1172242992|ref|WP_080632354.1| 4Fe-4S dicluster domain-containing protein [Desulfitobacterium hafniense] │
tpg|HBL81567.1||gnl|WGS:DOAW|DD391_03060 ferredoxin [Clostridiales bacterium] │
gi|1057632375|ref|WP_068914150.1| 4Fe-4S dicluster domain-containing protein [Criibacterium bergeronii] │
gi|737312160|ref|WP_035295014.1| MULTISPECIES: 4Fe-4S dicluster domain-containing protein [Clostridiaceae] │
gi|769212113|ref|WP_044975592.1| 4Fe-4S dicluster domain-containing protein [Ruminococcus sp. HUN007] │
tpg|HCW03068.1||gnl|WGS:DPMH|DGK91_00085 ferredoxin [Clostridium sp.] │
gi|1489662625|gb|RKW38741.1| 4Fe-4S dicluster domain-containing protein [Lachnospiraceae bacterium] │ …
After that, we get the first sequence of the MSA, the one we know that corresponds to the PDB of interest. We need the sequence as a String
without gaps (unaligned), so we use the MIToS.MSA
stringsequence
function together with replace
:
msa_seq = replace(stringsequence(msa, 1), '-' => "")
"GSIYAIDADSCIDCGSCASVCPVGAPNPED"
Also, we are going to read the UniProt sequence. You can easily download the sequence from UniProt by doing:
using MIToS.Utils
download_file("https://www.uniprot.org/uniprot/P00193.fasta", "P00193.fasta")
To read the FASTA file we are going to use the FastaIO
package:
using FastaIO
uniprot_sequences = readfasta(uniprot_file)
1-element Vector{Tuple{String, String}}:
("sp|P00193|FER_PEPAS Ferredoxin OS=Peptoniphilus asaccharolyticus OX=1258 PE=1 SV=1", "AYVINDSCIACGACKPECPVNIQQGSIYAIDADSCIDCGSCASVCPVGAPNPED")
And get the unique sequence:
uniprot_seq = uniprot_sequences[1][2]
"AYVINDSCIACGACKPECPVNIQQGSIYAIDADSCIDCGSCASVCPVGAPNPED"
We can perform a pairwise sequence alignment between both sequences by using the BioAlignments
package from the BioJulia suite. In this case, we use a semi-global alignment (no start/end gap penalty) because we know that the MSA sequence is a region of the UniProt sequence.
using BioAlignments
costmodel = AffineGapScoreModel(BLOSUM62, gap_open = -10, gap_extend = -1)
aln = pairalign(SemiGlobalAlignment(), msa_seq, uniprot_seq, costmodel)
BioAlignments.PairwiseAlignmentResult{Int64, String, String}:
score: 169
seq: 0 ------------------------GSIYAIDADSCIDCGSCASVCPVGAPNPED 30
||||||||||||||||||||||||||||||
ref: 1 AYVINDSCIACGACKPECPVNIQQGSIYAIDADSCIDCGSCASVCPVGAPNPED 54
Then, we only need to iterate the alignment to designate the positions and store the equivalences in a dictionary:
function seq2refnumber(aln)
seq_pos = 0
ref_pos = 0
last_seq_pos = 0
seq2ref = Dict{Int,Int}()
for (seq_res, ref_res) in alignment(aln)
if seq_res != '-'
seq_pos += 1
end
if ref_res != '-'
ref_pos += 1
end
if seq_pos != last_seq_pos
seq2ref[seq_pos] = ref_pos
last_seq_pos = seq_pos
end
end
seq2ref
end
seqnum2uniprotnum = seq2refnumber(aln)
Dict{Int64, Int64} with 30 entries:
5 => 29
16 => 40
20 => 44
12 => 36
24 => 48
28 => 52
8 => 32
17 => 41
30 => 54
1 => 25
19 => 43
22 => 46
23 => 47
6 => 30
11 => 35
9 => 33
14 => 38
3 => 27
29 => 53
⋮ => ⋮
Then, we can use getsequencemapping
to go from MSA column number to UniProt residue, and siftsmapping
to go from UniProt to PDB:
seqmap = getsequencemapping(msa, 1)
62-element Vector{Int64}:
0
0
0
0
0
0
0
0
0
0
⋮
22
23
24
25
26
27
28
29
30
colnum2uniprotnum = Dict{Int,Int}()
for (colnum, seqnum) in enumerate(seqmap)
if seqnum != 0 # getsequencemapping returns 0 where there is a gap
colnum2uniprotnum[colnum] = seqnum2uniprotnum[seqnum]
end
end
colnum2uniprotnum
Dict{Int64, Int64} with 30 entries:
56 => 48
35 => 27
55 => 47
58 => 50
52 => 44
60 => 52
37 => 29
53 => 45
32 => 25
47 => 39
41 => 33
43 => 35
45 => 37
36 => 28
44 => 36
49 => 41
39 => 31
51 => 43
61 => 53
⋮ => ⋮
using MIToS.SIFTS
uniprotnum2pdbnum = siftsmapping(
sifts_file,
dbUniProt,
"P00193",
dbPDB,
"1dur", # SIFTS stores PDB identifiers in lowercase
chain = "A",
missings = false,
) # residues without coordinates aren't used in the mapping
OrderedCollections.OrderedDict{String, String} with 54 entries:
"1" => "1"
"2" => "2"
"3" => "3"
"4" => "4"
"5" => "5"
"6" => "6"
"7" => "7"
"8" => "8"
"9" => "9"
"10" => "10"
"11" => "11"
"12" => "12"
"13" => "13"
"14" => "14"
"15" => "15"
"16" => "16"
"17" => "17"
"18" => "18"
"19" => "19"
⋮ => ⋮
To finally get the dictionary from MSA column index to PDB residue number
colnum2pdbnum = Dict{Int,String}()
for (colnum, uniprotnum) in colnum2uniprotnum
pdbresnum = get(uniprotnum2pdbnum, string(uniprotnum), "")
if pdbresnum != ""
colnum2pdbnum[colnum] = pdbresnum
end
end
colnum2pdbnum
Dict{Int64, String} with 30 entries:
56 => "49"
35 => "28"
55 => "48"
58 => "51"
52 => "45"
60 => "53"
37 => "30"
53 => "46"
32 => "26"
47 => "40"
41 => "34"
43 => "36"
45 => "38"
36 => "29"
44 => "37"
49 => "42"
39 => "32"
51 => "44"
61 => "54"
⋮ => ⋮
This page was generated using Literate.jl.