Linking structural and evolutionary information

Problem description

It is a very common task to map sequence to structure residue number. For example, to link structural information coming from PDB and evolutionary information calculated from multiple sequence alignments.

The naive way of mapping sequence and structure is to perform global pairwise alignment between the sequence and the PDB sequence (using the residues in ATOM). The problem with this approach is that the sequences can have missing regions and standard pairwise alignment algorithms often yield incorrect assignations around those regions (Velankar et.al. 2013). This is particularly important when aligning PDB sequences, that can have missing residues, and sequences coming from multiple sequence alignments, that can be incomplete or have unaligned regions (e.g. insert states).

The SIFTS (Structure Integration with Function, Taxonomy and Sequences) database solves this problem and provides residue level mapping between PDB and other databases (e.g. UniProt and Pfam).

The SIFTS module of MIToS has functions to access this residue level mapping between PDB and other databases. Also, MIToS keeps track of the residue number of each residue in a multiple sequence alignment (MSA) using its annotations. Both things together, allow the correct mapping of sequence and structure without performing error-prone pairwise alignments.

Particular solutions depend on problem details, here we show some common ways to use MIToS and SIFTS to map evolutionary information calculated in an MSA (e.g. entropy) with structural information (e.g. B-factors).

PDB and Pfam alignment mapping

This is the easiest problem to solve with the MIToS Pfam module because SIFTS already has a residue level mapping between PDB and Pfam.

For this example, we are going to map the columns in the multiple sequence alignment of the PF09645 Pfam family and the residues in the chain A from the 2VQC PDB file. The needed files are available in the MIToS test suite:

using MIToS
pdb_file   = abspath(pathof(MIToS), "..", "..", "test", "data", "2VQC.pdb")
pfam_file  = abspath(pathof(MIToS), "..", "..", "test", "data", "PF09645_full.stockholm")
sifts_file = abspath(pathof(MIToS), "..", "..", "test", "data", "2vqc.xml.gz")

You can also use downloadpdb from MIToS.PDB, downloadpfam from MIToS.Pfam and downloadsifts from MIToS.SIFTS to get the corresponding files from those databases.

It is important to read the Pfam MSA file using generatemapping=true and useidcoordinates=true because that allows keeping track of the residue number using the MSA annotations.

using MIToS.Pfam
msa = read(pfam_file, Stockholm, generatemapping=true, useidcoordinates=true)
AnnotatedMultipleSequenceAlignment with 16 annotations : 4×110 Named Matrix{MIToS.MSA.Residue}
         Seq ╲ Col │   6    7    8    9   10  …  111  112  113  114  115
───────────────────┼────────────────────────────────────────────────────
C3N734_SULIY/1-95  │   -    -    -    N    S  …    -    -    -    -    -
H2C869_9CREN/7-104 │   -    -    L    N    D       -    -    -    -    -
Y070_ATV/2-70      │   -    -    -    -    -       -    -    -    -    -
F112_SSV1/3-112    │   Q    T    L    N    S  …    I    K    A    K    Q

First, we need to know what is the sequence in the MSA that correspond to the PDB we want to link. Luckily, Pfam Stockholm files store the mapping between sequences and PDB chains. You can access that mapping using the getseq2pdb function from MIToS.Pfam

seq2pdbs = getseq2pdb(msa)
Dict{String, Vector{Tuple{String, String}}} with 1 entry:
  "F112_SSV1/3-112" => [("2VQC", "A")]

The returned dictionary gives you all the PDB chains associated with a determined sequence in the MSA. But, in this case, we want to go in the other direction to find all the sequences associated with a determined PDB chain. We are going to use a list comprehension because it is possible for a single chain to be associated with more than one sequence in the Pfam MSA (e.g. domain repeats).

pdb_code  = "2VQC"
pdb_chain = "A"
seq_ids = [ seq for (seq, pdbs) in seq2pdbs if (pdb_code, pdb_chain) in pdbs ]
1-element Vector{String}:
 "F112_SSV1/3-112"

In this example, we are going to use the only sequence we found for the A of 2VQC.

seq_id = seq_ids[1]
"F112_SSV1/3-112"

Finally, we can use the msacolumn2pdbresidue function from the Pfam module to get a dictionary from the MSA column index to the PDB residue number:

pfam_id = "PF09645"
msacol2pdbres = msacolumn2pdbresidue(msa, seq_id, pdb_code, pdb_chain, pfam_id, sifts_file)
OrderedCollections.OrderedDict{Int64, String} with 110 entries:
  6  => ""
  7  => "4"
  8  => "5"
  9  => "6"
  10 => "7"
  11 => "8"
  12 => "9"
  13 => "10"
  14 => "11"
  15 => "12"
  16 => "13"
  17 => "14"
  18 => "15"
  19 => "16"
  20 => "17"
  21 => "18"
  22 => "19"
  23 => "20"
  24 => "21"
  ⋮  => ⋮

This dictionary has the mapping between MSA column and PDB residue that allows the mapping between evolutionary and structural information. For example, to measure the correlation between entropy (related to residue variation in an MSA column) and the mean B factor of the residue:

using MIToS.Information
Hx = mapcolfreq!(entropy,
				 msa,
				 Counts(ContingencyTable(Int, Val{1}, UngappedAlphabet())))
1×110 Named Matrix{Float64}
Function ╲ Col │        6         7         8  …       113       114       115
───────────────┼──────────────────────────────────────────────────────────────
entropy        │     -0.0      -0.0      -0.0  …      -0.0      -0.0      -0.0

To get quick access to each PDB residue based on its residue number, we can read the PDB file into a dictionary using the read and residuesdict functions from the MIToS PDB module:

using MIToS.PDB
res_dict = residuesdict(read(pdb_file, PDBFile, occupancyfilter=true), "1", "A") # model 1 chain A
OrderedCollections.OrderedDict{String, MIToS.PDB.PDBResidue} with 95 entries:
  "4"  => PDBResidue:…
  "5"  => PDBResidue:…
  "6"  => PDBResidue:…
  "7"  => PDBResidue:…
  "8"  => PDBResidue:…
  "9"  => PDBResidue:…
  "10" => PDBResidue:…
  "11" => PDBResidue:…
  "12" => PDBResidue:…
  "13" => PDBResidue:…
  "14" => PDBResidue:…
  "15" => PDBResidue:…
  "16" => PDBResidue:…
  "17" => PDBResidue:…
  "18" => PDBResidue:…
  "19" => PDBResidue:…
  "20" => PDBResidue:…
  "21" => PDBResidue:…
  "22" => PDBResidue:…
  ⋮    => ⋮

Then, we can iterate the mapping dictionary to link the MSA and PDB based values:

using Statistics

x = Float64[]
y = Float64[]

for (col_index, res_number) in msacol2pdbres
	if res_number != "" # i.e. MSA column has an associated PDB residue
		push!(x, Hx[col_index])
		push!(y, mean(parse(Float64, atom.B) for atom in res_dict[res_number].atoms))
	end
end

cor(x, y)
0.059911489235294435

Unknown sequence coordinates

While Pfam alignments have the start and end of the aligned region indicated in the sequence name, other multiple sequence alignments don't give any hint about that. In those cases, we should use pairwise alignments. However, instead of aligning the sequence coming from the MSA and the PDB sequence, we can align the MSA sequence to the UniProt sequence to reduce the possibility of mapping errors. Once we have the mapping of the MSA sequence to the UniProt sequence, we can use SIFTS to map the PDB sequence to the MSA sequence using the UniProt numeration.

For this example, we are going to use the following files included in MIToS documentation:

using MIToS
pdb_file   = abspath(pathof(MIToS), "..", "..", "docs", "data", "1dur.pdb")
msa_file   = abspath(pathof(MIToS), "..", "..", "docs", "data", "blast_alignment.fa")
sifts_file = abspath(pathof(MIToS), "..", "..", "docs", "data", "1dur.xml.gz")
uniprot_file = abspath(pathof(MIToS), "..", "..", "docs", "data", "P00193.fasta")

First, we are going to read the MSA file. In this case, we can not use useidcoordinates=true because the sequence names don't have the sequence coordinates in the Pfam format. However, we are going to use generatemapping=true to get the default mapping for each sequence in the alignment (from 1 to the length of the aligned region):

using MIToS.MSA
msa = read(msa_file, FASTA, generatemapping=true)
AnnotatedMultipleSequenceAlignment with 102 annotations : 100×62 Named Matrix{MIToS.MSA.Residue}
                                                                                                                       Seq ╲ Col │   …  
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼──────
Query_23408 PDB 1DUR 4Fe-4S ferredoxin-type 2 motif                                                                              │   …  
gi|120014|sp|P00193.1|FER_PEPAS RecName: Full=Ferredoxin                                                                         │      
gi|7546410|pdb|1DUR|A Chain A, Replacement for 1FDX 2(4FE4S) ferredoxin from (NOW) Peptostreptococcus asaccharolyticus           │      
gi|1180747586|ref|WP_084229968.1| 4Fe-4S dicluster domain-containing protein [Peptoniphilus asaccharolyticus]                    │      
gi|1443100176|ref|WP_115312158.1| 4Fe-4S dicluster domain-containing protein [Peptoniphilus indolicus]                           │      
gi|769945096|ref|WP_045078402.1| 4Fe-4S dicluster domain-containing protein [Peptoniphilus sp. ING2-D1G]                         │      
gi|517967820|ref|WP_019138028.1| MULTISPECIES: 4Fe-4S dicluster domain-containing protein [Peptoniphilus]                        │      
gi|496513533|ref|WP_009221816.1| MULTISPECIES: 4Fe-4S dicluster domain-containing protein [Peptoniphilus]                        │      
⋮                                                                                                                                    ⋱  
tpg|HBK85263.1||gnl|WGS:DNZI|DDZ53_04440 ferredoxin [Firmicutes bacterium]                                                       │      
gi|1172242992|ref|WP_080632354.1| 4Fe-4S dicluster domain-containing protein [Desulfitobacterium hafniense]                      │      
tpg|HBL81567.1||gnl|WGS:DOAW|DD391_03060 ferredoxin [Clostridiales bacterium]                                                    │      
gi|1057632375|ref|WP_068914150.1| 4Fe-4S dicluster domain-containing protein [Criibacterium bergeronii]                          │      
gi|737312160|ref|WP_035295014.1| MULTISPECIES: 4Fe-4S dicluster domain-containing protein [Clostridiaceae]                       │      
gi|769212113|ref|WP_044975592.1| 4Fe-4S dicluster domain-containing protein [Ruminococcus sp. HUN007]                            │      
tpg|HCW03068.1||gnl|WGS:DPMH|DGK91_00085 ferredoxin [Clostridium sp.]                                                            │      
gi|1489662625|gb|RKW38741.1| 4Fe-4S dicluster domain-containing protein [Lachnospiraceae bacterium]                              │   …  

After that, we get the first sequence of the MSA, the one we know that corresponds to the PDB of interest. We need the sequence as a String without gaps (unaligned), so we use the MIToS.MSA stringsequence function together with replace:

msa_seq = replace(stringsequence(msa, 1), '-' => "")
"GSIYAIDADSCIDCGSCASVCPVGAPNPED"

Also, we are going to read the UniProt sequence. You can easily download the sequence from UniProt by doing:

using MIToS.Utils
download_file("https://www.uniprot.org/uniprot/P00193.fasta", "P00193.fasta")

To read the FASTA file we are going to use the FastaIO package:

using FastaIO
uniprot_sequences = readfasta(uniprot_file)
1-element Vector{Tuple{String, String}}:
 ("sp|P00193|FER_PEPAS Ferredoxin OS=Peptoniphilus asaccharolyticus OX=1258 PE=1 SV=1", "AYVINDSCIACGACKPECPVNIQQGSIYAIDADSCIDCGSCASVCPVGAPNPED")

And get the unique sequence:

uniprot_seq = uniprot_sequences[1][2]
"AYVINDSCIACGACKPECPVNIQQGSIYAIDADSCIDCGSCASVCPVGAPNPED"

We can perform a pairwise sequence alignment between both sequences by using the BioAlignments package from the BioJulia suite. In this case, we use a semi-global alignment (no start/end gap penalty) because we know that the MSA sequence is a region of the UniProt sequence.

using BioAlignments
costmodel = AffineGapScoreModel(BLOSUM62, gap_open=-10, gap_extend=-1)
aln = pairalign(SemiGlobalAlignment(), msa_seq, uniprot_seq, costmodel)
BioAlignments.PairwiseAlignmentResult{Int64, String, String}:
  score: 169
  seq:  0 ------------------------GSIYAIDADSCIDCGSCASVCPVGAPNPED 30
                                  ||||||||||||||||||||||||||||||
  ref:  1 AYVINDSCIACGACKPECPVNIQQGSIYAIDADSCIDCGSCASVCPVGAPNPED 54

Then, we only need to iterate the alignment to designate the positions and store the equivalences in a dictionary:

function seq2refnumber(aln)
    seq_pos = 0
	ref_pos = 0
	last_seq_pos = 0
	seq2ref = Dict{Int,Int}()
    for (seq_res, ref_res) in alignment(aln)
        if seq_res != '-'
            seq_pos += 1
		end
        if ref_res != '-'
            ref_pos += 1
		end
		if seq_pos != last_seq_pos
			seq2ref[seq_pos] = ref_pos
			last_seq_pos = seq_pos
    	end
	end
    seq2ref
end

seqnum2uniprotnum = seq2refnumber(aln)
Dict{Int64, Int64} with 30 entries:
  5  => 29
  16 => 40
  20 => 44
  12 => 36
  24 => 48
  28 => 52
  8  => 32
  17 => 41
  30 => 54
  1  => 25
  19 => 43
  22 => 46
  23 => 47
  6  => 30
  11 => 35
  9  => 33
  14 => 38
  3  => 27
  29 => 53
  ⋮  => ⋮

Then, we can use getsequencemapping to go from MSA column number to UniProt residue, and siftsmapping to go from UniProt to PDB:

seqmap = getsequencemapping(msa, 1)
62-element Vector{Int64}:
  0
  0
  0
  0
  0
  0
  0
  0
  0
  0
  ⋮
 22
 23
 24
 25
 26
 27
 28
 29
 30
colnum2uniprotnum = Dict{Int,Int}()
for (colnum, seqnum) in enumerate(seqmap)
	if seqnum != 0 # getsequencemapping returns 0 where there is a gap
		colnum2uniprotnum[colnum] = seqnum2uniprotnum[seqnum]
	end
end
colnum2uniprotnum
Dict{Int64, Int64} with 30 entries:
  56 => 48
  35 => 27
  55 => 47
  58 => 50
  52 => 44
  60 => 52
  37 => 29
  53 => 45
  32 => 25
  47 => 39
  41 => 33
  43 => 35
  45 => 37
  36 => 28
  44 => 36
  49 => 41
  39 => 31
  51 => 43
  61 => 53
  ⋮  => ⋮
using MIToS.SIFTS

uniprotnum2pdbnum = siftsmapping(sifts_file,
    dbUniProt,
    "P00193",
    dbPDB,
    "1dur", # SIFTS stores PDB identifiers in lowercase
    chain="A",
    missings=false) # residues without coordinates aren't used in the mapping
OrderedCollections.OrderedDict{String, String} with 54 entries:
  "1"  => "1"
  "2"  => "2"
  "3"  => "3"
  "4"  => "4"
  "5"  => "5"
  "6"  => "6"
  "7"  => "7"
  "8"  => "8"
  "9"  => "9"
  "10" => "10"
  "11" => "11"
  "12" => "12"
  "13" => "13"
  "14" => "14"
  "15" => "15"
  "16" => "16"
  "17" => "17"
  "18" => "18"
  "19" => "19"
  ⋮    => ⋮

To finally get the dictionary from MSA column index to PDB residue number

colnum2pdbnum = Dict{Int,String}()
for (colnum, uniprotnum) in colnum2uniprotnum
	pdbresnum = get(uniprotnum2pdbnum, string(uniprotnum), "")
	if pdbresnum != ""
		colnum2pdbnum[colnum] = pdbresnum
	end
end

colnum2pdbnum
Dict{Int64, String} with 30 entries:
  56 => "49"
  35 => "28"
  55 => "48"
  58 => "51"
  52 => "45"
  60 => "53"
  37 => "30"
  53 => "46"
  32 => "26"
  47 => "40"
  41 => "34"
  43 => "36"
  45 => "38"
  36 => "29"
  44 => "37"
  49 => "42"
  39 => "32"
  51 => "44"
  61 => "54"
  ⋮  => ⋮

This page was generated using Literate.jl.