PDB

The module PDB defines types and methods to work with protein structures inside Julia. It is useful to link structural and sequential information, and needed for measure the predictive performance at protein contact prediction of mutual information scores.

using MIToS.PDB # to load the PDB module

Features

Read and parse PDB and PDBML files.
Calculate distance and contacts between atoms or residues.
Determine interaction between residues.

Retrieve information from PDB database

This module exports the downloadpdb function, to retrieve a PDB file from PDB database. This function downloads a gzipped PDBML file, which could be easily read it with MIToS by default, but you are able to determine the format as PDBFile if you want it.

using MIToS.PDB

pdbfile = downloadpdb("1IVO", format=PDBFile)

"1IVO.pdb.gz"

PDB module also exports a getpdbdescription to access the header information of a PDB entry.

getpdbdescription("1IVO")

Dict{String, Any} with 4 entries:
  "entry"               => Dict{String, Any}("id"=>"1IVO")
  "rcsb_entry_info"     => Dict{String, Any}("assembly_count"=>1, "resolution_c…
  "polymer_entities"    => Any[Dict{String, Any}("rcsb_polymer_entity_container…
  "rcsb_accession_info" => Dict{String, Any}("initial_release_date"=>"2002-10-1…

Read and parse PDB files

This is easy using the read and parse functions, indicating the filename and the FileFormat: PDBML for PDB XML files or PDBFile for usual PDB files. These functions returns a Vector of PDBResidue objects with all the residues in the PDB. To return only a specific subset of residues/atoms you can use any of the following keyword arguments:

keyword arguments	default	returns only ...
`chain`	`All`	residues from a PDB chain, i.e. `"A"`
`model`	`All`	residues from a determined model, i.e. `"1"`
`group`	`All`	residues from a group: `"ATOM"`, `"HETATM"` or `All` for both
`atomname`	`All`	atoms with a specific name, i.e. `"CA"`
`onlyheavy`	`false`	heavy atoms (not hydrogens) if it's `true`
`occupancyfilter`	`false`	only the atoms with the best occupancy are returned if it's `true`

Note

For PDBML files it is possible to use the keyword argument label to false (default to true) to get the auth_ attributes instead of the label_ attributes for chain, atom and residue name fields. The auth_ attributes are alternatives provided by an author in order to match the identification/values used in the publication that describes the structure.

# Read α carbon of each residue from the 1ivo pdb file, in the model 1, chain A and in the ATOM group.
CA_1ivo = read(pdbfile, PDBFile, model="1", chain="A", group="ATOM", atomname="CA")

CA_1ivo[1] # First residue. It has only the α carbon.

PDBResidue:
	id::PDBResidueIdentifier
		    PDBe_number          number            name           group           model           chain
		             ""             "2"           "GLU"          "ATOM"             "1"             "A"
	atoms::Vector{PDBAtom}	length: 1
		                                                  coordinates            atom         element       occupancy               B
		1:                                   [92.793, 69.578, 31.657]            "CA"             "C"             1.0        "151.39"

Looking for particular residues

MIToS parse PDB files to vector of residues, instead of using a hierarchical structure like other packages. This approach makes the search and selection of residues or atoms a little different. To make it easy, this module exports a number of functions and macros to select particular residues or atoms. Given the fact that residue numbers from different chains, models, etc. can collide, it's mandatory to indicate the model, chain, group, residue number and atom name in a explicit way to these functions or macros. If you want to select all the residues in one of the categories, you are able to use the type All. You can also use regular expressions or functions to make the selections.

using MIToS.PDB
pdbfile = downloadpdb("1IVO", format=PDBFile)
residues_1ivo = read(pdbfile, PDBFile)
# Select residue number 9 from model 1 and chain B
residues(residues_1ivo, "1", "B", All, "9")

1-element Vector{MIToS.PDB.PDBResidue}:
 PDBResidue:
	id::PDBResidueIdentifier
		    PDBe_number          number            name           group           model           chain
		             ""             "9"           "GLY"          "ATOM"             "1"             "B"
	atoms::Vector{PDBAtom}	length: 4
		                                                  coordinates            atom         element       occupancy               B
		1:                                   [53.449, 49.375, 55.663]             "N"             "N"             1.0         "28.86"
		                                                  coordinates            atom         element       occupancy               B
		2:                                   [52.736, 49.613, 54.425]            "CA"             "C"             1.0         "65.57"
		                                                  coordinates            atom         element       occupancy               B
		3:                                   [52.382, 48.332, 53.719]             "C"             "C"             1.0         "21.73"
		                                                  coordinates            atom         element       occupancy               B
		4:                                   [53.261, 47.591, 53.294]             "O"             "O"             1.0         "79.88"

Getting a `Dict` of `PDBResidue`s

If you prefer a Dict of PDBResidue, indexed by their residue numbers, you can use the residuedict function or the @residuedict macro.

# Dict of residues from the model 1, chain A and from the ATOM group
chain_a = residuesdict(residues_1ivo, "1", "A", "ATOM", All)
chain_a["9"]

PDBResidue:
	id::PDBResidueIdentifier
		    PDBe_number          number            name           group           model           chain
		             ""             "9"           "GLY"          "ATOM"             "1"             "A"
	atoms::Vector{PDBAtom}	length: 4
		                                                  coordinates            atom         element       occupancy               B
		1:                                  [109.607, 71.943, 41.924]             "N"             "N"             1.0         "50.08"
		                                                  coordinates            atom         element       occupancy               B
		2:                                    [109.641, 73.162, 42.7]            "CA"             "C"             1.0         "62.78"
		                                                  coordinates            atom         element       occupancy               B
		3:                                  [110.753, 73.103, 43.722]             "C"             "C"             1.0         "23.60"
		                                                  coordinates            atom         element       occupancy               B
		4:                                   [110.778, 72.21, 44.568]             "O"             "O"             1.0         "83.70"

You can do the same with the macro @residuesdict to get a more readable code

chain_a = @residuesdict residues_1ivo model "1" chain "A" group "ATOM" residue All
chain_a["9"]

PDBResidue:
	id::PDBResidueIdentifier
		    PDBe_number          number            name           group           model           chain
		             ""             "9"           "GLY"          "ATOM"             "1"             "A"
	atoms::Vector{PDBAtom}	length: 4
		                                                  coordinates            atom         element       occupancy               B
		1:                                  [109.607, 71.943, 41.924]             "N"             "N"             1.0         "50.08"
		                                                  coordinates            atom         element       occupancy               B
		2:                                    [109.641, 73.162, 42.7]            "CA"             "C"             1.0         "62.78"
		                                                  coordinates            atom         element       occupancy               B
		3:                                  [110.753, 73.103, 43.722]             "C"             "C"             1.0         "23.60"
		                                                  coordinates            atom         element       occupancy               B
		4:                                   [110.778, 72.21, 44.568]             "O"             "O"             1.0         "83.70"

Select particular residues

Use the residues function to collect specific residues. It's possible to use a single residue number (i.e. "2") or even a function which should return true for the selected residue numbers. Also regular expressions can be used to select residues. Use All to select all the residues.

residue_list = map(string, 2:5)

# If the list is large, you can use a `Set` to gain performance
# residue_set = Set(map(string, 2:5))

4-element Vector{String}:
 "2"
 "3"
 "4"
 "5"

first_res = residues(residues_1ivo, "1", "A", "ATOM", resnum -> resnum in residue_list)

for res in first_res
    println(res.id.name, " ", res.id.number)
end

GLU 2
GLU 3
LYS 4
LYS 5

A more complex example using an anonymous function:

# Select all the residues of the model 1, chain A of the ATOM group with residue number less than 5

first_res = residues(residues_1ivo, "1", "A", "ATOM", x -> parse(Int, match(r"^(\d+)", x)[1]) <= 5 )
# The anonymous function takes the residue number (string) and use a regular expression
# to extract the number (without insertion code).
# It converts the number to `Int` to test if the it is `<= 5`.

for res in first_res
    println(res.id.name, " ", res.id.number)
end

GLU 2
GLU 3
LYS 4
LYS 5

Use the @residues macro for a cleaner syntax.

# You can use All, regular expressions or functions also for model, chain and group:

# i.e. Takes the residue 10 from chains A and B

for res in @residues residues_1ivo model "1" chain ch -> ch in ["A","B"] group "ATOM" residue "10"
    println(res.id.chain, " ", res.id.name, " ", res.id.number)
end

A THR 10
B THR 10

Select particular atoms

The atoms function or macro allow to select a particular set of atoms.

# Select all the atoms with name starting with "C" using a regular expression
# from all the residues of the model 1, chain A of the ATOM group

carbons = @atoms residues_1ivo model "1" chain "A" group "ATOM" residue All atom r"C.+"

carbons[1]

                                       coordinates            atom         element       occupancy               B
                          [92.793, 69.578, 31.657]            "CA"             "C"             1.0        "151.39"

You can also use the atoms function instead of the @atoms macro:

atoms(residues_1ivo, "1", "A", "ATOM", All, r"C.+")[1]

                                       coordinates            atom         element       occupancy               B
                          [92.793, 69.578, 31.657]            "CA"             "C"             1.0        "151.39"

Protein contact map

The PDB module offers a number of functions to measure distances between atoms or residues, to detect possible interactions or contacts. In particular the contact function calls the distance function using a threshold or limit in an optimized way. The measure can be done between alpha carbons ("CA"), beta carbons ("CB") (alpha carbon for glycine), any heavy atom ("Heavy") or any ("All") atom of the residues.

In the following example, whe are going to plot a contact map for the 1ivo chain A. Two residues will be considered in contact if their β carbons (α carbon for glycine) have a distance of 8Å or less.

using MIToS.PDB

pdbfile = downloadpdb("1IVO", format=PDBFile)

residues_1ivo = read(pdbfile, PDBFile)

pdb = @residues residues_1ivo model "1" chain "A" group "ATOM" residue All

dmap = distance(pdb, criteria="All") # Minimum distance between residues using all their atoms

511×511 Named PairwiseListMatrices.PairwiseListMatrix{Float64, false, Vector{Float64}}
 Res1 ╲ Res2 │   1_A_ATOM_2    1_A_ATOM_3  …  1_A_ATOM_511  1_A_ATOM_512
─────────────┼──────────────────────────────────────────────────────────
1_A_ATOM_2   │          0.0       1.32963  …        48.112       52.4859
1_A_ATOM_3   │      1.32963           0.0          42.5701       46.9381
1_A_ATOM_4   │      3.90659       1.32843          44.8597       49.0519
1_A_ATOM_5   │      7.48169       4.45375          45.0295       48.9502
1_A_ATOM_6   │      8.56892       7.00906          41.7069       45.8367
1_A_ATOM_7   │      12.7101        10.681          42.7762       46.6884
1_A_ATOM_8   │      15.8327       13.7432          38.2741       42.2592
1_A_ATOM_9   │      17.6834       16.0752          43.4541       47.2341
⋮                         ⋮             ⋮  ⋱             ⋮             ⋮
1_A_ATOM_505 │      48.2444       42.6696          5.52601       5.34234
1_A_ATOM_506 │      49.1067        43.611          3.27264       3.18892
1_A_ATOM_507 │       51.003       45.5613          4.57506       5.21453
1_A_ATOM_508 │      50.1635       44.8346          5.83251       8.07726
1_A_ATOM_509 │      47.1567       41.8096          4.63848        7.9661
1_A_ATOM_510 │      49.0161       43.6277           1.3263       3.96007
1_A_ATOM_511 │       48.112       42.5701              0.0       1.32624
1_A_ATOM_512 │      52.4859       46.9381  …       1.32624           0.0

Use the contact function to get a contact map:

cmap = contact(pdb, 8.0, criteria="CB") # Contact map

511×511 Named PairwiseListMatrices.PairwiseListMatrix{Bool, false, Vector{Bool}}
 Res1 ╲ Res2 │   1_A_ATOM_2    1_A_ATOM_3  …  1_A_ATOM_511  1_A_ATOM_512
─────────────┼──────────────────────────────────────────────────────────
1_A_ATOM_2   │         true          true  …         false         false
1_A_ATOM_3   │         true          true            false         false
1_A_ATOM_4   │         true          true            false         false
1_A_ATOM_5   │        false         false            false         false
1_A_ATOM_6   │        false         false            false         false
1_A_ATOM_7   │        false         false            false         false
1_A_ATOM_8   │        false         false            false         false
1_A_ATOM_9   │        false         false            false         false
⋮                         ⋮             ⋮  ⋱             ⋮             ⋮
1_A_ATOM_505 │        false         false            false         false
1_A_ATOM_506 │        false         false             true          true
1_A_ATOM_507 │        false         false            false         false
1_A_ATOM_508 │        false         false            false         false
1_A_ATOM_509 │        false         false            false         false
1_A_ATOM_510 │        false         false             true          true
1_A_ATOM_511 │        false         false             true          true
1_A_ATOM_512 │        false         false  …          true          true

using Plots
gr()

heatmap(dmap, grid=false, yflip=true, ratio=:equal)

heatmap(cmap, grid=false, yflip=true, ratio=:equal)

Structural superposition

using MIToS.PDB

pdbfile = downloadpdb("2HHB")

res_2hhb = read(pdbfile, PDBML)

chain_A = pdb = @residues res_2hhb model "1" chain "A" group "ATOM" residue All
chain_C = pdb = @residues res_2hhb model "1" chain "C" group "ATOM" residue All

using Plots
gr()

scatter3d(chain_A, label="A", alpha=0.5)
scatter3d!(chain_C, label="C", alpha=0.5)

superimposed_A, superimposed_C, RMSD = superimpose(chain_A, chain_C)

RMSD

0.23000472638958339

scatter3d(superimposed_A, label="A", alpha=0.5)
scatter3d!(superimposed_C, label="C", alpha=0.5)