SIFTS

The SIFTS module of MIToS allows to obtain the residue-level mapping between databases stored in the SIFTS XML files. It makes easy to assign PDB residues to UniProt/Pfam positions. Given the fact that pairwise alignments can lead to misleading association between residues in both sequences, SIFTS offers more reliable association between sequence and structure residue numbers.

using MIToS.SIFTS # to load the SIFTS module

Features

  • Download and parse SIFTS XML files
  • Store residue-level mapping in Julia
  • Easy generation of Dicts between residues numbers

Contents

Simplest residue-level mapping

This module export the function siftsmapping to generate a Dict between residue numbers. This function takes 5 positional arguments. 1) The name of the SIFTS XML file to parse, 2) the source database, 3) the source protein/structure identifier, 4) the destiny database and, 5) the destiny protein/structure identifier. Optionally it’s possible to indicate a particular PDB chain and if missings will be used.

Databases should be indicated using an available sub-type of DataBase. Keys and values types will be depend on the residue number type in that database.

Type db...DatabaseResidue number type
dbPDBePDBe (Protein Data Bank in Europe)Int
dbInterProInterProString
dbUniProtUniProtInt
dbPfamPfam (Protein families database)Int
dbNCBINCBI (National Center for Biotechnology Information)Int
dbPDBPDB (Protein Data Bank)String
dbCATHCATHString
dbSCOPSCOP (Structural Classification of Proteins)String
dbEnsemblEnsemblString

To download the XML SIFTS file of a determined PDB use the downloadsifts function.

using MIToS.SIFTS
siftsfile = downloadsifts("1IVO")

The following example, shows the residue number mapping between Pfam and PDB. Pfam uses UniProt coordinates and PDB uses their own residue numbers with insertion codes. Note that the siftsmapping function is case sensitive, and that SIFTS stores PDB identifiers using lowercase characters.

siftsmap = siftsmapping(siftsfile,
                        dbPfam,
                        "PF00757",
                        dbPDB,
                        "1ivo", # SIFTS stores PDB identifiers in lowercase
                        chain="A", # In this example we are only using the chain A of the PDB
                        missings=false) # Residues without coordinates aren't used in the mapping
OrderedCollections.OrderedDict{String, String} with 162 entries:
  "177" => "153"
  "178" => "154"
  "179" => "155"
  "180" => "156"
  "181" => "157"
  "182" => "158"
  "183" => "159"
  "184" => "160"
  "185" => "161"
  "186" => "162"
  "187" => "163"
  "188" => "164"
  "189" => "165"
  "190" => "166"
  "191" => "167"
  "192" => "168"
  "193" => "169"
  "194" => "170"
  "195" => "171"
  ⋮     => ⋮

Storing residue-level mapping

If you need more than the residue number mapping between two databases, you could access all the residue-level cross references using the function read in the SIFTSXMLFile.Format file. The parse function (and therefore the read function) for the SIFTSXML format, also takes the keyword arguments chain and missings. The read/parse function returns a Vector of SIFTSResidues objects that stores the cross references between residues in each database.

You are free to access the SIFTSResidue fields in order to get the desired information. SIFTSResidue objects contain db... objects (sub-types of DataBase), with the cross referenced information. You should note that, except for the PDBe and InterPro fields, the field values can be missing. The ismissing function is helpful to know if there is a db... object. For example, getting the UniProt residue name (one letter code of the amino acid) would be:

ismissing(residue_data.UniProt) ? "" : residue_data.UniProt.name
"C"

That line of code returns an empty string if the UniProt field is missing. Otherwise, it returns a string with the name of the residue in UniProt. Because that way of access values in a SIFT residue is too verbose, MIToS defines a more complex signature for get. Using MIToS get the previous line of code will be:

#   SIFTSResidue  database   field  default
get(residue_data, dbUniProt, :name, "")
"C"

The is not need to use the full signature. Other signatures are possible depending on the value you want to access. In particular, a missing object is returned if a default value is not given at the end of the signature and the value to access is missing:

julia> get(residue_data, dbUniProt) # get takes the database type (`db...`)MIToS.SIFTS.dbUniProt("P00533", "325", "K")
julia> get(residue_data, dbUniProt, :name) # and can also take a field name (Symbol)"K"

But you don't need the get function to access the three letter code of the residue in PDBe because the PDBe field can not be missing.

residue_data.PDBe.name
"CYS"

SIFTSResidue also store information about if that residue is missing (i.e. not resolved) in the PDB structure and the information about the secondary structure (sscode and ssname):

julia> residue_data.missingfalse
julia> residue_data.sscode"T"
julia> residue_data.ssname"loop"

Accessing residue-level cross references

You can ask for particular values in a single SIFTSResidue using the get function.

julia> using MIToS.SIFTS
julia> residue_data = read(siftsfile, SIFTSXML)[301] # Is the UniProt residue name in the list of basic amino acids ["H", "K", "R"]?SIFTSResidue with secondary structure code (sscode): "T" and name (ssname): "loop" PDBe: number: 301 name: LYS UniProt : id: P00533 number: 325 name: K Pfam : id: PF00757 number: 325 name: K NCBI : id: 9606 number: 325 name: K PDB : id: 1ivo number: 301 name: LYS chain: A SCOP : id: 76847 number: 301 name: LYS chain: A CATH : id: 2.10.220.10 number: 301 name: LYS chain: A InterPro: MIToS.SIFTS.dbInterPro[MIToS.SIFTS.dbInterPro("IPR006211", "301", "LYS", "PF00757"), MIToS.SIFTS.dbInterPro("IPR009030", "301", "LYS", "SSF57184")] Ensembl: MIToS.SIFTS.dbEnsembl[MIToS.SIFTS.dbEnsembl("ENSG00000146648", "325", "K", "ENST00000275493", "ENSP00000275493", "ENSE00001751179")]
julia> get(residue_data, dbUniProt, :name, "") in ["H", "K", "R"]true

Use higher order functions and lambda expressions (anonymous functions) or list comprehension to easily ask for information on the Vector{SIFTSResidue}. You can use get with the previous signature or simple direct field access and ismissing.

# Captures PDB residue numbers if the Pfam id is "PF00757"
resnums = [ res.PDB.number for res in siftsresidues if !ismissing(res.PDB) && get(res, dbPfam, :id, "") == "PF00757" ]
162-element Vector{String}:
 "153"
 "154"
 "155"
 "156"
 "157"
 "158"
 "159"
 "160"
 "161"
 "162"
 ⋮
 "306"
 "307"
 "308"
 "309"
 "310"
 "311"
 "312"
 "313"
 "314"

Useful higher order functions are:

findall

# Which of the residues have UniProt residue names in the list ["H", "K", "R"]? (basic residues)
indexes = findall(res -> get(res, dbUniProt, :name, "") in ["H", "K", "R"], siftsresidues)
69-element Vector{Int64}:
   3
   4
  12
  22
  28
  47
  55
  73
  83
 104
   ⋮
 462
 464
 469
 475
 482
 496
 502
 506
 508

map

map(i -> siftsresidues[i].UniProt, indexes) # UniProt data of the basic residues
69-element Vector{MIToS.SIFTS.dbUniProt}:
 MIToS.SIFTS.dbUniProt("P00533", "28", "K")
 MIToS.SIFTS.dbUniProt("P00533", "29", "K")
 MIToS.SIFTS.dbUniProt("P00533", "37", "K")
 MIToS.SIFTS.dbUniProt("P00533", "47", "H")
 MIToS.SIFTS.dbUniProt("P00533", "53", "R")
 MIToS.SIFTS.dbUniProt("P00533", "72", "R")
 MIToS.SIFTS.dbUniProt("P00533", "80", "K")
 MIToS.SIFTS.dbUniProt("P00533", "98", "R")
 MIToS.SIFTS.dbUniProt("P00533", "108", "R")
 MIToS.SIFTS.dbUniProt("P00533", "129", "K")
 ⋮
 MIToS.SIFTS.dbUniProt("P00533", "487", "K")
 MIToS.SIFTS.dbUniProt("P00533", "489", "K")
 MIToS.SIFTS.dbUniProt("P00533", "494", "R")
 MIToS.SIFTS.dbUniProt("P00533", "500", "K")
 MIToS.SIFTS.dbUniProt("P00533", "507", "H")
 MIToS.SIFTS.dbUniProt("P00533", "521", "R")
 MIToS.SIFTS.dbUniProt("P00533", "527", "R")
 MIToS.SIFTS.dbUniProt("P00533", "531", "R")
 MIToS.SIFTS.dbUniProt("P00533", "533", "R")

filter

# SIFTSResidues with UniProt names in ["H", "K", "R"]
basicresidues = filter(res -> get(res, dbUniProt, :name, "") in ["H", "K", "R"], siftsresidues)

basicresidues[1].UniProt # UniProt data of the first basic residue
MIToS.SIFTS.dbUniProt("P00533", "28", "K")

Example: Which residues are missing in the PDB structure

Given that SIFTSResidue objects store a missing residue flag, it’s easy to get a vector where there is a true value if the residue is missing in the structure.

julia> using MIToS.SIFTS
julia> sifts_1ivo = read(siftsfile, SIFTSXML, chain="A"); # SIFTSResidues of the 1IVO chain A
julia> [res.missing for res in sifts_1ivo]622-element Vector{Bool}: 1 0 0 0 0 0 0 0 0 0 ⋮ 1 1 1 1 1 1 1 1 1

However, if you need to filter using other conditions, you’ll find useful the get function. In this example, we are going to ask for the UniProt id (to avoid problems with fragments, tags or chimeric/fusion proteins). We are also using get to select an specific PDB chain.

siftsfile = downloadsifts("1JQZ")
julia> using MIToS.SIFTS
julia> sifts_1jqz = read(siftsfile, SIFTSXML); # It has an amino terminal his tag
julia> missings = [ ( ( get(res, dbUniProt, :id, "") == "P05230" ) & ( get(res, dbPDB, :chain, "") == "A" ) & res.missing ) for res in sifts_1jqz ];
julia> println("There are only ", sum(missings), " missing residues in the chain A, associated to UniProt P05230")There are only 3 missing residues in the chain A, associated to UniProt P05230
julia> println("But there are ", sum([ res.missing for res in sifts_1jqz ]), " missing residues in the PDB file.")But there are 10 missing residues in the PDB file.