SIFTS
The SIFTS module of MIToS allows to obtain the residue-level mapping between databases stored in the SIFTS XML files. It makes easy to assign PDB residues to UniProt/Pfam positions. Given the fact that pairwise alignments can lead to misleading association between residues in both sequences, SIFTS offers more reliable association between sequence and structure residue numbers.
using MIToS.SIFTS # to load the SIFTS moduleFeatures
- Download and parse SIFTS XML files
- Store residue-level mapping in Julia
- Easy generation of
Dicts between residues numbers
Contents
Simplest residue-level mapping
This module export the function siftsmapping to generate a Dict between residue numbers. This function takes 5 positional arguments.
- The name of the SIFTS XML file to parse,
- the source database,
- the source protein/structure identifier,
- the destiny database and,
- the destiny protein/structure identifier. Optionally it’s possible to indicate a particular PDB
chainand ifmissingswill be used.
Databases should be indicated using an available sub-type of DataBase. Keys and values types will be depend on the residue number type in that database.
Type db... | Database | Residue number type |
|---|---|---|
dbPDBe | PDBe (Protein Data Bank in Europe) | Int |
dbInterPro | InterPro | String |
dbUniProt | UniProt | Int |
dbPfam | Pfam (Protein families database) | Int |
dbNCBI | NCBI (National Center for Biotechnology Information) | Int |
dbPDB | PDB (Protein Data Bank) | String |
dbCATH | CATH | String |
dbSCOP | SCOP (Structural Classification of Proteins) | String |
dbEnsembl | Ensembl | String |
To download the XML SIFTS file of a determined PDB use the downloadsifts function.
using MIToS.SIFTSsiftsfile = downloadsifts("1IVO")The following example, shows the residue number mapping between Pfam and PDB. Pfam uses UniProt coordinates and PDB uses their own residue numbers with insertion codes. Note that the siftsmapping function is case sensitive, and that SIFTS stores PDB identifiers using lowercase characters.
siftsmap = siftsmapping(
siftsfile,
dbPfam,
"PF00757",
dbPDB,
"1ivo", # SIFTS stores PDB identifiers in lowercase
chain = "A", # In this example we are only using the chain A of the PDB
missings = false,
) # Residues without coordinates aren't used in the mappingOrderedCollections.OrderedDict{String, String} with 162 entries:
"177" => "153"
"178" => "154"
"179" => "155"
"180" => "156"
"181" => "157"
"182" => "158"
"183" => "159"
"184" => "160"
"185" => "161"
"186" => "162"
"187" => "163"
"188" => "164"
"189" => "165"
"190" => "166"
"191" => "167"
"192" => "168"
"193" => "169"
"194" => "170"
"195" => "171"
⋮ => ⋮Storing residue-level mapping
If you need more than the residue number mapping between two databases, you could access all the residue-level cross references using the function read_file in the SIFTSXMLFile.Format file. The parse_file function (and therefore the read_file function) for the SIFTSXML format, also takes the keyword arguments chain and missings. The read_file/parse_file function returns a Vector of SIFTSResidues objects that stores the cross references between residues in each database.
You are free to access the SIFTSResidue fields in order to get the desired information. SIFTSResidue objects contain db... objects (sub-types of DataBase), with the cross referenced information. You should note that, except for the PDBe and InterPro fields, the field values can be missing. The ismissing function is helpful to know if there is a db... object. For example, getting the UniProt residue name (one letter code of the amino acid) would be:
ismissing(residue_data.UniProt) ? "" : residue_data.UniProt.name"C"That line of code returns an empty string if the UniProt field is missing. Otherwise, it returns a string with the name of the residue in UniProt. Because that way of access values in a SIFT residue is too verbose, MIToS defines a more complex signature for get. Using MIToS get the previous line of code will be:
# SIFTSResidue database field default
get(residue_data, dbUniProt, :name, "")"C"The is not need to use the full signature. Other signatures are possible depending on the value you want to access. In particular, a missing object is returned if a default value is not given at the end of the signature and the value to access is missing:
julia> get(residue_data, dbUniProt) # get takes the database type (`db...`)MIToS.SIFTS.dbUniProt("P00533", "325", "K")julia> get(residue_data, dbUniProt, :name) # and can also take a field name (Symbol)"K"
But you don't need the get function to access the three letter code of the residue in PDBe because the PDBe field can not be missing.
residue_data.PDBe.name"CYS"SIFTSResidue also store information about if that residue is missing (i.e. not resolved) in the PDB structure and the information about the secondary structure (sscode and ssname):
julia> residue_data.missingfalsejulia> residue_data.sscode"T"julia> residue_data.ssname"loop"
Accessing residue-level cross references
You can ask for particular values in a single SIFTSResidue using the get function.
julia> using MIToS.SIFTSjulia> residue_data = read_file(siftsfile, SIFTSXML)[301]SIFTSResidue with secondary structure code (sscode): "T" and name (ssname): "loop" PDBe: number: 301 name: LYS UniProt : id: P00533 number: 325 name: K Pfam : id: PF00757 number: 325 name: K NCBI : id: 9606 number: 325 name: K PDB : id: 1ivo number: 301 name: LYS chain: A SCOP : id: 76847 number: 301 name: LYS chain: A CATH : id: 2.10.220.10 number: 301 name: LYS chain: A InterPro: MIToS.SIFTS.dbInterPro[MIToS.SIFTS.dbInterPro("IPR006211", "301", "LYS", "PF00757"), MIToS.SIFTS.dbInterPro("IPR009030", "301", "LYS", "SSF57184")] Ensembl: MIToS.SIFTS.dbEnsembl[MIToS.SIFTS.dbEnsembl("ENSG00000146648", "325", "K", "ENST00000275493", "ENSP00000275493", "ENSE00001751179")]julia> # Is the UniProt residue name in the list of basic amino acids ["H", "K", "R"]? get(residue_data, dbUniProt, :name, "") in ["H", "K", "R"]true
Use higher order functions and lambda expressions (anonymous functions) or list comprehension to easily ask for information on the Vector{SIFTSResidue}. You can use get with the previous signature or simple direct field access and ismissing.
# Captures PDB residue numbers if the Pfam id is "PF00757"
resnums = [
res.PDB.number for res in siftsresidues if
!ismissing(res.PDB) && get(res, dbPfam, :id, "") == "PF00757"
]162-element Vector{String}:
"153"
"154"
"155"
"156"
"157"
"158"
"159"
"160"
"161"
"162"
⋮
"306"
"307"
"308"
"309"
"310"
"311"
"312"
"313"
"314"Useful higher order functions are:
findall
# Which of the residues have UniProt residue names in the list ["H", "K", "R"]? (basic residues)
indexes = findall(res -> get(res, dbUniProt, :name, "") in ["H", "K", "R"], siftsresidues)69-element Vector{Int64}:
3
4
12
22
28
47
55
73
83
104
⋮
462
464
469
475
482
496
502
506
508map
map(i -> siftsresidues[i].UniProt, indexes) # UniProt data of the basic residues69-element Vector{MIToS.SIFTS.dbUniProt}:
MIToS.SIFTS.dbUniProt("P00533", "28", "K")
MIToS.SIFTS.dbUniProt("P00533", "29", "K")
MIToS.SIFTS.dbUniProt("P00533", "37", "K")
MIToS.SIFTS.dbUniProt("P00533", "47", "H")
MIToS.SIFTS.dbUniProt("P00533", "53", "R")
MIToS.SIFTS.dbUniProt("P00533", "72", "R")
MIToS.SIFTS.dbUniProt("P00533", "80", "K")
MIToS.SIFTS.dbUniProt("P00533", "98", "R")
MIToS.SIFTS.dbUniProt("P00533", "108", "R")
MIToS.SIFTS.dbUniProt("P00533", "129", "K")
⋮
MIToS.SIFTS.dbUniProt("P00533", "487", "K")
MIToS.SIFTS.dbUniProt("P00533", "489", "K")
MIToS.SIFTS.dbUniProt("P00533", "494", "R")
MIToS.SIFTS.dbUniProt("P00533", "500", "K")
MIToS.SIFTS.dbUniProt("P00533", "507", "H")
MIToS.SIFTS.dbUniProt("P00533", "521", "R")
MIToS.SIFTS.dbUniProt("P00533", "527", "R")
MIToS.SIFTS.dbUniProt("P00533", "531", "R")
MIToS.SIFTS.dbUniProt("P00533", "533", "R")filter
# SIFTSResidues with UniProt names in ["H", "K", "R"]
basicresidues =
filter(res -> get(res, dbUniProt, :name, "") in ["H", "K", "R"], siftsresidues)
basicresidues[1].UniProt # UniProt data of the first basic residueMIToS.SIFTS.dbUniProt("P00533", "28", "K")Example: Which residues are missing in the PDB structure
Given that SIFTSResidue objects store a missing residue flag, it’s easy to get a vector where there is a true value if the residue is missing in the structure.
julia> using MIToS.SIFTSjulia> sifts_1ivo = read_file(siftsfile, SIFTSXML, chain = "A"); # SIFTSResidues of the 1IVO chain Ajulia> [res.missing for res in sifts_1ivo]622-element Vector{Bool}: 1 0 0 0 0 0 0 0 0 0 ⋮ 1 1 1 1 1 1 1 1 1
However, if you need to filter using other conditions, you’ll find useful the get function. In this example, we are going to ask for the UniProt id (to avoid problems with fragments, tags or chimeric/fusion proteins). We are also using get to select an specific PDB chain.
siftsfile = downloadsifts("1JQZ")julia> using MIToS.SIFTSjulia> sifts_1jqz = read_file(siftsfile, SIFTSXML); # It has an amino terminal his tagjulia> missings = [ ( (get(res, dbUniProt, :id, "") == "P05230") & (get(res, dbPDB, :chain, "") == "A") & res.missing ) for res in sifts_1jqz ];julia> println( "There are only ", sum(missings), " missing residues in the chain A, associated to UniProt P05230", )There are only 3 missing residues in the chain A, associated to UniProt P05230julia> println( "But there are ", sum([res.missing for res in sifts_1jqz]), " missing residues in the PDB file.", )But there are 10 missing residues in the PDB file.