SIFTS
The SIFTS
module of MIToS allows to obtain the residue-level mapping between databases stored in the SIFTS XML files. It makes easy to assign PDB residues to UniProt/Pfam positions. Given the fact that pairwise alignments can lead to misleading association between residues in both sequences, SIFTS offers more reliable association between sequence and structure residue numbers.
using MIToS.SIFTS # to load the SIFTS module
Features
- Download and parse SIFTS XML files
- Store residue-level mapping in Julia
- Easy generation of
Dict
s between residues numbers
Contents
Simplest residue-level mapping
This module export the function siftsmapping
to generate a Dict
between residue numbers. This function takes 5 positional arguments.
- The name of the SIFTS XML file to parse,
- the source database,
- the source protein/structure identifier,
- the destiny database and,
- the destiny protein/structure identifier. Optionally it’s possible to indicate a particular PDB
chain
and ifmissings
will be used.
Databases should be indicated using an available sub-type of DataBase
. Keys and values types will be depend on the residue number type in that database.
Type db... | Database | Residue number type |
---|---|---|
dbPDBe | PDBe (Protein Data Bank in Europe) | Int |
dbInterPro | InterPro | String |
dbUniProt | UniProt | Int |
dbPfam | Pfam (Protein families database) | Int |
dbNCBI | NCBI (National Center for Biotechnology Information) | Int |
dbPDB | PDB (Protein Data Bank) | String |
dbCATH | CATH | String |
dbSCOP | SCOP (Structural Classification of Proteins) | String |
dbEnsembl | Ensembl | String |
To download the XML SIFTS file of a determined PDB use the downloadsifts
function.
using MIToS.SIFTS
siftsfile = downloadsifts("1IVO")
The following example, shows the residue number mapping between Pfam and PDB. Pfam uses UniProt coordinates and PDB uses their own residue numbers with insertion codes. Note that the siftsmapping
function is case sensitive, and that SIFTS stores PDB identifiers using lowercase characters.
siftsmap = siftsmapping(
siftsfile,
dbPfam,
"PF00757",
dbPDB,
"1ivo", # SIFTS stores PDB identifiers in lowercase
chain = "A", # In this example we are only using the chain A of the PDB
missings = false,
) # Residues without coordinates aren't used in the mapping
OrderedCollections.OrderedDict{String, String} with 162 entries:
"177" => "153"
"178" => "154"
"179" => "155"
"180" => "156"
"181" => "157"
"182" => "158"
"183" => "159"
"184" => "160"
"185" => "161"
"186" => "162"
"187" => "163"
"188" => "164"
"189" => "165"
"190" => "166"
"191" => "167"
"192" => "168"
"193" => "169"
"194" => "170"
"195" => "171"
⋮ => ⋮
Storing residue-level mapping
If you need more than the residue number mapping between two databases, you could access all the residue-level cross references using the function read_file
in the SIFTSXML
File.Format
file. The parse_file
function (and therefore the read_file
function) for the SIFTSXML
format, also takes the keyword arguments chain
and missings
. The read_file
/parse_file
function returns a Vector
of SIFTSResidue
s objects that stores the cross references between residues in each database.
You are free to access the SIFTSResidue
fields in order to get the desired information. SIFTSResidue
objects contain db...
objects (sub-types of DataBase
), with the cross referenced information. You should note that, except for the PDBe
and InterPro
fields, the field values can be missing
. The ismissing
function is helpful to know if there is a db...
object. For example, getting the UniProt residue name (one letter code of the amino acid) would be:
ismissing(residue_data.UniProt) ? "" : residue_data.UniProt.name
"C"
That line of code returns an empty string if the UniProt field is missing
. Otherwise, it returns a string with the name of the residue in UniProt. Because that way of access values in a SIFT residue is too verbose, MIToS defines a more complex signature for get
. Using MIToS get
the previous line of code will be:
# SIFTSResidue database field default
get(residue_data, dbUniProt, :name, "")
"C"
The is not need to use the full signature. Other signatures are possible depending on the value you want to access. In particular, a missing
object is returned if a default value is not given at the end of the signature and the value to access is missing:
julia> get(residue_data, dbUniProt) # get takes the database type (`db...`)
MIToS.SIFTS.dbUniProt("P00533", "325", "K")
julia> get(residue_data, dbUniProt, :name) # and can also take a field name (Symbol)
"K"
But you don't need the get
function to access the three letter code of the residue in PDBe
because the PDBe
field can not be missing
.
residue_data.PDBe.name
"CYS"
SIFTSResidue
also store information about if that residue is missing
(i.e. not resolved) in the PDB structure and the information about the secondary structure (sscode
and ssname
):
julia> residue_data.missing
false
julia> residue_data.sscode
"T"
julia> residue_data.ssname
"loop"
Accessing residue-level cross references
You can ask for particular values in a single SIFTSResidue
using the get
function.
julia> using MIToS.SIFTS
julia> residue_data = read_file(siftsfile, SIFTSXML)[301] # Is the UniProt residue name in the list of basic amino acids ["H", "K", "R"]?
SIFTSResidue with secondary structure code (sscode): "T" and name (ssname): "loop" PDBe: number: 301 name: LYS UniProt : id: P00533 number: 325 name: K Pfam : id: PF00757 number: 325 name: K NCBI : id: 9606 number: 325 name: K PDB : id: 1ivo number: 301 name: LYS chain: A SCOP : id: 76847 number: 301 name: LYS chain: A CATH : id: 2.10.220.10 number: 301 name: LYS chain: A InterPro: MIToS.SIFTS.dbInterPro[MIToS.SIFTS.dbInterPro("IPR006211", "301", "LYS", "PF00757"), MIToS.SIFTS.dbInterPro("IPR009030", "301", "LYS", "SSF57184")] Ensembl: MIToS.SIFTS.dbEnsembl[MIToS.SIFTS.dbEnsembl("ENSG00000146648", "325", "K", "ENST00000275493", "ENSP00000275493", "ENSE00001751179")]
julia> get(residue_data, dbUniProt, :name, "") in ["H", "K", "R"]
true
Use higher order functions and lambda expressions (anonymous functions) or list comprehension to easily ask for information on the Vector{SIFTSResidue}
. You can use get
with the previous signature or simple direct field access and ismissing
.
# Captures PDB residue numbers if the Pfam id is "PF00757"
resnums = [
res.PDB.number for res in siftsresidues if
!ismissing(res.PDB) && get(res, dbPfam, :id, "") == "PF00757"
]
162-element Vector{String}:
"153"
"154"
"155"
"156"
"157"
"158"
"159"
"160"
"161"
"162"
⋮
"306"
"307"
"308"
"309"
"310"
"311"
"312"
"313"
"314"
Useful higher order functions are:
findall
# Which of the residues have UniProt residue names in the list ["H", "K", "R"]? (basic residues)
indexes = findall(res -> get(res, dbUniProt, :name, "") in ["H", "K", "R"], siftsresidues)
69-element Vector{Int64}:
3
4
12
22
28
47
55
73
83
104
⋮
462
464
469
475
482
496
502
506
508
map
map(i -> siftsresidues[i].UniProt, indexes) # UniProt data of the basic residues
69-element Vector{MIToS.SIFTS.dbUniProt}:
MIToS.SIFTS.dbUniProt("P00533", "28", "K")
MIToS.SIFTS.dbUniProt("P00533", "29", "K")
MIToS.SIFTS.dbUniProt("P00533", "37", "K")
MIToS.SIFTS.dbUniProt("P00533", "47", "H")
MIToS.SIFTS.dbUniProt("P00533", "53", "R")
MIToS.SIFTS.dbUniProt("P00533", "72", "R")
MIToS.SIFTS.dbUniProt("P00533", "80", "K")
MIToS.SIFTS.dbUniProt("P00533", "98", "R")
MIToS.SIFTS.dbUniProt("P00533", "108", "R")
MIToS.SIFTS.dbUniProt("P00533", "129", "K")
⋮
MIToS.SIFTS.dbUniProt("P00533", "487", "K")
MIToS.SIFTS.dbUniProt("P00533", "489", "K")
MIToS.SIFTS.dbUniProt("P00533", "494", "R")
MIToS.SIFTS.dbUniProt("P00533", "500", "K")
MIToS.SIFTS.dbUniProt("P00533", "507", "H")
MIToS.SIFTS.dbUniProt("P00533", "521", "R")
MIToS.SIFTS.dbUniProt("P00533", "527", "R")
MIToS.SIFTS.dbUniProt("P00533", "531", "R")
MIToS.SIFTS.dbUniProt("P00533", "533", "R")
filter
# SIFTSResidues with UniProt names in ["H", "K", "R"]
basicresidues =
filter(res -> get(res, dbUniProt, :name, "") in ["H", "K", "R"], siftsresidues)
basicresidues[1].UniProt # UniProt data of the first basic residue
MIToS.SIFTS.dbUniProt("P00533", "28", "K")
Example: Which residues are missing in the PDB structure
Given that SIFTSResidue
objects store a missing
residue flag, it’s easy to get a vector where there is a true
value if the residue is missing in the structure.
julia> using MIToS.SIFTS
julia> sifts_1ivo = read_file(siftsfile, SIFTSXML, chain = "A"); # SIFTSResidues of the 1IVO chain A
julia> [res.missing for res in sifts_1ivo]
622-element Vector{Bool}: 1 0 0 0 0 0 0 0 0 0 ⋮ 1 1 1 1 1 1 1 1 1
However, if you need to filter using other conditions, you’ll find useful the get
function. In this example, we are going to ask for the UniProt id (to avoid problems with fragments, tags or chimeric/fusion proteins). We are also using get
to select an specific PDB chain.
siftsfile = downloadsifts("1JQZ")
julia> using MIToS.SIFTS
julia> sifts_1jqz = read_file(siftsfile, SIFTSXML); # It has an amino terminal his tag
julia> missings = [ ( (get(res, dbUniProt, :id, "") == "P05230") & (get(res, dbPDB, :chain, "") == "A") & res.missing ) for res in sifts_1jqz ];
julia> println( "There are only ", sum(missings), " missing residues in the chain A, associated to UniProt P05230", )
There are only 3 missing residues in the chain A, associated to UniProt P05230
julia> println( "But there are ", sum([res.missing for res in sifts_1jqz]), " missing residues in the PDB file.", )
But there are 10 missing residues in the PDB file.