SIFTS

MIToS.SIFTS — Module

The SIFTS module of MIToS allows to obtain the residue-level mapping between databases stored in the SIFTS XML files. It makes easy to assign PDB residues to UniProt/Pfam positions. Given the fact that pairwise alignments can lead to misleading association between residues in both sequences, SIFTS offers more reliable association between sequence and structure residue numbers.

Features

Download and parse SIFTS XML files
Store residue-level mapping in Julia
Easy generation of OrderedDicts between residues numbers

using MIToS.SIFTS

Contents

SIFTS

Types

MIToS.SIFTS.SIFTSCSV — Type

SIFTSCSV

A FileFormat subtype for the chain-level CSV summary tables described in the PDBe SIFTS Quick Access guide. Use it with the read_file function to load gzipped CSV files downloaded from SIFTS via downloadsifts. For example, to download and read the summary file for SCOP2:

using MIToS.SIFTS
summary_path = downloadsifts(dbSCOP2)
summary = read_file(summary_path, SIFTSCSV)

MIToS.SIFTS.SIFTSResidue — Type

A SIFTSResidue object stores the SIFTS residue level mapping for a residue. It has the following fields that you can access at any moment for query purposes:

- `PDBe` : A `dbPDBe` object, it's present in all the `SIFTSResidue`s.
- `UniProt` : A `dbUniProt` object or `missing`.
- `Pfam` : A `dbPfam` object or `missing`.
- `NCBI` : A `dbNCBI` object or `missing`.
- `InterPro` : An array of `dbInterPro` objects.
- `PDB` : A `dbPDB` object or `missing`.
- `SCOP` : A `dbSCOP` object or `missing`.
- `SCOP2` : An array of `dbSCOP2` objects.
- `SCOP2B` : A `dbSCOP2B` object or `missing`.
- `CATH` : A `dbCATH` object or `missing`.
- `Ensembl` : An array of `dbEnsembl` objects.
- `missing` : It's `true` if the residue is missing, i.e. not observed, in the structure.
- `sscode` : A string with the secondary structure code of the residue.
- `ssname` : A string with the secondary structure name of the residue.

MIToS.SIFTS.dbCATH — Type

dbCATH stores the residue id, number, name and chain in CATH as strings.

MIToS.SIFTS.dbEnsembl — Type

dbEnsembl stores the residue (gene) accession id, the transcript, translation and exon ids in Ensembl as strings, together with the residue number and name using the UniProt coordinates.

MIToS.SIFTS.dbInterPro — Type

dbInterPro stores the residue id, number, name and evidence in InterPro as strings.

MIToS.SIFTS.dbNCBI — Type

dbNCBI stores the residue id, number and name in NCBI as strings.

MIToS.SIFTS.dbPDB — Type

dbPDB stores the residue id, number, name and chain in PDB as strings.

MIToS.SIFTS.dbPDBe — Type

dbPDBe stores the residue number and name in PDBe as strings.

MIToS.SIFTS.dbPfam — Type

dbPfam stores the residue id, number and name in Pfam as strings.

MIToS.SIFTS.dbSCOP — Type

dbSCOP stores the residue id, number, name and chain in SCOP as strings.

MIToS.SIFTS.dbSCOP2 — Type

dbSCOP2 stores the residue id, number, name and chain in SCOP2 as strings.

MIToS.SIFTS.dbSCOP2B — Type

dbSCOP2B stores the residue id, number, name and chain in SCOP2B as strings. SCOP2B is expansion of SCOP2 domain annotations at superfamily level to every PDB with same UniProt accession having at least 80% SCOP2 domain coverage.

MIToS.SIFTS.dbUniProt — Type

dbUniProt stores the residue id, number and name in UniProt as strings.

Constants

Macros

Methods and functions

MIToS.SIFTS.downloadsifts — Method

downloadsifts(pdbcode::AbstractString; filename::AbstractString, source::AbstractString="ftp")

Download the gzipped SIFTS XML file for the provided pdbcode. The downloaded file will have the default extension .xml.gz. While you can change the filename, it must include the .xml.gz ending. The source keyword argument is set to "ftp" by default, downloading from the HTTPS mirror at https://ftp.ebi.ac.uk/pub/databases/msd/sifts/split_xml/. Alternatively, you can choose "https" as the source to download directly from the EBI PDBe server at https://www.ebi.ac.uk/pdbe/files/sifts/.

MIToS.SIFTS.downloadsifts — Method

downloadsifts(database::Type{<:DataBase}; filename=nothing)

Download a SIFTS chain-level summary file (CSV, gzipped) from PDBe into the current working directory.

The database argument selects which summary file to fetch; see the PDBe SIFTS Quick Access guide for details on each file. This function always downloads the gzipped CSV variant.

For example, to download the "pdb_chain_scop2_uniprot.csv.gz" file with the chain-level SCOP2 mappings, use downloadsifts(dbSCOP2). Then, to read and parse that file, you can use read_file with the SIFTSCSV format.

If filename is not provided, the canonical PDBe filename is used; otherwise, the data are saved to the specified path. The function returns the path to the downloaded file.

MIToS.SIFTS.siftsmapping — Method

Parses a SIFTS XML file and returns a OrderedDict between residue numbers of two DataBases with the given identifiers. A chain could be specified (All by default). If missings is true (default) all the residues are used, even if they haven’t coordinates in the PDB file.

MIToS.Utils.parse_file — Method

Utils.parse_file(io::IO, ::Type{SIFTSCSV})

Parse a SIFTS summary CSV file from an already-open, decompressed io stream. This function expects that io yields the plain CSV text (i.e., any .gz decompression has already been performed). This is automatically handled if you call read_file with SIFTSCSV as the format; as read_file opens the file, handles decompression when necessary, and then calls parse_file.

Returns a NamedTuple with:

colnames — a Vector{Symbol} of column names
table — the raw Matrix{String} produced by DelimitedFiles.readdlm

This low-level representation is intended for downstream reshaping or conversion. For example, if you have the output of this function stored in the variable summary, you can convert it to a DataFrame as follows:

using DataFrames
summary_df = DataFrame(summary.table, summary.colnames)

MIToS.Utils.parse_file — Method

parse_file(document::LightXML.XMLDocument, ::Type{SIFTSXML}; chain=All, missings::Bool=true)

Returns a Vector{SIFTSResidue} parsed from a SIFTSXML file. By default, parses all the chains and includes missing residues.