Multiple Sequence Alignments
For this example, we use the Multiple Sequence Alignment (MSA) of the catalytic domain from the Cytosol aminopeptidase family. We use the downloadpfam
function of the MIToS' Pfam module to get the annotated Stockholm file from the Pfam database:
using MIToS.Pfam
const pfam_file = downloadpfam("PF00883")
The read
function from MIToS' MSA allows parsing multiple sequence alignments. The file format is indicated by the second argument. There is no need for decompressing the file before reading it:
using MIToS.MSA
msa = read(pfam_file, Stockholm)
Pfam alignments are generated using HMMER3 Hidden Markov Model (HMM) searches, have insertion columns (indicated by dots and lowercase letters):
first_seq = sequencenames(msa)[1]
TIP: You can use TranscodingStreams to read compressed files. In particular, the CodecZlib package has the needed codecs for reading gzipped files. For example, the next cell is similar to using zgrep
in Linux:
zgrep ^A0A094YQ53_BACAO/182-489 PF00883.stockholm.gz
using CodecZlib
open(GzipDecompressorStream, pfam_file) do file
for line in eachline(file)
if startswith(line, first_seq)
println(line)
end
end
end
TIP: If you are on Linux and you have zgrep installed, you can run it as an external command and use interpolation:
run(`zgrep ^$first_seq $pfam_file`)
or using the shell mode of the Julia REPL:
;zgrep ^$first_seq $pfam_file
Insert columns are deleted by MIToS, therefore the first MSA column has the label "69"
(column number in the input file) instead of "1"
:
columnnames(msa)[1]
If we want to keep track of residue numbers in each sequence, we need to read the file using the keyword arguments generatemapping
and useidcoordinates
:
const mapped_msa = read(pfam_file,
Stockholm,
generatemapping=true,
useidcoordinates=true)
getsequencemapping(mapped_msa, "A0A094YQ53_BACAO/182-489")
The mapping is stored in the MSA annotations and it is automatically updated when columns are deleted.
The getannot...
functions return dictionaries with the MSA annotations
getannotresidue(mapped_msa)
Pfam has "AS" (active site) and "pAS" (predicted active site) annotations for some families. We use filter
to get a dictionary with the experimentally annotated active site residues (indicated by *
):
active_sites = filter(pair -> pair[1][2] == "AS", getannotresidue(mapped_msa))
Exercise I
Which are the UniProt residue numbers of residues at the active site of AMPL_BOVIN/197-508?
Hint: You can use the functions getsequencemapping
and zip
.
# ...your solution...
The answer is 294 and 368.
MSA statistics and plotting
Plots.jl is a nice library because it gives a single API for multiple plotting backends (packages):
- GR is fast
- PlotlyJS is interactive
- UnicodePlots doesn't need a graphics environment
- PGFPlots has LaTeX quality
- PyPlot for pythonists
using Plots
gr() # choose the plotting backend
Plots.jl allows to define plot recipes and MIToS uses them to define plots for common biological objects:
plot(mapped_msa)
Plots are also really useful to visualize MSA statistics:
plotlyjs() # change to an interactive backend
gap_percentage = 100.0 .* gapfraction(mapped_msa, 1) # i.e. reduce dimension 1
x = names(gap_percentage, 2)
y = vec(gap_percentage)
plot(x, y, legend=false)
xlabel!("MSA column (input file)")
ylabel!("Gap percentage")
Something you can easily calculate with MIToS is the percentage identity between all the sequences in a Multiple Sequence Alignment:
percent_id = percentidentity(mapped_msa)
Exercise 1
What is the sequence that has the highest identity percentage with AMPL_BOVIN/197-508?
Hint: Use NamedArrays Not
indexing and names
and Julia's findmax
:
using NamedArrays
# ...your solution...
This page was generated using Literate.jl.