MSA
MIToS.MSA
— ModuleThe MSA module of MIToS has utilities for working with Multiple Sequence Alignments of protein Sequences (MSA).
Features
- Read and write MSAs in
Stockholm
,FASTA
,A3M
,PIR
, orRaw
format - Handle MSA annotations
- Edit the MSA, e.g. delete columns or sequences, change sequence order, shuffling...
- Keep track of positions and annotations after modifications on the MSA
- Describe a MSA, e.g. mean percent identity, sequence coverage, gap percentage...
using MIToS.MSA
Contents
Types
MIToS.MSA.AbstractAlignedObject
— TypeMIToS MSA and aligned sequences (aligned objects) are subtypes of AbstractMatrix{Residue}
, because MSAs and sequences are stored as Matrix
of Residue
s.
MIToS.MSA.AbstractAlignedSequence
— TypeA MIToS aligned sequence is an AbstractMatrix{Residue}
with only 1 row/sequence.
MIToS.MSA.AbstractMultipleSequenceAlignment
— TypeMSAs are stored as Matrix{Residue}
. It's possible to use a NamedResidueMatrix{Array{Residue,2}}
as the most simple MSA with sequence identifiers and column names.
MIToS.MSA.AbstractSequence
— TypeA MIToS (unaligned) sequence is an AbstractMatrix{Residue}
with only 1 row/sequence.
MIToS.MSA.AlignedSequence
— TypeAn AlignedSequence
wraps a NamedResidueMatrix{Array{Residue,2}}
with only 1 row/sequence. The NamedArray
stores the sequence name and original column numbers as String
s.
MIToS.MSA.AnnotatedAlignedSequence
— TypeThis type represent an aligned sequence, similar to AlignedSequence
, but It also stores its Annotations
.
MIToS.MSA.AnnotatedMultipleSequenceAlignment
— TypeThis type represent an MSA, similar to MultipleSequenceAlignment
, but It also stores Annotations
. This annotations are used to store residue coordinates (i.e. mapping to UniProt residue numbers).
MIToS.MSA.AnnotatedSequence
— TypeAn AnnotationSequence
wraps a NamedResidueMatrix{Array{Residue,2}}
with only 1 row/sequence and its Annotations
. The NamedArray
stores the sequence name and original position numbers as String
s.
MIToS.MSA.Annotations
— TypeThe Annotations
type is basically a container for Dict
s with the annotations of a multiple sequence alignment. Annotations
was designed for storage of annotations of the Stockholm format.
MIToS also uses MSA annotations to keep track of:
- Modifications of the MSA (
MIToS_...
) as deletion of sequences or columns. - Positions numbers in the original MSA file (column mapping:
ColMap
) - Position of the residues in the sequence (sequence mapping:
SeqMap
)
MIToS.MSA.Clusters
— TypeData structure to represent sequence clusters. The sequence data itself is not included.
MIToS.MSA.GappedAlphabet
— TypeThis type defines the usual alphabet of the 20 natural residues and a gap character.
julia> using MIToS.MSA
julia> GappedAlphabet()
GappedAlphabet of length 21. Residues : res"ARNDCQEGHILKMFPSTWYV-"
MIToS.MSA.MultipleSequenceAlignment
— TypeThis MSA type include a NamedArray
wrapping a Matrix
of Residue
s. The use of NamedArray
allows to store sequence names and original column numbers as String
s, and fast indexing using them.
MIToS.MSA.NoClustering
— TypeUse NoClustering()
to avoid the use of clustering where a Clusters
type is needed.
MIToS.MSA.ReducedAlphabet
— TypeReducedAlphabet
allows the construction of reduced residue alphabets, where residues inside parenthesis belong to the same group.
julia> using MIToS.MSA
julia> ab = ReducedAlphabet("(AILMV)(RHK)(NQST)(DE)(FWY)CGP")
ReducedAlphabet of length 8 : "(AILMV)(RHK)(NQST)(DE)(FWY)CGP"
julia> ab[Residue('K')]
2
MIToS.MSA.Residue
— TypeMost of the MIToS design is created around the Residue
bitstype. It has representations for the 20 natural amino acids, a value representing insertions and deletions (GAP
, '-'
) and one representing unknown, ambiguous and non standard residues (XAA
, 'X'
). Each Residue
is encoded as an integer number, with the same bit representation and size than a Int
. This allows fast indexing operation of probability or frequency matrices.
Residue creation and conversion
Creation and conversion of Residue
s should be treated carefully. Residue
is encoded as a 32 or 64 bits type similar to Int
, to get fast indexing using Int(x::Residue)
. Int
simply calls reinterpret
without checking if the residue is valid. Valid residues have integer values in the closed interval [1,22]. convert
from Int
and Char
always returns valid residues, however it's possible to find invalid residues (they are shown using the character '�'
) after the creation of uninitialized Residue
arrays (i.e. using Array
). You can use zeros
, ones
or rand
to get initialized Residue
arrays with valid residues. Conversions to and from Char
s changes the bit representation and allows the use of the usual character representation of residues and amino acids. This conversions are used in IO operations and always return valid residues. In conversions from Char
, lowercase letters, '*'
, '-'
and '.'
are translated to GAP
, letters representing the 20 natural amino (ARNDCQEGHILKMFPSTWYV) acids are translated to their corresponding Residue
and any other character is translated to XAA
. Since lowercase letters and dots are translated to gaps, Pfam MSA insert columns are converted to columns full of gaps.
julia> using MIToS.MSA
julia> alanine = Residue('A')
A
julia> Char(alanine)
'A': ASCII/Unicode U+0041 (category Lu: Letter, uppercase)
julia> for residue in res"ARNDCQEGHILKMFPSTWYV-X"
println(residue, " ", Int(residue))
end
A 1
R 2
N 3
D 4
C 5
Q 6
E 7
G 8
H 9
I 10
L 11
K 12
M 13
F 14
P 15
S 16
T 17
W 18
Y 19
V 20
- 21
X 22
MIToS.MSA.ResidueAlphabet
— TypeAbstract type to define residue alphabet types.
MIToS.MSA.UngappedAlphabet
— TypeThis type defines the usual alphabet of the 20 natural residues, without the gap character.
julia> using MIToS.MSA
julia> UngappedAlphabet()
UngappedAlphabet of length 20. Residues : res"ARNDCQEGHILKMFPSTWYV"
Constants
MIToS.MSA.GAP
— ConstantGAP
is the Residue
representation on MIToS for gaps ('-'
, insertions and deletions). Lowercase residue characters, dots and '*'
are encoded as GAP
in conversion from String
s and Char
s. This Residue
constant is encoded as Residue(21)
.
MIToS.MSA.WeightTypes
— TypeThe WeightTypes
type is the same as Union{Weights,NoClustering,Clusters}
. This type is used to represent weights. Most of the functions taking the weights
kerword argument in the Information
module accept instances of WeightTypes
.
MIToS.MSA.XAA
— ConstantXAA
is the Residue
representation for unknown, ambiguous and non standard residues. This Residue
constant is encoded as Residue(22)
.
Macros
MIToS.MSA.@res_str
— MacroThe MIToS macro @res_str
takes a string and returns a Vector
of Residues
(sequence).
julia> using MIToS.MSA
julia> res"MIToS"
5-element Vector{Residue}:
M
I
T
-
S
Methods and functions
Base.isvalid
— Methodisvalid(res::Residue)
It returns true
if the encoded integer is in the closed interval [1,22].
Base.names
— MethodIt returns the name of each group. The name is a string with the one letter code of each residue that belong to the group.
julia> using MIToS.MSA
julia> ab = ReducedAlphabet("(AILMV)(RHK)(NQST)(DE)(FWY)CGP")
ReducedAlphabet of length 8 : "(AILMV)(RHK)(NQST)(DE)(FWY)CGP"
julia> names(ab)
8-element Vector{String}:
"AILMV"
"RHK"
"NQST"
"DE"
"FWY"
"C"
"G"
"P"
Base.rand
— MethodIt chooses from the 20 natural residues (it doesn't generate gaps).
julia> using MIToS.MSA
julia> using Random
julia> Random.seed!(1); # Reseed the random number generator.
julia> rand(Residue)
R
julia> rand(Residue, 4, 4)
4×4 Matrix{Residue}:
E D D A
F S K K
M S I M
Y F E D
MIToS.MSA.adjustreference
— FunctionCreates a new matrix of residues. This function deletes positions/columns of the MSA with gaps in the reference (first) sequence.
MIToS.MSA.adjustreference!
— FunctionIt removes positions/columns of the MSA with gaps in the reference (first) sequence.
MIToS.MSA.annotate_modification!
— MethodAnnotates on file annotations the modifications realized by MIToS on the MSA. It always returns true
, so It can be used in a boolean context.
MIToS.MSA.annotations
— MethodThe annotations
function returns the Annotations
of an annotated MSA or aligned sequence. If the object is not annotated, it returns an empty Annotations
object.
MIToS.MSA.column_index
— Methodcolumn_index(msa, col_name)
Return the index (integer position) of the column with name col_name
in the MSA msa
. A KeyError
is thrown if the column name does not exist. If col_name
is an integer, the same integer is returned without checking if it is a valid index.
MIToS.MSA.columngapfraction
— MethodFraction of gaps per column/position on the MSA
MIToS.MSA.columnname_iterator
— Methodcolumnname_iterator(msa)
It returns an iterator that returns the column names of the msa
. If the msa
is a Matrix{Residue}
this function returns the actual column numbers as strings. Otherwise it returns the column number of the original MSA through the wrapped NamedArray
column names.
MIToS.MSA.columnnames
— Methodcolumnnames(msa)
It returns a Vector{String}
with the sequence names/identifiers. If the msa
is a Matrix{Residue}
this function returns the actual column numbers as strings. Otherwise it returns the column number of the original MSA through the wrapped NamedArray
column names.
MIToS.MSA.columnpairsmatrix
— MethodInitialize an empty PairwiseListMatrix
for a pairwise measure in sequence pairs. It uses the sequence names if they are available, otherwise it uses the actual sequence numbers. You can use the positional argument to indicate the number Type
(default: Float64
), if the PairwiseListMatrix
should store the diagonal values on the list (default: false
) and a default value for the diagonal (default: NaN
).
MIToS.MSA.coverage
— MethodCoverage of the sequences with respect of the number of positions on the MSA
MIToS.MSA.delete_annotated_modifications!
— MethodDeletes all the MIToS annotated modifications
MIToS.MSA.deletefullgapcolumns!
— FunctionDeletes columns with 100% gaps, this columns are generated by inserts.
MIToS.MSA.filtercolumns!
— Functionfiltercolumns!(msa, mask[, annotate::Bool=true])
It allows to filter MSA or aligned sequence columns/positions using a AbstractVector{Bool}
mask
. Annotations are updated if annotate
is true
(default).
MIToS.MSA.filtercolumns!
— Methodfiltercolumns!(data::Annotations, mask)
It is useful for deleting column annotations (creating a subset in place).
MIToS.MSA.filtercolumns
— MethodIt's similar to filtercolumns!
but for an AbstractMatrix{Residue}
MIToS.MSA.filtersequences!
— Functionfiltersequences!(msa, mask[, annotate::Bool=true])
It allows to filter msa
sequences using a AbstractVector{Bool}
mask
(It removes sequences with false
values). AnnotatedMultipleSequenceAlignment
annotations are updated if annotate
is true
(default).
MIToS.MSA.filtersequences!
— Methodfiltersequences!(data::Annotations, ids::Vector{String}, mask::AbstractArray{Bool,1})
It is useful for deleting sequence annotations. ids
should be a list of the sequence names and mask
should be a logical vector.
MIToS.MSA.filtersequences
— MethodIt's similar to filtersequences!
but for an AbstractMatrix{Residue}
MIToS.MSA.gapfraction
— MethodIt calculates the fraction of gaps on the Array
(alignment, sequence, column, etc.). This function can take an extra dimension
argument for calculation of the gap fraction over the given dimension.
MIToS.MSA.gapstrip!
— FunctionThis functions deletes/filters sequences and columns/positions on the MSA on the following order:
- Removes all the columns/position on the MSA with gaps on the reference (first) sequence.
- Removes all the sequences with a coverage with respect to the number of columns/positions on the MSA less than a
coveragelimit
(default to0.75
: sequences with 25% of gaps). - Removes all the columns/position on the MSA with more than a
gaplimit
(default to0.5
: 50% of gaps).
MIToS.MSA.gapstrip
— MethodCreates a new matrix of Residue
s (MSA) with deleted sequences and columns/positions. The MSA is edited in the following way:
- Removes all the columns/position on the MSA with gaps on the reference (first) sequence
- Removes all the sequences with a coverage with respect to the number of columns/positions on the MSA less than a
coveragelimit
(default to0.75
: sequences with 25% of gaps) - Removes all the columns/position on the MSA with more than a
gaplimit
(default to0.5
: 50% of gaps)
MIToS.MSA.getannotcolumn
— Functiongetannotcolumn(ann[, feature[,default]])
It returns per column annotation for feature
MIToS.MSA.getannotfile
— Functiongetannotfile(ann[, feature[,default]])
It returns per file annotation for feature
MIToS.MSA.getannotresidue
— Functiongetannotresidue(ann[, seqname, feature[,default]])
It returns per residue annotation for (seqname, feature)
MIToS.MSA.getannotsequence
— Functiongetannotsequence(ann[, seqname, feature[,default]])
It returns per sequence annotation for (seqname, feature)
MIToS.MSA.getcolumnmapping
— MethodIt returns a Vector{Int}
with the original column number of each column on the actual MSA. The mapping is annotated in the ColMap
file annotation of an AnnotatedMultipleSequenceAlignment
or in the column names of an NamedArray
or MultipleSequenceAlignment
.
NOTE: When the MSA results from vertically concatenating MSAs using vcat
, the column map annotations from the constituent MSAs (such as 1_ColMap
, 2_ColMap
, etc.) are not returned. Instead, the column numbers referenced in the column names are provided. To access the original annotations, utilize the getannotfile
function.
MIToS.MSA.gethcatmapping
— MethodIt returns a vector of numbers from 1
to N for each column that indicates the source MSA. The mapping is annotated in the "HCat"
file annotation of an AnnotatedMultipleSequenceAlignment
or in the column names of an NamedArray
or MultipleSequenceAlignment
.
NOTE: When the MSA results from vertically concatenating MSAs using vcat
, the "HCat"
annotations from the constituent MSAs are renamed as "1_HCat"
, "2_HCat"
, etc. In that case, the MSA numbers referenced in the column names are provided. To access the original annotations, utilize the getannotfile
function.
MIToS.MSA.getnamedict
— MethodIt takes a ResidueAlphabet
and returns a dictionary from group name to group position.
julia> using MIToS.MSA
julia> ab = ReducedAlphabet("(AILMV)(RHK)(NQST)(DE)(FWY)CGP")
ReducedAlphabet of length 8 : "(AILMV)(RHK)(NQST)(DE)(FWY)CGP"
julia> getnamedict(ab)
OrderedCollections.OrderedDict{String, Int64} with 8 entries:
"AILMV" => 1
"RHK" => 2
"NQST" => 3
"DE" => 4
"FWY" => 5
"C" => 6
"G" => 7
"P" => 8
MIToS.MSA.getresidues
— Methodgetresidues
allows you to access the residues stored inside an MSA or aligned sequence as a Matrix{Residue}
without annotations nor column/row names.
MIToS.MSA.getresiduesequences
— Methodgetresiduesequences
returns a Vector{Vector{Residue}}
with all the MSA sequences without annotations nor column/sequence names.
MIToS.MSA.getsequence
— Functiongetsequence
takes an MSA and a sequence number or identifier and returns an aligned sequence object. If the MSA is an AnnotatedMultipleSequenceAlignment
, it returns an AnnotatedAlignedSequence
with the sequence annotations. From a MultipleSequenceAlignment
, It returns an AlignedSequence
object. If an Annotations
object and a sequence identifier are used, this function returns the annotations related to the sequence.
MIToS.MSA.getsequencemapping
— MethodIt returns the sequence coordinates as a Vector{Int}
for an MSA sequence. That vector has one element for each MSA column. If the number if 0
in the mapping, there is a gap in that column for that sequence.
MIToS.MSA.getweight
— Methodgetweight(c[, i::Int])
This function returns the weight of the sequence number i
. getweight should be defined for any type used for frequencies!
/frequencies
in order to use his weigths. If i
isn't used, this function returns a vector with the weight of each sequence.
MIToS.MSA.hobohmI
— MethodSequence clustering using the Hobohm I method from Hobohm et. al.
References
MIToS.MSA.join_msas
— Methodjoin_msas(msa_a::AnnotatedMultipleSequenceAlignment,
msa_b::AnnotatedMultipleSequenceAlignment,
pairing;
kind::Symbol=:outer,
axis::Int=1)::AnnotatedMultipleSequenceAlignment
join_msas(msa_a::AnnotatedMultipleSequenceAlignment,
msa_b::AnnotatedMultipleSequenceAlignment,
positions_a,
positions_b;
kind::Symbol=:outer,
axis::Int=1)::AnnotatedMultipleSequenceAlignment
Join two Multiple Sequence Alignments (MSAs), msa_a
and msa_b
, based on specified matching positions or names. The function supports two formats: one takes a pairing
argument as a list of correspondences, and the other takes positions_a
and positions_b
as separate lists indicating matching positions or names in each MSA. This function allows for various types of join operations (:inner
, :outer
, :left
, :right
) and can merge MSAs by sequences (axis
1
) or by columns (axis
2
).
Parameters:
msa_a::AnnotatedMultipleSequenceAlignment
: The first MSA.msa_b::AnnotatedMultipleSequenceAlignment
: The second MSA.pairing
: An iterable where each element is a pair of sequence or column positions (Int
s) or names (String
s) to match betweenmsa_a
andmsa_b
. For example, it can be a list of two-element tuples or pairs, or andOrderedDict
.positions_a
,positions_b
: Separate lists of positions or names inmsa_a
andmsa_b
, respectively.kind::Symbol
: Type of join operation. Default is:outer
.axis::Int
: The axis along which to join (1
to match sequences,2
to match columns).
Returns:
AnnotatedMultipleSequenceAlignment
: A new MSA resulting from the join operation.
Behavior and Sequence Ordering:
The order of sequences or columns in the resulting MSA depends on the kind
of join operation and the order of elements in the pairing
or positions_a
and positions_b
lists.
- For
:inner
joins, the function returns an MSA containing only those sequences/columns that are paired in bothmsa_a
andmsa_b
. The order of elements in the output MSA follows the order in thepairing
or position lists. - For
:outer
joins, the output MSA includes all sequences/columns from bothmsa_a
andmsa_b
. Unpaired sequences/columns are filled with gaps as needed. The sequences/columns frommsa_a
are placed first. If thepairing
or position lists are sorted, the output MSA columns and sequences will keep the same order as in the inputs. That's nice for situations such as profile alignments where the order of columns is important. If thepairing
or position lists are not sorted, then the order of sequences/columns in the output MSA is not guaranteed to be the same as in the inputs. In particular, the matched sequences or columns will be placed first, followed by the unmatched ones. - For
:left
joins, all sequences/columns frommsa_a
are included in the output MSA keeping the same order as inmsa_a
. Sequences/columns frommsa_b
are added where matches are found, with gaps filling the unmatched positions. - For
:right
joins, the output MSA behaves like:left
joins but with roles ofmsa_a
andmsa_b
reversed.
Warning: When using Dict
for pairing, the order of elements might not be preserved as expected. Dict
in Julia does not maintain the order of its elements, which might lead to unpredictable order of sequences/columns in the output MSA. To preserve order, it is recommended to use an OrderedDict
or a list of Pair
s objects.
MIToS.MSA.meanpercentidentity
— FunctionReturns the mean of the percent identity between the sequences of a MSA. If the MSA has 300 sequences or less, the mean is exact. If the MSA has more sequences and the exact
keyword is false
(defualt), 44850 random pairs of sequences are used for the estimation. The number of samples can be changed using the second argument. Use exact=true
to perform all the pairwise comparison (the calculation could be slow).
MIToS.MSA.namedmatrix
— MethodThe namedmatrix
function returns the NamedResidueMatrix{Array{Residue,2}}
stored in an MSA or aligned sequence.
MIToS.MSA.ncolumns
— Methodncolumns
returns the number of MSA columns or positions.
MIToS.MSA.ncolumns
— Methodncolumns(ann::Annotations)
returns the number of columns/residues with annotations. This function returns -1
if there is not annotations per column/residue.
MIToS.MSA.nsequences
— Methodnsequences
returns the number of sequences on the MSA.
MIToS.MSA.percentidentity
— Methodpercentidentity(seq1, seq2, threshold)
Computes quickly if two aligned sequences have a identity value greater than a given threshold
value. Returns a boolean value. Positions with gaps in both sequences doesn't count to the length of the sequences. Positions with a XAA
in at least one sequence aren't counted.
MIToS.MSA.percentidentity
— Methodpercentidentity(seq1, seq2)
Calculates the fraction of identities between two aligned sequences. The identity value is calculated as the number of identical characters in the i-th position of both sequences divided by the length of both sequences. Positions with gaps in both sequences doesn't count to the length of the sequences. Positions with a XAA
in at least one sequence aren't counted.
MIToS.MSA.percentidentity
— Methodpercentidentity(msa[, out::Type=Float64])
Calculates the identity between all the sequences on a MSA. You can indicate the output element type with the last optional parameter (Float64
by default). For a MSA with a lot of sequences, you can use Float32
or Flot16
in order to avoid the OutOfMemoryError()
.
MIToS.MSA.percentsimilarity
— FunctionCalculates the similarity percent between two aligned sequences. The 100% is the length of the aligned sequences minus the number of columns with gaps in both sequences and the number of columns with at least one residue outside the alphabet. So, columns with residues outside the alphabet (other than the specially treated GAP
) aren't counted to the protein length. Two residues are considered similar if they below to the same group in a ReducedAlphabet
. The alphabet
(third positional argument) by default is:
ReducedAlphabet("(AILMV)(NQST)(RHK)(DE)(FWY)CGP")
The first group is composed of the non polar residues (AILMV)
, the second group is composed of polar residues, the third group are positive residues, the fourth group are negative residues, the fifth group is composed by the aromatic residues (FWY)
. C
, G
and P
are considered unique residues.
Other residue groups/alphabets:
SMS (Sequence Manipulation Suite) Ident and Sim (Stothard Paul. 2000):
ReducedAlphabet("(GAVLI)(FYW)(ST)(KRH)(DENQ)P(CM)")
Stothard P (2000) The Sequence Manipulation Suite: JavaScript programs for analyzing and formatting protein and DNA sequences. Biotechniques 28:1102-1104.
Bio3D 2.2 seqidentity (Grant, Barry J., et al. 2006):
ReducedAlphabet("(GA)(MVLI)(FYW)(ST)(KRH)(DE)(NQ)PC")
References
- Stothard, Paul. "The sequence manipulation suite: JavaScript programs for analyzing and formatting protein and DNA sequences." Biotechniques 28.6 (2000): 1102-1104.
- Grant, Barry J., et al. "Bio3d: an R package for the comparative analysis of protein structures." Bioinformatics 22.21 (2006): 2695-2696.
MIToS.MSA.percentsimilarity
— MethodCalculates the similarity percent between all the sequences on a MSA. You can indicate the output element type with the out
keyword argument (Float64
by default). For an MSA with a lot of sequences, you can use out=Float32
or out=Flot16
in order to avoid the OutOfMemoryError()
.
MIToS.MSA.printmodifications
— MethodPrints MIToS annotated modifications
MIToS.MSA.rename_sequences!
— Methodrename_sequences!(msa, newnames::Vector{T}) where {T<:AbstractString}
rename_sequences!(msa, old2new::AbstractDict)
rename_sequences!(msa, old2new::Pair...)
Rename the sequences of an MSA given a vector of new names, a dictionary mapping old names to new names, or one or more pairs going from old to new names. If the msa
is an AnnotatedMultipleSequenceAlignment
, the annotations are also updated. The function modifies the msa
in place and returns it.
MIToS.MSA.rename_sequences
— Methodrename_sequences(msa, newnames::Vector{T}) where {T<:AbstractString}
rename_sequences(msa, old2new::AbstractDict)
rename_sequences(msa, old2new::Pair...)
Rename the sequences of an MSA given a vector of new names, a dictionary mapping old names to new names, or one or more pairs going from old to new names. If the msa
is an AnnotatedMultipleSequenceAlignment
, the annotations are also updated. The function returns a new MSA with the sequences renamed without modifying the original MSA.
MIToS.MSA.residue2three
— MethodThis function returns the three letter name of the Residue
.
julia> using MIToS.MSA
julia> residue2three(Residue('G'))
"GLY"
MIToS.MSA.residuefraction
— MethodIt calculates the fraction of residues (no gaps) on the Array
(alignment, sequence, column, etc.). This function can take an extra dimension
argument for calculation of the residue fraction over the given dimension
MIToS.MSA.sequence_id
— Methodsequence_id(seq::Union{AbstractSequence,AbstractAlignedSequence})
It returns the sequence identifier of a sequence object.
MIToS.MSA.sequence_index
— Methodsequence_index(msa, seq_name)
Return the index (integer position) of the sequence with name seq_name
in the MSA msa
. A KeyError
is thrown if the sequence name does not exist. If seq_name
is an integer, the same integer is returned without checking if it is a valid index.
MIToS.MSA.sequencename_iterator
— Methodsequencename_iterator(msa)
It returns an iterator that returns the sequence names/identifiers of the msa
.
MIToS.MSA.sequencenames
— Methodsequencenames(msa)
It returns a Vector{String}
with the sequence names/identifiers.
MIToS.MSA.sequencepairsmatrix
— MethodInitialize an empty PairwiseListMatrix
for a pairwise measure in column pairs. It uses the column mapping (column number in the input MSA file) if it’s available, otherwise it uses the actual column numbers. You can use the positional argument to indicate the number Type
(default: Float64
), if the PairwiseListMatrix
should store the diagonal values on the list (default: false
) and a default value for the diagonal (default: NaN
).
MIToS.MSA.setannotcolumn!
— Functionsetannotcolumn!(ann, feature, annotation)
It stores per column annotation
(1 char per column) for feature
MIToS.MSA.setannotfile!
— Functionsetannotfile!(ann, feature, annotation)
It stores per file annotation
for feature
MIToS.MSA.setannotresidue!
— Functionsetannotresidue!(ann, seqname, feature, annotation)
It stores per residue annotation
(1 char per residue) for (seqname, feature)
MIToS.MSA.setannotsequence!
— Functionsetannotsequence!(ann, seqname, feature, annotation)
It stores per sequence annotation
for (seqname, feature)
MIToS.MSA.setreference!
— FunctionIt puts the sequence i
(name or position) as reference (first sequence) of the MSA. This function swaps the sequences 1 and i
.
MIToS.MSA.shuffle_msa!
— Methodshuffle_msa!([rng=default_rng(),] msa::AbstractMatrix{Residue}, subset=Colon(); dims=2, fixedgaps=true, fixed_reference=false)
In-place version of shuffle_msa
. It randomly permute residues in the MSA msa
along sequences (dims=1
) or columns (dims=2
, the default). The optional positional argument subset
allows to shuffle only a subset of the sequences or columns. The optional keyword argument fixedgaps
indicates if the gaps should remain their positions (true
by default). The optional keyword argument fixed_reference
indicates if the residues in the first sequence should remain in their positions (false
by default).
MIToS.MSA.shuffle_msa
— Methodshuffle_msa([rng=default_rng(),] msa::AbstractMatrix{Residue}, subset=Colon(); dims=2, fixedgaps=true, fixed_reference=false)
It randomly permute residues in the MSA msa
along sequences (dims=1
) or columns (dims=2
, the default). The optional positional argument subset
allows to shuffle only a subset of the sequences or columns. The optional keyword argument fixedgaps
indicates if the gaps should remain their positions (true
by default). The optional keyword argument fixed_reference
indicates if the residues in the first sequence should remain in their positions (false
by default). To shuffle in-place, see shuffle_msa!
.
julia> using MIToS.MSA
julia> using Random
julia> msa = hcat(res"RRE",res"DDK", res"G--")
3×3 Matrix{Residue}:
R D G
R D -
E K -
julia> Random.seed!(42);
julia> shuffle_msa(msa, dims=1, fixedgaps=true)
3×3 Matrix{Residue}:
G D R
R D -
E K -
julia> Random.seed!(42);
julia> shuffle_msa(msa, dims=1, fixedgaps=false)
3×3 Matrix{Residue}:
G D R
R - D
E K -
MIToS.MSA.stringsequence
— Methodstringsequence(seq)
stringsequence(msa, i::Int)
stringsequence(msa, id::String)
It returns the selected sequence as a String
.
MIToS.MSA.swapsequences!
— MethodIt swaps the sequences on the positions i
and j
of an MSA. Also it's possible to swap sequences using their sequence names/identifiers when the MSA object as names.
MIToS.MSA.three2residue
— MethodIt takes a three letter residue name and returns the corresponding Residue
. If the name isn't in the MIToS dictionary, a XAA
is returned.
julia> using MIToS.MSA
julia> three2residue("ALA")
A
MIToS.Utils.parse_file
— Functionparse_file(io, format[, output; generatemapping, useidcoordinates, deletefullgaps])
The keyword argument generatemapping
(false
by default) indicates if the mapping of the sequences ("SeqMap") and columns ("ColMap") and the number of columns in the original MSA ("NCol") should be generated and saved in the annotations. If useidcoordinates
is true
(default: false
) the sequence IDs of the form "ID/start-end" are parsed and used for determining the start and end positions when the mappings are generated. deletefullgaps
(true
by default) indicates if columns 100% gaps (generally inserts from a HMM) must be removed from the MSA.
Random.shuffle
— FunctionIt's like shuffle
but in-place. When a Matrix{Residue}
or a AbstractAlignedObject
(sequence or MSA) is used, you can indicate if the gaps should remain their positions using the last boolean argument.
DEPRECATED: This method is deprecated. Use shuffle_msa
instead.
Random.shuffle!
— FunctionIt's like Random.shuffle
. When a Matrix{Residue}
is used, you can indicate if the gaps should remain their positions using the last boolean argument. The previous argument should be the dimension to shuffle, 1 for shuffling residues in a sequence (row) or 2 for shuffling residues in a column.
DEPRECATED: This method is deprecated. Use shuffle_msa!
instead.