CompariMotif.jl
Clean-room, unofficial Julia implementation of the motif–motif comparison strategy described by Edwards, Davey and Shields (Bioinformatics 24(10):1307–1309, 2008). It supports the comparison of protein, DNA and RNA motifs, represented as regular expressions.
API
ComparisonOptions(; kwargs...)compare(a::AbstractString, b::AbstractString, options::ComparisonOptions)::ComparisonResultcompare(motifs::AbstractVector{<:AbstractString}, db::AbstractVector{<:AbstractString}, options::ComparisonOptions)::Matrix{ComparisonResult}compare(motifs::AbstractVector{<:AbstractString}, options::ComparisonOptions)::Matrix{ComparisonResult}normalize_motif(motif::AbstractString; alphabet = :protein)::Stringto_column_table(result_or_results)::NamedTuple
Minimal example
using CompariMotif
using DataFrames
motifs = ["RKLI", "R[KR]L[IV]", "[KR]xLx[FYLIMVP]", "RxLE"]
options = ComparisonOptions(; min_shared_positions = 1, normalized_ic_cutoff = 0.0)
results = compare(motifs, options)
results[3, 4] # single pair summary
table = to_column_table(results)
df = DataFrame(table)to_column_table output can also be written with CSV.write("comparimotif_results.tsv", table).
Allowed regex symbols and syntax
Motif parsing supports a controlled regex-like subset.
- Fixed residues from the selected alphabet:
- protein (
alphabet=:protein, default):ARNDCQEGHILKMFPSTWYV - DNA (
alphabet=:dna):ACGT - RNA (
alphabet=:rna):ACGU
- protein (
- Wildcards:
x,X, and.are equivalent and mean "any residue in the selected alphabet".
- Character classes:
[KR]includes listed residues.[^P]is negation within the selected alphabet only.
- Anchors:
^and$to idicate N- and C-terminus for protein motifs, or 5' and 3' ends for nucleic acid motifs.
- Repeat quantifiers:
{n},{m,n}.
- Grouping and alternation:
(...)for grouping and|for alternatives (for exampleA(K|Q)LI).
- Whitespace:
- ignored inside motifs.
Official implementation
The official CompariMotif implementation is distributed as part of SLiMSuite: https://github.com/slimsuite/SLiMSuite (tool path: tools/comparimotif_V3.py).
Scope differences compared to the original CompariMotif
This package implements the paper-defined motif comparison core, but it does not aim to replicate the full SLiMSuite application surface. In particular:
- no standalone CLI interface or SLiMSuite pipeline integration;
- no raw
.tdtcompatibility/output mode (useto_column_tablefor tabular outputs); - no
Name*/Desc*metadata fields in API results or fixtures (regex motifs only); - no XGMML/network export outputs.
Fixtures and oracle regeneration
Oracle fixtures, i.e. expected results for black-box tests, are committed under data/fixtures/ and tests do not call the CompariMotif code directly. Only normalized TSV fixtures are committed rather than the raw .tdt output. To regenerate fixtures see the README.md in data/fixtures/.
Default options parity
Compared against the upstream CompariMotif oracle as a black-box executable (without reading source code), package defaults match:
min_shared_positions = 2(minshare=2)normalized_ic_cutoff = 0.5(normcut=0.5)matchfix = MatchFixNone(matchfix=0)mismatches = 0allow_ambiguous_overlap = true(overlaps=T)
License hygiene
This repository is MIT-licensed. Implementation is derived from the paper and black-box oracle observations only. GPL CompariMotif source code is not used.
Citation
If you use this Julia pipeline in scientific work, please cite the original algorithm paper:
- Edwards RJ, Davey NE, Shields DC. CompariMotif: quick and easy comparisons of sequence motifs. Bioinformatics 24(10):1307-1309 (2008). https://doi.org/10.1093/bioinformatics/btn105
Public API
CompariMotif.ComparisonOptions — Type
ComparisonOptionsReusable configuration object for CompariMotif comparisons.
Construct once with ComparisonOptions(; kwargs...) and reuse across many compare calls.
Keywords
alphabet::Symbol = :protein: comparison alphabet (:protein,:dna, or:rna).min_shared_positions::Int = 2: minimum number of matched, non-wildcard positions required for a hit.normalized_ic_cutoff::Real = 0.5: minimum normalized information content.matchfix::Union{MatchFixMode, Symbol, AbstractString} = MatchFixNone: fixed-position matching mode. Accepted symbol/string aliases are:none,query_fixed(query),search_fixed(search),both_fixed(both).mismatches::Int = 0: tolerated count of defined-position mismatches.allow_ambiguous_overlap::Bool = true: whether partial class overlaps are allowed as complex matches.max_variants::Int = 10_000: maximum expanded variants per motif.
Examples
julia> using CompariMotif
julia> opts = ComparisonOptions(; alphabet = :rna);
julia> String(opts.alphabet)
"ACGU"See also MatchFixMode, compare, ComparisonResult.
CompariMotif.ComparisonOptions — Method
ComparisonOptions(; kwargs...) -> ComparisonOptionsConstruct a reusable options object for motif comparisons.
julia> using CompariMotif
julia> opts = ComparisonOptions(; alphabet = :dna, min_shared_positions = 1);
julia> String(opts.alphabet)
"ACGT"CompariMotif.ComparisonResult — Type
ComparisonResultResult record produced by compare for one query/search motif pair.
Fields:
query,search: original input motifs.normalized_query,normalized_search: canonicalized motifs used internally.matched: whether the best-scoring valid alignment passed all thresholds.query_relationship,search_relationship: human-readable relationship labels.matched_pattern: consensus/overlap pattern for the selected alignment.matched_positions: count of matched non-wildcard positions.match_ic: total information content for matched positions.normalized_ic:match_icnormalized by the lower motif information content.core_ic: information content normalized by core overlap length.score: derived summary score (normalized_ic * matched_positions).query_information,search_information: total information content per motif.
See also ComparisonOptions, normalize_motif, to_column_table.
CompariMotif.MatchFixMode — Type
MatchFixModeFixed-position matching behavior used by CompariMotif:
MatchFixNone: no fixed-position requirement.MatchFixQueryFixed: fixed query positions must have exact fixed matches.MatchFixSearchFixed: fixed search positions must have exact fixed matches.MatchFixBothFixed: enforce fixed-position matching on both motifs.
Used by the matchfix keyword in ComparisonOptions.
CompariMotif.compare — Function
compare(a::AbstractString, b::AbstractString, options::ComparisonOptions) -> ComparisonResult
compare(motifs::AbstractVector{<:AbstractString},
db::AbstractVector{<:AbstractString},
options::ComparisonOptions) -> Matrix{ComparisonResult}
compare(motifs::AbstractVector{<:AbstractString},
options::ComparisonOptions) -> Matrix{ComparisonResult}Compare motifs according to the CompariMotif scoring scheme described in Edwards et al. (2008).
- Pairwise mode compares one query motif against one search motif.
- Matrix mode computes all pairwise query-vs-database comparisons.
- All-vs-all mode is a convenience alias for
compare(motifs, motifs, options).
Examples
julia> using CompariMotif
julia> options = ComparisonOptions(; min_shared_positions = 1, normalized_ic_cutoff = 0.0);
julia> result = compare("RKLI", "R[KR]L[IV]", options);
julia> result.matched
trueConfigure thresholds and matching semantics with ComparisonOptions. The result matrix has size (length(motifs), length(db)). Returns a ComparisonResult. Use normalize_motif for deterministic motif canonicalization. Convert results to column tables with to_column_table.
CompariMotif.compare — Method
compare(a::AbstractString, b::AbstractString, options::ComparisonOptions) -> ComparisonResultPairwise motif comparison.
CompariMotif.compare — Method
compare(motifs, db, options) -> Matrix{ComparisonResult}Compare all query motifs against all search-database motifs.
CompariMotif.compare — Method
compare(motifs, options) -> Matrix{ComparisonResult}Convenience all-vs-all matrix mode.
CompariMotif.normalize_motif — Method
normalize_motif(motif::AbstractString; alphabet::Symbol = :protein) -> StringParse and canonicalize a motif expression into a deterministic representation. Supported syntax includes fixed residues from the selected alphabet, bracket classes (including negation), x/X/. wildcards, ^/$ termini, and {n}/{m,n} repeat quantifiers. Grouping with (...) and alternation with | are also supported.
Wildcard tokens x, X, and . are equivalent and each means "any residue" in the selected alphabet (:protein, :dna, or :rna).
Examples
julia> using CompariMotif
julia> normalize_motif("r[kR].{0,1}l")
"R[RK]x{0,1}L"Configure thresholds and matching semantics with ComparisonOptions. Compute similarities with compare.
CompariMotif.to_column_table — Method
to_column_table(results::AbstractMatrix{<:ComparisonResult}) -> NamedTupleConvert a result matrix to a column table with query_index and search_index.
CompariMotif.to_column_table — Method
to_column_table(results::AbstractVector{<:ComparisonResult}) -> NamedTupleConvert a result vector to a column table with result_index.
CompariMotif.to_column_table — Method
to_column_table(results) -> NamedTupleConvert comparison results into a column-oriented NamedTuple where each key is a column name and each value is a vector column.
to_column_table(::ComparisonResult)returns a one-row table.to_column_table(::AbstractVector{<:ComparisonResult})addsresult_index.to_column_table(::AbstractMatrix{<:ComparisonResult})addsquery_indexandsearch_indexwith one row per matrix cell in deterministic row-major order.
The returned object can be converted to a DataFrame or written using CSV.write without requiring either dependency in the package itself.
Examples
julia> using CompariMotif, DataFrames
julia> motifs = ["RKLI", "R[KR]L[IV]"];
julia> options = ComparisonOptions(; min_shared_positions = 1, normalized_ic_cutoff = 0.0);
julia> table = to_column_table(compare(motifs, options));
julia> df = DataFrame(table);
julia> show(select(df, [:query_index, :search_index, :query, :search, :query_relationship]), allrows = true, allcols = true, truncate = 0)
4×5 DataFrame
Row │ query_index search_index query search query_relationship
│ Int64 Int64 String String String
─────┼───────────────────────────────────────────────────────────────────────
1 │ 1 1 RKLI RKLI Exact Match
2 │ 1 2 RKLI R[KR]L[IV] Variant Match
3 │ 2 1 R[KR]L[IV] RKLI Degenerate Match
4 │ 2 2 R[KR]L[IV] R[KR]L[IV] Exact MatchCompute similarities with compare. Returns a ComparisonResult.
Internal API
CompariMotif.ResidueClass — Type
ResidueClassResidue set encoded as a ResidueMask.
CompariMotif.ResidueMask — Type
ResidueMaskBit-mask representation used for residue-set operations.
CompariMotif._canonical_token — Method
_canonical_token(position::_Position, options::ComparisonOptions) -> StringRender one parsed position into deterministic canonical motif syntax.
CompariMotif._class_mask — Method
_class_mask(raw::AbstractString, options::ComparisonOptions) -> ResidueMaskParse a bracket class body into a residue mask.
CompariMotif._coerce_matchfix — Method
_coerce_matchfix(mode::AbstractString) -> MatchFixModeNormalize string aliases into a concrete MatchFixMode.
CompariMotif._coerce_matchfix — Method
_coerce_matchfix(mode::MatchFixMode) -> MatchFixModeReturn the match-fix mode unchanged.
CompariMotif._coerce_matchfix — Method
_coerce_matchfix(mode::Symbol) -> MatchFixModeNormalize symbol aliases into a concrete MatchFixMode.
CompariMotif._compare_parsed — Method
_compare_parsed(parsed_query, parsed_search, options) -> ComparisonResultCompare two already-parsed motifs.
CompariMotif._compare_positions — Method
_compare_positions(qpos, spos, options)Compare one query/search position pair and return matching diagnostics.
CompariMotif._empty_result_columns — Method
_empty_result_columns(nrows::Int) -> NamedTupleAllocate typed result columns for nrows rows.
CompariMotif._evaluate_alignment — Method
_evaluate_alignment(query_variant, search_variant, shift, options)Evaluate one concrete shift between two expanded motif variants. Returns _Candidate when all thresholds pass, otherwise nothing.
CompariMotif._expand_variants — Method
_expand_variants(parsed::_ParsedMotif, options::ComparisonOptions) -> Vector{_MotifVariant}Expand ranged-repeat motifs into concrete variant sequences.
CompariMotif._is_better — Method
_is_better(candidate::_Candidate, best::Union{Nothing, _Candidate}) -> BoolApply deterministic candidate ordering:
- higher
match_ic, 2) more matched positions, 3) more exact fixed matches.
CompariMotif._is_fixed — Method
_is_fixed(pos::_Position) -> BoolReturn true when pos encodes exactly one residue.
CompariMotif._is_terminus — Method
_is_terminus(pos::_Position) -> BoolReturn true when pos is a terminus anchor (^ or $).
CompariMotif._is_wildcard — Method
_is_wildcard(pos::_Position, options::ComparisonOptions) -> BoolReturn true when pos matches all residues in the selected alphabet.
CompariMotif._mask_from_char — Method
_mask_from_char(char::Char, options::ComparisonOptions) -> ResidueMaskReturn the residue mask for one alphabet character.
CompariMotif._mask_to_chars — Method
_mask_to_chars(mask::ResidueMask, options::ComparisonOptions; as_lowercase = false) -> Vector{Char}Materialize residues represented by a mask in canonical alphabet order.
CompariMotif._mask_to_symbol — Method
_mask_to_symbol(mask::ResidueMask, options::ComparisonOptions; as_lowercase = false, wildcard_symbol = "x") -> StringRender one residue mask as canonical motif syntax.
CompariMotif._match_symbol — Method
_match_symbol(qpos, spos, intersection, relation, mismatch, options) -> StringRender one output symbol for the overlap pattern.
CompariMotif._parse_motif — Method
_parse_motif(motif::AbstractString, options::ComparisonOptions) -> _ParsedMotifParse one motif string into canonical internal representation.
CompariMotif._parse_repeat_quantifier — Method
_parse_repeat_quantifier(text::AbstractString, i::Int) -> (Int, Int, Int)Parse optional repeat quantifier at index i, returning (min, max, next_index).
CompariMotif._position_ic — Method
_position_ic(pos::_Position, options::ComparisonOptions) -> Float64Compute information content for one parsed position.
CompariMotif._query_fixed_required — Method
_query_fixed_required(mode::MatchFixMode) -> BoolReturn true when query fixed residues must match exactly.
CompariMotif._search_fixed_required — Method
_search_fixed_required(mode::MatchFixMode) -> BoolReturn true when search fixed residues must match exactly.
CompariMotif._set_result_row! — Method
_set_result_row!(columns, row::Int, result::ComparisonResult)Write one ComparisonResult into preallocated column vectors.
CompariMotif._variant_count — Method
_variant_count(tokens::Vector{_Token}) -> BigIntReturn the number of expanded variants implied by repeat ranges.
CompariMotif.is_fixed — Method
is_fixed(a::ResidueClass) -> BoolReturn true when the residue class contains exactly one residue.
CompariMotif.is_subset — Method
is_subset(a::ResidueClass, b::ResidueClass) -> BoolReturn true when every residue in a is also in b.
CompariMotif.is_wildcard — Method
is_wildcard(a::ResidueClass, opts::ComparisonOptions) -> BoolReturn true when the residue class spans the full selected alphabet.
CompariMotif.overlaps — Method
overlaps(a::ResidueClass, b::ResidueClass) -> BoolReturn true when two residue classes share at least one residue.
CompariMotif.unionclass — Method
unionclass(a::ResidueClass, b::ResidueClass) -> ResidueClassReturn the set-union of two residue classes.