Internal API & Pipeline

This page documents the current implementation pipeline. Everything here is private and may change between releases; the stable package contract is the External API.

The running example below mirrors the worked comparison in Figure 1 of Edwards et al. (2008).

Pipeline Overview

The current code path for a pairwise comparison is:

_parse_motif strips whitespace, expands grouping and alternation, parses tokens, and produces a canonical normalized motif string.
_expand_variants resolves bounded repeat ranges into concrete motif variants with precomputed information content.
_find_precise_match checks exact full-length and exact subsequence relationships first to seed the best candidate before the full overlap search.
_evaluate_alignment scores each overlap candidate considered across the expanded variant pairs and relative shifts.
_compare_parsed keeps the best candidate and materializes the public ComparisonResult.

Parsing is the syntax-level normalization step. It canonicalizes residue classes and wildcard notation, records any alternation branches, and preserves bounded repeats in normalized form so later stages can expand them deliberately. Wildcard aliases are canonicalized to . even where the current oracle has a grouped-alternation quirk, because the package treats x, X, and . as intentionally equivalent syntax. Positive character classes are also treated as sets, so duplicate residues are discarded even though the oracle can score them differently.

julia> options = ComparisonOptions(; min_shared_positions = 1, normalized_ic_cutoff = 0.0);
julia> parsed_query = CompariMotif._parse_motif("[KR].L.{0,1}[FYLIVMP]", options);
julia> parsed_search = CompariMotif._parse_motif("R.LE", options);
julia> parsed_query.normalized"[RK].L.{0,1}[ILMFPYV]"
julia> parsed_search.normalized"R.LE"

2. Expand concrete variants

Variant expansion converts each parsed branch into one or more concrete motif variants with explicit positions and precomputed information content. This is the stage where repeat ranges become enumerated sequences, so all downstream alignment and scoring logic works with concrete variant objects rather than quantified syntax.

julia> spec = CompariMotif._alphabet_spec(options.alphabet);
julia> query_variants = CompariMotif._expand_variants(parsed_query, options, spec);
julia> search_variants = CompariMotif._expand_variants(parsed_search, options, spec);
julia> [variant.normalized for variant in query_variants]2-element Vector{String}:
 "[RK].L[ILMFPYV]"
 "[RK].L.[ILMFPYV]"
julia> round.([variant.information for variant in query_variants], digits = 3)2-element Vector{Float64}:
 2.119
 2.119
julia> only(search_variants).normalized"R.LE"
julia> round(only(search_variants).information, digits = 3)3.0

3. Check precise matches before overlap scoring

The precise-match pass looks for exact full-length and exact subsequence relationships among the expanded variants before the broader overlap search runs. The current implementation does not add a separate check for whether two motifs contain enough shared amino acids in any position to merit further comparison; after the exact-match pass, it proceeds directly to the sliding-window overlap search. Any exact hit seeds the current best candidate, but it still does not short-circuit later evaluation of other overlaps. We have tried implementing that, as it was suggested in the paper, but it did not improve performance in practice.

julia> found_precise, best_precise = CompariMotif._find_precise_match(query_variants, search_variants, options, spec);
julia> found_precisefalse

4. Score the best overlap

Alignment scoring evaluates one query variant against one search variant at a specific relative shift. Each candidate carries the matched pattern, matched positions, relationship labels, and the information-content-derived metrics used for ranking. The current implementation orders candidates by higher match_ic, then matched_positions, then score. If all of those still tie, the first candidate encountered in the shift scan inferred from black-box oracle tie cases is kept. normalized_ic, core_ic, and score are still materialized on the candidate for inspection and output.

julia> query_variant = query_variants[2];
julia> search_variant = only(search_variants);
julia> candidate = CompariMotif._evaluate_alignment(query_variant, search_variant, 0, options, spec);
julia> candidate.matched_pattern"[rk].Le"
julia> candidate.matched_positions2
julia> round(candidate.normalized_ic, digits = 3)0.835

5. Materialize the public result

After all precise matches and overlap candidates have been considered, _compare_parsed keeps the strongest candidate and materializes it as the public ComparisonResult. This final step copies the winning alignment's relationships, matched pattern, and information-content summary into the stable API object returned by compare.

julia> result = compare("[KR].L.{0,1}[FYLIVMP]", "R.LE", options);
julia> (result.query_relationship, result.search_relationship)("Degenerate Parent", "Variant Subsequence")
julia> round(result.match_ic, digits = 3)1.769
julia> round(result.score, digits = 3)1.669

Internal Reference

CompariMotif._ParsedMotif — Type

_ParsedMotif

Internal parsed representation of one user-supplied motif.

Fields:

original: motif text exactly as supplied by the caller.
normalized: canonical motif text used for deterministic comparisons.
tokens: token sequence for the first parsed branch.
alternatives: token sequence for every expanded top-level alternation branch.

source

CompariMotif._MotifVariant — Type

_MotifVariant

Concrete motif variant obtained after expanding bounded repeat ranges.

Fields:

positions: fixed sequence of parsed positions used during alignment.
normalized: canonical motif text for this expanded variant.
information: total information content of the variant.

source

CompariMotif._parse_motif — Function

_parse_motif(motif::AbstractString, options::ComparisonOptions)::_ParsedMotif

Parse one motif string into canonical internal representation.

source

CompariMotif._expand_variants — Function

_expand_variants(parsed::_ParsedMotif, options::ComparisonOptions, spec::_AlphabetSpec)::Vector{_MotifVariant}

Expand ranged-repeat motifs into concrete variant sequences.

source

CompariMotif._find_precise_match — Function

_find_precise_match(query_variants, search_variants, options, spec)

Search only exact same / exact-subsequence relationships. Returns (found_precise, best_candidate).

source

CompariMotif._evaluate_alignment — Function

_evaluate_alignment(query_variant, search_variant, shift, options, spec)

Evaluate one concrete shift between two expanded motif variants. Returns _Candidate when all thresholds pass, otherwise nothing.

source

CompariMotif._compare_parsed — Function

_compare_parsed(parsed_query, parsed_search, options)::ComparisonResult

Compare two already-parsed motifs.

source

Internal API & Pipeline

Pipeline Overview

Figure 1 Worked Example

1. Parse and normalize the motifs

2. Expand concrete variants

3. Check precise matches before overlap scoring

4. Score the best overlap

5. Materialize the public result

Internal Reference