FAQ / How-To

How do I use a Julia `Regex` as input?

The package accepts motif strings, not compiled Regex objects. When you already have a Julia regex, pass its pattern text with rx.pattern (please note that this is not part of the public API, so it may change in future versions of Julia). You should also ensure that the regex is compatible with the selected alphabet, otherwise an error will be thrown.

julia> options = ComparisonOptions(; min_shared_positions = 1, normalized_ic_cutoff = 0.0);

julia> rx = r"A[CG]T";

julia> result = compare(rx.pattern, "ACT", options);

julia> result.matched
true

julia> result.query_relationship
"Degenerate Match"

How do I handle IUPAC ambiguity codes such as `N`, `R`, `B` or `Z`?

CompariMotif accepts motifs written with the selected core alphabet, bracket classes, and wildcard syntax (., x, or X). The canonical normalized representation uses ., but x and X are still accepted on input. If your input uses IUPAC/IUBMB ambiguity letters, rewrite them into bracket classes or wildcards before calling the API.

For nucleotide alphabets, use the NC-IUB ambiguity codes as follows:

Code	`DNAAlphabet()`	`RNAAlphabet()`
`R`	`[AG]`	`[AG]`
`Y`	`[CT]`	`[CU]`
`W`	`[AT]`	`[AU]`
`S`	`[CG]`	`[CG]`
`M`	`[AC]`	`[AC]`
`K`	`[GT]`	`[GU]`
`H`	`[ACT]`	`[ACU]`
`B`	`[CGT]`	`[CGU]`
`V`	`[ACG]`	`[ACG]`
`D`	`[AGT]`	`[AGU]`
`N`	`.` or `x`	`.` or `x`

For the current 20-residue protein alphabet, the relevant IUPAC-IUB ambiguity codes are:

Code	`ProteinAlphabet()`
`B`	`[DN]`
`Z`	`[EQ]`

These mappings come from the NC-IUB recommendation Nomenclature for incompletely specified bases in nucleic acid sequences and the IUPAC-IUB/JCBN amino-acid nomenclature section AA-21.2 The one-letter system.

You can use Julia's replace function to rewrite motifs with IUPAC codes into the appropriate format for CompariMotif. For example:

julia> clean_dna_motif = replace("ARYN", "R" => "[AG]", "Y" => "[CT]", "N" => ".")
"A[AG][CT]."

julia> clean_rna_motif = replace("AUHB", "H" => "[ACU]", "B" => "[CGU]")
"AU[ACU][CGU]"

julia> clean_protein_motif = replace("ABZX", "B" => "[DN]", "Z" => "[EQ]", "X" => ".")
"A[DN][EQ]."

How do I switch alphabets?

Choose the alphabet when constructing ComparisonOptions. The alphabet options are DNAAlphabet(), RNAAlphabet(), and ProteinAlphabet() (the default).

julia> options = ComparisonOptions(; alphabet = RNAAlphabet());

julia> compare("AxG", "AUG", options).matched
true

How do I turn results into a table?

Use to_column_table when you want to convert a matrix of comparison results into a column-oriented NamedTuple that can be easily converted to a DataFrame or written to a CSV file. The CSV.write function can consume the NamedTuple directly, which is convenient for exporting results (you need to do using CSV to use CSV.write). When interactively exploring results in a Julia session, converting the table to a DataFrame is often easier, as DataFrame operations provide convenient tools for data manipulation, such as sorting and filtering. In order to use DataFrame, you need to do using DataFrames in your Julia session.

julia> motifs = ["RKLI", "R[KR]L[IV]", "[ST]P"];

julia> options = ComparisonOptions(; min_shared_positions = 1, normalized_ic_cutoff = 0.0);

julia> matrix = compare(motifs, options);

julia> table = to_column_table(matrix);

julia> df = DataFrame(table);

julia> show(df, allrows = true, allcols = true)
9×17 DataFrame
 Row │ query_index  search_index  query       search      normalized_query  normalized_search  matched  query_relationship  search_relationship  matched_pattern  matched_positions  match_ic  normalized_ic  core_ic   score    query_information  search_information
     │ Int64        Int64         String      String      String            String             Bool     String              String               String           Int64              Float64   Float64        Float64   Float64  Float64            Float64
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │           1             1  RKLI        RKLI        RKLI              RKLI                  true  Exact Match         Exact Match          RKLI                             4   4.0                1.0  1.0           4.0            4.0                 4.0
   2 │           1             2  RKLI        R[KR]L[IV]  RKLI              R[RK]L[IV]            true  Variant Match       Degenerate Match     R[rk]L[iv]                       4   3.53724            1.0  0.884311      4.0            4.0                 3.53724
   3 │           1             3  RKLI        [ST]P       RKLI              [ST]P                false  No Match            No Match                                              0   0.0                0.0  0.0           0.0            0.0                 0.0
   4 │           2             1  R[KR]L[IV]  RKLI        R[RK]L[IV]        RKLI                  true  Degenerate Match    Variant Match        R[rk]L[iv]                       4   3.53724            1.0  0.884311      4.0            3.53724             4.0
   5 │           2             2  R[KR]L[IV]  R[KR]L[IV]  R[RK]L[IV]        R[RK]L[IV]            true  Exact Match         Exact Match          R[RK]L[IV]                       4   3.53724            1.0  1.0           4.0            3.53724             3.53724
   6 │           2             3  R[KR]L[IV]  [ST]P       R[RK]L[IV]        [ST]P                false  No Match            No Match                                              0   0.0                0.0  0.0           0.0            0.0                 0.0
   7 │           3             1  [ST]P       RKLI        [ST]P             RKLI                 false  No Match            No Match                                              0   0.0                0.0  0.0           0.0            0.0                 0.0
   8 │           3             2  [ST]P       R[KR]L[IV]  [ST]P             R[RK]L[IV]           false  No Match            No Match                                              0   0.0                0.0  0.0           0.0            0.0                 0.0
   9 │           3             3  [ST]P       [ST]P       [ST]P             [ST]P                 true  Exact Match         Exact Match          [ST]P                            2   1.76862            1.0  1.0           2.0            1.76862             1.76862

How do I persist comparison results and load them in another Julia session?

For a matrix of results, write to_column_table(results) to CSV. The matrix form includes query_index and search_index, which lets you reconstruct the original Matrix{ComparisonResult} after loading the file again.

Let's see an example step by step. First, we create some motifs and compare them. Then we write the results to a CSV file to persist them. For that, we first need to do using CSV in our Julia session.

julia> motifs = ["RKLI", "R[KR]L[IV]", "[ST]P"];

julia> options = ComparisonOptions();

julia> results = compare(motifs, options);

julia> path = tempname();  # you can choose any filename; here we create a temporary file

julia> CSV.write(path, to_column_table(results));

Now we can load the file again in another Julia session and reconstruct the original matrix of ComparisonResult objects. For that, we need to do using CSV to read the CSV file and using DataFrames to convert the table into a DataFrame, which is easier to work with:

julia> loaded = DataFrame(CSV.File(path));

Once we have the DataFrame, we can create an empty matrix of ComparisonResult objects with the correct dimensions. The query_index and search_index columns indicate the required size. We then fill the matrix by iterating over the rows of the DataFrame and constructing ComparisonResult objects from each row:

julia> restored = Matrix{ComparisonResult}(undef, maximum(loaded.query_index), maximum(loaded.search_index));

julia> for row in eachrow(loaded)
           restored[row.query_index, row.search_index] = ComparisonResult(;
               query = row.query,
               search = row.search,
               normalized_query = row.normalized_query,
               normalized_search = row.normalized_search,
               matched = row.matched,
               query_relationship = row.query_relationship,
               search_relationship = row.search_relationship,
               matched_pattern = coalesce(row.matched_pattern, ""),
               matched_positions = row.matched_positions,
               match_ic = row.match_ic,
               normalized_ic = row.normalized_ic,
               core_ic = row.core_ic,
               score = row.score,
               query_information = row.query_information,
               search_information = row.search_information,
           )
       end

julia> restored[1, 2]
ComparisonResult(
  query               = RKLI
  search              = R[KR]L[IV]
  normalized_query    = RKLI
  normalized_search   = R[RK]L[IV]
  matched             = true
  query_relationship  = Variant Match
  search_relationship = Degenerate Match
  matched_pattern     = R[rk]L[iv]
  matched_positions   = 4
  match_ic            = 3.537243573680481
  normalized_ic       = 1.0
  core_ic             = 0.8843108934201203
  score               = 4.0
  query_information   = 4.0
  search_information  = 3.537243573680481
)

How do I cluster ELM motifs?

This package was developed to cluster ELM motifs for the purpose of partitioning them into training and test sets in a Julia-based machine learning project, without introducing a dependency on an external Python script. A straightforward approach is to perform all-against-all motif comparisons to construct a graph, and then define clusters as the connected components of that graph. In practice, this requires comparing motifs, selecting a threshold to determine which matches are strong enough to be represented as edges, and then identifying the weakly connected components of the resulting directed graph.

Following Palopoli et al. (2015), we treat two motifs as connected when they match in at least two positions, have MatchIC >= 1.5, and have normalized_ic >= 0.5. When you use ComparisonOptions(), the default thresholds already enforce the minimum of two matched positions and the normalized_ic >= 0.5 cut-off, so the clustering rule only needs to add the MatchIC threshold.

The graph is directed because compare(a, b) and compare(b, a) can differ depending on the options you choose. Using Graphs.weakly_connected_components on a directed graph keeps the clustering rule simple without assuming that the pairwise comparisons are symmetric.

We can make this concrete in three steps:

Step 1 is to compare all motifs against all motifs. The call compare(motifs, options) returns a square matrix, where results[i, j] contains the comparison of motif motifs[i] against motif motifs[j]. We also define a small helper, is_cluster_match, that says when one comparison is strong enough to become an edge in the clustering graph.

julia> using CompariMotif
julia> using Graphs
julia> motifs = ["RKLI", "R[KR]L[IV]", "[ST]P", "[ST]Px[KR]"]4-element Vector{String}:
 "RKLI"
 "R[KR]L[IV]"
 "[ST]P"
 "[ST]Px[KR]"
julia> options = ComparisonOptions()ComparisonOptions(
  alphabet                = CompariMotif.ProteinAlphabet()
  residue_frequencies     = nothing
  min_shared_positions    = 2
  normalized_ic_cutoff    = 0.5
  matchfix                = :none
  mismatches              = 0
  allow_ambiguous_overlap = true
  max_variants            = 10000
)
julia> results = compare(motifs, options);
julia> is_cluster_match(result) = result.matched && result.match_ic >= 1.5;

Step 2 is to convert that matrix into a graph and then extract clusters from the graph. In this representation, each motif is one vertex. Whenever results[i, j] passes the clustering rule, we add a directed edge from vertex i to vertex j. Once all edges have been added, weakly_connected_components returns groups of motif indices that belong to the same cluster.

julia> function connected_component_assignments(results)
           # `results` is the square matrix returned by `compare(motifs, options)`.
           # Each row/column corresponds to one motif in the original input order.
           n = size(results, 1)
           size(results, 2) == n || throw(ArgumentError("expected a square all-vs-all comparison matrix"))
       
           # Build a directed graph with one vertex per motif.
           graph = SimpleDiGraph(n)
           for i in 1:n, j in 1:n
               # Ignore self-comparisons: each motif is already in its own cluster.
               i == j && continue
               # Add an edge `i -> j` when motif `i` matches motif `j`
               # strongly enough for clustering.
               is_cluster_match(results[i, j]) || continue
               add_edge!(graph, i, j)
           end
       
           # Weakly connected components group motifs that are connected by the
           # directed edges, even if the arrows do not point both ways.
           components = weakly_connected_components(graph)
           # Sort components so cluster numbering is deterministic and easy to read.
           sort!(components, by = minimum)
       
           # `clusters[k]` will store the cluster number for motif `k`.
           clusters = zeros(Int, n)
           for (cluster_id, component) in enumerate(components)
               # `component` is the list of motif indices that belong to one cluster.
               # The `.=` assigns the same cluster number to all of them at once.
               clusters[component] .= cluster_id
           end
       
           return clusters
       end;

Step 3 is to run the helper and inspect the result. The output vector has one integer per motif, in the same order as the original motifs vector. Motifs with the same integer belong to the same cluster. Taking maximum(clusters) gives the total number of clusters.

julia> clusters = connected_component_assignments(results)4-element Vector{Int64}:
 1
 1
 2
 2
julia> maximum(clusters)2

Here, clusters[i] is the cluster assigned to motifs[i]. In this toy example, the first two motifs fallb into one cluster and the last two motifs fall into another one, so the assignment vector is [1, 1, 2, 2].