Regex Syntax
Motif parsing supports a controlled regex-like subset.
- Fixed residues from the selected alphabet:
- protein (
alphabet=ProteinAlphabet(), default):ARNDCQEGHILKMFPSTWYV - DNA (
alphabet=DNAAlphabet()):ACGT - RNA (
alphabet=RNAAlphabet()):ACGU
- protein (
- Wildcards:
.is the canonical wildcard in normalized and output motifs.xandXare accepted input aliases and mean the same thing as.in this package.
- Character classes:
[KR]includes listed residues.[^P]is negation within the selected alphabet only.- repeated residues inside positive classes are ignored, so
[KKAQ],[AQK], and[QQAAK]all normalize to the same residue set.
- Anchors:
^and$indicate N- and C-terminus for protein motifs, or 5' and 3' ends for nucleic acid motifs.
- Repeat quantifiers:
{n},{m,n}on residues, classes, anchors, and oracle-style grouped alternatives (see below).
- Grouping and alternation:
(...)for grouping and|for alternatives, for exampleA(K|Q)LI.
- Whitespace:
- leading and trailing whitespace is trimmed;
- the first internal whitespace ends motif parsing, matching the upstream oracle.
Edge Cases and Oracle Behavior
For a handful of parser edge cases, this package intentionally matches the observed black-box behavior of the upstream CompariMotif oracle rather than enforcing a stricter regex grammar.
- Internal whitespace truncates parsing at the first space, so
A Cbehaves asA. - Some permissive exact quantifiers are accepted and normalized the same way as the oracle, for example
A{0}andA{-1}both behave asA. - Grouped exact quantifiers are supported with the oracle's branch-splitting behavior rather than standard regex repetition semantics. For example,
(A|C){2}expands toAAandCC, while(AC|GT){2}expands toACCandGTT(equivalently normalized as(AC{2})|(GT{2})). - The upstream black-box oracle has a grouped-alternation quirk where top-level wildcard-only branches such as
(Q|.)and(Q|.){2}are dropped entirely, while top-level(Q|x)and(Q|X)retain theQbranch and only lose the wildcard-alias branch; embedded forms likeA(Q|x)Lstill expand. This package intentionally does not reproduce that behavior. Instead, canonical normalization always treatsx,X,., and explicit full-alphabet classes as the same wildcard syntax. For example,(Q|[ARNDCQEGHILKMFPSTWYV])normalizes as(Q)|(.). - The upstream oracle also scores duplicate residues inside positive classes, but this package intentionally treats character classes as sets. For example,
[AA]and[A]are equivalent here even though the oracle can distinguish them. - Some malformed constructs are treated as non-retained motifs rather than hard parse failures, such as
A{2,1}producing zero retained variants. - Truly malformed syntax is still rejected when the oracle also rejects it, for example an unclosed character class like
A[Q.