Iduna

Iduna builds one ThorAxe-based multiple sequence alignment and, by default, expands it with MMseqs2/HMMER. The public API is centered on iduna, which accepts one UniProt accession or Ensembl transcript ID and writes a reproducible work directory.

The package is file-first. External tools write logs and intermediate files under the chosen workdir, and the returned result stores paths plus the resolved identifiers, selected seed, expansion outputs, validation statistics, warnings, and status. result.workdir is absolute; artifact paths under it are reported relative to workdir.

using Iduna

result = iduna(
    "P20963";
    mmseqs_db="/path/to/mmseqs/uniref_db",
    workdir="P20963",
    overwrite=false,
    centroids=false,
    transcript_query_timeout_seconds=180,
)

expanded = load_expanded_msa(result)

The same entry point is available as a Julia 1.12 app:

julia --project=. -m Iduna P20963 --mmseqs-db /path/to/mmseqs/uniref_db

Add --centroids (or centroids=true in Julia) to also save a centroid-level MSA. This is a side output built from MMseqs2 centroid or consensus hits before cluster expansion; the regular expanded MSA remains the main result used by validation.

Use no_expansion=true in Julia, or --no-expansion in the app, to stop after the ThorAxe MSA stage. In that mode mmseqs_db is not required, result.expansion === nothing, and load_seed_msa(result) loads the selected ThorAxe PID seed.

thoraxe_only = iduna(
    "ENST00000362089.10";
    no_expansion=true,
    workdir="ENST00000362089_thoraxe",
)

seed = load_seed_msa(thoraxe_only)
thoraxe_only.thoraxe_msa.baseline_stockholm
thoraxe_only.thoraxe_msa.best_seed.stockholm_path

julia --project=. -m Iduna ENST00000362089.10 --no-expansion

For an Ensembl transcript input, Iduna resolves the parent Ensembl gene ID and species needed by ThorAxe. It does not require UniProt mapping on that path.

Iduna filters the ThorAxe species list with Ensembl homology by default using orthology="1:1", then applies biomart_datasets_filter=true as a second preflight against the current BioMart Ensembl Gene dataset list. BioMart dataset names are used only internally; species names are still passed to ThorAxe. Use orthology="1:n" or "m:n" for broader ortholog relationships, set specieslist_filter=false to skip the Ensembl homology step, or set biomart_datasets_filter=false to skip the BioMart dataset preflight. The BioMart dataset list is cached in package scratch space and refreshed when used on a later calendar date. Iduna also reports species recorded in transcript_query BioMart failure outputs when a run completes with partial BioMart failures. transcript_query_timeout_seconds defaults to 180 seconds, with a bounded retry that can drop the species list after a timeout. thoraxe_timeout_seconds is unset by default because ThorAxe runtime depends on gene complexity and the selected PID thresholds.

If the ThorAxe transcript_query bundle has already been created, pass it with thoraxe_input_dir. Iduna copies that bundle into workdir/thoraxe_input and continues with the same ThorAxe MSA and PID seed stages. Unless no_expansion=true, it also runs expansion and expanded-MSA validation.