Helix

helix.api

Research workflows often start in notebooks or lightweight scripts where shelling out to the CLI is inconvenient. The helix.api module mirrors the CLI surface area but returns plain Python dictionaries and lists, making it straightforward to serialize to JSON, pass through pandas, or feed into downstream visualization components.

Simulation only: helix.api helpers work exclusively on digital sequences, spectra, or graphs. They never prescribe wet-lab procedures or interface with instruments—use them strictly for in-silico analysis and visualization.

The helpers prefer explicit inputs: every function accepts either an inline sequence="ACGU..." or an input_path=Path("sample.fasta") when DNA/protein IO is relevant. Validation happens up front—invalid bases, overlapping arguments, and missing files raise informative ValueError or ImportError exceptions so that provenance remains audit-friendly.

Function reference

dna_summary(sequence=None, *, input_path=None, window=200, step=50, k=5, max_diff=1)

Normalize a DNA string, compute GC statistics, and discover k-mer hotspots tolerant of SNP-scale variation.

Parameters

Name Type Details
sequence str \| None Inline DNA input. Cannot be combined with input_path.
input_path str \| Path \| None File to read (FASTA/plain). Takes precedence over sequence.
window int Sliding-window size for GC summaries. Set to 0 to skip.
step int Advance between windows; smaller steps increase resolution.
k int k-mer size used for clustering recurrent motifs.
max_diff int Maximum Hamming distance when grouping similar k-mers.

Returns

dict with:

The GC window calculation and k-mer clustering match the logic behind helix dna summarize, ensuring CLI and notebook reports stay interchangeable.


triage_report(sequence=None, *, input_path=None, k=5, max_diff=1, min_orf_length=90)

Produce the complete “triage” bundle (GC skew, motif clusters, ORF calls) that backs helix viz triage.

Parameters

Name Type Details
sequence, input_path see above Mutually exclusive DNA inputs.
k, max_diff int Passed through to the underlying k-mer clustering.
min_orf_length int Filter ORFs shorter than the threshold (nt).

Returns

dict with:

All payloads are schema-compatible with helix schema show triage-report.


fold_rna(sequence, *, min_loop_length=3, allow_wobble_pairs=True)

Convenience wrapper around the annotated Nussinov dynamic program for RNA folding.

Parameters

Name Type Details
sequence str RNA/DNA string; U/T are normalized to U internally.
min_loop_length int Nussinov “hairpin” constraint (nt separating paired bases).
allow_wobble_pairs bool When True, GU wobble pairs are permitted in addition to AU/GC.

Returns

dict with:

Because the helper exposes the same knobs as helix rna fold, results are reproducible across interfaces.


spectrum_leaderboard(peptide=None, *, experimental_spectrum=None, cyclic=True, leaderboard_size=5)

Run the leaderboard cyclopeptide sequencing algorithm and return notebook-friendly results.

Parameters

Name Type Details
peptide str \| None Optional candidate peptide for generating a theoretical spectrum.
experimental_spectrum Sequence[int] \| None Observed masses used for leaderboard selection.
cyclic bool Whether the theoretical spectrum should be cyclic (True) or linear (False).
leaderboard_size int Maximum number of peptides kept per iteration (ties are preserved).

Returns

dict with:

Pass only peptide to preview its spectrum, only experimental_spectrum to search for best-scoring sequences, or both to compare expectations vs. observed data.


protein_summary(sequence=None, *, input_path=None, window=9, step=1, scale="kd")

Summarize amino-acid sequences using Biopython’s ProtParam utilities along with Helix hydropathy profiles.

Parameters

Name Type Details
sequence, input_path see above FASTA headers are handled transparently.
window int Sliding window for hydropathy averaging.
step int Offset between successive hydropathy windows.
scale str Hydropathy scale identifier (e.g., "kd" for Kyte-Doolittle).

Returns

dict with canonical protein metrics: sequence, length, molecular_weight, aromaticity, instability_index, gravy, charge_at_pH7, plus hydropathy_profile entries (start, end, score). Raises ImportError if Biopython is unavailable so that workflows can fail fast with a clear dependency message.


run_workflow(config_path, *, output_dir, name=None)

Execute a YAML workflow definition (the same format consumed by helix workflows run) from Python. The helper wraps helix_workflows.run_workflow_config, returning the list of materialized artifacts. name can restrict execution to a single workflow from a manifest, mirroring the CLI --name flag.

Usage patterns

from pathlib import Path
from helix import api as hx

report = hx.triage_report(sequence="AUGGCCUUUUAA", k=3)
gc_bins = hx.dna_summary(input_path=Path("samples/ecoli.fna"), window=500, step=50)
rna = hx.fold_rna("GGGAAACCC", min_loop_length=0)
peptides = hx.spectrum_leaderboard(
    experimental_spectrum=[0, 113, 128, 227, 242, 355, 370, 484],
    leaderboard_size=10,
)

Each function returns plain JSON-serializable structures, making it trivial to call json.dumps(...), ship results to helix viz ..., or interoperate with scientific Python stacks. Exceptions surface early and with actionable messages—ideal for research-grade reproducibility and provenance.

Need performance baselines? Run python -m benchmarks.api_benchmarks --repeat 5 --limit 0 --sort mean --out bench/api.json from the repo root to capture timing data for every helper (or focus on specific functions via --scenario). Set HELIX_BENCH_DNA_FASTA / HELIX_BENCH_PROTEIN_FASTA to swap in larger genomes or proteomes once available, and use --limit 10000 to mimic CI’s quicker sampling. Each run emits a schema-tagged payload (bench_result v1.0) that logs git SHA, BLAS vendor, CPU/threads, RNG seed, and per-case RSS stats so notebooks + CI can do apples-to-apples comparisons. Compare two runs via scripts/bench_check.py baseline.json current.json --threshold 5 to flag >5% slowdowns automatically, and browse the rolling history at docs/benchmarks.md.