API Reference¶

FlashANSR¶

Flash Amortized Neural Symbolic Regressor.

PARAMETER	DESCRIPTION
`simplipy_engine`	Engine responsible for manipulating and evaluating symbolic expressions. TYPE: `SimpliPyEngine`
`flash_ansr_model`	Trained transformer backbone that proposes expression programs. TYPE: `FlashANSRModel`
`tokenizer`	Tokenizer mapping model outputs to expression tokens. TYPE: `Tokenizer`
`generation_config`	Configuration that controls candidate generation. If `None` a default `SoftmaxSamplingConfig` is created. TYPE: `GenerationConfig` DEFAULT: `None`
`n_restarts`	Number of optimizer restarts used by the refiner when fitting constants. TYPE: `int` DEFAULT: `8`
`refiner_method`	Optimization routine employed by the refiner. TYPE: `(curve_fit_lm, minimize_bfgs, minimize_lbfgsb, minimize_neldermead, minimize_powell, least_squares_trf, least_squares_dogbox)` DEFAULT: `'curve_fit_lm'`
`refiner_p0_noise`	Distribution applied to perturb initial constant guesses. `None` disables perturbations. TYPE: `(uniform, normal, cauchy, magspan)` DEFAULT: `'uniform'`
`refiner_p0_noise_kwargs`	Keyword arguments forwarded to the noise sampler. `'default'` yields `{'loc': 0.0, 'scale': 5.0}` for the normal distribution. TYPE: `dict or {default} or None` DEFAULT: `'default'`
`numpy_errors`	Desired NumPy error handling strategy applied during constant refinement. TYPE: `(ignore, warn, 'raise', call, print, log)` DEFAULT: `'ignore'`
`length_penalty`	Penalty coefficient that discourages overly long expressions. TYPE: `float` DEFAULT: `0.05`
`constants_penalty`	Penalty coefficient applied to the number of constants present in an expression. TYPE: `float` DEFAULT: `0.0`
`likelihood_penalty`	Penalty coefficient applied to the negative log likelihood of the generated beam. TYPE: `float` DEFAULT: `0.0`
`refiner_workers`	Number of worker processes to run during constant refinement. `None` (the default) uses all available CPU cores, while explicit integers select a fixed pool size. Set `0` to disable multiprocessing. TYPE: `int or None` DEFAULT: `None`
`prune_constant_budget`	Apply constant-pruning refinement to the best beams after the initial refinement (ranked by FVU). If `> 1` treated as an absolute count; if `0 < value <= 1` treated as a fraction of beams (ceil). Set to 0 to disable. Defaults to 16. TYPE: `int or float` DEFAULT: `0`

Source code in src/flash_ansr/flash_ansr.py

def __init__(
        self,
        simplipy_engine: SimpliPyEngine,
        flash_ansr_model: FlashANSRModel,
        tokenizer: Tokenizer,
        generation_config: GenerationConfig | None = None,
        n_restarts: int = 8,
        refiner_method: Literal[
            'curve_fit_lm',
            'minimize_bfgs',
            'minimize_lbfgsb',
            'minimize_neldermead',
            'minimize_powell',
            'least_squares_trf',
            'least_squares_dogbox',
        ] = 'curve_fit_lm',
        refiner_p0_noise: Literal['uniform', 'normal', 'cauchy', 'magspan'] | None = 'normal',
        refiner_p0_noise_kwargs: dict | None | Literal['default'] = 'default',
        numpy_errors: Literal['ignore', 'warn', 'raise', 'call', 'print', 'log'] | None = 'ignore',
        length_penalty: float = 0.05,
        constants_penalty: float = 0.0,
        likelihood_penalty: float = 0.0,
        refiner_workers: int | None = None,
        prune_constant_budget: float | int = 0):
    self.simplipy_engine = simplipy_engine
    self.flash_ansr_model = flash_ansr_model.eval()
    self.tokenizer = tokenizer

    if refiner_p0_noise_kwargs == 'default':
        refiner_p0_noise_kwargs = {'loc': 0.0, 'scale': 5.0}

    if generation_config is None:
        generation_config = SoftmaxSamplingConfig()

    self.generation_config = generation_config
    self.n_restarts = n_restarts
    self.refiner_method = refiner_method
    self.refiner_p0_noise = refiner_p0_noise
    self.refiner_p0_noise_kwargs = copy.deepcopy(refiner_p0_noise_kwargs) if refiner_p0_noise_kwargs is not None else None
    self.numpy_errors = numpy_errors
    self.length_penalty = length_penalty
    self.constants_penalty = float(constants_penalty)
    self.likelihood_penalty = float(likelihood_penalty)
    self.prune_constant_budget = max(0.0, float(prune_constant_budget))

    cpu_count = os.cpu_count() or 1

    if refiner_workers is None:
        resolved_workers = max(1, cpu_count)
    elif isinstance(refiner_workers, numbers.Integral):
        resolved_workers = max(0, int(refiner_workers))
    else:
        raise TypeError("refiner_workers must be an integer or None.")

    self.refiner_workers = resolved_workers
    # Parallelize the post-generation simplify across `refiner_workers` (gated to choices >=
    # _SIMPLIFY_PARALLEL_THRESHOLD; byte-identical to serial). Set False to force serial.
    self.parallel_simplify = True

    self._results: list[Result] = []
    self.results: pd.DataFrame = pd.DataFrame()
    self._mcts_cache: dict[Tuple[int, ...], dict[str, Any]] = {}

    self._input_dim: int | None = None

    self.variable_mapping: dict[str, str] = {}
    self._prompt_prefix: PromptPrefix | None = None
    self._prompt_metadata: dict[str, list[list[str]]] | None = None

    # Optional persistent pre-CUDA fork pool (inference-speed Step 3). When set (via
    # ``load(persistent_refine_pool=True)``) refinement + simplify route onto it instead of
    # forking a fresh pool per call; ``None`` keeps the legacy per-call-fork path (the default).
    self._refine_pool: RecoverableForkPool | None = None

    # Set True by ``OverlappedEvaluationEngine`` for the duration of an overlapped run (inference-
    # speed Step 4). While True a GPU-owner thread runs generation concurrently with refinement, so
    # ``_fit_refine`` MUST keep all per-candidate work in forked pool workers (whose global-RNG
    # reseed is process-isolated) and MUST NOT fork a fresh pool on the calling thread (that would
    # be a fork-after-CUDA-while-another-thread-is-live hazard). See ``_run_refinement_jobs``.
    self._overlap_mode: bool = False

load `classmethod` ¶

load(directory: str, generation_config: GenerationConfig | None = None, n_restarts: int = 8, refiner_method: Literal['curve_fit_lm', 'minimize_bfgs', 'minimize_lbfgsb', 'minimize_neldermead', 'minimize_powell', 'least_squares_trf', 'least_squares_dogbox'] = 'curve_fit_lm', refiner_p0_noise: Literal['uniform', 'normal', 'cauchy', 'magspan'] | None = 'normal', refiner_p0_noise_kwargs: dict | None | Literal['default'] = 'default', numpy_errors: Literal['ignore', 'warn', 'raise', 'call', 'print', 'log'] | None = 'ignore', length_penalty: float = 0.05, constants_penalty: float = 0.0, likelihood_penalty: float = 0.0, device: str = 'cpu', refiner_workers: int | None = None, prune_constant_budget: float | int = 0, persistent_refine_pool: bool = False) -> FlashANSR

Instantiate a FlashANSR model from a configuration directory.

PARAMETER	DESCRIPTION
`directory`	Directory that contains `model.yaml`, `tokenizer.yaml` and `state_dict.pt` artifacts. TYPE: `str`
`generation_config`	Generation parameters to override defaults during candidate search. TYPE: `GenerationConfig` DEFAULT: `None`
`n_restarts`	Number of restarts passed to the refiner. TYPE: `int` DEFAULT: `8`
`refiner_method`	Optimization routine for constant fitting. TYPE: `(curve_fit_lm, minimize_bfgs, minimize_lbfgsb, minimize_neldermead, minimize_powell, least_squares_trf, least_squares_dogbox)` DEFAULT: `'curve_fit_lm'`
`refiner_p0_noise`	Distribution used to perturb initial constant guesses. TYPE: `(uniform, normal, cauchy, magspan)` DEFAULT: `'uniform'`
`refiner_p0_noise_kwargs`	Additional keyword arguments for the noise sampler. `'default'` resolves to `{'loc': 0.0, 'scale': 5.0}`. TYPE: `dict or {default} or None` DEFAULT: `'default'`
`numpy_errors`	NumPy floating-point error policy applied during refinement. TYPE: `(ignore, warn, 'raise', call, print, log)` DEFAULT: `'ignore'`
`length_penalty`	Length penalty used when compiling results. TYPE: `float` DEFAULT: `0.05`
`constants_penalty`	Penalty applied per constant present in the expression during scoring. TYPE: `float` DEFAULT: `0.0`
`likelihood_penalty`	Penalty applied to the negative log likelihood of each beam. TYPE: `float` DEFAULT: `0.0`
`device`	Torch device where the model weights will be loaded. TYPE: `str` DEFAULT: `'cpu'`
`refiner_workers`	Desired worker-pool size for constant refinement. `None` uses the number of available CPU cores, integers select an explicit pool size, and `0` disables multiprocessing. Mirrors the constructor parameter. TYPE: `int or None` DEFAULT: `None`
`prune_constant_budget`	Number of top beams (by FVU) or fraction of beams (if 0<value<=1) to prune after initial refinement when pruning is enabled. TYPE: `int` DEFAULT: `0`
`persistent_refine_pool`	When `True` a single persistent `fork` worker pool is forked BEFORE any CUDA init and reused across `fit()` calls for both refinement and the parallel post-generation simplify (instead of forking a fresh pool per call). For a CUDA `device` the weights are loaded on CPU first, the engine is warmed, the pool is forked, and only then is the model moved to the device -- the structural mitigation for the fork-after-CUDA deadlock family. Requires the `fork` start method and `refiner_workers > 1`; otherwise it is a no-op and the legacy per-call-fork path is used. Opt-in; the default preserves today's behaviour byte-for-byte. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`model`	Fully initialized regressor ready for inference. TYPE: `FlashANSR`

Source code in src/flash_ansr/flash_ansr.py

@classmethod
def load(
        cls,
        directory: str,
        generation_config: GenerationConfig | None = None,
        n_restarts: int = 8,
        refiner_method: Literal[
            'curve_fit_lm',
            'minimize_bfgs',
            'minimize_lbfgsb',
            'minimize_neldermead',
            'minimize_powell',
            'least_squares_trf',
            'least_squares_dogbox',
        ] = 'curve_fit_lm',
        refiner_p0_noise: Literal['uniform', 'normal', 'cauchy', 'magspan'] | None = 'normal',
        refiner_p0_noise_kwargs: dict | None | Literal['default'] = 'default',
        numpy_errors: Literal['ignore', 'warn', 'raise', 'call', 'print', 'log'] | None = 'ignore',
        length_penalty: float = 0.05,
        constants_penalty: float = 0.0,
        likelihood_penalty: float = 0.0,
        device: str = 'cpu',
        refiner_workers: int | None = None,
        prune_constant_budget: float | int = 0,
        persistent_refine_pool: bool = False) -> "FlashANSR":
    """Instantiate a `FlashANSR` model from a configuration directory.

    Parameters
    ----------
    directory : str
        Directory that contains ``model.yaml``, ``tokenizer.yaml`` and
        ``state_dict.pt`` artifacts.
    generation_config : GenerationConfig, optional
        Generation parameters to override defaults during candidate search.
    n_restarts : int, optional
        Number of restarts passed to the refiner.
    refiner_method : {'curve_fit_lm', 'minimize_bfgs', 'minimize_lbfgsb', 'minimize_neldermead', 'minimize_powell', 'least_squares_trf', 'least_squares_dogbox'}
        Optimization routine for constant fitting.
    refiner_p0_noise : {'uniform', 'normal', 'cauchy', 'magspan'}, optional
        Distribution used to perturb initial constant guesses.
    refiner_p0_noise_kwargs : dict or {'default'} or None, optional
        Additional keyword arguments for the noise sampler. ``'default'``
        resolves to ``{'loc': 0.0, 'scale': 5.0}``.
    numpy_errors : {'ignore', 'warn', 'raise', 'call', 'print', 'log'} or None, optional
        NumPy floating-point error policy applied during refinement.
    length_penalty : float, optional
        Length penalty used when compiling results.
    constants_penalty : float, optional
        Penalty applied per constant present in the expression during
        scoring.
    likelihood_penalty : float, optional
        Penalty applied to the negative log likelihood of each beam.
    device : str, optional
        Torch device where the model weights will be loaded.
    refiner_workers : int or None, optional
        Desired worker-pool size for constant refinement. ``None`` uses the
        number of available CPU cores, integers select an explicit pool size,
        and ``0`` disables multiprocessing. Mirrors the constructor parameter.
    prune_constant_budget : int, optional
        Number of top beams (by FVU) or fraction of beams (if 0<value<=1)
        to prune after initial refinement when pruning is enabled.
    persistent_refine_pool : bool, optional
        When ``True`` a single persistent ``fork`` worker pool is forked BEFORE any CUDA init and
        reused across ``fit()`` calls for both refinement and the parallel post-generation
        simplify (instead of forking a fresh pool per call). For a CUDA ``device`` the weights are
        loaded on CPU first, the engine is warmed, the pool is forked, and only then is the model
        moved to the device -- the structural mitigation for the fork-after-CUDA deadlock family.
        Requires the ``fork`` start method and ``refiner_workers > 1``; otherwise it is a no-op and
        the legacy per-call-fork path is used. Opt-in; the default preserves today's behaviour
        byte-for-byte.

    Returns
    -------
    model : FlashANSR
        Fully initialized regressor ready for inference.
    """
    directory = substitute_root_path(directory)

    flash_ansr_model_path = os.path.join(directory, 'model.yaml')
    tokenizer_path = os.path.join(directory, 'tokenizer.yaml')

    # When a persistent pre-CUDA pool is requested for a non-CPU device, defer the device move:
    # load weights on CPU (incl. map_location, which also touches CUDA otherwise) so the pool can
    # be forked while CUDA is still uninitialized. The flag-off path is unchanged.
    defer_cuda = persistent_refine_pool and str(device) != 'cpu'
    load_device = 'cpu' if defer_cuda else device

    model = FlashANSRModel.from_config(flash_ansr_model_path)
    model.load_state_dict(torch.load(os.path.join(directory, "state_dict.pt"), weights_only=True, map_location=load_device))
    model.eval().to(load_device)

    tokenizer = Tokenizer.from_config(tokenizer_path)

    nsr = cls(
        simplipy_engine=model.simplipy_engine,
        flash_ansr_model=model,
        tokenizer=tokenizer,
        generation_config=generation_config,
        n_restarts=n_restarts,
        refiner_method=refiner_method,
        refiner_p0_noise=refiner_p0_noise,
        refiner_p0_noise_kwargs=refiner_p0_noise_kwargs,
        numpy_errors=numpy_errors,
        length_penalty=length_penalty,
        constants_penalty=constants_penalty,
        likelihood_penalty=likelihood_penalty,
        refiner_workers=refiner_workers,
        prune_constant_budget=prune_constant_budget)

    if persistent_refine_pool:
        # Warm the engine + fork the pool (pre-CUDA), then move the model to the target device.
        nsr._enable_persistent_refine_pool(target_device=device)

    return nsr

fit ¶

fit(X: ndarray | Tensor | DataFrame, y: ndarray | Tensor | DataFrame | Series, variable_names: list[str] | dict[str, str] | Literal['auto'] | None = 'auto', converge_error: Literal['raise', 'ignore', 'print'] = 'ignore', verbose: bool = False, *, complexity: int | float | None = None, allowed_terms: Iterable[Sequence[Any]] | None = None, include_terms: Iterable[Sequence[Any]] | None = None, exclude_terms: Iterable[Sequence[Any]] | None = None, refine_seed: int | None = None) -> None

Perform symbolic regression on (X, y) and refine candidate expressions.

PARAMETER	DESCRIPTION
`X`	Feature matrix where rows index observations and columns variables. TYPE: `ndarray or Tensor or DataFrame`
`y`	Target values. Multi-output targets are unsupported. TYPE: `ndarray or Tensor or DataFrame or Series`
`variable_names`	Mapping from internal variable tokens to descriptive names. TYPE: `list[str] or dict[str, str] or {auto} or None` DEFAULT: `'auto'`
`converge_error`	Handling strategy when the refiner fails to converge. TYPE: `('raise', ignore, print)` DEFAULT: `'raise'`
`verbose`	If `True` progress bars and diagnostic output are displayed. TYPE: `bool` DEFAULT: `False`
`allowed_terms`	Keyword-only list of term token sequences that may appear in the generated expression. TYPE: `Iterable[Sequence[str]] or None` DEFAULT: `None`
`include_terms`	Keyword-only subset of allowed terms that the expression should prioritise using. TYPE: `Iterable[Sequence[str]] or None` DEFAULT: `None`
`exclude_terms`	Keyword-only list of term token sequences that should be discouraged during generation. TYPE: `Iterable[Sequence[str]] or None` DEFAULT: `None`
`refine_seed`	Keyword-only seed for the constant-refinement `p0` noise. When provided, the per-candidate refiner seeds are derived deterministically from it (via `np.random.SeedSequence(refine_seed)`), so refinement is reproducible and independent of completion order. When `None` (the default) fresh OS entropy is used, preserving the legacy behaviour. TYPE: `int or None` DEFAULT: `None`

RAISES	DESCRIPTION
`ValueError`	If `y` has more than one output dimension or cannot be reshaped.

Source code in src/flash_ansr/flash_ansr.py

def fit(
        self,
        X: np.ndarray | torch.Tensor | pd.DataFrame,
        y: np.ndarray | torch.Tensor | pd.DataFrame | pd.Series,
        variable_names: list[str] | dict[str, str] | Literal['auto'] | None = 'auto',
        converge_error: Literal['raise', 'ignore', 'print'] = 'ignore',
        verbose: bool = False,
        *,
        complexity: int | float | None = None,
        allowed_terms: Iterable[Sequence[Any]] | None = None,
        include_terms: Iterable[Sequence[Any]] | None = None,
        exclude_terms: Iterable[Sequence[Any]] | None = None,
        refine_seed: int | None = None) -> None:
    """Perform symbolic regression on ``(X, y)`` and refine candidate expressions.

    Parameters
    ----------
    X : ndarray or Tensor or DataFrame
        Feature matrix where rows index observations and columns variables.
    y : ndarray or Tensor or DataFrame or Series
        Target values. Multi-output targets are unsupported.
    variable_names : list[str] or dict[str, str] or {'auto'} or None, optional
        Mapping from internal variable tokens to descriptive names.
    converge_error : {'raise', 'ignore', 'print'}, optional
        Handling strategy when the refiner fails to converge.
    verbose : bool, optional
        If ``True`` progress bars and diagnostic output are displayed.
    allowed_terms : Iterable[Sequence[str]] or None, optional
        Keyword-only list of term token sequences that may appear in the
        generated expression.
    include_terms : Iterable[Sequence[str]] or None, optional
        Keyword-only subset of allowed terms that the expression should
        prioritise using.
    exclude_terms : Iterable[Sequence[str]] or None, optional
        Keyword-only list of term token sequences that should be discouraged
        during generation.
    refine_seed : int or None, optional
        Keyword-only seed for the constant-refinement ``p0`` noise. When
        provided, the per-candidate refiner seeds are derived deterministically
        from it (via ``np.random.SeedSequence(refine_seed)``), so refinement is
        reproducible and independent of completion order. When ``None`` (the
        default) fresh OS entropy is used, preserving the legacy behaviour.

    Raises
    ------
    ValueError
        If ``y`` has more than one output dimension or cannot be reshaped.
    """
    # TODO: Support lists
    # TODO: Support 0-d and 1-d tensors

    # Reset per-fit instance state up front so a ConvergenceError mid-fit leaves a clean (not
    # stale) view for library callers; the eval adapter ignores self on the error path.
    self._results = []
    self.variable_mapping = {}
    self._generation_time = 0.0
    self._refinement_time = 0.0

    # Adopt the configured floating-point error policy for refinement; try/finally restores it
    # even when refinement raises ConvergenceError (which must still propagate to the caller).
    numpy_errors_before = np.geterr()
    np.seterr(all=self.numpy_errors)
    try:
        gen_state = self._fit_generate(
            X, y, variable_names,
            complexity=complexity,
            allowed_terms=allowed_terms,
            include_terms=include_terms,
            exclude_terms=exclude_terms,
            verbose=verbose,
        )
        # Generation-phase state, applied so callers see it even if refinement raises.
        self.variable_mapping = gen_state.variable_mapping
        self._generation_time = gen_state.generation_time
        self._prompt_metadata = copy.deepcopy(gen_state.metadata_snapshot) if gen_state.metadata_snapshot is not None else None

        fit_result = self._fit_refine(
            gen_state,
            converge_error=converge_error,
            refine_seed=refine_seed,
            verbose=verbose,
        )
        self._apply_fit_result(fit_result)
    finally:
        np.seterr(**numpy_errors_before)

infer ¶

infer(X: ndarray | Tensor | DataFrame, y: ndarray | Tensor | DataFrame | Series, variable_names: list[str] | dict[str, str] | Literal['auto'] | None = 'auto', *, X_val: ndarray | Tensor | DataFrame | None = None, complexity: int | float | None = None, converge_error: Literal['raise', 'ignore', 'print'] = 'ignore', refine_seed: int | None = None, predict_val: bool = True, top_k: int | None = None, verbose: bool = False) -> InferenceResult

Run symbolic regression on (X, y) and return ALL candidates directly.

Unlike :meth:fit (which commits to self._results for later predict / get_expression read-back), infer returns an :class:~flash_ansr.inference.InferenceResult: the score-sorted refined :class:~flash_ansr.inference.Candidates PLUS the full :class:~flash_ansr.inference.CandidateLedger (the generation pool joined with the refined survivors, classified FIT_OK / FIT_FAILED / INVALID). It writes NOTHING to instance state, so it neither disturbs nor depends on self._results.

y_pred / y_pred_val are computed only for the top top_k candidates (top_k=None -> the best only): evaluating every candidate is O(candidates x n_support) and would blow up RAM at high candidate counts. predict_val toggles the validation-set prediction.

PARAMETER	DESCRIPTION
`X`	Support feature matrix and targets (the data to fit). TYPE: `array - like`
`y`	Support feature matrix and targets (the data to fit). TYPE: `array - like`
`variable_names`	Variable-name mapping (as in :meth:`fit`). TYPE: `list[str] or dict[str, str] or {auto} or None` DEFAULT: `'auto'`
`X_val`	Out-of-sample features for `y_pred_val` (validation predictions). TYPE: `array - like` DEFAULT: `None`
`complexity`	As in :meth:`fit`. TYPE: `optional` DEFAULT: `None`
`converge_error`	As in :meth:`fit`. TYPE: `optional` DEFAULT: `None`
`refine_seed`	As in :meth:`fit`. TYPE: `optional` DEFAULT: `None`
`verbose`	As in :meth:`fit`. TYPE: `optional` DEFAULT: `None`
`predict_val`	Whether to compute validation predictions for the top candidates. TYPE: `bool` DEFAULT: `True`
`top_k`	Compute `y_pred` / `y_pred_val` for the top `top_k` candidates; `None` -> best only. TYPE: `int or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`InferenceResult`	Score-sorted candidates + the full candidate ledger + generation / refinement times. If NO beam converges, `candidates` is empty and the ledger classifies every generated beam FIT_FAILED / INVALID -- `infer` returns it rather than raising (unlike `fit`).

Source code in src/flash_ansr/flash_ansr.py

def infer(
        self,
        X: np.ndarray | torch.Tensor | pd.DataFrame,
        y: np.ndarray | torch.Tensor | pd.DataFrame | pd.Series,
        variable_names: list[str] | dict[str, str] | Literal['auto'] | None = 'auto',
        *,
        X_val: np.ndarray | torch.Tensor | pd.DataFrame | None = None,
        complexity: int | float | None = None,
        converge_error: Literal['raise', 'ignore', 'print'] = 'ignore',
        refine_seed: int | None = None,
        predict_val: bool = True,
        top_k: int | None = None,
        verbose: bool = False) -> InferenceResult:
    """Run symbolic regression on ``(X, y)`` and return ALL candidates directly.

    Unlike :meth:`fit` (which commits to ``self._results`` for later ``predict`` /
    ``get_expression`` read-back), ``infer`` returns an :class:`~flash_ansr.inference.InferenceResult`:
    the score-sorted refined :class:`~flash_ansr.inference.Candidate`s PLUS the full
    :class:`~flash_ansr.inference.CandidateLedger` (the generation pool joined with the refined
    survivors, classified FIT_OK / FIT_FAILED / INVALID). It writes NOTHING to instance state, so
    it neither disturbs nor depends on ``self._results``.

    ``y_pred`` / ``y_pred_val`` are computed only for the top ``top_k`` candidates (``top_k=None``
    -> the best only): evaluating every candidate is O(candidates x n_support) and would blow up
    RAM at high candidate counts. ``predict_val`` toggles the validation-set prediction.

    Parameters
    ----------
    X, y : array-like
        Support feature matrix and targets (the data to fit).
    variable_names : list[str] or dict[str, str] or {'auto'} or None, optional
        Variable-name mapping (as in :meth:`fit`).
    X_val : array-like, optional
        Out-of-sample features for ``y_pred_val`` (validation predictions).
    complexity, converge_error, refine_seed, verbose : optional
        As in :meth:`fit`.
    predict_val : bool, optional
        Whether to compute validation predictions for the top candidates.
    top_k : int or None, optional
        Compute ``y_pred`` / ``y_pred_val`` for the top ``top_k`` candidates; ``None`` -> best only.

    Returns
    -------
    InferenceResult
        Score-sorted candidates + the full candidate ledger + generation / refinement times.
        If NO beam converges, ``candidates`` is empty and the ledger classifies every generated
        beam FIT_FAILED / INVALID -- ``infer`` returns it rather than raising (unlike ``fit``).
    """
    numpy_errors_before = np.geterr()
    np.seterr(all=self.numpy_errors)
    try:
        gen_state = self._fit_generate(X, y, variable_names, complexity=complexity, verbose=verbose)
        # allow_empty=True: infer() returns the FULL candidate ledger (all FIT_FAILED/INVALID)
        # even when no beam converged, per its contract, instead of raising ConvergenceError.
        fit_result = self._fit_refine(gen_state, converge_error=converge_error, refine_seed=refine_seed, verbose=verbose, allow_empty=True)
    finally:
        np.seterr(**numpy_errors_before)

    results = fit_result.results  # already score-sorted (best first) by _compile_results_pure

    def _decode_expr(raw_beam: list[int]) -> list[str] | None:
        expr_ids = self.flash_ansr_model.tokenizer.extract_expression_from_beam(raw_beam)[0]
        return self.tokenizer.decode(expr_ids, special_tokens='<constant>')

    ledger = build_candidate_ledger(
        gen_state.raw_beams, gen_state.log_probs, results,
        decode_expr=_decode_expr, is_valid=self.simplipy_engine.is_valid,
    )

    n_pred = (1 if top_k is None else int(top_k)) if results else 0
    X_support_p = pad_input_set(self._truncate_input(X), self.n_variables) if n_pred else None
    X_val_p = (pad_input_set(self._truncate_input(X_val), self.n_variables)
               if (n_pred and predict_val and X_val is not None) else None)

    variable_mapping = fit_result.variable_mapping
    candidates: list[Candidate] = []
    for rank, r in enumerate(results):
        refiner = r['refiner']
        want_pred = rank < n_pred
        expression_prefix = refiner.transform(expression=r['expression'], return_prefix=True, variable_mapping=None)
        # The variable-MAPPED INFIX string -- exactly get_expression(map_variables=True), but built from
        # the local refiner (no self._results read). Engine-bound prefix->infix lives in the refiner, so
        # a consumer cannot reproduce this without reaching into the model; expose it on the candidate.
        expression_infix = refiner.transform(expression=r['expression'], return_prefix=False, variable_mapping=variable_mapping)
        skeleton_prefix = normalize_skeleton(r['expression'])
        candidates.append(Candidate(
            raw_beam=list(r['raw_beam']),
            expression=list(r['expression']),
            expression_prefix=list(expression_prefix) if expression_prefix is not None else [],
            expression_infix=str(expression_infix),
            skeleton_prefix=list(skeleton_prefix) if skeleton_prefix is not None else [],
            constants=_best_constants(r),
            log_prob=float(r.get('log_prob', float('nan'))),
            score=float(r.get('score', float('nan'))),
            fvu=float(r.get('fvu', float('nan'))),
            complexity=int(r.get('complexity', len(r['expression']))),
            constant_count=int(r.get('constant_count', 0)),
            pruned_variant=bool(r.get('pruned_variant', False)),
            y_pred=(refiner.predict(X_support_p) if want_pred and X_support_p is not None else None),
            y_pred_val=(refiner.predict(X_val_p) if want_pred and X_val_p is not None else None),
        ))

    return InferenceResult(
        candidates=candidates,
        ledger=ledger,
        generation_time=gen_state.generation_time,
        refinement_time=fit_result.refinement_time,
        variable_mapping=fit_result.variable_mapping,
    )

predict ¶

predict(X: ndarray | Tensor | DataFrame, nth_best_beam: int = 0, nth_best_constants: int = 0) -> np.ndarray

Evaluate a fitted expression on new data.

PARAMETER	DESCRIPTION
`X`	Feature matrix to evaluate. TYPE: `ndarray or Tensor or DataFrame`
`nth_best_beam`	Beam index to select from the ranked results. TYPE: `int` DEFAULT: `0`
`nth_best_constants`	Index of the constant fit to choose for the selected beam. TYPE: `int` DEFAULT: `0`

RETURNS	DESCRIPTION
`y_pred`	Predicted targets with the same leading dimension as `X`. TYPE: `ndarray`

RAISES	DESCRIPTION
`ValueError`	If the model has not been fitted before prediction.

Source code in src/flash_ansr/flash_ansr.py

def predict(self, X: np.ndarray | torch.Tensor | pd.DataFrame, nth_best_beam: int = 0, nth_best_constants: int = 0) -> np.ndarray:
    """Evaluate a fitted expression on new data.

    Parameters
    ----------
    X : ndarray or Tensor or DataFrame
        Feature matrix to evaluate.
    nth_best_beam : int, optional
        Beam index to select from the ranked results.
    nth_best_constants : int, optional
        Index of the constant fit to choose for the selected beam.

    Returns
    -------
    y_pred : ndarray
        Predicted targets with the same leading dimension as ``X``.

    Raises
    ------
    ValueError
        If the model has not been fitted before prediction.
    """
    # TODO: Support lists
    # TODO: Support 0-d and 1-d tensors

    X = self._truncate_input(X)

    if isinstance(X, pd.DataFrame):
        X = X.values

    X = pad_input_set(X, self.n_variables)

    if len(self._results) == 0:
        raise ValueError("The model has not been fitted yet. Please call the fit method first.")

    return self._results[nth_best_beam]['refiner'].predict(X, nth_best_constants=nth_best_constants)

get_expression ¶

get_expression(nth_best_beam: int = 0, nth_best_constants: int = 0, return_prefix: bool = False, precision: int = 2, map_variables: bool = True, **kwargs: Any) -> list[str] | str

Retrieve a formatted expression from the compiled results.

PARAMETER	DESCRIPTION
`nth_best_beam`	Beam index to extract from `self._results`. TYPE: `int` DEFAULT: `0`
`nth_best_constants`	Constant fit index for the selected beam. TYPE: `int` DEFAULT: `0`
`return_prefix`	If `True` return the prefix notation instead of infix string. TYPE: `bool` DEFAULT: `False`
`precision`	Number of decimal places used when rendering constants. TYPE: `int` DEFAULT: `2`
`map_variables`	When `True` apply `self.variable_mapping` to humanise variables. TYPE: `bool` DEFAULT: `True`
`**kwargs`	Extra keyword arguments forwarded to :meth:`Refiner.transform`. TYPE: `Any` DEFAULT: `{}`

RETURNS	DESCRIPTION
`expression`	Expression either as a token list or human-readable string. TYPE: `list[str] or str`

Source code in src/flash_ansr/flash_ansr.py

def get_expression(self, nth_best_beam: int = 0, nth_best_constants: int = 0, return_prefix: bool = False, precision: int = 2, map_variables: bool = True, **kwargs: Any) -> list[str] | str:
    """Retrieve a formatted expression from the compiled results.

    Parameters
    ----------
    nth_best_beam : int, optional
        Beam index to extract from ``self._results``.
    nth_best_constants : int, optional
        Constant fit index for the selected beam.
    return_prefix : bool, optional
        If ``True`` return the prefix notation instead of infix string.
    precision : int, optional
        Number of decimal places used when rendering constants.
    map_variables : bool, optional
        When ``True`` apply ``self.variable_mapping`` to humanise variables.
    **kwargs : Any
        Extra keyword arguments forwarded to :meth:`Refiner.transform`.

    Returns
    -------
    expression : list[str] or str
        Expression either as a token list or human-readable string.
    """
    if len(self._results) == 0:
        raise ValueError("The model has not been fitted yet. Please call the fit method first.")

    return self._results[nth_best_beam]['refiner'].transform(
        expression=self._results[nth_best_beam]['expression'],
        nth_best_constants=nth_best_constants,
        return_prefix=return_prefix,
        precision=precision,
        variable_mapping=self.variable_mapping if map_variables else None,
        **kwargs)

save_results ¶

save_results(path: str) -> None

Persist fitted results (minus lambdas) for later reuse.

Source code in src/flash_ansr/flash_ansr.py

def save_results(self, path: str) -> None:
    """Persist fitted results (minus lambdas) for later reuse."""

    if not self._results:
        raise ValueError("No results available to save. Run `fit` first.")

    input_dim = self._input_dim if self._input_dim is not None else self.n_variables
    metadata = {
        "format_version": RESULTS_FORMAT_VERSION,
        "length_penalty": self.length_penalty,
        "constants_penalty": self.constants_penalty,
        "likelihood_penalty": self.likelihood_penalty,
        "n_variables": self.n_variables,
        "input_dim": input_dim,
        "variable_mapping": copy.deepcopy(self.variable_mapping),
    }

    payload = serialize_results_payload(self._results, metadata=metadata)
    save_results_payload(payload, path)

load_results ¶

load_results(path: str, *, rebuild_refiners: bool = True) -> None

Load previously saved results and rebuild refiners if requested.

Source code in src/flash_ansr/flash_ansr.py

def load_results(self, path: str, *, rebuild_refiners: bool = True) -> None:
    """Load previously saved results and rebuild refiners if requested."""

    payload = load_results_payload(path)
    metadata = payload.get("metadata", {})

    version = int(payload.get("version", 0))
    if version != RESULTS_FORMAT_VERSION:
        warnings.warn(
            f"Results payload version {version} does not match expected {RESULTS_FORMAT_VERSION}; attempting to proceed anyway."
        )

    length_penalty = float(metadata.get("length_penalty", getattr(self, "length_penalty", 0.0)))
    constants_penalty = float(metadata.get("constants_penalty", getattr(self, "constants_penalty", 0.0)))
    likelihood_penalty = float(metadata.get("likelihood_penalty", getattr(self, "likelihood_penalty", 0.0)))
    n_variables = int(metadata.get("n_variables", self.n_variables))
    input_dim = int(metadata.get("input_dim", n_variables))

    self._input_dim = input_dim
    self.length_penalty = length_penalty
    self.constants_penalty = constants_penalty
    self.likelihood_penalty = likelihood_penalty
    self.variable_mapping = metadata.get("variable_mapping", self.variable_mapping)

    restored = deserialize_results_payload(
        payload,
        simplipy_engine=self.simplipy_engine,
        n_variables=n_variables,
        input_dim=input_dim,
        rebuild_refiners=rebuild_refiners,
    )

    self._results = restored
    self.compile_results(
        length_penalty=length_penalty,
        constants_penalty=constants_penalty,
        likelihood_penalty=likelihood_penalty,
    )

compile_results ¶

compile_results(length_penalty: float | None = None, constants_penalty: float | None = None, likelihood_penalty: float | None = None) -> None

Aggregate refiner outputs into a tidy pandas.DataFrame.

PARAMETER	DESCRIPTION
`length_penalty`	Length penalty applied during score recomputation. Defaults to the current `length_penalty` value on the model. TYPE: `float` DEFAULT: `None`
`constants_penalty`	Constant-count penalty applied during score recomputation. Defaults to the current `constants_penalty` value on the model. TYPE: `float` DEFAULT: `None`
`likelihood_penalty`	Negative log-likelihood penalty applied during score recomputation. Defaults to the current `likelihood_penalty` value on the model. TYPE: `float` DEFAULT: `None`

RAISES	DESCRIPTION
`ConvergenceError`	If no beams converged during refinement.

Source code in src/flash_ansr/flash_ansr.py

def compile_results(
        self,
        length_penalty: float | None = None,
        constants_penalty: float | None = None,
        likelihood_penalty: float | None = None) -> None:
    """Aggregate refiner outputs into a tidy `pandas.DataFrame`.

    Parameters
    ----------
    length_penalty : float, optional
        Length penalty applied during score recomputation. Defaults to the
        current ``length_penalty`` value on the model.
    constants_penalty : float, optional
        Constant-count penalty applied during score recomputation. Defaults
        to the current ``constants_penalty`` value on the model.
    likelihood_penalty : float, optional
        Negative log-likelihood penalty applied during score recomputation.
        Defaults to the current ``likelihood_penalty`` value on the model.

    Raises
    ------
    ConvergenceError
        If no beams converged during refinement.
    """
    if not self._results:
        raise ConvergenceError("The optimization did not converge for any beam")

    self.initial_length_penalty = getattr(self, 'length_penalty', 0.0)
    self.initial_constants_penalty = getattr(self, 'constants_penalty', 0.0)
    self.initial_likelihood_penalty = getattr(self, 'likelihood_penalty', 0.0)

    if length_penalty is not None:
        self.length_penalty = float(length_penalty)
    if constants_penalty is not None:
        self.constants_penalty = float(constants_penalty)
    if likelihood_penalty is not None:
        self.likelihood_penalty = float(likelihood_penalty)

    self._results, self.results = self._compile_results_pure(
        self._results, self.length_penalty, self.constants_penalty, self.likelihood_penalty)

Inference results¶

The objects returned by FlashANSR.infer: the score-sorted refined candidates plus the full classified candidate ledger.

InferenceResult.to_dataframe() returns the refined survivors (the FIT_OK candidates in result.candidates) as a pandas DataFrame, one row per candidate; it does not include the full ledger.

InferenceResult¶

The result of one :meth:FlashANSR.infer: score-sorted refined candidates + the full ledger.

Candidate¶

One refined survivor (a generated expression that fitted). Rich, for interactive use.

CandidateLedger¶

The FULL generation pool U refined survivors, classified -- the lean columnar "all candidates" object (tokens + fvu + log_prob + valid + fit_status + best-constants). Holds NO model objects.

FlashANSRDataset¶

Dataset wrapper for amortized neural symbolic regression training.

Manages skeleton sampling, support point generation, optional prompt preprocessing, and collation into model-ready batches. Can also compile streaming output into an on-disk datasets.Dataset for deterministic iteration.

PARAMETER	DESCRIPTION
`source`	symbolic-data problem source streaming ready-to-use Problems (skeleton + support points) from its underlying generative catalog. TYPE: `ProblemSource`
`tokenizer`	Tokenizer used for expression serialization and padding. TYPE: `Tokenizer`
`padding`	Strategy for padding numeric support points. TYPE: `(random, zero)` DEFAULT: `"random"`
`preprocessor`	Prompt-aware preprocessor; when provided, prompt metadata can be injected during sampling or in worker processes. TYPE: `FlashANSRPreprocessor` DEFAULT: `None`

Notes

This object owns a multiprocessing worker pool. Call dataset.shutdown() when done, or use it as a context manager (with FlashANSRDataset(...) as dataset:) so the pool is shut down automatically. If neither is done, a warning is emitted at garbage collection.

Source code in src/flash_ansr/data/data.py

def __init__(
    self,
    source: ProblemSource,
    tokenizer: Tokenizer,
    padding: Literal["random", "zero"],
    preprocessor: FlashANSRPreprocessor | None = None,
    unconditional_prob: float = 0.0,
) -> None:
    self.source = source
    self.tokenizer = tokenizer
    self.padding = padding
    self.preprocessor = preprocessor
    # Fraction of generated examples emitted UNCONDITIONED (no condition) -> first-class optional
    # condition (CFG). 0.0 = every example conditioned (original behavior). Set only on the TRAIN
    # dataset; keep 0.0 on val so validation CE stays a pure conditioned metric.
    self.unconditional_prob = float(unconditional_prob)
    self.data = None

    self._collator = BatchFormatter(tokenizer=tokenizer)
    self._stream = SharedMemoryWorkerPool(
        source=source,
        tokenizer=tokenizer,
        padding=padding,
    )
    self._preprocessor_prompt_config = (
        copy.deepcopy(preprocessor.prompt_config) if preprocessor is not None else None
    )

from_config `classmethod` ¶

from_config(config: dict[str, Any] | str) -> FlashANSRDataset

Instantiate from a YAML/dict config.

Paths are normalized via load_config and substitute_root_path. The config carries a source: block: {catalog: <path-to-catalog-yaml OR inline dict>, sampling: {...}}. The catalog (a generative lample_charton catalog) is loaded into a dict and handed to a ProblemSource.

PARAMETER	DESCRIPTION
`config`	Dataset config or path to a YAML file. TYPE: `dict or str`

RETURNS	DESCRIPTION
`FlashANSRDataset`	Dataset wrapper with tokenizer and optional preprocessor wired.

Source code in src/flash_ansr/data/data.py

@classmethod
def from_config(cls, config: dict[str, Any] | str) -> "FlashANSRDataset":
    """Instantiate from a YAML/dict config.

    Paths are normalized via `load_config` and `substitute_root_path`. The
    config carries a `source:` block: `{catalog: <path-to-catalog-yaml OR
    inline dict>, sampling: {...}}`. The catalog (a generative
    `lample_charton` catalog) is loaded into a dict and handed to a
    `ProblemSource`.

    Parameters
    ----------
    config : dict or str
        Dataset config or path to a YAML file.

    Returns
    -------
    FlashANSRDataset
        Dataset wrapper with tokenizer and optional preprocessor wired.
    """
    config_ = load_config(config)

    if "dataset" in config_.keys():
        config_ = config_["dataset"]

    for key in ("source", "tokenizer", "padding"):
        if key not in config_:
            raise ValueError(f"Dataset config is missing required key {key!r}.")

    source_cfg = config_["source"]
    if "catalog" not in source_cfg:
        raise ValueError("Dataset config `source` block is missing required key 'catalog'.")
    catalog_cfg = source_cfg["catalog"]

    if isinstance(config, str) and isinstance(catalog_cfg, str) and catalog_cfg.startswith('.'):
        catalog_cfg = os.path.join(os.path.dirname(config), catalog_cfg)  # pragma: no cover - config guard
    if isinstance(catalog_cfg, str):
        catalog_cfg = substitute_root_path(catalog_cfg)

    # `source.catalog` may be: a curated NAME[@version] (resolved from HF), a catalog config path,
    # an inline generative-catalog dict, or a DIRECTORY holding a saved generative catalog (a fixed
    # validation pool). ProblemSource resolves names / paths / inline configs via build_catalog; only
    # the saved-directory form is loaded into an instance first (build_catalog has no saved-dir loader).
    catalog_spec: Any
    if isinstance(catalog_cfg, str) and os.path.isdir(catalog_cfg):
        catalog_spec = LampleChartonCatalog.load(catalog_cfg)
    else:
        catalog_spec = catalog_cfg

    source_obj = ProblemSource({"catalog": catalog_spec, "sampling": source_cfg.get("sampling", {})})

    tokenizer = Tokenizer.from_config(config_["tokenizer"])

    preprocessor_cfg = config_.get("preprocessor") if isinstance(config_, dict) else None
    preprocessor: FlashANSRPreprocessor | None = None
    if preprocessor_cfg is not None:
        preprocessor = FlashANSRPreprocessor.from_config(
            preprocessor_cfg,
            simplipy_engine=source_obj.catalog.simplipy_engine,
            tokenizer=tokenizer,
            catalog=source_obj.catalog,
        )

    return cls(
        source=source_obj,
        tokenizer=tokenizer,
        padding=config_["padding"],
        preprocessor=preprocessor,
        unconditional_prob=config_.get("unconditional_prob", 0.0),
    )

iterate ¶

iterate(size: int | None = None, steps: int | None = None, batch_size: int | None = None, n_support: int | None = None, max_seq_len: int = 512, max_n_support: int | None = None, n_per_equation: int = 1, preprocess: bool = False, preprocess_in_worker: bool | None = None, include_metrics: Sequence[str] | str | None = None, tokenizer_oov: Literal['unk', 'raise'] = 'raise', num_workers: int | None = None, prefetch_factor: int = 2, persistent: bool = False, unconditional_prob: float | None = None, tqdm_kwargs: dict[str, Any] | None = None, verbose: bool = False) -> Generator[dict[str, Any], None, None]

Stream batches of synthetic data.

PARAMETER	DESCRIPTION
`size`	Total number of samples to generate (used if `steps` is None). TYPE: `int` DEFAULT: `None`
`steps`	Number of generation steps; overrides `size` when set. TYPE: `int` DEFAULT: `None`
`batch_size`	Samples per step; defaults to 1. TYPE: `int` DEFAULT: `None`
`n_support`	Support points per equation; pool default when None. TYPE: `int` DEFAULT: `None`
`max_seq_len`	Maximum prefix length for generated expressions. TYPE: `int` DEFAULT: `512`
`max_n_support`	Upper bound for support points; used for padding. TYPE: `int` DEFAULT: `None`
`n_per_equation`	Number of datasets to draw per skeleton before moving on. TYPE: `int` DEFAULT: `1`
`preprocess`	Whether to run the preprocessor on generated batches. TYPE: `bool` DEFAULT: `False`
`preprocess_in_worker`	Force preprocessing inside workers (True), main process (False), or auto-select (None). TYPE: `bool` DEFAULT: `None`
`include_metrics`	Metrics to compute for each sampled expression. Supported values: "fisher", "hessian". TYPE: `Sequence[str] or str or None` DEFAULT: `None`
`tokenizer_oov`	How to handle tokens missing from the tokenizer. TYPE: `(unk, 'raise')` DEFAULT: `"unk"`
`num_workers`	Worker count for multiprocessing; defaults to CPU count when None. TYPE: `int` DEFAULT: `None`
`prefetch_factor`	Jobs per worker to pre-schedule. TYPE: `int` DEFAULT: `2`
`persistent`	Clone tensors to detach from shared memory buffers. TYPE: `bool` DEFAULT: `False`
`tqdm_kwargs`	Additional arguments forwarded to tqdm progress bars. TYPE: `dict` DEFAULT: `None`
`verbose`	Enable progress reporting. TYPE: `bool` DEFAULT: `False`

YIELDS	DESCRIPTION
`dict`	Model-ready batch with tensors and optional prompt metadata.

Source code in src/flash_ansr/data/data.py

def iterate(
    self,
    size: int | None = None,
    steps: int | None = None,
    batch_size: int | None = None,
    n_support: int | None = None,
    max_seq_len: int = 512,
    max_n_support: int | None = None,
    n_per_equation: int = 1,
    preprocess: bool = False,
    preprocess_in_worker: bool | None = None,
    include_metrics: Sequence[str] | str | None = None,
    tokenizer_oov: Literal["unk", "raise"] = "raise",
    num_workers: int | None = None,
    prefetch_factor: int = 2,
    persistent: bool = False,
    unconditional_prob: float | None = None,
    tqdm_kwargs: dict[str, Any] | None = None,
    verbose: bool = False,
) -> Generator[dict[str, Any], None, None]:
    """Stream batches of synthetic data.

    Parameters
    ----------
    size : int, optional
        Total number of samples to generate (used if `steps` is None).
    steps : int, optional
        Number of generation steps; overrides `size` when set.
    batch_size : int, optional
        Samples per step; defaults to 1.
    n_support : int, optional
        Support points per equation; pool default when None.
    max_seq_len : int, default 512
        Maximum prefix length for generated expressions.
    max_n_support : int, optional
        Upper bound for support points; used for padding.
    n_per_equation : int, default 1
        Number of datasets to draw per skeleton before moving on.
    preprocess : bool, default False
        Whether to run the preprocessor on generated batches.
    preprocess_in_worker : bool, optional
        Force preprocessing inside workers (True), main process (False), or auto-select (None).
    include_metrics : Sequence[str] or str or None, default None
        Metrics to compute for each sampled expression. Supported values: "fisher", "hessian".
    tokenizer_oov : {"unk", "raise"}, default "raise"
        How to handle tokens missing from the tokenizer.
    num_workers : int, optional
        Worker count for multiprocessing; defaults to CPU count when None.
    prefetch_factor : int, default 2
        Jobs per worker to pre-schedule.
    persistent : bool, default False
        Clone tensors to detach from shared memory buffers.
    tqdm_kwargs : dict, optional
        Additional arguments forwarded to tqdm progress bars.
    verbose : bool, default False
        Enable progress reporting.

    Yields
    ------
    dict
        Model-ready batch with tensors and optional prompt metadata.
    """
    if batch_size is None:
        batch_size = 1

    tqdm_kwargs = dict(tqdm_kwargs) if tqdm_kwargs else {}

    use_worker_preprocess = False
    if preprocess:
        if self.preprocessor is None:
            if preprocess_in_worker:
                warnings.warn(
                    "worker preprocessing requested but no preprocessor configured; falling back to main process.",
                    RuntimeWarning,
                    stacklevel=2,
                )
        else:
            if preprocess_in_worker is None:
                use_worker_preprocess = True
            else:
                use_worker_preprocess = bool(preprocess_in_worker)

    if self._stream.is_initialized and self._stream.worker_preprocess_enabled != use_worker_preprocess:
        raise RuntimeError(
            "Cannot switch worker preprocessing mode while workers are active. "
            "Call `dataset.shutdown()` before iterating with a new mode."
        )

    if self.data is not None:
        if include_metrics:
            warnings.warn(
                "Metric computation is only supported for streaming datasets; ignoring include_metrics.",
                RuntimeWarning,
                stacklevel=2,
            )
        precompiled_kwargs = tqdm_kwargs.copy()
        precompiled_kwargs.setdefault("desc", "Iterating over pre-compiled dataset")
        precompiled_kwargs.setdefault("disable", not verbose)
        precompiled_kwargs.setdefault("smoothing", 0.0)
        yield from tqdm(self.data, **precompiled_kwargs)
        return

    if steps is None and size is None:
        raise ValueError("Either size or steps must be specified.")

    if steps is None:
        assert size is not None
        steps = (size + batch_size - 1) // batch_size

    effective_unconditional_prob = self.unconditional_prob if unconditional_prob is None else float(unconditional_prob)
    self._initialize_stream(
        prefetch_factor=prefetch_factor,
        batch_size=batch_size,
        n_per_equation=n_per_equation,
        max_seq_len=max_seq_len,
        max_n_support=max_n_support,
        num_workers=num_workers,
        tokenizer_oov=tokenizer_oov,
        worker_preprocess=use_worker_preprocess,
        unconditional_prob=effective_unconditional_prob,
    )

    if self._stream.metadata_pool is None or not self._stream.buffers:
        raise RuntimeError("Multiprocessing resources are not properly initialized.")

    pool_size = self._stream.pool_size

    progress_kwargs = tqdm_kwargs.copy()
    progress_kwargs.setdefault("total", steps)
    progress_kwargs.setdefault("desc", "Generating Batches")
    progress_kwargs.setdefault("disable", not verbose)
    progress_kwargs.setdefault("smoothing", 0.0)
    pbar = tqdm(**progress_kwargs)

    try:
        for _ in range(min(pool_size, steps)):
            slot_idx = self._stream.acquire_slot()
            self._stream.submit_job(slot_idx, n_support)

        for step_id in range(steps):
            completed_slot_idx = self._stream.get_completed_slot()
            metadata_and_constants = self._stream.metadata_pool[completed_slot_idx]
            if metadata_and_constants is None:
                raise RuntimeError("Worker returned empty payload.")

            metadata_batch = metadata_and_constants["metadata"]
            metadata_fields: dict[str, list[Any]] = {}
            if metadata_batch:
                for key in metadata_batch[0]:
                    metadata_fields[key] = [entry[key] for entry in metadata_batch]

            batch_dict = {
                "x_tensors": torch.from_numpy(self._stream.buffers["x_tensors"][completed_slot_idx]),
                "y_tensors": torch.from_numpy(self._stream.buffers["y_tensors"][completed_slot_idx]),
                "data_attn_mask": torch.from_numpy(self._stream.buffers["data_attn_mask"][completed_slot_idx]).to(torch.bool),
                "input_ids": torch.from_numpy(self._stream.buffers["input_ids"][completed_slot_idx]),
                "constants": [
                    torch.from_numpy(c)
                    for c in metadata_and_constants["constants"]
                ],
            }
            batch_dict.update(metadata_fields)

            preprocessed_batch = metadata_and_constants.get("preprocessed")
            if preprocess:
                if use_worker_preprocess:
                    if preprocessed_batch is not None:
                        self._inject_preprocessed_fields(batch_dict, preprocessed_batch)
                    elif self.preprocessor:
                        batch_dict = self.preprocessor.format(batch_dict)
                elif self.preprocessor:
                    batch_dict = self.preprocessor.format(batch_dict)

            self._collator.ensure_numeric_channel(batch_dict)

            if include_metrics:
                self._compute_expression_metrics(batch_dict, include_metrics)

            if persistent:
                cloned_batch: dict[str, Any] = {}
                for key, value in batch_dict.items():
                    if isinstance(value, torch.Tensor):
                        cloned_batch[key] = value.clone()
                    elif key == "constants" and isinstance(value, list):
                        cloned_batch[key] = [tensor.clone() for tensor in value]
                    elif key == "constants":
                        cloned_batch[key] = value
                    else:
                        cloned_batch[key] = value
                batch_dict = cloned_batch

            yield batch_dict

            pbar.update(1)

            self._stream.release_slot(completed_slot_idx)
            if step_id + pool_size < steps:
                slot_to_refill = self._stream.acquire_slot()
                self._stream.submit_job(slot_to_refill, n_support)
    finally:
        pbar.close()
        self.shutdown()

compile ¶

compile(size: int | None = None, steps: int | None = None, batch_size: int | None = None, n_support: int | None = None, verbose: bool = False) -> None

Materialize a streaming iterator into an on-disk dataset.

PARAMETER	DESCRIPTION
`size`	Total number of samples to generate (used if `steps` is None). TYPE: `int` DEFAULT: `None`
`steps`	Number of iteration steps (overrides `size` when provided). TYPE: `int` DEFAULT: `None`
`batch_size`	Per-step generation batch size; defaults to 1. TYPE: `int` DEFAULT: `None`
`n_support`	Number of support points per equation; falls back to pool defaults. TYPE: `int` DEFAULT: `None`
`verbose`	Enable progress reporting. TYPE: `bool` DEFAULT: `False`

Source code in src/flash_ansr/data/data.py

def compile(
    self,
    size: int | None = None,
    steps: int | None = None,
    batch_size: int | None = None,
    n_support: int | None = None,
    verbose: bool = False,
) -> None:
    """Materialize a streaming iterator into an on-disk dataset.

    Parameters
    ----------
    size : int, optional
        Total number of samples to generate (used if `steps` is None).
    steps : int, optional
        Number of iteration steps (overrides `size` when provided).
    batch_size : int, optional
        Per-step generation batch size; defaults to 1.
    n_support : int, optional
        Number of support points per equation; falls back to pool defaults.
    verbose : bool, default False
        Enable progress reporting.
    """
    disable_progress_bars()
    if size is None and steps is None:
        size = self.source.size_hint()
        if size is None:
            raise ValueError(
                "Cannot infer a dataset size from an unbounded ProblemSource. "
                "Pass an explicit `size` or `steps` to `compile()`."
            )

    self.data = Dataset.from_list(
        list(
            self.iterate(
                size=size,
                steps=steps,
                batch_size=batch_size,
                n_support=n_support,
                verbose=verbose,
                persistent=True,  # clone tensors out of worker shared memory before shutdown frees it (avoids use-after-free)
            )
        )
    )

save ¶

save(directory: str, *args: Any, config: dict[str, Any] | str | None = None, reference: str = 'relative', recursive: bool = True, **kwargs: Any) -> None

Persist the compiled dataset and its config.

PARAMETER	DESCRIPTION
`directory`	Target directory for `dataset/` artifacts and `dataset.yaml`. TYPE: `str`
`config`	Config to save alongside the dataset. When omitted a warning is raised and only the data is stored. TYPE: `dict or str` DEFAULT: `None`
`reference`	How to normalize paths when writing the config. TYPE: `str` DEFAULT: `"relative"`
`recursive`	Whether to recursively resolve nested configs. TYPE: `bool` DEFAULT: `True`
`*args`	Passed to `datasets.Dataset.save_to_disk`. TYPE: `Any` DEFAULT: `()`
`**kwargs`	Passed to `datasets.Dataset.save_to_disk`. TYPE: `Any` DEFAULT: `()`

Source code in src/flash_ansr/data/data.py

def save(
    self,
    directory: str,
    *args: Any,
    config: dict[str, Any] | str | None = None,
    reference: str = "relative",
    recursive: bool = True,
    **kwargs: Any,
) -> None:
    """Persist the compiled dataset and its config.

    Parameters
    ----------
    directory : str
        Target directory for `dataset/` artifacts and `dataset.yaml`.
    config : dict or str, optional
        Config to save alongside the dataset. When omitted a warning is
        raised and only the data is stored.
    reference : str, default "relative"
        How to normalize paths when writing the config.
    recursive : bool, default True
        Whether to recursively resolve nested configs.
    *args, **kwargs : Any
        Passed to `datasets.Dataset.save_to_disk`.
    """
    if self.data is None:
        raise ValueError("No dataset to save. Please generate or load a dataset first.")

    directory = substitute_root_path(directory)
    os.makedirs(directory, exist_ok=True)

    self.data.save_to_disk(os.path.join(directory, "dataset"), *args, **kwargs)

    if config is None:
        warnings.warn(
            "No config specified, saving the model without a config file. "
            "Loading the model will require manual configuration.",
        )
    else:
        save_config(
            load_config(config, resolve_paths=True),
            directory=directory,
            filename="dataset.yaml",
            reference=reference,
            recursive=recursive,
            resolve_paths=True,
        )

shutdown ¶

shutdown() -> None

Release multiprocessing workers and shared buffers.

Source code in src/flash_ansr/data/data.py

def shutdown(self) -> None:
    """Release multiprocessing workers and shared buffers."""
    self._stream.shutdown()

FlashANSRPreprocessor¶

Format batch inputs and optionally enrich them with prompt metadata.

Source code in src/flash_ansr/preprocessing/pipeline.py

def __init__(
    self,
    simplipy_engine: SimpliPyEngine,
    tokenizer: Tokenizer,
    catalog: LampleChartonCatalog | None = None,
    *,
    prompt_config: FlashANSRPreprocessorConfig | dict[str, Any] | None = None,
    rng: np.random.Generator | None = None,
) -> None:
    self.simplipy_engine = simplipy_engine
    self.tokenizer = tokenizer
    self.catalog = catalog
    self._rng = rng if rng is not None else np.random.default_rng()

    self.prompt_config = FlashANSRPreprocessorConfig.from_dict(prompt_config)
    self._prompt_enabled = (
        catalog is not None
        and self.prompt_config.prompt_feature.prompt_probability > 0
    )

    self._feature_extractor: PromptFeatureExtractor | None = None
    if self._prompt_enabled:
        self._feature_extractor = PromptFeatureExtractor(
            simplipy_engine=simplipy_engine,
            tokenizer=tokenizer,
            config=self.prompt_config.prompt_feature,
            catalog=catalog,
            rng=self._rng,
        )

    self._serializer = PromptSerializer(tokenizer)

from_config `classmethod` ¶

from_config(config: dict[str, Any] | str | None, *, simplipy_engine: SimpliPyEngine, tokenizer: Tokenizer, catalog: LampleChartonCatalog | None = None, rng: Generator | None = None) -> 'FlashANSRPreprocessor'

Construct a preprocessor from a config plus the required runtime dependencies.

PARAMETER	DESCRIPTION
`config`	Config mapping or path to a config file. A top-level `"preprocessor"` key is unwrapped, and its `"prompt"` section configures prompt enrichment. `None` or a non-mapping config yields default (prompt-disabled) settings. TYPE: `dict[str, Any] or str or None`
`simplipy_engine`	Engine used to manipulate and evaluate symbolic expressions. TYPE: `SimpliPyEngine`
`tokenizer`	Tokenizer used to serialize prompts and expressions. TYPE: `Tokenizer`
`catalog`	Catalog enabling prompt-feature extraction; prompts are only emitted when a catalog is supplied and the configured prompt probability is positive. TYPE: `LampleChartonCatalog` DEFAULT: `None`
`rng`	Random generator driving stochastic prompt inclusion. Defaults to a fresh generator. TYPE: `Generator` DEFAULT: `None`

RETURNS	DESCRIPTION
`FlashANSRPreprocessor`	The configured preprocessor.

Source code in src/flash_ansr/preprocessing/pipeline.py

@classmethod
def from_config(
    cls,
    config: dict[str, Any] | str | None,
    *,
    simplipy_engine: SimpliPyEngine,
    tokenizer: Tokenizer,
    catalog: LampleChartonCatalog | None = None,
    rng: np.random.Generator | None = None,
) -> "FlashANSRPreprocessor":
    """Construct a preprocessor from a config plus the required runtime dependencies.

    Parameters
    ----------
    config : dict[str, Any] or str or None
        Config mapping or path to a config file. A top-level ``"preprocessor"`` key is
        unwrapped, and its ``"prompt"`` section configures prompt enrichment. ``None`` or a
        non-mapping config yields default (prompt-disabled) settings.
    simplipy_engine : SimpliPyEngine
        Engine used to manipulate and evaluate symbolic expressions.
    tokenizer : Tokenizer
        Tokenizer used to serialize prompts and expressions.
    catalog : LampleChartonCatalog, optional
        Catalog enabling prompt-feature extraction; prompts are only emitted when a catalog
        is supplied and the configured prompt probability is positive.
    rng : numpy.random.Generator, optional
        Random generator driving stochastic prompt inclusion. Defaults to a fresh generator.

    Returns
    -------
    FlashANSRPreprocessor
        The configured preprocessor.
    """
    config_ = load_config(config)

    if isinstance(config_, dict) and "preprocessor" in config_.keys():
        config_ = config_["preprocessor"]

    if not isinstance(config_, dict):
        config_ = {}

    prompt_cfg = config_.get("prompt")

    return cls(
        simplipy_engine=simplipy_engine,
        tokenizer=tokenizer,
        catalog=catalog,
        prompt_config=prompt_cfg,
        rng=rng,
    )

format ¶

format(batch: dict[str, Any]) -> dict[str, Any]

Format a batch instance-by-instance, optionally enriching it with prompt metadata.

Each instance in batch is formatted (adding input_num / prompt_mask / prompt_metadata and, when enabled, a sampled prompt prefix), then the results are re-stacked back into per-key lists.

PARAMETER	DESCRIPTION
`batch`	A batch mapping keys to per-instance sequences; must contain `"input_ids"`. TYPE: `dict[str, Any]`

RETURNS	DESCRIPTION
`dict[str, Any]`	The batch with formatted fields. Returned unchanged if `"input_ids"` is absent or the batch is empty.

Source code in src/flash_ansr/preprocessing/pipeline.py

def format(self, batch: dict[str, Any]) -> dict[str, Any]:
    """Format a batch instance-by-instance, optionally enriching it with prompt metadata.

    Each instance in ``batch`` is formatted (adding ``input_num`` / ``prompt_mask`` /
    ``prompt_metadata`` and, when enabled, a sampled prompt prefix), then the results are
    re-stacked back into per-key lists.

    Parameters
    ----------
    batch : dict[str, Any]
        A batch mapping keys to per-instance sequences; must contain ``"input_ids"``.

    Returns
    -------
    dict[str, Any]
        The batch with formatted fields. Returned unchanged if ``"input_ids"`` is absent or
        the batch is empty.
    """
    input_ids = batch.get("input_ids")
    if input_ids is None:
        return batch

    batch_size = len(input_ids)

    formatted_instances: list[dict[str, Any]] = []
    for idx in range(batch_size):
        instance = {key: self._select_batch_item(value, idx) for key, value in batch.items()}
        formatted_instances.append(self._format_single(instance))

    if not formatted_instances:
        return batch

    for key in formatted_instances[0].keys():
        batch[key] = [instance[key] for instance in formatted_instances]

    return batch

serialize_prompt_prefix ¶

serialize_prompt_prefix(*, complexity: float | int | None = None, allowed_terms: Iterable[Sequence[Any]] | None = None, include_terms: Iterable[Sequence[Any]] | None = None, exclude_terms: Iterable[Sequence[Any]] | None = None) -> dict[str, Any]

Serialize an explicit prompt prefix constraining generation.

Builds the token prefix (starting from <bos>) that encodes the requested constraints, emitting the <prompt> block only when the tokenizer defines the needed special tokens.

PARAMETER	DESCRIPTION
`complexity`	Target expression complexity to encode in the prompt. TYPE: `float or int` DEFAULT: `None`
`allowed_terms`	Terms the generated expression is restricted to. TYPE: `iterable of sequences` DEFAULT: `None`
`include_terms`	Terms that must appear in the generated expression. TYPE: `iterable of sequences` DEFAULT: `None`
`exclude_terms`	Terms that must not appear in the generated expression. TYPE: `iterable of sequences` DEFAULT: `None`

RETURNS	DESCRIPTION
`dict[str, Any]`	The serialized prefix with `input_ids`, `input_num`, `prompt_mask` and `prompt_metadata` entries.

Source code in src/flash_ansr/preprocessing/pipeline.py

def serialize_prompt_prefix(
    self,
    *,
    complexity: float | int | None = None,
    allowed_terms: Iterable[Sequence[Any]] | None = None,
    include_terms: Iterable[Sequence[Any]] | None = None,
    exclude_terms: Iterable[Sequence[Any]] | None = None,
) -> dict[str, Any]:
    """Serialize an explicit prompt prefix constraining generation.

    Builds the token prefix (starting from ``<bos>``) that encodes the requested constraints,
    emitting the ``<prompt>`` block only when the tokenizer defines the needed special tokens.

    Parameters
    ----------
    complexity : float or int, optional
        Target expression complexity to encode in the prompt.
    allowed_terms : iterable of sequences, optional
        Terms the generated expression is restricted to.
    include_terms : iterable of sequences, optional
        Terms that must appear in the generated expression.
    exclude_terms : iterable of sequences, optional
        Terms that must not appear in the generated expression.

    Returns
    -------
    dict[str, Any]
        The serialized prefix with ``input_ids``, ``input_num``, ``prompt_mask`` and
        ``prompt_metadata`` entries.
    """
    return self._serializer.serialize_prompt_prefix(
        complexity=complexity,
        allowed_terms=allowed_terms,
        include_terms=include_terms,
        exclude_terms=exclude_terms,
    )

Generation configurations¶

BeamSearchConfig¶

Configuration for beam-search based generation.

Source code in src/flash_ansr/utils/generation.py

def __init__(
    self,
    *,
    beam_width: int = 32,
    max_len: int = 32,
    batch_size: int = 128,
    unique: bool = True,
    limit_expansions: bool = True,
    use_cache: bool = True,   # KV cache ON by default (quality-equivalent; the inference speed win)
) -> None:
    self.method = 'beam_search'
    self.beam_width = beam_width
    self.max_len = max_len
    self.batch_size = batch_size
    self.unique = unique
    self.limit_expansions = limit_expansions
    self.use_cache = use_cache

to_kwargs ¶

to_kwargs() -> dict[str, Any]

Return the beam-search keyword arguments (beam_width, max_len, ...).

Source code in src/flash_ansr/utils/generation.py

def to_kwargs(self) -> dict[str, Any]:
    """Return the beam-search keyword arguments (``beam_width``, ``max_len``, ...)."""
    return {
        'beam_width': self.beam_width,
        'max_len': self.max_len,
        'batch_size': self.batch_size,
        'unique': self.unique,
        'limit_expansions': self.limit_expansions,
        'use_cache': self.use_cache,
    }

SoftmaxSamplingConfig¶

Configuration for softmax sampling generation.

Source code in src/flash_ansr/utils/generation.py

def __init__(
    self,
    *,
    choices: int = 1024,
    top_k: int = 0,
    top_p: float = 1.0,
    max_len: int = 64,
    batch_size: int | str = 'auto',   # c-adaptive chunk size (suggest_batch_size); int overrides
    temperature: float = 1.0,
    valid_only: bool = True,
    simplify: bool | str = True,
    unique: bool = True,
    use_cache: bool = True,   # KV cache ON by default (quality-equivalent; the inference speed win)
    static_decode: bool | None = None,   # tri-state: None=deployed default for capable models, True/False explicit
    guidance_weight: float | None = None,   # classifier-free guidance (optcond models only); None=plain conditioned decode
) -> None:
    self.method = 'softmax_sampling'
    self.choices = choices
    self.top_k = top_k
    self.top_p = top_p
    self.max_len = max_len
    self.batch_size = batch_size
    self.temperature = temperature
    self.valid_only = valid_only
    self.simplify = simplify
    self.unique = unique
    self.use_cache = use_cache
    self.static_decode = static_decode
    self.guidance_weight = guidance_weight

to_kwargs ¶

to_kwargs() -> dict[str, Any]

Return the softmax-sampling keyword arguments (choices, top_k, top_p, ...).

Source code in src/flash_ansr/utils/generation.py

def to_kwargs(self) -> dict[str, Any]:
    """Return the softmax-sampling keyword arguments (``choices``, ``top_k``, ``top_p``, ...)."""
    return {
        'choices': self.choices,
        'top_k': self.top_k,
        'top_p': self.top_p,
        'max_len': self.max_len,
        'batch_size': self.batch_size,
        'temperature': self.temperature,
        'valid_only': self.valid_only,
        'simplify': self.simplify,
        'unique': self.unique,
        'use_cache': self.use_cache,
        'static_decode': self.static_decode,
        'guidance_weight': self.guidance_weight,
    }

MCTSGenerationConfig¶

Configuration for Monte Carlo tree search generation.

Source code in src/flash_ansr/utils/generation.py

def __init__(
    self,
    *,
    beam_width: int = 16,
    simulations: int = 256,
    max_rollouts: int | None = None,
    refine_budget: int | None = None,
    batch_width: int = 32,
    async_search: bool = False,
    inflight: int = 128,
    gpu_batch: int | None = None,
    uct_c: float = 1.4,
    expansion_top_k: int = 32,
    max_depth: int = 64,
    rollout_max_len: int | None = None,
    rollout_policy: str = 'sample',
    temperature: float = 1.0,
    rollout_resample_retries: int = 8,
    dirichlet_alpha: float | None = None,
    dirichlet_epsilon: float = 0.25,
    backup: str = 'max',
    fpu_reduction: float = 0.0,
    renormalize_prior: bool = True,
    reward_log_fvu_hi: float = 0.0,
    reward_log_fvu_lo: float = -8.0,
    value_objective: str = 'score',
    invalid_penalty: float = 1.0,
    min_visits_before_expansion: int = 1,
    reward_transform: Callable[[float], float] | None = None,
    completion_sort: str = 'reward',
) -> None:
    self.method = 'mcts'
    self.beam_width = beam_width
    self.simulations = simulations
    self.max_rollouts = max_rollouts
    self.refine_budget = refine_budget
    self.batch_width = batch_width
    self.async_search = async_search
    self.inflight = inflight
    self.gpu_batch = gpu_batch
    self.uct_c = uct_c
    self.expansion_top_k = expansion_top_k
    self.max_depth = max_depth
    self.rollout_max_len = rollout_max_len
    self.rollout_policy = rollout_policy
    self.temperature = temperature
    self.rollout_resample_retries = rollout_resample_retries
    self.dirichlet_alpha = dirichlet_alpha
    self.dirichlet_epsilon = dirichlet_epsilon
    self.backup = backup
    self.fpu_reduction = fpu_reduction
    self.renormalize_prior = renormalize_prior
    self.reward_log_fvu_hi = reward_log_fvu_hi
    self.reward_log_fvu_lo = reward_log_fvu_lo
    self.value_objective = value_objective
    self.invalid_penalty = invalid_penalty
    self.min_visits_before_expansion = min_visits_before_expansion
    self.reward_transform = reward_transform
    self.completion_sort = completion_sort

    if completion_sort not in ('reward', 'log_prob'):
        raise ValueError("completion_sort must be either 'reward' or 'log_prob'")
    if backup not in ('max', 'mean'):
        raise ValueError("backup must be either 'max' or 'mean'")
    if rollout_policy not in ('sample', 'greedy'):
        raise ValueError("rollout_policy must be either 'sample' or 'greedy'")

to_kwargs ¶

to_kwargs() -> dict[str, Any]

Return the MCTS keyword arguments (simulations, uct_c, max_depth, ...).

Source code in src/flash_ansr/utils/generation.py

def to_kwargs(self) -> dict[str, Any]:
    """Return the MCTS keyword arguments (``simulations``, ``uct_c``, ``max_depth``, ...)."""
    return {
        'beam_width': self.beam_width,
        'simulations': self.simulations,
        'max_rollouts': self.max_rollouts,
        'refine_budget': self.refine_budget,
        'batch_width': self.batch_width,
        'async_search': self.async_search,
        'inflight': self.inflight,
        'gpu_batch': self.gpu_batch,
        'uct_c': self.uct_c,
        'expansion_top_k': self.expansion_top_k,
        'max_depth': self.max_depth,
        'rollout_max_len': self.rollout_max_len,
        'rollout_policy': self.rollout_policy,
        'temperature': self.temperature,
        'rollout_resample_retries': self.rollout_resample_retries,
        'dirichlet_alpha': self.dirichlet_alpha,
        'dirichlet_epsilon': self.dirichlet_epsilon,
        'backup': self.backup,
        'fpu_reduction': self.fpu_reduction,
        'renormalize_prior': self.renormalize_prior,
        'reward_log_fvu_hi': self.reward_log_fvu_hi,
        'reward_log_fvu_lo': self.reward_log_fvu_lo,
        'value_objective': self.value_objective,
        'invalid_penalty': self.invalid_penalty,
        'min_visits_before_expansion': self.min_visits_before_expansion,
        'reward_transform': self.reward_transform,
        'completion_sort': self.completion_sort,
    }

Utilities¶

Resolve a path relative to the project root (see :func:get_root).

Optionally creates the directories leading to the resolved path when create is set.

Source code in src/flash_ansr/utils/paths.py

def get_path(*args: str, filename: str | None = None, create: bool = False) -> str:
    """Resolve a path relative to the project root (see :func:`get_root`).

    Optionally creates the directories leading to the resolved path when ``create`` is set.
    """
    if any(not isinstance(arg, str) for arg in args):
        raise TypeError("All arguments must be strings.")

    path = normalize_path_preserve_leading_dot(
        os.path.join(get_root(), *args, filename or '')
    )

    if create:
        if filename is not None:
            os.makedirs(os.path.dirname(path), exist_ok=True)
        else:
            os.makedirs(path, exist_ok=True)

    return os.path.abspath(path)

Load a YAML config (optionally resolving nested relative paths).

Source code in src/flash_ansr/utils/config_io.py

def load_config(config: dict[str, Any] | str, resolve_paths: bool = True) -> dict[str, Any]:
    """Load a YAML config (optionally resolving nested relative paths)."""
    if isinstance(config, str):
        config_path = substitute_root_path(config)
        config_base_path = os.path.dirname(config_path)

        if not os.path.exists(config_path):
            raise FileNotFoundError(f'Config file {config_path} not found.')
        if os.path.isfile(config_path):
            with open(config_path, 'r') as config_file:
                config_ = yaml.safe_load(config_file)
        else:
            raise ValueError(f'Config file {config_path} is not a valid file.')

        def resolve_path(value: Any) -> Any:
            if (
                isinstance(value, str)
                and (value.endswith('.yaml') or value.endswith('.json'))
                and value.startswith('.')
            ):
                return normalize_path_preserve_leading_dot(os.path.join(config_base_path, value))
            return value

        if resolve_paths:
            config_ = apply_on_nested(config_, resolve_path)
    else:
        config_ = config

    return config_

API Reference¶

FlashANSR¶

load classmethod ¶

fit ¶

infer ¶

predict ¶

get_expression ¶

save_results ¶

load_results ¶

compile_results ¶

Inference results¶

InferenceResult¶

Candidate¶

CandidateLedger¶

FlashANSRDataset¶

from_config classmethod ¶

iterate ¶

compile ¶

save ¶

shutdown ¶

FlashANSRPreprocessor¶

from_config classmethod ¶

format ¶

serialize_prompt_prefix ¶

Generation configurations¶

BeamSearchConfig¶

to_kwargs ¶

SoftmaxSamplingConfig¶

to_kwargs ¶

MCTSGenerationConfig¶

to_kwargs ¶

Utilities¶

load `classmethod` ¶

from_config `classmethod` ¶

from_config `classmethod` ¶