Skip to content

API reference

The public surface, re-exported from the top-level agentsynth package.

Generation

agentsynth.AgentTrajectoryGenerator

Generate synthetic agent trajectories, offline-mock by default.

model is the model id for LLMClient (None auto-detects a provider from the environment); temperature and max_tokens are forwarded to it. max_steps caps the number of tool calls / reasoning steps. use_mock is "auto" (use the LLM if available, else mock), True (always mock), or False (require the LLM, but fall back to mock and set warning if no client is available). seed is mixed into every mock decision. Pass llm_client to reuse an existing client. tools sets a default catalog (anything parse_tool_catalog accepts, or a list of ToolSpec) used when a call doesn't pass its own; None means the built-in default.

generate

generate(query, tools=None, mode='single_agent', domain=None, index=0)

Generate a single trajectory for query in mode.

generate_batch

generate_batch(queries, tools=None, mode='single_agent', num_trajectories=None, domains=None, progress=None, vary_modes=False)

Generate many trajectories from one or more queries.

A list of queries gives one trajectory each (cycled or truncated to num_trajectories when set). A single query string gives num_trajectories deterministic variations.

Evaluation & verification

agentsynth.TrajectoryEvaluator

Score agent trajectories against a six-dimension rubric.

model is an optional model id for LLMClient; when None the client auto-detects a provider from the environment. use_mock of "auto" uses the LLM judge whenever a client is available and not forced off, True forces the structural judge, and False asks for the LLM judge but still falls back to structural if no client is available. weights overrides the overall-score rubric weights. llm_client lets tests inject a pre-built client. temperature and max_tokens are the LLM judge's decoding params (default to a deterministic 0.0). pass_threshold is the minimum weighted overall for passed.

evaluate_batch

evaluate_batch(trajectories, progress=None)

Evaluate a list of trajectories, reporting progress if given.

When callable, progress is invoked as progress(i / total, desc) before each item. Errors from it are swallowed so a bad callback can't break a run. This loosely matches Gradio's gr.Progress calling convention.

agentsynth.verify_trajectory

verify_trajectory(trajectory, verifiers=None)

Run a set of verifiers and combine their results.

Defaults to the standard verifiers (tool args, execution grounding, safety). verified is True only if every required verifier passes.

agentsynth.EnsembleEvaluator

Pipelines

agentsynth.Recipe

Bases: BaseModel

agentsynth.run_recipe

run_recipe(recipe, progress=None)

Environments

agentsynth.SQLEnvironment

Bases: Environment

agentsynth.PythonSandbox

Bases: Environment

agentsynth.MCPEnvironment

Bases: Environment

agentsynth.BrowserEnvironment

Bases: Environment

agentsynth.RestEnvironment

Bases: Environment

Learned verifier

agentsynth.train_learned_verifier

train_learned_verifier(trajectories, eval_results, threshold=None, test_size=0.25, seed=7)

Fit a LearnedVerifier on judge labels and report held-out agreement.

Labels come from each eval result's passed flag, or overall >= threshold when threshold is given. Returns (verifier, report) where the report has agreement (held-out accuracy vs the judge), precision/recall for the pass class, and the split sizes. Raises ValueError when the labels are all one class — vary the rubric or threshold so there is something to learn.

agentsynth.LearnedVerifier

Bases: Verifier

A judge-distilled classifier behind the standard Verifier interface.

Advisory by default (required=False): it contributes to the verification score without hard-failing a trajectory, since it's a screen, not a proof.

predict_proba

predict_proba(trajectory)

P(the LLM judge would pass this trajectory).

Trace import

agentsynth.trajectory_from_messages

trajectory_from_messages(messages, tools=None, query=None, domain=None, source='openai')

OpenAI-style chat messages → a Trajectory (roughly to_messages inverted).

The first user message becomes the query; assistant text becomes thoughts — except the last one, which becomes the final answer; tool_calls become tool_call steps (JSON-string arguments are parsed); tool/function-role messages become observations.

agentsynth.import_traces

import_traces(records, tools=None, format='auto')

Convert a batch of traces. Each record is a message list or a dict holding one under a messages key. format is auto, openai, or anthropic.

agentsynth.load_traces_jsonl

load_traces_jsonl(path, tools=None, format='auto')

Read one trace per line (a message list, or an object with messages).

Flywheel

agentsynth.mine_failures

mine_failures(report, cases)

Categorize every benchmark miss so the next run can target it.

agentsynth.mine_judge_failures

mine_judge_failures(trajectories, eval_results, threshold=0.7)

Flag every rubric dimension scoring below threshold, per trajectory.

agentsynth.recipe_from_failures

recipe_from_failures(report, k=20, seed=7, **recipe_kwargs)

A ready-to-run Recipe whose queries chase the report's failures.

Defaults to verify=True (the whole point is trustworthy patches); any Recipe field can be overridden through recipe_kwargs.

RL

agentsynth.AgentGym

Step-by-step episodes whose terminal reward comes from verification + the judge.

Per-step shaping is deliberately small and transparent: an action that names no real tool is penalized by invalid_action_penalty, an observation that comes back as an error is penalized by error_penalty (both applied as negative rewards), and everything else is 0 until the episode ends. The terminal reward is verify_weight * verification.score + judge_weight * judge_overall, both in [0, 1] — and with require_grounding (the default) the verification credit is only paid when the trajectory actually executed something, so a policy can't farm reward by skipping the tools and emitting a plausible answer.

An action carrying answer always ends the episode, even if it also names a tool. state() is a method here; the OpenEnv bridge exposes it as a property, as that spec requires. Not thread-shareable (one live episode at a time) — use one gym per worker.

rollout

rollout(policy, task=None, seed=None)

Run one full episode driven by policy(observation, gym) -> action.

agentsynth.make_reward_fn

make_reward_fn(environment=None, tools=None, execute=True, weights=None)

A TRL-style reward function: fn(prompts=..., completions=..., **kw) -> List[float].

agentsynth.rl.to_openenv

to_openenv(gym)

Wrap an AgentGym in an openenv.core.Environment (lazy import).

Benchmark

agentsynth.run_benchmark

run_benchmark(model_fn, cases=None)

Score a model on the function-calling cases.

model_fn(query, tools) returns the (tool_name, tool_args) the model would call.

agentsynth.compare_models

compare_models(before, after, cases=None)

Run two models on the same cases and return a before/after comparison.

Preference data & dedup

agentsynth.build_preference_pairs

build_preference_pairs(generator, evaluator, queries, k=4, mode='single_agent', tools=None, min_margin=0.0)

Generate k trajectories per query and pair best vs worst by judge score.

A pair is emitted when the two trajectories differ and their score gap is at least min_margin.

agentsynth.dedup_trajectories

dedup_trajectories(trajectories, threshold=0.85, shingle_k=3, key=None)

Drop trajectories that are near-identical to one already kept.

Similarity is Jaccard over shingles of key(trajectory), which defaults to the full example (query + tool sequence + answer). Pass key=lambda t: t.query for prompt-level dedup instead.

Schemas

agentsynth.Trajectory

Bases: BaseModel

tool_signature

tool_signature()

Stable signature of the tool sequence, for diversity metrics.

to_messages

to_messages()

Render as an OpenAI-style messages list.

thought/plan/critique collapse into assistant content, a tool_call becomes an assistant message carrying tool_calls, and each observation becomes a role="tool" message.

agentsynth.ToolSpec

Bases: BaseModel

A single tool the agent may call.

parameters is a JSON-Schema object, matching the OpenAI/Anthropic function-calling convention:

{
    "type": "object",
    "properties": {"city": {"type": "string", "description": "..."}},
    "required": ["city"],
}

agentsynth.EvalResult

Bases: BaseModel

flat

flat()

Flattened dict, handy for dataframes and metrics.