API reference¶
The public surface, re-exported from the top-level agentsynth package.
Generation¶
agentsynth.AgentTrajectoryGenerator ¶
Generate synthetic agent trajectories, offline-mock by default.
model is the model id for LLMClient (None auto-detects a provider from the
environment); temperature and max_tokens are forwarded to it. max_steps
caps the number of tool calls / reasoning steps. use_mock is "auto" (use the
LLM if available, else mock), True (always mock), or False (require the LLM,
but fall back to mock and set warning if no client is available). seed is
mixed into every mock decision. Pass llm_client to reuse an existing client.
tools sets a default catalog (anything parse_tool_catalog accepts, or a
list of ToolSpec) used when a call doesn't pass its own; None means the
built-in default.
generate ¶
Generate a single trajectory for query in mode.
generate_batch ¶
generate_batch(queries, tools=None, mode='single_agent', num_trajectories=None, domains=None, progress=None, vary_modes=False)
Generate many trajectories from one or more queries.
A list of queries gives one trajectory each (cycled or truncated to
num_trajectories when set). A single query string gives
num_trajectories deterministic variations.
Evaluation & verification¶
agentsynth.TrajectoryEvaluator ¶
Score agent trajectories against a six-dimension rubric.
model is an optional model id for LLMClient; when None the client
auto-detects a provider from the environment. use_mock of "auto" uses the
LLM judge whenever a client is available and not forced off, True forces the
structural judge, and False asks for the LLM judge but still falls back to
structural if no client is available. weights overrides the overall-score
rubric weights. llm_client lets tests inject a pre-built client. temperature
and max_tokens are the LLM judge's decoding params (default to a deterministic
0.0). pass_threshold is the minimum weighted overall for passed.
evaluate_batch ¶
Evaluate a list of trajectories, reporting progress if given.
When callable, progress is invoked as progress(i / total, desc) before
each item. Errors from it are swallowed so a bad callback can't break a run.
This loosely matches Gradio's gr.Progress calling convention.
agentsynth.verify_trajectory ¶
Run a set of verifiers and combine their results.
Defaults to the standard verifiers (tool args, execution grounding, safety).
verified is True only if every required verifier passes.
agentsynth.EnsembleEvaluator ¶
Pipelines¶
agentsynth.Recipe ¶
Bases: BaseModel
Environments¶
agentsynth.SQLEnvironment ¶
Bases: Environment
agentsynth.PythonSandbox ¶
Bases: Environment
agentsynth.MCPEnvironment ¶
Bases: Environment
agentsynth.BrowserEnvironment ¶
Bases: Environment
agentsynth.RestEnvironment ¶
Bases: Environment
Learned verifier¶
agentsynth.train_learned_verifier ¶
Fit a LearnedVerifier on judge labels and report held-out agreement.
Labels come from each eval result's passed flag, or overall >= threshold
when threshold is given. Returns (verifier, report) where the report has
agreement (held-out accuracy vs the judge), precision/recall for the
pass class, and the split sizes. Raises ValueError when the labels are all
one class — vary the rubric or threshold so there is something to learn.
agentsynth.LearnedVerifier ¶
Bases: Verifier
A judge-distilled classifier behind the standard Verifier interface.
Advisory by default (required=False): it contributes to the verification
score without hard-failing a trajectory, since it's a screen, not a proof.
Trace import¶
agentsynth.trajectory_from_messages ¶
OpenAI-style chat messages → a Trajectory (roughly to_messages inverted).
The first user message becomes the query; assistant text becomes thoughts —
except the last one, which becomes the final answer; tool_calls become
tool_call steps (JSON-string arguments are parsed); tool/function-role
messages become observations.
agentsynth.import_traces ¶
Convert a batch of traces. Each record is a message list or a dict holding one
under a messages key. format is auto, openai, or anthropic.
agentsynth.load_traces_jsonl ¶
Read one trace per line (a message list, or an object with messages).
Flywheel¶
agentsynth.mine_failures ¶
Categorize every benchmark miss so the next run can target it.
agentsynth.mine_judge_failures ¶
Flag every rubric dimension scoring below threshold, per trajectory.
agentsynth.recipe_from_failures ¶
A ready-to-run Recipe whose queries chase the report's failures.
Defaults to verify=True (the whole point is trustworthy patches); any Recipe
field can be overridden through recipe_kwargs.
RL¶
agentsynth.AgentGym ¶
Step-by-step episodes whose terminal reward comes from verification + the judge.
Per-step shaping is deliberately small and transparent: an action that names no
real tool is penalized by invalid_action_penalty, an observation that comes back
as an error is penalized by error_penalty (both applied as negative rewards),
and everything else is 0 until the episode ends. The terminal reward is
verify_weight * verification.score + judge_weight * judge_overall, both in
[0, 1] — and with require_grounding (the default) the verification credit is
only paid when the trajectory actually executed something, so a policy can't farm
reward by skipping the tools and emitting a plausible answer.
An action carrying answer always ends the episode, even if it also names a tool.
state() is a method here; the OpenEnv bridge exposes it as a property, as that
spec requires. Not thread-shareable (one live episode at a time) — use one gym
per worker.
rollout ¶
Run one full episode driven by policy(observation, gym) -> action.
agentsynth.make_reward_fn ¶
A TRL-style reward function: fn(prompts=..., completions=..., **kw) -> List[float].
agentsynth.rl.to_openenv ¶
Wrap an AgentGym in an openenv.core.Environment (lazy import).
Benchmark¶
agentsynth.run_benchmark ¶
Score a model on the function-calling cases.
model_fn(query, tools) returns the (tool_name, tool_args) the model would call.
agentsynth.compare_models ¶
Run two models on the same cases and return a before/after comparison.
Preference data & dedup¶
agentsynth.build_preference_pairs ¶
build_preference_pairs(generator, evaluator, queries, k=4, mode='single_agent', tools=None, min_margin=0.0)
Generate k trajectories per query and pair best vs worst by judge score.
A pair is emitted when the two trajectories differ and their score gap is at
least min_margin.
agentsynth.dedup_trajectories ¶
Drop trajectories that are near-identical to one already kept.
Similarity is Jaccard over shingles of key(trajectory), which defaults to the
full example (query + tool sequence + answer). Pass key=lambda t: t.query for
prompt-level dedup instead.
Schemas¶
agentsynth.Trajectory ¶
Bases: BaseModel
to_messages ¶
Render as an OpenAI-style messages list.
thought/plan/critique collapse into assistant content, a tool_call
becomes an assistant message carrying tool_calls, and each observation
becomes a role="tool" message.
agentsynth.ToolSpec ¶
Bases: BaseModel
A single tool the agent may call.
parameters is a JSON-Schema object, matching the OpenAI/Anthropic
function-calling convention:
{
"type": "object",
"properties": {"city": {"type": "string", "description": "..."}},
"required": ["city"],
}