API reference¶

The public surface, re-exported from the top-level agentsynth package.

Generation¶

agentsynth.AgentTrajectoryGenerator ¶

Generate synthetic agent trajectories, offline-mock by default.

model is the model id for LLMClient (None auto-detects a provider from the environment); temperature and max_tokens are forwarded to it. max_steps caps the number of tool calls / reasoning steps. use_mock is "auto" (use the LLM if available, else mock), True (always mock), or False (require the LLM, but fall back to mock and set warning if no client is available). seed is mixed into every mock decision. Pass llm_client to reuse an existing client. tools sets a default catalog (anything parse_tool_catalog accepts, or a list of ToolSpec) used when a call doesn't pass its own; None means the built-in default.

generate ¶

generate(query, tools=None, mode='single_agent', domain=None, index=0)

Generate a single trajectory for query in mode.

generate_batch ¶

generate_batch(queries, tools=None, mode='single_agent', num_trajectories=None, domains=None, progress=None, vary_modes=False)

Generate many trajectories from one or more queries.

A list of queries gives one trajectory each (cycled or truncated to num_trajectories when set). A single query string gives num_trajectories deterministic variations.

Evaluation & verification¶

agentsynth.TrajectoryEvaluator ¶

Score agent trajectories against a six-dimension rubric.

model is an optional model id for LLMClient; when None the client auto-detects a provider from the environment. use_mock of "auto" uses the LLM judge whenever a client is available and not forced off, True forces the structural judge, and False asks for the LLM judge but still falls back to structural if no client is available. weights overrides the overall-score rubric weights. llm_client lets tests inject a pre-built client. temperature and max_tokens are the LLM judge's decoding params (default to a deterministic 0.0). pass_threshold is the minimum weighted overall for passed.

evaluate_batch ¶

evaluate_batch(trajectories, progress=None)

Evaluate a list of trajectories, reporting progress if given.

When callable, progress is invoked as progress(i / total, desc) before each item. Errors from it are swallowed so a bad callback can't break a run. This loosely matches Gradio's gr.Progress calling convention.

agentsynth.verify_trajectory ¶

verify_trajectory(trajectory, verifiers=None)

Run a set of verifiers and combine their results.

Defaults to the standard verifiers (tool args, execution grounding, safety). verified is True only if every required verifier passes.

agentsynth.batch_verify ¶

batch_verify(trajectories, verifiers=None, progress=None)

agentsynth.EnsembleEvaluator ¶

agentsynth.Verifier ¶

Bases: ABC

check `abstractmethod` ¶

check(trajectory)

Run the check against one trajectory.

agentsynth.VerificationResult ¶

Bases: BaseModel

agentsynth.ExecutionVerifier ¶

Bases: Verifier

Re-run code_execution steps and confirm the recorded output reproduces.

agentsynth.ToolArgVerifier ¶

Bases: Verifier

Every tool call names a real tool and supplies its required args.

agentsynth.SafetyVerifier ¶

Bases: Verifier

No obviously dangerous content in args, code, or the final answer.

agentsynth.ExpectedAnswerVerifier ¶

Bases: Verifier

The final answer contains an expected value. Opt-in, for ground-truth tasks.

agentsynth.get_rubric ¶

get_rubric(name)

Return a fresh copy of a preset's weights + pass_threshold.

agentsynth.rubric_names ¶

rubric_names()

agentsynth.RUBRIC_PRESETS `module-attribute` ¶

RUBRIC_PRESETS = {'balanced': {'weights': dict(DEFAULT_RUBRIC_WEIGHTS), 'pass_threshold': 0.6}, 'strict': {'weights': {'task_completion': 0.3, 'tool_correctness': 0.25, 'faithfulness': 0.2, 'reasoning_coherence': 0.1, 'efficiency': 0.05, 'safety': 0.1}, 'pass_threshold': 0.75}, 'lenient': {'weights': dict(DEFAULT_RUBRIC_WEIGHTS), 'pass_threshold': 0.45}, 'safety_first': {'weights': {'task_completion': 0.2, 'tool_correctness': 0.15, 'faithfulness': 0.15, 'reasoning_coherence': 0.05, 'efficiency': 0.05, 'safety': 0.4}, 'pass_threshold': 0.7}}

Learned verifier¶

agentsynth.train_learned_verifier ¶

train_learned_verifier(trajectories, eval_results, threshold=None, test_size=0.25, calibrate=False, seed=7)

Fit a LearnedVerifier on judge labels and report held-out agreement.

Labels come from each eval result's passed flag, or overall >= threshold when threshold is given. Returns (verifier, report) where the report has agreement (held-out accuracy vs the judge), precision/recall for the pass class, brier (mean squared error of the probabilities — lower means the confidence is trustworthy, which matters when you route on it), and the split sizes. With calibrate=True the probabilities go through sigmoid calibration (cross-validated), which usually buys a better brier at the same agreement. Raises ValueError when the labels are all one class — vary the rubric or threshold so there is something to learn.

agentsynth.LearnedVerifier ¶

Bases: Verifier

A judge-distilled classifier behind the standard Verifier interface.

Advisory by default (required=False): it contributes to the verification score without hard-failing a trajectory.

predict_proba ¶

predict_proba(trajectory)

P(the LLM judge would pass this trajectory).

Scenarios¶

agentsynth.Scenario ¶

Bases: BaseModel

A serializable bundle: environment config, task, and outcome checkers.

build_environment ¶

build_environment()

A fresh environment carrying this scenario's seed state.

run_checks ¶

run_checks(environment, trajectory)

Outcome score in [0, 1] = fraction of checkers that pass on the end state.

agentsynth.SqlCheck ¶

Bases: BaseModel

Assert over the database's final state (needs a SQL environment).

agentsynth.HttpCheck ¶

Bases: BaseModel

GET a path on the environment's API and assert on the body (REST scenarios).

agentsynth.CalledTool ¶

Bases: BaseModel

Assert the trajectory actually used a tool (optionally with given args).

agentsynth.AnswerContains ¶

Bases: BaseModel

Assert the final answer mentions at least one of the expected strings.

agentsynth.CodeCheck ¶

Bases: BaseModel

Run the agent's Python against hidden tests (needs a python environment).

Gathers the code from every python tool call, appends the test, and runs the lot in the sandbox. Passes only when the tests run clean — the outcome is whether the code works, not whether the transcript claims it does.

agentsynth.run_scenario_suite ¶

run_scenario_suite(policy, scenarios, seed=7, **gym_kwargs)

Run a policy through every scenario. A scenario passes when every checker does.

policy(observation, gym) -> action, the same shape AgentGym.rollout takes.

agentsynth.load_scenarios ¶

load_scenarios(path)

Read a scenario pack (YAML or JSON).

agentsynth.save_scenarios ¶

save_scenarios(scenarios, path)

Write a scenario pack — YAML for .yaml/.yml, JSON otherwise.

Robustness (reward-hacking audit)¶

agentsynth.audit_pack ¶

audit_pack(scenarios, seed=7, adversaries=None)

Run the trivial adversaries across a pack and report what they passed.

agentsynth.RobustnessReport ¶

Bases: BaseModel

agentsynth.perturb_scenario ¶

perturb_scenario(scenario, seed=0)

An isomorphic sibling: rename string labels, keep every number and the structure.

Renaming labels (names, emails, SKUs) preserves every relational truth and numeric threshold while changing the surface tokens — so a policy that truly solves the task still passes, but one echoing a memorized answer fails. Single-table scenarios only; multi-table schemas (raw SQL with INSERTs) are returned unchanged.

agentsynth.ipt_report ¶

ipt_report(scenario, policy, seed=7)

Isomorphic perturbation test for a (claimed) generalizing policy.

Two properties a trustworthy outcome check should have:

the policy still passes the perturbed sibling (it solved the task, not the instance)
replaying the policy's original actions on the sibling now fails (the check rewards the state change, not a memorized transcript)

Synthesize verifiers from a demonstration¶

agentsynth.scenario_from_demonstration ¶

scenario_from_demonstration(task, schema, actions, rows=None, table=None, answer=None, scenario_id='demo', max_steps=None, seed=7)

Build a scenario from a worked example, deriving state checks from the diff.

rows=None means a multi-table world that seeds itself from INSERTs in the schema (matching the pack convention). Returns the scenario and the oracle's actions.

agentsynth.pack_from_demonstrations ¶

pack_from_demonstrations(demos, pack_id)

Turn a list of demonstrations into a pack + a matching oracle, ready to validate.

Mirrors the --from-schema output: returns (pack_yaml, oracle_py).

Export a pack (OpenEnv / verifiers)¶

agentsynth.scenario_reward ¶

scenario_reward(scenario, actions, answer='', seed=7)

Outcome score in [0, 1] for running actions (then answer) on a scenario.

This is the portable, verifiable reward: replay the agent's actions on the seeded world and report the fraction of end-state checks that hold. Wrap it in a verifiers Rubric, a TRL reward, or your own loop — the number means the same thing.

agentsynth.reward_from_messages ¶

reward_from_messages(scenario, messages, seed=7)

Score an OpenAI-style completion against a scenario's world-state checks.

agentsynth.export_pack ¶

export_pack(pack_path, fmt, out_dir)

Write a Hub-ready environment folder for a pack. Returns the files written.

Reliability (beyond pass@1)¶

agentsynth.reliability_report ¶

reliability_report(per_scenario_passes, trials)

Turn per-trial pass booleans into the full reliability picture.

agentsynth.ReliabilityReport ¶

Bases: BaseModel

agentsynth.wilson_interval ¶

wilson_interval(passes, n, z=1.96)

Wilson score interval for a binomial pass rate — sane near 0%, 100%, and small n.

The plain p ± z·sqrt(p(1-p)/n) interval collapses to zero width at 0/n and n/n, which is exactly where a benchmark lands; Wilson doesn't.

Contamination audit¶

agentsynth.contamination_report ¶

contamination_report(scenarios, corpus=None, threshold=0.8)

Score each scenario for contamination risk and mint its canary.

agentsynth.ContaminationReport ¶

Bases: BaseModel

agentsynth.canary_for ¶

canary_for(scenario_id, salt='agentsynth')

A stable, unguessable token unique to a scenario.

Embed it in the pack (or the task text) and search a model's outputs or a training corpus for it; a hit means the pack was memorized, not solved.

agentsynth.held_out_pack ¶

held_out_pack(scenarios, seed=0)

Isomorphic siblings of every scenario — a contamination-resistant variant.

Single-table worlds are relabelled; multi-table ones (data in the schema's INSERTs) come back unchanged, so check the ids if you need to know which were transformed.

Reproducible submissions¶

agentsynth.run_manifest ¶

run_manifest(pack_id, scenarios, report, model, seed, trials=1, version=None, cost=None)

Everything needed to reproduce and check a bench run.

cost (calls/tokens/usd from a CostMeter, when the policy is a metered LLM client) rides along as telemetry but is deliberately NOT a run_hash input — reproducing a run means matching its outcomes, not its exact spend.

agentsynth.verify_run ¶

verify_run(manifest, scenarios, policy, tolerance=0.0)

Re-run a manifest's bench and report whether it reproduced.

pack_intact catches a pack edited after the fact (fingerprint mismatch). reproduced is an exact run_hash match — what you get from a deterministic policy. For a stochastic model, allow a tolerance on the pass-rate and read pass_rate_delta instead.

agentsynth.pack_fingerprint ¶

pack_fingerprint(scenarios)

A content hash of the pack itself — its tasks, worlds, and checkers.

Two packs with the same fingerprint are the same benchmark; a changed checker or seed row changes it, so a score can't be claimed against a pack that was edited afterwards.

Multi-turn (user simulator)¶

agentsynth.run_conversation ¶

run_conversation(policy, scenario, seed=7, max_steps_per_turn=None)

Run a policy through a scenario's user turns against one persistent world.

agentsynth.run_conversation_suite ¶

run_conversation_suite(policy, scenarios, seed=7)

Run a policy through every conversation scenario; a scenario passes on its end state.

agentsynth.ConversationResult ¶

Bases: BaseModel

Plugins (custom environments)¶

agentsynth.register_environment ¶

register_environment(name, factory)

Register an environment factory under name (callable taking the scenario config).

agentsynth.available_environments ¶

available_environments()

Every plugin environment name known so far (registered + advertised).

RL¶

agentsynth.AgentGym ¶

One live episode at a time; use one gym per worker.

Step rewards: -invalid_action_penalty for a tool that doesn't exist, -error_penalty when the observation is an error, otherwise 0. Terminal reward: verify_weight * verification.score + judge_weight * judge overall, plus outcome_weight * scenario score when a scenario is attached. With require_grounding (default), verification credit is only paid if the episode executed at least one tool or code step.

An action with answer set ends the episode even if it also names a tool. state() is a method here; the OpenEnv bridge exposes it as a property.

from_scenario `classmethod` ¶

from_scenario(scenario, **kwargs)

Build a gym from a Scenario; default weights become 0.6 outcome / 0.2 verification / 0.2 judge.

transcript ¶

transcript(max_chars=4000)

The episode so far as compact text, newest steps kept when truncating.

rollout ¶

rollout(policy, task=None, seed=None)

Run one full episode driven by policy(observation, gym) -> action.

agentsynth.make_reward_fn ¶

make_reward_fn(environment=None, tools=None, execute=True, weights=None)

A TRL-style reward function: fn(prompts=..., completions=..., **kw) -> List[float].

agentsynth.rl.to_openenv ¶

to_openenv(gym)

Wrap an AgentGym in an openenv.core.Environment (lazy import).

Bring your own loop (adapters)¶

agentsynth.to_openai_tools ¶

to_openai_tools(source)

The world's tools as OpenAI function-calling schemas.

agentsynth.action_from_openai_tool_call ¶

action_from_openai_tool_call(tool_call)

An OpenAI-style tool call (dict or SDK object) as a gym action.

Malformed argument JSON becomes {} rather than an exception — the gym turns that into a recoverable error observation, the way a real run should.

Environments¶

agentsynth.Environment ¶

Bases: ABC

tools `abstractmethod` ¶

tools()

The tools this environment can run.

execute `abstractmethod` ¶

execute(tool_name, args)

Run a tool call and return the observation text.

Failures should come back as observation strings ("SQLError: ...") so the agent can read them; raising for an unknown tool is also fine — callers in the RL and generation layers tolerate both.

sample_args ¶

sample_args(tool_name, query, seed)

A valid example call for tool_name, so generated calls actually run.

Returns an empty dict by default; callers then synthesize their own args.

reset ¶

reset()

Restore initial state. No-op for stateless environments.

close ¶

close()

Release any resources.

agentsynth.SQLEnvironment ¶

Bases: Environment

rows ¶

rows(sql)

Raw result rows, for fixtures and outcome checks — not exposed as a tool.

agentsynth.PythonSandbox ¶

Bases: Environment

agentsynth.MCPEnvironment ¶

Bases: Environment

agentsynth.BrowserEnvironment ¶

Bases: Environment

agentsynth.RestEnvironment ¶

Bases: Environment

agentsynth.CompositeEnvironment ¶

Bases: Environment

Several environments behind one interface, routed by tool name.

Pipelines¶

agentsynth.Recipe ¶

Bases: BaseModel

agentsynth.run_recipe ¶

run_recipe(recipe, progress=None)

agentsynth.load_recipe ¶

load_recipe(path)

agentsynth.make_environment ¶

make_environment(spec)

Benchmark¶

agentsynth.run_benchmark ¶

run_benchmark(model_fn, cases=None)

Score a model on the function-calling cases.

model_fn(query, tools) returns the (tool_name, tool_args) the model would call.

agentsynth.compare_models ¶

compare_models(before, after, cases=None)

Run two models on the same cases and return a before/after comparison.

agentsynth.BenchmarkCase ¶

Bases: BaseModel

agentsynth.BenchmarkReport ¶

Bases: BaseModel

agentsynth.BUILTIN_CASES `module-attribute` ¶

BUILTIN_CASES = [BenchmarkCase(id='weather_paris', query="What's the weather in Paris right now?", expected_tool='get_weather', expected_args={'city': 'Paris'}), BenchmarkCase(id='weather_tokyo', query='Is it raining in Tokyo today?', expected_tool='get_weather', expected_args={'city': 'Tokyo'}), BenchmarkCase(id='math_mult', query='What is 23 times 7 plus 4?', expected_tool='calculator', expected_args={'expression': None}), BenchmarkCase(id='math_tip', query='Calculate an 18% tip on a $54 bill.', expected_tool='calculator', expected_args={'expression': None}), BenchmarkCase(id='search_news', query='Find recent news about open-source AI agents.', expected_tool='web_search', expected_args={'query': None}), BenchmarkCase(id='search_fact', query='Search the web for the population of Vietnam.', expected_tool='web_search', expected_args={'query': None}), BenchmarkCase(id='file_csv', query='Read the file data/report.csv and summarize it.', expected_tool='read_file', expected_args={'path': None}), BenchmarkCase(id='file_notes', query='Open notes.md and list the action items.', expected_tool='read_file', expected_args={'path': None}), BenchmarkCase(id='sql_revenue', query='Query the database for total revenue by region.', expected_tool='sql_query', expected_args={'query': None}), BenchmarkCase(id='sql_count', query='Run a SQL query to count rows in the sales table.', expected_tool='sql_query', expected_args={'query': None}), BenchmarkCase(id='email_launch', query='Send an email to team@example.com about the launch.', expected_tool='send_email', expected_args={'to': None}), BenchmarkCase(id='email_summary', query='Email a summary of the report to alex@example.com.', expected_tool='send_email', expected_args={'to': None})]

agentsynth.agentsynth_model ¶

agentsynth_model(generator, mode='single_agent')

Adapt an AgentTrajectoryGenerator into a benchmark model: it takes the first tool call the generated trajectory makes.

agentsynth.prompted_model ¶

prompted_model(complete_fn)

Turn a text-completion function (prompt) -> text into a benchmark model.

It asks the model for a single JSON tool call and parses the reply, so it works with any instruction-following model (a base or fine-tuned HF model, etc.).

agentsynth.report_table_md ¶

report_table_md(comparison)

Render a before/after comparison as a markdown table.

Trace import & redaction¶

agentsynth.trajectory_from_messages ¶

trajectory_from_messages(messages, tools=None, query=None, domain=None, source='openai')

OpenAI-style chat messages to a Trajectory (roughly to_messages inverted).

First user message becomes the query. Assistant text becomes thoughts, except the last one, which becomes the final answer. tool_calls become tool_call steps and tool/function-role messages become observations.

agentsynth.import_traces ¶

import_traces(records, tools=None, format='auto')

Convert a batch of traces. Each record is a message list, a dict with a messages key, or a dict with a spans key (OTel). format is auto, openai, anthropic, or otel.

agentsynth.load_traces_jsonl ¶

load_traces_jsonl(path, tools=None, format='auto')

Read one trace per line (a message list, or an object with messages).

agentsynth.importers.trajectory_from_otel_spans ¶

trajectory_from_otel_spans(spans, tools=None, query=None)

OpenTelemetry GenAI spans to a Trajectory.

The GenAI semconv is still incubating, so two common encodings are accepted: gen_ai.input.messages / gen_ai.output.messages JSON attributes on chat spans, and flattened gen_ai.prompt.{i}.* / gen_ai.completion.{i}.* keys. Tool spans (gen_ai.operation.name == "execute_tool") become a tool call plus its observation. Spans are ordered by start_time_unix_nano when present.

agentsynth.redact_text ¶

redact_text(text)

Strip emails, keys, tokens, long hex ids, and phone-shaped numbers.

agentsynth.redact_trajectory ¶

redact_trajectory(traj)

Redact every text surface of a trajectory in place, then return it.

Run this before sharing or donating imported production traces.

Flywheel¶

agentsynth.mine_failures ¶

mine_failures(report, cases)

Categorize every benchmark miss so the next run can target it.

agentsynth.mine_judge_failures ¶

mine_judge_failures(trajectories, eval_results, threshold=0.7)

Flag every rubric dimension scoring below threshold, per trajectory.

agentsynth.recipe_from_failures ¶

recipe_from_failures(report, k=20, seed=7, **recipe_kwargs)

A Recipe whose queries target the report's failures.

Defaults to verify=True; any Recipe field can be overridden through recipe_kwargs.

agentsynth.evolve_queries ¶

evolve_queries(queries, k=20, seed=7, llm_client=None)

k variations over queries, visiting the sources round-robin.

Scale¶

agentsynth.CachingLLMClient ¶

Bases: LLMClient

An LLMClient with a disk cache, retries, a cost meter, and a budget cap.

The cache key is the full request (model, messages, sampling params). Costs come from LiteLLM's pricing table when it knows the model, otherwise from price_per_1k_tokens; without either, usd stays 0, so set a price before relying on budget_usd.

agentsynth.CostMeter ¶

Thread-safe usage counter shared across clients and runs.

agentsynth.BudgetExceeded ¶

Bases: RuntimeError

Raised before a call that would start past the configured budget.

agentsynth.run_resumable ¶

run_resumable(recipe, out_dir, llm_client=None, max_items=None, progress=None)

Generate a recipe with incremental output and a resume file.

Trajectories append to <out_dir>/trajectories.jsonl one line at a time; <out_dir>/state.json records progress, and re-invoking with the same out_dir (and the same recipe) continues from there. max_items caps how many this invocation adds, for chunked or cron-driven runs. Returns {total, done, added, path}. Run evaluation/verification/dedup as a post-pass over load_jsonl(path) once done == total.

Preference data & dedup¶

agentsynth.build_preference_pairs ¶

build_preference_pairs(generator, evaluator, queries, k=4, mode='single_agent', tools=None, min_margin=0.0)

Generate k trajectories per query and pair best vs worst by judge score.

A pair is emitted when the two trajectories differ and their score gap is at least min_margin.

agentsynth.PreferencePair ¶

Bases: BaseModel

agentsynth.to_dpo_jsonl ¶

to_dpo_jsonl(pairs, path=None)

Serialize pairs as prompt/chosen/rejected JSONL (TRL DPO compatible).

agentsynth.load_dpo_jsonl ¶

load_dpo_jsonl(path)

agentsynth.dedup_trajectories ¶

dedup_trajectories(trajectories, threshold=0.85, shingle_k=3, key=None, method='pairwise', num_perm=64, bands=16)

Drop trajectories that are near-identical to one already kept.

Similarity is Jaccard over shingles of key(trajectory), which defaults to the full example (query + tool sequence + answer). Pass key=lambda t: t.query for prompt-level dedup instead.

method="pairwise" (default) compares against everything kept — exact but O(n²). method="minhash" buckets by MinHash/LSH bands and only Jaccard-verifies collisions, which stays linear at the 100k scale; pairs below the band sensitivity can slip through, so keep threshold >= 0.8 with the default bands.

agentsynth.decontaminate ¶

decontaminate(trajectories, contaminants, threshold=0.8, shingle_k=3)

Split trajectories into (clean, flagged) by similarity to benchmark text.

Training data prep¶

agentsynth.build_sft_dataset ¶

build_sft_dataset(trajectories, path=None, only_passed=False, eval_results=None)

agentsynth.build_dpo_dataset ¶

build_dpo_dataset(pairs, path=None)

agentsynth.to_sft_records ¶

to_sft_records(trajectories, only_passed=False, eval_results=None)

Conversational SFT records. With only_passed, keep just the trajectories whose eval result passed.

agentsynth.to_dpo_records ¶

to_dpo_records(pairs)

Prompt/chosen/rejected records from preference pairs.

Tasks¶

agentsynth.SeedTask ¶

Bases: BaseModel

agentsynth.sample_tasks ¶

sample_tasks(n, domains=None, seed=0)

Pick n tasks deterministically, optionally restricted to some domains.

Cycles through the pool if n is larger than it.

Metrics¶

agentsynth.compute_dataset_metrics ¶

compute_dataset_metrics(trajectories, eval_results=None)

Flat dict of dataset-level metrics.

Without eval_results the judge-derived keys (pass_rate, avg_overall, avg_scores) come back None/empty. Empty input gives a zero-filled dict.

agentsynth.diversity_score ¶

diversity_score(trajectories)

Diversity of a trajectory set, in [0, 1]; 0.0 when empty.

Averages three signals, each normalised on its own: structural (unique tool signatures over count), domain (unique domains over count), and lexical (the type-token ratio across all query tokens).

agentsynth.run_report_md ¶

run_report_md(result, meter=None)

One-page markdown summary of a RunResult, with costs when a meter is given.

Exporters¶

agentsynth.to_jsonl ¶

to_jsonl(trajectories, path=None)

Serialise trajectories to JSONL, one object per line.

Each line holds the rendered messages plus the raw tools/steps that load_jsonl needs to rebuild the trajectory. Writes to path (UTF-8) when given; either way the JSONL string is returned.

agentsynth.to_sharegpt ¶

to_sharegpt(trajectories, path=None)

Convert trajectories to the ShareGPT conversations format.

Roles map off Trajectory.to_messages: user becomes human, assistant text becomes gpt, assistant tool_calls become function_call (value is JSON), and tool becomes observation. Dumps to JSON at path when given; the list is always returned.

agentsynth.to_adp ¶

to_adp(trajectories, path=None)

Convert trajectories to Agent Data Protocol-style records.

Each record carries the instruction, the tool catalog (names + parameters), a flat list of typed steps, and the final output. Writes JSON to path when given; the list is always returned.

agentsynth.save_dataset ¶

save_dataset(trajectories, path, fmt='jsonl', eval_results=None)

Write trajectories to path in the requested format.

fmt is one of jsonl, sharegpt, adp, parquet. Returns the output path and raises ValueError on an unknown format.

agentsynth.load_jsonl ¶

load_jsonl(path)

Read a JSONL file from to_jsonl back into trajectories.

messages is derived from the steps, so it's ignored on load and regenerated.

Hugging Face Hub¶

agentsynth.push_dataset ¶

push_dataset(trajectories, repo_id, token=None, eval_results=None, private=False, pretty_name='AgentSynth Trajectories', out_dir=None)

Build the dataset folder and upload it to the Hub. Returns the repo URL.

Requires pip install "agentsynth-ai[hub]" and a write token (arg or HF login).

agentsynth.dataset_card ¶

dataset_card(count, pretty_name='AgentSynth Trajectories', generator='mock', pass_rate=None, summary=None, repo_id='agentsynth/agentsynth-trajectories')

agentsynth.prepare_dataset_dir ¶

prepare_dataset_dir(trajectories, out_dir, eval_results=None, pretty_name='AgentSynth Trajectories', repo_id='agentsynth/agentsynth-trajectories')

Write a push-ready dataset folder (no network). Returns the folder path.

Utilities¶

agentsynth.parse_tool_catalog ¶

parse_tool_catalog(raw)

Coerce assorted inputs into a list of ToolSpec.

Accepts a JSON string, a list of tool dicts, a single tool dict, or the OpenAI {"tools": [...]} / function-calling shapes. Bad entries are dropped so user input in the UI never hard-crashes us.

agentsynth.default_tool_catalog ¶

default_tool_catalog()

agentsynth.PythonREPL ¶

A tiny Python REPL for grounding synthetic code_execution steps in real stdout.

WARNING: this is not a security boundary. Imports are restricted to a numeric/data whitelist and the most obvious dangerous patterns are blocked, but do not run untrusted code through it on a sensitive host.

The namespace persists across run calls, like a real REPL.

run ¶

run(code)

Run code, returning captured stdout plus the last expression value.

On error, returns a compact traceback string instead of raising.

agentsynth.LLMClient ¶

Thin wrapper over litellm.completion.

When LiteLLM isn't installed or no provider key/model is configured, available is False and complete returns ""; callers then fall back to deterministic mock generation.

Point it at a local server (vLLM, Ollama, any OpenAI-compatible endpoint) with api_base, or the AGENTSYNTH_API_BASE + AGENTSYNTH_MODEL env vars — no provider key needed. vLLM's continuous batching makes that the cheap path for bulk generation.

export AGENTSYNTH_API_BASE=http://localhost:8000/v1   # vLLM
export AGENTSYNTH_MODEL=openai/my-served-model
# or Ollama: AGENTSYNTH_API_BASE=http://localhost:11434, MODEL=ollama/llama3

complete ¶

complete(messages, temperature=None, max_tokens=None, **kwargs)

Return the assistant text for messages, or "" on any failure.

Schemas¶

agentsynth.Trajectory ¶

Bases: BaseModel

tool_signature ¶

tool_signature()

Stable signature of the tool sequence, for diversity metrics.

to_messages ¶

to_messages()

Render as an OpenAI-style messages list.

thought/plan/critique collapse into assistant content, a tool_call becomes an assistant message carrying tool_calls, and each observation becomes a role="tool" message.

agentsynth.TrajectoryStep ¶

Bases: BaseModel

One step in a trajectory.

Union-shaped: only the fields relevant to step_type are filled. Keeping it flat is what lets TRL/Unsloth/Axolotl trainers load the JSONL without a custom parser.

short ¶

short()

One-line rendering for logs and the UI.

agentsynth.ToolSpec ¶

Bases: BaseModel

A single tool the agent may call.

parameters is a JSON-Schema object, matching the OpenAI/Anthropic function-calling convention:

{
    "type": "object",
    "properties": {"city": {"type": "string", "description": "..."}},
    "required": ["city"],
}

agentsynth.RubricScores ¶

Bases: BaseModel

LLM-as-Judge scores, each in [0, 1].

agentsynth.EvalResult ¶

Bases: BaseModel

flat ¶

flat()

Flattened dict, handy for dataframes and metrics.

API reference¶

Generation¶

agentsynth.AgentTrajectoryGenerator ¶

generate ¶

generate_batch ¶

Evaluation & verification¶

agentsynth.TrajectoryEvaluator ¶

evaluate_batch ¶

agentsynth.verify_trajectory ¶

agentsynth.batch_verify ¶

agentsynth.EnsembleEvaluator ¶

agentsynth.Verifier ¶

check abstractmethod ¶

agentsynth.VerificationResult ¶

agentsynth.ExecutionVerifier ¶

agentsynth.ToolArgVerifier ¶

agentsynth.SafetyVerifier ¶

agentsynth.ExpectedAnswerVerifier ¶

agentsynth.get_rubric ¶

agentsynth.rubric_names ¶

agentsynth.RUBRIC_PRESETS module-attribute ¶

Learned verifier¶

agentsynth.train_learned_verifier ¶

agentsynth.LearnedVerifier ¶

predict_proba ¶

Scenarios¶

agentsynth.Scenario ¶

build_environment ¶

run_checks ¶

agentsynth.SqlCheck ¶

agentsynth.HttpCheck ¶

agentsynth.CalledTool ¶

agentsynth.AnswerContains ¶

agentsynth.CodeCheck ¶

agentsynth.run_scenario_suite ¶

agentsynth.load_scenarios ¶

agentsynth.save_scenarios ¶

Robustness (reward-hacking audit)¶

agentsynth.audit_pack ¶

agentsynth.RobustnessReport ¶

agentsynth.perturb_scenario ¶

agentsynth.ipt_report ¶

Synthesize verifiers from a demonstration¶

agentsynth.scenario_from_demonstration ¶

agentsynth.pack_from_demonstrations ¶

Export a pack (OpenEnv / verifiers)¶

agentsynth.scenario_reward ¶

agentsynth.reward_from_messages ¶

agentsynth.export_pack ¶

Reliability (beyond pass@1)¶

agentsynth.reliability_report ¶

agentsynth.ReliabilityReport ¶

agentsynth.wilson_interval ¶

Contamination audit¶

agentsynth.contamination_report ¶

agentsynth.ContaminationReport ¶

agentsynth.canary_for ¶

agentsynth.held_out_pack ¶

Reproducible submissions¶

agentsynth.run_manifest ¶

agentsynth.verify_run ¶

agentsynth.pack_fingerprint ¶

Multi-turn (user simulator)¶

agentsynth.run_conversation ¶

agentsynth.run_conversation_suite ¶

agentsynth.ConversationResult ¶

Plugins (custom environments)¶

agentsynth.register_environment ¶

agentsynth.available_environments ¶

RL¶

agentsynth.AgentGym ¶

from_scenario classmethod ¶

transcript ¶

rollout ¶

agentsynth.make_reward_fn ¶

agentsynth.rl.to_openenv ¶

Bring your own loop (adapters)¶

agentsynth.to_openai_tools ¶

agentsynth.action_from_openai_tool_call ¶

Environments¶

check `abstractmethod` ¶

agentsynth.RUBRIC_PRESETS `module-attribute` ¶

from_scenario `classmethod` ¶

tools `abstractmethod` ¶

execute `abstractmethod` ¶

agentsynth.BUILTIN_CASES `module-attribute` ¶