May 18, 2026·9 min read

Dynamic action discovery isn't novel. Doing it safely is.

Anthropic shipped Tool Search in November 2025. mcpproxy-go has BM25. Stacklok benchmarks 94% retrieval. The pattern converged across the MCP-gateway market. mcpgate just shipped its own — and the choices that matter aren't the pattern, they are retrieval quality, the risk model, and how you roll it out without breaking working setups.

Six months ago, every MCP gateway shipped the same thing: a fixed list of tools, loaded into the model's context window on every turn. With 20 connected services and read/write split, that is about 40 tools at 800 to 1500 tokens each. Roughly 50,000 tokens of tool definitions before the user has even typed a question. (We later measured this on live servers — one alone hit ~140,000 tokens: the token cost of MCP tool definitions.)

Two things broke at that point. Anthropic's own advanced tool-use guidance recommends deferred loading whenever "tool definitions consume more than 10K tokens" or "10+ tools are available" — and their internal evals show measurable accuracy gains: Opus 4 from 49% to 74%, Opus 4.5 from 79.5% to 88.1%, both with Tool Search enabled. The other half is plain economics — every user turn re-ships every tool definition. A team running 10,000 conversations a day was paying for a lot of "Gmail send_email" prose nobody read.

The fix is obvious in hindsight. Don't ship every tool. Ship a search tool, and let the model pull definitions on demand.

Anthropic's Tool Search Tool shipped November 24, 2025. mcpproxy-go ships BM25 filtering. agentic-community/mcp-gateway-registry ships FAISS with sentence-transformer embeddings. Stacklok ToolHive ships hybrid BM25-plus-semantic and publishes 94% tool-selection accuracy on a 2,792-tool benchmark. Eight or nine other gateways shipped some version of it. The pattern is not the moat anymore.

Key takeaway

Dynamic action discovery — BM25 over a deferred tool catalog, activate-on-demand — is now a market-wide pattern, not a differentiator. The decisions that actually matter when shipping it: how good is the retrieval, what is the risk model for a write or destructive action, and how do you turn it on without breaking the setups that already work. This post is mcpgate's choices and the tradeoffs we are still living with.

What converged across the market

For an honest comparison, here is what the awesome-mcp-gateways list and Anthropic's own first-party API both look like as of May 2026:

Implementer	Mechanism	Surface
Anthropic (first-party API)	`tool_search_tool` + `defer_loading: true` per tool, BM25 or regex	Any tool definition you ship to the API
Claude Code	Built-in tools deferred behind `ToolSearch` since early 2026	Default system-tool context cut substantially; mechanism is the same as the API-level Tool Search
mcpproxy-go	BM25 over indexed tool descriptions	Whatever MCP servers you point it at
agentic-community/mcp-gateway-registry	FAISS + sentence-transformers	Registry of registered MCP servers
Stacklok ToolHive Optimizer	Hybrid BM25 + semantic, published 94% top-K @2,800 tools	Indexed MCP-tool surface
Composio Tool Router	Intent-analysis filter, 20,000 tools / 1,000 apps	Composio's catalog
Smithery Toolbox	Registry-driven `smithery tool find`	6,000+ registered servers
Docker MCP Toolkit	`mcp-find` + `mcp-add` meta-tools	Server-level enablement
IBM ContextForge	Plugin-driven admin enable/disable	No model-facing search; admin curation
OBot	Federates `tools/list` across upstream servers	No own catalog, no search-then-activate

If "everyone shipped it," the interesting question is: does it work? Two published numbers, from two different harnesses:

Implementation	Score	Harness
Anthropic regex mode	56% top-K retrieval	Arcade.dev, 4,027 tools, 25 tasks
Anthropic BM25 mode	64% top-K retrieval	Arcade.dev, 4,027 tools, 25 tasks
Stacklok ToolHive (hybrid)	94% selection / 98% retrieval	Stacklok's own, 2,792 tools
Anthropic Tool Search (on Stacklok's harness)	34% selection / 48% retrieval	Stacklok's own, 2,792 tools

Two different harnesses, two different methodologies, two ways of saying the same thing: there is a wide gap between the first-party Anthropic implementation and the best independent retriever. Retrieval quality is the real moat, not the pattern. Anyone telling you "we ship dynamic discovery" without numbers is making a claim about the easy part. The hard part is the indexing, the ranking, and the off-the-happy-path behavior — typo tolerance, synonym handling, action-vs-resource disambiguation, write/read disambiguation. Stacklok publishes its harness; Arcade publishes its harness; we should too.

What mcpgate just shipped

We rolled out dynamic action discovery in May 2026. The mechanism is BM25 over a long-tail action catalog generated from OpenAPI specs. mcpgate has shipped an admin-UI OpenAPI importer for a while — point it at a Microsoft Graph or BigQuery spec and the YAML appears. The catch, until now, was that every imported endpoint went into the default tool list. Importing the full Microsoft Graph spec meant a tools/list payload no MCP client could reason about. So in practice, admins imported the handful of endpoints they knew they needed and stopped — leaving most of the surface unreachable not because the YAML did not exist, but because surfacing it cost more context than it bought.

Dynamic discovery is what makes the full-spec import practical. The long-tail goes into a search index instead of the default tool list, and the model pulls definitions on demand. The catalog currently sits at more than 17,000 long-tail actions across the connected services — almost all of which would have been unusable without the search-and-call layer on top.

The model accesses them via one meta-tool: gateway_search_actions(query, service?, limit). It returns BM25-ranked candidates with their service, HTTP method, endpoint, and risk annotation. A candidate is callable directly by name on the next call — there is no separate activation step.

That part is unremarkable. The choices below are not.

Per-service toggle, default ON

Long-tail discovery is controlled by an admin toggle per service, on each service's page in the dashboard. We considered three rollout strategies:

Global on, single global switch. Every service exposes its long-tail in one flip. Simple, but it forces a team that wants Microsoft Graph long-tail to also turn on BigQuery long-tail. Rejected.
Global off, single global switch. Same coupling problem, inverted. Rejected.
Per-service toggle. Each service flips independently. The team that wants Graph access turns it on for one service; the team that does not leaves it off. This is what shipped.

The remaining question was the default. We launched the toggle off per service, so the initial rollout was byte-identical to the prior behavior and could not surprise anyone. After it ran cleanly, we flipped the default to on (29 May 2026): a connected service exposes its long-tail to search unless an admin opts it out. Search visibility never implies an action can fire unattended — the read/write tool split and the confirmed=true gate on destructive actions are the execution guardrails, and every call is audited. The same setting is exposed as an environment variable (DYNAMIC_DISCOVERY_DISABLED_SERVICES, a comma-separated opt-out denylist) for IaC-managed deployments.

Risk annotation lives at the action layer, not the tool layer

Anthropic's defer_loading is binary — a tool is either in the context or it isn't. There is no built-in concept of "this tool reads, this tool writes, this tool deletes." If you want differentiated handling, you build it on top.

mcpgate's YAML actions already carry method-level metadata: every action declares its HTTP method, its endpoint, whether it accepts a body. We annotate each long-tail action with a risk tier derived from that metadata plus a path-keyword scan:

read — GET endpoints, no body. Lowest risk. The model can call these autonomously.
write — POST/PUT/PATCH. Carries a body, mutates state. Routed only through the *_write_actions tool — an agent granted just the read tool can't reach it — and recorded in the audit log.
destructive — DELETE endpoints, or POSTs whose path matches a keyword list (/delete, /revoke, /terminate, /wipe). These require an explicit confirmed=true before they execute, and the model is expected to surface the consequence in plain language before it asks.

The shape echoes SEP-1888, the MCP spec draft for progressive disclosure of typed library operations, which includes risk-level annotations as part of the proposed operation index. SEP-1888 has no sponsor yet and may never land — but if it does, we are already structurally compatible.

This matters more than it first looks. Stacklok's 94% retrieval number is impressive, but "retrieve correctly" is not the same as "execute safely." A model that retrieves delete_user with high relevance is still a model that just retrieved delete_user. The risk tier lets us treat retrieval and execution as separate problems.

Pseudonymization still applies

This is the part we cannot bolt onto someone else's defer_loading. The two pillars mcpgate built first — PII pseudonymization with rehydration, and two-layer policy hooks — apply to long-tail actions the same way they apply to hand-written ones. An auto-imported Microsoft Graph endpoint goes through the same pseudonymize-on-the-way-out, rehydrate-on-the-way-back pipeline. Hooks see the call and can deny it before the gateway dispatches.

None of this is exclusive to mcpgate — any gateway can choose to build dynamic discovery on top of an existing PII/policy substrate. But several gateways shipped discovery before the substrate, and retrofitting it after the fact tends to leak: every new endpoint becomes a new place where PII might escape unredacted. We shipped it the other way around, and we are glad we did.

What is still open

Three things we have not solved yet.

We have not published our own retrieval benchmark. Stacklok did, on Arcade.dev's harness. We should. The reason we have not is that the harness assumes a flat tool surface; our catalog has service scoping, which changes the recall denominator. The right answer is to publish numbers in both modes — scoped and unscoped — and we owe that work.

BM25 alone has a known ceiling. The same Arcade benchmark shows hybrid BM25-plus-semantic at 94%, pure BM25 around 64%. There is a clean path to a semantic-augmented retriever, and we have not taken it. The reason is honest: at this scale BM25 is fast and predictable, and the cost of a semantic-embedding miss is harder to debug than a keyword miss. We will revisit when the catalog grows past ~50,000 entries or when we have ground-truth labels to train against.

Swarm promotion of long-tail to first-class. If 200 customers all use microsoft_graph.list_calendar_events, that action is no longer long-tail. It should be in the default tool set, with hand-curated description and parameter prose. We have the telemetry to detect this; we have not built the promotion pipeline. This is the next thing.

Honest scorecard

Six things we got right, two things we did not, ranked in the order we would defend if pressed:

Per-service toggle, launched default-off then flipped on. If we had shipped global-on from day one, we would have spent the first week of May rolling it back for at least one customer. Launching opt-in, proving it, then flipping the default to opt-out got us to the same capability without the rollback risk.
Risk tier at the action layer. Cheaper than per-call ACLs; more honest than "trust the model."
Pairing the OpenAPI importer with dynamic discovery. The importer was already in the admin UI; what was missing was a way to use a full imported spec without overflowing tools/list. The combination is what unlocks the long-tail — neither half by itself would have been enough.
BM25 over semantic, for now. We are giving up retrieval quality at the upper end in exchange for predictability. We think the trade is right at 16k actions; we are not sure it stays right at 50k.
Reusing the existing pseudonymization + hooks substrate. One less place for PII to leak; one less policy surface to audit.
Long-tail YAML lives in the same directory as curated YAML (actions.imported_longtail.yaml next to actions.yaml). The auto-import is a regular YAML file, reviewable by humans, diff-able in MRs.
Did not ship a published retrieval benchmark. We should have. Mea culpa.
Did not ship swarm-driven promotion. The data is there; the pipeline is not.

Where to try it

On a self-hosted instance, the per-service toggle on the connections dashboard is the authoritative control — and as of the update above, it defaults to ON for connected services. The default is now opt-out: long-tail is visible to search for every service unless you exclude it. To opt a service out (or for IaC-managed deployments), set DYNAMIC_DISCOVERY_DISABLED_SERVICES in your environment to a comma-separated list of service names; leave it empty for the default. Either way, search visibility is only half the story — the read/write split and the confirmed=true gate on destructive actions govern execution, so opt-out widens what is discoverable, not what fires unattended. The two-minute install still applies.

If you want to compare side-by-side against Anthropic's first-party tool_search_tool — go ahead. They are not mutually exclusive: a Claude conversation can use Anthropic's deferred-loading for its own tools while talking to mcpgate's gateway_search_actions for everything else. The two complement each other; they do not compete.

If you find a query that retrieves badly, tell us. That is the part we are not done with yet.