Dynamic action discovery isn't novel. Doing it safely is.
Anthropic shipped Tool Search in November 2025. mcpproxy-go has BM25. Stacklok benchmarks 94% retrieval. The pattern converged across the MCP-gateway market. mcpgate just shipped its own — and the choices that matter aren't the pattern, they are retrieval quality, the risk model, and how you roll it out without breaking working setups.
Six months ago, every MCP gateway shipped the same thing: a fixed list of tools, loaded into the model's context window on every turn. With 20 connected services and read/write split, that is about 40 tools at 800 to 1500 tokens each. Roughly 50,000 tokens of tool definitions before the user has even typed a question.
Two things broke at that point. Anthropic's own advanced tool-use guidance recommends deferred loading whenever "tool definitions consume more than 10K tokens" or "10+ tools are available" — and their internal evals show measurable accuracy gains: Opus 4 from 49% to 74%, Opus 4.5 from 79.5% to 88.1%, both with Tool Search enabled. The other half is plain economics — every user turn re-ships every tool definition. A team running 10,000 conversations a day was paying for a lot of "Gmail send_email" prose nobody read.
The fix is obvious in hindsight. Don't ship every tool. Ship a search tool, and let the model pull definitions on demand.
Anthropic's Tool Search Tool shipped November 24, 2025. mcpproxy-go ships BM25 filtering. agentic-community/mcp-gateway-registry ships FAISS with sentence-transformer embeddings. Stacklok ToolHive ships hybrid BM25-plus-semantic and publishes 94% tool-selection accuracy on a 2,792-tool benchmark. Eight or nine other gateways shipped some version of it. The pattern is not the moat anymore.
Key takeaway
Dynamic action discovery — BM25 over a deferred tool catalog, activate-on-demand — is now a market-wide pattern, not a differentiator. The decisions that actually matter when shipping it: how good is the retrieval, what is the risk model for activating a write or destructive action, and how do you turn it on without breaking the setups that already work. This post is mcpgate's choices and the tradeoffs we are still living with.
What converged across the market
For an honest comparison, here is what the awesome-mcp-gateways list and Anthropic's own first-party API both look like as of May 2026:
| Implementer | Mechanism | Surface |
|---|---|---|
| Anthropic (first-party API) | tool_search_tool + defer_loading: true per tool, BM25 or regex | Any tool definition you ship to the API |
| Claude Code | Built-in tools deferred behind ToolSearch since early 2026 | Default system-tool context cut substantially; mechanism is the same as the API-level Tool Search |
| mcpproxy-go | BM25 over indexed tool descriptions | Whatever MCP servers you point it at |
| agentic-community/mcp-gateway-registry | FAISS + sentence-transformers | Registry of registered MCP servers |
| Stacklok ToolHive Optimizer | Hybrid BM25 + semantic, published 94% top-K @2,800 tools | Indexed MCP-tool surface |
| Composio Tool Router | Intent-analysis filter, 20,000 tools / 1,000 apps | Composio's catalog |
| Smithery Toolbox | Registry-driven smithery tool find | 6,000+ registered servers |
| Docker MCP Toolkit | mcp-find + mcp-add meta-tools | Server-level enablement |
| IBM ContextForge | Plugin-driven admin enable/disable | No model-facing search; admin curation |
| OBot | Federates tools/list across upstream servers | No own catalog, no search-then-activate |
If "everyone shipped it," the interesting question is: does it work? Two published numbers, from two different harnesses:
| Implementation | Score | Harness |
|---|---|---|
| Anthropic regex mode | 56% top-K retrieval | Arcade.dev, 4,027 tools, 25 tasks |
| Anthropic BM25 mode | 64% top-K retrieval | Arcade.dev, 4,027 tools, 25 tasks |
| Stacklok ToolHive (hybrid) | 94% selection / 98% retrieval | Stacklok's own, 2,792 tools |
| Anthropic Tool Search (on Stacklok's harness) | 34% selection / 48% retrieval | Stacklok's own, 2,792 tools |
Two different harnesses, two different methodologies, two ways of saying the same thing: there is a wide gap between the first-party Anthropic implementation and the best independent retriever. Retrieval quality is the real moat, not the pattern. Anyone telling you "we ship dynamic discovery" without numbers is making a claim about the easy part. The hard part is the indexing, the ranking, and the off-the-happy-path behavior — typo tolerance, synonym handling, action-vs-resource disambiguation, write/read disambiguation. Stacklok publishes its harness; Arcade publishes its harness; we should too.
What mcpgate just shipped
We rolled out dynamic action discovery in May 2026. The mechanism is BM25 over a long-tail action catalog generated from OpenAPI specs. mcpgate has shipped an admin-UI OpenAPI importer for a while — point it at a Microsoft Graph or BigQuery spec and the YAML appears. The catch, until now, was that every imported endpoint went into the default tool list. Importing the full Microsoft Graph spec meant a tools/list payload no MCP client could reason about. So in practice, admins imported the handful of endpoints they knew they needed and stopped — leaving most of the surface unreachable not because the YAML did not exist, but because surfacing it cost more context than it bought.
Dynamic discovery is what makes the full-spec import practical. The long-tail goes into a search index instead of the default tool list, and the model pulls definitions on demand. The catalog currently sits at more than 16,000 long-tail actions across the connected services — almost all of which would have been unusable without the search-then-activate layer on top.
The model accesses them via one meta-tool: gateway_search_actions(query, service?, limit). It returns BM25-ranked candidates with their service, HTTP method, endpoint, risk annotation, and whether they are currently active. Activation happens implicitly on the next call.
That part is unremarkable. The choices below are not.
Default OFF, per-service Beta toggle
Long-tail discovery is opt-in per service, off by default, controlled by an admin toggle on the connections page. We considered three rollout strategies:
- Global on. Every service exposes its long-tail catalog. Maximum capability, maximum risk of an unexpected action firing on a setup that has been working for months. Rejected.
- Global off, single global switch. Easy to reason about, but it forces a team that wants Microsoft Graph long-tail to also turn on BigQuery long-tail. Rejected.
- Per-service toggle, default OFF. The team that wants Graph access can flip it for one service. The team that does not is byte-identical to last week's behavior. This is what shipped.
The cost is one more switch on the connections page. The benefit is that the rollout cannot break anyone — and admins can A/B their own setup by flipping it back. The same setting is exposed as an environment variable (DYNAMIC_DISCOVERY_ENABLED_SERVICES) for IaC-managed deployments.
Risk annotation lives at the action layer, not the tool layer
Anthropic's defer_loading is binary — a tool is either in the context or it isn't. There is no built-in concept of "this tool reads, this tool writes, this tool deletes." If you want differentiated handling, you build it on top.
mcpgate's YAML actions already carry method-level metadata: every action declares its HTTP method, its endpoint, whether it accepts a body. We annotate each long-tail action with a risk tier derived from that metadata plus a path-keyword scan:
- read — GET endpoints, no body. Lowest risk. The model can activate and call autonomously.
- write — POST/PUT/PATCH. Carries a body, mutates state. Activation triggers an audit-log line.
- destructive — DELETE endpoints, or POSTs whose path matches a keyword list (
/delete,/revoke,/terminate,/wipe). Activation requires the model to passconfirmed=true.
The shape echoes SEP-1888, the MCP spec draft for progressive disclosure of typed library operations, which includes risk-level annotations as part of the proposed operation index. SEP-1888 has no sponsor yet and may never land — but if it does, we are already structurally compatible.
This matters more than it first looks. Stacklok's 94% retrieval number is impressive, but "retrieve correctly" is not the same as "execute safely." A model that retrieves delete_user with high relevance is still a model that just retrieved delete_user. The risk tier lets us treat retrieval and execution as separate problems.
Pseudonymization still applies
This is the part we cannot bolt onto someone else's defer_loading. The two pillars mcpgate built first — PII pseudonymization with rehydration, and two-layer policy hooks — apply to long-tail actions the same way they apply to hand-written ones. An auto-imported Microsoft Graph endpoint goes through the same pseudonymize-on-the-way-out, rehydrate-on-the-way-back pipeline. Hooks see the call and can deny it before the gateway dispatches.
None of this is exclusive to mcpgate — any gateway can choose to build dynamic discovery on top of an existing PII/policy substrate. But several gateways shipped discovery before the substrate, and retrofitting it after the fact tends to leak: every new endpoint becomes a new place where PII might escape unredacted. We shipped it the other way around, and we are glad we did.
What is still open
Three things we have not solved yet.
We have not published our own retrieval benchmark. Stacklok did, on Arcade.dev's harness. We should. The reason we have not is that the harness assumes a flat tool surface; our catalog has service scoping, which changes the recall denominator. The right answer is to publish numbers in both modes — scoped and unscoped — and we owe that work.
BM25 alone has a known ceiling. The same Arcade benchmark shows hybrid BM25-plus-semantic at 94%, pure BM25 around 64%. There is a clean path to a semantic-augmented retriever, and we have not taken it. The reason is honest: at 16,000 actions, BM25 is fast and predictable, and the cost of a semantic-embedding miss is harder to debug than a keyword miss. We will revisit when the catalog grows past ~50,000 entries or when we have ground-truth labels to train against.
Swarm promotion of long-tail to first-class. If 200 customers all activate microsoft_graph.list_calendar_events, that action is no longer long-tail. It should be in the default tool set, with hand-curated description and parameter prose. We have the telemetry to detect this; we have not built the promotion pipeline. This is the next thing.
Honest scorecard
Six things we got right, two things we did not, ranked in the order we would defend if pressed:
- Per-service Beta toggle, default OFF. If we had shipped global-on, we would have spent the first week of May rolling it back for at least one customer.
- Risk tier at the action layer. Cheaper than per-call ACLs; more honest than "trust the model."
- Pairing the OpenAPI importer with dynamic discovery. The importer was already in the admin UI; what was missing was a way to use a full imported spec without overflowing
tools/list. The combination is what unlocks the long-tail — neither half by itself would have been enough. - BM25 over semantic, for now. We are giving up retrieval quality at the upper end in exchange for predictability. We think the trade is right at 16k actions; we are not sure it stays right at 50k.
- Reusing the existing pseudonymization + hooks substrate. One less place for PII to leak; one less policy surface to audit.
- Long-tail YAML lives in the same directory as curated YAML (
actions.imported_longtail.yamlnext toactions.yaml). The auto-import is a regular YAML file, reviewable by humans, diff-able in MRs. - Did not ship a published retrieval benchmark. We should have. Mea culpa.
- Did not ship swarm-driven promotion. The data is there; the pipeline is not.
Where to try it
On a self-hosted instance: set DYNAMIC_DISCOVERY_ENABLED_SERVICES in your environment to a comma-separated list of service names. The default remains empty — no behaviour change unless you opt in. Once a service is on, an admin can also flip the toggle per-service from the connections dashboard. The two-minute install still applies.
If you want to compare side-by-side against Anthropic's first-party tool_search_tool — go ahead. They are not mutually exclusive: a Claude conversation can use Anthropic's deferred-loading for its own tools while talking to mcpgate's gateway_search_actions for everything else. The two complement each other; they do not compete.
If you find a query that retrieves badly, tell us. That is the part we are not done with yet.