· 9 min read

Dynamic action discovery isn't novel. Doing it safely is.

Anthropic shipped Tool Search in November 2025. mcpproxy-go has BM25. Stacklok benchmarks 94% retrieval. The pattern converged across the MCP-gateway market. mcpgate just shipped its own — and the choices that matter aren't the pattern, they are retrieval quality, the risk model, and how you roll it out without breaking working setups.

Six months ago, every MCP gateway shipped the same thing: a fixed list of tools, loaded into the model's context window on every turn. With 20 connected services and read/write split, that is about 40 tools at 800 to 1500 tokens each. Roughly 50,000 tokens of tool definitions before the user has even typed a question.

Two things broke at that point. Anthropic's own advanced tool-use guidance recommends deferred loading whenever "tool definitions consume more than 10K tokens" or "10+ tools are available" — and their internal evals show measurable accuracy gains: Opus 4 from 49% to 74%, Opus 4.5 from 79.5% to 88.1%, both with Tool Search enabled. The other half is plain economics — every user turn re-ships every tool definition. A team running 10,000 conversations a day was paying for a lot of "Gmail send_email" prose nobody read.

The fix is obvious in hindsight. Don't ship every tool. Ship a search tool, and let the model pull definitions on demand.

Anthropic's Tool Search Tool shipped November 24, 2025. mcpproxy-go ships BM25 filtering. agentic-community/mcp-gateway-registry ships FAISS with sentence-transformer embeddings. Stacklok ToolHive ships hybrid BM25-plus-semantic and publishes 94% tool-selection accuracy on a 2,792-tool benchmark. Eight or nine other gateways shipped some version of it. The pattern is not the moat anymore.

Key takeaway

Dynamic action discovery — BM25 over a deferred tool catalog, activate-on-demand — is now a market-wide pattern, not a differentiator. The decisions that actually matter when shipping it: how good is the retrieval, what is the risk model for activating a write or destructive action, and how do you turn it on without breaking the setups that already work. This post is mcpgate's choices and the tradeoffs we are still living with.

What converged across the market

For an honest comparison, here is what the awesome-mcp-gateways list and Anthropic's own first-party API both look like as of May 2026:

ImplementerMechanismSurface
Anthropic (first-party API)tool_search_tool + defer_loading: true per tool, BM25 or regexAny tool definition you ship to the API
Claude CodeBuilt-in tools deferred behind ToolSearch since early 2026Default system-tool context cut substantially; mechanism is the same as the API-level Tool Search
mcpproxy-goBM25 over indexed tool descriptionsWhatever MCP servers you point it at
agentic-community/mcp-gateway-registryFAISS + sentence-transformersRegistry of registered MCP servers
Stacklok ToolHive OptimizerHybrid BM25 + semantic, published 94% top-K @2,800 toolsIndexed MCP-tool surface
Composio Tool RouterIntent-analysis filter, 20,000 tools / 1,000 appsComposio's catalog
Smithery ToolboxRegistry-driven smithery tool find6,000+ registered servers
Docker MCP Toolkitmcp-find + mcp-add meta-toolsServer-level enablement
IBM ContextForgePlugin-driven admin enable/disableNo model-facing search; admin curation
OBotFederates tools/list across upstream serversNo own catalog, no search-then-activate

If "everyone shipped it," the interesting question is: does it work? Two published numbers, from two different harnesses:

ImplementationScoreHarness
Anthropic regex mode56% top-K retrievalArcade.dev, 4,027 tools, 25 tasks
Anthropic BM25 mode64% top-K retrievalArcade.dev, 4,027 tools, 25 tasks
Stacklok ToolHive (hybrid)94% selection / 98% retrievalStacklok's own, 2,792 tools
Anthropic Tool Search (on Stacklok's harness)34% selection / 48% retrievalStacklok's own, 2,792 tools

Two different harnesses, two different methodologies, two ways of saying the same thing: there is a wide gap between the first-party Anthropic implementation and the best independent retriever. Retrieval quality is the real moat, not the pattern. Anyone telling you "we ship dynamic discovery" without numbers is making a claim about the easy part. The hard part is the indexing, the ranking, and the off-the-happy-path behavior — typo tolerance, synonym handling, action-vs-resource disambiguation, write/read disambiguation. Stacklok publishes its harness; Arcade publishes its harness; we should too.

What mcpgate just shipped

We rolled out dynamic action discovery in May 2026. The mechanism is BM25 over a long-tail action catalog generated from OpenAPI specs. mcpgate has shipped an admin-UI OpenAPI importer for a while — point it at a Microsoft Graph or BigQuery spec and the YAML appears. The catch, until now, was that every imported endpoint went into the default tool list. Importing the full Microsoft Graph spec meant a tools/list payload no MCP client could reason about. So in practice, admins imported the handful of endpoints they knew they needed and stopped — leaving most of the surface unreachable not because the YAML did not exist, but because surfacing it cost more context than it bought.

Dynamic discovery is what makes the full-spec import practical. The long-tail goes into a search index instead of the default tool list, and the model pulls definitions on demand. The catalog currently sits at more than 16,000 long-tail actions across the connected services — almost all of which would have been unusable without the search-then-activate layer on top.

The model accesses them via one meta-tool: gateway_search_actions(query, service?, limit). It returns BM25-ranked candidates with their service, HTTP method, endpoint, risk annotation, and whether they are currently active. Activation happens implicitly on the next call.

That part is unremarkable. The choices below are not.

Default tools/list — shipped to the model every turn Baseline action tools per connected service + the meta-tool. ~10 entries. ~10K tokens. jira_read_actions gitlab_write_actions slack_read_actions gateway_search_actions model calls when it needs more than the baseline can cover Long-tail catalog — indexed, kept out of context until searched OpenAPI-imported endpoints across all connected services. Searched via BM25, ranked, risk-tagged. bigquery.tables_get slack.users_info gitlab.issues_post notion.pages_delete microsoft.events_list grafana.alerts_post wordpress.posts_get slack.channels_archive + 16,000 more, one BM25 query away read — autonomous write — audit-logged destructive — requires confirmed=true

Default OFF, per-service Beta toggle

Long-tail discovery is opt-in per service, off by default, controlled by an admin toggle on the connections page. We considered three rollout strategies:

  1. Global on. Every service exposes its long-tail catalog. Maximum capability, maximum risk of an unexpected action firing on a setup that has been working for months. Rejected.
  2. Global off, single global switch. Easy to reason about, but it forces a team that wants Microsoft Graph long-tail to also turn on BigQuery long-tail. Rejected.
  3. Per-service toggle, default OFF. The team that wants Graph access can flip it for one service. The team that does not is byte-identical to last week's behavior. This is what shipped.

The cost is one more switch on the connections page. The benefit is that the rollout cannot break anyone — and admins can A/B their own setup by flipping it back. The same setting is exposed as an environment variable (DYNAMIC_DISCOVERY_ENABLED_SERVICES) for IaC-managed deployments.

Risk annotation lives at the action layer, not the tool layer

Anthropic's defer_loading is binary — a tool is either in the context or it isn't. There is no built-in concept of "this tool reads, this tool writes, this tool deletes." If you want differentiated handling, you build it on top.

mcpgate's YAML actions already carry method-level metadata: every action declares its HTTP method, its endpoint, whether it accepts a body. We annotate each long-tail action with a risk tier derived from that metadata plus a path-keyword scan:

  • read — GET endpoints, no body. Lowest risk. The model can activate and call autonomously.
  • write — POST/PUT/PATCH. Carries a body, mutates state. Activation triggers an audit-log line.
  • destructive — DELETE endpoints, or POSTs whose path matches a keyword list (/delete, /revoke, /terminate, /wipe). Activation requires the model to pass confirmed=true.

The shape echoes SEP-1888, the MCP spec draft for progressive disclosure of typed library operations, which includes risk-level annotations as part of the proposed operation index. SEP-1888 has no sponsor yet and may never land — but if it does, we are already structurally compatible.

This matters more than it first looks. Stacklok's 94% retrieval number is impressive, but "retrieve correctly" is not the same as "execute safely." A model that retrieves delete_user with high relevance is still a model that just retrieved delete_user. The risk tier lets us treat retrieval and execution as separate problems.

Pseudonymization still applies

This is the part we cannot bolt onto someone else's defer_loading. The two pillars mcpgate built first — PII pseudonymization with rehydration, and two-layer policy hooks — apply to long-tail actions the same way they apply to hand-written ones. An auto-imported Microsoft Graph endpoint goes through the same pseudonymize-on-the-way-out, rehydrate-on-the-way-back pipeline. Hooks see the call and can deny it before the gateway dispatches.

None of this is exclusive to mcpgate — any gateway can choose to build dynamic discovery on top of an existing PII/policy substrate. But several gateways shipped discovery before the substrate, and retrofitting it after the fact tends to leak: every new endpoint becomes a new place where PII might escape unredacted. We shipped it the other way around, and we are glad we did.

What is still open

Three things we have not solved yet.

We have not published our own retrieval benchmark. Stacklok did, on Arcade.dev's harness. We should. The reason we have not is that the harness assumes a flat tool surface; our catalog has service scoping, which changes the recall denominator. The right answer is to publish numbers in both modes — scoped and unscoped — and we owe that work.

BM25 alone has a known ceiling. The same Arcade benchmark shows hybrid BM25-plus-semantic at 94%, pure BM25 around 64%. There is a clean path to a semantic-augmented retriever, and we have not taken it. The reason is honest: at 16,000 actions, BM25 is fast and predictable, and the cost of a semantic-embedding miss is harder to debug than a keyword miss. We will revisit when the catalog grows past ~50,000 entries or when we have ground-truth labels to train against.

Swarm promotion of long-tail to first-class. If 200 customers all activate microsoft_graph.list_calendar_events, that action is no longer long-tail. It should be in the default tool set, with hand-curated description and parameter prose. We have the telemetry to detect this; we have not built the promotion pipeline. This is the next thing.

Honest scorecard

Six things we got right, two things we did not, ranked in the order we would defend if pressed:

  1. Per-service Beta toggle, default OFF. If we had shipped global-on, we would have spent the first week of May rolling it back for at least one customer.
  2. Risk tier at the action layer. Cheaper than per-call ACLs; more honest than "trust the model."
  3. Pairing the OpenAPI importer with dynamic discovery. The importer was already in the admin UI; what was missing was a way to use a full imported spec without overflowing tools/list. The combination is what unlocks the long-tail — neither half by itself would have been enough.
  4. BM25 over semantic, for now. We are giving up retrieval quality at the upper end in exchange for predictability. We think the trade is right at 16k actions; we are not sure it stays right at 50k.
  5. Reusing the existing pseudonymization + hooks substrate. One less place for PII to leak; one less policy surface to audit.
  6. Long-tail YAML lives in the same directory as curated YAML (actions.imported_longtail.yaml next to actions.yaml). The auto-import is a regular YAML file, reviewable by humans, diff-able in MRs.
  7. Did not ship a published retrieval benchmark. We should have. Mea culpa.
  8. Did not ship swarm-driven promotion. The data is there; the pipeline is not.

Where to try it

On a self-hosted instance: set DYNAMIC_DISCOVERY_ENABLED_SERVICES in your environment to a comma-separated list of service names. The default remains empty — no behaviour change unless you opt in. Once a service is on, an admin can also flip the toggle per-service from the connections dashboard. The two-minute install still applies.

If you want to compare side-by-side against Anthropic's first-party tool_search_tool — go ahead. They are not mutually exclusive: a Claude conversation can use Anthropic's deferred-loading for its own tools while talking to mcpgate's gateway_search_actions for everything else. The two complement each other; they do not compete.

If you find a query that retrieves badly, tell us. That is the part we are not done with yet.