The shift: for years we built open data for humans to read. Increasingly the consumer is an AI agent. So we need open-data systems designed for agents, not just people.
Four ways in: CLIs, APIs, Skills, and MCP. Today: how to actually implement MCP servers over open data.
Same goal, different contracts. Each layer answers "who handles auth, routing, and reasoning?"
The agent's native surface. Run a command, pipe it, chain it. In 2026 the terminal became where agents retrieve and act. gh, aws, ckanapi.
Raw reach. Maximum surface, zero opinion. The agent must know the endpoints.
Packaged know-how: instructions + scripts the client runs. Great front door over messy systems.
A running server exposing tools over a protocol. Where you put shared, agentic logic.
Today's focus: MCP servers — and when to lean on CLIs & Skills instead.
They live on different sides of the wire.
A folder of instructions + code the model loads on demand. Zero infra to run. Perfect for wrapping a CLI or a few API calls behind plain-language steps. Lives with the agent.
A process that advertises typed tools, resources, and prompts over a standard protocol. Any compliant client can connect. Put shared, stateful, or heavy logic here, once, for everyone.
Rule of thumb: personal & lightweight → Skill. Shared service or agentic backend → MCP.
skill → shells out to an authed CLI
# The CLI already did the OAuth dance,
# stores & refreshes the token for us.
import subprocess, json
def search_open_data(query):
out = subprocess.run(
["ckanapi", "action", "package_search",
f"q={query}", "--remote", PORTAL],
capture_output=True, text=True,
)
return json.loads(out.stdout)["result"]
CLIs are the cheapest auth boundary you'll ever get. Wrap, don't re-implement.
FastMCP turns a typed Python function into a discoverable tool. The signature is the schema.
server.pyfrom fastmcp import FastMCP
mcp = FastMCP(name="OpenDataServer")
@mcp.tool
def find_datasets(query: str, limit: int = 5) -> list[dict]:
"""Search the open-data catalogue for datasets."""
hits = catalogue.search(query, k=limit)
return [d.summary() for d in hits]
if __name__ == "__main__":
mcp.run()
WASH AI↗ wraps its existing knowledge-retrieval tools and serves the Sanihub↗ tenant over one MCP endpoint.
app/api/[transport]/route.tsimport { createMcpHandler } from 'mcp-handler';
import { knowledgeSearch } from '@/lib/ai/tools/knowledge-search';
// existing AI SDK tool, scoped + opinionated
const sanihub = knowledgeSearch({
tenantId: 'sanihub',
allowedTenants: ['sanihub'], // tenant gating
enableReranking: true, // server-side rerank
});
const handler = createMcpHandler((server) => {
server.registerTool('knowledgeSearch',
{ description: sanihub.description,
inputSchema: sanihub.inputSchema.shape },
async (args) => ({ content: [{ type: 'text',
text: await sanihub.execute(args) }] }));
});
export { handler as GET, handler as POST };
any AI SDK client connectsimport { experimental_createMCPClient
as createMCPClient } from 'ai';
const mcp = await createMCPClient({
transport: { type: 'sse',
url: 'https://washai.org/api/mcp',
headers: { Authorization: `Bearer ${tok}` }}});
const tools = await mcp.tools(); // MCP → AI SDK
await generateText({ model, tools, prompt });
Auto-generate 50–200 tools and the JSON schemas eat the window before the model reasons.
one tool, expanded into context{ "name": "get_dataset_resource_view_v3", "description": "Return a resource view ...", "inputSchema": { "type": "object", "properties": { "dataset_id": {"type":"string","description":"..."}, "resource_id":{"type":"string","description":"..."}, "view_id": {"type":"string","description":"..."}, "include_draft":{"type":"boolean"}, "format": {"enum":["json","csv","xml"]} }, "required":["dataset_id","resource_id"] } } # × 200 tools …
Context window used by tool schemas:
The agent searches for capability on demand; only matched schemas enter context.
the only tool the client sees up front@mcp.tool
def search_tools(need: str) -> list[ToolCard]:
"""Find the right tool for a task, by intent."""
return registry.rank(need) # keyword + embeddings
# agent: search_tools("query a CSV resource")
# → returns 2–3 candidates, full schema on demand
Context window used by tool schemas:
FastMCP's tool search / code mode ships this pattern out of the box.
any REST API → MCP serverimport httpx
from fastmcp import FastMCP
client = httpx.AsyncClient(base_url="https://api.example.com")
spec = httpx.get("https://api.example.com/openapi.json").json()
mcp = FastMCP.from_openapi(
openapi_spec=spec,
client=client, # auth lives on the client
name="CatalogueAPI",
)
mcp.run()
CKAN powers data.gov, open.canada.ca, dati.gov.it. Socrata powers data.calgary.ca and many city portals. Wrap the standard once, every portal gets the same agent interface.
A whole portal behind a handful of intent tools. Search to find, inspect the schema, then query.
SoQL over the live portal. The agent never sees 200 schemas, it discovers them.
Humanitarian & development funding, queryable in plain language. Hosted, with OAuth.
Calgary leans generic (SoQL passes through); IATI leans opinionated (aid concepts as tools).
Mirror an API 1:1 and the agent must know which of 200 endpoints to call, in what order, with what params.
Fine for prototypes. Fragile when the client is small or the API is large & noisy.
one intent-shaped tool, heavy lifting hidden@mcp.tool
def find_knowledge(question: str) -> list[Passage]:
"""Answer-ready evidence for a question."""
hits = hybrid_search(question) # BM25 + vectors
ranked = cross_encoder_rerank(question, hits)
ranked = dedupe(ranked)
return [trim(p) for p in ranked[:5]] # small, clean
The agentic work moves server-side once — every client inherits it.
Choose when: broad coverage fast, capable client, internal tools, exploration.
Cost: smart client, schema bloat, brittle chaining.
Choose when: shared product surface, thin or many clients, quality & safety matter.
Cost: design & maintain the logic yourself.
Most real servers mix both: tool search for breadth + a few intent tools for the hot paths.
An open endpoint is still your endpoint. Decide who gets in, how often, and who pays for the compute.
IATI: user's API key encrypted into a stateless JWT (OAuth 2.1), no server storage. WASH AI↗: validateTenantAccess blocks private tenants, public data flows freely.
Open data invites scraping. Rate-limit per token, paginate hard caps, separate public vs private scopes so a noisy client can't drain the portal.
Reranking, embeddings, research search: every call burns tokens + latency server-side. Budget & meter per call / per tenant, or one agent loop bills you all day.
Push intelligence server-side on purpose, then put a price tag and a gate on it.
Open data already speaks CKAN / Socrata / OpenAPI. Meet it where it is, then add opinion where it pays.
fastmcp · gofastmcp.com · ckan-mcp-server
A small team building open-source AI for social impact. We help organizations tap into the richness of their data and find new ways of working, now for agents as much as people.