open-data · agents

$ ./connect --to open-data

Plugging agents into
open data

The shift: for years we built open data for humans to read. Increasingly the consumer is an AI agent. So we need open-data systems designed for agents, not just people.

Four ways in: CLIs, APIs, Skills, and MCP. Today: how to actually implement MCP servers over open data.

use ← → or space to navigate

01 · the map

four ways in

How an agent reaches your data

Same goal, different contracts. Each layer answers "who handles auth, routing, and reasoning?"

CLI

The agent's native surface. Run a command, pipe it, chain it. In 2026 the terminal became where agents retrieve and act. gh, aws, ckanapi.

API

Raw reach. Maximum surface, zero opinion. The agent must know the endpoints.

Skill

Packaged know-how: instructions + scripts the client runs. Great front door over messy systems.

MCP

A running server exposing tools over a protocol. Where you put shared, agentic logic.

Today's focus: MCP servers — and when to lean on CLIs & Skills instead.

02 · skill vs mcp

runs where?

Skill vs MCP

They live on different sides of the wire.

client-side

Skill

A folder of instructions + code the model loads on demand. Zero infra to run. Perfect for wrapping a CLI or a few API calls behind plain-language steps. Lives with the agent.

server-side

MCP server

A process that advertises typed tools, resources, and prompts over a standard protocol. Any compliant client can connect. Put shared, stateful, or heavy logic here, once, for everyone.

Rule of thumb: personal & lightweight → Skill. Shared service or agentic backend → MCP.

03 · cli for auth

let the binary hold the keys

Skill + CLI = auth solved for free

skill → shells out to an authed CLI
# The CLI already did the OAuth dance,
# stores & refreshes the token for us.
import subprocess, json

def search_open_data(query):
    out = subprocess.run(
        ["ckanapi", "action", "package_search",
         f"q={query}", "--remote", PORTAL],
        capture_output=True, text=True,
    )
    return json.loads(out.stdout)["result"]

No secrets in the agent or prompt
Token refresh handled by the tool
Same path your team uses by hand
The Skill is just orchestration over a trusted binary

CLIs are the cheapest auth boundary you'll ever get. Wrap, don't re-implement.

04 · minimal mcp

fastmcp · python

An MCP server is a decorated function

FastMCP turns a typed Python function into a discoverable tool. The signature is the schema.

server.pyfrom fastmcp import FastMCP

mcp = FastMCP(name="OpenDataServer")

@mcp.tool
def find_datasets(query: str, limit: int = 5) -> list[dict]:
    """Search the open-data catalogue for datasets."""
    hits = catalogue.search(query, k=limit)
    return [d.summary() for d in hits]

if __name__ == "__main__":
    mcp.run()

Type hints → JSON schema, automatically
Docstring → the tool's description
Return value → handed back to the model
FastMCP: the most-used Python MCP framework (Prefect / J. Lowin).

04b · typescript

vercel ai sdk · mcp-handler

Same idea in TS: re-expose tools you already have

WASH AI↗ wraps its existing knowledge-retrieval tools and serves the Sanihub↗ tenant over one MCP endpoint.

app/api/[transport]/route.tsimport { createMcpHandler } from 'mcp-handler';
import { knowledgeSearch } from '@/lib/ai/tools/knowledge-search';

// existing AI SDK tool, scoped + opinionated
const sanihub = knowledgeSearch({
  tenantId: 'sanihub',
  allowedTenants: ['sanihub'],   // tenant gating
  enableReranking: true,         // server-side rerank
});

const handler = createMcpHandler((server) => {
  server.registerTool('knowledgeSearch',
    { description: sanihub.description,
      inputSchema: sanihub.inputSchema.shape },
    async (args) => ({ content: [{ type: 'text',
      text: await sanihub.execute(args) }] }));
});
export { handler as GET, handler as POST };

any AI SDK client connectsimport { experimental_createMCPClient
  as createMCPClient } from 'ai';

const mcp = await createMCPClient({
  transport: { type: 'sse',
    url: 'https://washai.org/api/mcp',
    headers: { Authorization: `Bearer ${tok}` }}});

const tools = await mcp.tools();   // MCP → AI SDK
await generateText({ model, tools, prompt });

Reuse the app's tools, zero rewrite
One endpoint extends Sanihub to any client

05 · the trap

⚠ context bloat

Every tool's schema is paid for in tokens

Auto-generate 50–200 tools and the JSON schemas eat the window before the model reasons.

one tool, expanded into context{
  "name": "get_dataset_resource_view_v3",
  "description": "Return a resource view ...",
  "inputSchema": {
    "type": "object",
    "properties": {
      "dataset_id": {"type":"string","description":"..."},
      "resource_id":{"type":"string","description":"..."},
      "view_id":   {"type":"string","description":"..."},
      "include_draft":{"type":"boolean"},
      "format":   {"enum":["json","csv","xml"]}
    },
    "required":["dataset_id","resource_id"]
  }
}   # × 200 tools …

Context window used by tool schemas:

~15k+ tokens · before reasoning

Higher cost & latency, every call
Model confusion → wrong tool picks

06 · the fix

discovery, not dumping

Expose one search tool, not two hundred

The agent searches for capability on demand; only matched schemas enter context.

the only tool the client sees up front@mcp.tool
def search_tools(need: str) -> list[ToolCard]:
    """Find the right tool for a task, by intent."""
    return registry.rank(need)        # keyword + embeddings

# agent: search_tools("query a CSV resource")
#  → returns 2–3 candidates, full schema on demand

Context window used by tool schemas:

~2–3k tokens

200 tools become 1 entry point
Schemas load just-in-time
~50k → ~2–3k tokens of overhead

FastMCP's tool search / code mode ships this pattern out of the box.

07 · free leverage

don't hand-write what a spec already describes

OpenAPI & CKAN → MCP in a few lines

any REST API → MCP serverimport httpx
from fastmcp import FastMCP

client = httpx.AsyncClient(base_url="https://api.example.com")
spec   = httpx.get("https://api.example.com/openapi.json").json()

mcp = FastMCP.from_openapi(
    openapi_spec=spec,
    client=client,          # auth lives on the client
    name="CatalogueAPI",
)
mcp.run()

Open data already has standard APIs

CKAN powers data.gov, open.canada.ca, dati.gov.it. Socrata powers data.calgary.ca and many city portals. Wrap the standard once, every portal gets the same agent interface.

Every endpoint becomes a tool by default
RouteMap to include / exclude / retag

07b · shipped

open-data MCPs, already live

The discovery pattern

A whole portal behind a handful of intent tools. Search to find, inspect the schema, then query.

Calgary · Socrata

3 tools, any dataset

search_datasets(query)

↳ get_dataset_metadata(id) → schema

↳ query_dataset(id, select, where, order)

SoQL over the live portal. The agent never sees 200 schemas, it discovers them.

IATI · aid data

domain-shaped tools

query_activities · search_text

search_by_country · _by_sector · _by_organization

get_facets · get_activity

Humanitarian & development funding, queryable in plain language. Hosted, with OAuth.

Calgary leans generic (SoQL passes through); IATI leans opinionated (aid concepts as tools).

08 · the catch

auto-generated ≠ done

A thin wrapper pushes the work onto the client

Mirror an API 1:1 and the agent must know which of 200 endpoints to call, in what order, with what params.

client agentmust be smart picks tools · chains calls · re-ranks · cleans

⟶

generic mcpthin pass-through 1 tool per endpoint

⟶

apiraw responses

Fine for prototypes. Fragile when the client is small or the API is large & noisy.

09 · the other end

put the brains behind the wire

Opinionated MCP → a dumb client can win

one intent-shaped tool, heavy lifting hidden@mcp.tool
def find_knowledge(question: str) -> list[Passage]:
    """Answer-ready evidence for a question."""
    hits   = hybrid_search(question)      # BM25 + vectors
    ranked = cross_encoder_rerank(question, hits)
    ranked = dedupe(ranked)
    return [trim(p) for p in ranked[:5]]  # small, clean

client agentcan be simple

⟱

opinionated mcpsearch · rerank · dedupe · trim

⟱

storesindex + vectors

The agentic work moves server-side once — every client inherits it.

10 · the dial

it's a spectrum, choose on purpose

Generic reach ↔ opinionated depth

generic · API→MCP

Choose when: broad coverage fast, capable client, internal tools, exploration.

Cost: smart client, schema bloat, brittle chaining.

opinionated MCP

Choose when: shared product surface, thin or many clients, quality & safety matter.

Cost: design & maintain the logic yourself.

Most real servers mix both: tool search for breadth + a few intent tools for the hot paths.

10b · guardrails

open ≠ unguarded

Auth, gating & the cost of AI on the server

An open endpoint is still your endpoint. Decide who gets in, how often, and who pays for the compute.

Auth & data access

IATI: user's API key encrypted into a stateless JWT (OAuth 2.1), no server storage. WASH AI↗: validateTenantAccess blocks private tenants, public data flows freely.

Gating & rate limits

Open data invites scraping. Rate-limit per token, paginate hard caps, separate public vs private scopes so a noisy client can't drain the portal.

AI runs cost money

Reranking, embeddings, research search: every call burns tokens + latency server-side. Budget & meter per call / per tenant, or one agent loop bills you all day.

Push intelligence server-side on purpose, then put a price tag and a gate on it.

recap

$ summary --open-data

Five things to walk out with

Skill = client, MCP = server. Lightweight & personal vs shared & agentic.
Let the agent drive a CLI. A composable surface, auth included.
Watch the token bill. 200 schemas can eat half your window.

Expose one search_tools. Discovery beats dumping.
Re-expose what you built. mcp-handler wraps existing app tools, like Sanihub↗.
Gate & price it. Auth, rate limits, and a budget on server-side AI.

Open data already speaks CKAN / Socrata / OpenAPI. Meet it where it is, then add opinion where it pays.

fastmcp · gofastmcp.com · ckan-mcp-server

whoami

ai tools for social good

Baobab Tech

A small team building open-source AI for social impact. We help organizations tap into the richness of their data and find new ways of working, now for agents as much as people.

github.com/baobab-tech · linkedin.com/company/baobabtech · thanks ✦

Plugging agents intoopen data

How an agent reaches your data

CLI

API

Skill

MCP

Skill vs MCP

Skill

MCP server

Skill + CLI = auth solved for free

An MCP server is a decorated function

Same idea in TS: re-expose tools you already have

Every tool's schema is paid for in tokens

Expose one search tool, not two hundred

OpenAPI & CKAN → MCP in a few lines

Open data already has standard APIs

The discovery pattern

3 tools, any dataset

domain-shaped tools

A thin wrapper pushes the work onto the client

Opinionated MCP → a dumb client can win

Generic reach ↔ opinionated depth

Auth, gating & the cost of AI on the server

Auth & data access

Gating & rate limits

AI runs cost money

Five things to walk out with

Baobab Tech

Baobab Tech

WASH AI

Sanihub

Plugging agents into
open data