pagedigest.json

One file that tells crawlers what changed.

A site publishes a JSON manifest at /.well-known/pagedigest.json. Each URL gets a monotonic integer that moves only when the content moves. A crawler fetches the manifest, compares integers, and fetches only the pages that changed — one request instead of thousands.

“If you would stop hammering my server for pages that haven’t changed, I wouldn’t have to rate-limit or ban you.”
The publisher
“If you would stop hiding behind bot detection, I wouldn’t have to hammer your server — I just need to know what’s new.”
The crawler

They are describing the same waste. pagedigest is the coordination.

The manifest

For a 10,000-page site that changes 20 pages a week, a consumer makes one manifest request plus twenty page fetches per cycle — instead of ten thousand per-URL checks. site_rev is the one-request fast path: if it hasn’t moved, nothing has.

{
  "version": 1,
  "generated": "2026-07-02T10:00:00Z",
  "site_rev": 18294,
  "entries": {
    "/": { "rev": 47 },
    "/about": { "rev": 12 },
    "/blog/hello-world": {
      "rev": 4,   // was 3 — this is the only page to re-fetch
      "digest": "sha256:2cf24dba5fb0a30e26e83b2ac5b9e29e…"
    }
  }
}

The optional digest lets consumers spot-audit publisher claims: publishers who lie get caught, publishers who cooperate accumulate trust.

How a consumer uses it

  1. Fetch the manifest

    One GET to /.well-known/pagedigest.json. If it’s missing or malformed, fall back to whatever you did before — the protocol never makes things worse.

  2. Compare site_rev

    Same integer as last visit? Nothing on the site changed. The crawl cycle is over after a single request.

  3. Fetch only what moved

    For each URL whose rev incremented, re-fetch. Everything else is provably unchanged, on the publisher’s own word.

  4. Audit occasionally

    Hash a sampled page and compare it to the manifest’s digest. A mismatch means the manifest can’t be trusted — downgrade and fall back.

Live in production

First production publisher

dotrepo.org — a public index of repository metadata for AI agents — publishes a pagedigest manifest covering its full corpus: thousands of JSON records that refresh continuously. Mirrors and agent caches sync the whole index with one manifest request plus only the files whose revision moved.

curl -s https://dotrepo.org/.well-known/pagedigest.json | jq .site_rev
This site, too

pagedigest.org publishes its own manifest at /.well-known/pagedigest.json, generated by the reference generator at build time.

The bargain

pagedigest is a coordination protocol, not a defense mechanism. Each side takes on an obligation that is already in its own interest.

Publishers commit

  • The integers move when content moves — and only then.
  • Digests are accurate hashes of what the server actually sends.
  • A publisher with an honest manifest has earned the right to rate-limit consumers that ignore it.

Consumers commit

  • Fetch the manifest first; skip what hasn’t moved.
  • Audit digests occasionally to keep publishers honest.
  • A consumer that respects the manifest is not the problem — that’s the intended use.

Status

Version 1, release candidate

The wire format is stable: field names, semantics, and the /.well-known/pagedigest.json location will not break before 1.0.

Discovery

Publishers advertise the manifest with Link: </.well-known/pagedigest.json>; rel="https://pagedigest.org/rel" on ordinary responses.

Reference implementations

A Rust generator and a Python consumer library exist today, with a conformance vector suite. Public source release accompanies 1.0.

Get involved

Crawler operators and publishers who want early integration: hello@pagedigest.org.