Reference
How we score sites.
Crawloria scores a site from 0 to 100 based on six weighted categories. This page explains exactly what each category measures, why it's weighted the way it is, and how the overall score and letter grade are computed.
The formula
Every individual check returns a sub-score from 0 to 10. Each category averages its checks. The overall score is a weighted sum across categories, normalized to the implemented weights and scaled to 0-100.
| Category | Weight |
|---|---|
| AI Bot Access | 25% |
| Content Accessibility | 20% |
| Structured Data | 15% |
| Navigation Friction | 15% |
| Agent-Specific Signals | 15% |
| Semantic Markup | 10% |
| Total | 100% |
When a category can't run (for example, the headless browser couldn't render the page), we exclude its weight from the denominator rather than score it as zero. The overall score reflects only the categories we successfully measured, with a “pending” indicator on the audit page for everything else.
Letter grade bands
AI Bot Access
25% weightWhether the seven major AI crawlers can actually fetch the site. We send real GET requests with each agent's documented User-Agent string and record what comes back.
GPTBot
OpenAI's general crawler. Powers ChatGPT search and feeds future model training.
ClaudeBot
Anthropic's crawler powering Claude's web access and search results.
OAI-SearchBot
OpenAI's search-specific crawler — separate from GPTBot, often blocked by sites that thought they only needed to allow one.
PerplexityBot
Perplexity's crawler that surfaces pages in their AI search results.
CCBot
Common Crawl. Feeds many open AI training datasets and downstream agents.
anthropic-ai
Used by Anthropic agents (including Claude Computer Use) when fetching pages on a user's behalf.
Google-Extended
Google's separate crawler for AI Overviews and Gemini training. Distinct from Googlebot.
Content Accessibility
20% weightWhether content actually loads in a form agents can use. HTTPS, response time, JavaScript dependency, and presence of bot-protection layers that may silently block traffic.
HTTPS
Plain HTTP is heavily penalized by modern agents. Most refuse to submit forms or follow links over insecure connections.
Time to First Byte
Agents have shorter timeouts than humans. Slow first byte often causes them to abort before reading anything.
Bot protection layer
Cloudflare's "Block AI Bots" toggle is on by default for many plans. We detect Cloudflare via response headers and warn even if scoring otherwise looks fine.
Content available without JavaScript
We fetch the initial HTML, then render the page in a real browser, and compare. Pages where most content arrives through client-side rendering score low — many AI agents don't run JS or have limited JS support. A guard rail also flags pages with under 200 chars of rendered text (login walls, error pages, auth-gated SPAs).
Structured Data
15% weightSchema.org JSON-LD, Open Graph, canonical URLs, title and meta description. The signals that let agents understand what a page is, not just read it.
JSON-LD structured data
Schema.org JSON-LD blocks tell agents the page is an Organization, Article, Product, FAQPage, etc. Heavily weighted because it's the most direct way to feed agents structured information.
Open Graph metadata
OG tags drive previews when URLs are shared in chat clients, agents, and social platforms. og:title, og:description, og:image, og:url at minimum.
Canonical URL
Without a canonical link, agents may treat URL variants (with/without trailing slash, query params) as different pages and fragment authority.
Title and meta description
Still the primary signal for what a page is about. Empty or default titles cost points.
LocalBusiness / Product schema
When relevant, we also check the type-specific schema. LocalBusiness must include name, address, telephone, opening hours, geo, priceRange. Product must include name, image, offers (with price), availability, brand, aggregateRating, sku.
Navigation Friction
15% weightThings that block agents from reaching real content even after they've fetched the page. Cookie banners, modals, login walls.
Modal or banner blocking content
We render the page in a real browser at 1568×1024 viewport and check for known cookie consent overlays (OneTrust, Cookiebot, Iubenda, Termly) plus a fallback heuristic that scans fixed/sticky elements with high z-index whose text matches cookie/consent/sign-up/subscribe/register/newsletter. Banners covering more than 50% of the viewport get a fail score.
Agent-Specific Signals
15% weightFiles specifically created to help AI agents understand and navigate the site, plus real-time search visibility for the brand's category.
robots.txt allows AI bots
We fetch /robots.txt and parse it for User-agent declarations that disallow our seven major AI crawlers. A site can serve content with HTTP 200 but still be hostile in robots.txt.
Sitemap declared in robots.txt
Agents and crawlers use the Sitemap directive to find your full URL list. Missing this means agents must guess what pages exist.
llms.txt present and well-formed
llms.txt is an emerging standard (proposed by Jeremy Howard at llmstxt.org) that gives LLMs a curated, structured introduction to a site in markdown. We check both /llms.txt and /.well-known/llms.txt and validate the structure.
Real-time search visibility
When a Brave Search API key is configured, we run a live search query for the brand's narrow category and check whether the domain appears in the top 20 results. This is the closest measurement to what ChatGPT Search will surface.
Semantic Markup
10% weightBasic HTML semantics that help agents (and screen readers, and search engines) interpret page structure.
Heading hierarchy
One <h1> per page, then nested <h2>s, then <h3>s. Multiple H1s or skipped levels (H1 → H3) make the page outline ambiguous.
HTML lang attribute
Without a lang attribute on the <html> element, agents and translation tools have to guess the language from content.
Viewport meta tag
Mobile agents and Computer Use models render at specific viewports. Without this they fall back to desktop assumptions and may misread layout.
Image alt text coverage
Visual agents downsample heavily; alt text is often the only signal for what an image conveys. We measure the percentage of <img> tags with non-empty alt attributes.
What we don't measure
A few things you might expect to see but won't in V0:
- Multi-page audits. We only scan the URL you give us. Auditing the full site is a planned Pro feature.
- Form analysis (label association, autocomplete attributes, input types). Critical for agents that fill forms; on the roadmap.
- Real Claude Computer Use replay against your site. Expensive and slow for a free tier; this is a paid-tier feature on the roadmap.
- Industry benchmarks. We don't yet compare your score against similar sites in your category — that requires a much larger dataset of audits.
Disclaimer
Crawloria is an automated audit. Scores are computed from measurements taken at scan time and reflect what a fresh, US-based, unauthenticated request sees. Real users, agents in other regions, authenticated sessions, and crawlers operating at scale may experience the site differently. We don't represent that any score predicts business outcomes — it's a structural diagnostic, not a ranking.