# [How this blog is built: static site, edge analytics, and LLM-friendly artifacts](https://blog.hirnschall.net/same-domain-analytics/)

author: [Sebastian Hirnschall](https://blog.hirnschall.net/about/)

meta description: Hosting on GitHub Pages with Cloudflare Worker + D1 for same-domain analytics, llms.txt and per-page markdown for LLMs. No cookies, no third parties.

meta title: How This Blog is Built — Static Site, Edge Analytics, LLM-Friendly

date published: 22.04.2026 (DD.MM.YYYY format)
date last modified: 22.04.2026 (DD.MM.YYYY format)

---

Motivation
----------

Renting a server with e.g. 1and1 (ionos) to host a website using wordpress and php feels dated in 2026. We will therefore switch to a modern, static site generation approach hosted with github pages and custom same-domain analytics living on cloudflare's edge. This way, we can have

* ci jobs to deploy using git,
* avoid third-party tracking and data sharing,
* avoid cookie consent banners,
* have full control over an extremely performant and seo optimized site,
* optimize for llm crawling by generating e.g. llms.txt, per page markdown etc.,
* and save money on hosting and maintenance costs.

Hard Requirements
-----------------

Let's now formalize and discuss the must-have features mentioned above in detail before we go over the implementation:

* **Continuous deployment and integration:** Hosting directly on github pages gives us the benefit of automatic deployment with every push to the repository. This way we can use git instead of some webui or ftp client.
* **No php:** Aside from the fact that php is no longer contemporary, it is also not supported by github pages. We will have to implement everything in JavaScript.
* **Static site generation:** I want each site to be a static html page with all js and css inlined. This has several benefits, including fast page loads, no incompatible files like locally cached js/css files that do not match an updated html structure, avoidance of cross domain tracking by third parties.
* **No third-party tracking and no cookies:** I do not like cookie consent banners. They train users to consent to anything to make the banner go away. We will avoid using cookies or third-party tracking scripts.
* **Analytics:** As we still want to see how articles perform and how users interact with the site, we will implement custom same-domain analytics. This is probably the most interesting part of the stack as it has several real benefits when compared to third-party solutions like e.g. Google Analytics.
* **Data ownership:** If we collect analytics data, I think sharing it with third parties, especially google, is bad. Furthermore, I want access to the full dataset for processing in e.g. python not just what a dashboard provides.
* **Zero runtime costs:** Aside from buying a domain, both github pages and cloudflare's edge have zero runtime costs. Cloudflare's workers and D1 databases have limits in the free tier, but when we reach 100k daily requests, we can surely afford the paid cloudflare plan. Seeing that hosting costs are the only running costs a blog has, let's cut it.

Now that the requirements are clear, we can discuss how to implement each of them in a modern, scalable way.

Static Site Generation
----------------------

I use a custom Python generator. It does what most static generators do: markdown to HTML with templates for figures, footers, listings. However, a few things in it are non-standard and worth discussing.

Using templates makes making site wide updates easy and efficient. Each article is split into a content file and a frontmatter json file containing its metadata.

### LLM crawling

While some try to block llm crawlers, we go the opposite route. We will try to make the site as llm-friendly as possible. For us, an llm linking and citing the page is the natural progression from search engines like google. For this the site generator outputs the following files:

* **llms.txt**: A markdown file that lists all the pages on the site and their metadata for LLMs to crawl. Kind of like a sitemap.
* **llms-full.txt:** A single long markdown file that contains all articles on the page. This reduces the number of requests needed for LLMs to crawl the entire site.
* **per page markdown:** Each page also generated and hosted as a markdown file.

As e.g. claude does not fetch pages that are disallowed in the robots.txt file, we will allow all pages and add `x-robots-tag: noindex` headers using cloudflare WAF. This way the files will not be indexed by google but still be accessible to LLMs.

### Resource validation and conversion

The generator converts all included images to webp on build. Furthermore, it keeps track of which files are actually linked on the page. When building, it only copies the linked and converted files to the build directory to save bandwidth when deploying. During this process it also checks if any linked file does not exist in the build directory. If it finds a file is missing, we raise an error, failing the pipeline.

### References and citations

Citations and references are implemented similar to latex. Each figure, reference, listing, etc. gets a name we can use in the text to reference it. This way we avoid wrong numbering issues.

Same-domain analytics
---------------------

The big thing to implement is analytics. All available options are either paid, share data, or require a cookie consent banner. Other issues include that they are not live or that they do not count every individual visitor.

### Architecture

As we cannot use php, we will rely on client side js and cloudflare edge workers + D1 storage.

We setup an endpoint on our domain to handle analytics requests from the client side js. The worker itself is also written in js but runs on the edge. On page load, we will send an init request to the endpoint to add the user's session. Furthermore, we will send heartbeat requests to update the time on page and we will send custom events to the endpoint. What we collect and why is the topic of the next sections.

One important benefit of this approach is that the analytics endpoint is hosted on the same domain as the page. It is therefore not subject to cross-origin restrictions and practically not distinguishable from other requests. This means that e.g. Adblockers will not block these requests as they do for e.g. google analytics.

Before we dive into both the client side and the server side implementation details here is a high level architecture diagram (fig. 1):

![architicture of modern same domain analytics using git, cloudflare and edge workers with D1 storage](https://blog.hirnschall.net/same-domain-analytics/resources/img/architecture.svg)


Figure 1: Architecture Diagram for blog.hirnschall.net

### Client Side JavaScript

As mentioned above, we split into init, heartbeat, and events. So let's take a closer look at what this practically means and why we do it:

#### Init

On page load, we send an init request to the worker. This creates a new entry in the D1 database. We track sessions and returning visitors using the anonymized IP address. Cloudflare already sees the IP as the CDN, so the analytics worker isn't a new data flow to them. The client side js also generates a random uuid and provides it to the server. Furthermore, the `document.referrer` is recorded.

The client side js is shown below:

```
function sendInit() {
    fetch(ANALYTICS_ENDPOINT + '/init', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
            uuid: pageUUID,
            page: pagePath,
            referrer: document.referrer || ''
        })
    });
}
```

#### Heartbeat

To keep track of time spent on each page, the client side js sends heartbeat requests at regular intervals. Here the uuid is used to identify the user's session and update the correct DB entry.

```
function sendHeartbeat(delta) {
    fetch(ANALYTICS_ENDPOINT + '/heartbeat', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ uuid: pageUUID, delta })
    });
}
```

#### Events

As there are several other things we want to track, we can also send 'events' to the endpoint. For this site, these are split into

* **Clicks:** The client side js attaches a onclick event listener to `a` tags on load. Clicks are then split into internal, outbound, referral, newsletter signups, and downloads.  
  **Interpretation:** How are users navigating the page. Which recommendations are useless and not clicked anyway.
* **Scroll depth:** An onscroll event listener checks in 5% increments how far along the page the user currently is. Every 5%, an event is triggered.  
  **Interpretation:** Are (long) articles read all the way through or do users leave after the first half of the page.
* **Time per section:** Again, using an onscroll event listener together with the current section id, the time per section is recorded.  
  **Interpretation:** How long does the user spend reading each section. Which sections are generally skipped. Which sections may be too complicated.
* **Code copies:** Using the copy event, the client sends a copy event to the endpoint.  
  **Interpretation:** What code is copied/used by readers. If no code is copied, the article is probably bad as the user did not find what he was looking for.

Adding other (new) events is straightforward. The client side js just needs to send the event data to the endpoint. The listing below shows the client side event handler.

```
function sendEvent(type, url, anchorText) {
    navigator.sendBeacon(
        ANALYTICS_ENDPOINT + '/event',
        new Blob([JSON.stringify({
            uuid: pageUUID,
            type: type,
            url: url || '',
            anchor_text: anchorText || '',
            page: pagePath
        })], { type: 'application/json' })
    );
}
```

#### Bot detection

To avoid inflated numbers due to crawlers triggering the js, we have several options to detect and ignore bots.

* **User agent:** The client side js checks the user agent. If it matches against known bot user agents, the site works like normal but no analytics requests are sent. This works well for bots that tell us they are bots. E.g. googlebot, claudebot, etc.
* **Time on page:** If the bot sends requests anyways, we can ignore them based on the time spent on the page.
* **Google Recaptcha v3:** Googles Recaptcha v3 is invisible to the user, free, and can be used to give a bot probability score. It uses cookies and shares data with google however. It is therefore something we will not use, but it's an option.
* **Cloudflare bot management:** The best option would be Cloudflare's bot management features. Similar to recaptcha v3 a bot probability score is provided. However, it runs on the Cloudflare network and does not require client side code. It not included in the free tier.
* **LLM opt-out:** To avoid tracking llms that do not want to identify as bots, the `knownBot=1` GET argument also disables analytics tracking. LLMs are instructed to use it in llms.txt and robots.txt. So, if I want to audit the site using llms, I can instruct them to use this argument. This is especially useful for xAI's Grok as it does not disclose its bot status.

The client side user agent handling is shown below. It is a simple regex match for known bot user agents. One notable thing is that we explicitly enable tracking for llm-user agents. These are requests on behalf of actual users and should be treated as such. The code also shows the `knownBot` flag:

```
function botDetected() {
    if (new URLSearchParams(window.location.search).get('knownBot') === '1') return true;

    const ua = navigator.userAgent || '';

    if (/chatgpt-user|claude-user|perplexity-user/i.test(ua)) return false;

    const isAutomated = !!navigator.webdriver;
    const bots = /bot|crawl|spider|google|bing|baidu|yandex|duckduckgo|facebook|slurp|exabot|facebot|scraper|headless|puppeteer|playwright|selenium|phantomjs|prerender|rendertron|screenshot|preview|facebookexternalhit|twitterbot|linkedinbot|slackbot|discordbot|are\.na|arena|microlink|diffbot|iframely|PTST|lighthouse|gptbot|chatgpt|claudebot|claude-searchbot|oai-searchbot|perplexitybot|anthropic-ai|anthropic|claude/i;

    return isAutomated || bots.test(ua);
}
```

### Edge Worker & D1 Storage

The worker implementation itself is also minimal and straightforward. All it needs to do is listen for the requests discussed in the client side js section above and update/write to the D1 database.

After adding a connection to the D1 database in the cloudflare dashboard we can handle requests as shown using `env.DB`.

Let's look at the implementation of each endpoint. For readability sake CORS headers, server-side data extraction, and POST restriction checks have been removed.

#### Init

Once an init request is received by the worker, it checks if the uuid and page are set. If they are, a new DB entry is created with `datetime('now')` for the timestamp.

```
if (path === '/init') {
    const { uuid, page, referrer } = body;

    if (!uuid || !page) {
    return new Response('Bad Request', { status: 400, headers: corsHeaders });
    }

    await env.DB.prepare(`
    INSERT INTO sessions (uuid, ipan, bot_score, referrer, page, timestamp, time)
    VALUES (?, ?, ?, ?, ?, datetime('now'), 0)
    `).bind(uuid, ipan, botScore, referrer ?? '', page).run();

    return new Response('OK', { status: 200, headers: corsHeaders });
}
```

#### Heartbeat

The heartbeat is more interesting. In line 4, we not only check if the uuid is set, but we also check if the delta sent by the client is whitelisted. If either one is missing, we return a '400' bad request response.

```
if (path === '/heartbeat') {
    const { uuid, delta } = body;

    if (!uuid || ![1, 5, 30].includes(delta)) {
    return new Response('Bad Request', { status: 400, headers: corsHeaders });
    }

    await env.DB.prepare(`
    UPDATE sessions SET time = time + ?
    WHERE uuid = ?
    `).bind(delta, uuid).run();

    return new Response('OK', { status: 200, headers: corsHeaders });
}
```

#### Events

Lastly, the event endpoint is the most complex to implement. Line 7 now checks if the event type sent by the client is a valid event type using a whitelist. We also check if a DB entry for the uuid exists. If not, it's a '401' error. If everything is valid, we insert the event into the database.

```
if (path === '/event') {
    const { uuid, type, url: eventUrl, anchor_text, page } = body;

    const validTypes = [
    ];

    if (!uuid || !type || !validTypes.includes(type)) {
    return new Response('Bad Request', { status: 400, headers: corsHeaders });
    }

    // Only insert if session exists - prevents spoofed UUIDs
    const session = await env.DB.prepare(`
    SELECT uuid FROM sessions WHERE uuid = ?
    `).bind(uuid).first();

    if (!session) {
    return new Response('Unauthorized', { status: 401, headers: corsHeaders });
    }

    await env.DB.prepare(`
    INSERT INTO events (session_uuid, type, url, anchor_text, page, timestamp)
    VALUES (?, ?, ?, ?, ?, datetime('now'))
    `).bind(uuid, type, eventUrl ?? '', anchor_text ?? '', page ?? '').run();

    return new Response('OK', { status: 200, headers: corsHeaders });
}
```

### Security and Abuse

To prevent abuse of the endpoints and we have to enable rate limiting in Cloudflare. We do not do this inside the worker as this would still count as a request. We can use WAF rules to block users based on the number of requests in the last n seconds. This way, the unwanted requests do not reach the worker in the first place.   
**Note:** This is very important for anything other than the free tier as other plans include pay-as-you-go!

Furthermore, whenever user data is inserted into the database, we validate it against a whitelist where possible and use prepared statements to prevent SQL injection.

### Limitations

The main limitation of our analytics solution is session tracking. Multiple users on the same NAT, e.g. at universities or companies, are collapsed into a single user. If the users ip changes, they count as a new user, not a returning one. Lastly, bot detection works quite good but would require using either google recaptcha v3 or paying for Cloudflare's bot management to improve further.

Conclusion
----------

Overall I am very happy with how the site works now. Updating is now just a git push, writing new articles no longer requires copying whole html pages, updating the style of the whole blog is easy. No more FTP, no more manually uploading files and hoping I did not miss one, no more breaking the site's old posts with js or css updates and of course proper version tracking with git.

The new analytics solution works very well. In other words, "it just works". While hosting the blog at 1and1 (ionos) with php, I already used a similar custom solution. During the move to github pages I tried google analytics for a short time but had major issues with it. No real referrer tracking, no accurate data due to adblockers, sharing data with google, consent redirects, and so on.

Writing new posts is finally just that. Writing new posts. No more fiddling around with html. It not only is more fun but it is also much faster.

Not using server side php also makes the site more performant as html, js, and css is efficiently cached by cloudflare. Something the php site failed to do.

Last but not least, the only real running cost of the blog is also gone.