Migrating the API to Cloudflare Workers

September 15, 2024 · 12 min read

This is an old blog post that I typed circa 2021 up but never got around to publishing.

It goes into a little detail about how the API was operating before, the architure behind it, and how it operates in Cloudflare now. Read this document as if it is 2020

StopForumSpam started in 2007 and within 6 months we were serving 5,000 requests directly out of MySQL. As demand grew over the next 5 years to levels well beyond what was ever expected, the API underwent several redesigns ending up to where it is now, serving about 400-500 requests per second.

As StopForumSpam grew, so did the need to reduce the request latency and increase its fault tolerance. This is where Cloudflare Workers shines.

The most recent major change as StopForumSpam has seen the API migrate to running in Cloudflare Workers… and here is the story about how it happened, told from the perspective of someone that is anything but a professional developer. I cover how the API has operated for the last several years, how it stored and processed data, how some of the architect had to change to operate in Workers, and the lessons learned.

Workers is a “serverless” platform for executing lightweight javascript on the edge of Cloudflare’s network, in over 140 locations worldwide.

This allows API queries to execute in the closest Cloudflare datacenter to your server instead of requests having to travel the globe as they would with a traditional web server application.

Legacy Architecture

Redis, Redis, and more Redis

At the heart of the API for the last several years is our custom version of Redis. Redis is extendable screaming fast in-memory NoSQL DB with a lightweight data protocol and a robust method for real time replication of data. As records are pushed into Redis, they’re distributed to all the API nodes using its real time replication mechanism.

This puts the data on the API node closest to your server, either in Europe or on each coast of the USA, with requests being routed one of two ways depending on the domain being used for queries. If you used stopforumspam.org then you hit Sucuri's geographic Anycast routing system directly. If you used stopforumspam.com then API requests were routed via a Cloudflare worker to rewrite the host headers before forwarding the request to Sucuri.

The final hop was the Sucuri network sending the request to your closest API servers.

API based on VPS servers

This is where the design changes needed to happen to support the transition to new data structures and methods to access them. In order to discuss the data structure migration, here is how the two data types are structured in Redis.

API data is inserted into Redis using a hashsets with a key derived from the first two bytes of the record hash and the hash value from the remainder of the hash, up to 64 bits (with some longer values for older legacy data).

The record itself (last seen date, count total, country etc) is built and stored using the PHP function pack format. This use of binary packed hash sets and binary packed data reduces memory usage from 4GB as a primary key strategy, to 250MB which is small enough to run API nodes on cheap VPS servers.

Extra non-spam related IP data, such as IP to country and IP to ASN data is stored in Redis interval sets which allows you to “stab” at ranges of IP addresses, pulling data for ranges which your IP address falls in.

A small example, adding an email address

md5("spam@spam.com") = ea5036994d8e3d00fe4c9ede36c2d05a

which gives us the key value and field

hash key = "ea50"
hash field = "36994d8e3d00"

and throw in some example record data

packedRecord = "\x83\xa1c\x02\xa1t\xce^\x14|\x18\xa1n\xa2id"

gives the following interaction on insert > hset "\xea\x50" "\x36\x99\x4d\x8e\x3d\x00" "\x83\xa1c\x02\xa1t\xce^\x14|\x18\xa1n\xa2id" ↲

This uniformly distributes all API data into 65,536 sets of almost equal size.

> hgetall "\xea\x50" ↲
1) "'A\xca\xae\xc4\x0c"
2) "\x82\xa1c\x02\xa1t\xce\]\xba\xf8\xf8"
3) "x\xa8\xb9\x9d\xa5\xcc"
...
156) "\x82\xa1\x01\xa1t\xce\_u\x8e\xe7"
157) "\x15\x1bu\*\x1f\xdb"
158) "\x82\xa1c\x01\xa1t\xce\_wDD"

To check an API request for any particular data, a call to Redis for the hash key and field pair will return a record if one exists, eg

> hget "\xea\x50" "\x36\x99\x4d\x8e\x3d\x00" ↲
1) "\x83\xa1c\x02\xa1t\xce^\x14|\x18\xa1n\xa2id"

The second Redis data structure is used for doing IP to ASN and IP to Country looks up.

Example, for IP to ASN/country (with 202968593 being the integer value for 12.25.14.17)

> istab ip_data 202968593 ↲
1) "7018:us"

This is a very powerful data structure in Redis for this use case, one that is very complex to maintain properly using native Redis commands, especially when IP ranges overlap. Just a note that Interval Sets are not in the main Redis repository as the pull request was rejected.

Migrating to Workers and Workers KV

The first data model change meant taking hash sets from Redis and transforming them into a data structure that works within WorkersKV, a native json datastore. Before any strategy was decided on, it was important to understand the datastore itself.

WorkersKV is an eventually consistent distributed primary key datastore. The simplest way to describe it within the scope of this bit of work is that it's a large datastore located in California somewhere, and when you request data from it then you get from a local cache if it's there, and retrieved from the main datastore if it's not. When you write to it, the data is committed to a cache locally before eventually reaching the main datastore. This takes a second or ten but there is no guarantee of time of this delay.

The underlying way in which WorkersKV transports data meant that data had to be sharded in order to operate the API at the speeds required by clients. By minimizing the record size, we reduce latency and the amount of wasted bandwidth, however reducing the key size too much results in a large number of shards. This introduces the chances of orphaning data and exponentially increases the time to do a WorkersKV data import because of the limit of the number of buckets that can be processed in each Cloudflare API write request.

After slicing up data sets into different key lengths and shard sizes, a mirror of the existing strategy was chosen with the existing 65,536 hash sets, each stored as a json object and stored in WorkersKV as a key based on the first two bytes (four hexadecimal characters) of each shard set, mimicking the Redis hashset strategy.

Cloudflare has a lot of detailed information available about WorkersKV, available at https://blog.cloudflare.com/workers-kv-is-ga/

As memory limitations are no longer as issue in WorkersKV as they were in Redis, full hashes are now used as the record key. Each key is about 12KB in size, or about 3KB max when compressed as it hits the network.

API data now looks like this in WorkersKV - json formatted data

WorkersKV json datastore

Now the data is structured, there is a problem to avoid any key collisions when submitting data to a non-atomic database when the same keys are updated quickly. Whilst there is no guarantee that old data doesn’t overwrite newer data in WorkersKV, a 30 second progressing window should theoretically mitigate any issue with delivery delay to the eventually-consistent primary datastore. Don't get me wrong, the data model of WorkersKV is a solid one, it's just that you have to be aware of it, and a 30 second window is probably overkill however it's easily run on a cron schedule with enough time to process each batch before the next is due to start.

The second structure to migrate was the Interval Sets used for ASN and country lookups. Whilst I would have loved a fast abstraction layer to do this, the datastore-in-KV layers ended up being unacceptably and costly, so the solution was result was to put ranges into arrays (sharded at the top /8 network) for each IPv4 and IPv6 subnet, and then running a binary search over it. An average binary tree search was about 5-6 integer comparisons which executed quickly, returning the ASN and country for any given IP address.

Fire up the most basic of searches, the binary search with both range support and support for the enormous number space of IPv6.

function binarySearch(arrayOfAddresses, value) {
   if (arrayOfAddresses === null || value === null) {
      return false;
   }
   let mid, right;
   let left = 0;
   right = arrayOfAddresses.length - 1;
   value = BigInt(value);
   while (left <= right) {
      mid = parseInt((left + right) / 2);
      if (BigInt(arrayOfAddresses[mid].s) <= value &&
         value <= BigInt(arrayOfAddresses[mid].e)) {
         return {
            'asn': arrayOfAddresses[mid].a,
            'country': arrayOfAddresses[mid].c
         };
      } else if (value < BigInt(arrayOfAddresses[mid].s)) {
         right = mid - 1;
      } else {
         left = mid + 1;
      }
   }
   return false; // not found, return false
}

Once the data structures were done, it was time to learn some proper Javascript. I've only ever coded a bit of Javascript before, and then only to tinker around on the website front end.

After a couple of weeks of on-and-off late evening coding, I was happy enough to let someone else look at it, and they didn’t laugh (or they didn’t tell me that they had). A crudely running API was working with the KV backend. Win!

Performance

Each iteration of code was focused on increasing API performance, such as removing large loops or adding caching.

Some of the early testing showed unacceptable levels of performance that put the query time beyond the permitted 50ms.

API response time metrics

The real gotcha was how I used storing the API configuration in the database. The configuration contains lists of blacklisted IP addresses, domains, and other settings required to process requests. This configuration was being pulled from the database and was then stored as a global variable so that it would persist between requests on each worker, for the life of the worker…. or so I thought.

Testing during development showed this to be the case, so that’s how it was initially deployed. Reality ended up being a very different story that became apparent once Cloudflare deployed WorkersKV analytics.

The metrics showed a huge amount of traffic, numbers that just didn’t match the number of API requests being made. This all pointed to the configuration variable persistence not working as well as intended. Sure, it was cached, but I was unhappy with all the unnecessary latency and network/process traffic that should be removed.

The solution here was obvious, avoid having the configuration in WorkersKV. Easy, you just don't push a configuration into the database, instead you develop a system for updating the configuration in as near real time as possible.

I changed the code to pull the configuration from a json file which is then built with the project on deploy

let globalSettings = await DATASTORE.get('configuration', 'json');

was replaced with

let globalSettings = require('./config.json');

As you can see, the reduction in WorkersKV traffic was significantly smaller, almost embarrassingly so. As the configuration is included in the code, and is available at initialisation, there is no database or cache overhead.

Code changes to reduce network usage

This change, along with the introduction of the LRU cache, resulted in an acceptable latency.

Weekly response time monitoring

Over a week, we see a steady flow of traffic without errors.

Weekly data usage monitoring

It wasn’t until the API had gone live on Workers until a lot of the metrics started to show the larger picture. If I had to change one thing, it would be to reshard the buckets into 18 bits, or even 20 bits, instead of 16 bits so as to reduce the WorkersKV traffic. The code change is simple enough but it would require MySQL schema and trigger changes. It’s on the list of things to do in the next API version.

Continuous Deployment

This brings me to the next issue, how to continually deploy configuration changes to Workers.

The json configuration file is built by a script when it detects changes in the main configuration stored on the MySQL server. This new file is then signed and pushed to a validation worker that provides an interface to Github.

The code accepts a POST containing the configuration, validates that the json matches the signature in the HTTP header, and then uploads the configuration to Github using their API. To push to Github, you need the SHA256 of the destination file so this worker connects, gets the hash and then includes that hash with the commit.

Once the configuration is committed to the source tree, a Github Action is triggered. Wranger Actions provide the means to control a workflow, and in this case a workflow that checks out the worker source code, including the new configuration, builds it and deploys it via the Cloudflare API

This trigger is controlled by the .github/workflows/main.yaml file, here using Github Secrets to secure the Cloudflare API key required to deploy the new code. Whenever an update to config.js is pushed to Github, the worker code is rebuild and deployed to Workers.

You can read more than Wranger Github Actions at https://github.com/marketplace/actions/deploy-to-cloudflare-workers-with-wrangler

So once it's all put together, we have something where non-API requests are served by the main site, such as search data and submitting spam data, and all API queries being served by Workers in the closest Cloudflare data center to your server.

Workers running API

Wrapping insecure apps in Cloudflare Workers

August 17, 2024 · 10 min read

The StopForumSpam Team

Sometimes you get one of those problems handed to you with a priority of “this has to happen yesterday”. A client uses an application that generates confidential content, and provides a world-readable URL to that processed data. This permits anyone with the URL to access to all that content — no usernames or passwords required. This doesn't seem insecure at first glance as a third party would have to guess a 48 character document number or otherwise gleam this ID from their browser…. this is unless the user pastes that link into an email or Slack…. or Twitter.

The remit was

To ensure that sharing any content links would not disclose content, including those loaded from rendered CSS (such as included image files).
No changes to frontend or backend code as this was a third party integration.
The content loads (via a series of REST requests) into an IFRAME which introduces issues with cross origin permissions, in addition to half the content being fetch initiated from a complex and large Javascript application, and the rest being included and included in style sheets. This makes using an interceptor and bearer token hard.
It had to be done yesterday (my favourite type of deadline, honestly)

Previous State Flow

Our application, let’s call it Diagrammer provides several Restful API endpoints for generating content in the following process flow, all of this without authorisation or access controls.

The frontend requests the processing of specific files by sending a request to the application backend.
The backed application validates the user has access to the requested documents and then has Diagrammer process them.
Diagrammer generates the requested content and provides the application backend with a URL for that content which it passes back to the frontend. The frontend creates an IFRAME and populates it with the URL. This is a one time token and any further request to generate the same content will get a different token.

Cloudflare Workers to the rescue

By placing Diagrammer on the Cloudflare network, I was able to intercept, transform, and control the flow of all its traffic using the Workers platform. This gave me enough to do what was needed until the customer’s Engineering team has time to implement controls via other means.

The new flow would work as follows:

Intercept the request from the application backend to the Diagrammer server to process the request, do what is needed, and then rewrite the response to the host name of the Cloudflare Worker.
With all requests going through the Cloudflare Worker, we can now process and validate them.

The Magic

After a couple of hours of thinking and testing possibilties in a sandbox, I decided on the following approach that forces all traffic through Workers to validate a cookie after discovering that content would leak with a hook on fetch to add security headers.

Check if the frontend has a JWT cookie, one that is limited in scope to the origin of the IFRAME. If not then generate a token.
Check if the document ID (remember, a 48 character token) has been issued to a token. This is stored in Cloudflare D1 as it's simplier than an implemenation in Durable Objects.
If this document ID has been issued then consider it consumed and return Access Denied.
Record the Document ID to token relationship in D1.
Get the content from the Diagrammer server, add a set-cookie header to response to the Frontend which renders in the IFRAME.
Every request for resources will then pass through this Worker which verifies access a Document ID/token pairing in D1.

New State Flow

The code

The code is basic (which is not complete here) flows to three main routes.

The /api/createdocument endpoint which returns a URL to where the processed result can be located.
The /api/showdocument/{id} page that is returned from the above request which is an application loader.
The /api/showdocument/{id}/resources endpoint which is all the content that the loader page pulls in either directly (Javascript, HTML, CSS) or images and fonts (included in CSS).

The code behind the routes is simple regex testing on the pathname, with each route having a setting that forces the router to test for the cookie.

let url = new URL(request.url);
url.pathname = url.pathname.replaceAll("//", "/");
let route;
for (let key in routes[request.method]) {
    let regex = new RegExp(key);
    if (regex.test(url.pathname)) {
        route = routes[request.method][key];
        break;
    }
}

if (route) {
    if (route.requireSessionValidation && !await isSessionValid(url, request, env, route)) {
        return new Response("Access Denied", { status: 403 });
    }
    return await route.handler(url, request, env);
}
return new Response("404", { status: 404 });

If the route exists and the requireSessionValidation flag is set then it tests if the session cookie is valid

async function isSessionValid(url, request, env, route) {

    // get the 48 charcter ID from the URL
    const documentId = documentIdFromUrl(url);
    if (!documentId && route.validationRequiresDocumentId) {
        return false ;
    }

    const userGuid = getCookie(request.headers.get("Cookie"), env.COOKIE_NAME);
    if (! isCookieJwtValid(userGuid)) {
        return false;
    }

    dbResult = await env.DB.prepare("select count(1) as rowCount FROM document_sessions_v1 WHERE document_id = ? and user_guid = ?")
        .bind(await sha256(documentId), await sha256(userGuid))
        .run();

    return (dbResult.results[0].rowCount > 0);
}

^/api/createdocument$

This is a simple proxy with a header rewrite as the location passed to the application loader is returned in the location header

"^/api/createdocument$": {
    handler: async function (url, request, env) {
        const proxyUrl = new URL(url);
        proxyUrl.hostname = env.SERVER_BACKEND;
        let newRequest = new Request(
            proxyUrl.toString(),
            new Request(request)
        );
        const originalResponse = await fetch(newRequest);
        // check all the headers if location exists and rewrite it
        if (originalResponse.headers.has("Location")) {
            const locationHeader = originalResponse.headers.get("Location");
            let response = new Response("", {
                status: originalResponse.status,
                headers: originalResponse.headers
            });
            let redirectUrl = new URL(locationHeader);
            // update the location so that its hostname is the cloudflare worker
            redirectUrl.hostname = url.hostname;
            response.headers.set("Location", redirectUrl.toString());
            return new Response(response, new Headers(response.headers))
        }
        return new Response("Bad Request", { status: 400 });
    }
}

^/api/showdocument/[a-f0-9]{48}$

The majority of the processing is performed in this route. This code is pseudo-code for the best part. This route returns the application loader with relative path names so no content is rewritten.

"^/api/showdocument/[a-f0-9]{48}$": {
    handler: async function (url, request, env) {
        let documentId = documentIdFromUrl(url);
        if (!documentId) {
            return new Response("Access Denied", { status: 403 });
        }

        // do we have a cookie GUID (from a previous session)
        let userGuid = getCookie(request.headers.get("Cookie"), env.COOKIE_NAME);
        let setCookieFlag = false;
        if (!isCookieJwtValid(userGuid)) {
            userGuid = await generateSignedJwtToken(); 
            // get the FLAG so for the cookie to be set in the header but ONLY if the url/info returns a 200;
            setCookieFlag = true;
        }

        // never store the actual document id
        const documentIdSha = await sha256(documentId);

        // has anyone (even us) consumed this document ID? this mitigates replay following header theft
        let dbResult = await env.DB.prepare("select count(1) as document_count FROM document_sessions_v1 WHERE document_id = ?");
            .bind(documentIdSha)
            .run();

        // someone has consumed this one-time document id so deny
        if (dbResult.results[0].document_count > 0) {
            return new Response("Access Denied", { status: 403 });
        }

        // if we get here then we can insert the session into the DB
        await env.DB.prepare("INSERT INTO document_sessions_v1 (document_id, user_guid) VALUES (?, ?)")
            .bind(documentIdSha, await sha256(userGuid))
            .run();

        proxyUrl = new URL(url);
        // fetch from the diagrammer application backend
        proxyUrl.hostname = env.SERVER_BACKEND;
        newRequest = new Request(
            proxyUrl.toString(),
            new Request(request)
        );
        response = await fetch(newRequest);

        // do NOT set a cookie if we dont have a valid Diagrammer loader
        let newResponseHeaders = new Headers(response.headers);
        if (response.status == 200) {
            let originalBody = await response.text();
            if (setCookieFlag) {
                // this cookie samesite MUST be none or the iframe refuses to set it.
                // partitioned (supported in chrome and in nightly firefox - not safari with tracking protection on)
                const tokenCookie = `${env.COOKIE_NAME}=${userGuid}; path=/; secure; HttpOnly; SameSite=None; Partitioned; __Host-iapp=${env.NAMESPACE_ENV}`;
                newResponseHeaders.set("Set-Cookie", tokenCookie);
            }
            return new Response(originalBody, response, newResponseHeaders);
        }
        return new Response("Bad Request", { status: 400 });
    }
}

^/api/showdocument/[a-f0-9]{48}/(.*)

This endpoint has the cookie and the document ID verified before running as a proxy to the Diagrammer server. It rewrites the URLs in HTML and CSS so that they’re changed to the Cloudflare worker

"^/api/showdocument/[a-f0-9]{48}/(.*)": {
    // this endpoint servers resources that MUST be session validated
    // and before the handler is called, the validationRequiresDocumentId validates an id
    requireSessionValidation: true,
    validationRequiresDocumentId: true,
    handler: async function (url, request, env) {
        const proxyUrl = new URL(url);
        const originalHostname = proxyUrl.hostname;
        proxyUrl.hostname = env.SERVER_BACKEND;
        let newRequest = new Request(
            proxyUrl.toString(),
            new Request(request)
        );

        // get the content from Diagrammer
        let response = await fetch(newRequest);

        // the html and css needs the URLs replaced
        if (response.headers.get("Content-Type").includes("text/css") || response.headers.get("Content-Type").includes("text/html")) {
            let body = await response.text();
            return new Response(body.replaceAll(env.SERVER_BACKEND, originalHostname), response);
        }
        return new Response(response)
    }
}

Here, resources is any resource, either the application.js application.css resource/images_001.png etc, whatever the Diagrammer application generates and embeds into HTML or CSS with each resource then having its cookie tested.

Our application will now only serve requests to the first client that requests this content. Any further requests will result in access being denied. It's not a foolproof method of preventing information disclosure but it was enough to pass testing as covered by the agreed statement of works.

The Costs

We have a Workers paid subscription in Cloudflare which allows for 25 billion read rows per month and 50 million row writes per month before a cost is incurred

Diagrammer is accessed about 20 times a hour on average of 10 HTTP reqeusts for content per diagram render. The costs of D1 and Workers falls well within the free allowance.

The Workers code itself is extremely lighweight with most of the time taken in the async fetch which you don't pay for while it's in an awaiting state. The highest CPU time is spent in the SHA calcluations which are provided in the extremely fast embedded crypto libraries. With the number of requests required here, we fall well within the allowance provided before incurring more cost.

To give you an idea of Workers pricing, the API for StopForumSpam clocks up 400 to 500 requests per second for a cost of $400 to $500 a month.

The overall price from Diagrammer will be no more than then $5 subscription price per month.

What could be done (a lot) better

The application backend could proxy all requests to Diagrammer but this wasn’t really an option given the timescales of their SDLC. As front end could still leak files included in CSS as it has no mechanism (without using a cookie) for checking these files.
The Worker can generate a signed token which frontend can feed into a modified/intermediate IFRAME loader via the postmessage API. This removes any issues with someone intercepting (or guessing) the URL in the seconds between it being generated and then consumed by the Frontend. The postmessage tokens would then use the Storage API as Safari is seriously lacking in its support for cookie partitioning.
Inject some scripts into the HTML so that access errors display with an friendly error message.
Use a global variable LRU/array for storing paired tokens so that routes to the same isolate don't have to query D1 again.
My code above, it’s just awful but it was rushed before a pentesting deadline but it does show how quickly something can be spun up.