Thursday, January 13, 2011

HTTP chunks and onreadystatechange

One of the features of HTTP 1.1 is "chunked transfer encoding". Rather than send a Content-Length header followed by the entire document, it is possible to transmit the body as a series of chunks, each with their own content length declaration. This lets you start sending the beginning of the document before you know how long it's going to be.

It also makes Comet "streaming" possible, letting you trickle down data without the overhead of a full HTTP request for each message. This depends on your browser telling you when new chunks arrive. As you might guess, this isn't supported by Internet Explorer. But all other major browsers that I've tried (Firefox, Chrome, Safari) will fire multiple XMLHttpRequest onreadystatechange events (readyState == 3) as additional parts of the document are received.

Here's MochiWeb's implementation of chunked transfer encoding, which is pretty straightforward:

%% @spec write_chunk(iodata()) -> ok
%% @doc Write a chunk of a HTTP chunked response. If Data is zero length,
%% then the chunked response will be finished.
write_chunk(Data) ->
    case Request:get(version) of
        Version when Version >= {1, 1} ->
            Length = iolist_size(Data),
            send([io_lib:format("~.16b\r\n", [Length]), Data, <<"\r\n">>]);
        _ ->
            send(Data)
    end.

For each chunk, you send an integer size, followed by a newline, followed by that number of bytes of data, followed by another newline. On the client, the web browser stitches each segment together, appending the data to responseText.

When designing Kanaloa's streaming protocol, I initially took it for granted that each chunk would have its own onreadystatechange event. This made parsing the chunks simple; in my case, I just sent down a valid JSON array in each chunk, kept track of how much responseText I'd already seen on the client, and called JSON.parse on the difference.

The first thing I noticed was that sometimes single chunks would be split across multiple events. I theorized that this resulted from them being put into multiple TCP packets, and indeed limiting the chunk size to the typical TCP segment size seemed to fix this problem.

The next thing I noticed was that sometimes multiple chunks would be concatenated into the same event.  This was also a problem, as JSON.parse needs a valid expression, and '["foo"]["bar"]' wasn't cutting it.

You can see both of these cases demonstrated here:
http://schwink.net/blog/chunker/client/
The small chunks are often concatenated, and the large chunks are split.

I took a look at the TCP packets in Wireshark, and am struct by two things. First, the small messages do in fact arrive as their own separate TCP packets. So the browser is stitching them together into the same event in some cases. They arrive at roughly equal intervals in the case I examined.

Secondly, in the cases where a chunk is split across multiple events, the event boundaries do correspond with the packet boundaries.

So I think we can conclude that the browser simply reads in incoming packets into its responseText buffer, and fires onreadystatechange for each. If your script is still running from the previous event, it just makes the responseText available to you when you get that field, rather than wait to send another event later.

This sort of begs the question whether we could construct a scenario where additional text gets appended to responseText without you being notified, or where it changes between multiple reads to that field by the same JavaScript thread. But I've just about had my fill of this topic for now : )

In the end I had to do what I'd been hoping to avoid from the start, and write my own logic to split the response, rather than trust the events to delineate them. The result may be the world's simplest and least featureful JSON parser, whose only job is to split a string into substrings that encode JSON arrays, which can in turn be properly deserialized. But it seems to work, and because it leaves unterminated arrays untouched, I can now also receive messages of arbitrary size that span multiple chunks.