Detecting updates

2nd May 2024

I want to update the server so it better detects changes on the file system. At the moment it doesn't really do that. This came from my paranoia about the server accidentally serving files I didn't want it to. My strategy was to scan the content folder, build a list of valid paths and validate incoming requests against that list.

As I stated at the time this gave me confidence in what was being served, but I would need to restart the server every time I added content. Most of the time this is fine. But when I'm writing and proofreading new posts, it's a bit of an annoyance. More so with blog posts because they are partially generated and only generated when the routes list is built as the server starts.

It’s interesting how you can normalise annoying things. I got used to saving the source file, killing the server, re-starting it and then refreshing the page to continue my review. I recently was doing some development at work and not having to continually restart the server was really refreshing. It shouldn’t be, it should be the norm and highlighted it was time to fix this in Tinn.

I’ve had some time to think about this and my new strategy is to ditch the pre-built list of routes. Instead I shall first validate the incoming URL to make sure it’s not doing anything sneaky and then pass that url to the various…err…content generators. I wanted to say “app”, but we’re not there yet. For now I have two content generators, the blog and static files. The routing code will ask each generator in turn if it wants to deal with the url, if it doesn’t it will ask the next and so on. If they all decline, I’ll return a “404 - Content not found” error. In the future if I add any other content generating code, it can get added to the list of content generators to query.

Validating the request target

I'm thinking that I should break the requested path up by directory. Then look to see if any of those directories are . (aka the current directory) in which case I could ignore them, or .. (aka the parent directory) in which case I should jump back a level. I should keep track of how many levels deep I am and if a .. tries to access the content directory’s parent, throw an error. The result would be a canonical path without any jumping back and forth and I think it would be safe from escaping the content directory.

I thought I should double check the spec and make sure I understood the format of the incoming URI. This actually means looking at four RFCs, HTTP/1.1 (9112), HTTP (9110), Uniform Resource Identifier (3986) and Augmented BNF for Syntax Specifications (5234). It’s exciting stuff. Anyway the most relevant information starts in section 3.2 of HTTP/1.1 which describes the request-target. There are actually four formats for request targets however only one applies to GET requests, so I only have to worry about the origin-form format.

This is basically what I expected, a path which consists of a sequence of segments separated by a forward slash (/). I’ve just learnt that the bits in a path between the slashes are called segments. The . and .. segments are called dot segments and it seems there is a well established algorithm for resolving these relative parts into an absolute path. Reassuringly it matches my thoughts above.

Otherwise the segments can have quite a large range of characters, really most things except a forward slash or a question mark (?). No slash because this delimits the segments and no question mark because this denotes the end of the path and the start of the query. In other contexts, web browsers for example, we would also be looking for the hash (#) to end the path and start the fragment, but on the server side this is invalid, in fact the hash is simply not allowed anywhere in a request-target.

I was surprised by the range of other characters segments could have. There were several I thought would be invalid, for example colon (:), which is fine apparently. This is because I was thinking about what would be acceptable characters for file names and I was thinking about Windows. I made two mistakes here. First, paths in HTTP don’t have to relate to file systems, they often do, but they can be interpreted by the server in any way the implementer (in this case me) wishes, and the spec is open enough to allow that. Second, it turns out that colon is a valid character for file names in Linux. It seems Linux can support any character other than null and forward slash, some are problematic and probably best avoided but they are supported.

If you collect together all the relevant Augmented Backus-Naur Form (a language used to describe formal languages) for the definition of the origin-form type of request-target you get the following:

request-target	= origin-form / absolute-form / authority-form / asterisk-form

origin-form		= absolute-path [ "?" query ]

absolute-path	= 1*( "/" segment )
segment			= *pchar
query			= *( pchar / "/" / "?" )

pchar			= unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved		= ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded		= "%" HEXDIG HEXDIG
sub-delims		= "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

ALPHA			=  %x41-5A / %x61-7A	; A-Z / a-z
DIGIT			=  %x30-39				; 0-9
HEXDIG			=  DIGIT / "A" / "B" / "C" / "D" / "E" / "F"

If you can read it, this tells you a lot of things you need to know about request targets. One of my questions it answered is you can’t have a blank target, you have to have at least one forward slash. As it happens, all origin-form request targets start with a forward slash. It also clarifies exactly what characters are allowed in a request-target. It’s very helpful when doing what I’m about to do next, which is write code to validate an incoming request.

sometime later

I created a new data structure URI to store the various parts of the target request.

typedef struct {
	bool valid;
	char* data;
	char* path;
	size_t path_len;
	char* query;
	size_t query_len;
	char** segments;
	size_t segments_count;
} URI;

I coupled this with a new uri_new function to populate the structure and as is my habit a uri_free to clean up. The uri_new function takes a Token, this is the target I already extracted from the request’s start line. I felt the urge to build a parser that did both jobs in one but resisted for now. After allocating space for the actual structure, it allocates space for a copy of the token, the address of which is stored in the data property. This copy will be chopped up into the various parts later but for now I just append a null terminator.

URI* uri_new(Token token) {
	URI* uri = allocate(NULL, sizeof(*uri));

	uri->data = allocate(NULL, token.length + 1);
	memcpy(uri->data, token.start, token.length);
	uri->data[token.length] = '\0';

The function then populates the other properties with some default values. The path gets set to NULL for now. query is set to point at the last character in the data string, which is a null terminator, making query an empty string. Space is then allocated for segments which is an array of character pointers and will be used to point to each segment in the path.

	uri->path = NULL;
	uri->path_len = 0;
	uri->query = uri->data + token.length;
	uri->query_len = 0;

	int max_segments = 8;
	uri->segments = allocate(NULL, max_segments * sizeof(*uri->segments));
	uri->segments_count = 0;

The last of the setup code sets the valid property to true. If I hit any problems with the URI later I’ll change this. Which brings me to my first check, ensuring the URI starts with a forward slash.

	uri->valid = true;

	// validate URI starts with a forward slash
	if (token.length==0 || uri->data[0]!='/') {
		uri->valid = false;
		return uri;
	}

Now for the real work. Checking each character in the URI in turn. If it’s a forward slash I’ve found a segment boundary. If it's a question mark I’ve found the start of the query. Otherwise I need to check if the character is in the list of valid characters.

When finding a question mark, I change the question mark character (in the copy of the string pointed to by data) into a null terminator. This is so any string functions reading the path before this point will now stop here. I then update the query property to point to the next character so I can access the query string as needed. Reading the spec closely shows that forward slashes and question marks are valid characters in queries. Any special meaning they may have is up to whatever is parsing the query to determine. Therefore I need to keep track of whether I'm processing the path or the query and act appropriately when I find a forward slash or question mark. To do this I set the simple in_path flag set to true initially and change it to false when I find the first question mark.

	// scan the URI looking for segments (directories), the start of the query
	// and invalid characters
	bool in_path = true;
	for (int i=0; i<token.length; i++) {
		switch (uri->data[i]) {
			case '/':
				if (in_path) {
					// create a segment...
				}
				break;
			case '?':
				if (in_path) {
					uri->data[i] = '\0';
					uri->query = uri->data + i + 1;
					uri->query_len = token.length - i + 1;
					in_path = false;
				}
				break;
			default:
				if (strchr(valid_chars, uri->data[i])==NULL) {
					uri->valid = false;
					return uri;
				}
		}
	}

To validate other characters I use the library function strchr to get the characters position in a string that contains all the valid characters. If this returns NULL the character is not in the string and therefore not valid.

static const char* valid_chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~%!$&\"()*+,;=:@";

The code to process segments is a bit more complex. First I need to make sure I have enough space in the array and expand it if required. Then like with the query, I replace the forward slash with a null terminator and point the next segment in the array at the next character.

			case '/':
				if (in_path) {
					if (uri->segments_count == max_segments) {
						max_segments *= 2;
						uri->segments = allocate(uri->segments, max_segments * sizeof(*uri->segments));
					}
					uri->data[i] = '\0';
					// deal with dot segments?
					uri->segments[uri->segments_count++] = uri->data + i + 1;
				}
				break;

The complexity comes from dealing with dot segments. This is a fence post problem. When I find a forward slash, the fence post in this analogy, I need to consider the segment before and after the slash, the fence panels either side of the post. I need to review the previous segment, which I can now do because it’s complete, and check if it was a dot segment and then remove it. This is where the fence post analogy falls apart. Pretend I didn’t mention fences. Basically I have to look forwards and backwards. And, this is a common pitfall in this kind of problem, remember to process the very last segment.

I wrote a helper function to review the previous segment and manipulate the segment count if it’s a dot segment. If it’s a single dot, reduce the count by one because I don’t need the previous segment, if it’s two dots reduce the count by two because I don’t need the previous two. Otherwise leave the count alone. I added some validation to ensure I don’t go negative, which should also ensure I don’t leave the content directory.

static bool remove_dot_segment(URI* uri) {
	if (uri->segments_count>0) {
		if (strcmp(uri->segments[uri->segments_count-1], ".")==0) {
			uri->segments_count--;
		} else if (strcmp(uri->segments[uri->segments_count-1], "..")==0) {
			if (uri->segments_count>1) {
				uri->segments_count -= 2;
			} else {
				uri->valid = false;
				return false;
			}
		}
	}
	return true;
}

The final task performed by the uri_new function is to re-assemble a path without any of the relative dot segments and save this to a new string pointed to by the path property. This has to be its own memory because the data string has been well sliced up with null terminators by this point.

	// build complete and rationalised path
	size_t lens[uri->segments_count];
	for (size_t i=0; i<uri->segments_count; i++) {
		lens[i] = strlen(uri->segments[i]);
		uri->path_len += 1 + lens[i];
	}
	uri->path = allocate(NULL, uri->path_len + 1);

	size_t pos = 0;
	for (size_t i=0; i<uri->segments_count; i++) {
		uri->path[pos] = '/';
		strcpy(uri->path + pos + 1, uri->segments[i]);
		pos += 1 + lens[i];
	}
	uri->path[pos] = '\0';

Testing this code required more than just pointing my browser at the server with various paths. This is because the browser did some relative path resolution of its own. Type in http://www.moohar.com/a/b/../c and the browser will remove the /../ before it sends the request. Chrome won’t even let you copy it out of the address bar before it updates it. You might ask if the browsers already handle relative paths, why did I just code a solution for them? Well I haven't tested every browser and there are other ways of making HTTP requests and the evil hackers don’t play by the rules. So I used telnet to test.

GET /a/b/../c HTTP/1.1

HTTP/1.1 200 OK
Date: Sun, 21 Apr 2024 20:30:27 GMT
Server: Tinn
Content-Type: text/html; charset=utf-8
Content-Length: 54

<html><body><h1>200 - OK</h1><p>/a/c</p></body></html>

I modified the server while I was testing to return a simple page with just the resolved path so I could check it was working. Therefore this test was a success. I also tested trying to escape the content directory, which thankfully was detected and the server returned an error.

GET /../secret.txt HTTP/1.1

HTTP/1.1 400 Bad Request
Date: Sun, 21 Apr 2024 20:18:33 GMT
Server: Tinn
Content-Type: text/html; charset=utf-8
Content-Length: 99

<html><body><h1>400 - Bad Request</h1><p>Oops, that request target seems invalid.</p></body></html>

I did discover an edge case if the very last segment was a dot segment. The path /a/b/c/. will resolve to /a/b/c which is incorrect, it should resolve to /a/b/c/. The path /. resolves to nothing, it should be /, which is how I spotted the error. Similarly /a/b/c/.. resolves to /a/b when it should resolve to /a/b/.

I corrected the problem by updating the remove_dot_segment, it now takes an extra parameter to flag when it’s processing the last segment. I updated its behaviour so instead of removing the last segment if it’s a dot, it changes it to an empty segment.

static bool remove_dot_segment(URI* uri, bool last) {
	if (uri->segments_count>0) {
		if (strcmp(uri->segments[uri->segments_count-1], ".")==0) {
			if (last) {
				uri->segments[uri->segments_count-1][0] = '\0';
			} else {
				uri->segments_count--;
			}
		} else if (strcmp(uri->segments[uri->segments_count-1], "..")==0) {
			if (uri->segments_count>1) {
				if (last) {
					uri->segments_count--;
					uri->segments[uri->segments_count-1][0] = '\0';
				} else {
					uri->segments_count -= 2;
				}
			} else {
				uri->valid = false;
				return false;
			}
		}
	}
	return true;
}

Before I move on I also had to decide what to do with empty segments. For example /a//b is a valid path with three segments: a, an empty segment and b. There is no guidance in the spec in how to treat these. Chrome leaves them alone, backing up my interpretation that they are valid segments. File systems ignore them, /a//b is equivalent to /a/b and /a////////b. The rest of the internet (yes, I checked all of it) is undecided what to do. They are valid, but they mess up search engines probably and caching maybe and should just go away. I made sure they didn’t break my code but otherwise left them as empty segments...for now.

You can see the full code for the URI module here (uri.h) and here (uri.c).

Content generators

I need to update the blog code to stick around and deal with requests on demand. Currently, as previously explained, it generates all the blog pages when the server first starts and then it’s gone leaving the current routes list code to deal with serving the content. If it’s going to detect changes it needs to stay active and be responsible for serving its own content.

The code already generates a list of posts, so that is not a problem, I just need to not dispose of it and keep a pointer to it…somewhere…let’s ignore that detail for a moment. I can check the request target against this list to work out if the target is a valid post. But before I do that, when processing a request I can first check the modified date of the posts file vs when I last parsed it to work out if a new post has been added and I need to re-generate the list. Then I can check the list. If it is a valid post I can check the associated source file and double check its last modified date to see if I need to re-generate the page content. There are a few checks and lists and buffers to manage, but it all seems to make sense and I can just about visualise the code I need to update/write.

I will also need to write a static files module. This will check a request target against the file system to see if the path matches a file and if so, serve it. There are opportunities for caching here, and if I do I will need to check the cached date vs the file modification date to work out if I need to refresh the cached version. I can borrow code from the current routes list and elsewhere, I don’t see this as particularly problematic.

Next question, how does the code that has processed the incoming request (the code in client.c) talk to the content generators (the blog and static file modules)? Via a function call of course. I think I know what the signature of that function will be, the client code passes the content generator the parsed request target and the content generator returns a buffer with its response or NULL if it has none. Except that won’t work. What if the content generator wants to return a redirect or set response headers? The response needs to be more flexible than just a buffer with the response text. Also the content generator needs more than just the request target, it needs to read other headers like If-Modified-Since to work out if it can return a simple 304 - Not modified instead of the full content.

At the moment the request and response data is being held in the ClientState structure with the functionality being hidden inside client.c. I feel I need to break this into request and response structures and open up the functionality so I can build responses from the content generators.

Which brings me back to how to keep a pointer to the blog list and how does the client code access it? I could create a global variable. It would work, but this won’t scale well as I add more content generators. Besides, globals will come back to haunt you. I reviewed how the current routes list works. It is created in the main function all the way back at the start. Then as part of the main network loop, the socket listener gets passed the routes list. It’s part of socket_listener typedef in net.h:

typedef void (*socket_listener)(Sockets* sockets, int index, Routes* routes);

This will have to change. Instead of routes, the main function will build a list of content generators and pass that to the socket listeners instead. I’m just debating in my head if that is the best way to do it, it seems something very specific to a particular type of socket listener (HTTP client requests) but the rest of the code in net.c is generic and could be used for any socket listening. The socket list already has a generic state pointer for storing data specific to that socket. I should use that mechanism instead.

Which completes my plan. To summarise, to update the server so it can detect changes I plan to:

update the blog module;
create a static file module;
split the client response code into request and response code;
delete the current routes list;
change the socket list; and
tweak the main network loop.

A small warning alarm is going off in my head. The scale of change is an indicator I’m over engineering this. I will sleep on it.

some sleeping later

I've slept. I showered, went to work (it was a manager hat day), went to the gym and jumped around, showered again, ate, then had a quick nap. I still think this is the correct path. So excuse me, I have code to write.

some typing later

I did this in phases.

1. Scaffolding

I started with the scaffolding for defining a list of content generators, populating that list, passing it around and calling the content generators to generate content. This started by defining what that function would look like. After some refinement I landed on the following function signature:

typedef bool (*content_generator)(void* state, Request* request, Response* response);

As I described before, this will be the interface between the network code and the content generating code. All the details of the request will be in the Request structure. The function can use the Response structure to return the response. I initially created some place holders for these structures each containing a buffer.

typedef struct {
	Buffer* buf;
} Request;

typedef struct {
	Buffer* buf;
} Response;

The function also takes a void pointer I’ve labelled state. The intention is this will point to a structure containing all the data that the content generator needs. For example the blog structure will contain the list of parsed blog posts.

To close off the function, it returns a boolean to indicate if it generated a response or not.

Next I needed a list of content generators. This code follows a very familiar pattern to list code I’ve used elsewhere. A structure to store all the variables required, and some functions to create a new list and add entries to it.

typedef struct {
	size_t size;
	size_t count;
	content_generator* generators;
	void** states;
} ContentGenerators;

ContentGenerators* content_generators_new(size_t size);
void content_generators_free(ContentGenerators* content);

void content_generators_add(ContentGenerators* content, content_generator generator, void* state);

I updated the code in the main function of tinn.c to remove the routes code and instead create a list of content generators.

	// create content generators
	TRACE("creating list of content generators");
	ContentGenerators* content = content_generators_new(2);
	content_generators_add(content, blog_content, blog_new());
	content_generators_add(content, static_content, NULL);

I updated the socket_listener typedef so it no longer took a routes list as its third parameter. This required me to update the server_listener function (which it used to ignore the routes list anyway) and the client_listener function which would now need the list of content generators instead. As planned I used the state pointer in the socket list to store the list of content generators.

The main function now calls the server_new function to create the server socket and passes it the content generators list. This function saves the content generators list in a ServerState structure and stores this in the socket list.

void server_new(Sockets* sockets, int socket, ContentGenerators* content) {
	int index = sockets_add(sockets, socket, server_listener);

	ServerState* state = server_state_new();
	state->content = content;
	sockets->states[index] = state;	
}

When the server accepts a connection and creates a new client socket/listener, it passes the pointer for the content generators list to the client.

static void server_listener(Sockets* sockets, int index) {
	// … other code …
		client_state = client_state_new();
		client_state->content = server_state->content;
	// … more code …
}

The client listener can now loop though this list to call each content generator in turn.

			bool ready = false;
			for (size_t i=0; !ready && i<state->content->count; i++) {
				ready = state->content->generators[i](state->content->states[i], request, response);
			}

			if (!ready) {
				response_error(response, 404);
			}

To assist in testing I created some placeholder content functions for the blog and static files.

bool blog_content(void* state, Request* request, Response* response) {
	TRACE("checking blog content");
	return false;
}

bool static_content(void* state, Request* request, Response* response) {
	TRACE("checking static content");
	return false;
}

This is a lot of change to a lot of parts of the program. It’s mostly list management and passing that list to the parts of the program that need it. With the placeholder code it was possible to get this code to compile so I could test new code worked as expected.

2. Request

Next I moved the responsibility for reading requests to the requests module and expanded the Request structure to include all the fields content generators might need. Primarily this means the request target, method and headers. Currently I’m only extracting the one header (If-Modified-Since) but as more are required they will go here.

typedef struct {
	bool complete;

	Buffer* buf;
	int content_start;

	Token method;
	URI* target;
	Token version;

	time_t if_modified_since;
} Request;

Most of the associated code to read the request was already written and I simply moved it from client.c to request.c.

3. Response

The response module was a little more complicated. This is because I wanted to add some flexibility and resilience to how the response is composed. For example I didn't want to have to set the status code before setting any other headers. As the status code is on the start line of the response, if I only use a buffer to compose the response I have to set and write it first. Equally I wanted to be able to set the same header twice but only output it once. This would allow me to set the header to a default value and then later change it should I need to.

To achieve this I added a field to the Response structure to store the status code until I needed it. I also added two arrays for header names and values (and some fields to track the number of headers set).

typedef struct {
	int status_code;

	size_t headers_size;
	size_t headers_count;
	char** header_names;
	char** header_values;

	// err, is this still right? no.
	Buffer* buf;
} Response;

I then created a response_header function to set a header. This checks the current arrays to see if the header is already set, if it is it will overwrite the value, if not it will create a new entry for the new header. It also deals with all the memory management for the strings, using a utility function copy_string I wrote to copy the incoming header and value strings. This seems wasteful of memory, I need to think about it some more, but I was aiming for safety first.

static char* copy_string(const char* value) {
	size_t len = strlen(value);
	char* new = allocate(NULL, len+1);
	return strcpy(new, value);
}

void response_header(Response* response, const char* name, const char* value) {
	for (size_t i=0; i<response->headers_count; i++) {
		if (strcmp(response->header_names[i], name)==0) {
			free(response->header_values[i]);
			response->header_values[i] = copy_string(value);
			return;
		}
	}

	if (response->headers_count == response->headers_size) {
		response->headers_size *= 2;

		response->header_names = allocate(response->header_names, sizeof(*response->header_names) * response->headers_size);
		response->header_values = allocate(response->header_values, sizeof(*response->header_values) * response->headers_size);
	}

	response->header_names[response->headers_count] = copy_string(name);
	response->header_values[response->headers_count] = copy_string(value);
	response->headers_count++;
}

I also wanted some flexibility with regards to the response content. I needed to support responses with no content, for example a 304 response to use the cached version doesn’t need any content. When there is content, I imagined a scenario where I want the response module to manage the buffer for the response content and one where some other part of the program would manage the buffer. So three sources for content in total: none; internally managed; externally managed.

I updated the Response structure again, this time with three new fields. content_source is used (with some handy macro constants) to indicate the three sources of content. A content field which is a pointer to a buffer for the actual content. Finally type a string with the HTTP content type for the Content-Type header.

typedef struct {
	int status_code;

	size_t headers_size;
	size_t headers_count;
	char** header_names;
	char** header_values;

	unsigned short content_source;
	const char* type;
	Buffer* content;

	// seriously now, this placeholder is no longer needed?
	Buffer* buf;
} Response;

I created three functions for each of the three sources of content. response_no_content is the simplest, setting the source to NONE. It does however first check if the content source was previously INTERNAL and if so, frees up that buffer as it is no longer needed. response_content sets the content source to INTERNAL and creates the buffer (if it’s not already created). response_link_content sets the content source to EXTERNAL and saves a pointer to it. The latter two functions both take a type parameter to set the HTTP content type.

void repsonse_no_content(Response* response) {
	if (response->content_source == RC_INTERNAL) {
		buf_free(response->content);
	}
	response->content_source = RC_NONE;
}

Buffer* response_content(Response* response, char* type) {
	if (response->content_source != RC_INTERNAL) {
		response->content = buf_new(1024);
	}
	response->content_source = RC_INTERNAL;
	response->type = content_type(type);
	return response->content;
}

void repsonse_link_content(Response* response, Buffer* buf, char* type) {
	if (response->content_source == RC_INTERNAL) {
		buf_free(response->content);
	}
	response->content_source = RC_EXTERNAL;
	response->content = buf;
	response->type = content_type(type);
}

With all the data it needs to compose the response, I moved the responsibility for actually sending the response to the new response module. First it would need to convert all the stored headers to a single string, then send the headers and then send the content if there was any. I needed a state variable to track where it was in the process, which for some reason I called stage, and a buffer for the stringified headers. The Response structure evolves into its final form:

typedef struct {
	int status_code;

	size_t headers_size;
	size_t headers_count;
	char** header_names;
	char** header_values;

	unsigned short content_source;
	const char* type;
	Buffer* content;

	Buffer* headers;
	unsigned short stage;
} Response;

The stage starts in PREP while content generators do their thing. On a call to the response_send function if the stage is still PREP the headers get built and the stage is incremented to HEADERS.

static void build_headers(Response* response) {
	TRACE("build response headers");

	// status line
	buf_append_format(response->headers, "HTTP/1.1 %d %s\r\n", response->status_code, status_text(response->status_code));

	// date header
	buf_append_str(response->headers, ": ");
	to_imf_date(buf_reserve(response->headers, IMF_DATE_LEN), IMF_DATE_LEN, time(NULL));
	buf_advance_write(response->headers, -1);
	buf_append_str(response->headers, "\r\n");

	// server header
	buf_append_str(response->headers, "Server: Tinn\r\n");

	// content headers
	if (response->content_source != RC_NONE) {
		buf_append_format(response->headers, "Content-Type: %s\r\n", response->type);
		buf_append_format(response->headers, "Content-Length: %ld\r\n", response->content->length);
	}

	// other headers
	for (size_t i=0; i<response->headers_count; i++) {
		buf_append_format(response->headers, "%s: %s\r\n", response->header_names[i], response->header_values[i]);
	}

	// close with empty line
	buf_append_str(response->headers, "\r\n");

	response->stage++;
}

The send code then works though the headers buffer sending the data to the client in a non-blocking way as I described in the Biggish files post. The difference here is once it’s done with the headers buffer it will increment the state to CONTENT and on subsequent calls to the function work though the content buffer (assuming there is content).

ssize_t response_send(Response* response, int socket) {
	if (response->stage == RESPONSE_PREP) {
		build_headers(response);
	}

	if (response->stage == RC_NONE) { // just in case?
		WARN("trying to send a request that is finished");
		return 0;
	}

	Buffer* buf = (response->stage == RESPONSE_HEADERS) ? response->headers : response->content;
	
	size_t len = buf_read_max(buf);
	ssize_t sent = send(socket, buf_read_ptr(buf), len, MSG_DONTWAIT);
	if (sent >= 0) {
		TRACE("sent %d: %ld/%ld", response->stage, sent, len);
		if ((size_t)sent < len) {
			buf_advance_read(buf, sent);
		} else {
			response->stage++;
			if (response->stage == RESPONSE_CONTENT && response->content_source == RC_NONE) {
				response->stage++;
			}
		}
	}
	return sent;	
}

The calling client code will keep calling this send function (and deal with all the socket stuff) until the response reaches the DONE stage.

4. Static content

Next I built the static files content generator. This basically involved expanding the placeholder function static_content into a working function. Most of the logic for this came from the old routes list code, specifically the routes_add_static function but updated to work with the Request and Response structures.

As the function is going to manipulate the request target a little bit to translate it for local use, I start by creating a copy of the path and extracting the last segment.

	// build a local path
	char local_path[request->target->path_len + 1 + 11 + 1]; // 1 for leading dot, 11 for possible /index.html, 1 for null terminator
	local_path[0] = '.';
	strcpy(local_path + 1, request->target->path);

	char* last_segment = request->target->segments[request->target->segments_count-1];

Then time for some of that manipulation. I check to see if the last segment is empty, if so the request is for a directory and so we should instead look for index.html in that directory and I update the path and last segment appropriately. At some point I will probably need to expand that to look for other index files, but for now, index.html is all that Tinn supports.

	if (strlen(last_segment)==0) {
		last_segment = "index.html";
		strcpy(local_path + 1 + request->target->path_len, last_segment);
	}

I don’t want to serve any dot files, so the next check is to see if the last segment starts with a . and if so exit with a false, aka no content generated.

	// ignore dot files
	if (last_segment[0]=='.') {
		TRACE("ignoring dot file \"%s\"", local_path);
		return false;
	}

I then try to read the file information. If this fails then there is no file at the request path and again I exit with a false.

	// get file information
	struct stat attrib;
	if (stat(local_path, &attrib) != 0) {
		TRACE("could not find \"%s\"", local_path);
		return false;
	}

If I found a file I check if it was really a file or a directory (or something else). For files I check the modification date and either return a 304 to use the cached version or the file contents otherwise. This code was moved from the prep_file function in client.c and updated to use the new response module.

	if (S_ISREG(attrib.st_mode)) {
		TRACE("found \"%s\"", local_path);

		// check this is a GET request
		// TODO: what about HEAD requests?
		if (!token_is(request->method, "GET")) {
			response_error(response, 405);
			return true;
		}

		// check modified date
		if (request->if_modified_since>0 && request->if_modified_since>=attrib.st_mtime) {
			TRACE("not modified, use cached version");
			response_status(response, 304);
			return true;
		}

		// open file and get content length
		long length;
		FILE *file = fopen(local_path, "rb");
		
		if (file == NULL) {
			ERROR("unable to open file \"%s\"", local_path);
			return false;
		}

		fseek(file, 0, SEEK_END);
		length = ftell(file);
		fseek(file, 0, SEEK_SET);

		// respond
		response_status(response, 200);
		response_header(response, "Cache-Control", "no-cache");
		response_date(response, "Last-Modified", attrib.st_mtime);

		char* body = buf_reserve(response_content(response, strrchr(last_segment, '.')), length);
		fread(body, 1, length, file);
		fclose(file);

		return true;

	}

If it’s a directory I check to see if that directory has an index.html file and if so return a 301 redirection to a path with a correctly trailing forward slash as described in the Routing post. If there is no index, I don’t redirect, I don’t want to redirect to then return a 404, I would rather just return the 404 straight away.

	if (S_ISDIR(attrib.st_mode)) {
		TRACE("found a directory \"%s\"", local_path);

		// check for index
		strcpy(local_path + 1 + request->target->path_len, "/index.html");
		if (stat(local_path, &attrib) == 0) {
			if (S_ISREG(attrib.st_mode)) {
				TRACE("found index, redirecting");

				char new_path[request->target->path_len+2];
				strcpy(new_path, request->target->path);
				strcpy(new_path + request->target->path_len, "/");

				response_redirect(response, new_path);
				return true;
			}
		}
		TRACE("no index");
		return false;
	}

If the file type is something else I log an error and move on generating no content. I guess this will come to hurt me if and when I serve content that includes symbolic links…look forward to a post with me fixing that one.

5. Blog content

Almost there.

As a reminder, the plan is to update the blog module so it persists. This means creating a structure to store all the various data it needs to keep track of. Mainly this is the list of posts, but also the various HTML fragments that are used to build the blog pages.

typedef struct {
	Buffer* header1;
	Buffer* header2;
	Buffer* footer;
	size_t size;
	size_t count;
	struct post* posts;
} Blog;

Blog* blog_new();
void blod_free(Blog* blog);

You’ve already had a sneak peak of the blog_new function. It gets called in the main function when defining the content generators. As you might expect it allocates memory for the structure and the posts list, creates the buffers for the HTML fragments and then calls read_posts to populate the list of posts.

Blog* blog_new() {
	Blog* blog = allocate(NULL, sizeof(*blog));
	blog->size = 32;
	blog->count = 0;
	blog->posts = allocate(NULL, sizeof(*blog->posts) * blog->size);

	// load html fragments
	TRACE("loading html fragments");
	blog->header1 = buf_new_file(".header1.html");
	blog->header2 = buf_new_file(".header2.html");
	blog->footer = buf_new_file(".footer.html");

	if (blog->header1 == NULL || blog->header2 == NULL || blog->footer == NULL) {
		buf_free(blog->header1);
		buf_free(blog->header2);
		buf_free(blog->footer);
		return NULL;
	}

	// read posts
	read_posts(blog);

	return blog;
}

The read_posts function is largely unchanged from how it was before. It reads the posts file, validates each entry and adds it to the posts list. The important change being where the post list file is stored, before it was a local structure, now it’s in the new Blog structure. Otherwise the only real change is if any particular post fails validation it now just skips just that entry and not the entire file.

The old blog_build function has been replaced by the new blog_content function. Where before all the pages were generated upfront and stored in the routes lists, now each page is built on demand using the new response module.

bool blog_content(void* state, Request* request, Response* response) {
	TRACE("checking blog content");

	Blog* blog = (Blog*)state;

	// check home page
	if (strcmp(request->target->path, "/")==0) {
		TRACE("generate home page");

		response_status(response, 200);
		
		Buffer* content = response_content(response, "html");
		buf_append_buf(content, blog->header1);
		buf_append_buf(content, blog->header2);

		for (size_t i=0; i<blog->count; i++) {
			TRACE_DETAIL("post %d \"%s\"", i, blog->posts[i].title);
			if (i > 0) {
				buf_append_str(content, "<hr>\n");
			}
			compose_article(content, blog->posts[i]);
		}

		buf_append_buf(content, blog->footer);
		return true;
	}

	// ... check other pages ...

6. Detecting updates

I’m now, eventually, in a position to do what I set out to do. I can update the blog module to detect updates to the file system. Let's start with some helper functions, one to get the modified date (and time) of a file and another to compare to dates (and times) and return the latest one.

static time_t get_mod_date(const char* path) {
	struct stat attrib;
	if (stat(path, &attrib) == 0) {
		return attrib.st_mtime;
	}
	return 0;
}

static time_t max_time_t(time_t a, time_t b) {
	return a>=b ? a : b;
}

I updated the Blog structure to include a mod_date field which will store the modified date of the posts file when I read it to generate the list of posts. Then in the blog_content function I added the check to see if the file had been updated since that date, and if so re-read the file.

	// check for changes
	if (get_mod_date(POSTS_PATH) > blog->mod_date) {
		reread_posts(blog);
	}

I did the same to the posts themselves, adding a mod_date field to the post structure and checking that date against the file’s modified date and re-reading the content if required.

static void check_post_date(struct post* post) {
	time_t mod_date = get_mod_date(post->source);
	if (mod_date > post->mod_date) {
		buf_reset(post->content);
		buf_append_file(post->content, post->source);
		post->mod_date = mod_date;
	}
}

I also had to add checks to the HTML fragments, checking each of those before generating any content. This finally pushed me to move the fragments into an array so I could use a loop to check and update them as required. This required a new structure for each fragment and an update to the Blog structure.

struct html_fragment {
	const char* path;
	time_t mod_date;
	Buffer* buf;
};
#define HF_HEADER_1	0
#define HF_HEADER_2	1
#define HF_FOOTER	2
#define HF_COUNT	3

typedef struct {
	time_t mod_date;
	struct html_fragment fragments[HF_COUNT];
	size_t size;
	size_t count;
	struct post* posts;
} Blog;

Then, for example, this is loop I added to the blog_content function to check for any updates:

	for (size_t i=0; i<HF_COUNT; i++) {
		time_t mod_date = get_mod_date(blog->fragments[i].path);
		if (mod_date > blog->fragments[i].mod_date) {
			buf_reset(blog->fragments[i].buf);
			buf_append_file(blog->fragments[i].buf, blog->fragments[i].path);
			blog->fragments[i].mod_date = mod_date;
		}
	}

To access the individual fragments when building the response, I used a marco to select the correct index in the array, for example:

buf_append_buf(content, blog->fragments[HF_FOOTER].buf);

I needed to apply some thought to when I needed to perform each check and when each page would be considered stale.

I check the posts file every time, and if it’s changed I consider everything to be stale and need to re-read everything. The logic to work out which posts could be considered still fresh in this scenario can wait for another day.
I check the HTML fragments every time too, again if any of them have changed everything is stale.
If the source file for an individual file has changed it makes its own page stale, but also the home and log pages, therefore I check every post when loading those pages. It doesn’t affect the archive page as this only has content from the posts file.

With all this in place, I have achieved my objective. I can now update the posts file, the individual posts or any of the fragments and just need to refresh the page the browser to see the updates. No longer do I need to restart the server. Yay!

The other reason I did this

I now have a last modified date for each blog page. This means I can also enable client caching on these pages. Excellent.

		// check modified date
		if (request->if_modified_since>0 && request->if_modified_since>=mod_date) {
			TRACE("not modified, use cached version");
			response_status(response, 304);
			return true;
		}
		
		// generate page
		response_status(response, 200);
		response_header(response, "Cache-Control", "no-cache");
		response_date(response, "Last-Modified", mod_date);

		// ... content ...

Conclusion

This all seems like a lot of effort to just make the server detect file system updates. The reality is I’ve done much more than that. I’ve built a more robust and expandable system for reading the incoming request and generating a response. This has set myself up for adding app support later, which is a key goal of this project. So it was well worth the effort.

Besides, I really like not having to restart the server when doing web development stuff.