Blog module mark II

6th March 2024

Reviewing my first version of the blog module, the blog_build function is 173 lines long, which bothers me. It definitely smells of a “make it work” version where I’ve just thrown all the code in one function to just get it working. You can take a look if want here. The list of things that bother me are:

The way I’m handling paths seems cumbersome.
There is some code repetition.
The call to fscanf, it’s cryptic and has some hard coded lengths in it.
The way I build the “prev” and “next” navigation at the bottom of each post is overly complicated.
I’m actually reading the post content from the file twice.
In general the function is trying to do too many things.

Building paths

As explained in the blog module post, this really bogged me down. I know what I need to do, but the memory management for strings in C keeps tripping me up. Let’s break it down.

The problem

I need to work out the path for four permanent files, and an additional two for each post. I have to build these paths from two variables and a bunch of constants/literals. The first variable is the location on my machine where all the content files are stored, this is provided as a command line argument and in a variable called base_path. The second is the directory name for each individual post, read from the posts file.

File	How the path is computed
Header fragment 1	`base_path + blog_dir + ".header1.html"`
Header fragment 2	`base_path + blog_dir + ".header2.html"`
Footer fragment	`base_path + blog_dir + ".footer.html"`
Posts file	`base_path + blog_dir + ".posts.txt"`
Post content	`base_path + blog_dir + post_dir + “/.post.html”`
server path	`blog_dir + post_dir`

The first problem is I need to check if the incoming content root has a trailing forward slash or not. As this is specified on the command line, either is possible. When adding this to the rest of the path I don’t want to end up with a double forward slash or no forward slash.

Second problem is that I moved the header and footer files to the blogs directory so 5 of the 6 files all started the same, to make the code a bit easier. I would prefer the HTML fragments were in the root or a common directory so I could use them elsewhere on the site.

Third, I need to make sure I have enough space to build the whole path, but because I don’t know the length of the post directory names yet, I don’t know what this is.

My first version of code started like this:

01:	void blog_build(char* base_path, struct routes* routes) {
02:		const size_t max_len = 256;
03:		const char blog_dir[] = "/blog/";
04:		const char blog_file[] = "/.post.html";
05:
06:		// create arrays for paths
07:		size_t bpl = strlen(base_path);
08:		size_t spl = sizeof(blog_dir)-1;
09:		if (base_path[bpl-1] == '/') {
10:			bpl--;
11:		}	
12:		char path[bpl + spl + max_len + sizeof(blog_file)];
13:		strcpy(path, base_path);
14:		strcpy(path + bpl, blog_dir);
15:		bpl += spl;
16:
17:		// … more code …

I decided that 256 characters for a directory name was probably more than enough, but to be safe I set this as the constant max_len, intending to use that later to validate the incoming directory name lengths. I also created two constants, one for the blog directory name and one for the content file name so I could change them easily if required.

Then it gets weird. On line 7 I create a variable called bpl which right now I know means “base path length” but in about six months when I’m looking back at the code I won’t have a clue, so I should probably change it. Then on line 8 I create spl aka “server path length”, which is effectively the length of the blog directory name and some forward slashes. Then on lines 9 to 11, I check if the last character of the content root is a forward slash and if it is I reduce the bpl by one because I’ve included the forward slash in the blog directory name.

On line 12 I create a character array called path with enough space for the content root, the blog directory name, 256 characters for the post directory name and then content file name (/.post.html).

Next I copy the content root into my new path. Then add the blog directory name to the path using pointer arithmetic to place it in the correct position. Using the bpl variable I manipulated earlier as an offset means that if the content root had a trailing forward slash, it will get overwritten by the one in the blog directory name, if it didn’t the blog directory name is appended. Either way I won’t end up with just the one forward slash.

Finally, on line 15, I add the length of the blog directory name (and all its forward slashes) to bpl. From this point on path + bpl points to the first character in the path that would be inside the blog directory, and for at least 5 of the paths, that’s the part I want to manipulate. For example to build the path for the the footer fragment I use:

strcpy(path + bpl, ".footer.html");

The path variable will now contain the complete path to the footer fragment file that I can use in a call to fopen (file open). Later when I want to build the path to a post’s content file I would use:

strcpy(path + bpl, post_path);
strcpy(path + bpl + strlen(post_path), blog_file);

It works. It might even be efficient. But it irks me.

I’ve had a few days to think about this. What bugs me is I’m trying to do too many things all at the same time. Notably, validate the incoming content root and make sure I don’t cause an overflow. I did enter a bit of a loop as some of my ideas to address this would make the code less efficient in terms of memory management. Which touches on a flaw in my simplified software engineering aphorism, sometimes to make it neat you make it less fast, and sometimes to make it fast you have to make it less neat. So to make progress I’m going to focus on making it neat. I'm not completely ignoring speed and efficiency, it's just a secondary concern right now.

Validating the content root

I decided to move the responsibility for validating the content root out of the blog_build function and into its own function that would be called by main when validating the command line arguments in general. That's where it should be really.

I then remembered I can change the current working directory. The library function chdir would accept a path with or without the trailing forward slash and then switch the working directory to that directory. From that point on any other file related functions like fopen using relative paths would start in that directory. More than that, I could test the response of chdir to ensure the path was a valid directory. It’s such an obvious solution I can’t believe it took two days for me to think of it.

int main(int argc, char* argv[]) {
	// validate arguments
	if (argc != 3) {
		fprintf(stderr, "Usage: %s port content_directory\n", argv[0]);
		return EXIT_FAILURE;
	}

	// change working directory to content directory
	if (chdir(argv[2]) != 0) {
		perror("content_directory");
		return EXIT_FAILURE;
	}

	// build routes
	struct routes* routes = routes_new();
	blog_build(routes);

	// … more code …

	return EXIT_SUCCESS;
}

The blog_build function doesn't need to know or care about where the content root is. It now just assumes it’s the current directory. Arguably this is less flexible but I have to remind myself I’m not building a library, this function is only used in this program and only used once. So I can remove flexibility to increase simplicity and readability.

I did think about using chdir within the blog_build function as I load the files from different places, however I decided against it as I wanted the error messages (should a file be missing for example) to have some context, i.e. at least be a path starting at the content root.

Avoiding overflows

I already solved this problem. I created the buffer structure and functions to deal with reading requests and composing responses. I could use the same functions to build a path and let them worry about allocating enough space. But that does seem overkill, so I wanted to explore other options first.

Experiment one was a function called make_path. I was really aiming for ease of use. It is a variadic function that can take a variable number of string arguments. It will then loop through each of them adding them to a single string up to a maximum length and return the result.

#define MAX_PATH_LEN 256

char* make_path(int n, ...) {
	static char path[MAX_PATH_LEN];
	size_t len = 0;

	va_list args;
	va_start(args, n);
	for (int a=0; len<MAX_PATH_LEN-1 && a<n; a++) {
		char* fragment = va_arg(args, char*);
		for (size_t i=0; len<MAX_PATH_LEN-1 && fragment[i]!='\0'; i++) {
			path[len++] = fragment[i];
		}
	}
	va_end(args);

	path[len] = '\0';
	return path;
}

void blog_build(struct routes* routes) {
	const char blog_dir[] = "blog/";
	const char blog_file[] = ".post.html";

	// … other code …	
	
	char* header_path = make_path(2,  blog_dir, ".header1.html");
	char* post_path = make_path(4, blog_dir, post_path, "/", blog_file);

	// … more code …
}

I have concerns. While it is more convenient to call than other solutions, the first parameter specifying the number of strings being passed detracts from its convenience. This seems like something the compiler or the va_* variadic library functions should be able to deal with.

More of a concern is the static character array being used to store the resulting path. That is a can of worms right there. Being static, it is a chunk of memory that is going to persist for the entire program run, and considering I’m only using this during the program's start up, it seems like a waste. I know the focus is neatness and not efficiency, but this seems just too inefficient. Also, not that it matters yet, but I think the static variable means it is not thread safe. I get the impression that static variables like globals should be avoided, at least until I understand them better.

Another potential problem is this function will silently truncate the input if the output will exceed the maximum size. The result will be correctly terminated and will not overflow, but there is no way outside the function to detect if the input was too long.

I decided make_path was not the solution. It did however give me an idea, I could use snprintf to build the path.

#define MAX_PATH_LEN 256

void blog_build(struct routes* routes) {
	const char blog_dir[] = "blog";
	char path[MAX_PATH_LEN];

	// … 
	
	// build path for post file
	snprintf(path, MAX_PATH_LEN, "%s/.posts.txt", blog_dir);
	
	// …

	// build path for post content
	snprintf(path, MAX_PATH_LEN, "%s/%s/.post.html", blog_dir, post_path);

	// … more code …
}

I quite like this. Overflow safe. Reasonably flexible. Not overly complicated, certainly no pointer arithmetic to remember/decode. It’s probably less efficient than my first version, but adequate for what I need and not needlessly wasteful. It is however silently discarding any input that would overflow, but that is because I’m not inspecting the return value.

snprintf returns the number of characters that should have been written to the buffer excluding the null terminator (or a negative number if an error occurs). The second parameter indicates the maximum buffer size including the null terminator, for which I pass MAX_PATH_LEN. So, if the value snprintf returns is equal to or greater than MAX_PATH_LEN, the path got truncated.

	int len = snprintf(path, MAX_PATH_LEN, "%s/%s/.post.html", blog_dir, post_path);
if (len < 0 || len >= MAX_PATH_LEN) {
	fprintf(stderr, "Error, unable to create path for \"%s\"\n", post_path);
	continue;
}

I added this check when building the path for the post content. This is being built based on a value from the posts file which I can't guarantee the length of, so needs the check. Elsewhere, like for the posts file itself, where the length is predictable and I therefore know fits in the buffer, I don’t need to worry about the return from snprintf.

This is not the end of my journey with string manipulation in C. Right now I’m comfortable with it again. It seems to me snprintf is definitely the way to go if you need to do anything more complicated than the most basic copy or concatenation. I'm certain I’ll go to implement something in the coming days or weeks that will cause me to question everything again. For now however, it's time to move on.

Code repetition

Lines 832 to 854 load three files into buffers, these are the three html fragments. It’s clear I wrote this once, copied it, pasted it twice and modified it slightly.

	// load header part 1
	strcpy(path+bpl, ".header1.html");
	struct buffer* header1 = buf_new(256);
	if (buf_append_file(header1, path) < 0) {
		fprintf(stderr, "Error reading %s\n", path);
		return;
	}

	// load header part 2
	strcpy(path+bpl, ".header2.html");
	struct buffer* header2 = buf_new(256);
	if (buf_append_file(header2, path) < 0) {
		fprintf(stderr, "Error reading %s\n", path);
		return;
	}

	// load footer
	strcpy(path+bpl, ".footer.html");
	struct buffer* footer = buf_new(256);
	if (buf_append_file(footer, path) < 0) {
		fprintf(stderr, "Error reading %s\n", path);
		return;
	}

To give myself some credit I did create the buf_append_file function the first time round so a good chunk of the code to load a file is already in an appropriate function. Still, it seems there is still an opportunity to make this neater. First, I moved the fragment files back to the content root, with the other changes I already made, this removed the need to construct a path for each file, I could just use a string literal. I then created a buf_new_file function that creates a new buffer, loads it up with a file, reports with any errors and returns a pointer to the new buffer.

struct buffer* buf_new_file(char* path) {
	struct buffer* buf = buf_new(256);
	if (buf != NULL) {
		if (buf_append_file(buf, path) < 0) {
			fprintf(stderr, "Error reading %s\n", path);
			return NULL;
		}
	}
	return buf;
}

void blog_build(char* base_path, struct routes* routes) {
	// … other code …

	// load html fragments
	struct buffer* header1 = buf_new_file(".header1.html");
	struct buffer* header2 = buf_new_file(".header2.html");
	struct buffer* footer = buf_new_file(".footer.html");

	// … more code …
}

Much neater. BUT, I’ve created a problem. What happens if one of the files fails to load? Before I would print an error and escape out of the blog_build, not great but at least the code would not crash. Now the error still gets printed but the buf_new_file function will return a NULL pointer and the blog_build function will carry on regardless until it tries to use that buffer, and then crashes.

Error reading .footer.html
fish: Job 1, 'build/tinn 8080 ../moohar/' terminated by signal SIGSEGV (Address boundary error)

I need to check if the returning buffer is not NULL. If I do it for each file in turn, the resulting code would not be much neater than the original. I could instead check all three after loading them.

bool blog_build(struct routes* routes) {
	// … other code …
			
	// load html fragments
	struct buffer* header1 = buf_new_file(".header1.html");
	struct buffer* header2 = buf_new_file(".header2.html");
	struct buffer* footer = buf_new_file(".footer.html");

	if (header1 == NULL || header2 == NULL || footer == NULL) {
		buf_free(header1);
		buf_free(header2);
		buf_free(footer);
		return false;
	}

	// … more code …
}

Something I forgot to do in the previous code if there was an error, was to tidy-up and free any buffers that did get created. For this to work I needed to make a small change to the buf_free function to check for and ignore NULL pointers. I also changed the blog_build function to return a boolean indicating if it was successful. This way the main function can check the result and handle it appropriately.

If there were more than three files, I would be looking to use an array and a loop, for now though using three separate variables is probably less work than setting up an array.

Reading the posts list

The original version of the blog module would read the posts file one line at a time, interpret that line, try to do everything with that post in one go and then move on to the next post. One issue here is to build the navigation at the bottom of individual posts, I at least need to know the path of the next and previous posts. The current code deals with this by not computing the navigation until it’s read the next post. As a result each iteration of the loop is actually dealing with two posts, the navigation for one and the content for another. This is why there are variables like next_post_path and next_next_post_path. Some of the code is also duplicated outside the loop to close off the last post. Dealing with edge cases, like the first and last post, also requires nested checks which is duplicated in and outside the loop. It’s not clean.

The plan is therefore to break this down into at least two phases. First read the post file and convert it into an array that can be accessed randomly (as opposed to sequentially). Second, use the array to build each post (and the archive page (and the home page)) in a slightly clearer and sane way.

This started by defining a new struct to hold the information about each post.

struct post {
	char path[MAX_PATH_LEN];
	char server_path[MAX_PATH_LEN];
	char title[MAX_PATH_LEN];
	char date[20];
	struct buffer* content;
};

Then because I want the array to resize as required I copied the list code I’ve used a few times and modified it to work with the new post structure. My brain is screaming at me that there should be a generic way to handle these dynamic arrays. I know it’s not straightforward in C so I’m filling that in the “later” pile.

I created the function blog_read_posts which reads the posts file and for each one creates a posts entry using the above structure. This includes reading the content from the .post.html file for each post into the buffer.

This code is still using an fscanf call to parse each line in the posts file.

fscanf(data, "%255[^\t]%*[\t]%255[^\t]%*[\t]%19[^\n]%*[\n]", post->path, post->title, post->date)

I’m going to leave this line alone for now. It still bothers me that there are hardcoded limits for each field. I did this to prevent overflow, but the value (255) is not automatically linked to the MAX_PATH_LEN defined elsewhere. If I change one, I will have to remember to change the other. A bad idea. Also while I understand the format string today, it’s really hard to read and would be a pain to update at a later date. I will revisit this code when I work on parsing the incoming HTTP request headers, it’s a similar problem.

The way I maintain the list is slightly different from other list examples. I start with a call to posts_list_draft which ensures there is enough space for a new entry in the list and then returns a pointer to that new entry. The key difference is that it does not update its internal count. In effect, the new entry is not yet “committed” to the list. If you were to loop through entries using the count to determine the upper bound, it would not include this new entry. I then use this new entry to capture the results of the fscanf call and load the contents buffer. If everything works I call posts_list_commit which simply increments the count by 1, effectively committing the new entry. If anything goes wrong I don’t commit and a subsequent call to posts_list_draft will return a pointer to the same entry as before that can be overwritten.

Back in the blog_build function I can now loop through the array of posts to create the required pages. To keep things clear I actually loop through the posts three times. Once to create the individual pages, once to build the archive page with a link to each post, and one last time to build the home page with all the posts in a single page. Because I can now look forwards and backwards in the array, building the navigation is far simpler, there is no longer a need to overlap the creation of the pages and have variables like next_next_post_path.

// build blog pages
for (int i=0; i<posts->count; i++) {
	struct buffer* post = buf_new(10240);
	buf_append_buf(post, header1);
	buf_append_str(post, " - ");
	buf_append_str(post, posts->data[i].title);
	buf_append_buf(post, header2);
	buf_append_str(post, "<article><h1>");
	buf_append_str(post, posts->data[i].title);
	buf_append_str(post, "</h1><h2>");
	buf_append_str(post, posts->data[i].date);
	buf_append_str(post, "</h2>\n");
	buf_append_buf(post, posts->data[i].content);
	buf_append_str(post, "<nav>");
	if (i < posts->count-1) {
		buf_append_str(post, "<a href=\"");
		buf_append_str(post, posts->data[i+1].path);
		buf_append_str(post, "\">prev</a>");
	} else {
		buf_append_str(post, "<span>&nbsp;</span>");
	}
	if (i > 0) {
		buf_append_str(post, "<a href=\"");
		buf_append_str(post, posts->data[i-1].path);
		buf_append_str(post, "\">next</a>");
	}
	buf_append_str(post, "</nav></article>\n");
	buf_append_buf(post, footer);
			
	routes_add(routes, RT_BUFFER, posts->data[i].server_path, post);
}

The code is still verbose, but it is at least now easy to follow.

The verboseness comes from having to add the strings to the buffer one at a time. My recent experience with snprintf caused me to consider if there was a better way. I already have a buf_append_format function that uses vsnprintf (the variadic version of snprintf), but I’d not quite got it’s implementation correct as I required the caller to pass a maximum size of the formatted string. This is inconvenient and not in the spirit of the buffer which should just expand as required. Now I understand how to use the return value from snprintf I could update the code to not require that parameter.

long buf_append_format(struct buffer* buf, char* format, ...) {
	long max = buf->size - buf->length;

	va_list args;
	va_start(args, format);
	long len = vsnprintf(buf->data + buf->length, max, format, args);
	va_end(args);

	if (len >= max) {
		if (!buf_ensure(buf, len+1)) {
			return -1;
		}

		va_start(args, format);
		len = vsnprintf(buf->data + buf->length, len+1, format, args);
		va_end(args);
	}

	if (len < 0) {
		return -1;
	}
	buf->length += len;
	
	return buf->length;
}

Essentially the code now attempts to append the formatted string to the buffer and if the required space is more than was available it expands the buffer and tries again. This means I can tidy up the code to create a post page a little more.

// build blog pages
for (int i=0; i<posts->count; i++) {
	struct buffer* post = buf_new(10240);
	buf_append_buf(post, header1);
	buf_append_format(post, " - %s", posts->data[i].title);
	buf_append_buf(post, header2);
	buf_append_format(post, "<article><h1>%s</h1><h2>%s</h2>\n", posts->data[i].title, posts->data[i].date);
	buf_append_buf(post, posts->data[i].content);
	buf_append_str(post, "<nav>");
	if (i < posts->count-1) {
		buf_append_format(post, "<a href=\"%s\">prev</a>", posts->data[i+1].path);
	} else {
		buf_append_str(post, "<span> </span>");
	}
	if (i > 0) {
		buf_append_format(post, "<a href=\"%s\">next</a>", posts->data[i-1].path);
	}
	buf_append_str(post, "</nav></article>\n");
	buf_append_buf(post, footer);
			
	routes_add(routes, RT_BUFFER, posts->data[i].server_path, post);
}

Not a huge saving in the number of lines, but it does help with clarity. For example, the code to add the “next” link is now a single line with an obvious intent instead of spread over three.

Review

Let’s review my gripe list.

The way I’m handling paths seems cumbersome. -> This has been mostly replaced with calls to standard library functions, specifically chrdir and snprintf. It's no longer cumbersome, it's actually quite neat.
There is some code repetition. -> The obvious examples have been replaced with functions. I think there is potential for some code reuse with how the page content is being generated, but it's not as simple as it first looks. There are enough differences between the page types that a quick function is not going to cut it. I'll think about this some more.
The call to fscanf, it’s cryptic and has some hard coded lengths in it. -> This has yet to be addressed.
The way I build the “prev” and “next” navigation at the bottom of each post is overly complicated. -> Breaking the code into two phases, reading the data into an array and then using the array to build the pages has made building the navigation trivial.
I’m actually reading the post content from the file twice. -> The same array used for the previous point also solved this problem.
In general the function is trying to do too many things. -> The blog_build function is now 94 lines long and most of that is the actual composing of content. The rest has been moved to appropriate functions.

At this point the code is better organised, more readable and easier to understand. Therefore it is more maintainable. Goal achieved. This has for example enabled me to spot some bugs, especially around handling potential errors which means I now have a more robust solution. In addition my change in approach let me add the following new feature with ease.

Blogs are backwards

I’ve always felt that blogs are backwards. I know why they are the way they are. You want your newest content at the top for lots of reasons. But I’ve always felt there should be an option to read the blog in chronological order. Sometimes when you discover a new blog and want to read it in its entirety, it would be helpful to do this from start to finish without any unnecessary navigating and/or scrolling back and forth. So I added a way to do that on my blog. I added the log page.

To generate this page I simply loop through the posts array in reverse order. Job done.

Outstanding issues

I need to review the code around the routes list. In my recent testing I encountered a memory management problem with how the paths are stored in the routing list and what bit of code should be responsible for that memory. I’ll think I’ll address that next but this post is getting too long already. For now I’m not freeing up the posts array after I’ve built the post pages, which ideally I should be able to do.

I also need to review my error handling strategy, it’s inconsistent at the moment with how I return an error, where the responsibility for reporting the error is, and how to free up resources already allocated.

My code is now over a thousand lines. It’s past time to move to multiple source files. I’ve been avoiding that because that also means I should probably learn about makefiles. Or is it build files? I’m sure I’m about to find out.

Finally, I know the home page and my new log page are going to grow to a size I shouldn’t load them in one go. I need to implement some infinite scrolling mechanism that loads posts as needed, and that means JavaScript. Which is exciting.