Bug hunting

8th May 2024

I knew I did something wrong the morning after I updated the server because it crashed. The terminal showed a segment fault right after reporting a bad request and right before trying to output the request target. So my gut was I messed up the debug/logging code and was trying to output a null pointer or something. I had my day job to attend and plans for the evening so it would have to wait. I was tempted to roll back the deployed version but let's be real, not that many people are reading my site and no one is relying on it. Besides I have an SSH client on my phone, I could check from time to time and restart the server if I needed to. Which I did, about three more times.

Evening plans complete, it was time to go bug hunting, let me walk you through my process.

It starts with me dry running the program. That is, I imagine the program running while I read through the code one line at a time. Sometimes I use a bit of paper to keep track of variable values, but mostly I do the whole thing in my head. I was lucky enough to study computer science for A-levels and this is a skill I picked up then. We spent a lot of time dry running programs on paper in classrooms. The idea is that this slow deliberate approach lets me spot any flaws in logic so I can correct them. It solves a lot of problems.

Based on the terminal output I started by looking at the new code for parsing and storing URIs. My confusion was that I’d thought I had made this code safe with regards to bad requests. In the case of an invalid request target it should still create a safe URI structure with the valid flag set to false, a path length of zero and the path set to an empty string. After carefully reviewing the code, I couldn’t see why it wouldn’t do exactly that. It certainly shouldn’t be causing a segment fault.

I also looked at the code that called it, the new request module. Likewise that code seemed correct. I couldn’t see what was wrong.

The next step of my debug process is to try to reproduce the error. This was tricky, requests made from my browser all seemed to work, I couldn’t make the server crash. The server was also running for hours responding to multiple requests from all kinds of sources without crashing. This is why I’d not spotted the problem in my original testing. I fired up the telnet client and started experimenting with bad requests. The first test was to send an empty request and straight away the server crashed in the same way. This is a huge step forward. Once I can reproduce the error at will, I can start to experiment to find out exactly what is going on.

It didn’t take long. With HTTP requests, the header is not considered complete until the server receives a blank line. I’d written code to look for this blank line by searching for the input buffer for consecutive newlines (\r\n\r\n). Upon finding the blank line, everything before that line is the header which would then get parsed, including the request target URI. If after receiving data I did not find the new line, the request is incomplete and I would leave it alone and return to the main network loop to wait for more data.

With the telnet client data is sent to the server every time you hit enter and go to a new line. Hitting enter straight away will send a blank line, technically it will send a new line sequence (\r\n). My code looking for the end of the header will be triggered but because the entire input buffer is currently a single new line, it won’t find consecutive new lines and will consider the request incomplete, returning to the network loop to wait for more data. However before I could send a second new line the server would crash. This is because the socket listener code for clients was not checking if the request was complete before trying to read it. I was checking for errors but the code to check if the request was completed got missed in the refactor. As a result the code tried to process an incomplete request and crashed when reading the request target because this structure had yet to be populated.

The fix was simple enough. I just added a check to see if the request is complete and if not, return to the main network loop and wait for more data, like it thought it was already doing.

static bool read_request(struct pollfd* pfd, ClientState* state) {
	Request* request = state->request;
	Response* response = state->response;

	int recvied = request_recv(request, pfd->fd);
	if (recvied <= 0) {
		if (recvied < 0) {
			ERROR("recv error from %s (%d)", state->address, pfd->fd);
		} else {
			LOG("connection from %s (%d) closed", state->address, pfd->fd);
		}
		return false;
	} else if (!request->complete) { // new check
		return true; // true because we want to keep the socket open
	} else {
		// process request
		// …

		// send
		return send_response(pfd, state);
	}
}

I didn’t spot this error when first investigating because I was too deep into the stack and I was focusing on the new URI and request code. I didn’t consider the (relatively) older socket listener even though I had made quite a few changes to this code. So many times finding the cause of a bug results in a “d’oh” moment. How did I miss that and why didn’t I look there first?

Also, as is often the case, once I understood what is wrong, the fix is comparatively easy. Bug hunting is mostly about the hunt. I often have to reassure my team as they try to justify why it took them several hours to find and understand a problem before changing a few characters of code. “I get it” I tell them, “I’m a programmer too”.