Building a Concurrent HTTP Client and Parallel File Encryptor From Scratch

March 3, 2026

Over the past month I built two systems programming projects back to back - a concurrent HTTP fetch library in C, and a parallel file encryptor in C++. Neither was built by following tutorials step by step. Both were built through trial and error, looking things up only when a specific problem demanded it.

This is a record of what I learned and why each piece exists.

Part 1: HTTP Fetch Library in C

Where It Started

I came across a YouTube video by Dr. Jonas Birch walking through a basic HTTP fetch implementation in C without external libraries. I coded along but wasn't satisfied, so I decided to extend it into a proper concurrent implementation. Most of the learning happened in conversation with Claude, which I used not as a code generator but as something closer to a patient senior engineer - one I could ask "why does this work?" rather than "what do I write?" That distinction mattered. The code is mine. The understanding came from the back-and-forth.

How I Prepared Before Writing Any Code

Before touching the implementation, I spent time deliberately preparing - and the order mattered.

Claude helped me map out which LeetCode problems to solve before getting to any threading code, and why. The sequencing was: Design Circular Queue first, then learn about mutex, then implement task_queue.c. The reason is simple - if you learn circular buffer mechanics and mutex synchronization simultaneously and something breaks, you don't know which part is wrong. Isolate the data structure first, add the synchronization layer second.

Design Circular Queue was the most directly useful. The (tail + 1) % capacity wraparound logic from that problem is almost exactly what ended up in task_queue.c. I solved it as a pure data structure problem, then the mutex video showed me what needed protecting, and the connection was immediate. Implement Queue using Stacks built FIFO intuition before any threading complexity. Valid Parentheses seems unrelated but isn't - it built the instinct that every pthread_mutex_lock needs a matching unlock, including on early returns. Every ( needs a ), and returning from a function mid-critical-section is an unclosed parenthesis.

The problems didn't teach concurrency. They reduced cognitive load so that when mutexes and condition variables arrived, I was learning one new thing instead of three at once.

The same back-and-forth also shaped how I understood what I was building. Before writing a single line, Claude walked me through the full four-layer architecture - Application → HTTP → Network → Kernel - not writing code, just naming what each layer does and why the separation exists. That conversation made the implementation feel like filling in a structure rather than guessing at one.

One concrete example of how this worked: Claude explained that string literals live in a read-only section of memory called .rodata. I ran objdump -s -j .rodata grab myself and saw 47455400 504f5354 in hex - GET and POST - sitting in my own binary. Being told something and seeing it are different experiences. That's what the back-and-forth was for.

What Actually Happens When You Fetch a URL

Before writing any code, I had to understand the actual sequence:

Create a socket - a file descriptor the kernel gives you to represent a network endpoint
Connect to the server - the kernel performs a TCP three-way handshake (SYN, SYN-ACK, ACK)
Send an HTTP request - formatted text with very specific rules
Wait for a response - the kernel reassembles incoming TCP segments
Parse the response - extract status, headers, body
Close the connection

The entire internet runs on this loop.

The Architecture

The architecture wasn't designed upfront. It emerged from asking a series of questions:

How do I send bytes over a network? → network.c
What bytes do I send? → http.c
Who coordinates everything? → fetch.c
How does the user interact with it? → grab.c

Each layer has one job and doesn't know about the layers above it. network.c knows nothing about HTTP. http.c knows nothing about sockets. fetch.c connects them.

The Network Layer

Five functions, each wrapping one kernel syscall:

socket_create()   // SYS_socket  - create the endpoint
socket_connect()  // SYS_connect - TCP handshake
socket_send()     // SYS_write   - send bytes
socket_recv()     // SYS_read    - receive bytes
socket_close()    // SYS_close   - terminate connection

One thing that surprised me: socket_send uses SYS_write and socket_recv uses SYS_read - the same syscalls used for files. In Unix, a socket is just a file descriptor. The kernel routes it differently based on type. Socket create returns 3 because 0, 1, 2 are already taken by stdin, stdout, and stderr.

The other thing that took time to understand was byte order. The network expects big-endian. x86 CPUs use little-endian. Port 80 stored in memory as [50][00] needs to be sent as [00][50]. That's what htons() does. inet_addr("142.250.182.100") packs four octets into a 32-bit integer via bit shifting and OR operations - 192.168.1.1 becomes 0xC0A80101 - then htonl() reverses the bytes for transmission. IP addresses aren't strings at all. They're 32-bit integers formatted for human readability.

The Sockaddr struct has 8 bytes of padding because the kernel expects exactly 16 bytes in that layout. One byte wrong means connection failure.

The HTTP Layer

HTTP is just text with strict formatting rules. A GET request looks like this:

GET / HTTP/1.1\r\n
Host: www.google.com\r\n
Connection: close\r\n
\r\n

Every line ends with \r\n. A blank line separates headers from body. \n alone won't work on a real server.

Building the request meant walking a pointer through a buffer, appending strings one by one. The double pointer char **ptr in append_str is necessary because the function needs to modify where ptr points, not just what it points to. append_int builds digits in reverse into a temp array - % gives you the last digit first - then writes them out in correct order.

Parsing the response meant walking a pointer forward through raw text, stopping at \r\n landmarks to extract status code, headers, and body. find_str does naive string matching: outer loop tries each starting position, inner loop compares character by character, *n == '\0' confirms a full match.

The method_to_string() function returns a pointer into .rodata - the read-only section of the binary where string literals live. That pointer can't be free()d or modified. Running objdump -s -j .rodata grab shows the method strings sitting there in hex: 47455400 504f5354 - GET, POST. Once you've seen it in your own binary it stops being abstract.

The Memory Problem

After parse_http_response(), resp->data points into response_buffer - which lives on the stack inside fetch_sync(). When fetch_sync() returns, that stack frame is gone. resp->data is now a dangling pointer pointing at whatever memory gets reused next.

The fix: malloc(resp->size + 1), copy the body byte by byte, update resp->data to point at the heap copy instead.

free_response() requires two free() calls because malloc was called twice - once for the Response struct, once for the body copy. Free the child before the parent, or you lose the pointer to resp->data before you can free it.

Every error path in fetch_sync() calls socket_close(sock) and free(resp) before returning NULL. Resources released in reverse order of acquisition. Miss one and you have a leak.

It Worked

First successful run:

Status: 200
Content-Type: text/html; charset=ISO-8859-1
Body size: 5373 bytes
<!doctype html><html itemscope=""...

Real HTML from Google's servers, fetched with code written line by line. There was also a 14f5 at the start of the body - a chunk size indicator from HTTP chunked transfer encoding that my parser didn't handle. A reminder that real servers don't always behave like the spec examples.

The Concurrency Problem

Ten requests. Six seconds. Each socket_recv() blocks - the thread sits doing nothing while waiting for the network. The CPU isn't busy. It's just waiting.

The fix is to run requests in parallel. But parallel in C means threads, and threads sharing memory means thinking carefully about what can go wrong.

The Task Queue

Before building threads, I needed somewhere to put work. The task queue is a thread-safe circular buffer. Each item is a Task - a function pointer and a void* argument:

typedef struct {
    task_fn function;
    void*   arg;
} Task;

Generic by design. The queue doesn't know anything about HTTP.

The (tail + 1) % capacity wraparound logic here is the same thing I'd written solving Design Circular Queue. The difference is what surrounds it.

A mutex protects the few lines where head, tail, and size are modified together. These three variables have to stay consistent with each other. A thread switch between any two of those updates without a lock causes corruption. The mutex is held for microseconds during this. The actual work - the HTTP request - happens completely outside the lock. That's what makes real concurrency possible. Locking the bookkeeping is not the same as locking the work.

A condition variable solves the idle problem. Without it, workers spin in a loop burning CPU. With pthread_cond_wait, a worker releases the lock and goes to sleep atomically until signaled. The while loop around it isn't defensive programming - spurious wakeups are real and the condition needs to be rechecked every time. The lock/unlock balance from Valid Parentheses holds here too: pthread_cond_wait releases the mutex on entry and reacquires it on exit.

The Thread Pool

A fixed set of worker threads, all sharing one task queue. Each worker runs the same loop:

while (not shutdown or queue not empty) {
    task = dequeue()            // sleep here if nothing to do
    task->function(task->arg)   // do the work
}

Workers are created once at startup and reused for every request.

Shutdown required one non-obvious step: when thread_pool_destroy sets shutdown = 1, workers are sleeping inside pthread_cond_wait. They'll never wake up to check the flag. The fix is pthread_cond_broadcast - wake every sleeping worker so they can each see the flag and exit.

I caught a race empirically before understanding it analytically: 3 out of 5 tasks printing, not 5. Intermittent, hard to reproduce. That's what a race condition looks like in the wild.

The Async API

With the queue and pool in place, adding async was connecting the pieces:

fetch_async(request, my_callback, context);

This function creates a task where the work is fetch_sync, drops it in the queue, and returns immediately. A worker picks it up, runs the HTTP request, then calls the callback with the response.

One ownership problem worth noting: the callback receives a Response* but also needs to free the Request* that was allocated for it. The solution was a void* context parameter in the callback signature - the caller attaches whatever they need, and it travels through the library untouched, arriving at the callback alongside the response.

The Numbers

Sequential:         10 requests → 6.95 seconds
Async (4 workers):  10 requests → 1.83 seconds

3.8x speedup. The callback order was different every run - tasks submitted 1 through 10, responses arriving in a different sequence each time. That's not a bug. That's what concurrent execution looks like.

The thread pool works because it runs the slow parts in parallel, not because it speeds up any individual operation. Mutex operations take microseconds. Network roundtrips take hundreds of milliseconds. The lock exists to protect bookkeeping, not to protect I/O.

What Actually Clicked

The mutex scope rule makes sense when you measure what's inside it. head, tail, size - microseconds. socket_recv() - hundreds of milliseconds. You can state this as a rule in one sentence, but it only becomes instinct when you've built the thing it applies to.

Producer-consumer isn't a pattern you memorize. It's a name for something you built: one thread submitting tasks, multiple threads pulling from a queue, a condition variable signaling between them. The LeetCode problem gave mechanical intuition. The implementation gave it a name.

Dangling pointers aren't abstract warnings. resp->data pointing at a dead stack frame is a real bug that looks like garbage output or a segfault depending on when the memory gets reused. You hit it, you fix it, you don't forget it.

The .rodata discovery was quite satisfying. Running objdump and seeing your own strings in hex makes the compiler feel less like a black box. That kind of visibility is one of the reasons to work without libraries.

Part 2: Parallel File Encryptor in C++

Starting Point

I'd originally planned to follow along with Lovepreet Singh's file encryptor project videos on YouTube. About five minutes in, I stopped. Watching someone build something and actually understanding it are different things. So I used the videos only as a loose reference and worked through the rest in conversation with Claude - same approach as the HTTP library. Each concept came up only when a specific problem demanded it. I didn't learn about unique_lock in the abstract; I learned about it because I'd used lock_guard wrong and needed to understand why.

Coming off the HTTP library, I had the hard concepts already - thread pools, mutexes, condition variables. What I didn't know was C++ specifically. That gap turned out to be easier to close than expected. The ideas transferred directly; mostly I was learning new syntax for familiar patterns.

The Foundation

The core idea: read a file, XOR every byte with a key, write it back. XOR has a useful property - applying it twice with the same key restores the original, so encrypt and decrypt are the same operation.

The architecture emerged the same way as the HTTP library - from questions. How do I read a file? Write one? Describe a unit of work? The answers became FileReader, FileWriter, and a Task struct:

struct Task {
    std::string file_path;
    Action action;
    Task(std::string path, Action act) : file_path(path), action(act) {}
};

A ProcessManager class holds a std::queue<Task>. Sequential first, then parallel.

Stage 1: Thread Pool in C++

Making ProcessManager concurrent meant learning where C++ diverges from C primitives.

The race condition is concrete: Thread 1 calls empty() - returns false. Thread 2 calls empty() - also returns false. Both enter the loop. Both call front() on the same task. This is a TOCTOU problem. The check and the pop need to be atomic together, which a mutex provides.

One thing I got wrong early: I used lock_guard and tried to manually call unlock() on it. lock_guard doesn't have an unlock() method - it only unlocks on scope exit. The fix was unique_lock, which allows early manual unlock. Why does early unlock matter? Because file I/O takes 200-500 microseconds. Holding the mutex during that time would serialize every thread. The mutex should be held for microseconds, not milliseconds - the same principle from the HTTP library's task queue.

void ProcessManager::process() {
    while(true) {
        unique_lock<mutex> lock(mtx);
        cv.wait(lock, [this]{ return !t.empty() || stop; });
        if(t.empty() && stop) break;
        Task task = t.front();
        t.pop();
        lock.unlock();
        // file I/O happens here, outside the lock
    }
}

Benchmarks on 1000 files: performance scaled from ~105ms at 2 threads to ~47ms at 8-10 threads, then flatlined. Past your core count, context-switching overhead dominates.

Stage 2: Multiprocessing

Threads share memory. Processes don't. This one difference changes everything architecturally.

A simple experiment confirmed it: set a variable before fork(), modify it in the child, check it in the parent. The parent sees the original value. Each process gets its own private copy of memory at the moment of forking.

So the thread pool's std::queue<Task> can't work across processes - its internal heap pointers point to one process's private memory, meaningless to any other. The solution is shared memory via mmap.

But std::queue can't live in shared memory either. The fix is a circular buffer - a fixed-size array with head and tail indices. Everything lives contiguously in one block, no external pointers, no heap allocations. This is the same structure from the HTTP library's task queue, now put to a different use.

struct SharedQueue {
    SharedTask tasks[100];
    int head, tail, size, capacity;
    pthread_mutex_t mtx;  // NOT std::mutex - must work across processes
    sem_t items;
    sem_t spaces;
    bool stop;
};

std::mutex only works within a single process. For cross-process synchronization, pthread_mutex_t initialized with PTHREAD_PROCESS_SHARED is stored directly in shared memory. POSIX semaphores replace condition variables - two of them handle producer-consumer coordination. items starts at 0 and workers block on it when the queue is empty. spaces starts at capacity and the main process blocks on it when the queue is full.

Benchmarks were nearly identical to threading at low worker counts, with multiprocessing pulling ahead slightly at higher counts. Each process has its own CPU cache. Threads sharing memory can invalidate each other's cache lines even when working on different files. Processes don't. At 10+ workers this cache isolation starts mattering.

Stage 3: Hybrid

The hybrid combined both: N processes each running M threads internally, all pulling from the same shared memory queue. Straightforward to build given the existing pieces.

The benchmarks were nearly identical to pure multiprocessing and threading at equivalent total worker counts. I'd suspected this might happen, so I added instrumentation.

Measuring queue access time versus file I/O time per task:

queue: 0-3 microseconds
file I/O: 200-500 microseconds

300x difference. The queue was never the bottleneck. Running perf stat confirmed it further: with 4 threads, CPU utilization was only 1.5 - threads spent most of their time idle, waiting on kernel I/O. More parallelism strategies couldn't fix a blocking I/O problem.

Stage 4: Async I/O with io_uring

The problem wasn't how work was distributed. The problem was that every thread blocked on each read() and write() syscall, sitting idle until the kernel returned. The solution isn't more workers - it's eliminating the blocking.

io_uring lets you submit I/O requests to a ring buffer and collect completions later, without blocking. One thread can keep dozens of operations in flight simultaneously, keeping the disk saturated.

The design needed one new concept: tracking the lifecycle of each file. A file isn't done after a read - it still needs a write. An IOContext struct carries the file descriptor, buffer, size, and an is_read flag. When a read completion arrives, the buffer gets XOR-transformed and a write is submitted. When the write completion arrives, cleanup happens.

while(completions_received < total_files * 2) {
    while(in_flight < queue_depth && files_submitted < total_files) {
        IOContext* ctx = new IOContext();
        ctx->fd = open(path.c_str(), O_RDWR);
        // fstat, allocate buffer, prep_read, set_data
        in_flight++; files_submitted++;
    }
    io_uring_submit(&ring);

    io_uring_wait_cqe(&ring, &cqe);
    IOContext* ctx = (IOContext*)io_uring_cqe_get_data(cqe);

    if(ctx->is_read) {
        // transform buffer, submit write
    } else {
        close(ctx->fd); delete[] ctx->buffer; delete ctx;
        in_flight--; completions_received++;
    }
    io_uring_cqe_seen(&ring, cqe);
}

The results against multiprocessing, confirmed via perf stat:

Metric	Multiprocess (12)	Async I/O (depth 8)
CPUs utilized	9.1	1.7
Total CPU time	0.53s	0.08s
Wall clock time	~26ms	~34ms

Similar wall clock. Seven times less CPU consumed. The async user time was effectively zero - the application thread was barely running. The kernel handled everything via io_uring.

What Both Projects Taught Me

The HTTP library taught me what every network client does at the bottom - syscalls, byte ordering, protocol formatting, memory ownership. The concurrency work taught me the producer-consumer pattern and why mutex scope matters more than people think.

The file encryptor started as a way to learn C++ idioms. It ended as a practical demonstration of why production systems are built the way they are. The deeper lesson wasn't about any specific API - it was about correctly identifying what the actual bottleneck is before reaching for more complexity.

Adding a hybrid architecture on top of an already I/O-saturated disk doesn't help. Measuring before concluding is the habit worth keeping.

The thread pool and task queue pattern transferred directly from the HTTP library to the file encryptor. The same circular buffer that handled async HTTP requests became the shared memory queue across processes. Different problem domain, same underlying machinery. That transferability was the real payoff.

Both projects were built the same way - wrong attempts, specific questions, corrections, and trying again. Having something that could explain the why behind a fix rather than just hand over working code made the difference between finishing with a project and finishing with understanding.