Coding Challenges

Coding Challenge #123 - Database Driven LLM Wiki

John Crickett — Sat, 06 Jun 2026 08:01:47 GMT

Hi, this is John with this week’s Coding Challenge.

🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉

If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧

Coding Challenge #123 - Database Driven LLM Wiki

This challenge is to build your own personal knowledge base tool - a system where an LLM agent reads your curated sources, extracts the key information, and builds you a living, interlinked wiki that grows smarter with everything you add.

Most people’s experience with LLMs and documents is RAG: upload files, ask questions, get answers stitched together from retrieved chunks. It works, but the LLM rediscovers knowledge from scratch every time. Ask a subtle question that spans five documents, and the system has to find and piece together fragments it’s seen before. Nothing accumulates.

LLM Wiki takes a different approach. Instead of just retrieving from raw documents at query time, an LLM agent incrementally builds and maintains a persistent wiki - a structured collection of markdown files that sits between you and your sources. When you add a new source, the agent reads it, extracts key information, and integrates it into the existing wiki: updating entity pages, revising topic summaries, noting where new data contradicts old claims, strengthening or challenging the evolving synthesis. The knowledge is compiled once and then kept current, not re-derived on every query.

You’re in charge of sourcing, exploration, and asking good questions. The LLM does all the grunt work: summarising, cross-referencing, filing, and the bookkeeping that makes a knowledge base actually useful over time.

This challenge is inspired by Andrej Karpathy’s LLM Wiki concept - a pattern for building personal knowledge bases using LLMs. Karpathy describes the idea in the abstract; our job is to build a working implementation.

Under the hood, the system stores vector embeddings and full-text indexes for every wiki page in Oracle AI Database. When you ask a question, it finds relevant pages using hybrid search, and an LLM synthesises an answer with citations. The agentic behaviour - ingestion workflows, multi-step querying, lint passes - is orchestrated using LangGraph’s state machine model, while LangChain handles the LLM integration. It’s a practical introduction to agent-based knowledge management, vector search, full-text search, and building tools that genuinely compound in value over time.

The Challenge - Building LLM Wiki

You’re going to build a personal knowledge base that an LLM agent writes and maintains for you. It starts by ingesting source documents into a wiki of markdown files, then lets you query it through a web interface or CLI chat. Step by step you’ll add wiki scaffolding, source ingestion, index and log management, vector storage, semantic retrieval, project management, linting, and a user interface. By the end, you’ll have a tool that genuinely helps you build and navigate a growing body of knowledge.

Step Zero

In this introductory step you’re going to set your environment up ready to begin developing and testing your solution.

You’ll need to make a few decisions and get some infrastructure running:

Set up your vector and full-text database. You’ll need Oracle Database 26ai running in a local Docker container. Pull the container-registry.oracle.com/database/free:latest image, start the container, and set a password for the admin account. You can find full setup instructions in the Oracle Database Free Get Started guide. Once the container is running, connect using a SQL client and verify you can create a table. Store all credentials in an environment file, not hardcoded anywhere.

docker pull container-registry.oracle.com/database/free:latest
docker run -d -p 1521:1521 -e ORACLE_PWD= container-registry.oracle.com/database/free:latest

Choose your embedding model. You need a text embedding model that captures semantic meaning. Nomic’s nomic-embed-text is open source (Apache 2.0) and runs locally on CPU. It produces 768-dimensional vectors. Install it via Hugging Face: pip install sentence-transformers and load it as nomic-ai/nomic-embed-text-v2-moe. Any general-purpose embedding model with reasonable semantic quality will work - the key requirement is that it can capture the meaning of wiki pages well enough to find relevant ones for a given query.
Set up your LLM provider. You’ll need a language model for writing wiki pages, summarising sources, and answering questions. Any provider with a chat API will work - Anthropic, OpenAI, Google, Mistral, or a local model via Ollama. The model needs to be capable enough to write coherent markdown and extract structured information from source documents.
Set up LangChain and LangGraph. You’ll use LangChain for LLM integration and LangGraph for orchestrating the agent’s workflows. Install both: pip install langchain langgraph. LangChain handles the plumbing of prompting and response parsing. LangGraph handles the agentic flows - the multi-step processes of ingesting a source (read, extract, write summaries, update pages, update index, log), answering queries (search, read pages, synthesise), and running lint passes.
Pick a topic for your first wiki. Choose a domain you’re genuinely curious about and gather 50-100 source documents - articles, papers, blog posts - about it. Save them as markdown or plain text files. This will be your test corpus. Pick something where the articles naturally reference the same entities and concepts, so there are connections for your wiki to surface. A technical topic works well (e.g. database internals, a specific ML technique, a programming language), but any domain with depth will do - history, cooking, fitness, whatever interests you.

Testing: Verify your Oracle Database container is running and you can connect to it. Load your embedding model and generate a test embedding to confirm it returns a vector of 768 dimensions. Make a test call to your LLM API to confirm it returns a valid response. Verify your environment file is being read correctly and no credentials are in your source code.

Step 1

In this step your goal is to build the wiki scaffolding - the system that creates and manages the directory of markdown files that will become your knowledge base.

The wiki lives on disk as a directory of markdown files. Before you can ingest sources or answer questions, you need the infrastructure to create pages, write content to them, and link them together. Think of this as the file system layer of your knowledge base.

Start by defining what a new wiki looks like on disk. When you create a wiki, the system should scaffold a directory structure with subdirectories for different page types: summaries for source summaries, entities for pages about people/companies/concepts, topics for overview pages, and a raw directory for the original source files. Alongside the directories, create a schema file (call it SCHEMA.md) that defines the conventions for this wiki - what the directories are for, what naming conventions to use, what frontmatter fields pages should have, and how cross-references should be formatted.

Pages should carry YAML frontmatter at the top with metadata: title, type (entity, concept, summary, overview), date created, date updated, tags, and a list of sources the page draws from. The frontmatter makes pages queryable later and lets tools like Obsidian’s Dataview plugin generate dynamic views.

Cross-references between pages should use standard markdown links or wikilinks ([[Page Name]]). Which style you use is up to you, but the schema file should record the convention so the agent can be consistent. When one page references another, the agent should be able to follow that link, read the target page, and update both sides of the relationship.

Build a simple CLI that lets you create a new wiki, list existing wikis, and inspect a wiki’s structure - how many pages it has, what directories exist, and what the schema says. This CLI is just for development and testing; you’ll replace it with a proper interface later.

Testing:

Create a wiki. Verify the directory structure and SCHEMA.md file were created on disk.
Manually add a few test pages with frontmatter and cross-references. Verify the frontmatter parses correctly and links resolve to the expected paths.
Create a second wiki with a different schema configuration (different frontmatter fields, different link style). Verify both wikis coexist and respect their own conventions.
List your wikis and verify both appear.
Delete a wiki directory manually and verify the listing correctly reflects its absence.

Step 2

In this step your goal is to build the source ingestion pipeline - the agent workflow that reads a source document and integrates its knowledge into the wiki.

This is the heart of the system. When you drop a new source into the raw directory and tell the agent to ingest it, a multi-step workflow begins. The agent reads the source, extracts key information, discusses the takeaways with you (in interactive mode), and then updates the wiki across multiple pages.

Model the ingestion workflow as a graph in LangGraph. Each node handles one concern: read the source, extract entities and concepts, identify claims and key information, write a summary page in the summaries directory, update or create entity pages for each entity found, update or create concept pages for each concept, revise topic overview pages, flag contradictions with existing content, update the index, and append an entry to the log. A single source might touch 10-15 wiki pages.

The agent should be able to detect contradictions. When a new source makes a claim that conflicts with something already in the wiki, the agent should note the discrepancy on the relevant page rather than silently overwriting or ignoring it. The user should be able to see where sources disagree and make their own judgement.

The agent should also identify gaps - entities or concepts referenced in the source that don’t yet have pages - and create stub pages or flag them for later attention.

Think carefully about how you prompt the LLM for each of these tasks. The quality of the wiki depends entirely on the quality of the extraction and synthesis. You’ll likely need different prompts for different page types: a summary page prompt, an entity page prompt, a concept page prompt, and so on. The schema file you built in Step 1 should guide these prompts.

The original source file should go into the raw directory and never be modified. The agent reads from it but never writes to it. This is your source of truth.

Testing:

Ingest a single source document (a short article, 500-1000 words) into a test wiki. Verify the agent creates a summary page that captures the key points without hallucinating facts not in the source.
Verify the agent creates or updates entity pages for the key people, companies, or concepts mentioned in the source.
Verify the agent creates or updates topic overview pages that connect this source to existing knowledge (if the wiki already has content).
Ingest a second source on the same topic that contradicts something in the first source. Verify the agent flags the contradiction on the relevant page.
Check the wiki directory after ingestion. It should contain new or updated files in the summaries, entities, and topics directories. The raw directory should contain the original source unchanged.
Ingest a source that mentions an entity not yet in the wiki. Verify the agent creates a stub page or flags the gap.

Step 3

In this step your goal is to build the index and log - two special files that help the agent (and you) navigate the wiki as it grows.

The index (index.md) is content-oriented. It’s a catalogue of every page in the wiki, organised by category: entities, concepts, sources, overviews. Each entry includes a link to the page, a one-line summary, and optionally metadata like creation date and the number of sources that feed into it. When the agent needs to answer a query, it reads the index first to find candidate pages, then drills into the most relevant ones. This approach works well at moderate scale (hundreds of pages) and avoids the need for embedding-based RAG infrastructure at the browsing level.

The log (log.md) is chronological. It’s an append-only record of everything that happened: ingests, queries, lint passes, schema changes. Each entry starts with a consistent prefix format: ## [YYYY-MM-DD] type | Description. This makes the log parseable with simple command-line tools - grep "^## \\\\[" log.md | tail -5 gives you the last five entries.

The key design decision is that the agent owns both files. Every ingestion should update the index with new pages and revised summaries. Every operation should append to the log. The agent should read the index at the start of every query to know what’s available. The agent should read the log at the start of every session to know what’s been done recently.

Build the index and log maintenance into your LangGraph workflows from Step 2. After the agent finishes writing wiki pages for an ingestion, it should update the index and append to the log as the final nodes in the graph. If an ingestion fails partway through, the log should record the failure.

Testing:

Ingest a source and verify the index is updated with entries for the new summary page, entity pages, and concept pages. Each entry should have a link and a one-line description.
Ingest a second source and verify the index reflects both sources, with shared entity pages showing updated descriptions.
Check the log after several ingestions. Verify each entry has the correct format (## [YYYY-MM-DD] ingest | Title) and appears in chronological order.
Run grep "^## \\\\[" log.md and verify you get a clean chronological listing.
Manually delete a wiki page. Ingest a new source and verify the index accounts for the missing page (removes the stale entry rather than leaving dead links).

Step 4

In this step your goal is to add semantic search over your wiki pages using vector embeddings and full-text search stored in Oracle AI Database.

So far the agent navigates the wiki by reading the index and following links. That works at moderate scale, but as your wiki grows to hundreds of pages, you’ll want semantic search - finding pages by meaning, not just by browsing the catalogue.

Take every wiki page (excluding the index and log themselves, and excluding raw source files), generate a vector embedding for it using your embedding model, and store the embedding alongside the page’s path, title, type, tags, and a snippet or summary in Oracle Database. The metadata fields should all be stored and indexed so you can filter by type (”only entity pages”) or by tag.

Create a vector index on the embedding column for fast cosine similarity search. Also create an Oracle Text full-text index on the page content (or at minimum on the title and summary fields). Vector search finds semantically related pages even when the words don’t match. Full-text search catches exact names, technical terms, and phrases that vector search might rank lower. Together they give you robust retrieval.

Think about when embeddings should be generated. Every time the agent creates or updates a page during ingestion, the new or revised page needs to be re-embedded and stored. Pages that weren’t touched by an ingestion should keep their existing embeddings. You’ll need to track which pages changed so you only re-embed those.

Also think about what you embed. You could embed the full page text, but long pages might dilute the semantic signal. You could embed a summary or the first N paragraphs. You could embed both the title and the body separately and combine scores. Experiment and find what works for your test corpus - different approaches suit different types of content.

Testing:

Run the full pipeline against your test wiki: parse all pages, generate embeddings, store in Oracle Database. Verify that the number of stored embeddings matches the number of wiki pages.
Query Oracle Database directly to inspect a few stored entries. Verify each contains the embedding vector, page path, title, type, and tags.
Verify that both the vector index and the full-text index have been created.
Update a page (by ingesting a new source that modifies an existing entity page). Verify only the changed page is re-embedded; unchanged pages keep their existing embeddings.
Search for a page by a concept it discusses (not by its exact title). Verify vector search returns it even though the words don’t match.
Search for an exact technical term or person’s name. Verify full-text search catches it with high confidence.

Step 5

In this step your goal is to build the query system - the agent workflow that answers your questions by searching the wiki and synthesising a response.

Now that you have a searchable wiki, you need the agent to put it to use. When you ask a question, the agent should follow a multi-step process: read the index to identify candidate pages, search for semantically relevant pages using the hybrid search you built in Step 4, read the most relevant pages in full, and synthesise an answer with citations to specific pages and sections.

Model the query workflow as a LangGraph graph. The nodes might include: read index, hybrid search, read candidate pages, and synthesise answer. If the agent finds gaps - the question touches on something the wiki doesn’t cover well - it should say so honestly rather than speculating.

The system should support different answer formats depending on the question. A comparison between two concepts might be best as a table. A timeline of events might be best as a chronological list. A straightforward explanation might be best as prose. Give the LLM the flexibility to choose the format, and provide guidance in your prompts.

An important capability: the system should offer to file good answers back into the wiki as new pages. When you ask a question that generates a useful analysis, comparison, or synthesis, the agent should ask if you want to save it. If you do, it writes the answer as a new wiki page, adds it to the index, logs it, and embeds it. This way your explorations compound in the knowledge base just like ingested sources do.

Support follow-up questions within a session. If you ask “tell me about X” and then “what about Y?”, the agent should understand from context that Y relates to the broader topic X is part of. LangGraph’s state carries conversation context forward between queries.

Testing:

Ask a question about something well-covered in your wiki. The answer should be accurate, cite specific pages, and not hallucinate facts not in the wiki.
Ask the same question with different phrasing. Verify you get a similar answer - semantic search should match by meaning, not exact wording.
Ask a question that spans two or more wiki pages. Verify the agent reads multiple pages and synthesises an answer that connects them.
Ask a follow-up question without re-stating the topic. Verify the agent maintains context from the previous exchange.
Ask a question about something not covered in your wiki. Verify the agent honestly reports the gap rather than making things up.
Ask the agent to save an answer as a wiki page. Verify the page is created on disk, added to the index, logged, and embedded in Oracle Database.
Ask a comparison question (”what’s the difference between X and Y?”). Verify the answer uses an appropriate format (table, side-by-side, etc.).

Step 6

In this step your goal is to add wiki project management so you can maintain separate knowledge bases for different topics.

A single wiki is useful, but you’ll likely want separate knowledge bases for different areas of your life - one for your research topic, one for book notes, one for health and fitness, one for career learning. Each should be isolated, with its own set of pages, its own embeddings, and its own agent memory.

Add support for named wiki projects. Store project metadata - name, creation date, last ingestion timestamp, page count, source count - in Oracle AI Database. Each project’s embeddings and metadata should be isolated so that searches against one wiki never return pages from another.

The user should be able to create a new wiki project, list existing projects, and select which project to work with. When the user starts the system, they should be able to specify a project name and immediately pick up where they left off.

All wiki data should persist between runs. The markdown files live on disk in their project directories. The embeddings and metadata live in Oracle AI Database. When the user comes back and selects a project, everything should be exactly as they left it - the same pages, the same index, the same log, the same search capability.

Testing:

Create two wikis on different topics, each with a small set of source documents. Ingest sources into both.
Query one wiki and verify results come only from that wiki, not the other.
List your wikis and verify both appear with correct names and metadata (page count, last ingestion time).
Query the project metadata directly in Oracle Database and verify it matches what the system reports.
Stop and restart your system. Verify all wiki data is intact and queryable for both projects.
Add a new source to an existing wiki and re-run ingestion. Verify only new or changed pages are re-embedded; unchanged pages preserve their existing embeddings.

Step 7

In this step your goal is to build a linting system that health-checks your wiki and helps it stay consistent as it grows.

As your wiki accumulates pages and sources, inconsistencies creep in. A page makes a claim that a newer source contradicts, but the older page was never updated. A concept is discussed across five pages but never got its own dedicated page. A page references another that was renamed or deleted. Links only go one way. Gaps appear where you have half the story. Humans abandon wikis because this maintenance burden grows faster than the value. The agent can handle it.

Build a lint operation as a LangGraph workflow. The agent should walk the wiki systematically and check for: contradictions between pages (two pages make incompatible claims), stale claims (a page asserts something that a newer source has revised or disproven), orphan pages (pages with no inbound links from other wiki pages), missing pages (entities or concepts referenced but lacking their own page), broken cross-references (links that point to non-existent pages), and data gaps (areas where the wiki is thin and could benefit from additional sources).

The agent should report its findings as a prioritised list: critical issues first (contradictions, stale claims), then warnings (orphans, missing pages), then suggestions (gaps, possible new sources to look for). Each issue should include the specific pages involved and a suggested action.

The lint pass should also suggest new questions to investigate and new sources to look for - what’s missing from the wiki that would fill important gaps? This turns the lint operation from a bug-finding exercise into a research planning tool.

Make the lint operation interactive by default. The agent presents its findings, and you accept, reject, or modify each suggestion before any changes are made. The agent should not modify pages without confirmation unless you explicitly run in auto-fix mode.

Testing:

Create a deliberate contradiction: write two entity pages that make incompatible claims about the same thing. Run lint and verify the agent detects and reports the contradiction, citing both pages.
Create an orphan page: a page with no other pages linking to it. Run lint and verify it’s flagged.
Reference a page that doesn’t exist (a broken wikilink). Run lint and verify the broken link is reported.
Have a concept discussed across multiple pages but without its own dedicated page. Run lint and verify the agent suggests creating one.
Accept a lint suggestion and verify the agent applies the fix correctly.
Reject a lint suggestion and verify no changes are made.

Step 8

In this step your goal is to build a user interface for your LLM Wiki, providing a chat-based interface for querying and managing your knowledge base.

Your wiki agent currently works through a development CLI. Now give it a proper interface. You have a choice: build a web interface (similar to the Code Sherpa challenge), a CLI chat interface, or both. The core requirements are the same regardless.

The interface should provide a chat panel for querying the wiki. The user types questions in natural language, and the agent responds with synthesised answers that cite specific wiki pages. Responses should render markdown so that tables, lists, and formatted text display clearly. Citations should be clickable links that open the referenced page.

The interface should support the full query workflow you built in Step 5: semantic search, multi-page synthesis, follow-up questions with context, different answer formats, and saving answers as new wiki pages.

Provide a way to browse the wiki structure: a page tree or list showing categories (entities, concepts, summaries, overviews) and the pages within each. This helps the user understand the shape of their knowledge base at a glance.

When the agent is processing a query, show a loading state so the user knows something is happening. LangGraph workflows can take several seconds as the agent reads the index, searches for pages, reads them, and synthesises an answer.

Build a separate CLI tool for source ingestion that doesn’t require launching the full interface. A user should be able to run something like llm-wiki ingest my-research ./new-article.md and have the agent process the source in the background, updating the wiki, index, log, and embeddings. This makes adding sources a quick, low-friction operation.

Testing:

Launch the interface and verify you can access it.
Select a wiki project and ask a question. Verify the response appears with markdown rendering and citations to wiki pages.
Click a citation link. Verify it opens the referenced page.
Ask a follow-up question and verify the system maintains context from the previous exchange.
Browse the wiki structure through the interface. Verify it accurately reflects the pages on disk.
Ask the agent to save an answer as a wiki page. Verify the page appears in the wiki structure immediately.
Use the CLI ingestion tool to add a new source. Verify the wiki updates without needing to launch the full interface.
Submit a query and verify a loading indicator appears while the agent processes it.

Going Further

You’ve built a working personal knowledge base with an LLM agent that ingests sources, writes wiki pages, answers questions, and keeps everything consistent. Here are some ways to take it further:

Cloud database support: Add an option to connect to a cloud-hosted Oracle AI Database instance instead of the local Docker container. Read the connection string from configuration.
Batch ingestion: Add a batch mode that ingests multiple sources at once with less supervision. The agent processes each source, generates updates, and presents a summary of changes for your review rather than discussing each source individually.
Marp slide generation: Add the ability to generate slide decks (using Marp format) from wiki content. This turns a collection of pages on a topic into a presentation with a single request.
Obsidian integration: Build tighter integration with Obsidian. Watch the wiki directory for changes made through Obsidian and update embeddings automatically. Use Obsidian’s Dataview plugin with the frontmatter your agent already writes.
Multi-format sources: Extend source ingestion to handle PDFs, web URLs (with scraping), and audio transcripts. The more formats you support, the more knowledge you can capture.

Share Your Solutions!

If you think your solution is an example other developers can learn from please share it, put it on GitHub, GitLab or elsewhere. Then let me know via Bluesky or LinkedIn or just post about it there and tag me. Alternately please add a link to it in the Coding Challenges Shared Solutions Github repo

Request for Feedback

I’m writing these challenges to help you develop your skills as a software engineer based on how I’ve approached my own personal learning and development. What works for me, might not be the best way for you - so if you have suggestions for how I can make these challenges more useful to you and others, please get in touch and let me know. All feedback is greatly appreciated.

You can reach me on Bluesky, LinkedIn or through SubStack

Thanks and happy coding!

John

Coding Challenge #122 - AI-Powered Contract Review Agent

John Crickett — Sat, 30 May 2026 08:01:36 GMT

Hi, this is John with this week’s Coding Challenge.

🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉

If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧

Coding Challenge #122 - AI-Powered Contract Review Agent

This challenge is to build your own AI-powered contract review agent using Trigger.dev - an application that takes a PDF contract, breaks it into clauses, analyses each one for risk in parallel using LLMs, pauses for human review, and streams back a final summary. Whilst the example of contract review, the workflow is applicable to many other domains.

This challenge was created in collaboration with Trigger.dev, whose platform provides durable background tasks with no timeouts, built-in retries, concurrency controls, human-in-the-loop pause points, real-time streaming, and full observability - all in TypeScript.

Contract review is one of the most time-consuming parts of legal work. Lawyers spend hours poring over dense documents looking for risky clauses, ambiguous language, and missing terms. An AI agent that can do the first pass, flag issues, and then incorporate human feedback before producing a final report would save enormous amounts of time.

Building this from scratch means you’d need to solve several hard infrastructure problems: job queuing with retries, parallel execution with concurrency control, durable pause-and-resume for human review, real-time streaming to the frontend, and execution tracing. That’s exactly what Trigger.dev handles for you. You define your workflow as a set of tasks - functions that can run for as long as needed - and Trigger.dev takes care of the rest. Your focus stays on the application logic: extracting clauses, analysing risk, and generating summaries.

By the end you’ll have a deep understanding of durable workflow orchestration and how a platform like Trigger.dev makes building AI-powered applications dramatically simpler.

The Challenge - Building Your Own AI-Powered Contract Review Agent

You’re going to build a web application that lets users upload PDF contracts and then kicks off a series of durable Trigger.dev tasks. Those tasks will extract text, split it into clauses, analyse each clause in parallel with automatic retries, pause for human approval, and stream a final summary back to the frontend in real time. The system will handle contracts with 50+ clauses reliably, support multiple LLM providers, and give you complete visibility into every step through Trigger.dev’s built-in dashboard.

Step Zero

In this introductory step you’re going to set your environment up ready to begin developing and testing your solution.

You’ll need to make a few decisions:

Set up your Trigger.dev project. Trigger.dev is a TypeScript-first platform, so you’ll be working in that ecosystem. Scaffold a new project using npx create-trigger@latest and follow the quickstart to get a task running. Choose a framework for your web app - Next.js is a natural fit since it pairs well with Trigger.dev, but you can also use Express, Remix, or any Node.js framework. Get your Trigger.dev API key from the dashboard and configure your environment.
Choose your LLM providers. Sign up for API keys. You’ll build an abstraction layer so you can swap providers without changing the rest of your code.
Choose your database. You’ll need to persist users, contracts, clause analyses, reviewer decisions, and final summaries. PostgreSQL works well and integrates naturally with Prisma. Pick what you’re most comfortable with.
Understand the Trigger.dev project structure. Your tasks live in the trigger/ folder. Each file defines one or more tasks using the task() function. These are functions that can run indefinitely with no timeouts, automatic retries on failure, and built-in logging. You trigger tasks from your web app, and you can chain tasks together using triggerAndWait() or batchTriggerAndWait(). Take a few minutes to read through the tasks documentation and get comfortable with the concepts.

Testing: Run the example task that create-trigger generates (usually called hello-world). Trigger it from your web app and verify it appears in the Trigger.dev dashboard. Make a simple API call to each of your chosen LLM providers with a basic prompt and verify you get a coherent response. Set up your database, create a test table, and verify you can read and write data. Once all three are working independently, you’re ready to start building.

Step 1

In this step your goal is to build user authentication and a lightweight homepage.

Create a sign-up and login system using email and password. You’ll need a user model in your database, registration and login forms, and session management. Keep the auth simple - you don’t need OAuth or social login, just email/password with hashed passwords.

Build a lightweight homepage that explains what the product does. It doesn’t need to be a full marketing site - just a clear explanation that this is an AI-powered contract review tool, what it does, and a call-to-action to sign up or log in.

Testing:

Visit the homepage and verify it renders correctly with the product explanation.
Register a new account with an email and password. Verify you’re redirected and logged in.
Log out and log back in with the same credentials.
Try registering with an email that already exists. You should get an appropriate error.
Try logging in with the wrong password. You should get an appropriate error, not a crash.

Step 2

In this step your goal is to build the PDF upload and text extraction pipeline.

Create a web UI where logged-in users can drag and drop or select a PDF contract file for upload. Once uploaded, store the file and extract the raw text from the PDF. Use a PDF parsing library - pdf-parse for simpler PDFs or pdf.js for more complex documents.

This is a great place to introduce your first custom task. Create a processContractUpload task in your trigger/ folder. When a user uploads a PDF, your web app should store the file, create a contract record in your database, and then trigger the task with the contract ID. The task should:

Extract raw text from the PDF
Store the extracted text back to the database, linked to the contract
Update the contract status

You don’t need to worry about timeouts - Trigger.dev tasks can run as long as needed, which is important for large PDFs. You also don’t need to worry about what happens if the server restarts mid-processing. The task will resume where it left off.

Testing:

Upload a multi-page PDF document. Verify the text is extracted and stored in the database.
Upload a PDF with unusual formatting (headers, footers, columns). Check how well your extraction library handles it. Some garbled text is expected.
Try uploading a non-PDF file. Your application should reject it with a clear error message.
Try uploading without being logged in. The application should redirect to the login page.
Open the Trigger.dev dashboard. You should see your processContractUpload task run with a status, duration, and any logs you emitted.

Step 3

In this step your goal is to split the extracted text into individual clauses using an LLM.

Raw extracted PDF text is rarely clean. You’ll have page numbers, headers, footers, and sometimes text in the wrong order. Your job now is to take that extracted text and use an LLM to identify and split it into individual, well-formed clauses.

Extend your processContractUpload task or create a new child task that takes the extracted text, sends it to an LLM with a prompt instructing the LLM to return the text split into clauses, and stores the results. Each clause should be a distinct logical unit - a paragraph, a condition, a definition, a warranty.

Store the identified clauses in your database, linked to the contract. Each clause should have a reference number (1, 2, 3...) so you can refer to it later.

Testing:

Upload a contract and verify it gets split into multiple clauses. For a typical multi-page contract, you should get at least 10-15 clauses.
Inspect the clauses in your database. Each one should be a coherent, self-contained piece of text, not a fragment mid-sentence.
Upload a very short document (a single paragraph). It should still work, returning one clause.
Check the Trigger.dev dashboard and verify the clause-splitting step appears in the run timeline.

Step 4

In this step your goal is to analyse each clause in parallel using LLMs to flag risk levels and ambiguous language.

This is the core of the application - and where Trigger.dev’s features really shine. For each clause, you need to send it to an LLM for analysis. The analysis should identify:

Risk level - high, medium, or low
Risk explanation - a short explanation of why the clause is risky
Ambiguous language - any vague terms like “reasonable efforts”, “as soon as practical”, “material adverse change” that could be interpreted differently
Recommendations - suggested changes to reduce risk or clarify ambiguity

Create an analyseClause task that takes a clause ID and text, calls your LLM, and stores the analysis result in the database. Configure it with retry settings so transient LLM API failures are handled automatically - Trigger.dev will retry with exponential backoff by default.

Now for the parallelism. From your parent task, use analyseClause.batchTriggerAndWait() to trigger all clause analyses in a single batch call. Trigger.dev will execute them in parallel (up to your environment’s concurrency limit), collect all the results, and return them to your parent task. A 50-clause contract is no problem - you get fan-out parallelism without writing any queue infrastructure.

Set a concurrencyLimit on the analyseClause task’s queue if you need to respect LLM API rate limits. For example, if your OpenAI tier allows 10 concurrent requests, set queue: { concurrencyLimit: 10 }.

Testing:

Upload a contract and verify all clauses are analysed, each with a risk level, explanation, ambiguity flags, and recommendations.
Check the Trigger.dev dashboard run view. You should see the parent task with all the child analyseClause runs, their individual statuses, durations, and any retries.
Temporarily use a rate-limited API key and verify that failed analyses are automatically retried and eventually succeed. Watch the retries in the dashboard.
Upload a 50+ clause contract. Verify it completes reliably. All 50+ analyses should be in the database.
Inspect a few analyses. A clause that says “The Provider shall not be liable for any damages” should be flagged as high risk. A clause that mentions “reasonable efforts” should be flagged for ambiguous language.

Step 5

In this step your goal is to aggregate the clause analyses into a structured review report and pause for human review.

Once all clause analyses are complete, your parent task should aggregate them into a structured review report. The report should show each clause number, the clause text, the risk level, the analysis explanation, any ambiguous language found, and recommendations. Group clauses by risk level (high first) so the reviewer can tackle the most important issues first.

Now for the human-in-the-loop part - this is where Trigger.dev’s waitpoint system comes in. After aggregating the results, use wait.createToken() to create a pause point. Store the token ID alongside the contract in your database so your review dashboard can reference it later. Then call wait.forToken() - your task will suspend at this point. Trigger.dev checkpoints the task state and releases compute resources. You’re not paying for idle time, and there’s no timeout.

Send an email notification to the user with a link to the review dashboard. Trigger.dev has hooks for this - you can use the onSuccess hook of the analysis task, or send the email before the waitpoint. Use a transactional email service like Resend, SendGrid, or Mailgun.

Testing:

Complete an analysis on a contract. Verify the aggregated report is stored in the database, ordered by risk level.
Verify the email notification is sent containing the correct summary statistics and a working link to the dashboard.
Check the Trigger.dev dashboard. The run should show as WAITING - it’s suspended at the waitpoint, waiting for the review token to be completed.
Verify that the task does not proceed until someone completes the token.

Step 6

In this step your goal is to build the review dashboard where a reviewer can approve, reject, or annotate each flagged clause.

Build a web dashboard that displays the aggregated review report. For each clause, the reviewer should be able to:

Approve the clause as is (no changes needed)
Reject the clause (it needs revision)
Annotate it with a free-text note explaining their reasoning or providing instructions

The dashboard should show the original clause text alongside the AI’s analysis so the reviewer has full context to make a decision. Make it easy to navigate between clauses and see at a glance which ones have been reviewed and which still need attention.

When the reviewer is done and submits their review, your application should save all decisions and annotations to the database, then complete the waitpoint token to resume the suspended task. Use wait.completeToken() from your web app backend (or send a POST to the token’s URL from the frontend using the public access token). The task will resume exactly where it left off, with all the reviewer’s decisions available from the token’s output.

You can also use Trigger.dev’s Realtime hooks in your dashboard. useRealtimeRun() lets you subscribe to run status changes without polling - so your dashboard can show the live status of the contract review workflow.

Testing:

Navigate to the review dashboard for a contract. Verify all clauses are displayed with their AI analysis.
Approve a few clauses, reject a few, and add annotations to some. Verify the decisions are saved to the database.
Submit the review. Verify the token is completed and the Trigger.dev task resumes.
Watch the Trigger.dev dashboard during review submission. The run should transition from WAITING to running again.
Before submitting, check that you can see visually which clauses have been reviewed and which haven’t.

Step 7

In this step your goal is to make the system LLM provider-agnostic with a configurable abstraction layer.

Your clause analysis and summary generation tasks currently call one or two specific LLM providers. Build an abstraction layer so you can swap providers without changing your task code.

Define a common interface for LLM interactions: a function that takes a prompt (or messages), configuration (temperature, max tokens, etc.), and returns a standardised response with the generated text and metadata (tokens used, finish reason, etc.).

Configuration should be externalised. Provider selection, model choice, and API keys should come from environment variables. Your Trigger.dev tasks should interact with the abstraction layer, not with any specific provider’s SDK directly.

Testing:

Run a full contract analysis using one LLM provider. Verify it works end to end.
Switch the configuration to use another provider instead. Rerun the same contract. The analysis should complete with comparable results.
Swap providers without changing a single line of task code (only environment variables).
Verify your abstraction layer captures provider-agnostic metadata regardless of which provider is underneath.

Step 8

In this step your goal is to generate the final summary report using an LLM, incorporating the clause analyses and reviewer feedback.

Now that the human review is complete, the waitpoint token has been completed, and your task has resumed, it’s time to generate the final summary.

Create a generateSummary task that loads all clause analyses, reviewer decisions, and annotations from the database, sends everything to an LLM, and asks it to synthesise a final report. The final summary should include:

Executive summary - a high-level overview of the contract’s risk profile
Key findings - the most important issues identified, incorporating reviewer feedback
Risk breakdown - a summary of risk levels across the contract
Clause-by-clause detail - for each clause, the original risk assessment and the reviewer’s decision, combined into a final recommendation

This should be a well-written, professional document that could be shared with a client or colleague. Your parent task should trigger the summary generation using triggerAndWait() so it gets the result back.

Testing:

Generate a final summary for a contract that has been fully reviewed. Verify it includes all sections and incorporates both the AI analysis and the reviewer feedback.
Check that clauses the reviewer approved show as “accepted” in the final report, while rejected clauses include the reviewer’s annotations and reasoning.
Verify the summary is stored in the database and linked to the contract.
Read the summary from start to finish. It should read as a coherent, professional document, not a jumble of disconnected analyses.

Step 9

In this step your goal is to stream the final summary to the frontend in real time as it is being generated.

Final summaries can be long, and waiting for a complete document before showing anything is a poor user experience. Trigger.dev has first-class support for streaming data from tasks to your frontend. Use Realtime streams to pipe LLM tokens directly to the browser as they’re generated.

Define a stream using streams.define() - give it a clear ID like "summary-output" and a type for the stream chunks. In your generateSummary task, configure your LLM call to stream tokens, and pipe the stream to your defined stream using .pipe().

In your frontend, use the useRealtimeStream() React hook to subscribe to the stream. As tokens arrive, your component renders them incrementally. No polling, no WebSocket management, no SSE wiring - the hook handles the connection automatically.

The user should also be able to receive the final summary via email as an alternative. Once the stream is complete, send the full summary as an email.

Testing:

Generate a final summary and watch the frontend. Tokens should appear incrementally, not all at once at the end.
Verify that the stream works across page refreshes - existing chunks should be replayed.
Check that the streaming handles slow generation gracefully. Partial content should render without freezing.
Verify the email delivery option works. Trigger an email with the completed summary and check your inbox.
Generate summaries from different providers. Streaming should work regardless of which provider is configured.

Step 10

In this step your goal is to explore Trigger.dev’s built-in observability - run tracing, logging, and monitoring.

You’ve already been using Trigger.dev’s dashboard throughout this challenge to see your tasks run. Now let’s make the most of its observability features. Unlike building your own tracing system from scratch, Trigger.dev gives you this out of the box.

Add tags to your tasks and runs so you can filter them in the dashboard. For example, tag runs with the contract ID, the user ID, the LLM provider used, and the workflow stage.

Use runs.metadata to attach structured data to your runs that updates as the workflow progresses. For example, set metadata for the number of clauses found, the count of high/medium/low risk clauses, the review status, and any error counts. This metadata appears in the dashboard and is available via the SDK.

Use Trigger.dev’s built-in logger throughout your tasks. It automatically captures log entries with timestamps and attaches them to the run - no log aggregation infrastructure needed.

Finally, explore the dashboard’s run view. You can see the full timeline of your contract review workflow: when each analyseClause child task started and completed, which clauses triggered which retries, how long each LLM call took, and any errors that occurred. The batch trigger view shows all parallel clause analyses at a glance, with individual run statuses and durations.

Testing:

Run a complete contract review. Open the Trigger.dev dashboard and find your run.
Verify you can see every step in the timeline: PDF extraction, clause splitting, batch clause analysis (with all child runs), waitpoint pause, waitpoint completion, and summary generation.
Click into individual analyseClause runs. You should see logs, duration, and whether any retries occurred.
Apply filters in the dashboard using your tags. Filter by status (failed runs only), by user, or by date range.
Add metadata to your runs and verify it appears in the dashboard.

Going Further

Want to take this further? Here are some ideas:

Use wait.for() to schedule follow-ups. Trigger a task that waits 7 days, then sends a reminder to review a contract that hasn’t been actioned.
Use input streams for cancellation. Add a cancel button to the frontend that uses Trigger.dev’s input streams to abort a running summary generation mid-stream.
Add support for more file formats. DOCX is even more common than PDF for contracts. Add support for Word documents and other formats.
Add role-based access control. Different users might need different permissions - uploaders, reviewers, and administrators.
Add comparison mode. Upload two versions of the same contract and have the LLM identify what changed and whether the changes alter the risk profile.
Add custom risk categories. Let users define their own risk categories and rules, then use Trigger.dev’s wait.forToken() to collect approval for each category.
Add a clause library. Build a library of standard, low-risk clause templates that the LLM can suggest as replacements for high-risk clauses.
Use concurrency keys for multi-tenancy. Leverage Trigger.dev’s concurrencyKey to give each organisation or user their own isolated queue.

Share Your Solutions!

Request for Feedback

You can reach me on Bluesky, LinkedIn or through SubStack

Thanks and happy coding!

John

Coding Challenge #121 - Dd

John Crickett — Sat, 23 May 2026 08:00:59 GMT

Hi, this is John with this week’s Coding Challenge.

🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉

If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧

Coding Challenge #121 - Dd

This challenge is to build your own version of dd, the low-level data copying and conversion utility found on every Unix-like system.

dd has been around since the early days of Unix - it first appeared in Version 5 Unix in the mid-1970s. The name is a nod to IBM’s JCL (Job Control Language) DD statement, which was used to describe data sets on mainframes. Unlike most Unix tools, dd uses a key=value syntax for its arguments rather than the usual flags, another inheritance from its mainframe roots.

At its heart, dd reads data in fixed-size blocks, optionally transforms it, and writes it out. That simplicity makes it surprisingly powerful: it’s used to copy disk images, create files of a specific size, benchmark storage throughput, convert character encodings, and recover data from failing drives. If you’ve ever created a bootable USB stick with dd if=image.iso of=/dev/sdb, you’ve used it.

If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️, with the bonus that you get 20% off any of my courses.
Buy one of my self-paced courses that walk you through a Coding Challenge.
Join one of my live courses where I personally teach you Go by building five of the coding challenges or systems software development by building a Redis clone.

The Challenge - Building Dd

In this challenge you’re going to build your own version of dd. You’ll start with the core block-copy loop and progressively add the operands and conversion options that make dd so versatile.

Step Zero

In this introductory step your goal is to set your environment up ready to begin developing and testing your solution.

Choose your target platform and programming language. dd is a low-level tool that benefits from a language with good support for binary I/O and byte-level manipulation all work well.

Before you start coding, spend a few minutes playing with the system dd to get a feel for how it behaves:

# Copy a file
dd if=/etc/hosts of=/tmp/hosts-copy

# Copy from stdin to stdout
echo "Hello, dd!" | dd

# Create a 1 MB file of zeroes
dd if=/dev/zero of=/tmp/zeros bs=1M count=1

Notice the summary that dd prints to stderr when it finishes - something like:

2+0 records in
2+0 records out
1024 bytes (1.0 kB, 1.0 KiB) copied, 0.000123 s, 8.3 MB/s

That 2+0 notation means “2 full records and 0 partial records”. You’ll be implementing that too.

Also note that dd uses key=value operands rather than the usual -flag style. if=, of=, bs=, and so on are all positional arguments, not flags.

Step 1

In this step your goal is to implement the core block-copy loop.

Your ccdd should read data from stdin and write it to stdout in fixed-size blocks, defaulting to 512 bytes. It should support the if=FILE operand to read from a file instead of stdin, and the of=FILE operand to write to a file instead of stdout.

When it finishes, it should print a summary to stderr in the same format as the real dd:

+ records in
+ records out
 bytes copied,  s,  MB/s

Where n is the number of full blocks read and m is the number of partial blocks (blocks where fewer bytes were available than the block size).

Testing: Copy a file and verify the output is identical:

ccdd if=/etc/hosts of=/tmp/hosts-copy
diff /etc/hosts /tmp/hosts-copy

The diff should produce no output. Check the summary printed to stderr matches what the real dd reports. Also test reading from stdin and writing to stdout:

echo "Hello, dd!" | ccdd | cat

Step 2

In this step your goal is to add block size control with the bs=, ibs=, and obs= operands, along with size suffixes.

The bs=BYTES operand sets both the input and output block size simultaneously. The ibs=BYTES and obs=BYTES operands set them independently - useful when you want to read in small chunks but write in large ones, or vice versa.

You should support the following size suffixes on any byte count:

Suffix Multiplier c 1 w 2 b 512 k or K 1024 M 1,048,576 G 1,073,741,824

So bs=4k means a 4096-byte block size, and bs=2M means 2,097,152 bytes.

Testing: Verify that different block sizes produce the same output:

ccdd if=/etc/hosts of=/tmp/out1 bs=1
ccdd if=/etc/hosts of=/tmp/out2 bs=512
ccdd if=/etc/hosts of=/tmp/out3 bs=4k
diff /tmp/out1 /tmp/out2
diff /tmp/out1 /tmp/out3

All three should be identical. Check the records in/out summary changes appropriately - a 1-byte block size will show many more records than a 4k block size for the same file.

Step 3

In this step your goal is to add the count=N, skip=N, and seek=N operands.

count=N limits the copy to N input blocks. skip=N skips N input blocks before starting to copy (seeking forward in the input). seek=N skips N output blocks before starting to write (seeking forward in the output, leaving the beginning of the output file untouched).

These three operands are what make dd useful for working with disk images and binary file formats where you need to operate on a specific region of a file.

Testing: Extract the middle portion of a file:

# Create a test file with known content
printf 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' > /tmp/test.bin  # 64 bytes of A
printf 'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB' >> /tmp/test.bin # 64 bytes of B
printf 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC' >> /tmp/test.bin # 64 bytes of C

# Extract just the B section (skip 1 block of 64 bytes, copy 1 block)
ccdd if=/tmp/test.bin of=/tmp/out.bin bs=64 skip=1 count=1
xxd /tmp/out.bin

The output should be 64 bytes of B. Test seek=N by writing into the middle of an existing file:

dd if=/dev/zero of=/tmp/sparse.bin bs=64 count=3
ccdd if=/tmp/test.bin of=/tmp/sparse.bin bs=64 skip=1 count=1 seek=1
xxd /tmp/sparse.bin

The middle 64 bytes should now be B, with zeroes before and after.

Step 4

In this step your goal is to implement the conv= operand with the text conversion options: ucase, lcase, and swab.

conv=ucase converts all lowercase ASCII letters to uppercase as the data passes through. conv=lcase does the reverse. conv=swab swaps adjacent bytes - byte 0 with byte 1, byte 2 with byte 3, and so on. If an odd number of bytes is read, the last byte is held over and swapped with the first byte of the next block.

Multiple conversions can be combined with commas: conv=ucase,swab.

Testing:

echo "Hello, World!" | ccdd conv=ucase

Should output HELLO, WORLD!. Test lcase:

echo "Hello, World!" | ccdd conv=lcase

Should output hello, world!. Test swab with a known byte sequence:

printf '\\x01\\x02\\x03\\x04' | ccdd conv=swab | xxd

Should show 02 01 04 03 - each pair of bytes swapped.

Step 5

In this step your goal is to implement the remaining conv= options: notrunc, noerror, sync, and sparse.

By default, dd truncates the output file before writing. conv=notrunc disables this, leaving any existing content beyond what dd writes intact. This is essential when patching a specific region of a binary file.

conv=noerror tells dd to continue after a read error rather than stopping. It’s used when recovering data from a failing drive - you’d rather get most of the data than none of it.

conv=sync pads each input block with null bytes (\\x00) to the full input block size when a short read occurs. Combined with noerror, this is the standard recipe for imaging a failing drive: dd if=/dev/sda of=image.img conv=noerror,sync.

conv=sparse is an optimisation: instead of writing blocks that are entirely null bytes, dd seeks past them in the output file. The filesystem records these as “holes”, creating a sparse file that takes up less actual disk space than its apparent size.

Testing: Test notrunc by writing a short string into the middle of a longer file:

echo "Hello, World!" > /tmp/original.txt
echo "Hi" | ccdd of=/tmp/original.txt conv=notrunc
cat /tmp/original.txt

The output should start with Hi but retain the rest of the original content (, World! and the newline). Without notrunc, the file would be truncated to just Hi\\n.

Test sparse by creating a file with large null regions and checking its actual disk usage:

ccdd if=/dev/zero of=/tmp/sparse.img bs=1M count=100 conv=sparse
ls -lh /tmp/sparse.img    # apparent size: 100 MB
du -sh /tmp/sparse.img    # actual disk usage: near 0

Step 6

In this step your goal is to implement the status= operand and SIGUSR1 signal handling.

status=none suppresses all output, including the final summary. status=noxfer suppresses the transfer statistics (bytes, time, rate) but still prints the records in/out counts. status=progress prints periodic transfer statistics to stderr while the copy is running, so you can see progress on long operations.

You should also handle the SIGUSR1 signal: when your process receives it, print the current transfer statistics to stderr without interrupting the copy. This is how you check on a long-running dd without stopping it.

Testing: Verify status=none produces no output at all:

echo "test" | ccdd status=none 2>/tmp/stderr.txt
cat /tmp/stderr.txt  # should be empty

Test status=progress with a slow copy:

ccdd if=/dev/zero of=/tmp/progress-test bs=1M count=500 status=progress

You should see the statistics updating as the copy runs. Test SIGUSR1:

ccdd if=/dev/zero of=/tmp/signal-test bs=1M count=1000 &
PID=$!
sleep 1
kill -USR1 $PID
wait $PID

Sending SIGUSR1 should print the current statistics without stopping the copy.

Step 7

In this step your goal is to add iflag= and oflag= for I/O flags.

iflag=direct opens the input file with O_DIRECT (or the platform equivalent), bypassing the OS page cache. This is useful for benchmarking raw storage throughput without cache effects. oflag=direct does the same for the output.

oflag=dsync opens the output with O_DSYNC, which causes each write to block until the data is physically written to storage. This is slower but guarantees durability.

iflag=fullblock changes how dd handles short reads on the input. Normally, a short read (fewer bytes than the block size) counts as a partial record. With fullblock, dd keeps reading until it has accumulated a full block or reaches end-of-file. This is important when reading from pipes or network sockets, where a single read() call may return less than the requested amount even when more data is coming.

Multiple flags can be combined with commas: iflag=direct,fullblock.

Testing: Test iflag=fullblock with a pipe that delivers data in small chunks:

# Without fullblock, each small write becomes a partial record
yes | head -c 4096 | ccdd bs=512 > /dev/null

# With fullblock, partial reads are accumulated into full blocks
yes | head -c 4096 | ccdd bs=512 iflag=fullblock > /dev/null

Compare the records in/out summary - fullblock should show fewer partial records. For oflag=dsync, copy a file and verify it completes successfully (the main observable effect is that it’s slower, as each write is synchronised to disk).

Going Further

Add count_bytes=N and skip_bytes=N (GNU extensions) to operate in bytes rather than blocks - useful when the data doesn’t divide evenly into your block size
Implement conv=ascii and conv=ebcdic to convert between ASCII and EBCDIC character encodings, the original purpose of dd‘s conversion mode
Add iflag=count_bytes so that count=N counts bytes rather than blocks
Build a progress bar using ANSI escape codes for status=progress, showing a visual indicator alongside the transfer rate
Benchmark your implementation against the system dd on a large file and see how close you can get - try different block sizes and see how throughput changes
Use your ccdd to create and restore a disk image of a USB drive (carefully!) and verify the image is byte-for-byte identical using md5sum

P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!

Share Your Solutions!

Request for Feedback

You can reach me on Bluesky, LinkedIn or through SubStack

Thanks and happy coding!

John

Coding Challenge #120 - md5sum

John Crickett — Sat, 16 May 2026 08:16:11 GMT

Hi, this is John with this week’s Coding Challenge.

🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉

If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧

Coding Challenge #120 - md5sum

This challenge is to build your own version of md5sum, the command-line utility that computes and verifies MD5 message digests.

MD5 (Message-Digest Algorithm 5) was designed by Ronald Rivest in 1991 and published as RFC 1321. For decades it was the go-to hash function for verifying file integrity - you’d download a file, run md5sum on it, and compare the output to the hash published alongside it to make sure nothing got corrupted or tampered with.

These days MD5 is considered cryptographically broken (collisions can be generated in seconds on a modern laptop), but it’s still widely used for checksums, cache keys, and non-security purposes. More importantly for Coding Challenges, MD5 is a simple hash function to implement from scratch. Building it yourself is a wonderful way to demystify how hash functions work.

If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️, with the bonus that you get 20% off any of my courses.
Buy one of my self-paced courses that walk you through a Coding Challenge.
Join one of my live courses where I personally teach you Go by building five of the coding challenges or systems software development by building a Redis clone.

The Challenge - Building Your Own MD5sum

In this challenge you’re going to build your own version of md5sum. There are two tracks through this challenge. Pick the one that suits you, or do both.

Track 1 (Steps 1 - 3) gets you to a fully working md5sum clone using your language’s standard library or a third-party library for the hash computation. You’ll focus on the command-line interface, file handling, and the check mode.

Track 2 (Steps 4 - 6) takes you deeper. You’ll implement the MD5 algorithm itself from scratch by following RFC 1321, replacing the library you used in Track 1 with your own code. This is where you’ll learn how hash functions actually work - message padding, Merkle-Damgård construction, and the compression function that sits at the heart of MD5 (and SHA-1, and SHA-2).

Both tracks produce the same tool. The only difference is whether the hashing happens inside a library or inside your own code.

Step Zero

In this introductory step you’re going to set your environment up ready to begin developing and testing your solution.

Choose your target platform and programming language. Any language will work for Track 1. For Track 2, pick a language that gives you easy access to 32-bit unsigned integer arithmetic and bitwise operations.

Before you start coding, have a play with the system md5sum so you get a feel for how it behaves:

echo -n "" | md5sum
echo -n "Hello, World!" | md5sum

The output format is the 32-character hex digest, two spaces, and the filename (or - for stdin). Note the -n flag on echo, without it, echo appends a newline, which changes the hash. This is a common gotcha when testing.

If you’re planning to do Track 2, open RFC 1321. It’s short, readable, and includes reference C code and a full test suite in the appendix. You won’t need it until Step 4, but it’s worth a skim now.

Step 1

In this step your goal is to hash the contents of a file and print the result in the standard md5sum output format.

Your tool should accept one or more filenames as command-line arguments, compute the MD5 hash of each file’s contents, and print one line per file in the format:

<32-char hex digest>

Note the two spaces between the digest and the filename, this is the md5sum convention for text mode. You can use your language’s built-in MD5 library or any third-party package for the hash computation. The focus in this step is on reading files, formatting the output correctly, and handling errors (for example, printing a message to stderr and continuing if a file doesn’t exist).

Testing: Create a test file and compare your output to the system md5sum:

echo -n "Coding Challenges" > test.txt
md5sum test.txt
ccmd5 test.txt

Both should produce the same hash. Test with multiple files:

echo -n "File one" > a.txt
echo -n "File two" > b.txt
ccmd5 a.txt b.txt
md5sum a.txt b.txt

The output should match line for line. Also test what happens when you pass a file that doesn’t exist - your tool should print an error message for that file and still process the remaining files.

Step 2

In this step your goal is to support reading from standard input and to handle binary files correctly.

When no filenames are given, your tool should read from stdin, compute the hash, and print the result with - as the filename. When - is given explicitly as a filename argument, it should also read from stdin.

Your tool should also support the -b flag for binary mode. In binary mode the output uses * (space-asterisk) between the digest and the filename instead of two spaces. On modern systems the hash is computed the same way regardless of mode, the flag only affects the output format and is a holdover from systems where text and binary file reads differed. But md5sum supports it, so yours should too.

Testing:

echo -n "Hello" | ccmd5
echo -n "Hello" | md5sum

Both should output the same hash followed by -. Test binary mode:

ccmd5 -b test.txt
md5sum -b test.txt

The output should show * before the filename instead of two spaces. Test with a binary file too -- an image, a compiled executable, or /bin/ls -- and verify your hash matches the system md5sum.

Step 3

In this step your goal is to implement check mode with the -c flag.

When called with -c, your tool should read a file containing previously generated checksums (one per line, in the same format your tool produces) and verify each one. For each line, it should read the named file, compute its hash, and compare it to the stored hash. It should print the filename followed by OK or FAILED for each entry.

At the end, if any checksums failed, your tool should print a summary line to stderr saying how many didn’t match, and exit with a non-zero status code. If all checksums match, it should exit with status 0.

Implement the --quiet flag, which suppresses the OK lines and only shows failures. And implement the --status flag, which suppresses all output and only sets the exit code -- useful in scripts.

Testing: Generate a checksum file, then verify it:

ccmd5 a.txt b.txt > checksums.md5
ccmd5 -c checksums.md5

You should see:

a.txt: OK
b.txt: OK

Now tamper with one of the files and re-check:

echo -n "Changed" > a.txt
ccmd5 -c checksums.md5

You should see:

a.txt: FAILED
b.txt: OK
ccmd5: WARNING: 1 computed checksum did NOT match

Test --quiet (only the FAILED line should appear) and --status (no output, but echo $? should show a non-zero exit code).

If you’re happy with a working md5sum clone and aren’t interested in implementing the hash algorithm itself, skip ahead to Going Further. Otherwise, read on.

Step 4

In this step your goal is to implement MD5 message padding and preprocessing, the first stage of the algorithm.

From here on you’re replacing the library hash with your own implementation. By the end of Step 6, your tool should produce identical output using code you wrote yourself.

MD5 operates on the input message in 512-bit (64-byte) blocks. Before processing, the message must be padded so its length is a multiple of 512 bits. The padding works like this:

Append a single 1 bit to the message (in practice, append the byte 0x80).
Append zero bytes until the message length is 56 bytes short of a multiple of 64 (i.e., length mod 64 equals 56).
Append the original message length in bits as a 64-bit little-endian integer.

This padding scheme means the final block always has room for the length field, and the 0x80 byte ensures the padding is unambiguous -- you can always tell where the original message ended.

Implement the padding, then split the padded message into 512-bit blocks. For now, just verify your padding is correct by checking it against the test vectors in RFC 1321 (Appendix A.5). You should also initialise the four 32-bit state variables (A, B, C, D) to the values specified in the RFC:

A = 0x67452301
B = 0xefcdab89
C = 0x98badcfe
D = 0x10325476

These are the starting values of the hash state, sometimes called the initialisation vector. They’re arbitrary constants chosen by Rivest.

Testing: The empty string "" has a length of 0 bits. After padding, you should have exactly one 64-byte block: 0x80, followed by 55 zero bytes, followed by the 64-bit length (0) in little-endian. Print your padded block as hex and verify it looks right.

The string "a" has a length of 8 bits. After padding: 0x61 0x80, 53 zero bytes, then 0x08 0x00 0x00 0x00 0x00 0x00 0x00 0x00.

Step 5

In this step your goal is to implement the MD5 compression function, the core of the algorithm.

The compression function processes one 512-bit block at a time and updates the four state variables. For each block:

Break the 64-byte block into sixteen 32-bit words (little-endian).
Initialise working variables a, b, c, d to the current state A, B, C, D.
Run 64 rounds, divided into four groups of 16. Each round applies a different auxiliary function to three of the four working variables, adds in one of the message words and a round constant, then rotates the result.

The four auxiliary functions are:

F (rounds 0-15): F(B, C, D) = (B AND C) OR (NOT B AND D) - a bitwise conditional: “if B then C else D”
G (rounds 16-31): G(B, C, D) = (B AND D) OR (C AND NOT D) - same idea, different arrangement
H (rounds 32-47): H(B, C, D) = B XOR C XOR D - a parity function
I (rounds 48-63): I(B, C, D) = C XOR (B OR NOT D) - another nonlinear mixing function

Each round computes:

a = b + left_rotate((a + func(b,c,d) + message_word + round_constant), shift_amount)

Then the variables are rotated: the old d becomes the new c, the old c becomes the new b, and so on. The shift amounts and round constants are specified in the RFC -- there are 64 of each, and they’re fixed values you can hard-code as a table.

After all 64 rounds, add the working variables back into the state: A += a, B += b, C += c, D += d. This addition step is what makes the construction iterative, each block’s output becomes the next block’s input.

The final hash is the state variables A, B, C, D concatenated in little-endian byte order to produce the 128-bit (16-byte) digest, which is then printed as 32 hex characters.

Testing: RFC 1321 provides test vectors in Appendix A.5. Your implementation should produce these exact hashes:

MD5("") = d41d8cd98f00b204e9800998ecf8427e
MD5("a") = 0cc175b9c0f1b6a831c399e269772661
MD5("abc") = 900150983cd24fb0d6963f7d28e17f72
MD5("message digest") = f96b697d7cb7938d525a2f31aaf161d0
MD5("abcdefghijklmnopqrstuvwxyz") = c3fcd3d76192e4007dfb496cca67e13b

Work through these one at a time. If a hash doesn’t match, check your byte ordering, little-endian is the most common source of bugs. Once all five test vectors pass, swap your library hash for your own implementation and verify that your ccmd5 tool still produces the same output as the system md5sum for every file you test.

Step 6

In this step your goal is to add support for SHA-256, so your tool can operate as both md5sum and sha256sum.

SHA-256 follows the same Merkle-Damgård structure as MD5, pad the message, split into blocks, process each block through a compression function - but with different parameters. The blocks are still 512 bits, but the state is eight 32-bit words instead of four, the compression function runs 64 rounds with different operations (Ch, Maj, and two sigma functions instead of F/G/H/I), and the output is 256 bits instead of 128. The message length in the padding is big-endian rather than little-endian.

Add a --algorithm flag (or -a) that accepts md5 or sha256, defaulting to md5. When SHA-256 is selected, the output format should match the system sha256sum, same layout, just a longer digest.

If you’ve structured your code well, the padding, block processing loop, and I/O code should be shared between both algorithms, with only the compression function and initialisation differing. This is a good test of how cleanly you’ve separated concerns.

Testing: Verify against the system sha256sum and well-known test vectors:

echo -n "" | ccmd5 -a sha256
echo -n "" | sha256sum

Both should produce e3b0c44298fc1c149afbf4c898fbf90a... - (the SHA-256 of the empty string).

echo -n "Hello, World!" | ccmd5 -a sha256
echo -n "Hello, World!" | sha256sum

Test check mode with SHA-256 too -- generate a checksum file with -a sha256 and verify it with -c.

Going Further

Here are some ideas to take your implementation further:

Add support for SHA-1, SHA-384, and SHA-512 to build a complete family of hash tools
Add the -tag output format (BSD-style: MD5 (filename) = digest) which is what macOS md5 uses by default
Implement HMAC-MD5 and HMAC-SHA256 using your hash functions - HMAC is how hash functions are used for message authentication in protocols like TLS
Benchmark your implementation against the system md5sum on a large file and see how close you can get - then try optimising with SIMD intrinsics or by processing multiple blocks in parallel
Read about the MD5 collision attacks (Wang et al., 2004) and try to understand how they exploit weaknesses in the auxiliary functions - it’s a fascinating bit of cryptographic history
Implement SHA-3 (Keccak), which uses a completely different construction (a sponge function) rather than Merkle-Damgård - comparing the two designs is a great way to understand why the cryptographic community moved on from MD5’s family of designs

P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!

Share Your Solutions!

Request for Feedback

You can reach me on Bluesky, LinkedIn or through SubStack

Thanks and happy coding!

John

Coding Challenge #119 - AI Pong Player

John Crickett — Sat, 09 May 2026 08:01:53 GMT

Hi, this is John with this week’s Coding Challenge.

🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉

If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧

Coding Challenge #119 - AI Pong Player

This challenge is to build your own reinforcement learning agent that learns to play Atari Pong directly from the pixels on the screen.

Pong is one of the oldest video games ever made, and it has a special place in the history of artificial intelligence. In 2013, DeepMind used Pong (and a handful of other Atari games) to show that a single algorithm could learn to play games at a human level, just by watching the screen and being told the score. That work kicked off the modern era of deep reinforcement learning. Pong is the friendliest of the Atari games to start with, the rules are simple, the screen is mostly empty, and the agent only needs to choose between moving the paddle up or down. That makes it the perfect first project for going from “I’ve read about reinforcement learning” to “I’ve actually trained an agent from raw pixels and watched it learn to win.” Building this project will introduce you to ideas you’ll come across again and again throughout your career: turning observations into features, sampling from a stochastic policy, computing returns, reducing variance, and the policy gradient itself.

If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️, with the bonus that you get 20% off any of my courses.
Buy one of my self-paced courses that walk you through a Coding Challenge.
Join one of my live courses where I personally teach you Go by building five of the coding challenges or systems software development by building a Redis clone.

The Challenge - Building Your Own AI Pong Player

In this challenge you’re going to build a policy gradient agent that learns to play Pong from raw pixels using the REINFORCE algorithm. Your agent will start out playing randomly, lose 21-0 over and over, and then, if you’ve wired everything up correctly, gradually start scoring points, then winning rallies, and eventually beating the built-in opponent more often than it loses.

This challenge is a good fit if you’ve written some Python before, are comfortable with NumPy, and have at least a passing acquaintance with neural networks. You don’t need to be an reinforcement learning expert. REINFORCE is one of the simplest deep reinforcement learning algorithms there is, and the version we’ll build here is famously the one Andrej Karpathy described in his “Pong from Pixels“ blog post. A small policy network, no value function, no replay buffer, no target network. Just a policy, some episodes, and a gradient.

A word of warning before you start: training from pixels is slow. Even on a sensible setup, you should expect a few hours of CPU training before the agent really starts to win, and you may want to leave it running overnight. That’s part of the experience, watching the score curve crawl upwards over many thousands of episodes is genuinely exciting once you’ve built the thing yourself.

Step Zero

In this introductory step you’re going to set your environment up ready to begin developing and testing your solution.

Python is the natural choice for this challenge because the reinforcement learning ecosystem lives there, but the ideas transfer cleanly to any language with a deep learning framework.

You’ll need three things installed: Gymnasium (the maintained successor to OpenAI Gym), the Atari environments via ALE-py, and a deep learning framework, PyTorch, TensorFlow, or JAX are all fine, pick whichever you’d like to practise with. You’ll also want NumPy, Matplotlib, and probably opencv-python or Pillow for image work. Have a quick read of the Gymnasium docs and the Atari environment list so you know what’s available.

Before you write any code, spend a few minutes playing Pong yourself if you’ve never seen it. Notice that the only thing that matters is your paddle’s vertical position, the ball’s position, and the ball’s direction of travel. Your agent will need to work this out from the screen, with no idea what any of those concepts mean.

Step 1

In this step your goal is to get a Pong environment running and have a “random agent” play a full game so you can see the data flowing.

Create the ALE/Pong-v5 environment from Gymnasium and run a single episode where, at every step, you pick an action uniformly at random and step the environment with it. For each step, print or log the reward. You should see mostly zeros, with the occasional -1 (the built-in opponent has scored against you) and very rarely a +1 (you got lucky). The episode should end after twenty-one points have been scored on one side.

Have a look at the action space (env.action_space) and the observation space (env.observation_space). The action space has six entries, but for Pong you really only ever need two of them: the action that moves the paddle up and the action that moves it down. Constraining your agent’s choices to just those two actions makes learning much faster, because there are fewer wrong things it can do. Pick the two action indices you’ll use throughout the rest of the challenge and write them down somewhere obvious in your code.

The observation is a 210 x 160 x 3 RGB image - the raw screen. Have a look at one with Matplotlib so you know what your agent is seeing. There’s a lot of pixels there that have nothing to do with playing Pong: the score at the top, the borders down the sides, the colours. We’ll fix all of that in the next step.

Testing: Run your random agent for one episode and confirm that:

The episode terminates of its own accord (you don’t have to cap the step count)
The total reward is somewhere between roughly 21 and 15 (random play loses badly)
The observation shape is (210, 160, 3) with uint8 values

Step 2

In this step your goal is to turn the raw 210 x 160 x 3 screen into a much smaller representation that contains just the information your agent needs.

There are four things to do here, and they should all happen inside a single function that takes a raw frame and returns the preprocessed observation:

Crop away the score area at the top of the screen and the borders on each side, leaving just the playing area.
Convert the result to greyscale - colour adds nothing useful in Pong.
Resize down to 80 x 80 pixels. The image was already mostly empty space; at this resolution you can still clearly see the paddles and the ball.
Flatten the 80 x 80 grid into a single 1D vector of length 6400. This is the input format your policy network will expect.

A static frame doesn’t tell your agent anything about which way the ball is moving, and direction is the most important thing in Pong. The classic trick - and the one used in the original Karpathy write-up - is to feed in the difference between the current preprocessed frame and the previous one. Pixels that didn’t change become zero, and pixels that did change show up as positive or negative values. The ball appears as a little bright streak pointing the way it’s travelling. Add this difference computation on top of your preprocessing function.

Testing: Save a few raw frames and their preprocessed versions to disk and look at them with an image viewer. The preprocessed frame should clearly show the two paddles and the ball as bright pixels on a dark background, with nothing else. Display a frame difference - it should be almost entirely black except for the ball and the moving paddle.

A good sanity check: the output of your preprocessing function should be a 1D NumPy array of length 6400 (or whatever shape you’ve chosen) with float32 values, not raw pixel bytes.

Step 3

In this step your goal is to build the neural network that maps a preprocessed observation to a probability distribution over actions, and use it to pick actions.

The policy network Karparthy describes is a tiny network - a single hidden layer with about 200 ReLU units, then an output layer that produces one number per action. Pass that output through a softmax (or a sigmoid if you’ve reduced things to a single output for “probability of moving up”) and you have a probability distribution. To pick an action, sample from that distribution rather than taking the most likely one.

Wire up an “act” function that takes a preprocessed frame, runs it through the network, and returns a sampled action plus whatever extra information you’ll need later for training (typically the log-probability of the action that was taken, or the network output itself).

Once that’s working, run another full episode - this time with your untrained network choosing the actions instead of random.choice. The agent will still lose badly (its weights are random), but the score should be in roughly the same ballpark as the random agent from Step 1. If you see something dramatically different, something is wrong with your preprocessing or your sampling.

Testing: Run a single episode with the untrained policy. The total reward should be in the same -21 to -15 range as the random agent. The action distribution - if you log it - should be close to 50/50 at the start of training. Print the shape of the network output and the sampled action index for the first few steps to make sure everything lines up.

Step 4

In this step your goal is to collect a complete episode of experience and turn the rewards into the returns that will drive learning.

For each step in an episode, store three things: the observation that was fed in, the action that was taken (or its log-probability), and the reward that came back from the environment. At the end of the episode you’ll have three lists, all the same length.

Now compute the discounted return for each step. The return at step t is the sum of all the rewards from step t onwards, with rewards further in the future weighted by a discount factor gamma (use 0.99). You should compute this as a single backwards pass over the reward list - much faster and cleaner than the obvious double loop. There’s one Pong-specific subtlety: every time someone scores a point, the rally ends and a new one begins inside the same episode. You probably want to reset the running sum when a non-zero reward appears, so credit for a point only flows back to the actions in that rally rather than all the way to the start of the game. This makes a big difference to learning speed.

Once you have the per-step returns, normalise them across the whole episode by subtracting the mean and dividing by the standard deviation. Normalised returns put roughly half the actions on the “this was better than average” side and half on the “this was worse” side, which gives the policy gradient a much more stable signal.

Testing: Run an episode, compute the returns, and have a look:

The length of your returns array matches the number of steps in the episode.
After normalisation, the mean should be close to zero and the standard deviation close to one.
For an action that was followed by a +1 reward soon after, the return should be positive; for one followed by a 1, it should be negative.

A nice sanity print is to show, for the last twenty steps of an episode, the reward at that step and the discounted return - you’ll see the return building up smoothly and then jumping when a point is scored.

Step 5

In this step your goal is to actually update the policy in the direction that makes good actions more likely and bad actions less likely. This is the heart of the whole challenge.

The REINFORCE update is delightfully simple. For each step in your collected rollout, compute the loss as -log(probability of the action taken) * normalised return for that step, then sum (or average) across all the steps. Run that through your framework’s autograd, take a gradient step with an optimiser (Adam or RMSProp at a learning rate around 1e-3 to 1e-4 works well), and that’s it. Actions that led to better-than-average returns get pushed up; actions that led to worse-than-average returns get pushed down. You’re doing gradient ascent on expected return, even though you’re calling loss.backward().

A single episode’s worth of gradient is very noisy. Batch up the gradients over multiple episodes - ten is a sensible starting point - before you actually call the optimiser. You can either accumulate gradients across episodes or concatenate the per-step data and do one bigger update; both work.

Now wrap the whole thing in a training loop that runs for thousands of episodes, prints a running average of the score after each one, and just leaves it going. Be patient. For the first few hundred episodes the score will hover around -21 -- the policy is still essentially random and learning very slowly. After that, you should see the running average start to creep upwards. By the time you’ve trained for several thousand episodes (this can be many hours of wall-clock time on CPU), the running average should cross zero, meaning your agent is winning more rallies than it loses.

Testing: This is the step where things either work or they very visibly don’t. A few things to check as training progresses:

The running average reward should be trending upwards over time, not just bouncing around
After ~500 episodes, the agent should reliably score some points (running average above 21)
After a few thousand episodes, the running average should be approaching zero or going positive
If the loss explodes or the score gets stuck at 21 forever, the most common culprits are: forgetting to reset the discounted return between rallies, an unnormalised return signal, the wrong sign on the loss, or feeding raw frames instead of frame differences

If you’d like a stronger signal that things are alive, log the average length of an episode (in steps). Random play produces short episodes; an agent that’s learning to actually rally produces longer ones, well before the score itself starts to go up.

Step 6

In this step your goal is to make your training run something you can show off, not just a console of numbers scrolling by.

There are four things to add:

Save the model weights -- both periodically (every N episodes) and whenever a new best running average reward is achieved. You don’t want to leave a long training run going only to lose the weights.
An evaluation mode that loads a saved set of weights, plays a fixed number of episodes with the policy fixed (no learning, ideally with greedy action selection rather than sampling), and reports the average score. This is what you’d use to honestly compare two different training runs.
Video recording of the agent playing. Gymnasium has a RecordVideo wrapper that writes MP4s. Record a video of an early-training agent (it’ll be hilariously bad), a mid-training agent (starting to get the idea), and a late-training agent (winning, hopefully). Stitching these together is the single most satisfying artefact of the whole project.
A training reward plot -- a simple Matplotlib chart showing the per-episode reward and a rolling average over the run. The shape of this curve, going from a flat line at 21 through random play and up into positive territory, is the picture of an agent learning.

Testing: Once you have all four bits in place:

Kill your training script and restart it from a saved checkpoint. The running average should pick up roughly where it left off, not crash back to 21.
Run evaluation on a checkpoint with sampling vs. greedy action selection; greedy should be at least as good as sampled.
Open one of your recorded videos and watch the agent play. It is unbelievably satisfying to see the paddle that you trained track the ball and put it past the opponent.

Going Further

Here are some ideas to take your Pong agent further:

Add a baseline to reduce variance. REINFORCE has notoriously noisy gradients. Subtract a baseline - the simplest one is the running average reward, the next-simplest is a learned value function - from the returns before you scale the policy gradient. This is the first step from REINFORCE towards Actor-Critic.
Replace the MLP with a small CNN. Convolutional layers are a much more natural fit for image input than a flattened MLP. You’ll lose the trick of feeding the frame difference and instead stack the last few frames as channels. Compare training time and final score against the MLP version.
Try a different algorithm. Once you have all the scaffolding - environment, preprocessing, training loop, logging - you can swap the algorithm out without rewriting the rest. Implement A2C, PPO, or DQN against the same Pong setup and see how they compare on sample efficiency and final score.
Run multiple environments in parallel. A single CPU core stepping through one game at a time is the bottleneck on most training runs. Use Gymnasium’s AsyncVectorEnv to step several Pong games at once and gather rollouts much faster.
Train an opponent. The built-in Pong AI is fixed and not very good. Once your agent beats it consistently, you’ve topped out the score. A natural next step is self-play: have two copies of your agent play each other and improve together.

P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!

Share Your Solutions!

Request for Feedback

You can reach me on Bluesky, LinkedIn or through SubStack

Thanks and happy coding!

John

Coding Challenge #118 - Comm

John Crickett — Sat, 02 May 2026 08:01:16 GMT

Hi, this is John with this week’s Coding Challenge.

🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉

If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧

Coding Challenge #118 - Comm

This challenge is to build your own version of comm, the classic Unix utility that compares two sorted files line by line.

comm is one of those small tools in the Unix toolbox that solves a single problem really well. Given two sorted files, it tells you three things at once: which lines are unique to the first file, which lines are unique to the second file, and which lines appear in both. It does this in a single streaming pass, taking advantage of the fact that the inputs are already sorted, so it never has to load whole files into memory and never has to do an O(n²) comparison.

You’ll find comm used to diff lists of users, find files that exist in one directory tree but not another, work out the intersection of two datasets, and as a building block in countless shell pipelines. Building your own version is a lovely exercise in stream processing, careful state management, and the merge step that sits at the heart of merge sort, ideas you’ll reach for again and again throughout your career.

If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️, with the bonus that you get 20% off any of my courses.
Buy one of my self-paced courses that walk you through a Coding Challenge.
Join one of my live courses where I personally teach you Go by building five of the coding challenges or systems software development by building a Redis clone.

The Challenge - Building Comm

In this challenge you’re going to build your own version of comm, a streaming, sorted file comparison tool. Your tool will read two sorted files, compare them line by line, and print three columns of output: lines unique to the first file, lines unique to the second file, and lines common to both. It will be compatible with the standard POSIX comm utility, which means you’ll be able to test your work directly against the system comm and use it as a drop-in replacement in shell pipelines.

The clever bit about comm is that it doesn’t sort the files for you, it relies on the fact that they are already sorted to do its work in a single pass with constant memory. Two read pointers, one per file, and a small handful of comparison rules are all you need.

Step Zero

In this introductory step you’re going to set your environment up ready to begin developing and testing your solution.

Choose your target platform and programming language. I’d encourage you to pick a language you’re comfortable with for reading files line by line and parsing command-line arguments. Pretty much any general-purpose language is a good fit for this challenge. The focus is on the algorithm, not on the language.

Before you start coding, have a read through the POSIX comm specification and the man page on your own machine (man comm). Spend some time playing with the system comm so you get a feel for how it behaves, especially around the column indentation and the suppression flags.

Create a couple of small sorted test files to use throughout the challenge:

printf "apple\nbanana\ncherry\ndate\nelderberry\n" > file1.txt
printf "banana\ncherry\nfig\ngrape\n" > file2.txt

file1.txt and file2.txt are both sorted. The two files share banana and cherry, while apple, date, and elderberry are unique to the first and fig and grape are unique to the second. We’ll use these throughout the challenge.

A quick note on locale. Both sort and the system comm only do byte-wise comparison when the locale is C or POSIX. With a default locale (especially on macOS), ordering becomes locale-aware -- Apple can sort between apple and banana rather than before them, which will confuse your test files and any comparisons you run against the system comm. Either set LC_ALL=C in your shell while working on this challenge, or prefix the relevant commands with it (LC_ALL=C sort ..., LC_ALL=C comm ...).

Step 1

In this step your goal is to read two sorted files and produce the three-column output that is the heart of comm.

Your tool should accept exactly two filenames as command-line arguments, open both files, and walk through them in sorted order using a single pass. For each line you should decide whether it belongs to column 1 (lines only in the first file), column 2 (lines only in the second file), or column 3 (lines in both). Column 1 has no leading tab, column 2 has one leading tab, and column 3 has two leading tabs. Lines should be compared byte by byte, the same way LC_ALL=C sort orders them.

You’ll also need to handle files of different lengths -- when one file runs out, the remaining lines from the other file should be emitted in the appropriate column. Empty files are a useful edge case to think about: if one file is empty, every line of the other file goes into its own column. Don’t worry about any flags or options yet; just get the basic three-column comparison working.

Testing: Run your tool against the test files and check your output:

ccomm file1.txt file2.txt
apple
		banana
		cherry
date
elderberry
	fig
	grape

Try it with files of different lengths, with one empty file, and with two completely disjoint files. Lines unique to the first file should appear with no leading tab, lines unique to the second file with one leading tab, and shared lines with two leading tabs.

Step 2

In this step your goal is to support the single-column suppression flags -1, -2, and -3.

comm lets you hide any of the three columns from the output. -1 hides lines unique to the first file, -2 hides lines unique to the second file, and -3 hides lines common to both. When a column is suppressed, the indentation for the remaining columns shifts left so the leading tabs go away. For example, with -1 the second column should no longer have its leading tab, because there is no first column for it to sit next to.

Have a play with the system comm to see exactly how it lays things out. This is one of those details that is easier to copy than to describe.

Testing:

ccomm -1 file1.txt file2.txt
	banana
	cherry
fig
grape

ccomm -2 file1.txt file2.txt
apple
	banana
	cherry
date
elderberry

ccomm -3 file1.txt file2.txt
apple
date
elderberry
	fig
	grape

Step 3

In this step your goal is to support combining the suppression flags.

The flags from Step 2 can be combined to suppress more than one column at a time. -12 shows only the common lines, -23 shows only lines unique to the first file, -13 shows only lines unique to the second, and -123 produces no output at all. The combinations can be supplied as a single argument (-12) or as separate arguments (-1 -2); both should behave identically.

If you wrote Step 2 in a flexible way -- tracking which columns are active rather than special-casing each flag -- this step should be a very small change.

Testing:

ccomm -12 file1.txt file2.txt
banana
cherry

ccomm -23 file1.txt file2.txt
apple
date
elderberry

ccomm -13 file1.txt file2.txt
fig
grape

ccomm -123 file1.txt file2.txt

ccomm -1 -2 file1.txt file2.txt
banana
cherry

The fifth invocation should produce the same output as -12.

Step 4

In this step your goal is to support reading from standard input.

When - is given as one of the two filename arguments, your tool should read that input from stdin instead of opening a file. This is what lets comm slot into shell pipelines. Either of the two arguments can be -, but only one of them at a time -- if both are - your tool should report an error.

Testing:

printf "apple\nbanana\ncherry\n" | ccomm - file2.txt
apple
		banana
		cherry
	fig
	grape

cat file1.txt | ccomm file2.txt -
	apple
		banana
		cherry
	date
	elderberry
fig
grape

The first command pipes a sorted list into your tool as the first input. The second pipes file1.txt in as the second input. Each should produce the same three-column comparison as the equivalent file-based invocation.

Step 5

In this step your goal is to support case-insensitive comparison with the -i flag.

When -i is set, lines that differ only in case (for example Apple and apple) should be treated as equal. The original case of the lines should still be preserved in the output -- only the comparison itself is case-insensitive.

This is the only step where the comparison rule changes, so it’s worth thinking about how you’ve structured your comparison code. If you’ve kept the comparison behind a single function, this should be a small change.

Testing: Build a couple of files that differ only in case:

printf "Apple\nBanana\nCherry\n" > upper.txt
printf "apple\nbanana\ndate\n" > lower.txt

Without -i, every line is unique to one file or the other (because Apple and apple compare differently):

ccomm upper.txt lower.txt
Apple
Banana
Cherry
	apple
	banana
	date

With -i, the matching pairs should appear in column 3:

ccomm -i upper.txt lower.txt
		Apple
		Banana
Cherry
	date

The third invocation combines -i with column suppression:

ccomm -i -1 upper.txt lower.txt
	Apple
	Banana
date

Going Further

Here are some ideas to take your comm implementation further:

Add the GNU -check-order and -nocheck-order flags so your tool can warn when its inputs aren’t really sorted
Add the GNU -output-delimiter=STRING flag for choosing a custom string between columns instead of tabs
Add the GNU -total flag to print a summary line with the count for each column at the end of the output
Add the GNU z / -zero-terminated flag so records are separated by NUL bytes instead of newlines, which is useful for working with filenames containing newlines or spaces
Add a -header flag that prints column headings before the output to make it easier to read interactively
Support comparing more than two files at once (you’ll need to think about what the output even looks like for three or more inputs)
Add a flag to compare on a specific field rather than the whole line, the way join and sort -k do
Support files compressed with gzip or zstd transparently, so your tool can read .gz files without an explicit zcat
Build a streaming library version of your tool so other programs in your chosen language can use the comparison logic without shelling out
Benchmark your implementation against the system comm on very large files (think hundreds of millions of lines) and see if you can match or beat it
Try running your comm as part of a real pipeline in your own work -- finding files in one directory tree but not another, or comparing two snapshots of a database export, are good real-world tests

P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!

Share Your Solutions!

Request for Feedback

You can reach me on Bluesky, LinkedIn or through SubStack

Thanks and happy coding!

John

Coding Challenge #117 - AI Powered Support Bot

John Crickett — Sat, 25 Apr 2026 08:01:45 GMT

Hi, this is John with this week’s Coding Challenge.

🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉

If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧

Coding Challenge #117 - AI Powered Support Bot

This challenge is to build your own AI-powered customer support bot - and then discover, the hard way, why production AI applications need more than just an API key and a system prompt.

This challenge was created in collaboration with Orq.ai, whose Router provides a single API across 400+ models from 20+ providers - with built-in fallbacks, cost routing, and observability. Free to start, no markup on token costs.

Every developer has used a support bot. Most have opinions about them. In this challenge you’ll build one for a fictional version of Coding Challenges, giving it context about the available projects - Build Your Own Redis, Docker, Git and the rest - so it can answer questions about which challenge to tackle, what skills you’ll learn, how to get started, and general troubleshooting.

It starts simple. But step by step you’ll layer on the production concerns that real AI applications face: resilience when a provider goes down, observability so you know what’s happening, and cost routing so you’re not burning money on simple questions. By the time you’ve built all of that yourself, you’ll have a deep appreciation for what an AI gateway does - and you’ll see just how much code disappears when you use one.

If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️, with the bonus that you get 20% off any of my courses.
Buy one of my self-paced courses that walk you through a Coding Challenge.
Join one of my live courses where I personally teach you Go by building five of the coding challenges or systems software development by building a Redis clone.

The Challenge - Building Your Own AI Powered Support Bot

You’re going to build an AI customer support bot that answers questions about Coding Challenges projects. Along the way you’ll experience the real production pain points of working with LLMs - provider lock-in, reliability, observability, and cost - and then see what happens when you replace your hand-rolled infrastructure with a single gateway endpoint.

Step Zero

In this introductory step you’re going to set your environment up ready to begin developing and testing your solution.

You’ll need to make a few decisions:

Choose your programming language. Pick something you’re comfortable building with. You’ll be making HTTP requests, handling JSON, and building a simple interactive loop. Python, TypeScript, Go, Java, and Rust all work well.
Choose your first LLM provider. You need an API key from one of the major providers - OpenAI, Anthropic, Google (Gemini), or Mistral. You’ll be using their chat completions API. You can build using the REST API, an AI framework or install their SDK for your chosen language.
Grab the context data from here.

Testing: Make a simple API call to your chosen provider with a basic prompt like “Hello, who are you?” and verify you get a coherent response back. If that works, you’re ready to move on.

Step 1

In this step your goal is to build a working support bot using a single LLM provider.

Build an interactive command-line application that takes user questions and responds using your chosen LLM. The bot should have a system prompt that includes the Coding Challenges context you downloaded in Step Zero, instructing it to act as a helpful customer support agent that answers questions based on that context.

Your bot should maintain a conversation history so follow-up questions work naturally. If a user asks “Which challenge should I start with?” and then follows up with “What will I learn from that one?”, the bot should understand what “that one” refers to.

Keep it simple. One provider, one model, one API key hardcoded (or read from an environment variable). No fallbacks, no logging, no clever routing. Just a bot that works.

Testing:

Ask the bot a factual question about Coding Challenges based on the context data, like “What projects are available?” or “What are the frontend focused projects?” The answer should be accurate and drawn from your context.
Ask a follow-up question that references the previous answer. The bot should maintain context across the conversation.
Ask something the context doesn’t cover, like “Can you help me with my tax return?” The bot should politely explain that it can only help with Coding Challenges questions.
Ask for a personalised recommendation, like “I know Python and want to learn about networking - which challenge should I do?” The bot should give a thoughtful answer based on the available projects.

Step 2

In this step your goal is to add resilience by introducing a fallback to a second LLM provider.

Imagine your primary provider goes down. Right now your bot is completely broken. To simulate this, temporarily use an invalid API key for your primary provider so every request fails.

Now fix it. Sign up for a second LLM provider and integrate their SDK alongside the first. When a request to the primary provider fails, your bot should automatically retry with the fallback provider. The user should get an answer either way.

This sounds straightforward, but pay attention to the friction. You now have two SDKs with different interfaces, two API keys to manage, two different authentication mechanisms, and subtly different request and response shapes. Your code needs to handle the differences, normalise the responses, and manage the error handling for both.

Once your fallback is working, remove the invalid API key and restore normal operation. Keep the fallback logic in place - you’ll want it for reliability.

Testing:

With both providers working, verify the bot responds normally using the primary provider.
Invalidate your primary provider’s API key and send a question. The bot should seamlessly fall back to the second provider and still give a good answer. The user shouldn’t see an error - just a response.
Invalidate both API keys and verify the bot handles the failure gracefully with a clear error message rather than crashing.
Restore the primary key and verify the bot goes back to using the primary provider.
Take note of how much code you’ve added to handle two providers. Count the lines if you like. You’ll want that number later.

Step 3

In this step your goal is to add observability by tracking token usage, latency, and cost for every request.

In production, you need to know what’s happening. How many tokens are you using? How much is each request costing? How long are responses taking? Without this information you’re flying blind.

Build a logging layer that captures the following for every LLM request:

Request ID - a unique identifier for each request, so you can correlate a user-visible response with its log entry.
Timestamp of the request.
Provider and model used (especially important now you have two providers).
Token counts - prompt tokens and completion tokens. Most providers return these in the API response.
Latency - how long the request took from start to finish.
Estimated cost - calculate this from the token counts and the provider’s published pricing.
Whether it was a primary or fallback request.
Error details - when a request fails, capture the error type, status code, and message. This is what tells you why the primary failed and the fallback kicked in.
Finish reason - was it stop, length, content-filter, or a tool call? This matters when you’re investigating quality issues.

Store these logs however you like - a local file, an in-memory list, a database. Add a way to view a summary: total requests, total tokens, total cost, average latency, and the breakdown between primary and fallback usage.

You might also consider capturing the request and response payloads (watch out for PII), a conversation or session ID so you can trace a whole support session, retry counts, and time-to-first-token once you’re streaming. These aren’t required, but they’re the kinds of things a production observability stack typically includes.

This is the kind of infrastructure that every production AI application needs, and building it yourself gives you an appreciation for how much work it is to get right. You need to handle it consistently across both providers despite their different response formats.

Testing:

Send several questions to the bot and then check your logs. Every request should have a complete record with all the fields listed above.
Force a fallback by invalidating the primary key, send a question, then check the log. The entry should show the fallback provider was used.
View your summary statistics. They should accurately reflect the requests you’ve made - total cost, average latency, token counts.
Verify that token counts and cost calculations are consistent with what the providers report in their dashboards.

Step 4

In this step your goal is to add cost-aware routing so that simple questions go to cheaper models and complex questions go to more capable (and more expensive) ones.

Not all questions are equal. “What’s the pricing?” is a simple lookup that any small model can handle. “I’m a backend developer who knows Python but wants to learn systems programming - which challenges should I do and in what order?” needs genuine reasoning ability.

Build a routing layer that classifies incoming questions and directs them to the appropriate model. You’ll need at least two tiers:

Simple queries - FAQs, greetings, straightforward factual lookups. Route these to a cheaper, faster model.
Complex queries - personalised recommendations, multi-step reasoning, comparisons across projects, questions that require synthesising information. Route these to a more capable model.

How you classify queries is up to you. You could use keyword matching, a separate lightweight LLM call to classify the question, message length heuristics, or some combination. The point is to reduce cost without noticeably reducing quality.

A note on what’s actually changing: the system prompt and context data stay the same for every request - the cheap model still needs to see the Coding Challenges context to answer “What’s the pricing?” The saving comes from running fewer parameters per token, not from sending less context. Don’t be tempted to trim the context for simple queries; that quickly leads to wrong answers.

This is where your codebase starts to feel the weight. You now have multiple providers, fallback logic, per-request logging across all of them, and routing logic that needs to work with all your models. Take a moment to look at your code. Count the lines dedicated to infrastructure versus the lines dedicated to the actual support bot logic.

Testing:

Ask a simple question like “What’s the pricing?” and check your logs. It should be routed to the cheaper model.
Ask a complex question like “I want to learn distributed systems but I’ve never done any systems programming - what’s the best learning path through your challenges?” and check your logs. It should be routed to the more capable model.
Compare the costs in your logs between simple and complex queries. The simple queries should be noticeably cheaper.
Verify the quality of responses. Simple questions routed to the cheaper model should still be answered well. If the quality is poor, adjust your classification logic.
Send a mix of ten questions - some simple, some complex - and review the routing decisions in your logs. Most should be classified correctly.

Step 5

In this step your goal is to replace all of the infrastructure you built in Steps 2 through 4 with the Orq.ai Router.

Sign up for a free Orq.ai account and get your API key. The Router provides an OpenAI-compatible endpoint, which means you can point any OpenAI SDK at it by changing the base URL and API key. That’s it.

Replace your multi-provider setup, your fallback logic, your logging infrastructure, and your routing layer with a single API call to the Router endpoint. The Router handles:

Fallbacks and retries - if a provider fails, the Router automatically retries with another. You only pay for successful completions.
Observability - per-request traces with token counts, latency, cost, and error logs are available in the Orq.ai dashboard. No custom logging code needed.
Cost routing - the Auto Router directs each prompt to the most cost-effective model that meets quality requirements, with a typical 50% cost reduction while retaining 98% quality.
Access to 400+ models from 20+ providers through a single API key and a single interface.

Now look at your code. The fallback handling from Step 2, the logging layer from Step 3, and the routing logic from Step 4 can all be removed. Your bot should be back to something close to the simplicity of Step 1, but with all the production capabilities you spent three steps building by hand.

Testing:

Send the same mix of questions you used throughout the challenge. The responses should be at least as good as before.
Check the Orq.ai dashboard for your request logs. You should see token counts, costs, latency, and which model was used for each request - all without any custom logging code.
Simulate a provider failure (the Router handles this transparently). Send requests and verify they succeed even when individual providers have issues.
Compare the total lines of code in your solution now versus at the end of Step 4. The infrastructure code should have largely disappeared.
Compare the cost of your requests through the Router versus your manual routing from Step 4. The Router’s Auto Router should achieve similar or better cost optimisation.

Going Further

You’ve built a support bot and experienced the full arc from simple prototype to production-ready AI application. Here are some ways to push further:

Add a web interface: Build a simple chat UI in front of your bot instead of using the command line. The backend stays the same - just add an HTTP API layer.
Streaming responses: If you haven’t already, add streaming so responses appear token by token. This makes a big difference to the user experience, especially for longer answers.
Conversation summarisation: For long support sessions, summarise older messages to keep the context window manageable while preserving important information.
Multi-language support: Add the ability to detect the user’s language and respond in kind. This is a good test of your routing logic - you might want a different model for different languages.
Evaluation harness: Build a test suite of questions with expected answers and measure your bot’s accuracy, response time, and cost across different configurations. Orq.ai can help you with this.
RAG integration: Instead of putting all the context in the system prompt, build a retrieval pipeline that fetches relevant documentation chunks based on the user’s question. This scales much better as your knowledge base grows. Orq.ai can help you with this too.

P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!

Share Your Solutions!

Request for Feedback

You can reach me on Bluesky, LinkedIn or through SubStack

Thanks and happy coding!

John

Coding Challenge #116 - Awk

John Crickett — Sat, 18 Apr 2026 08:00:51 GMT

Hi, this is John with this week’s Coding Challenge.

🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉

If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧

Coding Challenge #116 - Awk

This challenge is to build your own version of awk, the classic text processing language.

Awk was created in 1977 by Alfred Aho, Peter Weinberger, and Brian Kernighan (the name comes from their initials). It’s a small but remarkably powerful language designed for processing structured text data.

Awk reads input line by line, splits each line into fields, and applies pattern-action rules to produce output. It sits in a sweet spot between sed (which is great for simple substitutions) and a full programming language like Perl or Python. Despite being nearly 50 years old, awk remains one of the most useful tools in a developer’s toolkit. You’ll find it in shell scripts, data pipelines, and one-liners across every Unix system in the world. Building your own awk will teach you about lexing, parsing, interpreters, and the design of small domain-specific languages.

If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️, with the bonus that you get 20% off any of my courses.
Buy one of my self-paced courses that walk you through a Coding Challenge.
Join one of my live courses where I personally teach you Go by building five of the coding challenges or systems software development by building a Redis clone.

The Challenge - Building Your Own Awk

In this challenge you’re going to build your own version of the awk text processing tool. Your tool will read input line by line, split each line into fields, match lines against patterns, and execute actions -- producing output that is compatible with the standard POSIX awk utility.

Awk programs are built from pattern-action rules that look like this: condition { action }. For each line of input, awk checks every rule. If the condition matches, the action is executed. It’s a simple model that turns out to be surprisingly expressive.

Step Zero

In this introductory step you’re going to set your environment up ready to begin developing and testing your solution.

Choose your target platform and programming language. I’d encourage you to pick a language you’re comfortable with for building interpreters. You’ll be writing a lexer, a parser, and a tree-walking interpreter, so a language with good string handling and data structures will make your life easier.

Before you start coding, have a read through the POSIX awk specification to get a feel for the language. Don’t worry about understanding every detail -- we’ll work through the features step by step. It’s also worth playing with the system awk on your machine to get a sense of how it behaves.

Create a test file to use throughout the challenge. Here’s one you can use:

echo "John 25 London
Jane 30 New York
Bob 22 Paris
Alice 35 Tokyo
Charlie 28 Berlin" > test.txt

Step 1

In this step your goal is to support the basic print action with field splitting.

Your tool should read input from a file (or stdin if no file is given), split each line into fields on whitespace, and support the print statement. The special variable $0 refers to the whole line, $1 to the first field, $2 to the second, and so on.

At this point, you only need to handle a bare action block with no pattern -- meaning the action runs for every line of input. Focus on getting the core loop right: read a line, split it into fields, execute the action, move to the next line.

Testing: Run these commands and compare against the system awk:

ccawk '{ print }' test.txt
ccawk '{ print $0 }' test.txt
ccawk '{ print $1 }' test.txt
ccawk '{ print $1, $3 }' test.txt
echo -e "hello\\nworld" | ccawk '{ print $0 }'

The first two should print every line. The third should print just the first name from each line. The fourth should print the name and city, separated by a space (the default output field separator). The fifth should read from standard input and print each line. Your output should match awk exactly.

Step 2

In this step your goal is to support the -F flag for custom field separators and the built-in variables NR, NF, and FS.

The -F flag lets the user specify a custom field separator. For example, -F: splits on colons, which is useful for parsing files like /etc/passwd.

Implement the built-in variables NR (the current record number, starting at 1), NF (the number of fields in the current record), and FS (the field separator).

Testing: Create a CSV-like test file and test with custom separators:

echo "john:25:london
jane:30:new york
bob:22:paris" > test2.txt

ccawk -F: '{ print $1 }' test2.txt
ccawk '{ print NR, $1 }' test.txt
ccawk '{ print NF }' test.txt

The first command should print just the names from the colon-separated file. The second should print line numbers alongside names. The third should print the number of fields on each line. Compare all output against the system awk.

Step 3

In this step your goal is to support patterns, comparison operators, and regular expression matching.

Awk’s power comes from its pattern-action model. A pattern can be a comparison expression (like $2 > 25), a regular expression (like /London/), or the special patterns BEGIN and END. If a line matches the pattern, the action is executed. If there’s no action, the default is { print }.

Implement comparison operators (==, !=, <, >, <=, >=), regular expression matching with /regex/ patterns and the ~ and !~ operators, logical operators (&&, ||, !), and the BEGIN and END special patterns. BEGIN runs before any input is read, and END runs after all input has been processed. With BEGIN available, you should also support setting FS within the program (e.g. BEGIN { FS = ":" }) as an alternative to the -F flag from Step 2.

Your program should support multiple pattern-action rules. Awk checks every rule for every line, so a single line can trigger multiple actions.

Testing:

ccawk '$2 > 25 { print $1 }' test.txt
ccawk '/London/ { print $1 }' test.txt
ccawk '$1 ~ /^[AJ]/ { print }' test.txt
ccawk 'BEGIN { print "Name Age" } { print $1, $2 } END { print "Done" }' test.txt
ccawk 'BEGIN { FS = ":" } { print $1 }' test2.txt
ccawk '$2 > 25 && $2 < 35 { print $1, "mid-range" }' test.txt
ccawk '/London/ { print "City:", $3 } /^J/ { print "J-name:", $1 }' test.txt

The first should print names of people older than 25. The second should print “John”. The third should print lines where the first field starts with A or J. The fourth should print a header, all names with ages, then “Done”. The fifth should set the field separator to colon inside BEGIN and print names from the colon-separated file. The sixth should print people whose age is between 25 and 35 exclusive. The seventh demonstrates multiple rules -- John’s line matches both patterns. Compare against awk.

Step 4

In this step your goal is to support variables, arithmetic operators, and assignment operators.

Awk variables are dynamically typed -- they can hold strings or numbers and convert between the two as needed. Uninitialised variables default to 0 when used as numbers and "" when used as strings.

Implement arithmetic operators (+, -, *, /, %, ^), assignment operators (=, +=, -=, *=, /=, %=), and string concatenation (which in awk is done by placing values next to each other with no operator).

Testing:

ccawk '{ total += $2 } END { print "Total age:", total }' test.txt
ccawk '{ print $1, $2 * 2 }' test.txt
ccawk '{ name = $1 " from " $3; print name }' test.txt
ccawk 'BEGIN { x = 2; print x ^ 10 }'

The first should print the sum of all ages. The second should print names with doubled ages. The third should concatenate fields with text. The fourth should print 1024. Compare against awk.

Step 5

In this step your goal is to support control flow: if/else, while, for, do-while, and C-style for loops.

Also implement break and continue for loops, next to skip to the next input record, exit to stop processing entirely, and the ternary conditional operator (condition ? value_if_true : value_if_false).

Testing:

ccawk '{ if ($2 > 25) print $1, "senior"; else print $1, "junior" }' test.txt
ccawk '{ for (i = 1; i <= NF; i++) print $i }' test.txt
ccawk '$1 == "Bob" { next } { print }' test.txt
ccawk '{ print; if (NR == 3) exit }' test.txt
ccawk '{ print ($2 > 25) ? $1 " is senior" : $1 " is junior" }' test.txt

The first should label people as senior or junior based on age. The second should print every field on its own line. The third should skip Bob’s line and print everything else. The fourth should print only the first three lines. The fifth uses the ternary operator to produce the same senior/junior labelling in a different style. Compare against awk.

Step 6

In this step your goal is to support associative arrays and the for (key in array) construct.

Associative arrays are one of awk’s most powerful features. They’re indexed by strings (not just integers) and can be used to count, group, and aggregate data. Implement the in operator for testing membership, for (key in array) for iterating over keys, and the delete statement for removing elements.

Testing:

ccawk '{ count[$3]++ } END { for (city in count) print city, count[city] }' test.txt
ccawk '{ ages[$1] = $2 } END { if ("Bob" in ages) print "Bob is", ages["Bob"] }' test.txt
ccawk '{ a[$1] = $2 } END { delete a["Bob"]; for (k in a) print k, a[k] }' test.txt

The first should count how many people live in each city. The second should check if Bob exists and print his age. The third should delete Bob and print the rest. Note that for (key in array) iterates in an unspecified order, so don’t worry about the ordering of output lines -- just make sure the content matches.

Step 7

In this step your goal is to support the printf statement and built-in string functions.

Implement printf with C-style format strings (supporting at least %d, %f, %s, %c, and %x with width and precision specifiers).

Implement these built-in string functions: length, substr, index, split, sub, gsub, match, sprintf, tolower, and toupper.

Testing:

ccawk '{ printf "%-10s %3d %s\\n", $1, $2, $3 }' test.txt
ccawk '{ print length($1) }' test.txt
ccawk '{ print substr($1, 1, 3) }' test.txt
ccawk '{ gsub(/o/, "0", $1); print }' test.txt
ccawk '{ print toupper($1) }' test.txt

The first should print a neatly formatted table. The second should print the length of each name. The third should print the first three characters of each name. The fourth should replace all “o” characters with “0” in the first field. The fifth should print names in uppercase. Compare against awk.

Step 8

In this step your goal is to support user-defined functions and built-in arithmetic functions.

Implement user-defined functions with the syntax function name(params) { body }. Functions should support local variables (declared as extra parameters in the function signature, which is how awk handles local scope) and return values with return.

Implement the built-in arithmetic functions: int, sqrt, sin, cos, atan2, exp, log, rand, and srand.

Testing:

ccawk 'function max(a, b) { return a > b ? a : b } { print $1, max($2, 30) }' test.txt
ccawk 'BEGIN { srand(42); for (i = 0; i < 5; i++) printf "%.4f\\n", rand() }'
ccawk '{ print $1, int(sqrt($2)) }' test.txt

The first should print each name alongside the greater of their age or 30. The second should print 5 random numbers. The third should print names with the integer square root of their age. Compare against awk (except for random numbers where the seed behaviour may differ).

Step 9

In this step your goal is to support the remaining output and input features.

Implement the output built-in variables: OFS (output field separator), ORS (output record separator), and RS (record separator). When print outputs multiple fields separated by commas, it uses OFS between them. Each print statement ends with ORS.

Implement the -f flag to read the awk program from a file instead of the command line, -v var=value for setting variables before execution, and support for reading from multiple input files. Implement the FILENAME, ARGC, and ARGV built-in variables.

Testing:

ccawk 'BEGIN { OFS="-" } { print $1, $2, $3 }' test.txt
echo 'BEGIN { print "Start" } { print FILENAME, $0 }' > prog.awk
ccawk -f prog.awk test.txt
ccawk -v threshold=25 '$2 > threshold { print $1 }' test.txt
ccawk '{ print FILENAME, $0 }' test.txt test2.txt

The first should print fields separated by dashes. The second should read the program from a file and print each line with its filename. The third should use the command-line variable. The fourth should process both files and show which file each line came from. Compare against awk.

Step 10

In this step your goal is to support piping output to shell commands and the getline function.

Implement the pipe operator for print, which lets you send output to an external command: print "hello" | "sort". Awk keeps the pipe open across multiple print statements to the same command, so all the output goes to a single invocation of the command. Implement the close() function to close a pipe or file, which is needed when you want to reopen a pipe or when the external command needs to receive EOF to produce output.

Implement getline in its various forms: getline to read the next line from the current input, getline var to read into a specific variable, getline < "file" to read from a file, and "command" | getline to read from a command.

Testing:

ccawk '{ print $1 | "sort" }' test.txt
ccawk '{ while (("date" | getline line) > 0) print line; close("date") }'
ccawk 'BEGIN { while ((getline line < "test.txt") > 0) print line }'

The first should print names in sorted order. The second should print the current date. The third should read and print the test file from within a BEGIN block. The fourth is a classic awk one-liner that sums file sizes from ls output. Compare against awk.

ls -l | ccawk 'NR > 1 { total += $5 } END { print "Total bytes:", total }'

Going Further

Here are some ideas to take your awk implementation further:

Add support for multi-character record separators (which gawk supports but POSIX awk does not)
Implement the OFMT and CONVFMT variables for controlling numeric-to-string conversion
Add support for range patterns (/start/,/stop/) which match all lines between two patterns
Implement coprocess communication with |& (a gawk extension)
Add support for @include to include other awk source files
Build a bytecode compiler and virtual machine instead of a tree-walking interpreter for better performance
Add support for the ENVIRON array for accessing environment variables
Implement nextfile to skip to the next input file
Try running your awk against real-world awk scripts (there are many collected online) and see how compatible your implementation is

P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!

Share Your Solutions!

Request for Feedback

You can reach me on Bluesky, LinkedIn or through SubStack

Thanks and happy coding!

John

Coding Challenge #115 - Code Sherpa

John Crickett — Sat, 11 Apr 2026 08:01:23 GMT

Hi, this is John with this week’s Coding Challenge.

🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉

If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧

Coding Challenge #115 - Code Sherpa

This challenge is to build your own semantic code exploration tool - a system that helps developers make sense of large, unfamiliar codebases using natural language questions instead of reading files top to bottom.

We’ve all been there. You join a new team, or pick up a legacy project, and you’re staring at thousands of files with no idea where the interesting bits live. You grep for keywords, open file after file, and slowly piece together how things fit. It works, but it’s slow and frustrating.

Code Sherpa takes a different approach. You point it at a codebase and ask questions in plain English: “where do we handle payment failures?” or “how does the authentication flow work?” The system finds the relevant code by meaning, not by keyword matching, explains what it does, and remembers what you’ve already explored so each session builds on the last.

Under the hood, the system parses code into semantic chunks, embeds them into vectors, and stores them in a vector database such as Oracle Database 26ai. When you ask a question, it retrieves the most relevant code using both vector similarity and full-text search, and uses an LLM to explain it in context. The agentic behaviour - memory, follow-up questions, multi-step exploration - is orchestrated using LangGraph’s state machine model, while LangChain handles the retrieval chains and LLM integration. It’s a practical introduction to vector search, full-text search, embeddings, code parsing, agent orchestration, and building a web interface to tie it all together.

If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️, with the bonus that you get 20% off any of my courses.
Buy one of my self-paced courses that walk you through a Coding Challenge.
Join one of my live courses where I personally teach you Go by building five of the coding challenges or systems software development by building a Redis clone.

The Challenge - Building Code Sherpa

You’re going to build a semantic code exploration tool. It starts by ingesting a codebase into a vector database, then lets you query it through a web interface using natural language. Step by step you’ll add code parsing, vector storage, semantic retrieval, LLM-powered explanations, project management, a web interface, agent memory, and intelligent navigation. By the end, you’ll have a tool that genuinely helps you understand unfamiliar code.

Step Zero

In this introductory step you’re going to set your environment up ready to begin developing and testing your solution.

You’ll need to make a few decisions and get some infrastructure running:

Set up your vector database. I suggest Oracle Database 26ai running in a local Docker container. Pull the container-registry.oracle.com/database/free:latest image, start the container, and set a password for the admin account. You can find full setup instructions in the Oracle Database Free Get Started guide. Once the container is running, connect using a SQL client and verify you can create a table. Store all credentials in an environment file, not hardcoded anywhere.

docker pull container-registry.oracle.com/database/free:latest
docker run -d -p 1521:1521 -e ORACLE_PWD= container-registry.oracle.com/database/free:latest

Choose your embedding model. You need a code-aware embedding model - one that understands programming constructs, not just prose. Nomic’s CodeRankEmbed is open source (MIT), purpose-built for code retrieval, and lightweight enough to run locally on CPU without a GPU or API key. It produces 768-dimensional vectors. Install it via Hugging Face: pip install sentence-transformers and load it as nomic-ai/CodeRankEmbed. If you have a GPU available, Nomic’s larger nomic-embed-code (7B parameters, requires ~14GB VRAM) offers a few points better retrieval quality.
Set up your LLM provider. You’ll need a language model for generating explanations later. Any provider with a chat API will work - Anthropic, OpenAI, Google, Mistral, or a local model.
Set up LangChain and LangGraph. You’ll be using LangChain for retrieval chains and LLM integration, and LangGraph for orchestrating the agent’s behaviour as a state machine. Install both: pip install langchain langgraph. LangChain handles the plumbing of embedding, retrieval, and prompting. LangGraph handles the agentic flow - deciding when to search memory, when to retrieve code, when to ask follow-up questions, and how to route between these steps.

Prepare a test codebase to work with throughout the challenge. Pick an open source project you’re curious about but haven’t explored in depth - something with a few thousand lines across multiple files and directories. A project with clear structure (like a web framework, CLI tool, or library) works well.

Testing: Verify your Oracle Database container is running and you can connect to it. Load your embedding model and generate a test embedding to confirm it returns a vector of the expected dimensions. Make a test call to your LLM API to confirm it returns a valid response. Verify your environment file is being read correctly and no credentials are in your source code.

Step 1

In this step your goal is to build a code parsing pipeline that breaks a codebase into meaningful chunks.

The foundation of semantic code search is good chunking. Rather than splitting files at arbitrary character boundaries, you want to split at logical boundaries: functions, classes, and modules. A chunk should represent one coherent unit of code that makes sense on its own.

Point your parser at a local directory and have it walk the file tree, identify source files, and split each one into chunks. Each chunk should carry metadata: the file path it came from, what type of chunk it is (function, class, module), the programming language, and the character range within the original file.

Not every file will parse cleanly. Some might have syntax errors, use unusual language features, or be in a format your parser doesn’t support. When that happens, log the failure and keep going. A partial index is far more useful than no index at all.

Display progress in the terminal as ingestion runs: how many files have been processed, how many chunks have been created, and any failures encountered.

Testing:

Run your parser against your test codebase and verify it produces chunks at function and class boundaries, not arbitrary splits.
Inspect several chunks and confirm each one contains a complete, coherent unit of code.
Check that the metadata on each chunk is correct - file path, chunk type, language, and character range should all match the source.
Introduce a file with a deliberate syntax error and verify the parser logs the failure and continues processing the rest.
Verify the terminal output shows meaningful progress: files processed, chunks created, and any errors.

Step 2

In this step your goal is to embed the code chunks and store them in your vector database.

Take each chunk from your parser, generate a vector embedding for it using your chosen embedding model, and store the embedding alongside the chunk’s text and metadata in Oracle Database. The metadata fields - file path, chunk type, language, and character range - should all be stored and indexed so you can filter on them later.

Think about how you structure your storage. You’ll want to be able to search by vector similarity, but also filter by metadata (e.g. “only show me Python files” or “only functions, not classes”). Set up your vector indexes accordingly. Also create an Oracle Text full-text index on the code text column - this will let you fall back to keyword search when vector similarity alone doesn’t find good matches, and is particularly useful when developers search for exact identifier names, error messages, or string literals.

Testing:

Run the full pipeline - parse then embed and store - against your test codebase.
Query Oracle Database directly to verify the data is there: check the total number of stored chunks matches what your parser reported.
Inspect a few stored entries and confirm they contain the embedding vector, the original code text, and all metadata fields.
Verify that both the vector index and the full-text index have been created on the appropriate columns.
Run the pipeline again against the same codebase and verify it handles the re-run sensibly (either updating existing entries or skipping duplicates).

Step 3

In this step your goal is to implement semantic retrieval so you can ask natural language questions and get back the most relevant code.

This is where the tool starts to feel useful. Take a natural language question from the user, embed it using the same model you used for the code chunks, and search your vector database for the closest matches using cosine similarity. Also run the query through Oracle’s full-text search on the same table. Combine the results - vector search finds semantically related code even when the words don’t match, while full-text search catches exact identifier names and string literals that vector search might rank lower. Return the top results along with their file paths and line references.

Not every query will have good matches. Set a minimum cosine similarity threshold - start at around 0.3 for CodeRankEmbed embeddings, which is a reasonable baseline for filtering out unrelated results. You may need to adjust this based on your embedding model: if you’re getting too many irrelevant results, raise it; if you’re missing relevant code, lower it. When nothing exceeds the threshold and full-text search also returns no matches, the system should tell the user honestly rather than returning low-confidence results that waste their time.

Build this as a simple CLI interface for now - you’ll add the web interface later. The user types a question, and the system returns the matching code chunks with their locations. This is database search only so far.

Testing:

Ask a question about something you know exists in your test codebase (e.g. “where is the main entry point?” or “how are errors handled?”). Verify the returned chunks are genuinely relevant.
Ask the same question using different phrasing and verify you get similar results. This is the whole point of semantic search - it matches by meaning, not keywords.
Search for an exact function or variable name. Verify the full-text search catches it even if the vector similarity score would be low.
Ask a question about something that definitely isn’t in the codebase. Verify the system tells you no relevant code was found rather than returning irrelevant results.
Check that every returned chunk includes its file path and line reference.

Step 4

In this step your goal is to add LLM-powered explanations so the system doesn’t just find code - it explains what the code does.

Raw code chunks are useful, but an explanation in plain language is far more helpful when you’re trying to understand an unfamiliar codebase. Wire up your LLM through LangChain to take the retrieved chunks and generate a clear explanation. Use LangChain’s retrieval chain to handle the prompt construction - passing the retrieved code as context alongside the user’s question.

The explanation should cite the specific functions and files involved. When the user asks how two parts of the codebase relate to each other, the system should retrieve both and explain the connection. Where multiple implementations of the same concept exist, it should surface all of them and explain the differences.

Rather than hardcoding the retrieval orchestration, bind tool definitions to the LLM so it can invoke tools during response generation. Define three tools: search_code(query) for hybrid vector and full-text search, read_file(file_path) for full file contents, and list_files(pattern) for glob-based file listing. The agent loop sends the user’s question, executes any tool calls the LLM requests, returns the results, and repeats until the LLM produces a final text response or hits a configurable iteration limit (default 10). Each tool invocation should be emitted as a progress event so the frontend can show what the agent is doing.

Crucially, the system should not speculate beyond what the retrieved code supports. If a question can’t be fully answered from what’s been ingested, it should say so explicitly rather than making things up.

Testing:

Ask what a specific function does. The explanation should be accurate, in plain language, and cite the file and function name.
Ask how two parts of the codebase relate (e.g. “how does the router connect to the request handlers?”). The system should retrieve relevant code from both areas and explain the relationship.
Verify the agent uses tool calls to gather information iteratively - it should invoke search_code, read_file, or list_files as needed and you should see progress events for each tool invocation.
Ask a question that requires multiple tool calls to answer fully (e.g. “trace the request lifecycle from entry point to response”). Verify the agent makes several tool calls before producing a final answer.
Ask a question that the codebase only partially answers. Verify the system explains what it can and explicitly flags what it can’t determine from the code.
Ask about a concept that has multiple implementations in the codebase. Verify the system surfaces all of them and explains how they differ.

Step 5

In this step your goal is to add project management so a developer can maintain separate knowledge bases for different codebases.

A developer working across multiple projects needs each one indexed and searchable independently. Add support for named projects. Store project metadata - name, source path, creation date, last ingestion timestamp, file count, and chunk count - in Oracle Database alongside your embeddings. Each project should store its embeddings, metadata, and any agent memory in isolation, so queries against one project never return results from another.

The user should be able to create a new project, list existing projects, and select which project to query. When a codebase is re-ingested into an existing project, only the changed files should be re-embedded - unchanged files should keep their existing embeddings. This makes re-ingestion fast even for large projects.

All project data should persist between runs in Oracle Database. When the user comes back tomorrow and selects a project, everything should be exactly as they left it.

Testing:

Create two projects from two different codebases. Query each one and verify the results come only from the correct project.
List your projects and verify both appear with the correct names and metadata (source path, file count, last ingestion time).
Query the project metadata directly in Oracle Database and verify it matches what the system reports.
Modify a single file in one of your test codebases, re-ingest, and verify that only the changed file’s chunks are re-embedded. Unchanged files should not be re-processed.
Stop and restart your system. Verify all project data is still intact and queryable.

Step 6

In this step your goal is to build a web interface for browsing and querying your indexed codebases.

The web interface should launch automatically when the system starts, opening in the default browser or displaying the local URL clearly in the terminal. Build it as a three-panel layout: a file tree on the left, a chat panel in the centre, and a code viewer on the right.

The chat panel should present a chat-style conversation interface. User and assistant messages appear in a scrollable history as distinct message bubbles. When follow-up mode is enabled (the default), conversation history is sent with each request so the LLM can reference prior exchanges. A “New Chat” button resets the conversation. Assistant responses should render markdown and include source citations with expandable code snippet previews. Clicking a file reference in a citation or the file tree opens the code in the viewer panel alongside the chat.

The web interface should also include a dedicated Projects page, separate from the code exploration view, for managing projects. From this page, users can create new projects by specifying a name and source (local path or GitHub URL), trigger ingestion or re-ingestion, and monitor progress in real time. Progress updates should stream via Server-Sent Events (SSE), showing the current phase and batch progress (e.g. “Embedding chunks: batch 3 of 10”). On completion, display a summary of chunks stored and files skipped, updated, or removed. Concurrent ingestion on the same project should be prevented.

Add a browsable file tree of the ingested codebase structure, built from the stored metadata. This gives the user a visual overview of the project layout without needing to look at the actual file system.

While the agent is processing a query, show a loading state so the user knows something is happening.

Testing:

Start the system and verify the web interface launches and is accessible in your browser.
Select a project and ask a question through the chat panel. Verify the response appears as a message bubble with markdown rendering, source citations, and expandable code previews.
Ask a follow-up question and verify the system uses conversation history to maintain context.
Click “New Chat” and verify the conversation resets.
Click a file reference in a citation and verify it opens in the code viewer panel.
Switch between projects and verify the results update to reflect the selected project.
Browse the file tree and verify it accurately reflects the structure of the ingested codebase.
Create a new project from the Projects page by providing a name and source path. Trigger ingestion and verify progress streams in real time, showing the current phase and batch progress.
Attempt to start a second ingestion on the same project while one is running. Verify it is prevented.
Submit a query and verify a loading indicator appears while the response is being generated.

Step 7

In this step your goal is to add agent memory so the system remembers what’s been explored and what the developer has told it about the project.

Without memory, every session starts from scratch. The developer re-explains the same context, re-asks the same orientation questions, and the system re-explains things it’s already covered. Memory changes that.

Implement two types of memory, stored in Oracle Database alongside your code embeddings. Episodic memory tracks which areas of the codebase the developer has already explored, so the system can avoid re-explaining concepts that have already been covered. Semantic memory stores project-level context that the developer provides - things like “this service owns all payment logic” or “the legacy auth module is being deprecated” - and applies it to future responses within that project.

Use LangGraph to build a memory-aware query graph. When a question arrives, the graph should first check memory for relevant prior context, then decide how to handle the query: if the user has already explored this area, route to a node that builds on prior understanding rather than explaining from scratch; if it’s new territory, route to a full retrieval and explanation. This routing logic is where LangGraph’s state machine model pays off - each node in the graph handles one concern (check memory, retrieve code, generate explanation, update memory) and the edges encode the decision logic.

Since you already have Oracle Database storing your code embeddings, it’s a natural home for memory too. Store memory entries as vectors so they can be retrieved by semantic similarity - when a developer asks a question, the system can search its memory for relevant prior context the same way it searches the codebase for relevant code. Keep memory isolated per project, just like your code embeddings.

Both types of memory should persist across sessions in Oracle Database. When the user returns to a project, the system should pick up where it left off. When asked, the system should be able to provide a summary of what’s been explored so far and what remains unvisited.

Provide a dedicated Memory page in the web interface, accessible from the main navigation alongside the Explorer and Projects pages. The page should show two sections: episodic memory (exploration history with queries, files explored, summaries, and timestamps) and semantic memory (developer-provided context with content and timestamps). Users can search across both memory types using a unified search that combines text matching and vector similarity. Semantic memory entries can be added, edited, and deleted individually. Both memory types support individual deletion and bulk clear operations with confirmation dialogs.

Testing:

Explore several areas of a codebase across a session. End the session, start a new one, and ask the system what you’ve explored so far. It should accurately summarise the areas you’ve already covered.
Tell the system something about the project (e.g. “the payments module is the most critical part of this service”). In subsequent queries, verify the system uses this context to inform its responses.
Ask about something you’ve already explored. The system should recognise this and build on prior understanding rather than explaining from scratch.
Ask for a summary of what’s been explored versus what remains unvisited. Verify it gives a reasonable breakdown.
Open the Memory page and verify it shows episodic and semantic memory entries with timestamps.
Add a new semantic memory entry from the Memory page. Return to the Explorer and verify the system uses it in subsequent responses.
Edit and delete individual memory entries. Verify changes take effect immediately.
Use the search function on the Memory page and verify it finds entries by both keyword and meaning.
Use the bulk clear operation on episodic memory and verify all entries are removed after confirmation.

Step 8

In this step your goal is to add intelligent navigation so developers can drill deeper into code and follow connections naturally.

A good exploration tool doesn’t just answer isolated questions - it helps you follow threads. Start by classifying query intent using a single LLM call rather than regex patterns. The classifier should determine whether a query is a map request, a follow-up to a previous exchange, a broad exploration, or a specific question. Include recent conversation history in the classification prompt so the LLM can detect follow-ups contextually rather than relying on keyword matching.

Use LangGraph to model multi-step retrieval as a state graph. When a user asks “what calls this?”, the graph should: retrieve the current function’s code, identify references to it across the codebase, retrieve those callers, and generate an explanation that ties them together. Each step is a node in the graph, with state passed between them.

Add support for follow-up questions that drill deeper into a previous result without the user needing to re-state context. If the system just explained a function, the user should be able to ask “what calls this?” or “where is the return value used?” and get a meaningful answer. LangGraph’s state carries the conversation context forward, so the agent knows which function is being discussed.

Dependency extraction should be language-aware. Use the language field stored on each code chunk to select appropriate import and require patterns for the chunk’s language. Support at minimum Python (import X, from X import Y), JavaScript/TypeScript (import ... from, require(), import()), Go (import "pkg", multi-line import (...)), and Java (import com.example.Foo) import styles, plus class inheritance patterns for each language. For unsupported languages, fall back to a generic regex that detects common import keywords.

When the system identifies a dependency or reference in retrieved code, it should offer to retrieve and explain the linked code. In the web interface, render these as clickable elements that trigger a follow-up retrieval.

Add an exploration planning capability using LangGraph. When a user asks a broad question like “how does the authentication system work?”, the agent should plan a multi-step exploration: find entry points, trace the authentication flow through the codebase, retrieve each step, and produce a coherent walkthrough. Model this as a graph where each retrieval step feeds into the next, building up a complete picture rather than returning a single set of search results.

Finally, add a “map” query that returns a high-level summary of the codebase structure: a breakdown of languages used, top-level modules, and entry points where identifiable. The map query should identify key files that signal project structure: build configuration (pyproject.toml, package.json, Cargo.toml, go.mod, pom.xml, build.gradle, Makefile, CMakeLists.txt), container definitions (Dockerfile, docker-compose.yml), documentation (readme, changelog, contributing, license), and entry points (main., app., index., server., cli.*). This gives the developer a bird’s-eye view before diving into specifics.

Testing:

Ask about a function, then ask a follow-up like “what calls this?” without re-stating which function you mean. The system should understand from context and return relevant results.
Look for linked dependencies in a response. If the system identifies an import or function call, verify it offers to explain the linked code.
In the web interface, click a dependency link and verify it triggers a follow-up retrieval and explanation.
Verify that query classification correctly identifies a map request, a follow-up, a broad exploration, and a specific question.
Run the “map” query and verify you get a useful high-level summary: languages, modules, key files (build config, Dockerfiles, documentation), and entry points should all be represented.
Verify dependency extraction works across languages: test with Python imports, JavaScript/TypeScript requires, Go imports, and Java imports if your test codebase includes them.

Going Further

You’ve built a working semantic code exploration tool. Here are some ways to take it further:

Cloud database support: Add an option to connect to a cloud-hosted Oracle Database instance instead of the local Docker container. Read the connection string from configuration so teams can share a single index.
Multi-language parsing: Extend your parser to handle a wider range of programming languages. Tree-sitter is a good foundation for this - it provides consistent parsing across dozens of languages.
Collaboration features: Allow multiple developers to share a project index and see each other’s exploration history. This is particularly useful during onboarding, where a senior engineer’s exploration trail becomes a guided tour for newcomers. Think about how to share the insights they provide and create a shared memory.
IDE integration: Build an extension for VS Code or another editor that lets developers query Code Sherpa directly from their IDE, with results that link back to the exact line in the editor.
Export and share: Let users export their exploration session - the questions asked, the code found, and the explanations generated - as a shareable document. This turns an exploration session into reusable documentation.

This coding challenge was sponsored by Oracle.

P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!

Share Your Solutions!

Request for Feedback

You can reach me on Bluesky, LinkedIn or through SubStack

Thanks and happy coding!

John

Coding Challenge #114 - Gzip

John Crickett — Sat, 04 Apr 2026 08:00:47 GMT

Hi, this is John with this week’s Coding Challenge.

🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉

If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧

Coding Challenge #114 - Gzip

This challenge is to build your own version of gzip, the widely used file compression utility.

Gzip has been a cornerstone of computing since 1992. It’s used everywhere, compressing files on the command line, serving web content over HTTP, packaging up tarballs for distribution, and much more. Under the hood, gzip uses the DEFLATE compression algorithm (a combination of LZ77 and Huffman coding) wrapped in a simple file format defined by RFC 1952. Building your own gzip will give you a deep understanding of how data compression works, how file formats are structured, and how command-line tools handle the many options users expect.

If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️, with the bonus that you get access to a monthly AMA and 20% off any of my courses.
Buy one of my self-paced courses that walk you through a Coding Challenge.
Join one of my live courses where I personally teach you Go by building five of the coding challenges or systems software development by building a Redis clone.

The Challenge - Building Gzip

In this challenge you’re going to build your own version of the gzip compression tool. Your tool will compress and decompress files using the DEFLATE algorithm, producing output that is fully compatible with the standard gzip and gunzip utilities.

You can use a library for the DEFLATE algorithm if you want, but I’d really encourage you to implement it yourself by reading the spec. Data compression is one of those topics that sounds intimidating but becomes surprisingly approachable once you break it down. You’ll learn far more about how compression actually works by building LZ77 and Huffman coding from scratch than by calling a library function. The RFCs are well written and this challenge is structured to walk you through it incrementally.

Step Zero

In this introductory step you’re going to set your environment up ready to begin developing and testing your solution.

Choose your target platform and programming language. I’d encourage you to pick a language that gives you access to low-level byte manipulation, as you’ll be working with binary data, checksums, and bit-level operations.

Before you start coding, read through RFC 1952 (the gzip file format) and RFC 1951 (the DEFLATE compressed data format). Don’t worry about understanding every detail right now, we’ll work through the key parts step by step. The important thing is to get a feel for how the pieces fit together: gzip is a wrapper format, and DEFLATE is the compression algorithm inside it.

Step 1

In this step your goal is to produce a valid .gz file using DEFLATE stored (uncompressed) blocks.

Before you tackle actual compression, get the gzip file format right first. Your tool should take a filename as an argument, wrap its contents in a valid .gz file with the correct header and trailer, and write it out with the .gz extension appended. After writing the compressed file, remove the original (this is the default gzip behaviour).

The gzip format (RFC 1952) requires a 10-byte header containing the magic number (1f 8b), the compression method (08 for DEFLATE), and flags. It also requires a trailer containing the CRC-32 checksum of the original data and the original file size (modulo 2^32).

For the DEFLATE payload, use stored blocks (block type 00 in RFC 1951, section 3.2.4). A stored block simply contains the raw data with a small header — no actual compression. This lets you get the gzip wrapper, CRC-32 calculation, and file handling working correctly before you add real compression. Your output will be valid gzip, just larger than the input.

Testing: Create a test file and compress it with your tool, then decompress it with the system gunzip to verify your format is correct:

echo "Hello, World!" > test.txt
ccgzip test.txt
gunzip test.txt.gz
cat test.txt

You should see Hello, World! and the file should be identical to the original. The .gz file will be slightly larger than the original since you’re not compressing yet — that’s fine, the important thing is that gunzip accepts it.

Step 2

In this step your goal is to implement LZ77, the first half of the DEFLATE algorithm.

LZ77 works by sliding a window over the input data and looking for sequences that have already appeared. When it finds a match, instead of storing the bytes again, it stores a (length, distance) pair — “copy 5 bytes from 12 positions back”. This is how DEFLATE eliminates repeated patterns.

Implement a sliding window (up to 32,768 bytes as per the spec) and a match-finding algorithm. For each position in the input, search the window for the longest match. If you find a match of 3 bytes or more, emit a (length, distance) pair. Otherwise, emit the literal byte.

At this point, encode your LZ77 output using DEFLATE fixed Huffman codes (RFC 1951, section 3.2.6). Fixed codes use a predefined Huffman table built into the spec, so you don’t need to build your own trees yet — you just need to emit the right bit sequences for literals, lengths, and distances.

Testing: Compress a file with your tool and decompress with gunzip:

echo "abcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabc" > test.txt
ccgzip test.txt
gunzip test.txt.gz
cat test.txt

The .gz file should now be smaller than the original for repetitive data. Compare the compressed size to what the system gzip produces — yours won’t be as good yet (we’ll get there), but it should be noticeably smaller than the stored blocks from Step 1. Try it on a larger text file to see a more dramatic difference.

Step 3

In this step your goal is to implement Huffman coding, the second half of DEFLATE.

Fixed Huffman codes work, but they’re not optimal for every input. Dynamic Huffman codes (RFC 1951, section 3.2.7) let you build custom Huffman trees tailored to the actual frequency of symbols in your data, which gives much better compression.

Build a Huffman tree from the frequencies of the literal/length and distance symbols in your LZ77 output. Encode the tree itself into the DEFLATE block header (the spec describes exactly how to do this using code length codes), then encode the data using your custom tree.

This is the trickiest part of the challenge, take it slow and test frequently. The encoding of the Huffman tree in the block header is fiddly, with its own mini-Huffman encoding for the code lengths. The RFC walks through it methodically; follow it closely.

Testing: Compress files with your tool and compare the sizes to the system gzip:

echo "abcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabc" > test.txt
cp test.txt test-original.txt
ccgzip test.txt
ls -la test.txt.gz
gunzip test.txt.gz
diff test.txt test-original.txt

Rename one of them to compare both. Your compressed output should now be close to what the system gzip produces. Test with a variety of files — text, source code, binary — and decompress each with gunzip to make sure everything round-trips correctly.

Step 4

In this step your goal is to decompress a .gz file using the -d flag.

When your tool is called with -d, it should read a .gz file, validate the gzip header, decompress the DEFLATE data, and write the original content to a new file with the .gz extension removed. After decompression, the .gz file should be removed.

You’ll need to reverse everything you built in the previous steps: parse the DEFLATE block headers, reconstruct the Huffman trees (for dynamic blocks), decode the symbols, and replay the LZ77 back-references to reconstruct the original data.

Your tool should validate the CRC-32 checksum and original file size stored in the gzip trailer. If the checksum doesn’t match the decompressed data, your tool should report an error and exit with a non-zero status code — this is how gzip detects corrupt files.

Testing: Compress a file with the system gzip, then decompress it with your tool:

echo "Testing decompression" > test.txt
gzip test.txt
ccgzip -d test.txt.gz
cat test.txt

You should see Testing decompression. Also test with a corrupted file — use a hex editor to change a byte in a .gz file and verify your tool reports a CRC error.

Step 5

In this step your goal is to support reading from standard input and writing to standard output, along with the -c, -k, and -f flags.

When no filename is provided, your tool should read data from stdin, compress it, and write the compressed output to stdout. When decompressing, it should read compressed data from stdin and write the decompressed output to stdout.

Implement the -c flag, which writes compressed or decompressed output to stdout instead of to a file, leaving the original file unchanged. Implement the -k flag, which keeps (does not delete) the original file after compression or decompression. And implement the -f flag, which forces overwriting of existing output files without prompting.

Testing: Test piping data through your tool:

echo "Piped data" | ccgzip | gunzip
echo "Piped data" | gzip | ccgzip -d

Both should output Piped data. Test -k and -f:

echo "Keep me" > test.txt
ccgzip -k test.txt
ls test.txt test.txt.gz
ccgzip -f test.txt

The -k flag should leave both files in place. The second compression should succeed because -f forces the overwrite of the existing .gz file.

Step 6

In this step your goal is to support compressing and decompressing multiple files and to implement compression levels.

Your tool should accept multiple filenames as arguments and process them one at a time. If any file fails, your tool should report the error and continue processing the remaining files.

Implement compression levels from -1 (fastest, least compression) to -9 (slowest, best compression), with -6 as the default. These levels control how aggressively the LZ77 stage searches for matches — at level 1, you might limit the search to short look-aheads and a smaller window, whilst at level 9, you search more thoroughly for the longest possible matches.

Testing: Test multiple files:

echo "File one" > a.txt
echo "File two" > b.txt
echo "File three" > c.txt
ccgzip a.txt b.txt c.txt
ls *.gz

You should see a.txt.gz, b.txt.gz, and c.txt.gz. Test compression levels on a larger file:

ccgzip -1 -k test.txt && mv test.txt.gz test-fast.gz
ccgzip -9 -k test.txt && mv test.txt.gz test-best.gz
ls -la test-fast.gz test-best.gz

For a sufficiently large file, the -9 version should be smaller than the -1 version.

Step 7

In this step your goal is to implement the -l, -t, and -v flags.

Implement the -l flag, which displays compression statistics for a .gz file without decompressing it. The output should include the compressed size, uncompressed size, compression ratio, and the original filename. The format should match the standard gzip output:

         compressed        uncompressed  ratio uncompressed_name
                 73                  26  42.3% test.txt

Implement the -t flag, which tests the integrity of a .gz file by decompressing it and validating the CRC-32 checksum without writing the output to disk. If the file is valid, it exits silently with status 0. If it’s corrupt, it reports an error.

Implement the -v flag, which displays the name and compression ratio for each file as it is processed. This is useful when compressing multiple files so you can see progress.

Testing: Test the -l flag:

gzip -k test.txt
ccgzip -l test.txt.gz
gzip -l test.txt.gz

Compare the output of both commands — they should show the same statistics. Test -t on a valid and a corrupted file. Test -v by compressing multiple files and checking that each one shows its name and ratio.

Step 8

In this step your goal is to implement recursive directory compression, preserve file metadata, and ensure full compatibility with the standard gzip tools.

Implement the -r flag, which recursively traverses directories and compresses (or decompresses) all files found within them.

The gzip header has optional fields for the original filename and the modification timestamp of the source file. Your tool should store these when compressing and restore the modification timestamp when decompressing.

Your output should be fully compatible with the system gzip and gunzip , any file compressed by your tool should decompress correctly with gunzip, and any file compressed by gzip should decompress correctly with your tool. The .gz extension should be added on compression and removed on decompression.

Testing: Test recursive compression:

mkdir -p testdir/subdir
echo "Root file" > testdir/file1.txt
echo "Sub file" > testdir/subdir/file2.txt
ccgzip -r testdir
find testdir -name "*.gz"

You should find testdir/file1.txt.gz and testdir/subdir/file2.txt.gz. Test timestamp preservation:

touch -t 202301151200.00 test.txt
ccgzip -k test.txt
ccgzip -d test.txt.gz
stat test.txt

The modification timestamp should match the original. Test full compatibility by compressing a variety of files (text, binary, empty) with both your tool and the system gzip, and cross-decompressing them to verify they produce identical output.

Going Further

Here are some ideas to take your gzip implementation further:

Add support for concatenated gzip streams (multiple gzip members in a single file)
Implement the -rsyncable option which makes the compressed output more friendly to rsync’s delta transfer algorithm
Add support for the -suffix option to use a custom file extension
Build a parallel compression mode (like pigz) that uses multiple CPU cores to compress data faster
Add support for decompressing other formats that gzip can handle, such as compress (.Z) files
Experiment with different match-finding strategies (hash chains, binary trees, optimal parsing) and measure the compression ratio and speed trade-offs

P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!

Share Your Solutions!

Request for Feedback

You can reach me on Bluesky, LinkedIn or through SubStack

Thanks and happy coding!

John

Coding Challenge #113 - AI Writing Detector

John Crickett — Sat, 28 Mar 2026 09:02:19 GMT

Hi, this is John with this week’s Coding Challenge.

🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉

If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧

Coding Challenge #113 - AI Writing Detector

This challenge is to build your own AI writing detector that analyses text and determines the likelihood it was written by an AI rather than a human.

Some people believe and argue that AI-generated text has recognisable patterns and characteristics. They argue that language models tend to favour certain vocabulary, use particular phrasing structures, and employ specific rhetorical techniques. To a certain extent they’re right, LLMs can indeed sometimes produce text that is often unnaturally uniform in sentence length, excessively formal in tone, and littered with vague attributions that sound authoritative but say nothing specific. By identifying these patterns, you can build a system that scores text and provides a detailed analysis of what makes it look AI-generated or human-written.

By building this project you’ll learn how to implement some complex text analysis and just how hard it truly is to detect AI written text. Once you have a working solution try it on some famous text that pre-dates LLMs (i.e. written before 2017). The reality is it is hard to detect AI writing, but a great learning exercise.

If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️, with the bonus that you get access to a monthly AMA and 20% off any of my courses.
Buy one of my self-paced courses that walk you through a Coding Challenge.
Join one of my live courses where I personally teach you Go by building five of the coding challenges or systems software development by building a Redis clone.

The Challenge - Building an AI Writing Detector

You’ll build a system that accepts text input and analyses it using a rule-based approach to detect linguistic patterns and characteristics commonly found in AI-generated writing. Your detector will evaluate text across multiple dimensions — vocabulary, sentence structure, rhetorical patterns, and statistical properties — and produce a detailed report showing the likelihood the text was AI-generated. And no, the em-dash ‘—‘ is not one of them!

This is a coding challenge that focuses on text analysis, pattern detection, and building a scoring system. You can implement it in any programming language you’re comfortable with.

Step Zero

In this introductory step you’re going to set your environment up ready to begin developing and testing your solution.

Choose a programming language you’re comfortable with for text analysis and string manipulation. You’ll be doing a fair amount of tokenisation, regular expression matching, and statistical calculation. Consider whether you’d like to build a command-line tool, a web-based interface, or both.

For testing, you’ll evaluate your detector against sample texts. Some AI-generated and some human-written. Prepare a small collection of texts to test against as you build. You can generate AI-written samples by asking any large language model to write an essay on a topic, and use your own writing or published articles for the human-written samples.

Step 1

In this step your goal is to accept text input for analysis and display basic statistics about it.

Your system should provide a way for users to submit text. Whether that’s reading from a file, accepting command-line input, or providing a text area in a web interface. As text is entered, display a character count and word count. If the input is empty, the analysis should not proceed.

Testing guidance: Load a sample text and verify that your character count and word count are correct. Try submitting empty input and confirm your system handles it gracefully by preventing the analysis from running.

Step 2

In this step your goal is to identify vocabulary and phrases that are commonly overused in AI-generated text.

Language models have characteristic vocabulary patterns, they favour words like “delve into”, “navigate”, “robust”, “innovative solutions”, “transformative”, “leverage”, “streamline”, and “ecosystem”. Build a detector that scans for these LLM-signature words and phrases and tracks how many distinct ones appear. You can find examples in Wikipedia’s Signs of AI writing article.

Score this detector based on the number of distinct AI vocabulary terms found, with each contributing a few points up to a maximum cap. This gives you your first pattern detection category.

Testing guidance: Write or generate a short AI-written essay and run your detector on it. You should find multiple hits. Then run it on a piece of your own writing. You should see far fewer. Verify that the score increases with more AI vocabulary terms but doesn’t exceed your cap.

Step 3

In this step your goal is to identify several common structural patterns in AI writing. Language models frequently use:

The “rule of three” — describing things in groups of three adjectives, three nouns, or three parallel phrases. AI text is full of these.
Negative parallelism — rigid constructions like “not only... but also” and similar overly formal parallel structures.
Outline-style conclusions — formulaic endings that follow the pattern: “Despite [challenges], [subject] offers [benefits/opportunities].”
False ranges — “from X to Y” constructions where the two endpoints don’t form a coherent or logical scale.

Build detectors for each of these patterns. When you find them, record where they occur in the text and contribute to the score. Each pattern category should have its own maximum cap so that no single category can dominate the overall score. Again you can find examples in Wikipedia’s Signs of AI writing article.

Testing guidance: Craft test sentences for each pattern type and verify your detectors find them. For the rule of three, try: “This approach is efficient, scalable, and maintainable.” For negative parallelism, try: “It not only improves performance but also enhances reliability.” Test that your scoring caps are working, text saturated with one pattern type should hit the cap and stop accumulating.

Step 4

In this step your goal is to identify patterns where text makes broad claims without concrete evidence. AI writing often lacks specificity, hiding behind phrases that sound authoritative but name no actual sources.

Build detectors for three categories:

Vague attributions — phrases like “experts agree”, “studies show”, “research indicates”, and “industry insiders report” that appeal to unnamed sources.
Superficial analysis — hedging and filler like “it is worth noting”, “significant developments”, “one could argue”, and “various sources indicate”.
Overgeneralisation — patterns that frame limited information as universal, such as “everyone knows”, “it is well established”, and “universal consensus”.

Each category should contribute to the score independently, with its own cap.

Testing guidance: Write test sentences containing each type of vague language and verify your detectors find them. Then test with text that contains legitimate attributions (”According to a 2024 study by Smith et al. in Nature...”) and confirm your detector doesn’t flag specific, concrete citations.

Step 5

In this step your goal is to identify excessive emphasis and promotional language patterns. AI writing tends to oversell, using superlatives and marketing-style phrasing that reads more like advertising copy than natural writing.

Build detectors for three categories:

Undue emphasis — excessive use of superlatives, intensifiers (”tremendous”, “remarkable”, “groundbreaking”), and emphatic punctuation.
Promotional language — marketing phrasing like “game-changer”, “revolutionary”, “impressive features”, and “transformative potential”.
Elegant variation — where the same concept or entity is repeatedly referred to by different but equivalent terms across sentences. AI text often cycles through synonyms for the same thing rather than simply repeating the word as a human would.

Testing guidance: Run your detectors on marketing copy or AI-generated product descriptions — these should score highly. Compare with straightforward human writing about the same topic. For elegant variation, look for text where “the company” becomes “the organisation” becomes “the firm” becomes “the enterprise” within a few sentences.

Step 6

In this step your goal is to move beyond pattern matching and analyse the statistical properties of the writing itself. AI text has measurable differences from human writing in several dimensions.

Build analysers for some or all of these linguistic factors:

Lexical diversity — calculate the type-token ratio (unique words divided by total words). AI text often falls outside the normal range, either too uniform or artificially varied.
Sentence length variation — measure the standard deviation and coefficient of variation of sentence lengths. Human writing naturally varies; AI text tends to be unnaturally uniform, with a coefficient of variation below 0.35.
Passive voice frequency — identify passive voice constructions and calculate what percentage of sentences use them. Human writers typically use passive voice in 5-10% of sentences; AI text often exceeds 15%.
Transition word density — detect formal discourse markers (”furthermore”, “moreover”, “consequently”, “additionally”) and calculate the percentage of sentences containing them. More than 20% suggests AI generation.
Reading grade level — calculate the Flesch-Kincaid Grade Level. AI text often scores at an artificially high grade level (above 14), suggesting unnecessary complexity.
Punctuation patterns — analyse the density of semicolons, em-dashes, colons, and ellipses. AI text tends to overuse semicolons and em-dashes whilst rarely using ellipses.
Rare word usage — identify uncommon words and calculate their frequency. Human writers typically use rare words at 3-8% frequency; AI text often exceeds 12%.

Display each factor as a labelled indicator with a percentage value and a brief explanation of what the result means. Beware that some of these fail on short text.

Testing guidance: Run your analysers on both AI-written and human-written samples of similar length and topic. Compare the results — you should see measurable differences. Pay particular attention to sentence length variation and transition word density, which tend to be strong signals. Verify your Flesch-Kincaid calculation against an online readability calculator.

Step 7

In this step your goal is to aggregate all your pattern detections and linguistic analyses into an overall AI probability score from 0 to 100.

Combine the contributions from each detector. If the raw combined score exceeds 100, normalise the individual contributions proportionally so the final score is clamped to 100.

Then classify the text based on the score:

Below 30: “Likely Human-Written”
30 to 59: “Possibly AI-Generated”
60 or above: “Likely AI-Generated”

Display the score with a colour-coded indicator, green below 30, yellow for 30-59, red for 60 and above.

Testing guidance: Run your complete detector on several sample texts. Verify that AI-written samples score above 60 and human-written samples score below 30. If your scores don’t separate well, experiment with the weighting of different detectors. Check that the individual detector contributions add up correctly to the total score, and that normalisation works when the raw total would exceed 100.

Step 8

In this step your goal is to produce a comprehensive report that shows what was detected and why the text received its classification.

Your report should include:

The overall AI probability score and classification
Text statistics (word count, character count, average word length)
A breakdown of linguistic factors, each with its score and an explanation
A breakdown of pattern detections, each showing the category, occurrence count, score contribution, and explanatory text
The timestamp of when the analysis was performed

Present the linguistic factors and pattern detections in separate labelled sections so the report is easy to scan.

Testing guidance: Generate reports for both AI-written and human-written samples. The reports should tell a coherent story about why each text received its classification. Verify that the pattern breakdowns add up to the overall score. Check that the report is readable and the explanations make sense to someone who doesn’t know the internals of your system.

Going Further

Once you’ve built the core detector, here are ways to extend it:

Text highlighting — mark detected patterns directly in the original text with category-specific colours. When highlights from different detectors overlap, keep the first one and discard subsequent overlaps. Display category badges showing which patterns were found.
Advanced linguistic analysis — implement Zipf’s Law comparison (comparing word frequency distribution against the expected power-law distribution), named entity density analysis, and paragraph coherence measurement through inter-sentence similarity.
Copy and share — add a “Copy Results” button that copies the score and analysis to the clipboard, and an “Analyse Another Text” option to return to the input.
Accuracy measurement — collect a larger corpus of human-written and AI-written samples and measure your detector’s precision and recall.
Weighting experiments — try different weightings for each detector. Some patterns are stronger signals than others — which ones matter most?

P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!

Share Your Solutions!

Request for Feedback

You can reach me on Bluesky, LinkedIn or through SubStack

Thanks and happy coding!

John

Coding Challenge #112 - AI Coding Agent

John Crickett — Sat, 21 Mar 2026 09:01:17 GMT

Hi, this is John with this week’s Coding Challenge.

🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉

If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧

Coding Challenge #112 - AI Coding Agent

This challenge is to build your own AI coding agent - a command-line tool that can read, understand, and modify code on your behalf by combining a large language model with practical software engineering tools.

AI coding agents have rapidly become part of the modern developer’s toolkit. Tools like Claude Code, Codex, Cursor, Kiro, and AmpCode let you describe what you want in natural language and the agent figures out which files to read, what changes to make, and how to verify its work. Under the hood, they’re surprisingly approachable: a loop that talks to an LLM, a set of tools the model can call, and some orchestration to keep everything on track.

By building your own, you’ll gain a deep understanding of how these tools actually work - the agentic loop, tool use, context management, and all the engineering that turns a chat API into a coding assistant. You’ll also end up with something you can actually use on your own projects.

If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️, with the bonus that you get access to a monthly AMA and 20% off any of my courses.
Buy one of my self-paced courses that walk you through a Coding Challenge.
Join one of my live courses where I personally teach you Go by building five of the coding challenges or systems software development by building a Redis clone.

The Challenge - Building Your Own AI Coding Agent

You’re going to build a command-line AI coding agent, a simplified version of tools like Claude Code, Codex and AmpCode. It starts as a simple chat interface, and step by step you’ll add the ability to read files, edit code, run shell commands, search a codebase and manage context. By the end, you’ll have a working agent that can navigate a real project and make meaningful changes to it.

To really get the most from this challenge I suggest you call the LLM provider’s REST API directly and manage all the data yourself, this will give you the best understanding of AI agents and how they work.

Step Zero

In this introductory step you’re going to set your environment up ready to begin developing and testing your solution.

You’ll need to make a few decisions:

Choose your LLM provider. You need a model that supports tool calling (also called function calling). Most major providers support this: Anthropic, OpenAI, Google (Gemini), Mistral, or local models via Ollama. Check your chosen provider’s documentation for their tool-calling API - you’ll be using it heavily throughout this challenge.
Choose your programming language. Pick something you’re comfortable building CLI tools in. You’ll be doing a fair amount of file I/O, process spawning, and JSON handling. Python, TypeScript, Go, and Rust all work well. The language doesn’t matter nearly as much as your comfort with it.
Get your API key set up. Make sure you can make a basic chat completion request to your chosen provider and get a response back before moving on.

Prepare a small test project to use as a playground throughout the challenge - a simple application with a few files in a couple of directories. You’ll be pointing your agent at this project to test reading, editing, and searching.

Testing: Make a simple API call to your LLM provider with a basic prompt like “Hello, who are you?” and verify you get a coherent response. If you’re using a local model, confirm it’s running and accessible. I suggest using curl to do this so you know how to call the REST API for your provider.

Step 1

In this step your goal is to build the core agentic loop with streaming responses.

The heart of any coding agent is the loop: read user input, send it to the LLM, display the response, repeat. Build a REPL (read-eval-print loop) that takes input from the terminal, sends it to your LLM as a chat message, and streams the response back to the terminal as it arrives.

Streaming matters here. LLM responses can take several seconds to generate in full, and watching text appear token by token is a much better experience than staring at a blank screen. Your provider’s API will have a streaming option - use it.

Your loop should maintain a conversation history so the model has context from earlier in the session. Each time you send a request, include the full conversation so far: all previous user messages and assistant responses.

Handle the basics gracefully: let the user exit the session cleanly, and don’t crash if the API returns an error.

Testing:

Start your agent and have a multi-turn conversation. Ask a question, then ask a follow-up that references the previous answer. The model should understand the context.
Verify responses stream to the terminal incrementally rather than appearing all at once.
Check that you can exit the session cleanly (e.g. with Ctrl+C or typing “quit” or “exit”).
Disconnect from the network and send a message - verify the agent handles the error without crashing.

Step 2

In this step your goal is to add tool calling and implement file reading as your first tool.

Tool calling is what turns a chatbot into an agent. Instead of just generating text, the model can request to call a function - read a file, run a command, search for something - and your agent executes it and feeds the result back. The model then uses that result to continue its response.

Define a tool interface that your LLM can call. The exact format depends on your provider, but typically you describe each tool with a name, a description, and a JSON schema for its parameters. Start with a single tool: read_file, which takes a file path and returns the file’s contents.

The agentic loop now becomes: send the conversation to the LLM. If the response includes a tool call, execute it, append the result to the conversation, and send it back to the LLM. Keep looping until the model responds with text instead of a tool call. The model might chain several tool calls before giving a final answer - your loop should handle that naturally.

Testing:

Ask your agent “What’s in the file README.md?” (or any file in your test project). It should call the read_file tool, receive the contents, and summarise or discuss the file.
Ask it about a file that doesn’t exist. The tool should return an error, and the model should explain that the file wasn’t found rather than crashing.
Ask a question that requires reading multiple files. The model should make multiple tool calls in sequence to gather the information it needs.
Ask a question that doesn’t need any file reading (e.g. “What is a binary tree?”). The model should answer directly without calling any tools.

Step 3

In this step your goal is to add file editing and codebase search tools.

A coding agent that can only read files isn’t much use - it needs to be able to make changes too. Add an edit_file tool that applies a targeted edit to a file. A good approach is to have the tool take the file path, the text to find, and the text to replace it with. This is safer than having the model rewrite entire files, which is both slow and error-prone.

Also add a write_file tool for creating new files. This takes a file path and the full content to write.

Next, add two search tools so the agent can navigate unfamiliar code. A glob tool that finds files matching a pattern (e.g. **/*.py, src/**/*.ts) and a grep tool that searches file contents for a pattern and returns matching lines with file paths and line numbers.

With these five tools - read, edit, write, glob, and grep - your agent can explore and modify a codebase in a meaningful way.

Testing:

Ask your agent to add a comment to a specific function in your test project. It should read the file, make a targeted edit, and confirm the change. Open the file and verify the edit is correct.
Ask it to create a new file with some content. Verify the file is created with the correct contents.
Ask it to find all Python (or whatever language your test project uses) files in the project. It should use the glob tool and return the list.
Ask it to find where a specific function or variable is used across the codebase. It should use the grep tool to search.
Ask it to refactor something - rename a variable or extract a function. This should require multiple tool calls: search to find usages, then edit each one.

Step 4

In this step your goal is to add shell command execution and a permission system.

Shell access makes your agent dramatically more capable. It can run tests, install dependencies, check build output, and interact with any command-line tool. Add a bash tool that takes a command string, executes it in a shell, and returns the stdout, stderr, and exit code.

But with great power comes the need for guardrails. You don’t want your agent silently running rm -rf / because the model hallucinated a cleanup step. Implement a permission system that controls which actions the agent can take without asking.

Your permission system should support at least three modes for each tool: allow (execute without asking), prompt (ask the user for confirmation before executing), and deny (never execute). A sensible default is to allow read-only operations (file reading, glob, grep) automatically, prompt for mutations (file edits, shell commands), and let the user configure overrides.

When a tool call requires confirmation, display the tool name and its arguments clearly and wait for the user to approve or reject before proceeding.

Testing:

Ask your agent to run the test suite for your test project. It should use the bash tool to execute the appropriate test command and report the results.
Verify that shell commands require your approval before running (assuming you’ve set bash to “prompt” mode).
Deny a shell command when prompted and verify the agent adapts gracefully - it should acknowledge that you declined and try an alternative approach or explain what it was trying to do.
Configure file reading to “allow” and verify those calls execute without prompting.
Ask the agent to do something that involves both allowed and prompted tools in sequence. Verify the allowed tools execute silently and the prompted ones ask for confirmation.

Step 5

In this step your goal is to add context window management so your agent can handle long sessions without breaking.

Every LLM has a context window limit, and coding sessions can generate a lot of content. Reading a few large files, running some commands, and having a back-and-forth conversation can fill up the context quickly. When you hit the limit, your API calls will fail.

Implement a strategy to manage this. A practical approach is conversation compression (aka compaction): when the conversation history approaches the context limit, summarise the older messages into a condensed form and keep only the recent messages intact. The summary preserves the key decisions, findings, and context from earlier in the conversation without using as many tokens. Most agents use a call to the LLM to generate the summary. Make that call in the background and don’t show the user.

You’ll need to track token usage. Most providers return token counts in their API responses. Keep a running total and trigger compression when you’re approaching the limit - leaving enough headroom for the model’s response.

After compression, the conversation should continue to work naturally. The model should still understand what it was doing and what decisions were made earlier, even if it can’t see the exact messages from the beginning of the session.

Testing:

Have a long session with your agent where you read several large files and have an extended conversation. Verify it doesn’t crash when the context gets large.
After compression has occurred, ask the agent to recall something from earlier in the conversation. It should still have the key information from the summary.
Check your token tracking by asking the agent how much of the context window has been used (you might expose this in a status command or similar).
Verify that tool calls still work correctly after compression - the model should still know which tools are available and how to use them.

Step 6

In this step your goal is to add project context loading and a configuration file hierarchy.

A good coding agent should understand the project it’s working in without being told everything from scratch. Add support for a project instruction file - a markdown file in the project root (e.g. AGENTS.md or CLAUDE.md I suggest you use AGENTS.md) that contains project-specific context. When the agent starts, it should look for this file and include its contents in the system prompt.

This file might contain information like the project’s architecture, coding conventions, how to run tests, which directories contain what, or anything else that would help the agent be more effective. The contents of this file are added to the context sent to the LLM as one of the first user messages.

Next, implement a configuration file hierarchy. Settings should cascade from three levels: global (user-wide defaults, e.g. in a home directory dotfile), project-level (in the project root), and local (for personal overrides that aren’t committed to source control). More specific settings override more general ones.

The configuration should cover at least: the default LLM provider and model, permission defaults for each tool, and any custom system prompt additions.

Testing:

Create an AGENTS.md file in your test project with some specific instructions (e.g. “Always use snake_case for variable names” or “Run tests with pytest“). Ask the agent to make a change and verify it follows the project instructions.
Verify the agent works fine when no project instruction file exists - it should carry on without error.
Set up global and project-level configuration files with different values for the same setting. Verify the project-level setting takes precedence.
Add a local configuration override and verify it takes precedence over both project and global settings.

Step 7

In this step your goal is to add persistent memory so your agent remembers context across sessions and conversation history so you can resume previous sessions.

Without persistence, every session starts from zero. If you told the agent about your project’s architecture yesterday, it’s forgotten it today. Implement a memory system that lets the agent store and retrieve information across sessions.

A file-based approach works well: the agent writes memories to a designated directory as individual files, with an index that tracks what’s stored. Memories might include things the user has asked the agent to remember, project decisions, or user preferences. When a new session starts, the agent loads relevant memories to inform its behaviour.

Also add conversation history persistence. Save completed sessions so the user can resume a previous conversation with its full context intact, or start a new session that has access to a summary of past work. Again use the LLM to generate summaries.

Testing:

Tell your agent to remember something specific (e.g. “Remember that our API uses JWT authentication”). End the session, start a new one, and ask a question where that context is relevant. The agent should use the stored memory in its response.
Ask the agent what it remembers. It should be able to list or describe its stored memories.
Tell the agent to forget something it previously stored. Verify it’s removed.
End a session, then resume it. The conversation context should be intact.
Start a fresh session and verify it doesn’t carry over the conversation history from the previous one (though memories should still be accessible).

Step 8

In this step your goal is to add subagent support and plan mode.

Some tasks benefit from being broken down and worked on in parallel, or from being planned before implementation begins. Add the ability for your agent to spawn subagents - separate agent instances that work on a specific subtask and report back.

A subagent should have its own conversation with the LLM, its own context, and access to the same tools as the main agent. The main agent describes a task, the subagent works on it independently, and returns a result. This is useful for things like “search the codebase for all usages of this pattern” or “read these five files and summarise what they do” - tasks that would clutter the main conversation with tool calls.

Also add a plan mode. When activated, the agent switches to an architect role: it reads code, asks questions, and produces a plan, but doesn’t make any changes. Once the user approves the plan, the agent switches back to implementation mode and follows the plan. This is valuable for larger tasks where you want to review the approach before any code is modified. Plan mode will often benefit from a reduced toolset and a customised system prompt.

Testing:

Ask your agent to do something that benefits from subagents, like “summarise all the files in the src directory”. Verify it spawns subagents and combines their results.
Verify that subagent work doesn’t pollute the main conversation - the main agent should present a clean summary.
Activate plan mode and ask the agent to implement a feature. Verify it produces a plan without making any changes.
Approve the plan and verify the agent implements it.
Reject or modify the plan and verify the agent adapts.

Going Further

You’ve built a working AI coding agent. Here are some ways to push it further:

Model Context Protocol (MCP): Add support for MCP, which lets your agent connect to external tool servers. This means anyone can extend your agent’s capabilities by writing an MCP server, without modifying the agent itself.
Skills System: Add the ability to define reusable skills - pre-written prompts and tool configurations that can be invoked by name. For example, a commit skill that knows how to stage changes and create a well-formatted commit, or a review skill that analyses code for issues.
Hooks: Let users define shell commands that trigger on agent events - before a tool executes, after a file is edited, when a session starts. This enables custom workflows like running a linter automatically after every file edit.
Model Selection: Support switching between different models mid-session. Some tasks need the most capable model available, while others can use a faster, cheaper one.
Headless Mode: Add a non-interactive mode where the agent receives a prompt, executes it, and exits. This enables CI/CD integration and scripted automation.

P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!

Share Your Solutions!

Request for Feedback

You can reach me on Bluesky, LinkedIn or through SubStack

Thanks and happy coding!

John

Coding Challenge #111 - AI Agent Scheduling System

John Crickett — Sat, 14 Mar 2026 09:01:57 GMT

Hi, this is John with this week’s Coding Challenge.

🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉

If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧

Coding Challenge #111 - AI Agent Scheduling System

This challenge is to build your own AI agent scheduling system - a system that runs AI-powered tasks automatically on a cron schedule and delivers the results to you.

If you’ve ever wished you could have an assistant that checks your project’s dependencies for security advisories every Monday, summarises your inbox before you wake up, or keeps an eye on the status pages of the services your stack depends on throughout the day - that’s exactly what you’re building here. The idea is simple: you define a task, tell the agent scheduling system when to run it, and it takes care of the rest, delivering a summary straight to your email.

Scheduled agents are becoming a core pattern in AI-powered automation. By building your own, you’ll learn how to combine LLM orchestration with job scheduling, error handling, and notification delivery. Skills that apply well beyond this single project.

If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️, with the bonus that you get access to a monthly AMA and 20% off any of my courses.
Buy one of my self-paced courses that walk you through a Coding Challenge.
Join one of my live courses where I personally teach you Go by building five of the coding challenges or systems software development by building a Redis clone.

The Challenge - Building Your Own Scheduled Agent

You’ll build a system that lets you define named AI agents, each with a task description and a cron schedule, that run automatically and deliver their results by email. Along the way you’ll work with LLM integration, cron-based job scheduling, natural language parsing, retry logic, and email delivery.

The challenge starts with a basic AI agent that can execute a single task, then progressively adds scheduling, resilience, notifications, and management capabilities. Here are some use cases the full solution could handle:

Morning Briefing - An agent runs every day at 7am, pulls your email, calendar, and relevant news, and drops a structured brief into your inbox.
Weekly Dependency Watch - Every Monday at 8am, an agent checks your project’s key dependencies for new releases, security advisories, and deprecation notices, and delivers a structured summary.
Daily Outage & Incident Digest - Every few hours, an agent checks the status pages of services your stack depends on (AWS, GitHub, npm, your CI provider) and flags any ongoing or recent incidents that could affect your work.
Weekly Tech Radar - Every Monday, an agent scans Hacker News, tech blogs, and release notes for developments in your chosen technology areas and delivers a curated summary.
PR Review Reminder - Twice a day, an agent checks your team’s open pull requests and sends you a summary of what’s awaiting review, what’s gone stale, and what’s been merged.

Step Zero

In this introductory step you’re going to set your environment up ready to begin developing and testing your solution.

You’ll need to make a few decisions:

Choose your LLM provider. You’ll need access to a language model for your agents to use when processing tasks. Options include OpenAI, Anthropic, Google Gemini, Mistral, or running models locally with Ollama. Any provider with a chat completions API will work.
Choose your programming language. Pick something you’re comfortable with for both HTTP requests and background job processing. You’ll be building a long-running service that needs to make API calls, run tasks on a schedule, and send emails.
Choose your persistence layer. You’ll need somewhere to store agent definitions, schedules, and execution history. A lightweight database like SQLite works well to start with, or Redis if you’d prefer something in-memory.

Step 1

In this step your goal is to build a basic AI agent that can execute a single task.

An agent at this stage is straightforward: it takes a task description (a natural language prompt), sends it to your LLM, and returns the result. Think of it as a thin wrapper around an LLM call that adds structure.

Define an agent with a name, a task description, and optionally a system prompt that shapes how the agent behaves. For example, an agent called “Hacker News Summariser” might have the task “Summarise the top five technology news stories today’s hacker news posts which are {posts}”, where you substitute in the posts from Hacker News and a system prompt instructing it to be concise and use bullet points.

The agent should return a structured result containing the agent’s name, the time the task ran, whether it succeeded or failed, and the LLM’s response.

Testing: Create a few agents with different task descriptions and run them manually:

A simple summarisation agent - give it a prompt like “Summarise the key benefits of test-driven development” and verify you get a coherent response.
An agent with a specific system prompt - verify the response style matches the instructions in the system prompt.
An agent with a deliberately impossible task - verify it completes without crashing and returns something sensible.
Check that the structured result includes the agent name, timestamp, status, and response content.

Step 2

In this step your goal is to add cron-based scheduling so agents run automatically at defined intervals.

It should be possible to configure the invocation of each agent based on a cron expression that defines when it runs. Support the full five-field cron format: minute, hour, day of month, month, and day of week. This gives you the flexibility to express schedules like “every hour” (0 * * * *), “weekdays at 9am” (0 9 * * 1-5), or “first Monday of every month at 8am” (0 8 1-7 * 1).

Your scheduler should evaluate all registered agents, determine which ones are due to run, execute them, and then wait for the next tick. When multiple agents are due at the same time, they should all run - don’t let one agent’s execution block another.

Agents should also have an enabled/disabled flag so you can pause an agent’s schedule without deleting it.

Testing:

Create an agent with a schedule of * * * * (every minute) and verify it fires once per minute.
Create two agents scheduled for the same time and verify both run.
Disable an agent and verify it stops running on schedule. Re-enable it and verify it resumes.
Set up an agent with a more complex expression like /5 * * * * (every five minutes) and verify the timing is correct.
Check that the scheduler continues running reliably over a period of at least an hour without drifting or missing executions.

Step 3

In this step your goal is to add natural language schedule parsing so users can describe when they want an agent to run in plain English instead of writing cron expressions.

Expressions like “every weekday at 9am”, “every Monday at 8am”, “twice a day”, “every 3 hours”, or “the first of every month” should be parsed into the corresponding cron expression. You can use your LLM to handle this translation, or a dedicated natural language parsing library - either approach works.

When a user provides a natural language schedule, show them the interpreted cron expression and a human-readable description of what it means (e.g. “Runs at 09:00 on Monday through Friday”) so they can confirm it’s correct before saving.

If the input is ambiguous or can’t be parsed, the system should say so clearly and ask the user to rephrase rather than guessing incorrectly.

Testing:

Try a range of natural language inputs and verify each produces the correct cron expression:
- “every day at 7am” should produce 0 7 * * *
- “every weekday at 9am” should produce 0 9 * * 1-5
- “every Monday at 8am” should produce 0 8 * * 1
- “every 3 hours” should produce 0 */3 * * *
- “twice a day” should produce something reasonable like 0 9,18 * * * (the exact times may vary)
Try an ambiguous input like “sometimes in the morning” and verify you get a clear error or clarification request rather than a bad cron expression.
Verify the human-readable confirmation message accurately describes the interpreted schedule.

Step 4

In this step your goal is to add timeout handling and retry logic so your agents are resilient to transient failures.

LLM API calls can be slow, rate-limited, or simply fail. Your agents need to handle this gracefully. Each agent should have a configurable execution timeout - if the task hasn’t completed within that time, it should be terminated cleanly. A sensible default is 60 seconds, but agents that do heavier processing might need longer.

When an execution fails (whether from a timeout, an API error, or any other exception), the agent should retry automatically. Each agent should have a configurable maximum number of retries and a backoff strategy. Exponential backoff with jitter is a solid default - it avoids hammering a struggling API with rapid retries.

After all retries are exhausted, the agent should record a permanent failure for that execution with the error details. Failed executions should never block future scheduled runs of the same agent.

Testing:

Set an agent’s timeout to something very short (e.g. 2 seconds) with a task that takes longer than that. Verify the execution times out and a retry is attempted.
Configure an agent with 3 maximum retries and simulate a failing LLM call (you could temporarily use an invalid API key or point to a non-existent endpoint). Verify it retries exactly 3 times before recording a permanent failure.
Check that the backoff delay increases between retries rather than retrying immediately each time.
After a failed execution with all retries exhausted, verify the agent still runs on its next scheduled time as normal.
Run an agent that succeeds on the first attempt and verify no retries are triggered.

Step 5

In this step your goal is to add email delivery so agent results are sent to you automatically.

When an agent completes its scheduled run, it should send the results to a configured email address. The email should include the agent’s name in the subject line, the timestamp of the execution, and the full response from the LLM formatted for readability.

Use an email sending service such as Resend or SendGrid or use SMTP directly to deliver the messages. For development and testing, a local SMTP tool like Mailpit or MailHog lets you capture emails without actually sending them.

If the email delivery fails, it should be retried independently of the agent execution itself - the agent’s task already succeeded, so you don’t want to re-run the whole thing just because the notification didn’t go through. Log delivery failures for visibility.

For failed agent executions (after all retries are exhausted), send a failure notification email instead, including the error details so you know something went wrong without having to check logs.

Testing:

Run an agent and verify an email arrives with the correct subject line, agent name, timestamp, and response content.
Verify the email content is well-formatted and readable, not a raw text dump.
Check that a failed agent execution sends a failure notification email with error details.
Temporarily break your email configuration and verify the agent execution still completes successfully - only the delivery should fail, not the whole run.
Verify email delivery failures are logged and retried.

Step 6

In this step your goal is to build a management interface for creating, editing, listing, and deleting agents.

Up to now you’ve probably been configuring agents directly in code or a configuration file. In this step, add an API (or CLI, if you prefer) that lets you manage agents without touching the code.

You should be able to create a new agent by providing a name, task description, system prompt, schedule (cron expression or natural language), email recipient, timeout, and retry settings. You should also be able to list all registered agents with their current status (enabled, disabled, last run time, last run result), edit any agent’s configuration, and delete agents you no longer need.

Each agent’s configuration and execution history should be persisted so everything survives a restart of the service. The execution history for each agent should include the timestamp, status (success or failure), duration, and a summary of the result or error.

Testing:

Create a new agent through the management interface and verify it starts running on its defined schedule.
List all agents and check the output shows correct status information.
Edit an agent’s schedule and verify the new schedule takes effect without needing to restart the service.
Delete an agent and verify it stops running and is removed from the listing.
Restart the service and verify all agent configurations and execution history are preserved.
View the execution history for an agent and verify it includes timestamps, statuses, and durations for recent runs.

Step 7

In this step your goal is to add execution logging and a monitoring endpoint so you can observe your system’s health at a glance.

Build a status endpoint (or command) that reports the overall system health: how many agents are registered, how many are enabled, upcoming scheduled runs, and aggregate statistics like total executions, success rate, and average execution time.

Each execution should be logged with enough detail to diagnose problems: the agent name, start time, duration, whether it succeeded or failed, the number of retries attempted, and a truncated version of the response or error. Keep a rolling window of execution history - the last 100 runs per agent is a reasonable default.

If any agent has failed its last three consecutive runs, flag it in the status output as unhealthy so you know to investigate.

Testing:

Query the status endpoint and verify it reports the correct number of registered and enabled agents.
Run several agents and check the aggregate statistics update correctly (total executions, success rate, average duration).
Verify upcoming scheduled runs are listed with the correct next-run times.
Simulate three consecutive failures for an agent and verify it appears as unhealthy in the status output.
Check the execution log for an agent and verify it contains the expected detail for each run.

Going Further

You’ve built a scheduled AI agent system with cron scheduling, retry logic, and email delivery. Here are some ways to push it further:

Skill or MCP support: Add the ability for agents to call skills, use tools and MCP.
Slack Delivery: Add the option to send agent results to a Slack channel using the Slack Web API. Let each agent choose between email, Slack, or both as its delivery method.
Discord Delivery: Add Discord webhook support as another delivery option. Discord’s webhook API makes this straightforward - you post a formatted message to a webhook URL and it appears in the configured channel.
Web Dashboard: Build a web-based dashboard that visualises agent status, execution history, and upcoming schedules. A timeline view showing when each agent last ran and when it will run next is particularly useful.
Agent Chaining: Let one agent’s output become another agent’s input. For example, a data-gathering agent runs first, and its results are passed to a summarisation agent that formats and delivers the final report.
External Data Sources: Give your agents the ability to fetch live data as part of their tasks - pulling from RSS feeds, APIs, or web searches before asking the LLM to process the results. This turns your agents from simple prompt runners into genuine automation tools.
Rate Limit Management: Add awareness of your LLM provider’s rate limits. If multiple agents are scheduled close together, stagger their execution to stay within your API quota rather than having them all fail from rate limiting.

P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!

Share Your Solutions!

Request for Feedback

You can reach me on Bluesky, LinkedIn or through SubStack

Thanks and happy coding!

John

Coding Challenge #110 - RTFM For Me Agent

John Crickett — Sat, 07 Mar 2026 09:01:38 GMT

Hi this is John with this week’s Coding Challenge.

🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉

If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧

Coding Challenge #110 - RTFM For Me Agent

This challenge is to build your own AI-powered documentation assistant - a tool that can ingest technical documentation, answer questions about it using AI, and remember context across conversations.

If you’ve ever used an AI chatbot and wished it could answer questions specifically about your own documentation, that’s exactly what you’re building here. The technique behind it is called Retrieval-Augmented Generation (RAG). Instead of relying solely on what an AI model was trained on, you retrieve the specific documents relevant to a question and feed them to the model as context. The result is grounded, accurate answers with source citations rather than hallucinated guesses.

Redis is the backbone of this project. It handles vector search for finding relevant documents, semantic caching for avoiding redundant AI calls, session storage for conversation history, and long-term memory for remembering user context across sessions. Everything else - the AI model, the embedding provider, your programming language, and your framework - is entirely your choice. You can read all about Redis’ AI offerings here.

If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️, with the bonus that you get access to a monthly AMA and 20% off any of my courses.
Buy one of my self-paced courses that walk you through a Coding Challenge.
Join one of my live courses where I personally teach you Go by building five of the coding challenges or systems software development by building a Redis clone.

The Challenge - Building The RTFM For Me Agent

In this challenge you’re going to build a RTFM For Me Agent, a full-stack AI assistant that ingests documentation files, answers questions using retrieval-augmented generation, caches semantically similar queries to reduce costs, and maintains memory across sessions. By the end, you’ll have a system that gets more useful the more you interact with it and can search and read documentation for you.

Step Zero

In this introductory step you’re going to set your environment up ready to begin developing and testing your solution.

You’ll need Docker and Docker Compose to run Redis Stack, which provides Redis with built-in vector search capabilities. Set up a docker-compose.yml that runs Redis Stack.

Next, make three decisions that will shape the rest of your build:

Choose your LLM and embedding provider. You’ll need a language model for generating answers and an embedding model for converting text into vectors. Options include OpenAI, Anthropic, Google Gemini, Mistral, Cohere, or running models locally with Ollama. Whatever embedding model you choose, note its output dimensions - you’ll need this when creating your Redis vector index. If you’re new to this, an embedding model turns your text into a list of numbers (called a vector) that captures its meaning. This is what lets the system find relevant chunks of text later, by comparing how similar those number lists are.
Choose your programming language and framework. Redis has client libraries for Python, TypeScript, Java, Go, Rust, C#, and more. You can use an AI agent framework like PydanticAI, LangChain, or LlamaIndex - or skip the framework entirely and call your LLM’s API directly.
Choose your Redis client library. Python developers might want to look at The Python version of RedisVL, Java developer can grab the new Java version of RedisVL, both of which provides high-level abstractions for vector search, caching, and sessions.

Prepare some sample documentation files (markdown, text, or HTML) that you’ll use to test your system throughout the challenge. Technical documentation with clear sections works well - API references, getting started guides, or architecture documents. A great example would be the Pro Git book: https://github.com/progit/progit2 allowing you to create an agent to help with git commands.

Testing: Verify Redis is running by connecting with redis-cli and running PING. You should receive PONG in response. Verify you can call your chosen LLM and embedding APIs successfully.

By the way, there is also a coding challenge that has you build your own Redis.

Step 1

In this step your goal is to build a document ingestion pipeline that loads documentation files, splits them into chunks, generates vector embeddings for each chunk, and stores everything in a Redis vector index.

Start by loading your sample documentation files. Then split the text into smaller chunks - roughly 500 tokens each with some overlap between consecutive chunks so you don’t lose context at the boundaries. Try to split on natural boundaries like paragraphs rather than cutting mid-sentence.

For each chunk, generate a vector embedding using your chosen embedding provider. Then store the chunk in Redis along with its embedding and metadata: the source file name, the section heading, and the chunk’s position in the document. This metadata will become important later when you add filtering.

You’ll need to create a Redis vector index that supports similarity search over these embeddings. The index should make the chunk text full-text searchable, the metadata fields filterable, and the embedding vectors searchable by similarity. Refer to the Redis vector search documentation for details on creating indexes with the FT.CREATE command.

Testing: Ingest your sample documentation files and verify the data is in Redis:

Run FT.INFO on your index to confirm it exists and shows the correct number of documents.
Run HGETALL on one of your stored document keys to verify it contains the chunk text, metadata fields, and embedding vector.
Try ingesting the same files again and verify your pipeline handles duplicates sensibly.

Step 2

In this step your goal is to implement vector search and RAG-based answer generation. When a user asks a question, your system should find the most relevant document chunks and use them to generate a grounded answer.

The flow works like this: take the user’s question, convert it to an embedding using the same model you used for your documents, then search your Redis vector index for the most similar chunks. Take the top results and pass them to your LLM as context alongside the question.

Your system prompt should instruct the LLM to answer using only the provided context and to cite which source file each piece of information comes from. If the context doesn’t contain enough information to answer the question, the LLM should say so honestly rather than making something up.

Wrap this in a REST API with at least two endpoints: one for ingesting documents and one for asking questions. A streaming endpoint for the chat response is a nice addition if your framework supports server-sent events.

Testing: Ask questions that you know the answers to based on your sample documentation:

Ask a question that’s directly covered in your docs. The response should be accurate, cite the correct source file, and not include information that isn’t in the docs.
Ask a question that isn’t covered at all. The system should tell you it doesn’t have enough information rather than hallucinating an answer.
Ask a question that spans multiple documents. The system should pull context from several sources.
Test with curl to verify your API endpoints work correctly.

Step 3

In this step your goal is to add semantic caching so that repeated or similar questions get instant answers without an LLM call.

Traditional caching uses exact string matches, which means “how do I authenticate?” and “what’s the authentication process?” would be treated as completely different queries. Semantic caching embeds the question and checks whether any previously cached question is close enough in vector space. If it is, the cached answer is served without touching the LLM at all.

You’ll need a separate Redis vector index for your cache entries. Each entry stores the original question, its embedding, and the generated response. When a new question comes in, search this cache index first. If the closest match is within your similarity threshold, return the cached response. Otherwise, proceed with the full RAG pipeline and cache the result afterwards.

Start with a similarity threshold of around 0.15 (cosine distance) and tune from there. Too strict and you’ll rarely get cache hits. Too loose and you’ll serve wrong answers for questions that are only loosely related.

Python developers can use RedisVL’s SemanticCache or LangCache which handle much of this for you. In other languages, it’s straightforward to build yourself - it’s just a vector index with a similarity check.

Track your cache metrics: hit rate, average latency for cached versus uncached responses, and estimated cost savings. Store these counters in Redis using INCR so they persist across restarts. Expose them through a /metrics endpoint.

Testing:

Ask the same question twice. The second time should be noticeably faster and your metrics should show a cache hit.
Rephrase the question slightly (e.g. “how does auth work?” then “what’s the authentication process?”). If your threshold is tuned correctly, the second should also be a cache hit.
Ask a completely different question and verify it’s a cache miss.
Check your /metrics endpoint to see hit rate and latency comparisons.
Add a cache flush endpoint and verify that clearing the cache causes previously cached queries to miss again.

Step 4

In this step your goal is to add session memory so your assistant can handle follow-up questions within a conversation.

Without session memory, each question is treated in isolation. If a user asks “what’s the authentication flow?” and then follows up with “how do I refresh the token?”, the system has no idea what “the token” refers to. Session memory fixes this by maintaining conversation history.

Store conversation messages in Redis, keyed by session ID. Each time a user sends a message, append it to the session’s history. When building the prompt for the LLM, include the recent conversation messages so the model has context for follow-up questions. Redis lists or streams both work well for this.

Set a time-to-live on your sessions so they clean up automatically after a period of inactivity - 24 hours is a reasonable default.

Testing:

Start a new session and ask a question about a specific topic in your docs.
Ask a follow-up question that relies on context from the first answer (e.g. use “it”, “that”, or “the same endpoint” to refer back). The assistant should understand what you’re referring to.
Start a different session and verify it has no memory of the first conversation.
Wait for the session TTL to expire (or set a short TTL for testing) and verify the session data is cleaned up from Redis.

Step 5

In this step your goal is to add long-term agent memory so your assistant remembers user context across sessions and uses it to personalise answers.

Session memory disappears when a session ends. Long-term memory persists. If a user tells the assistant “I’m working on the payments microservice in Go” in one session, the assistant should remember that context in future sessions and tailor its answers accordingly.

Set up the Redis Agent Memory Server as a Docker container alongside your Redis instance. The memory server provides a REST API for storing and searching memories, with built-in support for topic extraction, entity recognition, and semantic search over stored memories. It supports over 100 LLM providers via LiteLLM, so whatever model you’re using for your main application will work here too.

Integrate the memory server into your chat flow. After each conversation, extract any important context - user preferences, project details, technical decisions - and store it as a long-term memory. Before generating answers, search for relevant memories and include them in the prompt.

Your LLM prompt should now assemble context from three sources: document chunks from vector search, recent messages from the session, and relevant long-term memories. The memories help the assistant give more relevant answers - if the user has previously mentioned they use Python, documentation examples should lean towards Python where possible.

Testing:

In one session, tell the assistant about your project context (e.g. “I’m building a payment service in Go”).
End the session and start a new one. Ask a general question. The assistant’s answer should reflect your project context even though it’s a new session.
Search the memory server’s REST API directly to verify memories were stored with the correct topics and entities.
Ask the assistant what it knows about your project - it should surface relevant stored memories.

Step 6

In this step your goal is to add hybrid search and production hardening to make your system more robust and precise.

Pure vector search works well for general questions, but sometimes users want answers from a specific document or section. Hybrid search combines vector similarity with metadata filtering. For example, a user might ask “how does authentication work in the API reference?” - the vector search finds semantically relevant chunks, and the metadata filter narrows results to only the API reference document.

You can also use context from long-term memory to apply filters automatically. If the assistant knows the user is working on authentication, it can prioritise chunks from authentication-related sections without being asked.

Add conversation summarisation to handle long sessions gracefully. When the conversation history grows beyond a token threshold, summarise the older messages and keep only the recent ones intact. This prevents your context window from overflowing while preserving important information from earlier in the conversation.

Finally, make your system degrade gracefully when non-critical components fail. If the semantic cache is unavailable, skip it and call the LLM directly. If the memory server is down, answer without long-term context. Only the vector search and LLM are truly essential - everything else should fail silently with appropriate logging.

Testing:

Ask a question scoped to a specific document (e.g. “based on the getting started guide, how do I...”). Verify the results come only from that document.
Have a long conversation (15+ messages) and verify the system still responds correctly as older messages get summarised.
Stop the memory server container and verify the chat still works, just without personalisation.
Stop and restart the semantic cache and verify the system recovers gracefully.
Check your observability metrics: response latency, cache hit rate, token usage, and estimated cost.

Going Further

You’ve built a documentation assistant with RAG, semantic caching, and persistent memory. Here are some ways to push further:

Semantic routing: Classify incoming queries before processing them. Is it a documentation question, an off-topic chat, or a request for an action? Route each type differently.
Multi-tenant support: Scope all indexes, caches, and memories by organisation or team using Redis key prefixes, so multiple teams can share one deployment.
Document versioning: Track document versions and warn users when answers are based on outdated documentation.
MCP integration: Expose your assistant as an MCP server so other AI agents can use it as a tool. The Agent Memory Server already supports MCP natively.
Evaluation suite: Build a test harness that measures retrieval precision, answer accuracy, and cache effectiveness across a standard set of questions.
Real-time updates: Use Redis Pub/Sub or Streams to notify a frontend when document ingestion completes or new memories are created.
Multi-model strategy: Use a cheaper model for memory extraction and caching, and a more capable model for final answers. Redis doesn’t care which model generates the content it stores.
Web crawl: find and ingest documentation from the web.

This coding challenge was sponsored by Redis.

P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!

Share Your Solutions!

Request for Feedback

You can reach me on Bluesky, LinkedIn or through SubStack

Thanks and happy coding!

John

Coding Challenge #109 - Ebook Reader

John Crickett — Sat, 28 Feb 2026 09:01:20 GMT

Hi this is John with this week’s Coding Challenge.

🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉

If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧

Coding Challenge #19 - Ebook Reader

This challenge is to build your own ebook reader application.

EPUB is the most widely used open standard for digital books. Unlike proprietary formats, EPUB is built on web technologies you probably already know; XHTML for content, CSS for styling, and a ZIP container to package it all together. Almost every ebook outside the Amazon Kindle ecosystem uses it. By building an EPUB reader, you’ll work with file format parsing, content rendering, text processing, and UI design all in one project.

But first, many thanks to this week’s sponsor, Unblocked.

Give your agents the understanding they need to generate reliable code, reviews, and answers. Unblocked builds context from your team’s code, PR history, conversations, documentation, planning tools, and runtime signals. It surfaces the insights that matter so AI outputs reflect how your system actually works.

“Unblocked has reversed my AI fatigue completely. The level of precision is wild.” - Senior developer, Clio

See how it works.

The Challenge - Building Your Own EBook Reader

In this coding challenge you’ll build an ebook reader that can open and display EPUB files, manage a library of books, and provide a comfortable reading experience with features like annotations, dictionary lookups, and customisable themes. By the end, you’ll have a fully functional reader that works with any standard EPUB file.

This is an advanced challenge that covers file format parsing, content rendering, persistent storage, and building a polished user interface. Choose a platform and tech stack that gives you good tools for displaying rich text content.

Step Zero

Set up your development environment and get some EPUB files to work with.

Pick a platform for your reader. It could be desktop, web, or mobile. Then a tech stack you’re comfortable building user interfaces with. You’ll need to handle ZIP archives, parse documents, render styled text, and build interactive UI elements, so make sure your chosen stack has good support for these.

For test content, grab a few free EPUB files from Project Gutenberg or Standard Ebooks. Standard Ebooks is particularly good because their files are well-formatted EPUB 3 files with proper styling. Get at least three or four books of different lengths so you can test properly throughout the challenge.

Step 1

In this step your goal is to open an EPUB file and display its content for reading. The EPUB file format is documented on the W3 website.

An EPUB file is a ZIP archive containing XHTML content files, a CSS stylesheet, metadata, and a manifest that describes how everything fits together. Your reader needs to unpack this structure, understand the reading order from the manifest, and render each chapter as readable, styled text that you can scroll or page through.

You should be able to open an EPUB, see the book’s title and author, navigate between chapters using a table of contents, and read the content with the book’s own styling applied. The text should reflow cleanly when you resize the window.

Testing guidance: Open one of your Project Gutenberg or Standard Ebooks files and verify you can read the full book from start to finish. Check that the table of contents works and that you can jump to any chapter. Try resizing the window and confirm the text reflows without breaking. Open a second, different book to make sure you’re not accidentally hardcoding anything specific to one file.

Step 2

In this step your goal is to manage a collection of ebooks and search across them.

Rather than opening one file at a time, your reader should maintain a library. Users should be able to import books, see them organised with their cover images and metadata (title, author, language), and sort or filter the collection. Creating a full-text search is where this gets interesting, index the actual content of every book in the library so users can search for a quote or topic and find it across all their books.

Testing guidance: Import your collection of test EPUBs into the library. Verify that metadata displays correctly for each book. Search for a word or phrase you know appears in one of the books and confirm the search returns the right results with the book title and location in the text. Search for something that appears in multiple books and check that all matches are found. Try sorting and filtering the library by different criteria.

Step 3

In this step your goal is to give readers control over how their books look.

A good ebook reader lets you adjust fonts, font sizes, colours, margins, line spacing, and background colour. You’ll need a theming system that overrides the book’s built-in stylesheets without breaking them, applying user preferences on top of the author’s intended styling. Support at least light and dark themes, and let users adjust the core typography settings.

Testing guidance: Open a book and switch between your light and dark themes. Verify the text remains readable in each. Change the font, font size, and line spacing and confirm the content reflows properly. Close the book and reopen it, your settings should persist. Open a different book with its own stylesheet and check that your theme still applies cleanly without breaking the book’s formatting (bold text, italics, headings should all still look right).

Step 4

In this step your goal is to build a layer on top of the rendered text that supports highlights, bookmarks, and notes.

Users should be able to select text and highlight it in different colours, attach written notes to any highlighted passage, and set bookmarks to return to later. All annotations need to persist — when you close the book and reopen it, everything should still be there in the right place. The tricky part is that EPUB text is reflowable, so your annotations need to be anchored to the text itself, not to a position on the screen.

Testing guidance: Open a book, highlight several passages in different colours, and add notes to a couple of them. Set a few bookmarks. Close the book completely and reopen it — all your annotations should be exactly where you left them. Change the font size or window width to reflow the text and verify annotations stay attached to the correct passages. Check that you can view a list of all your bookmarks and annotations for the book.

Step 5

In this step your goal is to let readers look up any word instantly without leaving the book.

When a user selects a word, they should see a definition. For dictionary lookups, you’ll want this to work offline so it doesn’t break the reading flow. For additional context, provide a way to look up the selected word or phrase on Wikipedia using their public API. The lookup should appear as a clean overlay or popup that doesn’t lose the reader’s place.

Testing guidance: Open a book and select a word — verify you get a dictionary definition quickly. Try selecting an uncommon word and check that it’s handled (either a definition or a clear “not found” message). Select a proper noun or topic and try the Wikipedia lookup, verify you get a relevant summary. Check that the lookup popup doesn’t disrupt your reading position, when you dismiss it, you should be right where you were.

Step 6

In this step your goal is to make your reader fully usable through keyboard navigation and screen readers.

Every feature in your reader should be reachable and operable without a mouse. The reading view, table of contents, library, annotations, and all controls should be keyboard-navigable with a logical tab order. For screen readers, all interactive elements need proper labels and roles so assistive technology can describe what’s on screen and what actions are available. This is one of those features that seems straightforward but requires careful attention to how your UI communicates with the accessibility layer of your platform.

Testing guidance: Put your mouse away and try to use every feature of your reader using only the keyboard. Navigate the library, open a book, move between chapters, create a bookmark, and switch themes, all without clicking. Then test with a screen reader (VoiceOver on macOS, NVDA on Windows, Orca on Linux). Verify that the screen reader announces book titles, chapter names, and button labels correctly. Check that the reading content is announced in the proper order.

Going Further

Once you’ve built the core ebook reader, here are some ways to extend it:

Text-to-Speech with Sentence Highlighting: Add narration that highlights the current sentence as it reads aloud. The interesting part is synchronising the audio timing with text positions so the highlight tracks smoothly. Handle pausing, resuming, and adjusting playback speed.
OPDS and Calibre Integration: OPDS is an open catalog protocol for discovering and downloading ebooks. Build a catalog browser that connects to OPDS feeds and local Calibre libraries on your network. It’s a good exercise in protocol parsing and network discovery without needing cloud infrastructure.
PDF Support: Extend your rendering engine to handle PDF files alongside EPUB. PDF is a fixed-layout format, so the rendering approach is quite different from reflowable EPUB — a worthwhile challenge in its own right.
Reading Position Sync: If you’ve built a web-based reader, add the ability to sync reading positions across devices. This involves persisting state to a server and handling conflicts when the same book is read on multiple devices.

P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!

Share Your Solutions!

Request for Feedback

You can reach me on Bluesky, LinkedIn or through SubStack

Thanks and happy coding!

John

Coding Challenge #108 - Online Coding Playground

John Crickett — Sat, 21 Feb 2026 09:01:33 GMT

Hi this is John with this week’s Coding Challenge.

🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉

If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧

Coding Challenge #108 - Online Python Playground

This challenge is to build your own online code playground where users can write and run Python code directly in their web browser.

An online code playground is a web-based environment where developers can experiment with code without needing to install anything on their computer. Think of it like a digital sandbox, you can write code, click a button, and see the results instantly. It’s incredibly useful for learning, testing ideas quickly, or sharing code snippets with others.

If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️, with the bonus that you get access to a monthly AMA and 20% off any of my courses.
Buy one of my self-paced courses that walk you through a Coding Challenge.
Join one of my live courses where I personally teach you Go by building five of the coding challenges or systems software development by building a Redis clone.

The Challenge - Building Your Own Online Code Playground

You’re going to build an online code playground for Python (or another programming language of your choice). Users will be able to write code in an editor, run it with the click of a button, and see the output or errors displayed immediately. Over several steps, you’ll add features like saving code snippets, syntax highlighting, error messages with line numbers, and performance safeguards to keep the browser responsive.

Step Zero

Set up your development environment and get familiar with the technologies you’ll be using.

You’ll need to choose a tech stack that works well for building web applications with a rich code editor. You’ll be using a Python WebAssembly runtime (like Pyodide or PyScript) to execute Python code in the browser, so make sure you understand how it works before you begin.

Step 1

In this step, your goal is to execute Python code in the browser and display the output to the user.

You’ll create a simple HTML page with a code editor (a basic textarea is fine for now) and a “Run” button. When the user clicks the button, take the code from the editor, send it to the Python WASM runtime, and display whatever the code prints to standard output in an output panel below.

Testing: Write a simple Python script that prints “Hello, World!” and verify it displays correctly. Try a few more examples: print multiple lines, do some math (print(2 + 2)), and test with variables.

Step 2

In this step, your goal is to catch and display Python errors in a user-friendly way.

When code fails to run (syntax errors, runtime errors, etc.), the user should see a clear error message that tells them what went wrong and where. Include the line number and the error traceback so they can debug their code.

Testing: Deliberately write broken code, missing colons, undefined variables, division by zero, and verify that each error is clearly shown with line numbers and helpful messages.

Step 3

In this step, your goal is to replace the basic textarea with a proper code editor that highlights Python syntax.

Your users will write code more efficiently and catch mistakes faster if the syntax is color-coded. You can use an existing code editor library (like CodeMirror or Monaco Editor). The editor should support undo/redo, copy/paste, and basic keyboard shortcuts.

Testing: Type Python code into the editor and verify that keywords (like def, if, for) appear in different colours. Test undo/redo, and make sure basic shortcuts like Ctrl+Z or Cmd+Z work.

Step 4

In this step, your goal is to let users save their code snippets to the browser’s local storage and load them again later.

Users should be able to save their current code with a name, see a list of their saved snippets, and load any snippet back into the editor. They should also be able to delete saved snippets they no longer need.

Testing: Write some code, save it with a descriptive name, close the browser tab (or reload the page), and verify the code is still there when you come back. Create multiple snippets and switch between them.

Step 5

In this step, your goal is to prevent long-running code from freezing the browser and provide users with feedback about what’s happening.

Add a timeout for code execution—if code takes too long to run, stop it and show a message. Display a loading indicator while code is executing. Optionally, show memory usage warnings if code is consuming too much resources.

Testing: Write code that runs for a long time (like an infinite loop or a deep recursion). Verify it stops after a reasonable timeout and the user sees a clear message. Test that the loading indicator appears and disappears at the right times.

Step 6

In this step, your goal is to handle edge cases gracefully so your playground doesn’t break.

Handle situations where the Python WASM runtime fails to initialise, the browser doesn’t support WebAssembly, or the user’s browser runs out of localStorage space. Show helpful messages in each case so users know what’s happening.

Testing: Try disabling WebAssembly in your browser to test the fallback message. Fill up localStorage and try to save a large code snippet, and verify a clear error message appears.

Going Further

Once you’ve built the basic playground, consider adding more advanced features:

Autocomplete and code hints - Help users discover Python functions and methods as they type
Multiple files - Let users create projects with multiple Python files
Import libraries - Allow users to import common Python libraries (many are available in Pyodide)
Share snippets - Generate a link that someone else can click to load and run your code
Dark mode - Add a theme toggle so users can write code comfortably at night
Keyboard shortcuts guide - Show users what shortcuts are available

P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!

Share Your Solutions!

Request for Feedback

You can reach me on Bluesky, LinkedIn or through SubStack

Thanks and happy coding!

John

Coding Challenges - Does AI Write Good Code? Let's Find Out.

John Crickett — Sat, 14 Feb 2026 09:01:44 GMT

Hi, this is John with this week’s Coding Challenge and it’s going to be a bit different.

🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉

If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧

Coding Challenge - Does AI Write Good Code?

AI is changing software engineering. AI can write code faster than you or I can. That's exciting, but it creates a new problem: just because code works doesn't mean it's good. How do you know if what your LLM generated is secure, maintainable, and ready for production?

There are two things you can do. Firstly, follow industry research, like Sonar’s LLM Leaderboard, which looks at the quality, security, complexity, and maintainability of the code created using the leading LLMs. It’s well worth a read to understand the strengths and weaknesses of the models. I found it particularly eye-opening to see that GPT 5.2 High generates 50% more code than Opus 4.5 for the same tasks and Opus 4.5 was still generating around 200% more code than Gemini 3 Pro! I know which codebase I’d rather be responsible for!

Secondly, there are many tools we can leverage to evaluate aspects of code quality, maintainability and security. They include compilers, type checkers, linters, and automated code review tools like SonarQube. In today’s Coding Challenge we’re going to look at how we can leverage them to guide and evaluate AI when building software.

Step Zero

In this step your goal is to pick a Coding Challenge, technology stack and AI coding agent of your choice. If you primarily use Copilot at work, consider trying Amp Code, if you mainly use Claude, try Copilot. In short, try a different coding agent and learn something new.

Step 1

In this step your goal is to build a solution to one of the Coding Challenges using your favourite agent / LLM. I’ll go into more detail on how to leverage AI agents in a future newsletter, but for now I suggest prompting the agent to tackle one step of the Coding Challenge at a time. Between steps, or if the context window starts to fill up or it hallucinates, clear the context window.

Once your solution is complete, head to step 2 to start leveraging tools to assess the quality and security of the code produced by your AI.

Step 2

In this step your goal is to prompt your agent to review the code quality using the compiler, code formatter and linter appropriate to your programming language and stack.

For example, if you’re using Python run checks with ruff, ty, pyrefly or pyright. If you’re using JavaScript, switch to TypeScript 😇. If you’re using Rust, use clippy, for Go check out Golangci-lint. You get the idea.

Step 3

In this step your goal is to install the SonarQube MCP Server and use it to evaluate the code created by your agent.

SonarQube is an automate code review solution that performs deep code analysis to detect bugs, security vulnerabilities, and quality issues that simple linters miss. It can also be used to enforce rules for maintainability, reliability, and security before merging. It’s been widely used in the organisations I’ve worked for and I’ve often been grateful for the feedback it provided my teams. So for this step you’re going to use it, via the MCP server to analyse the code your agent has produced.

To get started, see Sonar’s instructions on how to install their MCP server. I found the quickstart guide particularly useful and chose to use the container running in Docker. Then configured my agent to use SonarQube via the MCP server.

Once you have it installed, check with your agent to verify the MCP server is configured. For example, with Copilot run:

/mcp show

You should then see some output that includes something like this:

Configured MCP servers: sonarqube

MCP Server Configuration:
  sonarqube (local): Command: docker

Once you have that done, ask your agent to review the project with SonarQube, for example:

Review this code with SonarQube

I tried it on the Monkey compiler project I did last year and got the following output:

Total Issues Found: 40

  Issue Breakdown by Severity:

   - CRITICAL: 37 issues
   - MAJOR: 1 issue
   - MINOR: 2 issues

  Key Issues Identified:

  1. High Cognitive Complexity (6 issues):

   - compiler/compiler.go:72 - Complexity: 96 (allowed: 15)
   - vm/vm.go:72 - Complexity: 69 (allowed: 15)
   - vm/vm_test.go:66 - Complexity: 34 (allowed: 15)
   - evaluator/evaluator.go:15 - Complexity: 25 (allowed: 15)
   - compiler/symbol_table_test.go:166 - Complexity: 16 (allowed: 15)
   - compiler/compiler_test.go:649 - Complexity: 16 (allowed: 15)

  2. Duplicated String Literals (14 issues): Various test files contain repeated error message strings that should be defined as
  constants.

  3. Empty Functions (13 issues): Multiple empty functions in ast/ast.go lack explanatory comments about why they're intentionally
  empty.

  4. Other Issues:

   - Empty code block in compiler/compiler_test.go:661
   - Interface naming convention in object/object.go:72
   - Parameter grouping in compiler/compiler.go:353

  The most critical issues are the high cognitive complexity in the compiler and VM modules, which could impact maintainability.

You can then use the agent to explore the issues in more depth.

Step 4

In this step your goal is to pick out three of the points that SonarQube highlights and investigate them. Review the code yourself and either fix it, or prompt your agent to fix the code, including the output from SonarQube in your prompt.

I did this by prompting the agent for the highest priority issues:

Show me the three highest priority issues

Which gave me:

Top 3 Highest Priority Issues

  1. compiler/compiler.go:72 - Cognitive Complexity: 96 ⚠️ CRITICAL

  Function: Compile(node ast.Node)

  Problem: This massive switch statement handles all AST node compilation with deeply nested conditionals and error handling. The complexity is 6.4x over the allowed limit (96
  vs 15).

  Why it matters: This is the core compiler function - extremely difficult to understand, test, and modify. Each new language feature adds more nesting.

  Recommendation: Refactor using the strategy pattern - extract each case into separate compilation methods:

   func (c *Compiler) compileProgram(node *ast.Program) error { ... }
   func (c *Compiler) compileInfixExpression(node *ast.InfixExpression) error { ... }
   func (c *Compiler) compileIfExpression(node *ast.IfExpression) error { ... }

  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  2. vm/vm.go:72 - Cognitive Complexity: 69 ⚠️ CRITICAL

  Function: Run()

  Problem: Main VM execution loop with a giant switch statement for all opcodes. Complexity is 4.6x over the limit (69 vs 15).

  Why it matters: This is the VM's heart - any bug here affects runtime execution. The nested switch cases, error checks, and instruction pointer manipulation make it hard to
  reason about.

  Recommendation: Extract opcode handlers into separate methods:

   func (vm *VM) handleOpConstant(ins code.Instructions, ip int) error { ... }
   func (vm *VM) handleOpJump(ins code.Instructions, ip int) error { ... }

  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  3. vm/vm_test.go:66 - Cognitive Complexity: 34 ⚠️ CRITICAL

  Function: Test helper function

  Problem: Test code with complexity 2.3x over the limit (34 vs 15), likely due to nested loops validating complex test cases.

  Why it matters: Complex test code is harder to maintain and can hide bugs in the tests themselves.

  Recommendation: Break into smaller helper functions for specific validation types (integers, booleans, arrays, etc.).

Impact: These three functions represent the core of your interpreter - the compiler, VM, and tests. High complexity makes maintenance risky and onboarding difficult.

Look at the issues you find and either fix them by hand or have the agent fix them. Then use the agent and SonarQube to review the fixes, it’s a great way to improve your code quality.

Step 5

In this step your goal is to reflect on how to use LLMs and how tools like compilers, linters and SonarQube can help you improve the quality of the software you build.

I recommend adding them to your normal agentic coding process, for me that means incorporating them into the prompt for each task, something like this in my prompts:

After reading the specification: 
1. Create a set of tests to verify the implementation behaves correctly. 
2. Then create the code required to meet the specification. 
3. Verify the functionality is correct using the tests.
4. Verify the code lints and passes quality checks with no warnings or errors.

My AGENTS.md usually defines how to run the linter and quality checks for the project.

Going Further

Review the LLM Leaderboard that Sonar created to provide transparency into how models build code, not just what they build. By running thousands of AI-generated solutions through SonarQube, they evaluated the models on the metrics that matter to engineering leaders: security, reliability, maintainability, and complexity.

To generate the leaderboard, Sonar analysed code quality from leading AI models (GPT-5.2 High, GPT-5.1 High, Gemini 3 Pro, Opus 4.5 Thinking, and Claude Sonnet 4.5).

It was interesting to see that while these models pass functional benchmarks well, they have significant differences in code quality, security, and maintainability.

Higher performing models tend to generate more verbose and complex code for example:

Opus 4.5 Thinking leads with 83.62% pass rate but generates 639,465 lines of code (more than double the less verbose models).
Gemini 3 Pro achieves similar performance (81.72%) with much lower complexity and verbosity.
GPT-5.2 High hits 80.66% pass rate but produces the most code (974,379 lines) and shows worse maintainability than GPT-5.1.

I found it particularly interesting to see that Gemini produced only 289k lines. That’s a lot less code to review and maintain!

Many thanks to Sonar for sponsoring this issue of Coding Challenges.

P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!

Share Your Solutions!

Request for Feedback

You can reach me on Bluesky, LinkedIn or through SubStack

Thanks and happy coding!

John

Coding Challenge #107 - Loom Clone

John Crickett — Sat, 07 Feb 2026 09:02:14 GMT

Hi this is John with this week’s Coding Challenge.

🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉

If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧

Coding Challenge #107 - Loom Clone

This challenge is to build your own version of Loom, a screen recording and video messaging tool.

Loom is a popular tool for creating quick video messages, tutorials, and screen recordings. It’s simple to use - you select what you want to record (screen, camera, microphone), click record, and it saves your video. This challenge will have you building the core features of Loom, giving you hands-on experience with media capture APIs, file handling, and building a user-friendly recording interface.

If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️, with the bonus that you get access to a monthly AMA and 20% off any of my courses.
Buy one of my self-paced courses that walk you through a Coding Challenge.
Join one of my live courses where I personally teach you Go by building five of the coding challenges or systems software development by building a Redis clone.

The Challenge - Building Loom

You’ll be building a simplified version of Loom that lets users record their screen, camera, and microphone, then save and playback those recordings. The key is to create an intuitive interface for setting up the recording, managing devices, and reviewing what you’ve captured.

You can build it as a web based application like Loom itself, or a desktop application.

Step Zero

In this introductory step you’re going to set your environment up ready to begin developing and testing your solution.

Choose your target platform and programming language. Loom is cross-platform, so you could build this as a web application, a desktop app using Electron or a similar framework, or a native application. Pick a tech stack that you’re comfortable with and that has good support for media capture and file handling.

Step 1

In this step your goal is to create the home page of your application.

Your home page should have a simple, clean interface that serves as the starting point for users. The main feature here is the ability to select and confirm a folder where recordings will be saved. Users should be able to browse and select a destination folder before they start recording.

Test your implementation by starting your application and verifying that you can select a folder on your system and see that selection reflected in the UI.

Step 2

In this step your goal is to let users select which microphone and camera devices they want to use for recording.

Before recording begins, users need to choose their audio input device (microphone) and video input device (camera). Your application should detect available devices on the system and present them in a dropdown or list. Users might have multiple microphones or cameras, so make sure they can see all available options and select the ones they want.

Test your implementation by checking that your application detects the devices available on your system. Verify that you can list them and select different ones.

Step 3

In this step your goal is to add the ability to mute or unmute the microphone and enable or disable the camera before and during recording.

Users often want to toggle their microphone or camera on and off - maybe they want to record their screen but not their voice, or show their screen but not their face. Add toggle buttons or checkboxes so users can control whether the microphone and camera are active. These controls should be available both before recording starts and while recording is in progress.

Test your implementation by toggling the microphone and camera on and off, verifying that the UI reflects the current state.

Step 4

In this step your goal is to implement the core recording functionality - starting and stopping a recording, and saving it to the folder the user selected.

Add a prominent button to start recording. Once recording begins, capture the audio from the selected microphone and video from the selected camera. Users should be able to stop the recording at any time. When they stop, the application should save the complete recording to the folder they selected in Step 1. The file should have a name or timestamp so users can identify it later.

Test your implementation by starting a recording, speaking or moving in front of the camera, stopping the recording, and verifying that a file has been created in your selected folder.

Step 5

In this step your goal is to add screen sharing and screen recording capabilities.

Extend your application so users can choose to record their screen instead of (or in addition to) the camera. Add a toggle or option to enable screen sharing mode. When enabled, users should be able to select which screen or window to record. This is in addition to the microphone and camera options - they should still be able to record audio and camera video alongside their screen. Ensure the shared screen is the main focus of the video.

Test your implementation by enabling screen recording, selecting a screen or window to record, and verifying that the screen content is captured when you record.

Step 6

In this step your goal is to display a list of previous recordings on the home page and allow users to play them back.

After users have made at least one recording, the home page should display a list of all recordings saved in the selected folder. Users should be able to click on a recording to play it back. The playback should handle both audio and video content, and work seamlessly with whatever format you chose to save recordings in.

Test your implementation by making a few recordings, returning to the home page, and verifying that all recordings appear in the list. Click on each one and verify that playback works correctly.

Going Further

Here are some ways you could extend this project:

Recording preview: Add a preview window so users can see themselves or their screen before they start recording.
Trimming and editing: Allow users to trim recordings to remove unwanted sections at the beginning or end.
Multiple simultaneous captures: Let users record screen + camera + microphone all at once and composite them into a single video.
Recording metadata: Store additional information with each recording like creation date, duration, and custom notes.
Share functionality: Add the ability to export recordings in different formats or upload them to cloud storage.
Keyboard shortcuts: Add keyboard shortcuts for common actions like start/stop recording, mute microphone, etc.

P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!

Share Your Solutions!

Request for Feedback

You can reach me on Bluesky, LinkedIn or through SubStack

Thanks and happy coding!

John

Will AI Kill Coding?

John Crickett — Sat, 31 Jan 2026 09:01:17 GMT

Some people claim that AI is already writing 90%+ of our code. Others tell us it will be writing 100% of it with six months.

They’re both wrong. But software engineering is changing. AI and AI-assisted coding is behind the change.

That means Coding Challenges has to change too.

I’d love your feedback on what you would like to see in this newsletter. Is it more coding challenges, more coding tutorials or more focus on how to change with the industry and leverage AI?

Hit reply, or comment on Substack and let me know. Thanks!

If you’re skeptical about AI and don’t think it can complete a specific coding task, I am on the hunt for challenges to put to AI. You can find full details and submit your challenge on the AI Coding Challenges GitHub repo.

Regards, John

Coding Challenge #106 - JSON Validator And Prettier

John Crickett — Sat, 24 Jan 2026 09:01:33 GMT

Hi this is John with this week’s Coding Challenge.

🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉

If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧

Coding Challenge #106 - JSON Validator And Prettier

This challenge is to build a JSON validator and formatter. JSON is everywhere, in APIs, configuration files, and data exchange. If you’re a software engineer it’s hard to avoid.

Sometimes that also means you have to cope with broken or unreadable large globs of JSON, as a result I’ve often found myself using JSON linters or prettiers. They’re simple tools but incredibly useful for anyone working with JSON data. This Coding Challenge is to build one.

If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️, with the bonus that you get access to a monthly AMA and 20% off any of my courses.
Buy one of my self-paced courses that walk you through a Coding Challenge.
Join one of my live courses where I personally teach you Go by building five of the coding challenges or systems software development by building a Redis clone.

The Challenge - Building a JSON Validator and Formatter

You’re going to build an application that lets users paste in some JSON, check if it’s valid, and format it in a useful way. The tool will parse JSON, validate its structure, and provide useful transformations.

You can build it as a web application, desktop application or following the trend in CLI tools that is seeing a resurgence because of CLI based AI agents, a CLI tool. It’s your project, your choice!

Step Zero

In this introductory step you’re going to set up your development environment and create the basic project structure.

Choose your target platform and programming language.

Step 1

In this step your goal is to build the initial user interface for the JSON tool.

Create a UI with a large text input box where users can paste JSON into. The interface should be clean and responsive. Think about how to display error messages when validation fails - users need to know what went wrong and where.

At the end of this step you should have a UI that the user can enter JSON into, by either typing or pasting the JSON in.

Step 2

In this step your goal is to build the JSON validation and formatting.

Write the code to parse the input string as JSON (if you haven’t before, now might be a good time to do the JSON parser project) and detect whether it is valid. When the user clicks the validate button, your tool should check the JSON syntax. If the JSON is valid, display it in a nicely formatted way with consistent indentation so the structure is easy to read.

If the JSON is invalid, highlight the error with a clear message about what went wrong and where (line number, character position, or a descriptive message). Seeing well-formatted JSON when validation succeeds helps users understand their data structure and spot issues visually.

Test it with valid JSON:, for example

{"name": "John", "job": "software engineer", "country": "United Kingdom"}

When validated, it should display it nicely formatted:

{
  "name": "John",
  "job": "software engineer",
  "country": "United Kingdom"
}

Test with invalid JSON (for example a missing comma):

{"name": "John" "job": "software engineer"}

Verify that the tool shows an appropriate error message rather than formatted output.

Test with invalid JSON (trailing comma):

{"name": "John",  "job": "software engineer",}

Again verify the error is clearly identified. Test that minified valid JSON gets properly formatted when validated.

Step 3

In this step your goal is to implement a sort feature. Write code that takes valid JSON and reorganises it so that keys are sorted alphabetically at each level of the object.

For example, if the input has keys in the order [city, age, name], the sorted output should have them as [age, city, name]. The sorting should apply independently at each nesting level - if you have nested objects, each one gets sorted by its own keys.

Testing: Test with an unsorted object:

{"zebra": 1, "apple": 2, "banana": 3}

The output should be:

{"apple": 2, "banana": 3, "zebra": 1}

Test with nested objects:

{"z": {"b": 1, "a": 2}, "a": {"y": 3, "x": 4}}

The output should sort both the top level and each nested object:

{"a": {"x": 4, "y": 3}, "z": {"a": 2, "b": 1}}

Step 4

In this step your goal is to implement the compress feature.

Write code that removes all non-essential whitespace from the JSON while preserving any whitespace that appears inside string values. This means removing spaces, newlines, and tabs between structural elements like braces, brackets, and commas, but leaving the content of strings unchanged.

Testing: Test with formatted JSON:

{
  "name": "John",
  "message": "Hello World"
}

The compressed output should be:

{"name":"John","message":"Hello World"}

Note that the space in the string “Hello World” is preserved. Don’t forget to test with newlines in strings too.

Ensure that it is possible to return to the prettier version with the validate button.

Step 5

In this step your goal is to implement a JSON to YAML converter.

Write code that takes valid JSON and converts it to YAML format. Your converter should handle objects, arrays, strings, numbers, booleans, and null values, translating the JSON structure into proper YAML syntax.

Testing: Test with a simple object:

{"name": "John", "age": 30}

The output should be:

name: John
age: 30

Test with nested objects and arrays:

{"person": {"name": "John", "hobbies": ["reading", "coding"]}}

The output should be:

person:
  name: John
  hobbies:
    - reading
    - coding

Test with various data types:

{"active": true, "count": 0, "message": null, "score": 9.5}

The output should be:

active: true
count: 0
message: null
score: 9.5

Verify that your converter handles indentation correctly and produces valid YAML.

Going Further

You can take this further by:

Add copy-to-clipboard functionality so users can quickly copy the result.
Making the formatting configurable.
Create a dark mode for the interface.

P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It

Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!

Share Your Solutions!

Request for Feedback

You can reach me on Bluesky, LinkedIn or through SubStack

Thanks and happy coding!

John