API Docs Overwrite Mechanism (Part 2): Sanitize code-first OpenAPIs

This is a continuation of an initial idea discussed in Part 1: An overwrite mechanism for machine- and human-friendly API documentation

Intro #

When it comes to a code-first, design-later approach, a specification is autogenerated from a source language like C#, Java, etc. OpenAPI is one such standard for defining REST APIs.

Exposure and visibility and abstraction are the name of the game. Even when APIs are classified as “internal to the company,” there are even more internal APIs, where only a specific application and a specific team should use them. *cough* this is a weakness of microservice architecture *cough*

Machine-generated OpenAPI schema (OAS) files have internal and external APIs jumbled together. The packages and pipelines which create our Swagger UIs pick up everything, unless the developer explicitely marks a controller as “Do not show,” which, as you can guess, is just as rare as a developer writing good comments.

Then, when I take the massive JSON and try to open it in Stoplight Studio, it crashes (Stoplight Studio is an Electron app after all 🤦‍). Software development is pretty disorganized. Actually, web development is disorganized, and ✨real✨ software engineers hate web development for being a free-for-all festival.

Maybe I can convince the developers to correctly mark their APIs as “public” and “private.” Unfortunately, the sheer amount of internal APIs outnumber the external ones. It’s a lot easier for a dev to tell me which ones are usable.

The “last mile” gap #

Partial implementation of an “information highway” looks like: ✅ Auto-deployment of OAS files into the cloud that hosts a SwaggerUI webpage 👎 None of the webpages are collated, so managers are still cobbling together a list of APIs, with ol’ fashioned Excel sheets and ringing up the phone.

Companies simply do not invest resources into their info and docs toolchains. I call this the “last mile” gap. It comes from Edward Hume’s Door to Door, where a controversy over California Interstate 710 has left it 4 miles short of being a highway meant to directly connect a seaport to distribution centers.

When delivering a package to its final destination, the truck-to-door part is always manual. Nearly everything before that process, from outbound warehouse shipments to unloading an airplane, is highly automated.

In the case of API docs, the manual processes will always be: authoring of original content, reviewing, and signoff before publication. However, the delivery of information between teams and systems can be automated. Ditto for statistical and “purely logical” info (like how many servers are set up on a particular domain).


  • No more “manual” fetching, no sending Swagger/OAS files over email or chat
  • Deposit sanitized OAS files directly into a central Git repository
  • ∴ In a central repo, the augmentation of the OAS is where we should focus our efforts

OK! I cannot feasibly close the gap in a single article. There are many potential solutions that could’ve prevented disaster, but if the disaster has already happened, what are you going to do? My current concern is to extract “external” APIs so they’re editable for the individual.

What comes in must come out #

Before I can start documenting the APIs, I need to separate them into “public” and “private.” My main concern is with “public” APIs. I can safely disregard the “internal” ones. The responsibilities and actions of each actor would look like:


  1. Provide the endpoints which are intended to be “public” (devs outside of the team will use)
  2. Provide the server URI to call the API from
  3. Create an include list of these endpoints

Sanitizng pipeline:

  1. Filter the OAS to only keep what’s listed in include
  2. Deleted unreferenced models
  3. Output a slimmer, sanitized OpenAPI schema

Technical writer:

  1. Edit the sanitized OAS by adding descriptions and examples
  2. Ask SMEs the right questions to get this info
  3. Create any additional pages, such as how-tos & scenario examples
  4. Analyze differences between incoming and existing schemas and follow up with devs
  5. Push changes to a documentation Git repo on a “review” branch

After all that, a docs-as-code pipeline takes care of the publication.

  1. One branch should hook to a “review” site
  2. SMEs and writers meet up and review docs together
  3. When changes are finalized, merge to the live branch/site

This article will be concerned with #1-6: how to get developers to provide the endpoints and the server URL painlessly, and how to set up a sanitizing pipeline to get a raw OAS ready for API writers. In the future I might write up how to implement the publication pipeline.

Existing tools #

Am I making my life harder than it needs to be? There are a slew of tools available for OAS processing already:

Tom Johnson’s Redocly Tutorial for splitting and combining OAS is awesome and comprehensive.

Why am I not satisfied if they’re specifically designed for OAS?

  • They use Node and npm. I love JavaScript, but npm project dependencies are massive and they don’t fare well in a DevOps (aka. automate-the-things) pipeline.
  • For code-first APIs, we can’t edit the OAS design spec. We can fetch the raw OAS from a pipeline, then augment it.
  • OAS is only a dominant specification. One day, a new standard will de-throne OAS, so I want something that’s “evergreen.” If we base our work on a generic format like JSON, then we know the techniques will be adaptable.

What is DevOps? Managing cloud resources.
Cloud computing is renting compute power from a big company like Amazon or Microsoft.
A pipeline is a series of compute jobs, and we (the renter/customer) provide a config that tells the remote server what to install and what scripts to run. Current providers include Github Actions, Azure DevOps, & Amazon Web Services.

Every tool and download delays the runtime and costs money. Adding packages on your own computer is a one-time affair, but on rented computers that are wiped frequently? Even if you cache the dependencies, they are still redownloaded from somewhere.

Normally, optimizing docs isn’t important enough to warrant attention. It’s not like the savings on a plain text file are gamechanging compared to fixing memory leaks on an enterprise application. Still, if we can use the simplest tools that integrate into pipelines easily, we should choose that.

Choosing jq #

GitHub Actions shows the tools that come with their images, including jq, a JSON processing command line tool. jq is cross platform, so it’s perfect for pipelines.

With jq, we can do almost everything that the aforementioned OpenAPI Tools do. However, it works with JSON and not YAML. There is a tool called yq which is a similar YAML processing utility.

Working with OAS locally #

The rest of this guide assumes you have git, jq, and cURL installed and know how to evoke them on the command line.
I’ve written a tutorial on How to install a CLI manually on Windows that covers jq, just in case.

I refer “raw” OAS to mean autogenerated OAS. They are raw in the sense that they need processing before they’re ready to use. Assume that raw OAS comes in always comes in JSON, and we won’t be dealing with YAML.

We want to accomplish basic sanitization and prepare the OAS for editing in any editor, whether it’s Stoplight Studio or Notepad++.

  1. Only show public endpoints.
  2. Add description fields if they’re missing

OpenAPI utility scripts #

Use the Sample Petstore OAS to check out what we can do with jq.

I’ve created a set of jq scripts, so the hard work has been done. They are available at my OpenAPI Utils repository, which has more than what gets covered here.

Fetch the scripts on your machine and change to the openapi-utils directory:

git clone https://gitlab.com/kibblab/openapi-utils.git
cd openapi-utils

I will not go in-depth on how to actually program with jq. Each script has comments.

📃 Include List Format (ILF) #

Create a file that contains:

post /pet
get /pet/{petId}
post /pet/{petId}
delete /pet/{petId}

This list is basic and hopefully self explanatory, with a <method> <endpoint> on each line.

  • The format should be intuitive and follow REST API display conventions.
  • It should be easy to write manually and/or generate automatically.
  • The endpoints must match what is listed in .paths of the OAS.

Save it as .include so it seems like a config file (it can be called anything, like public_apis.txt, but for the purposes of this article I will stick to .include).

Filter desired operations and endpoints #

This command takes an ILF and an OAS, and outputs the filtered OAS that only contains the paths specified in ILF.

jq -f filter.jq --rawfile list <includelist> <oasPiped>
  • <includelist> : path to an includelist
  • <oasPiped> : path to OAS file

(filter.jq source)

Example #

Before: petstore.json contains several methods. A snippet of the .paths object looks like:

"/pet": {
"post": { ... },
"put": { ... }
"/pet/findByStatus": {
"get": { ... }
"/pet/findByTags": {
"get": { ... }
"/pet/{petId}": {
"get": { ... },
"post": { ... },
"delete": { ... }
"/pet/{petId}/uploadImage": {
"post": { ... }
// ... and so on

Run: the filter script using .include as our list:

jq -f filter.jq --rawfile list ".include" "schemas/petstore.json" > petstore.filtered.json 

After: In petstore.filtered.json only these methods remain in .paths:

"/pet": {
"post": { ... }
"/pet/{petId}": {
"get": { ... },
"post": { ...},
"delete": { ... }

Those are the exact 4 operations we wanted to keep!

Prune unused models #

Any .components.schemas that aren’t referenced anymore should be removed. API Handyman made a script for this. We’re borrowing it.

jq -f prune.jq petstore.filtered.json > petstore.pruned.json

How do we know if the pruning works? You can traverse the resulting JSON, or check the file sizes:

schemas/petstore.json33 kb
petstore.pruned.json13 kb

It works because I trust in the API Handyman 😛

(prune.jq source)

Add “description” properties #

Adds description to each path method and component if it doesn’t exist. This step is optional, but it’s for the future “overwrite mechanism” that I still have to get to.

jq -S -f add-desc.jq petstore.pruned.json > petstore.editable.json

(add-desc.jq source)

Piping all of them together #

To run steps #1-3 all together:

jq -f filter.jq --rawfile list ".include" "schemas/petstore.json" |
jq -f prune.jq |
jq -f add-desc.jq > petstore.editable.json

Fetch a remote OAS with curl:

curl "https://raw.githubusercontent.com/OAI/OpenAPI-Specification/main/examples/v3.0/petstore-expanded.json" |
jq -f filter.jq --rawfile list .include |
jq -f prune.jq |
jq -f add-desc.jq > petstore.editable.json

Now we have a chunk of code that can be thrown into a pipeline, processed in a Bash/Batch script, etc.

The End #

Well, this article only covered a small part of that checklist I posted in the first section. I need to organize these topics better…

💡 Future Topics #

After sanitization, development on the Overwrite Mechanism can commence. The idea is to store human-added annotations separately from OAS so that our hard work doesn’t get overwritten by an automated pipeline. At the moment of writing, I don’t have the “annotator” as polished as I’d like.

Still, there are also tons of other topics. Would you be interested in seeing A, B, C, or something else? Let me know in the comments.

A. Comparison of OpenAPI GUIs:

B. Docs-as-code publishing pipeline setup
At my place we have the basics down. There are some advanced features I want to implement. When I get the chance, I hope to explore the basic necessities as well as the “extras” that you may want to consider: programattic style guide enforcement, hyperlink fixing, duplicate content detection.

C. Detect and log changes in the OAS
Explore the implications of comparing old and new OAS diffs. If we get an automated delievery of OAS files, we should at least log the changes before overwriting them.

Photo by cottonbro