Pull docs from multiple GitHub repositories

Oct 11 2021

Here’s a rather common question: should you store documentation in a BIG, ¡CENTRAL! repository, or keep docs embedded close to their respective product?

Well, to be honest, nobody can answer that question for you. It depends on the product’s distribution model, the size of the technical writing team, whether the docs are public facing or not, whether the product team wants to own their docs or hand it off to someone else.

For example, a SaaS that emphasizes continous service without strict or obvious versioning may want docs stored with the product. A traditional desktop application might want a separate repo for writing so the codebase history won’t be mixed with doc changes.

Anyway, I’m gonna assume you made the decision to have several doc repositories, and you need to aggregate them for publishing. Antora, a docs-as-code service, likes to advertise their multi-docs as a feature. If you’re unwilling to shell out money for that, here’s a simple setup.

Dependencies #

PowerShell 7.x or higher (Core): Most people at my place develop on Windows, so PowerShell is a natural choice.
The utilites git, jq, and curl

Assumptions #

All repositories we are tracking have a docs/ folder in the root.
GitHub service

If you use a different Git provider, the API and jq filter will be slightly different, but you can use the same concept.

Procedure #

The goal is to:

Schedule a cron job that runs every X hours.
For each repository that we track, ping the API and get the latest commit.
Go through each repository in the list and check if the hash has changed
If it’s different or nonexistent, fetch the docs from the repository.
Update the list of hashes.

Create 2 files:

base/
├─ myRepos.txt
└─ pull_multi_repos.ps1

If you just want the solution, see myRepos.txt format and final all-in-one script.

Repository List #

You will need a headerless CSV of all the repos to track. The format is repoName,branchName. For example:

myRepos.txt

squidfunk/mkdocs-material,master
jupyter/jupyter,master
salesforce-ux/design-system,main

Get the latest commit hash from a GitHub repo #

We only want to clone a repository if the devs actually made commits within the folder.

Relevant endpoint:

https://api.github.com/repos/{ORG}/{REPO_NAME}/contents/?ref={BRANCH_NAME}

Example: Get the commit hash of docs/ from squidfunk/mkdocs-material

curl -H 'Accept: application/vnd.github.v3+json' \
'https://api.github.com/repos/squidfunk/mkdocs-material/contents/?ref=master'  | 
jq -r '.[] | select((.path|test("docs"; "i" )) and .type=="dir") | .sha'

GitHub’s API returns an array of all the files in the repository root. The jq filter:

Selects an object whose .path property contains the value "docs", where "i" is for case-insensitive matching (matches “DOCS” and “Docs”).
Checks .type is “dir”, or else we’ll get other files with “docs” in the name.
Pipes out .sha, the latest commit hash of the directory.

{
  "name": "docs",
  "path": "docs",
  "sha": "d8e8deb6baaeda4e8b1577002eb64091dda4f146",
  "size": 0,
  "url": "https://api.github.com/repos/squidfunk/mkdocs-material/contents/docs?ref=master",
  "html_url": "https://github.com/squidfunk/mkdocs-material/tree/master/docs",
  "git_url": "https://api.github.com/repos/squidfunk/mkdocs-material/git/trees/d8e8deb6baaeda4e8b1577002eb64091dda4f146",
  "download_url": null,
  "type": "dir",
  "_links": {
    "self": "https://api.github.com/repos/squidfunk/mkdocs-material/contents/docs?ref=master",
    "git": "https://api.github.com/repos/squidfunk/mkdocs-material/git/trees/d8e8deb6baaeda4e8b1577002eb64091dda4f146",
    "html": "https://github.com/squidfunk/mkdocs-material/tree/master/docs"
  }
}

Authentication #

In the real world, you’re probably going to be working on private projects, and authentication is necessary. Let’s create a script which fetches each repository’s latest SHAs.

get_sha.ps1

$header="Authorization: token $($args[0])"
$in=$args[1]
$out=$args[2]
# We already know $out doesn't exist
Clear-Content $out -ErrorAction 'SilentlyContinue'
Import-Csv $in -Header repo,branch | Foreach-Object { 
    $repo = $_.PSObject.Properties.value[0]
    $branch = $_.PSObject.Properties.value[1]
    $url="https://api.github.com/repos/$($repo)/contents/?ref=$($branch)"
    
    $sha = curl -H $header -H 'Accept: application/vnd.github.v3+json' $url | 
    jq -r ".[] | select((.path|test(\""docs\""; \""i\"" )) and .type==\""dir\"") | .sha"

    $repo +','+$sha >> $out
}

Usage #

.\get_sha.ps1 <GITHUB_SECRET> <inputFile> <output>

<GITHUB_SECRET>: GitHub token with read access
<inputFile>: a list which follows myRepos.txt format
<output>: specify an output destination, such as tempShas.txt

Output

It creates another CSV with the format repoName,latestCommit

squidfunk/mkdocs-material,68b6cb9fc3082d222d8ef02e0b7d4983d867c115
jupyter/jupyter,82b5e87bc3f70d3d70d147bde588b081eb296f2c
salesforce-ux/design-system,38b84ab9e0a119d6bd0b7375652606223f0cb516

Clone docs #

We want a script that only clones a particular folder and branch. The special sauce is the git sparse-checkout feature, which lets you specify folders.

clone_docs.ps1

$TOKEN=$args[2]
$baseUrl="https://$($TOKEN)@github.com/"
$dir=($args[0] -split '/')[-1] # "user_org/reponame" to just reponame

git init $dir &&
cd $dir
git remote add -t $args[1] -f origin $baseUrl$($args[0]).git 
git config core.sparseCheckout true
git config core.ignorecase true
echo "docs/" >> .git/info/sparse-checkout
git pull --depth 1 origin $args[1] &&
cd ..
if ($LASTEXITCODE -eq 0) { # success
  echo ("Successly updated $($dir)")
} else {
  echo ("Exited with code $($LASTEXITCODE)")
}

Usage #

.\clone_docs.ps1 <REPO> <BRANCH> <GITHUB_SECRET>

.\clone_docs.ps1 "squidfunk/mkdocs-material" master $(GITHUB_SECRET)

This script will clone nested docs/ folders. For example, ui/docs/ would be included.

Compare SHA #

Compare each line from the old file. If there’s a difference, trigger the clone script.

compare_sha.ps1

$repos = Import-Csv $args[0] -Header repo,branch
$oSha = Import-Csv $args[1] -Header repo,hash
$nSha = Import-Csv $args[2] -Header repo,hash

# First time, no sha.txt exists
if (!$oSha){
    $repos| Foreach-Object { 
        .\clone_docs.ps1 $_.repo $_.branch
    }
} else {
Compare-Object -ReferenceObject $oSha -DifferenceObject $nSha -Property repo,hash |
# Compare-Object gets the differences of both files, so need to group them
Group-Object -Property repo |
  Foreach-Object { 
      $reponame = $_.PSObject.BaseObject.Name
      # Query CSV file for repo
      $info = $repos | Where-Object -Property repo -eq -Value $reponame
      # perfom clone logic here
      .\clone_docs.ps1 $reponame $info.branch
  }
}

Usage #

.\compare_sha.ps1 <REPO_LIST> <OLD_SHA_RECORD> <NEW_SHA_RECORD>

<REPO_LIST>: CSV list like myRepos.txt
<OLD_SHA_RECORD>: A sha text that was created by get_sha.ps1
<NEW_SHA_RECORD>: Output name for the new file, such as tempSha.txt

Putting it together #

When put togther, process of checking SHAs and copying a repo would be:

$sha='sha.txt'
$temp='tempSha.txt'
$list='myRepos.txt'
.\get_sha.ps1 $(GITHUB_SECRET) $list $temp
.\compare_sha.ps1 $list $sha $temp
# We're done with these files; replace old shas with new
cp $temp $sha
rm $temp

However, we can put all of them into 1 script for maintainability.

Pull from multiple repos #

The all-in-one script has all the functions explained previously. It checks each repo and clones them if any changes were detected.

Usage #

.\pull_multi_repos.ps1 <GITHUB_SECRET> <REPO_LIST> <SHA_RECORD>

pull_multi_repos.ps1

function Get-Sha {
  param( [string]$auth, [string]$repos, [string]$out )
  $header="Authorization: token $($auth)"
  # We already know $out doesn't exist
  Clear-Content $out -ErrorAction 'SilentlyContinue'
  Import-Csv $repos -Header repo,branch | Foreach-Object { 
      $repo = $_.PSObject.Properties.value[0]
      $branch = $_.PSObject.Properties.value[1]
      $url="https://api.github.com/repos/$($repo)/contents/?ref=$($branch)"
      
      $sha = curl -H $header -H 'Accept: application/vnd.github.v3+json' $url | 
      jq -r ".[] | select((.path|test(\""docs\""; \""i\"" )) and .type==\""dir\"") | .sha"

      $repo +','+$sha >> $out
  }
}

function Clone-Docs{
  param( [string]$auth, [string]$repository, [string]$b )
  
  if ($auth) {
    $baseUrl="https://$($auth)@github.com/"
  } else {
    $baseUrl="https://github.com/"
  }
# user_org/reponame to just reponame
  $dir=($repository -split '/')[-1]

  git init $dir &&
  cd $dir
  git config core.sparseCheckout true
  git config core.ignorecase true
  git remote add -t $b -f origin "$baseUrl$($repository).git" &&
  echo "docs/" >> .git/info/sparse-checkout
  git config pull.rebase true
  git pull --depth 1 origin $b &&
  cd ..

  if ($LASTEXITCODE -eq 0) { # success
    echo ("Successly updated $($dir)")
  } else {
    echo ("Exited with code $($LASTEXITCODE)")
  }
}

function Compare-Clone{
  param( [string]$repos, [string]$old, [string]$new, [string]$auth )
  
  $r = Import-Csv $repos -Header repo,branch
  
  try {
    $oSha = Import-Csv $old -Header repo,hash
    $nSha = Import-Csv $new -Header repo,hash
  }
  catch {
    # no sha.txt exists, clone
    $r | Foreach-Object { 
      Clone-Docs -repository $_.repo -b $_.branch
    }
    exit
  }

  Compare-Object -ReferenceObject $oSha -DifferenceObject $nSha -Property repo,hash |
  # Compare-Object gets the differences of both files, so need to group them
  Group-Object -Property repo |
    Foreach-Object { 
        $reponame = $_.PSObject.BaseObject.Name
        # Query CSV file for repo
        $info = $r | Where-Object -Property repo -eq -Value $reponame
        # perfom clone logic here
        Clone-Docs -repository $reponame -b $info.branch
    }

}

$_a=$args[0]
$_r=$args[1]
$_s=$args[2]
$_t='tempSha.txt'

Get-Sha -auth $_a -repos $_r -out $_t
Compare-Clone -auth $_a -repos $_r -old $_s -new $_t
cp $_t $_s
rm $_t

Pipeline config #

Wherever your CICD server is, you can set a CRON job and the auth token GITHUB_SECRET as a secret variable (for example in Azure DevOps):

schedules:
- cron: "0 0 * * *"
  displayName: Daily midnight build
  branches:
    include:
    - main

- pwsh: | # pwsh is PowerShell 7.x
    $list='myRepos.txt'
    $sha='sha.txt'
    .\pull_multi_repos.ps1 $(GITHUB_SECRET) $list $sha
  displayName: Only clone repos with updated docs

After running the scripts, you’ll get these files:

baserepo/
├─ myRepos.txt
├─ pull_multi_repos.ps1
├─ azure-pipelines.yml
├─ sha.txt
│
├─ mkdocs-material/
│  └─ docs/
│     └─ getting-started.md
└─ jupyter/
   └─ docs/
      └─ doc-requirements.txt

Cache the repos #

A future consideration is to cache the git repos. Currently we copy all the docs/ into the main repo, so we have duplication.

The first time you run a pipeline, sha.txt won’t exist, so it’ll clone all repos by default. It’s easiest to commit the current sha.txt into the same repo.
Otherwise, restore the cache with a prebuilt task.
Adjust the working directories and clone into the cache.
Cache .cache folder

baserepo/
├─ myRepos.txt
├─ azure-pipelines.yml
├─ pull_multi_repos.ps1
└─ sha.txt

.cache/
├─ mkdocs-material/
│  └─ docs/
└─ jupyter/
   └─  docs/

Conclusion #

I wanted to try a multi-doc repo with submodules, but I couldn’t figure out a way to do it without cloning the entirety of every repository. Since I only need a subset of files, most of the recommendations involve subtrees or using extra scripts like git-filter-repo, but I couldn’t figure those out either. To reduce cloud computing costs, I’d like to avoid downloading whole repos in the first place.

Ideally, each engineering team would have a CICD pipeline that not only deploys their application, but also sends their docs to us when they click on the shiny “Release” button. Still, waiting for people to give you docs takes longer than grabbing the files yourself.

Image from KRQE News