URL Checker for broken links and statuses

A common task for documenation maintenance is to scour the pages for broken links.

There’s many advanced link checkers, but they web scrape with UI automation for the worst case scenario, like websites that have fancy dynamic client-sided rendering where you have to wait for the JavaScript to inject the HTML elements, and the only way to get URLs is to wait for them to load in a headless browser. Some of these link checkers will also recursively crawl through the URLs of every connected page. They may also authenticate and authorize crawlers to load pages that are behind a VPN or firewall.

However, it’s a lot of overhead for documentation sites, which tend to be simpler, and the page counts are known already.

The blog post Bulk URL Checker with cURL discusses how to use a shell script to check many links. You don’t need to download entire Python or NPM modules if you already have a Bash shell.

If I have a local copy of the documentaion, and I’m the only person who cares, then I’d want to run this on my work computer with little to no setup. In my case, I don’t have a Bash shell on Windows. I do have PowerShell 7.1 installed. Although I could use Bash on windows by downloading Cygwin or logging into a WSL instance, I just want to get my link checking done on my native system without any more software.

One-liners #

Assuming you have a list of links you want to check, such as ./links.txt with a link on each line:

https://en.wikipedia.org/
http://h2020.myspecies.info/
http://weevil.info/
http://cbdkhlfmnrtvwsxz.neverssl.com/online
https://8.8.8.8/
http://8.8.8.8/
https://www.amazon.com/error
http://www.example.com/stuff

In Linux, you can use a 1-liner to get the results in the console:

xargs -P 100 -n 1 curl --connect-timeout 10 -Iso /dev/null -w "%{url_effective}\t%{http_code}\t%{redirect_url}\n" < ./links.txt

The PowerShell 1-liner equivalent would be:

Get-Content ./links.txt | ForEach-Object -Parallel{ curl --connect-timeout 10 -Iso /dev/null -w "%{url_effective}\t%{http_code}\t%{redirect_url}\n" $_ } -ThrottleLimit 100

xargs -P 100 lets you run up to 100 commands in parallel, and -n 1 feeds 1 line from the input file at a time (the PS equivalent is ForEach-Object -Parallel{ } -ThrottleLimit 100).

The curl -I flag is short for curl --HEAD which returns the server’s metadata instead of the page’s entire content. There might be servers that aren’t compliant with the HEAD method, but most modern servers are, so it’s better to use it.

A Bigger Script #

We can make a utility that accepts not just a *.txt, but other extension and entire directories. Then, we can spit out a tab-separated value (TSV) file for people who want to view the results in a spreadsheet.

RequiresSource repository
curl cli and Powershell 7+📁 Scripts and examples on GitLab

This script is called linkchecker.ps1

Use .\linkchecker.ps1 -help or .\linkchecker.ps1 -h for a list of commands.

param(
[Alias('p')]
[string]$path,
[Alias('o')]
[string]$output = '.\linkstatus.tsv',
[switch]$dryrun,
[Alias('h')]
[switch]$help
)

$accept=@( 'html', 'htm', 'markdown', 'md', 'yaml', 'yml', 'json', 'xml', 'svg')

if ($help){
@"

SYNOPSIS
Check broken links from a static html page or a text list. Creates a TSV file with their status.
File extensions checked: $accept

USAGE
linkchecker.ps1 -path PATH [-output OUTPUT]

-p, -path The input path. Can be a directory and will recurse through children
-o, -output The output TSV file. Default is '$output'
-dryrun Show links to be checked but don't make any HTTP calls
-h, -help Guidance menu
"
@
Exit 0
}

$rgx='(http|https)://[^\"<>{}\(\)\t,]+'

# Check if folder
if ((Get-Item $path).PSIsContainer) {
$files=(Get-ChildItem . -Recurse -File).FullName
$links=($files | ForEach-Object -Parallel{
Get-Content $_ | Select-String $($using:rgx) -All | foreach {$_.Matches.Value}
} -ThrottleLimit 100 )
} else {
$ext=(Split-Path -Path $path -Leaf).Split(".")[1];
if ($ext -in $accept ){
$links=( Get-Content $path | Select-String $rgx -All | foreach {$_.Matches.Value} )
} else {
$links=@( Get-Content $path )
}
}

$links = ($links | Sort-Object | Get-Unique)

if ($dryrun) { Write-Output $links; Exit 0 }

Write-Output "Checking $($links.length) URLs..."

$callTime = Measure-Command {
Set-Content -Path $output "url`tstatus`tredirect"
$links | ForEach-Object -Parallel{
curl --connect-timeout 10 -Iso /dev/null -w "%{url_effective}\t%{http_code}\t%{redirect_url}\n" $_
} -ThrottleLimit 100 | Add-Content -Path $output
}


Write-Output "Done. Time to complete was $callTime"

Basic usage: Outputs a file called linkstatus.tsv.

pwsh linkchecker.ps1 README.md

Explicit arguments: specify the input path and output filename.

pwsh linkchecker.ps1 -path ./examples/test.json -o 2022-07-29.tsv

Dry run: show links to be checked without making network calls.

pwsh linkchecker.ps1 . -dryrun

This lists all the links found in the current directory. To actually run the checker: pwsh linkchecker.ps1 .

Interpreting TSV results

A TSV/CSV can be opened in Excel, Google and LibreOffice.

  • A status between 400-499 is broken.
  • 500+ might be temporarily unavailable, not necessarily broken
  • 000 means the link timed out after 10 seconds.

Complete Example #

A PowerShell script to create a CSV named after today’s date, only containing failed links.

$input='./examples/links.txt'

# Check links
$TempFile = New-TemporaryFile;
.\linkchecker.ps1 -path $input -o $TempFile;

# Filter for bad links
$errored=( Import-Csv -Path $TempFile -Delimiter "`t" | Sort-Object status | Where-Object {$_.status -gt 399 -or $_.status -lt 100 } );

# Name a CSV report after today's date.
$date=( Get-Date -Format 'yyyy-MM-dd' );
$errored | Export-Csv "$date.csv"

Remove-Item -Recurse -Force $TempFile

If I run this script on August 1st, 2022, I should get a file called 2022-08-01.csv

References #