How to Use the Wayback Machine to Recover Lost Content

53 images. Gone.

I migrated my personal site from WordPress to Astro a while back. New stack, new hosting, new everything. The content came across fine. The text was all there. But one of my older posts, a 6,000 word guide on logfile analysis for SEO, had every single image broken.

The post has 19 referring domains. It is the most linked page on my site by a distance. And it was sitting there with 53 broken image placeholders where screenshots of JetOctopus, Screaming Frog, Data Studio templates, and setup guides used to be.

The original images were hosted on Contentful’s CDN. I had long since moved away from Contentful. The WordPress site was gone. My local backups did not include the media library properly because of course they didn’t.

I needed those images back.

The obvious approach and why it wasn’t enough :/

Most guides about recovering content from the Wayback Machine tell you to do this:

Go to archive.org
Paste in the URL of the page you want to recover
Pick a date from the calendar
Browse the archived snapshot

That works. Sort of. I found my logfile analysis post archived from 2022. I could see the page. But the images in the archived version were still loading from Contentful’s CDN, not from the Wayback Machine itself. The archive had captured the HTML but the image src attributes still pointed to the original CDN URLs.

So the question became: are those CDN URLs still alive?

I checked a few manually. Some were. Contentful doesn’t aggressively purge CDN assets even after you close your account. But some were returning 404s. And I had 53 of them to check.

This is where clicking around in a browser stops being practical.

The CDX API

The Wayback Machine has an API called the CDX API. It is not hidden exactly, but you will not find it mentioned in any of those “How to Use the Wayback Machine” articles that rank on the first page of Google. They are all written for people who want to look up what a website looked like in 2015. Which is fine.

But it is not what I needed.

The CDX API lets you search the entire Wayback Machine index programmatically. You can find every snapshot of a URL, filter by status code, filter by file type, deduplicate by content hash, and get back structured data you can actually work with.

The base URL is:

https://web.archive.org/cdx/search/cdx

Here is a real example. This is the command I ran to find every unique page archived under my blog path:

curl -s "https://web.archive.org/cdx/search/cdx?url=suganthan.com/blog/*&output=json&fl=timestamp,original,statuscode,mimetype&filter=statuscode:200&collapse=urlkey&matchType=prefix"

That returns JSON. The first row is headers, every subsequent row is a snapshot. Each one has a timestamp, the original URL, the HTTP status code, and the content type.

The parameters that matter:

Parameter	What it does
`url`	The URL to search. Supports wildcards with `*`
`output`	Response format. Use `json`
`fl`	Which fields to return
`filter`	Filter results, e.g. `statuscode:200` for successful pages only
`collapse`	Deduplicate. `digest` gives unique content, `urlkey` gives unique URLs
`matchType`	`exact`, `prefix`, `host`, or `domain`
`limit`	Cap the number of results
`from` / `to`	Date range in `YYYYMMDD` format

Finding the images

For my logfile analysis post, I needed to find what images the Wayback Machine had captured. The images were on Contentful’s CDN, not on my domain. So I had two places to look.

First, I grabbed the archived HTML of my post:

curl -s "https://web.archive.org/web/20220815id_/https://suganthan.com/blog/logfile-analysis-for-seo/" > archived_page.html

See that id_ after the timestamp? That is critical. Without it, the Wayback Machine injects its own toolbar and JavaScript into the response. With id_, you get the raw original HTML. This matters enormously when you are parsing the output or downloading binary files like images. Every “how to download from wayback machine” guide that skips this detail is setting you up for corrupted files.

Then I extracted all the image URLs from the HTML:

grep -oP 'src="[^"]*\.(png|jpg|jpeg|gif|webp|svg)"' archived_page.html | sed 's/src="//;s/"//' | sort -u

That gave me a list of 53 image URLs, all pointing to Contentful’s CDN.

The recovery

Here is the part that surprised me. Most of those Contentful CDN URLs were still live. The images were still being served even though my Contentful account was long gone. CDN providers generally do not purge assets aggressively. If the URL is not actively deleted, it tends to stick around.

I wrote a quick script to check each one:

while IFS= read -r url; do
  status=$(curl -sI "$url" 2>/dev/null | head -1 | awk '{print $2}')
  echo "$status $url"
done < image_urls.txt

48 of the 53 returned HTTP 200. I downloaded those directly from the CDN. Faster, better quality, no Wayback Machine rate limiting to worry about.

For the remaining 5, I fell back to the Wayback Machine. The CDX API found archived versions of each one:

curl -s "https://web.archive.org/cdx/search/cdx?url=images.ctfassets.net/path/to/image.png&output=json&fl=timestamp,original&filter=statuscode:200&limit=1"

Then downloaded them using the raw URL with id_:

curl -s "https://web.archive.org/web/20220501120000id_/https://images.ctfassets.net/path/to/image.png" -o recovered_image.png

All 53 images recovered. Every single one. The post is back to its full state with all screenshots, Data Studio template slides, setup guides, and tool comparisons intact.

When this matters for SEO

This is not just about recovering images from one blog post. The same approach works for any content recovery scenario:

Reclaiming backlink equity. If you have pages that earned backlinks and those pages broke during a migration, the links are still pointing at your domain. The authority is still flowing. But it is hitting a 404. Recover the content from the Wayback Machine, republish it at the same URL, and that link equity starts working for you again instead of leaking into nothing.

Competitor research. Competitors delete pages for all sorts of reasons. Product pivots, rebrands, legal issues. The Wayback Machine often has copies. If a competitor’s guide that used to rank well suddenly disappears, the CDX API can tell you when the last good snapshot was captured.

Content audits after CMS migrations. Every CMS migration loses something. URLs change, images break, internal links rot. Running the CDX API against your own domain before and after a migration gives you a complete inventory of what existed and what survived.

Recovering content from domains you acquire. If you buy an expired domain for its backlink profile, the Wayback Machine is how you figure out what content earned those links in the first place. Republish something genuinely useful at those URLs and the inbound links become valuable again.

Automating the whole process

After doing this manually for my logfile analysis post, I realised this is exactly the kind of repeatable workflow that should be automated. So of course I built a Claude Code skill for it.

If you are not familiar with Claude skills, they are instruction files that give Claude specialised knowledge and methodology for specific tasks. You drop the file in the right directory, and whenever you ask Claude to do something that matches the skill’s description, it automatically uses the right approach.

The Wayback Machine skill I built covers:

CDX API discovery with all the parameters, filters, and deduplication options
Raw content retrieval using the id_ suffix (so you never get corrupted downloads)
Image recovery workflow that checks original CDN URLs first before falling back to Wayback
Bulk download scripts with rate limiting to avoid getting throttled

Here is how you install it:

~/.claude/skills/wayback-machine/
└── SKILL.md

Create that directory and drop the SKILL.md file in it. Then the next time you ask Claude to recover content from an old site, find archived versions of a URL, or download images from the Wayback Machine, it knows exactly what to do.

Get the skill

The full skill file is below. Save it as SKILL.md in the directory path above.

---
name: wayback-machine
description: "When the user wants to recover, retrieve, or view archived web
  pages, images, or content from the Wayback Machine or Internet Archive. Also
  use when the user mentions 'Wayback Machine,' 'web archive,' 'archived
  version,' 'old version of a page,' 'recover deleted content,' 'CDX API,'
  'Internet Archive,' 'web.archive.org,' or 'recover images from old site.'
  This skill covers the CDX API for finding snapshots, downloading archived
  pages and assets, and bulk recovery of content and images from archived URLs."
metadata:
  version: 1.0.0
---

# Wayback Machine Content Recovery

You are an expert at using the Internet Archive's Wayback Machine APIs to find,
retrieve, and recover archived web content.

## Core APIs

### CDX API (Finding Snapshots)

Base URL: https://web.archive.org/cdx/search/cdx

Key parameters:
- url: Target URL (supports wildcards, e.g. example.com/blog/*)
- output: json
- fl: Fields to return (timestamp, original, statuscode, mimetype)
- filter: e.g. statuscode:200 or mimetype:image/.*
- collapse: digest (unique content) or urlkey (unique URLs)
- limit: Max results
- from/to: Date range in YYYYMMDD format
- matchType: exact, prefix, host, or domain

### Raw Content Retrieval

Standard URL (includes Wayback toolbar):
https://web.archive.org/web/{timestamp}/{url}

Raw URL (no toolbar, essential for programmatic use):
https://web.archive.org/web/{timestamp}id_/{url}

Always use id_ when fetching content programmatically. Without it, binary files
get corrupted and HTML gets polluted with Wayback JavaScript.

### Image Recovery Workflow

1. Fetch archived page HTML using id_ URL
2. Extract image src URLs from the HTML
3. Check if original CDN URLs are still live (they often are)
4. Download live ones directly from CDN
5. For dead URLs, query CDX API for archived versions
6. Download from Wayback using id_ URL

### Rate Limiting

Add sleep 1 between bulk downloads. The Wayback Machine will throttle
aggressive requests.

Once installed, you can say things like “recover the images from this archived page” or “find all snapshots of example.com/blog/” and Claude will use the CDX API correctly, including the id_ suffix that most people miss.

What I learned

The Wayback Machine is one of those tools where the surface level use (paste a URL, pick a date, look at a snapshot) is about 10% of what it can actually do. The CDX API turns it into a proper data source for bulk operations.

Three things worth remembering:

CDN assets persist longer than you think

Contentful, Cloudinary, imgix, and similar CDN providers rarely purge assets immediately after account closure. Always check the original URL before resorting to the Wayback Machine. It is faster and the quality is better.

The `id_` suffix is non-negotiable

Every corrupted image download and every HTML file full of injected Wayback Machine JavaScript traces back to forgetting those three characters. Treat it as the only correct way to fetch archived content programmatically.

Wildcard searches are powerful

Instead of checking URLs one by one, example.com/blog/* with collapse=urlkey gives you a complete inventory of every unique URL the Wayback Machine ever captured under that path. It is the fastest way to understand what content existed on a domain at any point in history.

53 images. All recovered. The post that earns the most backlinks on my site is whole again.

The skill is yours if you want it. Download it here.

How to Use the Wayback Machine to Recover Lost Content

The obvious approach and why it wasn’t enough :/

The CDX API

Finding the images

The recovery

When this matters for SEO

Automating the whole process

Get the skill

What I learned

CDN assets persist longer than you think

The `id_` suffix is non-negotiable

Wildcard searches are powerful

You might also like

Stay in the loop

The obvious approach and why it wasn’t enough :/

The CDX API

Finding the images

The recovery

When this matters for SEO

Automating the whole process

Get the skill

What I learned

CDN assets persist longer than you think

The id_ suffix is non-negotiable

Wildcard searches are powerful

You might also like

Stay in the loop

The `id_` suffix is non-negotiable