Sources with No Snapshots — Recovery Report

Overview (All Status Codes)

From no_snapshots_by_codes_02_12.json15,379 total sources with no snapshots, broken down by HTTP status code:

Status CodeCountDescription
4047,380Not Found — pages removed/never existed
5005,519Internal Server Error / scraping failures
599983Custom/timeout error
415386Unsupported Media Type
403297Forbidden — access blocked
402238Payment Required — paywall
503102Service Unavailable
41281Precondition Failed
41080Gone — explicitly removed
42964Too Many Requests — rate limited
41863I'm a Teapot (bot detection)
41342Payload Too Large
40136Unauthorized
42322Locked
40619Not Acceptable
52117Cloudflare — Web Server Down
50213Bad Gateway
4009Bad Request
5115Network Authentication Required
5304Cloudflare 1xxx Error
5254SSL Handshake Failed (Cloudflare)
4053Method Not Allowed
4093Conflict
4443Nginx No Response
4982Invalid Token
5222Cloudflare — Connection Timed Out
4071Proxy Authentication Required
4511Unavailable for Legal Reasons
5231Cloudflare — Origin Unreachable

Deep Dive: Status 500 Sources (5,494 analyzed)

CategoryCount%Recoverable?
Snapshot dir not created1,53928.0%Yes — scraping ran but directory issue
DNS temporary failure (EAI_AGAIN)57310.4%Yes — transient DNS, retry likely works
SSL/TLS errors2404.4%Yes — transient handshake failures
Connection reset/refused2344.3%Yes — server temporarily down
"client is not defined"2194.0%Yes — internal bug, fixable
Timeouts881.6%Yes — retry with longer timeout
Server errors (HTTP 500)731.3%Yes — server-side, may resolve
Unexpected end of file621.1%Yes — partial response, retry
Redirect loops410.7%Yes — may need config adjustment
DNS not found (ENOTFOUND)2,20740.2%No — domains are dead/gone
Other (detached frames, host unreachable, etc.)2184.0%Mixed

All Status Codes — Recovery Classification

CategoryStatus CodesCount
Recoverable
Rate limited42964
Bad Gateway (transient)50213
Service Unavailable (transient)503102
Cloudflare transient521, 522, 523, 525, 53028
Timeout / custom599983
Status 500 (recoverable portion)5003,069
Recoverable subtotal4,259
Non-recoverable
Not Found4047,380
Gone41080
Legal takedown4511
Unauthorized40136
Payment required402238
Forbidden403297
Status 500 (dead domains)5002,207
Non-recoverable subtotal10,239
Unclear / config issues
Bot detection412, 418144
Unsupported media type415386
Other client errors400, 405, 406, 407, 409, 413, 423, 444, 498, 511109
Status 500 uncategorized500243
Unclear subtotal882

Summary

ClassificationCount% of all 15,379
Potentially recoverable4,25927.7%
Non-recoverable10,23966.6%
Unclear8825.7%
~28% of all sources without snapshots are potentially recoverable. The remaining ~67% are dead (domains gone, pages removed, access blocked) and ~6% are ambiguous.

Report generated: 2026-02-12 — Based on data from no_snapshots_by_codes_02_12.json and sources_status_500.json