Main blog page


Speedrunning, for the uninitiated, is the act of playing a game as fast as possible. It's part competitive, with individuals vying for the best time, and part cooperative, with a community of runners trying to push the envelope and develop strategies together. For a more in-depth discussion of speedrunning, see this Gist by 0xabad1dea.

A major site used by speedrunners is speedrun.com, a site that acts partly as a leaderboard, and partly as a community hub. It tracks 14,236 games, boasts 67,573 users, and has 748,676 submitted runs altogether. That's a lot. NB: most of the information in this post was obtained on or around Jan. 1, 2019; specific numbers will obviously change.

This post gets a bit info-dumpy. I'll try to avoid it becoming a spreadsheet.

So a simple question is raised: how much of this history has been lost? And how much can we expect to lose? Most of the time, submitting a run goes like this:

  1. Play the game, and record it
  2. Upload recording to the internet
  3. Submit run information to speedrun.com, and link to recording as proof (speedrun.com does not host video)
  4. A moderator of the community will verify the run, and mark it as valid or invalid
  5. All the data stays intact forever

Step 5 is where we haven't quite figured things out. Sometimes, when you click on the link to the recording for a run, you get...nothing. I decided to look into how many video links have been lost to link rot over time, and how many links still work.

Before we go any further, let me stress something: I do not believe any of the people involved here should be blamed for link rot. Speedrun.com doesn't have the hosting or bandwidth to keep that much video. Speedrunners may not control whether their videos stay up or not. And few runners have enough local storage to keep backups of everything they've done (and may not realise that it's an issue). This is a systemic problem all over the internet; I've just decided to focus on one part of it.

How bad is it?

First, some numbers. And then, some context.

Out of the 748,676 runs, 643,485 (85% of total) seem to still be alive and kicking. 67,726 runs (9%) did not have any link to video proof. 35,498 runs (5%) presumably used to have a working link to video, but now get a "404 Page Not Found" response if you try to access it. The remainder have various other errors: malformed URL's, non-existent sites, or server errors.

Status Count Percentage
Alive 643,485 85%
No Proof 67,726 9%
Dead (404) 35,498 5%

For each run, I sent a HEAD request to the proof URL and marked the HTTP response code. If the run had no proof, or the server could not be reached (timeout, DNS error, connection refused), I called that status "0" for the sake of completeness. 3XX redirects were followed, and I also tried to automatically correct many common mistakes in URLs (misspellings of YouTube, extra characters, etc.).

For a refresher: HTTP Response Codes let a website tell you if something went wrong with the webpage: 200 means 'OK', 404 means 'Page Not Found', etc.

There were 407,761 runs submitted to YouTube, the most common site with 54% of all submissions linking there. 378,035 of these runs (92%) are available today. 29,741 return a 404 error (7%). For Twitch (the second most common site), there were a total of 239,863 runs submitted (32% of all runs). 238,980 of these runs (>99%) still work. Just 855 returned a 404 code (less than half a percent). Overall, Twitch seems to be a much more reliable site for storing speedrun video, which I didn't expect.

Site Total Runs Dead Links Error Rate
YouTube 407,761 29,741 7%
Twitch 239,863 855 0.3%

It's a bit harder to figure out why videos get removed. In general, there's 4 main entities who can have a video taken down:

  1. The original uploader
  2. The hosting site
  3. Someone with a copyright claim
  4. Someone pretending to have a copyright claim

It could also be someone pretending to be the original uploader, but that seems very rare

Twitch mostly avoids the latter 2, but doesn't (to my knowledge) automatically save videos from streams unless the streamer chooses to. Youtube unfortunately hits all 4 heavily. While YouTube does sometimes specify if a copyright holder has requested a takedown of a video, this only accounts for 539 of the 29,598 404'd videos (About 2%). The others may be copyright strikes, channel deletions, or anything really.

Why should I care?

So a further question: how "important" are these dead links? After all, there's spam and other junk submissions on the site. Usually it's just rejected by a moderator and ignored. So how many of our 404'd links were verified? 28,497 out of 35,157 were verified, or 81%. Compare this to a sample of 10 thousand live links, where 96% were verified. So runs with dead links are less likely to be verified, which makes sense: if a link is dead, no moderator will accept it. Or if a mod rejects a submission, no sense in keeping the video around.

But there's still a number of submissions that were verified which now have dead links. Maybe the runner got a better time, and didn't care to keep old "obsolete" runs. So, let's check which verified runs are currently the runner's PB in a category, but with a dead link. This was done by using the following query:

I realise that my queries are not optimal, that's not the point. Also I was split across 2 databases because I was lazy, so I couldn't do this in a single query like the one you just thought of

SELECT runID FROM Runs r1 WHERE r1.runID=? AND r1.verifiedBy IS NOT NULL AND NOT 
EXISTS(SELECT * FROM Runs r2 WHERE r2.user = r1.user AND r1.category = r2.category AND 
((r1.level IS NULL AND r2.level IS NULL) OR r1.level = r2.level)
AND r2.primaryTime < r1.primaryTime AND r2.verifiedBy IS NOT NULL);

for each 404'd run. It checks if the same user has a faster run in that category and level. The results: 10,375 out of the 28,497 verified 404'd runs are PB's, or about 36%.

Even more interesting is this: 1,470 dead links are for WR runs (the same query as above, but without the same-user constraint). For example, the WR for the game Speedrunners on Easy links to a non-existent YouTube video.

How do I know you're not lying?

If you'd like to play around with this data, I've uploaded my 2 SQLite databases with the information, as well as some relevant Python files, to the Internet Archive. I have tried to work with the data to some extent, finding common classes of typos, etc. But there will be things I did wrong, or things I could've done differently.

I wasn't really expecting to release the data and code, so my focus was on getting the info I needed, not making it intuitive.

There is also a JSON API for speedrun.com, but it comes with rate-limiting, and won't let you do complicated post-processing.

Okay okay, but what do we actually *do* about it?

Well, if there are any runs you feel are important, you can:

Before I started crunching the numbers, I tried backing up speedruns that were on Twitch, my thinking being that Twitch VODs are more likely to go down. But it looks like runs uploaded to YouTube are the real endangered species. Even so, my 4 TB drive was nearly full after just going through 10K runs, so someone else may need to take up the reins.

As I said before, this is not the fault of anyone (well, except maybe egregious copyright-strikers on YouTube). I can't say how these numbers stack up against other parts of the web, but here's some comparison: about 50% of links cited in SCOTUS decisions are dead.

Speedrunning history is important, and the runs people submit to speedrun.com are runs that they feel are worth remembering. So hopefully, we'll figure out indefinite data storage at some point.

Miscellanea

I also got the chance to come across some other random interesting data, if that's your thing.

Someone claims to have beaten Donkey Kong Country in 1905 This has since been corrected.

About 36 YouTube videos were linked to that were missing a single character in the video URL: this run should link to this video for instance. By brute-forcing the 64 possibilities, I recovered 24 of these 36 video URL's. The other 12 may be lost, or may never have existed.

Run wzpqwrry Run 1zqdvlxm
Run 6yj9djdz Run mr82054y
Run 0y6pe6jm Run m7p7w39m
Run yvgkjl8y Run mekk9v8m
Run y8q8q3dy Run y45rlwkm
Run 1zx7l3gy Run y67e09qm
Run z0e1grjm Run z51eopgm
Run zn8ox89z Run m3o3vpgm
Run zgn5jeny Run zxvvke8y
Run zq3gnqxy Run yvg41n8y
Run y456exkm Run yvg3pr4y
Run z113kqgz Run 8y85kn1y

Some proof URLs were obviously invalid, but still amused me. Some examples:

Here is the full breakdown of HTTP status codes received for the proof links:

Status 0 200 204 400 403 404 405 410
Count 68,299 643,485 2 18 1,038 35,498 74 2


Status 418 429 440 451 500 501 502 503 520 521
Count 70 9 5 1 61 1 26 44 11 32

And before you ask, here is the run that links to the 451 video...The Illegal Speedrun O.O

451 being the HTTP Status Code for 'Not Available for Legal Reasons'

One site that surprised me with its usage is Imgur: 16,773 runs post have their evidence posted there (about 2%). But these aren't gifs of speedruns (thankfully). Instead, they're often screenshots of in-game timers. Whether this counts as proof is up to each community to decide.

The table below shows the 10 most common origins used as evidence of a speedrun (see the full table)

Origin Total links Live links Success Rate
www.youtube.com 251,007 231,534 92.2%
www.twitch.tv 235,673 234,918 99.7%
youtu.be 153,951 143,624 93.3%
imgur.com 14,253 10,817 75.9%
i.imgur.com 2,520 2,516 99.8%
m.youtube.com 2,213 1878 84.9%
secure.twitch.tv 1,850 1,847 99.8%
drive.google.com 1,437 1,287 89.6%
twitter.com 1,385 1,184 85.5%

Finally: all 22 of the runs submitted to PornHub are still up...a 100% success rate even better than Twitch 🙄