Archiving A Yahoo Group

Verizon pulling the plug on Yahoo Groups will leave a massive memory hole in the Internet. Here's how to save your history from being lost.

This is an old post!

This post is over 2 years old. Solutions referenced in this article may no longer be valid. Please consider this when utilizing any information referenced here.

I’ve been on the Internet a long time, since the early to mid 1990s. And when you are on the Internet that long, you tend to leave a pretty long trail behind you. But over the years that trail gets overgrown as sites close, lists vanish, and machines crash. There is precious little left from those early years.

One thing that has persisted to this time, despite being pretty heavily neglected over the years, is Yahoo Groups. Those who remember the first dot-com boom may remember that Yahoo Groups was not originally Yahoo Groups. It was eGroups, which Yahoo bought and merged into their own sprawling empire. eGroups basically made it possible for anyone to set up a mailing list without needing access to a listserv service.

Well, it looks like the end has finally come for Yahoo Groups. Verizon, the new owner of the rotting corpse of Yahoo, has announced that all groups will disappear on December 14th. I was on tons of mailing lists during my early Internet years, and I would really like to archive and preserve those messages if I could. But how could I get them out of Yahoo?

As it turns out, there is already an option out there for downloading content from Yahoo. Someone has kindly written a backup script to take care of the hard part of getting messages out of Yahoo.

The problem? It only appears to be able to store them in a MongoDB database. :( Not that I have anything against MongoDB (it is webscale!), but I really wanted to preserve the raw messages themselves as text data rather than storing them in a database.

Why? Well, the biggest reason is long-term stability. Will you be able to read that data out of MongoDB in 20 years? I have some files that are younger than that I can no longer read. Either the program doesn’t work anymore, or isn’t supported, or can’t even be found. But text files? I can read 40 year old text files just fine. Pretty good bet text files will be readable in another 40 years as well.

So the solution I came up with was to spin up a docker image of MongoDB and allow the script to do it’s thing, then wrote another script to pull the data out of MongoDB and write raw data. I decided to write both JSON of the full entry and text of the original raw email. That way I have all the Yahoo metadata if I ever need it in the future, in an open format that should be relatively easy to read in the future, as well as the original raw format.

Setting Up The Fetch Script

The fetching script is a tad finnicky, especially on macOS. Your best bet is to install Python 3.7 from Homebrew. You’re also better off doing this in a Python Virtual Environment, as I found out the hard way.

brew install python
python3 -m venv /tmp/yahoo-backup
cd /tmp/yahoo-backup
source bin/activate

You will also need to install Chromedriver as well.

Now that you are inside a “pristine” Python environment, you can follow the instructions in the readme for the fetch script.

Before I was able to get pip to install the dependencies from requirements.txt, I also had to bump the version of pyyaml to 3.13. It did not compile otherwise on macOS and there is a bug about this that is fixed on 3.13. Doing this does not seem to impact the script.

cd /tmp
git clone [email protected]:hrenfroe/yahoo-groups-backup.git
cd yahoo-groups-backup
pip install -r requirements.txt
cp settings.yaml.template settings.yaml

Be sure to fill in your username and password in the YAML file.

Next, we need to spin up a MongoDB instance in Docker:

docker run -p 127.0.0.1:27017:27017 --name mongo -d mongo

Once you’re ready to go, you can just run the script like so:

./yahoo-groups-backup.py scrape_messages --driver=chrome <group name>

Setting Up The Dump Script

So after you’ve let the script run for awhile (and it may take awhile depending on the quantity of messages, as this script seems to process them at the rate of about 40 per hour), you can dump the data to local files.

cd /tmp
git clone [email protected]:peckrob/yahoo-mongo-dump.git
cd yahoo-mongo-dump
pip install pymongo

And now to dump the files out of Mongo:

python3 dump.py --list <list name> --output <output_dir>

And it will create a directory structure of raw text and JSON files, one for each message. From there, you can zip them up for more efficient storage.

Comments (0)

Interested in why you can't leave comments on my blog? Read the article about why comments are uniquely terrible and need to die. If you are still interested in commenting on this article, feel free to reach out to me directly and/or share it on social media.

Contact Me
Share It
Microsoft
So I see Microsoft’s is attempting to rebrand the old Windows Live Search as bing.com. The commercials on TV are advertising it as a different type of search engine - a “decision engine.” Yeah, when I heard that, I, too, wondered exactly what a “decision engine” was. But the commercials are clever and somewhat funny to anyone who has ever spent time searching through hundreds of results for a single missing piece. But where’s the meat? My coworker Brian, a few weeks ago, provided a great example of how this claim of being a “decision engine” is kind of a joke. And it can be summed up in a single sentence: “How big is the sun?” Maybe now you’re confused about what I’m talking about. What does the sun have to do with search engines? Well, try plugging that sentence, word for word, into your favorite search engine. Our of curiosity, I ran this search on a number of top and up-and-coming engines to see what they returned. Google is obviously the 900-pound gorilla in this space, so they’re a logical place to start. When you ask Google “How big is the Sun?” Big Brother Google replies, right at the top “Mass: 1.9891 ×1030 KG 332 946 Earths,” with most of the results relevant to the question at hand. In fact, all but two of the results were directly relevant to the question asked. Yahoo didn’t return a nice little piece of math like Google did, but all but one of the search results is _directly _relevant to the question asked. The only result that wasn’t relevant was that VH1 has some videos by a band called Big Sun, but that was torwards the bottom of the SERP. The newcomer Wolfram Alpha, which bills itself as a “knowledge engine” gives you a simple result, 432,200 miles, along with a handy formula for conversion. Not a traditional search engine, but closer to a “decision engine” than Bing … And finally, the “decision engine” Bing. So how does the vaunted “decision engine” handle knowing how big the sun is?It doesn’t. The first result is a garden furniture store in Austin, Texas. The second result is an Equine Product Store in Florida. The third was pictures of the sun from the Boston Globe - okay, that one was close. The next results are a realty company in Florida and an athletic conference. Only then, six results down, do we get into the meat of the question. Look, it’s easy to hate on Microsoft. It’s no challenge anymore. I, personally, am not exactly a fan of Microsoft, but I’m hardly an enemy either. At worst, I’m indifferent. And, as an aside, I really feel sorry for the poor guy they send to the OSCON keynote every year who literally gets hammered for no good reason by what can only be described as nerd rage from the questioners. And yet every year, they come back with more money and more people. I almost posted an entry about it last year. It was really kind of sad to watch. Anyways, the point is, there are some things that Microsoft _has _done well. Office? Great productivity suite. Windows 7? From what I’ve seen, it looks pretty good. The XBOX and gaming units at Microsoft do gangbusters. But it just seems like they’re irrationally pursuing this search thing, out of spite, at this point to the detriment of the rest of their business. Considering that bing doesn’t appear, at the surface, to be any different from Windows Live Search in terms of its usefulness (that is to say, not), Microsoft is throwing tons of money in the form of development and marketing to something that just isn’t very good when they could be focusing on the core parts of their business. But, then again, I’m not Ballmer.
Read More
Release Announcements
petfeedd users, I am proud to announce the beta release of petfeedd 1.0.1. This release has no major changes in it and is solely about addressing security issues in many of the underlying libraries used by petfeedd. To install it or upgrade from previous versions, you can simply run: docker pull peckrob/petfeedd:latest
Read More
Release Announcements
After five beta releases and months of testing, I am happy to announce petfeedd Version 1.0 is now available. All changes from the beta branch have been merged in and the release is now available on Docker Hub. To install it or upgrade from Version 0.2, you can simply run: docker pull peckrob/petfeedd:latest And restart. It should perform all the upgrades needed for version 1.0.
Read More