After I finished the previously-lamented server and code migration, there remained a couple of applications that I didn’t bother to transfer, because I’m the only one that uses them. Plus, it would have delayed “going live” with the new server, and prolonged my co-workers’ collective suffering.
One of these apps was a news feed generator for work’s press releases. If you’re not familiar with RSS or XML or news feeds, I’ve written about them extensively in the past; maybe those will help?
Anyway, that app was a quick-and-dirty effort that served only to cut in half the amount of copying/pasting I had to do. Instead of copying/pasting the individual pieces of text (the article’s title, its URL on the web, etc.) into the proper locations between the XML tags in the text file, I only had to copy/paste the relevant chunks of text into a web-form. The app would spit out a little snippet of XML text and display it on my browser, which I’d then copy/paste into the text file, and upload it to our web server.
It was convenient for a while… but it really didn’t automate anything. So it quickly earned the reviled “This Really Sucks!” badge.
I previously wrote some code that would build news feeds from the content of various websites. The Feeds of Fury—which list upcoming concerts in the D.C. area—are just the latest examples. The process is simple enough:
- Go get the HTML text used to build a web page.
- Analyze the structure of that text.
- Pick out the relevant bits.
- Drop them in the appropriate places inside an XML text document.
So, I wondered, why wasn’t I doing the same thing for the work press release news feeds? All the necessary information already exists on the web, so why couldn’t I just build a couple of tidy XML text files from it? Sick of myself whining and bitching about it, I sac’d up and finally wrote the fuckin’ code.
Using Leonard Richardson’s spiffy “Beautiful Soup” Python module to help with the HTML parsing, here’s what my new shit does:
- It snags the HTML text of the web page where I list all of work’s press releases.
- From that, it gleans the dates, titles, and web URLs of the latest 12 press releases.
- Then, it snags the HTML text of each of the 12 web pages which contain the full content of the press releases.
- From those, it gets the first paragraph of each article and strips out all the HTML markup code; this will serve as the “summary” for the news feed.
- Then, it extracts the full text of the press releases—complete with HTML—which it then modifies slightly to make sure everything is valid XML.
- Next, it constructs each story’s entry using all the previously-extracted chunks of text.
- Finally, it spits out the two XML news feeds (both of which validate against the standard) to the proper place on my machine, which I then synchronize with our public web server.
Bickity-bam; fully automated without a single copy/paste operation.