## 2014-05-18

### From blog posts to "html" to editable document

I received the following request: having a blog, I want to put every post altogether into a single document (a word-processor document), keeping just the title, the published date and the content. Can you help me?

I did as follow.
• Retrieve the full atom feed of the blog… Since the blog was hosted at blogspot, this link was helpful. But I had to add  “?max-results=500” to the url, since otherwise it stops at 50 posts.
• Now, it is nothing but an xml, so a proper XSLT should be enough. And in fact… I have built upon this, removing everything I didn't need and adding the published date — the date, then, was the only reason for a post processing, since I had no idea how to transform it as I wanted, therefore I have put it raw (almost raw, indeed) into the output html (generated by xsltproc), and then…
• I wrote few lines of Perl to transform every date from YYYYMMDD to “Weekday name, DD Month Name YYYY” in the generated html;
• Loaded the html into LibreOffice Writer, then exported to odt.
The result is not perfect, but mainly it's the content's fault, for it is sometimes from a Microsoft Office text (i.e. the entry was written in Microsoft Word, then copy-pasted in the blogger text editor area).

Just to keep this post longer than it could be, here the few lines of Perl code I wrote to reinterpret the dates.
#! /usr/bin/perl
use strict;
while (<>) {
if (/##(\d{8})##/) {
my $r = LC_TIME="it_IT.utf8" date -d$1 +"%A %d %B %Y";
s/##\d{8}##/$r/; } print$_;
}

In the generated html, the sequence ## was used to mark the date, extracted as YYYYMMDD (using properly substring). I had to set LC_TIME since I am used to set my locale to en_GB.utf8 (I try to keep my system consistent about the language and avoid the mixture that happens when you use locale-aware and locale-unaware softwares), but I needed italian names for week days and months.

Simply silly, but now this post can come to an end. (No, not yet: why do you ignore the export feature? since I have no access to the blog indeed, but I was able to ask for the necessary blogID).