2014-05-18

From blog posts to "html" to editable document

I received the following request: having a blog, I want to put every post altogether into a single document (a word-processor document), keeping just the title, the published date and the content. Can you help me?

I did as follow.
  • Retrieve the full atom feed of the blog… Since the blog was hosted at blogspot, this link was helpful. But I had to add  “?max-results=500” to the url, since otherwise it stops at 50 posts.
  • Now, it is nothing but an xml, so a proper XSLT should be enough. And in fact… I have built upon this, removing everything I didn't need and adding the published date — the date, then, was the only reason for a post processing, since I had no idea how to transform it as I wanted, therefore I have put it raw (almost raw, indeed) into the output html (generated by xsltproc), and then…
  • I wrote few lines of Perl to transform every date from YYYYMMDD to “Weekday name, DD Month Name YYYY” in the generated html;
  • Loaded the html into LibreOffice Writer, then exported to odt.
The result is not perfect, but mainly it's the content's fault, for it is sometimes from a Microsoft Office text (i.e. the entry was written in Microsoft Word, then copy-pasted in the blogger text editor area).

Just to keep this post longer than it could be, here the few lines of Perl code I wrote to reinterpret the dates.
#! /usr/bin/perl
use strict;
while (<>) {
    if (/##(\d{8})##/) {
      my $r = `LC_TIME="it_IT.utf8" date -d$1 +"%A %d %B %Y"`;
      s/##\d{8}##/$r/;
    }
    print $_;
}
In the generated html, the sequence ## was used to mark the date, extracted as YYYYMMDD (using properly substring). I had to set LC_TIME since I am used to set my locale to en_GB.utf8 (I try to keep my system consistent about the language and avoid the mixture that happens when you use locale-aware and locale-unaware softwares), but I needed italian names for week days and months.

Simply silly, but now this post can come to an end. (No, not yet: why do you ignore the export feature? since I have no access to the blog indeed, but I was able to ask for the necessary blogID).


2014-05-10

Particles of coroutines

Following the very same idea of the previous post, I've implemented the same stuff for x86. No worry about details — I am not a x86 fan and lover — except that there are few tests I've not done in the m68k version (namely the latter assumed the compressed stream is not corrupted). But it's just noise, not worth considering.

Intel x86 assembly instructions suck, but I admit I don't know it very well and likely I haven't used some cool feature and I don't know any cool feature which, once I'll know it, will make my mind change. Rants end.

Since x86 has not too many registers, and since I've used C library (compiled with nasm, linked with gcc/ld) and x86 calling conventions apply, and since I wanted to avoid special purpose registers (ECX, ESI, EDI…) as “global” variable storage, there are extra push and pop to keep values between calls to library functions, while each coroutine assumes also that the register it's interested in, are not trashed.

The register EBP can be used for “shared” (or global) storage; in fact, I've used it to store the pointer to the token buffer, and the “continuation” address.

The yield/resume feature is done with this code (kept into a macro):

    mov ebx,[ebp]
    mov dword [ebp],$+9
    jmp ebx

First the next address is put into EBX, then it's replaced by the address of the instruction following the jump, then the jump to the address in EBX is performed.

That's all. Readers interested in the whole code can find it at this gist, but I doubt it's worth it. It'd be far more interesting to study an implementation that could be used for really, as the result of compiling high level language code.

Different calling conventions can make it easier, but then you need extra code to call external functions — sticking to common C calling convention on a system is the key to access a lot of code without the need for any kind of glue — almost. Rants end, again; guess when they began.

This may work fine for two coroutines. Let's reason about a third coroutine. Does it work? No. If you need to create another cooperation, e.g. between the parser which extracts tokens (a “lexical scanner” indeed) and a grammar parser (i.e. a parser), you'll be fucked up.

E.g. our parser at some point, instead of got_token, need to give control to another routine, namely the one which understands the grammar. Thus, for each coroutine pair we need a “slot” similar to the one in [ebp]. A theoretical JCONT macro would be more complex, and take into account at least the coroutine we want to give control to. E.g.

 parse:
     JCONT   parse,getc
     test    eax,eax
     ...
 .wend:
     mov     eax,TWORD
     ; the grammar_parser'd like to have ptr to buffer too
     ; ... but this could be a global, as it is
     JCONT   parse,grammar_parser
     ...

If there's a hashmap for each coroutine, then we need to initialize it first, and the somewhere likely we could need a reset too. The macro could look, in pseudocode, something like

    get_slot_of    %2
    mov            ebx,[ebp]
    mov            dword [ebp],$+9
    store_slot_for %1
    jmp            ebx

Just an idea, at a very late hour.

Crumbs of coroutines

Playing with handmade lexical scanners and parsers you soon discover how cool it would be if you could use coroutines, but unfortunately language like C and C++ haven't such a feature, nor they have a general gear to manage continuations — even though setjmp/longjmp can be thought as what you need to begin, but they maybe do not bring you to the end, not always at least.