ShinTakezou's Blog: perl

Showing posts with label perl. Show all posts

2014-05-18

From blog posts to "html" to editable document

I received the following request: having a blog, I want to put every post altogether into a single document (a word-processor document), keeping just the title, the published date and the content. Can you help me?

I did as follow.

Retrieve the full atom feed of the blog… Since the blog was hosted at blogspot, this link was helpful. But I had to add “?max-results=500” to the url, since otherwise it stops at 50 posts.
Now, it is nothing but an xml, so a proper XSLT should be enough. And in fact… I have built upon this, removing everything I didn't need and adding the published date — the date, then, was the only reason for a post processing, since I had no idea how to transform it as I wanted, therefore I have put it raw (almost raw, indeed) into the output html (generated by xsltproc), and then…
I wrote few lines of Perl to transform every date from YYYYMMDD to “Weekday name, DD Month Name YYYY” in the generated html;
Loaded the html into LibreOffice Writer, then exported to odt.

The result is not perfect, but mainly it's the content's fault, for it is sometimes from a Microsoft Office text (i.e. the entry was written in Microsoft Word, then copy-pasted in the blogger text editor area).

Just to keep this post longer than it could be, here the few lines of Perl code I wrote to reinterpret the dates.

#! /usr/bin/perl
use strict;
while (<>) {
    if (/##(\d{8})##/) {
      my $r = `LC_TIME="it_IT.utf8" date -d$1 +"%A %d %B %Y"`;
      s/##\d{8}##/$r/;
    }
    print $_;
}

In the generated html, the sequence ## was used to mark the date, extracted as YYYYMMDD (using properly substring). I had to set LC_TIME since I am used to set my locale to en_GB.utf8 (I try to keep my system consistent about the language and avoid the mixture that happens when you use locale-aware and locale-unaware softwares), but I needed italian names for week days and months.

Simply silly, but now this post can come to an end. (No, not yet: why do you ignore the export feature? since I have no access to the blog indeed, but I was able to ask for the necessary blogID).

2010-05-24

The YT case (part one)

It is a problem that goes and returns back periodically: how can I download YouTube videos? It's not a problem that has a forever solution since YouTube changes things. Maybe it does so also to make it harder to download videos: we must "pass" through them, so they can control contents better (DRM!) and earn through traffic and data we produce using their service... Even supposing we're on the dark side of believing their shameless lies (rather than on the bright side of thinking they are just taming us turning our bodies in batteries that produce electro-money only for them), we could wish to download a video the author released after some permissive license.

Searching for already-made mass solutions we are just caught in the ads-worlds. So we seek a little bit differently, not too much, and we find interesting sites giving real solutions. I've started my alone researches but dropped them since I've found these working solutions (this is one of the cases you're happy to find people smarter and more efficient than you). Nonetheless I will write some of my researches here, they could be of interest for someone now or in the future.

But first let's see working solutions I've tried (my video target was always the same but I've no reasons why to think they should not work with other videos). Not everybody is able to use these solution and because of this I am working on a C# porting of one of these. Hope I will finish it (I am not a C# programmer, but it's time I start to taste it a bit...)

GAWK solution. This was the second found code I've tried and the first to work. Unluckly perl scripts (one-liners) by the same author (see his other post) failed. I suspect it could be only because of the User-Agent, but I've not done tests yet. If it would be so, there's an easy fix.

Youtube-dl.py; this one looks cool, it seems to support several sites (despite its name), it looks well-written, and the -g option can be specified if one wants to use his/her custom downloader: I've tried wget (without spoofing) and worked!

And now, the part no-one is interested in: my researches. Read at your own risk (if you can waste your time I suggest studying the code of the gawk or python solutions, rather than reading what follows). If you want to read it, consider it as a muddle of scattered thoughts; possible audience maybe should be a little bit computer literate.

Discontinued analysis/study of the YT case

Request URL is simply /watch?v=ID where ID is a video_id identifying the video. This brings us to /v/ID, through the browser it sends back a compressed swf file. Disassembling the file with flasm we see defining a set of variables; one seems to hold the URL of the skin of the player (another swf file at http://s.ytimg.com/yt/swf/cps-vfl165272.swf at least in this case); video_id set to the same value of ID; a variable sk holds a key, it changes every time I download the file. They may appear other variables "mirroring" URL "parameters". Searching it seems like the work this swf file does was indeed done by a simple html file in ancient ages...

At some point this swf contains code that seems to construct a URL, so I followed it a bit and wrote the following pieces. Not so interesting after all :( A a little bit more readable form of the flasm-flavoured flash disassembled code is

main = function ('clip') ( ... )
{
  loadClip = createEmptyMovieClip ...........
  r:2 = new MovieClipLoader .................
  clip.addCallback .......
  
  . (nothing interesting here...)
  .
  .
  
  r:3 = clip.swf
        i.e.  "http://s.ytimg.com/yt/swf/cps-vfl165272.swf"
                         1         2         3         4
               0123456789012345678901234567890123456789012
  r:4 = clip.swf.split("/")[2]
        gets the domain ("s.ytimg.com")
   
  r:4 == "s.ytimg.com"
  branchIfTrue label5

        i.e. is the domain s.ytimg.com? Yes, in this case, so
        branch
  .
  . (see CASE B code)
  .
label5:
  loadClip.loadClip(r:3, r:2)

        where r:3 is clip.swf, see above
          and r:2 is an instance of MovieClipLoader
}

In case the domain is not s.ytimg.com, it builds an URL; this is not the case but it could be interesting.

* CASE B *

r:5 = clip.swf.indexOf("-vfl")
          r:5 is 29 in my case
r:6 = clip.swf.indexOf(".swf")
         r:6 is 39 in my case
r:7 = clip.swf.indexOf("/swf/") + 5
         r:7 is 21 + 5 (index pointing to cps- or whatever
                comes after /swf/ part
r:8 = "cps"
if not (r:5 > -1) then
     // -vfl not found in the string clip.swf
     r:8 = clip.swf.substring(r:7, r:6)
           i.e. everything after /swf/, less the three letters ext
end if

r:9 = loadClip._url.split("/")[2]
r:3 = "http://" + r:9 + "/swf/" + r:8 + ".swf"
      i.e. builds
         http://domain of _url/swf/thing.swf

The interesting part seems to be when the domain is not s.ytimg.com. But it appears the _url variable of clip, which is not set nowhere here... in this case at least. The domain matchs so no need to have _url set, but I wonder when it does not match. I suspect indeed this one is the wrong swf to look for. Maybe URL parameters may change things and this very same code serves other "purposes" too. Interesting to note that this flash says

System.security.allowDomain("*")

So theoretially this swf is usable also externally; this is obvious thinking about embedding. The set of variables the code assigns is:

// the "cover" of the video
iurl = 'http://i4.ytimg.com/vi/ID/hqdefault.jpg'
el   = 'embedded'  // embedded where? in the default flash player?
fs   = '1'         // full screen
title = '...'
avg_rating = '4.7547...'  // wow how many digits for the rating!
video_id = 'ID'
length_seconds = '..' // number for its length
allow_embed = '1'     // interesting...
swf = 'http://s.ytimg.com/yt/swf/cps-vfl165272.swf'
// Security Key? ?
sk  = 'TK755pvEYU-oGqmzRTwz7fq1dipYreRnC' // or alike
rel = '1'
cr  = 'US'
eurl = ''

I am wondering what happens if allow_embed is 0 and I modify it into 1 and use this as "embedding" trampoline.

In the html page of the video (the one we get with /watch?v=ID there are alternate addresses, serving informations in JSON+OMBED or XML+OEMBED, oembed, flying around these we can find an address to use with the RTS Protocol, tried this road, mplayer understand the protocol, but it seems the Google RTSP Server does not like too much it and stop the connection. (SDP used too).

Once upon a time it existed a so called get_video API, it seems to work still but it is different the way we can get the needed parameters (see youtube-dl.py with -g option), which are different too. In the URL given by youtube-dl.py appear video_id (which is ID), t (token... could it be sk? They are not the same, it seems), eurl (null...), el (detailpage), ps (default), gl (US), hl (en); some appears in the analysed swf too. But the most important is for sure the token (t).

Discontinued. Youtube-dl.py works, I'll look how it acts and write something runnable on Windows machine by people not interested in installing python on their system (bad very bad).

2010-03-30

One-liners and why to get it harder than it could be

Since I've updated kernel and X server, my computer became slower... new "softwares" require new hardware to do the same things. This is a fact that always upset me. Anyway, I abandoned KDE (unusable... and half-mixed KDE3-4!) and installed WindowMaker. I also abandoned full graphics file-manager (avoid the loading of Qt libs and KDE services, being used to use Konqueror, uncomparable to Dolphin), tried lightweight Gtk file-managers but none satisfied me. I used midnight commander for a while and tried also gnu-git ... But now mainly I use simply the command line, in a simple xterm of course...

So it happens that something easy and fast with a filemanager becomes a little bit harder. But still the command line, with all GNU tools around and interpreters of languages like perl, is powerful.

When I download pictures from the cam, I need renaming them according to EXIF date; so I've created a script (exifrename.sh), which uses exif the extract the date and rename the object the way I want. I use it like this

exifrename.sh *.jpg |bash

The pipe is because the script outputs the commands rather than executing them (it could be useful to check everything's fine).

My cam creates a directory for each day, like NUMcanon where NUM is a number not related to the day. What I want is to copy just some days, not all, into the local hard disk and in the same time I want to strip the "canon" part. It would be easier maybe to copy the dirs and then rename them. But less fun for me. So what I do is

for el in 101 102 105 ; do mkdir $el ; 
   pushd $el ;
   cp -r /media/usbpen/dcim/${el}canon/* . ;
   popd ; done

Ok not so funny but useful to... Recently I had to convert several scattered .doc (alas Microsoft stuff) into HTML, I've used this

find . -iname "*.doc" |
   (
    while read line;
    do wvHtml "$line" "${line%.doc}.html ; done
   )

As last example, a "oneliner" to read FASTA headers / sequences and sort them (just the header). Don't ask why, I've found the request on Yahoo Answer.

perl -e '%p=();
while(<>) {
  if (/^>/) {
    s/^>//; chomp;
    foreach $v (split("\x01", $_)) {
       $p{$v} = 1 if !exists $p{$v} ;
    }
  } 
}
print join(" ", sort keys %p)' <fasta.txt

A perl guru can make surely better (and what about perl6?), anyway the solution worked .

Now we discover that there are things that can't be done simply by a graphics interface but... These examples are taken directly from the history of the shell; there are other two similar examples. The history has 1045 entries. Of these, 1039 are mainly simple commands like cd, rm, cp, ls, mkdir (and sometimes mount and umount). Actions that would have been a lot easier using a graphical file manager!

So the problem said at the beginning is back: I need keeping running few processes/programs. ... I'm going to hate the computer business, because I can't see a reason why to do the same thing I did before now I need more memory or a more powerful processor... or downgrading the software (meaning I should compile by myself a lot of codes since oldest packaged pre-compiled programs are already too recent, or there are too many dependencies issues I don't want to cope with!)

I hope things will be better, but looking around I see things indeed get worse: we can say Berlusconi is the winner, and this means there's no hope to change for Italy (it seems OT, but is not...).

Ah, the exifrename.sh code:

for el in "$@"; do
  pathpart=$( dirname "$el" )
  namepart=$( basename "$el" )
  data=$( exif "$el" |egrep "Date and Time \(orig" \
     |sed -e 's/^.*|\([0-9]\{4\}\):\([0-9][0-9]\):\([0-9][0-9]\) \([0-9][0-9]\):\([0-9][0-9]\):.*$/\1\2\3-\4\5/' )
  echo "mv \"$el\" \"$pathpart/$data-$namepart\""
done