ShinTakezou's Blog: 2010-05

2010-05-27

About a gfortran bug

Few weeks ago I have marked as incorrect an implementation of a task on RosettaCode in Fortran. It has struck me since it appeared elegant and short, so I wanted to run the code and... alas obtained a wrong result (AAAA). So without further analysing the code I have marked it as incorrect.

Yesterday I was looking for recent activities on RosettaCode and task to work on, when noticed that my "mark" was removed and in the talk page they've shown a gfortran bug... Unbelievable, I use gfortran to ru my computation, such a bug, as interpreted there, was horrorful!
(Even though recent gfortran versions fixed it).

Further investigation has shown that the bug is more subtle but less horrorful; first let us see how the bug manifests itself with the guilty fragment of code.


  character(2), dimension(2), parameter :: list = (/ 'ab', 'ba' /)
  integer :: i

  i = 1
  write(*, *) list(:)(1:1) == list(1)(1:1)  ! ref. (A)
  write(*, *) list(:)(1:1) == list(1)(i:i)  ! ref. (B)

We expect the same output, but we get something different! So at a glance the bug is in how gfortran compute expression with variables in slices, like if it does not grab the right value of i.

The problem is instead another. When constant slices of a known (at compile time) parameter array are requested, the slices are indeed computed at compile time. So that where there's no expression to be evaluated (even the simple i), the result is computed at compile time and no bug is shown.

Compiling with -S option allows us to study assembly output (on x86 machine):


 .section .rodata
 .align 4
 
 .type A.2.545, @object
 .size A.2.545, 8
A.2.545:
 .long 1
 .long 0
 .align 4

This is a piece of the readonly data section produced compiling a program containing only the line (A). What are those .long 1 and .long 0. Without breaking my head into debugging for real the code, I've suspected they are the result computed at compile time (True, False) and to verify it I simply changed 1 to 0, assembled/linked and run the code to obtain F F as output; if I change it into 1 1, I obtain T T, quod erat demonstrandum.

So the bug does not manifest itself since it is a runtime bug (maybe; we shall see it is not exactly so). At compile time, slicing is done correcly and so the comparing.

Let us see the code produced when only line (B) is in use:


.Ltext0:
 .section .rodata
.LC0:
 .ascii "a"
.LC1:
 .ascii "b"
 .align 4
 
 .type A.2.545, @object
 .size A.2.545, 8
A.2.545:
 .long .LC0
 .long .LC1
 .align 4

Now the result (True and False values, i.e. 1 and 0) is missing since list(1)(i:i) must be computed at runtime. Nonetheless, list(:)(1:1) can still be computed at compile time and in fact it is: label A.2.545 point to pointers to two strings, and the "object" represents the array (/ "a", "b" /) which is the result of evaluation of list(:)(1:1).

Now list(1)(i:i) must be computed at runtime and the comparison must be carried at runtime, in fact later we find the fragment:

 
 movl A.2.545(,%eax,4), %eax
 movl %edx, 12(%esp)
 movl $1, 8(%esp)
 movl %eax, 4(%esp)
 movl $2, (%esp)
 call _gfortran_compare_string

which is missing from the case (A) (since de facto the comparison is done all at compiletime, and only the output routine appears). The problem is somewhere here or there... Again, since I am lazy, before going deep to debug the code let me see if there's a simpler way to grab a spark of enlightening.

Analysis I've done before and that you can find on talk page, made me think that the problem could be related someway with how Fortran handles "strings". I say, it is not so good at:D! They are indeed characters of given length and this is fixed. Let's see some code to understand better:

  character(2)  :: a
  character(3)  :: b
  character     :: c

  c = 'a'
  write(*,*) len(a)
  b = 'cd'
  write(*,*) len(b)
  a = 'b'
  write(*,*) len(c)
  write(*, "(A,'|',A,'|',A,'|')") a, b, c

The output of this code (reformatted) is

2
3
1
b |cd |a|

And it shows that in Fortran the length of the "string" is not determined by the real "string" we put in but by the size of the container. Moreover, unused room is padded with spaces (we need trim to remove the padding spaces). Nonetheless, comparing functions seem to consider this oddity and

'a' == 'a    '

is True! And also


a = 'a '
b = 'a  '
write(*,*) a == b

gives of course True, which is odd (indeed, considered wrong!) in any other language I know. So comparing function seems to ignore trailing spaces. But it sometimes could make some confusion and how trailing spaces are really ignored? With this awareness/doubt in mind, I modified


.LC0:
 .ascii "a"
.LC1:
 .ascii "b"
 .align 4

into


.LC0:
 .ascii "a "
.LC1:
 .ascii "b "
 .align 4

Then I "assembled and linked" (sounds cool, indeed I've just written gfortran modifiedcode.s!), run the code and BINGO the right output appears!! So now we think that the problem is at compiletime, since a trailing space is missing... But wait we've said trailing spaces are ignored so the comparing function should not consider them for real! Counterexample: let us add two spaces instead of one, to have "a ", or more, to have "a "... The right result does not change!¹

So, it seems at least one space must be present. Is this a bug of the comparing function, or a bug in the code producing the static strings? Remember that those strings are the representation of the array (/ "a", "b" /) obtained with list(:)(1:1). Since list is an array of character(2), we could expect that the slice inherits the char(2) size. Maybe. But how it should be implemented in code? Maybe padding with spaces for real... But if this a requirement, then it means the comparing function is not buggy. If it is not a requirement and the rank of the "content" of a volatile object like list(:)(1:1) needs no to be inherited, then the string comparing function seems buggy... But why then it works when used "directly", in code as 'a' == 'a '?

Something is missing still.

Let us do another test. Let us "copy" the object list(:)(1:1) in another array, i.e. add this code


  character(1), dimension(2) :: t

  t = list(:)(1:1)
  print *, t == list(1)(i:i)

Now the result is correct!! So the problem as stated before is not in the usage of an expression (simply i:i in this case) but it is subtly bound to how object are evaluated at compiletime and stored in the produced code. Copying in a non "static" array fix the problem!

So another way to fix the problem should be to avoid that list(:)(1:1) is computed at compiletime and (we're supposing currently) stored "badly" in a way that confund the comparing function... I.e. we use list(:)(i:i) too... And in fact, doing so, fix the problem too! We're nearer.

The "static" word used before made me think about the parameter attribute... What if we remove it... Let's try... and wow, without it, everything is fine, always!! So the bug is in how parameter object are handled. At this point I am curious to know how the bug was fixed in recent version of gfortran!

But first I take a look at what is changing when copying list(:)(1:1) in a "dimension" made of character(N). (Changing N from 1 to whatever, does not change the output, which is always right).
When doing so, in the readonly data section we find only the representation of the initial array.


 .type list.544, @object
 .size list.544, 4
list.544:
 .ascii "ab"
 .ascii "ba"

The rest is done at runtime (slicing, copying and so on). And so, everything works.

Now let us see what changes when the parameter attribute is removed. What can be noticed is that now everything is done at runtime, always. It is like that: parameter object are not supposed to change, so things done with them, when possible, can be computed at compile time (if there's an expression that must be computed, then the operations on a "parameter" array are done "deferred" at runtime). If no parameter attribute is given, the object may change at runtime, so the compiler does not try to do anything at compiletime.

So now we are more convinced that the bug is in the part of gfortran dealing with compile time "optimization" (for "immutable" objects). To let things work, we can remove the parameter attribute, or force the compiler to operate at runtime, e.g. using an expression that must be evaluated at runtime... Or we must be sure everything is computed at compiletime... Basically, if we mix things evaluated at compile time with things that must be evaluated at runtime, the bug shows up.

Note 1: look better at the code calling the comparing function... we see it appears a 2... hmm, what if this is the length of the string to be compared? Let's change


 movl $2, (%esp)
 call _gfortran_compare_string

into


 movl $1, (%esp)
 call _gfortran_compare_string

keeping "a" and "b" without spaces... Let's assembly and run and... bingo again. The fortran compare string function gets the length of the string to be compared as argument... So the bug is in here, since it should be $1 instead of $2... or in the static data... or ... hmmm

2010-05-24

The YT case (part one)

It is a problem that goes and returns back periodically: how can I download YouTube videos? It's not a problem that has a forever solution since YouTube changes things. Maybe it does so also to make it harder to download videos: we must "pass" through them, so they can control contents better (DRM!) and earn through traffic and data we produce using their service... Even supposing we're on the dark side of believing their shameless lies (rather than on the bright side of thinking they are just taming us turning our bodies in batteries that produce electro-money only for them), we could wish to download a video the author released after some permissive license.

Searching for already-made mass solutions we are just caught in the ads-worlds. So we seek a little bit differently, not too much, and we find interesting sites giving real solutions. I've started my alone researches but dropped them since I've found these working solutions (this is one of the cases you're happy to find people smarter and more efficient than you). Nonetheless I will write some of my researches here, they could be of interest for someone now or in the future.

But first let's see working solutions I've tried (my video target was always the same but I've no reasons why to think they should not work with other videos). Not everybody is able to use these solution and because of this I am working on a C# porting of one of these. Hope I will finish it (I am not a C# programmer, but it's time I start to taste it a bit...)

GAWK solution. This was the second found code I've tried and the first to work. Unluckly perl scripts (one-liners) by the same author (see his other post) failed. I suspect it could be only because of the User-Agent, but I've not done tests yet. If it would be so, there's an easy fix.

Youtube-dl.py; this one looks cool, it seems to support several sites (despite its name), it looks well-written, and the -g option can be specified if one wants to use his/her custom downloader: I've tried wget (without spoofing) and worked!

And now, the part no-one is interested in: my researches. Read at your own risk (if you can waste your time I suggest studying the code of the gawk or python solutions, rather than reading what follows). If you want to read it, consider it as a muddle of scattered thoughts; possible audience maybe should be a little bit computer literate.

Discontinued analysis/study of the YT case

Request URL is simply /watch?v=ID where ID is a video_id identifying the video. This brings us to /v/ID, through the browser it sends back a compressed swf file. Disassembling the file with flasm we see defining a set of variables; one seems to hold the URL of the skin of the player (another swf file at http://s.ytimg.com/yt/swf/cps-vfl165272.swf at least in this case); video_id set to the same value of ID; a variable sk holds a key, it changes every time I download the file. They may appear other variables "mirroring" URL "parameters". Searching it seems like the work this swf file does was indeed done by a simple html file in ancient ages...

At some point this swf contains code that seems to construct a URL, so I followed it a bit and wrote the following pieces. Not so interesting after all :( A a little bit more readable form of the flasm-flavoured flash disassembled code is

main = function ('clip') ( ... )
{
  loadClip = createEmptyMovieClip ...........
  r:2 = new MovieClipLoader .................
  clip.addCallback .......
  
  . (nothing interesting here...)
  .
  .
  
  r:3 = clip.swf
        i.e.  "http://s.ytimg.com/yt/swf/cps-vfl165272.swf"
                         1         2         3         4
               0123456789012345678901234567890123456789012
  r:4 = clip.swf.split("/")[2]
        gets the domain ("s.ytimg.com")
   
  r:4 == "s.ytimg.com"
  branchIfTrue label5

        i.e. is the domain s.ytimg.com? Yes, in this case, so
        branch
  .
  . (see CASE B code)
  .
label5:
  loadClip.loadClip(r:3, r:2)

        where r:3 is clip.swf, see above
          and r:2 is an instance of MovieClipLoader
}

In case the domain is not s.ytimg.com, it builds an URL; this is not the case but it could be interesting.

* CASE B *

r:5 = clip.swf.indexOf("-vfl")
          r:5 is 29 in my case
r:6 = clip.swf.indexOf(".swf")
         r:6 is 39 in my case
r:7 = clip.swf.indexOf("/swf/") + 5
         r:7 is 21 + 5 (index pointing to cps- or whatever
                comes after /swf/ part
r:8 = "cps"
if not (r:5 > -1) then
     // -vfl not found in the string clip.swf
     r:8 = clip.swf.substring(r:7, r:6)
           i.e. everything after /swf/, less the three letters ext
end if

r:9 = loadClip._url.split("/")[2]
r:3 = "http://" + r:9 + "/swf/" + r:8 + ".swf"
      i.e. builds
         http://domain of _url/swf/thing.swf

The interesting part seems to be when the domain is not s.ytimg.com. But it appears the _url variable of clip, which is not set nowhere here... in this case at least. The domain matchs so no need to have _url set, but I wonder when it does not match. I suspect indeed this one is the wrong swf to look for. Maybe URL parameters may change things and this very same code serves other "purposes" too. Interesting to note that this flash says

System.security.allowDomain("*")

So theoretially this swf is usable also externally; this is obvious thinking about embedding. The set of variables the code assigns is:

// the "cover" of the video
iurl = 'http://i4.ytimg.com/vi/ID/hqdefault.jpg'
el   = 'embedded'  // embedded where? in the default flash player?
fs   = '1'         // full screen
title = '...'
avg_rating = '4.7547...'  // wow how many digits for the rating!
video_id = 'ID'
length_seconds = '..' // number for its length
allow_embed = '1'     // interesting...
swf = 'http://s.ytimg.com/yt/swf/cps-vfl165272.swf'
// Security Key? ?
sk  = 'TK755pvEYU-oGqmzRTwz7fq1dipYreRnC' // or alike
rel = '1'
cr  = 'US'
eurl = ''

I am wondering what happens if allow_embed is 0 and I modify it into 1 and use this as "embedding" trampoline.

In the html page of the video (the one we get with /watch?v=ID there are alternate addresses, serving informations in JSON+OMBED or XML+OEMBED, oembed, flying around these we can find an address to use with the RTS Protocol, tried this road, mplayer understand the protocol, but it seems the Google RTSP Server does not like too much it and stop the connection. (SDP used too).

Once upon a time it existed a so called get_video API, it seems to work still but it is different the way we can get the needed parameters (see youtube-dl.py with -g option), which are different too. In the URL given by youtube-dl.py appear video_id (which is ID), t (token... could it be sk? They are not the same, it seems), eurl (null...), el (detailpage), ps (default), gl (US), hl (en); some appears in the analysed swf too. But the most important is for sure the token (t).

Discontinued. Youtube-dl.py works, I'll look how it acts and write something runnable on Windows machine by people not interested in installing python on their system (bad very bad).

2010-05-16

If you know it, you can...

I confess it: I often creep in forums and "ask your question" sites, sometimes I add my answers to solve a problem or manifest an idea (and often I think it's the best answers of course). Mostly it's boring but it is also a cheap pastime and I find it relaxing.

Few hours ago someone asked help for the following problem: I have a JPG file N kbytes long, and I want to make it M kbytes long (M > N). A typical trait of these questions is an almost total inability of the asker to express what they want in a non-ambiguous way.

The already written answers talked about increasing JPG quality and/or resizing image's width and height; these actions have as consequence to make the file bigger.

The common answerer focused on the fact that the file was an image (that can be resized) in a specific format the "quality" can be set of. Since the asker talked about file sizes, I focused on the file instead and swifly thought of cat and dd: dd can be used to "cut" a stream at the desired size, while cat allows the creation of a stream bigger than M. Since the simplest way to make a file bigger is to pad it with zeros, I thought of /dev/zero too, and wrote the line

cat file.jpg /dev/zero |
   dd of=modified.jpg bs=1 count=800k

if we want the final file to be 800k bytes (800*1024) long. Neat and easy, isn't it? At this point someone can ask: but this way the JPG gets corrupted while the other answers have the merit not to waste it! And at this point it is useful a little bit of knowledge about JPG files: you can append data to a well-terminated JPG file without problem, since decoders stop to a marker which says the JPG file is finished. If you put data after the stop-marker, you do no harm.

The story taught me that a little bit of knowledge about file formats and classical "*nix" tools (in their GNU versions in my case) can solve a lot of problems easily, and that who does not know these things have as only option the use of softwares that do the job as needed; if there are not such softwares then you've not bricks you can build the solution with. If you don't have a bigger picture of what computers are for, your only idea will be to load you preferred image manipulation software and try to match the file size changing "parameters", and maybe you'll get it after several trial-error cycles... only to notice that now it looks different and you don't want it to look so different!!

So the suggestion is read read read read and learn learn learn learn (and install GNU coreutils:D)

And now a brief explanation of what the line does. cat A B concatenate two streams (e.g. two files) one after the other. The /dev/zero can be considered as a "virtual" infinite file made of all zero bytes. For this reason we need dd to select the right amount of bytes from the stream, copying to "output file" only the quantity we want (see e.g. this Wikipedia link about dd).

2010-05-10

PDF (Portable Document Format)

Few days ago I've found PDFs of interesting books (basically the same), old (and marked obsolete!) editions: Numerical Recipes in C, Numerical Recipes in Fortran 77, Numerical Recipes in Fortran 90, obsoleted by Numerical Recipes in C++ (not downloadable).

These PDFs are encrypted through a hateable system, handled by the FileOpen plugin (they have a site and purchase their system to "protect" PDF). You have also to install the plugin, and the only usable viewer will be the one that can use the plugin... as far as I know, only the original Adobe Reader (and related products).

I understand, but I am absolutely adverse to, the wrong idea behind DRM (basically it's what FileOpen is about). The dangers in that idea can be discussed longly, and we also can remember this funny comics about DRM in music. But this is not the major point for FileOpen for these PDFs.

The problem here is that in order to do its job, the plugin needs to connect to somewhere, and this means that a working connection is always needed to read those PDFs! Thus the P in PDF becomes untrue, since I can't "port" those files everywhere: if there's no a working internet connection, those files become nothing but a bunch of unusable data.

Some friends of mine dream about the future: computers are terminals, the operating system being just a life support for a browser, and "in" it you do everything (cloud computing for the whole set of what a computer could be and do). This would work until there's a (fast!) connection everywhere. Otherwise the "terminal" would become really unuseful hardware only.

I hate this idea (~~cloud computing~~ software as services for everything and for masses; it can't be "instead of", while it's ok to be "altogether with", according to "needs" and possibilities). It creates a dependency which is not desiderable at all. It is basicaly a threat for freedom of usage. I want to do the things I like to do with a computer without the need of an internet connection. There are things (like this blog) which implies the presence of internet, ok. But if I don't want to connect, I can, and I still can use the computer as an important tool or as recreational pastime.

What is the solution? (There's no solution since we are not the one who will decide). ... But I've removed the FileOpen plugin and deleted those PDFs (unreadable data without the plugin), after all I was just curious, I can wrote algorithms by myself and I can find freely accessible literature which does not depend on the net to be read ("download once, own forever" and public libraries).

By the way, RE or so on FileOpen was done (issues about avoiding connection, key expirations... exist). But it would mean to give too much importance to them; rather, we should simply reject every FileOpen-protected ebook/PDF, in favor of freely accessible one. So they would be forced to drop that silly technology and use instead a more user-friendly (internet-independent) solution to their problems ("protecting" their stuffs).

2010-05-06

Another lazy one-liner

Today I was back on twitter (guess, shintakezou) and found among people I am following an interesting link. I wanted to download almost everything and lazyly built the following line:

lynx http://address -dump |
   egrep -v thumb |
   egrep "[0-9]+\. http:" |
   egrep "mp3|jpg" |sed -e 's/[0-9]\+\.//;s/^ \+//' |
   xargs -n 1 wget

I am almost sure there's a better way, but I've just finished eating.

Since people may think this is mainly a p0rn scraping line (mixed with few good musical files), I claim it is not even though it could be used for that purpose too. In that address I've found many Bilibin images — I know this artist only because his paintings are often used on Fravia's site (R.I.P.) — and a mp3 file I am going to listen, the title is Nuclear weapons are morally infensible.

Flash must die!!

I've read this and felt better: I simply hate Flash from the very beginning. Moreover current GNU/Linux version I am using is resource hungry and sometimes crashs. I don't know if all the good things said in the article are true (on the Apple part), but I would really like to delete definitively Flash and Flash plugins from my system. Hope I can do it soon!

Post scriptum: I love PostScript and like PDF, so I also hope Adobe won't be destroyed by loosing their poor Flash technology market!