## 2010-05-27

Few weeks ago I have marked as incorrect an implementation of a task on RosettaCode in Fortran. It has struck me since it appeared elegant and short, so I wanted to run the code and... alas obtained a wrong result (AAAA). So without further analysing the code I have marked it as incorrect.

Yesterday I was looking for recent activities on RosettaCode and task to work on, when noticed that my "mark" was removed and in the talk page they've shown a gfortran bug... Unbelievable, I use gfortran to ru my computation, such a bug, as interpreted there, was horrorful!
(Even though recent gfortran versions fixed it).

Further investigation has shown that the bug is more subtle but less horrorful; first let us see how the bug manifests itself with the guilty fragment of code.

  character(2), dimension(2), parameter :: list = (/ 'ab', 'ba' /)  integer :: i  i = 1  write(*, *) list(:)(1:1) == list(1)(1:1)  ! ref. (A)  write(*, *) list(:)(1:1) == list(1)(i:i)  ! ref. (B)

We expect the same output, but we get something different! So at a glance the bug is in how gfortran compute expression with variables in slices, like if it does not grab the right value of i.

The problem is instead another. When constant slices of a known (at compile time) parameter array are requested, the slices are indeed computed at compile time. So that where there's no expression to be evaluated (even the simple i), the result is computed at compile time and no bug is shown.

Compiling with -S option allows us to study assembly output (on x86 machine):

 .section .rodata .align 4  .type A.2.545, @object .size A.2.545, 8A.2.545: .long 1 .long 0 .align 4

This is a piece of the readonly data section produced compiling a program containing only the line (A). What are those .long 1 and .long 0. Without breaking my head into debugging for real the code, I've suspected they are the result computed at compile time (True, False) and to verify it I simply changed 1 to 0, assembled/linked and run the code to obtain F F as output; if I change it into 1 1, I obtain T T, quod erat demonstrandum.

So the bug does not manifest itself since it is a runtime bug (maybe; we shall see it is not exactly so). At compile time, slicing is done correcly and so the comparing.

Let us see the code produced when only line (B) is in use:

.Ltext0: .section .rodata.LC0: .ascii "a".LC1: .ascii "b" .align 4  .type A.2.545, @object .size A.2.545, 8A.2.545: .long .LC0 .long .LC1 .align 4

Now the result (True and False values, i.e. 1 and 0) is missing since list(1)(i:i) must be computed at runtime. Nonetheless, list(:)(1:1) can still be computed at compile time and in fact it is: label A.2.545 point to pointers to two strings, and the "object" represents the array (/ "a", "b" /) which is the result of evaluation of list(:)(1:1).

Now list(1)(i:i) must be computed at runtime and the comparison must be carried at runtime, in fact later we find the fragment:

  movl A.2.545(,%eax,4), %eax movl %edx, 12(%esp) movl $1, 8(%esp) movl %eax, 4(%esp) movl$2, (%esp) call _gfortran_compare_string

which is missing from the case (A) (since de facto the comparison is done all at compiletime, and only the output routine appears). The problem is somewhere here or there... Again, since I am lazy, before going deep to debug the code let me see if there's a simpler way to grab a spark of enlightening.

Analysis I've done before and that you can find on talk page, made me think that the problem could be related someway with how Fortran handles "strings". I say, it is not so good at:D! They are indeed characters of given length and this is fixed. Let's see some code to understand better:

  character(2)  :: a  character(3)  :: b  character     :: c  c = 'a'  write(*,*) len(a)  b = 'cd'  write(*,*) len(b)  a = 'b'  write(*,*) len(c)  write(*, "(A,'|',A,'|',A,'|')") a, b, c

The output of this code (reformatted) is

231b |cd |a|

And it shows that in Fortran the length of the "string" is not determined by the real "string" we put in but by the size of the container. Moreover, unused room is padded with spaces (we need trim to remove the padding spaces). Nonetheless, comparing functions seem to consider this oddity and

'a' == 'a    '

is True! And also

a = 'a 'b = 'a  'write(*,*) a == b

gives of course True, which is odd (indeed, considered wrong!) in any other language I know. So comparing function seems to ignore trailing spaces. But it sometimes could make some confusion and how trailing spaces are really ignored? With this awareness/doubt in mind, I modified

.LC0: .ascii "a".LC1: .ascii "b" .align 4

into

.LC0: .ascii "a ".LC1: .ascii "b " .align 4

Then I "assembled and linked" (sounds cool, indeed I've just written gfortran modifiedcode.s!), run the code and BINGO the right output appears!! So now we think that the problem is at compiletime, since a trailing space is missing... But wait we've said trailing spaces are ignored so the comparing function should not consider them for real! Counterexample: let us add two spaces instead of one, to have "a ", or more, to have "a "... The right result does not change!1

So, it seems at least one space must be present. Is this a bug of the comparing function, or a bug in the code producing the static strings? Remember that those strings are the representation of the array (/ "a", "b" /) obtained with list(:)(1:1). Since list is an array of character(2), we could expect that the slice inherits the char(2) size. Maybe. But how it should be implemented in code? Maybe padding with spaces for real... But if this a requirement, then it means the comparing function is not buggy. If it is not a requirement and the rank of the "content" of a volatile object like list(:)(1:1) needs no to be inherited, then the string comparing function seems buggy... But why then it works when used "directly", in code as 'a' == 'a '?

Something is missing still.

Let us do another test. Let us "copy" the object list(:)(1:1) in another array, i.e. add this code

  character(1), dimension(2) :: t  t = list(:)(1:1)  print *, t == list(1)(i:i)

Now the result is correct!! So the problem as stated before is not in the usage of an expression (simply i:i in this case) but it is subtly bound to how object are evaluated at compiletime and stored in the produced code. Copying in a non "static" array fix the problem!

So another way to fix the problem should be to avoid that list(:)(1:1) is computed at compiletime and (we're supposing currently) stored "badly" in a way that confund the comparing function... I.e. we use list(:)(i:i) too... And in fact, doing so, fix the problem too! We're nearer.

The "static" word used before made me think about the parameter attribute... What if we remove it... Let's try... and wow, without it, everything is fine, always!! So the bug is in how parameter object are handled. At this point I am curious to know how the bug was fixed in recent version of gfortran!

But first I take a look at what is changing when copying list(:)(1:1) in a "dimension" made of character(N). (Changing N from 1 to whatever, does not change the output, which is always right).
When doing so, in the readonly data section we find only the representation of the initial array.

 .type list.544, @object .size list.544, 4list.544: .ascii "ab" .ascii "ba"

The rest is done at runtime (slicing, copying and so on). And so, everything works.

Now let us see what changes when the parameter attribute is removed. What can be noticed is that now everything is done at runtime, always. It is like that: parameter object are not supposed to change, so things done with them, when possible, can be computed at compile time (if there's an expression that must be computed, then the operations on a "parameter" array are done "deferred" at runtime). If no parameter attribute is given, the object may change at runtime, so the compiler does not try to do anything at compiletime.

So now we are more convinced that the bug is in the part of gfortran dealing with compile time "optimization" (for "immutable" objects). To let things work, we can remove the parameter attribute, or force the compiler to operate at runtime, e.g. using an expression that must be evaluated at runtime... Or we must be sure everything is computed at compiletime... Basically, if we mix things evaluated at compile time with things that must be evaluated at runtime, the bug shows up.

Note 1: look better at the code calling the comparing function... we see it appears a 2... hmm, what if this is the length of the string to be compared? Let's change

 movl $2, (%esp) call _gfortran_compare_string into  movl$1, (%esp) call _gfortran_compare_string

keeping "a" and "b" without spaces... Let's assembly and run and... bingo again. The fortran compare string function gets the length of the string to be compared as argument... So the bug is in here, since it should be $1 instead of$2... or in the static data... or ... hmmm