## 2010-07-20

### Syntax and efficiency in C

Time for a brand new post.

In my last job interview I was asked to write a function to copy a string. Thinking not about compactness, I wrote on the fly the following code

void strcpy(const char *s, char *d){  while( *s != '\0' ) {    *d = *s; s++; d++;  }  *d = '\0';}

That of course works, but it is longer than the need (and by the way, it is not compliant with what standard says about real strcpy, but this is not interesting since making a standard-compliant function was not the point of the question).

As noticed by the person making the interview, such a copy can be written in one line. When I've finished the interview and went back with the mind to what I've said and what I could have said, I realized that I could have defended my on-the-fly implementation: it is clearer (isn't it?), it has the same efficiency of the "compact" version, being this one possible just because of the great flexibility of the C syntax... which is, however, a boomerang if abused.

Second thought. The "long" code is clearer. Maybe (usually, not using C compactness ability makes code clearer; but likely in this particular case the one-line version is even clearer). But, my doubt is now: is it really the same with respect to efficiency?

Let us first see how to "convert" the long version to the short one. First, post-increments can be done "in place".

 *d++ = *s++;

Then, assignment can be put into the while:

  while( (*d++ = *s++) != '\0' ) ;

And here's the trap. We don't need the final "terminator-missing fix". Apparently, the short version looks also more efficient. But if we unroll the loops, we have the same amount of assignments, of course. Moreover, the short version assign the terminator reading it from the source; this gives two memory access (one for reading from s, one for writing into d).

On the other hand, the long version, does not need the extra reading access to memory and can be optimized on many processor with a single instruction writing a byte constant (read at the instruction fetch time).

So, it seems that is is the opposite: long version is (slightly) more efficient!

Of course modern compilers' optimization can cancel this apparent advantage. But let's look what gcc does with the basic "implicit" optimization level.

mstrcpy: pushl %ebp            # prolog movl %esp, %ebp.L2: movl 8(%ebp), %eax    # get s in eax   M movzbl (%eax), %edx   # *s in edx (dl) M movl 12(%ebp), %eax   # get d in eax   M movb %dl, (%eax)      # *d = *s (dl)   M movl 12(%ebp), %eax   #                M movzbl (%eax), %eax   # take char*d in eax M testb %al, %al        # I setne %al             # is not 0? in the same al  I addl $1, 12(%ebp) # these are the post M addl$1, 8(%ebp)      # increments of d and s M testb %al, %al        # check al   I jne .L2               # loop if not 0 ('\0')  JT popl %ebp             # epilog ret

mstrcpy2: pushl %ebp              # prolog movl %esp, %ebp jmp .L5                 # uncond. jump.L6: movl 8(%ebp), %eax      # get s in eax    M movzbl (%eax), %edx     # *s in edx (dl)  M movl 12(%ebp), %eax     # get d in eax    M movb %dl, (%eax)        # *d = *s         M addl $1, 12(%ebp) # d++ M addl$1, 8(%ebp)        # s++             M.L5: movl 8(%ebp), %eax      # get s           M movzbl (%eax), %eax     # *s in eax (al)  M testb %al, %al          # I jne .L6                 # exec body if not 0 JT movl 12(%ebp), %eax     # get d      M movb \$0, (%eax)         # *d = '\0'  M popl %ebp               # epilog ret