2010-07-20

Syntax and efficiency in C

Time for a brand new post.

In my last job interview I was asked to write a function to copy a string. Thinking not about compactness, I wrote on the fly the following code


void strcpy(const char *s, char *d)
{
while( *s != '\0' ) {
*d = *s; s++; d++;
}
*d = '\0';
}


That of course works, but it is longer than the need (and by the way, it is not compliant with what standard says about real strcpy, but this is not interesting since making a standard-compliant function was not the point of the question).

As noticed by the person making the interview, such a copy can be written in one line. When I've finished the interview and went back with the mind to what I've said and what I could have said, I realized that I could have defended my on-the-fly implementation: it is clearer (isn't it?), it has the same efficiency of the "compact" version, being this one possible just because of the great flexibility of the C syntax... which is, however, a boomerang if abused.

Second thought. The "long" code is clearer. Maybe (usually, not using C compactness ability makes code clearer; but likely in this particular case the one-line version is even clearer). But, my doubt is now: is it really the same with respect to efficiency?

Let us first see how to "convert" the long version to the short one. First, post-increments can be done "in place".


*d++ = *s++;


Then, assignment can be put into the while:


while( (*d++ = *s++) != '\0' ) ;


And here's the trap. We don't need the final "terminator-missing fix". Apparently, the short version looks also more efficient. But if we unroll the loops, we have the same amount of assignments, of course. Moreover, the short version assign the terminator reading it from the source; this gives two memory access (one for reading from s, one for writing into d).

On the other hand, the long version, does not need the extra reading access to memory and can be optimized on many processor with a single instruction writing a byte constant (read at the instruction fetch time).

So, it seems that is is the opposite: long version is (slightly) more efficient!

Of course modern compilers' optimization can cancel this apparent advantage. But let's look what gcc does with the basic "implicit" optimization level.

The "short version" (S) x86 code (with added comments):


mstrcpy:
pushl %ebp # prolog
movl %esp, %ebp
.L2:
movl 8(%ebp), %eax # get s in eax M
movzbl (%eax), %edx # *s in edx (dl) M
movl 12(%ebp), %eax # get d in eax M
movb %dl, (%eax) # *d = *s (dl) M
movl 12(%ebp), %eax # M
movzbl (%eax), %eax # take char*d in eax M
testb %al, %al # I
setne %al # is not 0? in the same al I
addl $1, 12(%ebp) # these are the post M
addl $1, 8(%ebp) # increments of d and s M
testb %al, %al # check al I
jne .L2 # loop if not 0 ('\0') JT
popl %ebp # epilog
ret


The "long version" (L) x86 code (with added comments)


mstrcpy2:
pushl %ebp # prolog
movl %esp, %ebp
jmp .L5 # uncond. jump
.L6:
movl 8(%ebp), %eax # get s in eax M
movzbl (%eax), %edx # *s in edx (dl) M
movl 12(%ebp), %eax # get d in eax M
movb %dl, (%eax) # *d = *s M
addl $1, 12(%ebp) # d++ M
addl $1, 8(%ebp) # s++ M
.L5:
movl 8(%ebp), %eax # get s M
movzbl (%eax), %eax # *s in eax (al) M
testb %al, %al # I
jne .L6 # exec body if not 0 JT
movl 12(%ebp), %eax # get d M
movb $0, (%eax) # *d = '\0' M
popl %ebp # epilog
ret


The M marks instructions that do memory access; let's count only the in-loop instructions first. S has 8 memory access instrs in loops; L has 8 memory access instrs in loop but S has 3 "register" instrs in loop, while L has none. So for S we have 8+3 in loop, for L only 8.

In S, when the '\0' is reached, the whole 8+3 instruction are performed; then the jump is not taken (JT stands for Jump Taken most of the times). In L, when the '\0' is found, (2+2)+1 instructions are executed (2 M after the .L5 label, and 2 M after the conditional jump). In fact S performs all the "actions" of the C code (post-increments inclusive) in the while; but L does only the test on *s and the final termination of d. (This means also that if we access s and d in the S version, they point one char beyond '\0', while in the L version we do not increment them, since we exit the loop, so s and d will point to the '\0').

On modern processors it is hard to say which is really more efficient. Likely, the difference in performance is negligible. Moreover, optimizations done by the compilers may level down the difference, if any.

It is however worth noting that L version does not produce a so longer code (considering also loop unrolling), and it could be even more efficient than S version!