TransWikia.com

Hailstone Sequence in NASM

Code Review Asked on October 27, 2021

For practice, I wrote some NASM code that prints out the hailstone sequence of a (unfortunately, hardcoded) number.

This is by far the most complex code I’ve ever written in NASM. I’d like advice on anything, but specifically:

  • I’m trying to abide by CDECL. Am I doing anything off?
  • The multiplication part seems overly complicated. The problem is, mul doesn’t take an immediate, and the register that I want to multiply is ebx, not eax, so I need to do a couple movs before I can multiply.
  • Anything else that’s worth mentioning.

hail.asm:

global _start

section .data
    newline: db `n`
    end_str: db `1n`

section .text
    print_string:  ; (char* string, int length)
        push ebp
        mov ebp, esp

        push ebx

        mov eax, 4
        mov ebx, 1
        mov ecx, [ebp + 8]
        mov edx, [ebp + 12]
        int 0x80

        pop ebx

        mov esp, ebp
        pop ebp

        ret


    print_int:  ; (int n_to_print)
        push ebp
        mov ebp, esp

        push ebx
        push esi

        mov esi, esp  ; So we can calculate how many were pushed easily

        mov ecx, [ebp + 8]

        .loop:
            mov edx, 0  ; Zeroing out edx for div
            mov eax, ecx  ; Num to be divided
            mov ebx, 10  ; Divide by 10
            div ebx
            mov ecx, eax  ; Quotient

            add edx, '0'
            push edx  ; Remainder

            cmp ecx, 0
            jne .loop

        mov eax, 4  ; Write
        mov ebx, 1  ; STDOUT
        mov ecx, esp  ; The string on the stack
        mov edx, esi
        sub edx, esp  ; Calculate how many bytes were pushed
        int 0x80

        add esp, edx

        pop esi
        pop ebx

        mov esp, ebp
        pop ebp

        ret


    main_loop:  ; (int starting_n)
        push ebp
        mov ebp, esp

        push ebx

        mov ebx, [ebp + 8]  ; ebx is the accumulator
        .loop:
            push ebx
            call print_int
            add esp, 4

            push 1
            push newline
            call print_string
            add esp, 8

            test ebx, 1
            jz .even
            .odd:
                mov eax, ebx
                mov ecx, 3  ; Because multiply needs a memory location
                mul ecx
                inc eax
                mov ebx, eax
                jmp .end

            .even:
                shr ebx, 1

            .end:
                cmp ebx, 1
                jnz .loop

        push 2
        push end_str
        call print_string
        add esp, 8

        pop ebx

        mov esp, ebp
        pop ebp

        ret


    _start:
        push 1000  ; The starting number
        call main_loop
        add esp, 4

        mov eax, 1
        mov ebx, 0
        int 0x80

Makefile:

nasm hail.asm -g -f elf32 -Wall -o hail.o
ld hail.o -m elf_i386 -o hail

One Answer

Multiplying by 3

The multiplication part seems overly complicated. The problem is, mul doesn't take an immediate, and the register that I want to multiply is ebx, not eax, so I need to do a couple movs before I can multiply.

This is all true, but based on the premise that the mul instruction must be used. Here are a couple of alternatives:

  • imul ebx, ebx, 3, listed in the manual as a signed multiplication, but that makes no difference, because only the low half of the product is used.
  • lea ebx, [ebx + 2*ebx], even the +1 can be merged into it: lea ebx, [ebx + 2*ebx + 1]. As a reminder, lea evaluates the address on the right and stores it in the destination register, it does not access memory despite the square-brackets syntax. 3-component lea takes 3 cycles on some processors (eg Haswell, Skylake), making it slightly slower than a 2-component lea and a separate inc. 3-component lea is good on Ryzen.

Dividing by 10

The simplest way is of course to use the div instruction, but that's not the fastest way, and it's not what a compiler would do. Here is a faster way, similar to how compilers do it, based on multiplying by a fixed-point reciprocal of 10 (namely 235 / 10, the difference between 235 and 232 is compensated for by shifting right by 3, the remaining division by 232 is implicit by taking the high half of the output of mul).

; calculate quotient ecx/10
mov eax, 0xCCCCCCCD
mul ecx
shr edx, 3
mov eax, ecx
mov ecx, edx
; calculate remainder as n - 10*(n/10)
lea edx, [edx + 4*edx]
add edx, edx
sub eax, edx

push edx in print_int

This will put 4 bytes on the stack for every character of the decimal representation of the integer, 1 actual char and 3 zeroes as filler. That looks fine when printed because a zero does not look like anything, so I'm not sure if this should be classed as a bug, but it just seems like an odd thing to do. The characters could be written to some buffer byte-by-byte, with a store and decrementing the pointer, then there would not be zeroes mixed in. A similar "subtract pointers to find the length"-trick could be used, that's a good trick.

Small tricks

mov edx, 0  ; Zeroing out edx for div

That's fine but xor edx, edx is preferred, unless the flags must be preserved.

    jmp .end
.even

Given that n is odd, 3n+1 is even, so you could omit the jump and have the flow of execution fall straight into the "even" case. Of course that means that not all integers in the sequence are printed, so maybe you can't use this trick, depending on what you want from the program.

If skipping some numbers to accelerate the sequence is OK, here is an other trick for that: skip a sequence of even numbers all at once by counting the trailing zeroes and shifting them all out.

tzcnt ecx, ebx
shr ebx, cl
   mov esp, ebp
   pop ebp

If you want (it doesn't make a significant difference, so it's mostly personal preference), you can use leave instead of this pair of instructions. Pairing the leave with enter is not recommended because enter is slow, but leave itself is OK. GCC likes to use leave when it makes sense, but Clang and MSVC don't.

       cmp ecx, 0
       jne .loop

That's fine, but there are a couple of alternatives that you may find interesting:

  • test ecx, ecx
    jne .loop
    
    Saves a byte, thanks to not having to encode the zero explicitly.
  • jecxz .loop
    
    This special case can be used because ecx is used. Only 2 bytes instead of 5 or 4. However, unlike a fusible arith/branch pair, this costs 2 µops on Intel processors. On Ryzen there is no downside.

Answered by harold on October 27, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP