This is a note while studying x86-64 assembler.

References

General Notes

misc

GCC(Gnu Compiler Collection) is a front end. It will call as and ld. The process can be viewed if we use -v argument of gcc.

Using -g argument will embed debug info in the object file. E.g. gcc -c first.s -o first.o -g. After that, we are able to use list in GDB to view the source code.

Below table lists the size of x86-64 memory/registers. Some is different from ARM. For example, in ARM, A .word means 32 bits.

term	size (bits)
byte	8
word	16
dword	32
qword	64

In x86-64, use .long or .int to specify 64 bits integer. In ARM, use .word to specify 32 bits integer. Note, .word or .short may have different lengths in different machine. They are machine dependent.

STDIN file descriptor is 0. STDOUT file descriptor is 1. STDERR file descriptor is 2.

The labels in the assembly program begins with _, e.g. _start, is due to the convention of C compiler. It is the simple name mangaling. C++ has more complex name mangling.

Hello world program x86-64, position indepent code

//first.s 
//pay attention to the `lea msg(%rip)` instruction 
.data
msg:
    .asciz "Hello, world!\n"
 
.extern printf
.extern flush
.text
    .global main # entry point
main:
    movq %rsp, %rbp ;#for correct debugging
    andq $-16, %rsp
    lea msg(%rip), %rdi ;position independent code 
    call printf
    movq %rbp, %rsp
    xorq %rax, %rax
    ret

to build it.

gcc -c -g -o first.o first.s 
gcc -o first first.o

Refer to https://reverseengineering.stackexchange.com/questions/18007/disassembly-shows-lea-with-rip

Unlike 32bits modes instructions that were taken as absolute addresses (use 32-bit immediate offset addressing), the 64bits modes (a.k.a long-mode) are usually using 32-bit offset from the current RIP, not from 0x00000000 like before. That means that you don’t have to know the absolute address of something you want to reference, you only need to know how far away it is from the currently executing instruction.

There are very few addressing modes which use a full 64bit absolute address. Most addressing modes are 32bit offsets relative to one of the 64bit registers (usually RIP).

Hello world program x86-64

The program is copied from x86-64 Assembly on youtube, from Mike Shah

It refers to the linux syscall table

another linux syscall table

In intel format.

/*first.s*/
.intel_syntax noprefix
.global _start 
.hello.str: 
    .ascii "12345678\n" 
str_len: equ $-.hello.str ;didn't test if it works in GNU AS 

.text 

_start: 
    push rbp
    movq rbp, rsp 
    
    movq rax, 1  
    movq rdi, 1  
    leaq rsi, .hello.str  
    movq rdx, str_len ;didn't test if it works in GNU AS 
    syscall 
    
    movq rax, 60  
    movq rdi, 0  
    syscall 
    
    pop rbp

To build it.

as -o first.o first.s 
ld -o first first.o

In AT&T format.

/*first.s*/

.global _start 
.hello.str: 
    .ascii "12345678\n" 

.text 

_start: 
    push %rbp 
    movq %rsp, %rbp 
    movq $1, %rax    
    movq $1, %rdi   
    leaq .hello.str, %rsi    
    movq $9, %rdx  
    syscall 
    
    movq $60, %rax  
    movq $0, %rdi   
    syscall 
    
    pop %rbp

To build it. The below 2 ways are both OK.

as -o first.o first.s 
ld -o first first.o 

gcc -c first.s -o first.o 
ld -o first first.o

Hello world program X86 32 bit

The 32 bit hello world program and the 64 bit counterpart both use syscall to write to the screen and exit the program. But the 32 bit and 64 bit have different call conventions, and the syscall table are different in 32 bit and 64 bit too.

Refer to 32 bit syscall table.

Refer to 64 bit syscall table

In 32 bit, the EXIT syscall is 1, while in 64 bit it is 60.

In 32 bit, the WRITE syscall is 4, while in 64 bit it is 1.

In 32 bit, the EAX, EBX, ECX, EDX, ESI, EDI in sequence are used to specify the SYSCALL code and arguments.

In 64 bit, the RAX, RDI, RSI, RDX, R10, R8, R9 in sequence are used to specify the SYSCALL code and arguments. call convention

In 32 bit, the syscall instruction is not available. It will report illegal instruction if we use it.

Refer to stackoverflow on syscall and sysenter.

syscall is the default way of entering kernel mode on x86-64. This instruction is not available in 32 bit modes of operation on Intel processors.
sysenter is an instruction most frequently used to invoke system calls in 32 bit modes of operation. It is similar to syscall, a bit more difficult to use though, but that is the kernel’s concern. int 0x80 is a legacy way to invoke a system call and should be avoided.
The preferred way to invoke a system call is to use vDSO(virtual dynamic shared object), a part of memory mapped in each process address space that allows to use system calls more efficiently (for example, by not entering kernel mode in some cases at all). vDSO also takes care of more difficult, in comparison to the legacy int 0x80 way, handling of syscall or sysenter instructions.

Note: In 32 bit, it uses STACK to specify arguments when calling a function. In 64 bit, it uses the 6 registers to specify the first 6 arguments, and the stack to specify the remaining arguments.

Note: int $0x80 and sysenter does not mean the same thing in 32 bit. Refer to the definitive guide to linux system calls.

The below program is proved OK in my 32 bit Ubuntu.

.data
hello: .ascii "hello world\n"
.bss
.text

.global _start
    // output hello world to screen
_start:
    push %ebp  ;//preserve ebp
    movl %esp, %ebp ;//put the current esp to ebp
    movl $1, %ebx ;//put STDOUT file descriptor to %ebx
    leal hello, %ecx  ;//put address of the str to %ecx
    movl $12, %edx ;//put the length of the str to %edx
    movl $4, %eax  ;//put the WRITE syscall number to eax
    int $0x80

    // exit the program
    movl $0, %ebx ;//put the exit code to ebx
    movl $1, %eax ;//put the EXIT syscall number to eax
    int $0x80
    pop %ebp

call instruction to call a function

/*first.s*/
# there is also .bss segment for not initialized global data 
.data 
.hello.str: 
    .ascii "12345678\n" 

.text 

_write_str: 
    movq %rsp, %rbp 
    movq $1, %rax    
    movq $1, %rdi   
    leaq .hello.str, %rsi    
    movq $9, %rdx  
    syscall 
    ret 
_exit:
    movq $60, %rax  
    movq $0, %rdi   
    syscall 
    ret 

.global _start 
_start: 
    call _write_str 
    call _exit 
    pop %rbp

The 64 bit x86 C Calling Convention

The section is copied from x86-64 call convention. It is for Linux. Microsoft windows does not follow the same convention. Refer to x64 calling convention, Microsoft.

pop, push, call, ret instructions.

The caller’s rules:

Before calling a subroutine, the caller should save the contents of certain registers that are designated caller-saved. The caller-saved registers are r10, r11, and any registers that parameters are put into. If you want the contents of these registers to be preserved across the subroutine call, push them onto the stack.
To pass parameters to the subroutine, we put up to six of them into registers (in order: rdi, rsi, rdx, rcx, r8, r9). If there are more than six parameters to the subroutine, then push the rest onto the stack in reverse order (i.e. last parameter first) – since the stack grows down, the first of the extra parameters (really the seventh parameter) parameter will be stored at the lowest address (this inversion of parameters was historically used to allow functions to be passed a variable number of parameters).
To call the subroutine, use the call instruction. This instruction places the return address on top of the parameters on the stack, and branches to the subroutine code.
After the subroutine returns, (i.e. immediately following the call instruction) the caller must remove any additional parameters (beyond the six stored in registers) from stack. This restores the stack to its state before the call was performed.
The caller can expect to find the return value of the subroutine in the register RAX.
The caller restores the contents of caller-saved registers (r10, r11, and any in the parameter passing registers) by popping them off of the stack. The caller can assume that no other registers were modified by the subroutine.

The Callee’s Rules:

Allocate local variables by using registers or making space on the stack. Recall, the stack grows down, so to make space on the top of the stack, the stack pointer should be decremented. The amount by which the stack pointer is decremented depends on the number of local variables needed. For example, if a local float and a local long (12 bytes total) were required, the stack pointer would need to be decremented by 12 to make space for these local variables: sub rsp, 12. As with parameters, local variables will be located at known offsets from the stack pointer.
Next, the values of any registers that are designated callee-saved that will be used by the function must be saved. To save registers, push them onto the stack. The callee-saved registers are RBX, RBP, and R12 through R15 (RSP will also be preserved by the call convention, but need not be pushed on the stack during this step). After these three actions are performed, the actual operation of the subroutine may proceed. When the subroutine is ready to return, the call convention rules continue.
When the function is done, the return value for the function should be placed in RAX if it is not already there.
The function must restore the old values of any callee-saved registers (RBX, RBP, and R12 through R15) that were modified. The register contents are restored by popping them from the stack. Note, the registers should be popped in the inverse order that they were pushed.
Next, we deallocate local variables. The easiest way to do this is to add to RSP the same amount that was subtracted from it in step 1.
Finally, we return to the caller by executing a ret instruction. This instruction will find and remove the appropriate return address from the stack.

If you look at the assembly generated by some compilers, you will see a few extra commands in there in the callee’s prologue:

push rbp ; at the start of the callee 
mov rbp, rsp
... 
pop rbp ; just before the ending `ret`

This code is unnecessary, and is a hold-over from the 32-bit calling convention. You can tell the compiler to not include this code by invoking it with the -fomit-frame-pointer flag.

It might be noted that the callee’s rules fall cleanly into two halves that are basically mirror images of one another. The first half of the rules apply to the beginning of the function, and are therefor commonly said to define the prologue to the function. The latter half of the rules apply to the end of the function, and are thus commonly said to define the epilogue of the function.

The 32 bit x86 C calling Convention

The section is copied from x86 32 bit call convention.

The Caller’s Rules

Bfore calling a subroutine, the caller should save the contents of certain registers that are designated caller-saved. The caller-saved registers are EAX, ECX, EDX. If you want the contents of these registers to be preserved across the subroutine call, push them onto the stack
To pass parameters to the subroutine, push them onto the stack before the call. The parameters should be pushed in inverted order (i.e. last parameter first) – since the stack grows down, the first parameter will be stored at the lowest address (this inversion of parameters was historically used to allow functions to be passed a variable number of parameters).
To call the subroutine, use the call instruction. This instruction places the return address on top of the parameters on the stack, and branches to the subroutine code.
After the subroutine returns, (i.e. immediately following the call instruction) the caller must remove the parameters from stack. This restores the stack to its state before the call was performed.
The caller can expect to find the return value of the subroutine in the register EAX.
The caller restores the contents of caller-saved registers (EAX, ECX, EDX) by popping them off of the stack. The caller can assume that no other registers were modified by the subroutine.

The Callee’s Rules

At the beginning of the subroutine, the function should push the value of EBP onto the stack, and then copy the value of ESP into EBP using the following instructions:

push ebp 
mov ebp, esp ;Intel style instead of AT&T style

The reason for this initial action is the maintenance of the base pointer, EBP. The base pointer is used by convention as a point of reference for finding parameters and local variables on the stack. Essentially, when any subroutine is executing, the base pointer is a “snapshot” of the stack pointer value from when the subroutine started executing. Parameters and local variables will always be located at known, constant offsets away from the base pointer value. We push the old base pointer value at the beginning of the subroutine so that we can later restore the appropriate base pointer value for the caller when the subroutine returns. Remember, the caller isn’t expecting the subroutine to change the value of the base pointer. We then move the stack pointer into EBP to obtain our point of reference for accessing parameters and local variables.

Next, allocate local variables by making space on the stack. Recall, the stack grows down, so to make space on the top of the stack, the stack pointer should be decremented. The amount by which the stack pointer is decremented depends on the number of local variables needed. For example, if 3 local integers (4 bytes each) were required, the stack pointer would need to be decremented by 12 to make space for these local variables. I.e.:

sub esp, 12

As with parameters, local variables will be located at known offsets from the base pointer.

Next, the values of any registers that are designated callee-saved that will be used by the function must be saved. To save registers, push them onto the stack. The callee-saved registers are EBX, EDI and ESI (ESP and EBP will also be preserved by the call convention, but need not be pushed on the stack during this step). After these three actions are performed, the actual operation of the subroutine may proceed. When the subroutine is ready to return, the call convention rules continue:
When the function is done, the return value for the function should be placed in EAX if it is not already there.
The function must restore the old values of any callee-saved registers (EBX, EDI and ESI) that were modified. The register contents are restored by popping them from the stack. Note, the registers should be popped in the inverse order that they were pushed.
Next, we deallocate local variables. The obvious way to do this might be to add the appropriate value to the stack pointer (since the space was allocated by subtracting the needed amount from the stack pointer). In practice, a less error-prone way to deallocate the variables is to move the value in the base pointer into the stack pointer, i.e.:

mov esp, ebp

This trick works because the base pointer always contains the value that the stack pointer contained immediately prior to the allocation of the local variables.

Immediately before returning, we must restore the caller’s base pointer value by popping EBP off the stack. Remember, the first thing we did on entry to the subroutine was to push the base pointer to save its old value.
Finally, we return to the caller by executing a ret instruction. This instruction will find and remove the appropriate return address from the stack.

General purpose registers in X86-64

rax eax ax al
rbx ebx bx bl
rcx ecx cx cl
rdx edx dx dl
rsi esi si sil
rdi edi di dil
rbp ebp bp bpl
rsp esp sp spl
r8 r8d r8w r8b
r9 r9d r9w r9b
r10 r10d r10w r10b
r11 r11d r11w r11b
r12 r12d r12w r12b
r13 r13d r13w r13b
r14 r14d r14w r14b
r15 r15d r15w r15b

AT&T Syntax

Refer to AT&T assembly syntax

It’s AT&T assembly syntax:

source comes before destination
mnemonic suffixes indicate the size of the operands (q for quad, etc.)
registers are prefixed with % and immediate values with $
effective addresses are in the form DISP(BASE, INDEX SCALE) (DISP + BASE + INDEX * SCALE)
Indirect jump/call operands indicated with * (as opposed to direct).

add, sub, inc, dec

inc and add 1 are different.

dec and sub 1 are different.

inc and dec won’t affect the carry flag, while add and sub will.

Refer to inc, dec instructions

A comparison of GAS and NASM

Refer to Linux assemblers: A comparison of GAS and NASM

Book Programming from the Ground Up by Jonathan Bartlett (Author)

Programming from the Ground Up uses Linux assembly language to teach new programmers the most important concepts in programming.

Book Professional Assembly Language by Richard Blum (Good)

It is said that it uses the AT&T syntax. And uses the GNU AS assembler. But seems it is for 32 bit X86. After knowing the differences of call conventions between x86-64 and x86-32, it does not matter. Refer to install a 32-bit ubuntu in a 64-bit host via virtual box

Use vagrant to install a 32 bit of ubuntu in virtul box.

//put the Vagrantfile in a folder and 'run vagrant up'. 
//run 'vagrant ssh' to login to the VM. It is a terminal interface without GUI. 
//default user and passwd is vagrant/vagrant. 
//A folder /vagrant is mounted to refer to the folder in the host machine where the Vagrantfile locates. 
// run `sudo apt install gdb` to install gdb for debugging   
Vagrant.configure("2") do |config|
  config.vm.box = "reelio/trusty32"
  config.vm.box_version = "0.0.1"
end

Chap5 Moving data

The data section is declared using the .data directive.

There is another type of data section called .rodata. Any data elements defined in this section can only be accessed in readonly mode(thus the ro prefix).

Two statements are required to define a data element in the data section: a label and a directive. The label has no meaning to the processor; it is just a place for the assembler to reserve a specified amount of memory for the data element to be referenced by the label.

directive, data type 

.ascii, Text string 
.asciz, Null terminated text string 
.byte, Byte value 
.double, Double precision floating-point number 
.float, Single precision floating-point nuber 
.int, 32-bit integer number 
.long, 32-bit integer number 
.octa, 16-byte integer number 
.quad, 8-byte integer number 
.short, 16-bit integer number 
.single, single-precision floating pointer number(same as .float)

You are not limited to defining just one value on the directive statement line. You can define multiple values on the line, with each value being placed in memory in the order it appears in the directive.

sizes: .long 100,150,200,250,300

You can also declare static data symbols in the data section. The .equ directive is used to set a constant value to a symbol that can be used in the text section, as shown in the following examples:

.equ factor, 3 
.equ LINUX_SYS_CALL, 0X60

To reference the static data element, you must use a dollar sign before the label name.

movl $LINUX_SYS_CALL, %eax

In BSS section, defining data elements is somewhat different from defining them in the data section. Instead of declaring specific data types, you just declare raw segments of memory that are reserved for whatever purpose you need them for.

The GNU assembler uses two directives to declare buffers, as shown in the following table.

.comm, Declares a common memory area for data that is not initialized 
.lcomm, Declares a local common memory area for data that is not initialized

The local common memory area is reserved for data that will not be accessed outside of the local assembly code.

.comm symbol, length 
.lcomm buffer, 10000

Local common memory areas cannot be accessed by functions outside of where they were declared(They can’t be used in .globl directives).

One benefit to declaring data in the bss section is that the data is not included in the executable program.

The .fill directive enables the assembler to automatically create the 10000 elements for you. The default is to create one byte per field, and fill it with zeros.

.section .data 
buffer: 
    .fill 10000 
.section .text 
.globl _start 
_start: 
    movl $1, %eax 
    movl $0, $ebx 
    int $0x80

The GNU assembler adds another dimension to the MOV instruction, in that the size of the data element moved must be declared. The size is declared by adding an additional character to the MOV mnemonic. Thus the instruction becomes movx, where x can be l for a 32-bit long word value, w for a 16-bit word value, b for an 8-bit byte value, q for an 8-byte value.

moving data values from memory to a register

By my test in both 32 bit and 64 bit OS, .int and .long both have size 4 bytes.

By my test in both 32 bit and 64 bit OS(take 64 bits as an example), movq (value), %rcx and movq value, %rcx mean the same thing. After executation of the below 2 programs, the value of %rcx is 3, instead of the address of value in the .data section.

# movtest1.s 
.section .data, first way 
# define value with 8 bytes and value 3 
value: .quad 3  

.section .text 
.globl _start 
_start: 
    nop 
    movq value,%rcx 
    movq $60, %rax 
    movq $42, %rdi 
    syscall

# movtest1.s 
.section .data, 2nd way 
# define value with 8 bytes and value 3 
value: .quad 3  

.section .text 
.globl _start 
_start: 
    nop 
    movq value,%rcx 
    movq $60, %rax 
    movq $42, %rdi 
    syscall

Followed the steps to build and debug the program.

[root@centos8 asm]# as -gstabs -o movtest1.o movtest1.s 
[root@centos8 asm]# ld -o movtest1 movtest1.o 
[root@centos8 asm]# gdb ./movtest1 
(gdb) b *_start 
Breakpoint 1 at 0x4000b0: file movtest1.s, line 8.
(gdb) r
Starting program: /root/tmp/asm/movtest1 

Breakpoint 1, _start () at movtest1.s:8
8       nop 
(gdb) n
9       movq (value),%rcx 
(gdb) print/x $rcx 
$1 = 0x0
(gdb) n
10      movq $60, %rax 
(gdb) print/x $rcx 
$2 = 0x3
(gdb)

But if we put the $ prefix in mov instruction before value, it will mov the address instead of value as shown in the below program.

# movtest1, it will move the address instead of the value 
.section .data 
# define value with 8 bytes and value 3 
value: .quad 3  

.section .text 
.globl _start 
_start: 
    nop 
    movq $value,%rcx 
    movq $60, %rax 
    movq $42, %rdi 
    syscall

Followed the debug steps to demonstrate.

gdb ./movtest1 

(gdb) b *_start 
Breakpoint 1 at 0x4000b0: file movtest1.s, line 8.
(gdb) r
Starting program: /root/tmp/asm/movtest1 

Breakpoint 1, _start () at movtest1.s:8
8       nop 
(gdb) n
9       movq $value,%rcx 
(gdb) print/x $rcx 
$1 = 0x0
(gdb) n
10      movq $60, %rax 
(gdb) print/x $rcx 
$2 = 0x6000c8

moving data values from a register to memory

.section .data 
# define value with 8 bytes and value 3
value: .quad 3  

.section .text 
.globl _start 
_start: 
    nop 
    movq $100, %rcx 
    movq %rcx, value 
    
    # exit the program with error code 42 
    movq $60, %rax 
    movq $42, %rdi 
    syscall

[root@centos8 asm]# as -gstabs -o movtest1.o movtest1.s 
[root@centos8 asm]# ld -o movtest1 movtest1.o 
[root@centos8 asm]# gdb -q ./movtest1
Reading symbols from ./movtest1...done.
(gdb) b *_start
Breakpoint 1 at 0x4000b0: file movtest1.s, line 8.
(gdb) r
Starting program: /root/tmp/asm/movtest1 

Breakpoint 1, _start () at movtest1.s:8
8       nop 
(gdb) x/d &value 
0x6000d0:   3
(gdb) n
9       movq $100, %rcx 
(gdb) n
10      movq %rcx, value 
(gdb) n
13      movq $60, %rax 
(gdb) x/d &value 
0x6000d0:   100

Using indexed memory locations

You can specify more than one value on a directive to place in memory:

values: .int 10, 15, 20, 25, 30

When referencing data in the array, you must use an index system to determine which value you are accessing. The way this id done is called “indexed memory mode”. The memory location is determined by the following:

A base address
A offset address to add to the base address
The size of the data element
An index to determine which data element to select

The format of the expression is

base_address(offset_address, index, size)

The data value retrieved is located at base_address + offset_address + index * size. If any of the values are zero, they can be omitted(but the commas are still required as placeholders). The offset_address and index value must be registers, but the size value can be a numerical value. For ex, to reference the value 20 from the values array shown, you would use the following instructions:

movl $2, %edi 
movl values(,%edi,4), %eax

Using indirect addressing with registers

Besides holding data, registers can also be used to hold memory addresses. When a register holds a memory address, it is referred to as a pointer. Accessing the data stored in the memory location using the pointer is called indirect addressing.

While using a label references the data value contained in the memory location, you can get the memory location address of the data value by placing a dollar sign $ in front of the label in the instruction.

// move the memory address the `values` label references to the EDI register 
movl $values, %edi 

movl %edx, 4(%edi)

movl %edx, -4(%edi)

Conditional Move Instructions

The conditional move instruction can prevent the processor from implementing JMP instructions, which helps out the prefetch cache condition of the processor, usually speeding up the application.

cmovx source, destimation where x is a one or two letter code denoting the condition that will trigger the move action. The conditions are based on the current values in the RFLAGS register.

CF, Carry flag

OF, Overflow flag

PF, Parity flag

SF, Sign flag

ZF, Zero flag

//conditional move instructions for unsigned values 
cmova/cmovnbe, above/not below or equal 
cmovae/cmovnb, above or equal/not below 
cmovnc, not carry 
cmovb/cmovnae, below/not above or equal 
cmovc, carry 
cmovbe/cmovna, below or equal/not above 
cmove/cmovz, equal/zero 
cmovne/cmovnz, not equal/not zero 
cmovp/cmovpe, PF = 1, parity even 
cmovnp/cmovpo, not parity/parity odd, PF=0 

//conditional move instructions for signed values 
cmovge/cmovnl, greater or equal/not less  
cmovl/cmovnge, less/not greater or equal 
cmovle/cmovng, less or equal/not greater 
cmovo, overflow
cmovno, not overflow  
cmovs, sign 
cmovns, not sign

Exchanging Data

One drawback to the MOV instructions is that it is difficult to switch the values of two registers without using a temporary intermediate register.

xchg, exchanges the values of two registers, or a register and a memory location 
BSWAP, reverses the byte order in a 32-bit register 
xadd, exchanges two values and stores the sum in the destination operand 
cmpxchg, compares a value with an external value and exchanges it with another. 
cmpxchg8b, compares two 64-bit values and exchanges it with another

Stack

pushl 
pushw 
pushq

popl
popw
popq

PUSHA/POPA    //not available in 64-bit 
PUSHAD/POPAD  //not available in 64-bit 
PUSHF/POPF
PUSHFD/POPFD

Chap6 Controlling Execution Flow

Read to page 158

Book Beginning x64 Assembly Programming by Jo Van Hoey

It includes introductions on asm on both Linux and Windows. It also include some advanced instructions for example AVX, SSE, etc.

Book X64 Assembly Language Step by Step

Notes while I read the book.

Chap1 It’s all in the Plan: Understanding What Computers Really Do

A computer program is a list of steps and tests, nothing more.

A test is the sort of either/or decision we make. - First, you take a look at sth that can go one of two way. - Then you do one of two things, depending on what you saw when you took a look.

Chap2 Allien Bases: Getting Your Arms Around Binary and Hexadecimal

octal

hexadecimal

binary

Chap3 Lifting the Hood: Discovering What Computers Actually Are

A bit is a single binary digit, either 1 or 0.

A byte is eight bits.

Two bytes side by side are called a word.

Two words side by side are called a double word.

A quad word consists of two double words.

A group of four bits is called a nybble.

Chap4 Location: Registers, Memory Addressing, and Knowing Where Things Are

The skill of assembly language consists of a deep comprehension of memory addressing. Everything else is details – and easy details, at that.

There are a fair number of different ways to address memory in the Intel/AMD CPU family. Each of these ways is called a Memory Model. There are 3 major memory models that you can use with the more recent members of the Intel family, and a number of minor variations on those three, especially the one in the middle.

Real mode flat model.

Real mode segmented model.

Protected-mode flat model(32-bit and 64-bit).

The 8080 was an 8-bit CPU(its general-purpose registers have 8-bits), meaning that it processed 8 bits of information at a time. However, it had 16 address lines coming out of it(it will address 64KB).

The 8080 memory-addressing scheme was very simple. You put a 16-bit address out on the address lines, and you got back the 8-bit value that was stored at that address.

The 8086 comes after 8080. It is 16-bit CPU. It has 20 address lines.

The 8080 is used a lot. Intel wanted to make it easy for people to translate older software from the 8080 to 8086. One way to do this was to make sure that a 16-bit addressing system such as that of the 8080 still worked. Even though the 8086 could address 16 times as much memory as the 8080(16x64KB=1MB), Intel setup the 8086 so that a program could take some 64 KB segment within that megabyte of memory and run entirely inside it, just as though it were the smaller 8080 memory system. This was done by the use of segment registers.

Speaking of the 8086 and 8088, there are 4 segment registers(CS, DS, …).

This was very wise short-term thinking and catastrophically bad long-term thinking. Programs that needed more than 64KB of memory at a time had to use memory in 64KB chunks, switching between chunks by switching values into and out of segment registers.

To maintain backward compatibility with the ancient 8086 and 8088, newer CPUs were given the power to limit themselves to what the older chips could address and execute. When a Pentium-class or better CPU needs to run software written for the real-mode segmented model, it pulls a neat trick that, temporarily, make it become an 8086. This was called virtual-86 mode, and it provided excellent backward compatibility for DOS software.

A segment may start every 16 bytes throughout the full megabyte of real memory.

CS, DS, SS, ES, FS, GS: Segment registers. All segment registers are 16 bits in size, irrespective of the CPU. FS and GS exist only in the 386 and later Intel x86 32-bit CPUs.

CS: Code Segment

DS: Data Segment

SS: Stack Segment

ES: Extra segment

FS, GS: Clones of ES

Segment registers become useless in application programming in X86-64. Operating systems use two of them for special purposes.

Do Intel’s x86-64 CPUs have 64 address lines? No (48 or 52).

In the x86-64 world, CPUs have 14 general purpose 64-bit registers, plus SP and BP.

There are eight 16-bit general-purpose registers: AX, BX, CX, DX, BP, SI, DI, SP (8086, 8088, 80186 and 80286).

EAX, EBX, ECX, EDX, EBP, ESI, EDI, ESP. (32 bitS)

RAX, RBX, RCX, RDX, RBP, RSI, RDI, RSP. R8 to R15. (64 BITS)

RAX(EAX(AX(AH,AL)))

RBX(EBX(BX(BH,BL)))

RCX(ECX(CX(CH,CL)))

RDX(EDX(DX(DH,DL)))

RSI(ESI(SI(SIL))) and so on for RDI, RSP.

RIP, EIP, IP

The new x64 registers R8-R15 can be addressed as 64 bits, 32 bits, 16 bits, and 8 bits. However, the AH/AL scheme for the low 16 bits is a trick reserved for only RAX-RDX. The naming scheme for the R registers provides a mnemonic: D for double word, W for word, and B for byte. For example, if you want to deal with the lowest 8 bits of R8, you use the name R8B. Don’t make the beginner’s mistake of assuming that R8, R8D, R8W, and R8B are four separate and independent registers! A better metaphor is to think of the register names as country/state/county/city.

IP register.

While executing a program, the CPU uses IP to keep track of where it is in the current code segment. Instructions come in different sizes, ranging typically from 1 to 15 bytes. The CPU knows the size of each instruction it executes.

IP is notable in being the only register that can neither be read nor written to directly.

Flags register.

RFLAGS, EFLAGS, FLAGS. When the flag’s value is 1, we say that the flag is set. When the flag’s value is 0, we say that the flag is cleared.

Math Coprocessors and Their Registers (may be 128 bits or 256 bits)

Real-Mode Flat Model

Real-Mode Segmented Model

32-bit Protected Mode Flat Model

64-bit Long Mode

Chap5 The Right to Assemble: The Process of Creating Assembly Language Programs

Text files: are files that can be opened and examined meaningfully in a text editor, like notepad.

Binary files: are files containing values that do not display meaningfully as text.

Assemblers: read your source code files and generate an object code file containing the machine instructions that the CPU understands plus any data you’ve defined in your source code.

Linker: Object code files cannot themselves be run as programs. An additional step, called linking, is necessary to turn object code files into executable program files.

Symbol table: To process several object modules into a single executable module, the linker must first build an index called a symbol table, with an entry for every named item in every object module it links, with information on what name (called a symbol) refers to what location within the module.

Exe: Once the symbol table is complete, the linker builds an image of how the executable program will be arranged in memory when the operating system loads it. This image is then to disk as the executable file. The most important thing about the image that the linker builds relates to addresses.

Holes: Object modules are allowed to refer to symbols in other object modules. During assembly, these external references are left as holes to be filled later—naturally enough, because the module in which these external symbolsexist may not have been assembled or even written yet. As the linker builds animage of the eventual executable program file, it learns where all of the symbols are located within the image and thus can drop real addresses into all of the external reference holes.

Debugging info: Debugging information is, in a sense, a step backward. Portions of the source code, which was all stripped out early in the assembly process, are put back into the object module by the assembler. These portions of the source code are mostly the names of data items and procedures, and they’re embedded in the object file to make it easier for the programmer (you!) to see the names of data items when you debug the program.

Relocatability: Primordial microcomputers like 8080 systems running CP/M-80 had a simple memory architecture. Programs were written to be loaded and run at a specific physical memory address. For CP/M, this was 0100H. The programmer could assume that any program would start at 0100H and go up from there. Memory addresses of data items and procedures were actual physical addresses, and every time the program ran, its data items were loaded and referenced at precisely the same place in memory. This all changed with the arrival of the 8086, and 8086-specific operating systems such as CP/M-86 and PC DOS. Improvements in the Intel architecture introduced with the 8086 made it unnecessary for the program to be assembled for running at any specific physical memory address. This feature is called relocatability and is a necessary part of any modern operating system, especially when multiple programs may be running at once.

The author uses nasm. To compile an ASM file. Also SASM IDE for editing, building and debugging. SASM means SimpleAssembler.

nasm -f elf64 -g -Fdwarf first.asm

Chap6 Linux and the Tools That Shape the Way You Work

SASM: Simple ASM

Make

Chap7 Following your instructions: Meeting Machine Instructions Up Close and Personal

instructions: xchg, inc, dec, jnz, jmp, neg, movsx, mul, div, imul, idiv

Immediate data is built right into its own machine instruction. Register data is stored in one of the CPU’s collection of internal registers. In contrast, memory data is stored somewhere in the silver of system memory “owned” by a program, at a 64-bit memory address.

Only one of an instruction’s two operands may specify a memory location. You can’t move a memory value directly to another memory value. This is an inherent limitation of Intel CPUs of all generations.

To specify that we want the data at the memory location contained in a register rather than the data in the register itself, we use square brackets around the name of the register.

mov rax, [rbx]

mov rax, [rbx + 16]

mov rax, [rbx + rcx]

mov rax, [rbx + rcx + 11]

Whatever is inside the brackets is called the effective address of a data item in memory. At the current evolution of the Intel hardware, 2 registers may be added together to form the effective address, but not three or more.

Where the size issue gets tricky is when you write data in a register out to memory. NASM does not “remember” the size of variables, like higher-level languages do. It knows where EatMsg starts in memory, and that’s it. You have to tell NASM how many bytes of data to move. This is done by a size specifier.

mov byte [EatMsg], 'G'

Here we tell NASM that we want to move only a single byte out to memory by using the BYTE size specifier. Other size specifiers include WORD, DWORD, QWORD.

Only 18 bits of the RFlags register are actually flags. The rest are reserved for later use in future generations of Intel CPUs. Even among the defined flags, only a few are commonplace, and fewer still are useful when you’re just learning your way around. Some are used only inside system software like operating systems and are not available at all in userspace programs.

OF: Overflow flag. is set when the result of an arithmetic operation on a signed integer quantity becomes too large to fit in the operand it originally occupied. OF is generally used as the “carry flag” in signed arithmetic.

DF: The Direction flag is an oddball among the flags in that it tells the CPU something that you want it to know, rather than the other way around. It dictates the direction that activity moves (up-memory or down-memory) during the execution of string instructions. When DF is set, string instructions proceed from high memory toward low memory. When DF is cleared, string instructions proceed from low memory toward high memory.

IF: The Interrupt Enable flag is a two-way flag. The CPU sets it under certain conditions, and you can set it yourself using the STI and CLI instructions—though you probably won’t; see below. When IF is set, interrupts are enabled and may occur when requested. When IF is cleared, interrupts are ignored by the CPU. Ordinary programs could set and clear this flag with impunity in Real Mode, back in the DOS era. Under Linux (whether 32-bit or 64-bit) IF is reserved for the use of the operating system and sometimes its drivers. If you try to use the STI and CLI instructions within one of your programs, Linux will hand you a general protection fault, and your program will be terminated. Consider IF off-limits for userspace programming like we’re discussing in this book.

TF: When set, the Trap flag allows debuggers to manage single-stepping, by forcing the CPU to execute only a single instruction before calling an interrupt routine. This is not an especially useful flag for ordinary programming, and I won’t have anything more to say about it in this book.

SF: The Sign flag becomes set when the result of an operation forces the operand to become negative. By negative, we mean only that the highest-order bit in the operand (the sign bit) becomes 1 during a signed arithmetic operation. Any operation that leaves the sign of the result positive will clear SF

ZF: The Zero flag becomes set when the results of an operation become zero. If the destination operand instead becomes some nonzero value, ZF is cleared. You’ll be using this one a lot for conditional jumps.

AF: The Auxiliary Carry flag. used only for BCD arithmetic. These instructions are considered obsolete and are not present in x64.

PF: The Parity flag PF indicates whether the number of set (1) bits in the low-order byte of a result is even or odd. For example, if the result is 0F2H, PF will be cleared because 0F2H (11110010) contains an odd number of 1 bits.

CF: The Carry flag is used in unsigned arithmetic operations. If the result of an arithmetic or shift operation “carries out” a bit from the operand, CF becomes set. Otherwise, if nothing is carried out, CF is cleared.

The highest bit in the most significant byte of a signed value is the sign bit. If the sign bit is a 1-bit, the number is negative.

movsx: move with sign extension

xor rax, rax 
mov ax, -42 
movsx rbx, ax ;rbx will become -42 in two's complement

mul instruction has implicit operand. Immediate values cannot be used as operands for mul. MUL very helpfully sets the Carry flag CF when the value of the product overflows the low-order register.

explicit operand	implicit operand	implicit product
mul r/m8	AL	AX
mul r/m16	AX	DX:AX
mul r/m32	EAX	EDX:EAX
mul r/m64	RAX	RDX:RAX

explicit operand	implicit operand	quotient	remainder
div r/m8	AX	AL	AH
div r/m16	DX:AX	AX	DX
div r/m32	EDX:EAX	EAX	EDX
div r/m64	RDX:RAX	RAX	RDX

Note: r/m8 means 8bits register or memory

mov eax,447
mov ebx,1739
mul ebx

Chap8 Creating programs that work

Ordinary user-space programs written in NASM for Linux are divided into 3 sections. The order in which these sections fall in your program really isn’t important, but by convention the .data section comes first, followed by the .bss section and then the .text section.

.data section contains data definitions of initialized data items.

.bss Block/Buffer Start Symbol. Contains data items not having values before the program begins running.

Data items in the .data section add to the size of your executable file. Data items in the .bss section do not. A buffer that takes up 16,1000 bytes can be defined in .bss and add almost nothing( about 50 bytes for the desctiption) to the executable file size. This is possible because of the way the Linux loader brings the programs into memory. When you build your executable file, the Linux linker adds info to the file describing all the symbols you’ve defined, including symbols naming data items. The loader knows which data items do not have initial values, and it allocates space in momory for them when it brings the executable in from disk. Data items with initial values are read in along with their values.

The actual machine instructions that make up your program go into the .text section. The .text section contains symbols called labels that identify locations in the program code for jumps and calls. All global labels must be declared in the .text section, or the labels cannot be “seen” outside your program.

Refer to compiler,linker,assembler,loader

Linux Compiler: A compiler is a specialized system tool that translates a program written in a specific programming language into the assembly language code.

Linux Assembler: The assembler translates our assembly code to the machine code and then stores the result in an object file. Moving further, the assembler gives a memory location to each object and instruction in our code. The memory location can be physical as well as virtual. A virtual location is an offset that is relative to the base address of the first instruction.

Linux Linker: The linker combines all external programs (such as libraries and other shared components) with our program to create a final executable.

Linux Loader: The loader is a specialized operating system module. It loads the final executable code into memory.

Labels must begin with a letter or else with an underscore, period, or question mark. Theese last three have special meanings to the assembler, so don’t use them until you know how NASM interprets them.

Labels must be followed by a colon when they are defined.

Labels are case sensitive.

PUSH: pushes a 16-bit or 64-bit register or memory value that is specified by you in your source code. Note that you can’t push an 8-bit nor a 32-bit value onto the stack! You’ll get an error if you try.

PUSHFQ: pushes the full 64-bit RFlags register onto the stack. The Q means quadword.

Any of the 16-bit and 64-bit general-purpose registers may be pushed individually onto the stack. You can’t push AL or BH or any other of the 8-bit registers. 16bit and 64bit immediate data can be pushed onto the stack. User-space Linux programs cannot push the segment registers onto the stack under any circumstance. With x64, segment registers belong to the OS and are unavailable to user-space programs. As odd as it might seem, 32-bit values (including all 32-bit registers) may not be pushed onto the stack.

POP

popfq ; pop 8bytes from the stack into RFlags 
pop rcx 
pop bx 
pop [rbx]

PUSHA, PUSHAD, POPA, POPAD are gone in x64, though they are available in x86-32. These instructions were used to push or pop all of the general-purpose registers at once. They are removed likely due to that there are a lot more general-purpose registers in x64.

ABI: Application Binary Interface. It defines a collection of fundamental callable functions, generally supplied by the operating system, as is done in Linux. This definition describes how to pass parameters to the many kernel service functions. An ABI also defines how linkers link compiled or assembled modules into a single executable binary program.

RDI, RSI, RDX, R10,R8,R9 specifies a system call’s parameters. RAX is dedicated to the numeric code specifying the system call to be made.

The SYSCALL instruction itself makes use of RAX, RCX, and R11 internally. After the SYSCALL returns, you can’t assume that RAX, RCX or R11 will have the same values they did before the SYSCALL.

How to reserve bytes in .bss section in NASM. How to do it in gas?

;nasm assembly 
section .bss
Buff resb 1

;gas assembly 
.bss 
.lcomm buff1, 10 ;reserve length bytes for a local common denoted by symbol. Symbol is not declared global, so normally it is not visible to ld. 
.comm buff2, 10 ;declares common symbol named symbol.

jb and ja are for unsigned number, and jg and jl are for signed number

jb: jump if below

ja: jump if above

jg: jump if greater

jl: jump if lesser

# upper.s
# a program to convert user inputted char from lower case to upper case
.data

.bss
.lcomm char,1 #local common buffer, with the name `char`, 1 byte size

.text
.global _start

_start:
    push %rbp
    mov %rsp, %rbp

    #read a char
    mov $0, %rax
    mov $0, %rdi
    lea char, %rsi
    mov $1, %rdx
    syscall

    lea char, %rax #put the addr of char to %rax
    movb (%rax),%bl #put the byte value pointed by %rax to %rbx
    cmp $0x61, %bl #subtract 0x61 from %bl and without changing %rbx
    jb do_not_change #jump if below. Note the diff between Intel and AT&T syntax
    cmp $0x7a, %bl
    ja do_not_change
    # convert the char to upper case
    sub $0x20, %bl
    movb %bl, (%rax) #put the byte value in %bl to memory pointed by %rax
do_not_change:

    #write a char
    mov $1, %rax
    mov $1, %rdi
    lea char, %rsi
    mov $1, %rdx
    syscall

    #exit the program
    mov $60, %rax #syscall number
    mov $42, %rdi #exit code
    syscall

    pop %rbp

Use the commands below to build the above program.

as -g -o upper.o upper.s
ld -o upper upper.o

In the above program, we should pay attention to the below facts.

The way to declare buffer in .bss segment via .lcomm or .comm
When using the lea, there is not $ prefix in front of the label. When using the mov, there is $ prefix.
To refer to memory, put the memory address between parentheses movl 0x20(%ebx),%eax or movl (%ebx),%eax
When using the sub and cmp instructions, pay attention to the differences between AT&T syntax and Intel syntax.
The AT&T syntax mnenomics have a suffix, movb, movw, movl, movq.

movb %bl, %al 
movw %bx, %ax 
movl %ebx, %eax 
movq %rbx, %rax

Chap9 Bits, Flags, Branches, and Tables, Easing Into Mainstream Assembly Coding

Instructions: SHL, SHR, ROL, ROR, RCL, RCR, AND, OR, XOR, NOT, JC, JNC, STC, CLC, JMP, JZ, JNZ, CMP, TEST, BT, JA, JAE, JB, JBE, JE, JNE, JG, JGE, JL, JLE, JNBE, JNLE, XLAT

Bits in assembly language are numbered, starting from 0. The least significant bit is the one with the least value in the binary number system. When you count bits, start with the bits on the right-hand end, and number them leftward from 0.

Note: CS, DS, SS, ES segment registers are not GP(general purpose) registers.

In x64, the shift instructions require ether a immediate value from 0-255 or CL(CL instead of CX, ECX, RCX).

Shifting by 0 is pointless but allowed.

You cannot shift more positions than the destination register has. In 64-bit long mode, you cannot shift more than 63 counts. Attempting to do so won’t trigger an error. It just won’t work. CPU masks the count value to the 6 lowest bits before the instruction is executed.

In 32-bit protected mode, CPU masks the count values to the 5 lowest bits.

Shifting a bit off the left end of a binary value doesn’t exactly send that bit into cosmic nothingness. A bit shifted out of the left end of a binary value is bumped into a temporary bucket for bits called the CF(Carry Flag). We can test the state of the CF with a branching instruction(JC, JNC).

JC: Jump if Carry

JNC: Jump if Not Carry

JZ: Jump if Zero

JNZ: Jump if not Zero

JMP: Jump

If you shift a bit into the CF, and then immediately execute another shift operation, the bit bumped into the CF earlier will be bumped off the end of the world into cosmic nothingness.

If a bit’s destiny is not to be lost in cosmic nothingness, you need to use the rotate instructions RCL(Rotate Carry Left), RCR(Rotate Carry Right), ROL(Rotate Left) and ROR(Rotate Right). A bit bumped off one end of the operand reappears at the opposite end of the operand.

STC: Set Carry bit

CLC: Clear Carry bit

CF, Carry Flag

ZF, Zero Flag

SF, Sign Flag

DF, Direction Flag

OF, Overflow Flag

BT, Bit Test, it copies the specific bit in the first operand into the CF.

;intel syntax 
;check the 4th bits of rax, counting from 0 
bt rax,4
jnc quit ; jump if the 4th bit is not 1

X86-64 long mode memory addressing.

;BASE can be any GP register 
;INDEX can be any GP register 
;SCALE can be 1,2,4,8
;DISP(displacement) can be any 32bit constant
BASE + (INDEX * SCALE) + DISP
;example in intel syntax 
[rsi + rbp*4 + 9]
;example in AT&T syntax 
jmpq *0x402680(%rbp,%rax,8)

LEA has a off-label purpose: doing fast math without shifts, adds or MUL.

;intex syntax 
;multiply rdx by 3 
lea rdx,[rdx*2+rdx]

Chap10 Dividing and Conquering: Using procedures and macros to battle program complexity

A procedure must begin with a label.

Somewhere within the procedure, there must be at least one ret instruction.

A procedure may use call to call another procedure.

CALL first push the address of the next instruction after itself onto the stack. Then CALL transfers the execution to the address represented by the label that names the procedure. The instructions contained in the procedure execute. The RET instruction pops the return address off the top of the stack and transfers execution to that address. Execution continues as though CALL had not changed the flow of instruction execution at all.

There is a convention for which registers must be preserved within a procedure and which do not. This convention is part of the x86-64 System V ABI(Application Binary Interface). Some registers are considered volatile, meaning that they can be changed by a procedure, and others are nonvolatile, which means they must be preserved.

Linux uses registers, too. It is serious knowing what registers are changed during system calls via the SYSCALL instruction. There is no simple answer. It depends completely on which system call you make. But first and above all, the SYSCALL instruction itself makes use of 2 registers:

SYSCALL stores return address in the RCX register.

SYSCALL stores RFlags in the R11 register.

Everytime you execute SYSCALL, RCS and R11 will be clobbered.

The system call number (in other words, which system call you’re calling) is always in RAX. A system call will accept up to six parameters. The registers used to pass parameters are in this order: RDI, RSI, RDX, R10, R8, and R9. In other words, the first parameter is passed in RDI. The second parameter is passed in RSI, and so on. No system call requires any parameters be passed to it on the stack.

Note: Whether or not a register (like R9, say) is used to pass a parameter to a system call, that register is not preserved. Only seven registers are preserved by Linux across a system call: R12, R13, R14, R15, RBX, RSP, and RBP.

After a SYSCALL, RAX will contain a return value. If RAX is negative, it indicates an error occurred during the call. For most system calls, a 0 value indicates success.

Note: PUSHA, POPA, PUSHAD, POPAD are not available in x86-64.

Sooner or later, you are going to accidentally reuse a label.

This is a common enough problem that NASM’s authors created a feature to deal with it: local labels. Local labels are based on the fact that nearly all labels in assembly work (outside of names of subroutines and major sections) are “local” in nature, by which I mean that they are only referenced by jump instructions that are very close to them—perhaps only two or three instructions away.

Note: NASM and Gas may have different syntax for local labels. And it seems local labels and local symbols mean different things in GAS. Refer to as symbol-names

For NASM: Note that the label .modTest has a period in front of it. This period marks it as a local label. A local label is local to the first nonlocal label (that is, the first label not prefixed by a period; we call these global) that precedes it in the code.

; Go through the buffer and convert binary byte values to hex digits:
Scan:
xor rax,rax ; Clear RAX to 0
mov al,[Buff+rcx] ; Get a byte from the buffer into AL
mov rdx,rsi ; Copy total counter into RDX
and rdx,000000000000000Fh ; Mask out lowest 4 bits of char counter
call DumpChar ; Call the char poke procedure
; Bump the buffer pointer to the next char and see if buffer's done:
inc rsi ; Increment total chars processed counter
inc rcx ; Increment buffer pointer
cmp rcx,r15 ; Compare with # of chars in buffer
jb .modTest ; If we've processed all chars in buffer...
call LoadBuff ; ...go fill the buffer again
cmp r15,0 ; If r15=0, sys_read reached EOF on stdin
jbe Done ; If we get EOF, we're done
; See if we're at the end of a block of 16 and need to display a line:
.modTest:
test rsi,000000000000000Fh ; Test 4 lowest bits in counter for 0
jnz Scan ; If counter is *not* modulo 16, loop back
call PrintLine ; ...otherwise print the line
call ClearLine ; Clear hex dump line to 0's
jmp Scan ; Continue scanning the buffer

Local labels can also be called in the nested way as shown below in NASM. In a sense, under the covers, a local label is just the “tail” of a global label.

jne Calc.modTest

For GAS: Local labels are different from local symbols. Local labels help compilers and programmers use names temporarily. They create symbols which are guaranteed to be unique over the entire scope of the input source code and which can be referred to by a simple notation. To define a local label, write a label of the form ‘N:’ (where N represents any non-negative integer). To refer to the most recent previous definition of that label write ‘Nb’, using the same number as when you defined the label. To refer to the next definition of a local label, write ‘Nf’. The ‘b’ stands for “backwards” and the ‘f’ stands for “forwards”.

Here is an example:

1:        branch 1f
2:        branch 1b
1:        branch 2f
2:        branch 1b
Which is the equivalent of:

label_1:  branch label_3
label_2:  branch label_1
label_3:  branch label_4
label_4:  branch label_3

This is a real GAS example from stackoverflow:

_pg_dir:
startup_32:
    movl $0x10,%eax
    mov %ax,%ds
    mov %ax,%es
    mov %ax,%fs
    mov %ax,%gs
    lss _stack_start,%esp
    call setup_idt
    call setup_gdt
    movl $0x10,%eax     # reload all the segment registers
    mov %ax,%ds     # after changing gdt. CS was already
    mov %ax,%es     # reloaded in 'setup_gdt'
    mov %ax,%fs
    mov %ax,%gs
    lss _stack_start,%esp
    xorl %eax,%eax
1:  incl %eax       # check that A20 really IS enabled  ~~~
    movl %eax,0x000000
    cmpl %eax,0x100000
    je 1b                                             ~~~
    movl %cr0,%eax      # check math chip
    andl $0x80000011,%eax   # Save PG,ET,PE
    testl $0x10,%eax
    jne 1f          # ET is set - 387 is present      ~~~
    orl $4,%eax     # else set emulate bit
1:  movl %eax,%cr0                                    ~~~
jmp after_page_tables

short, near, far jumps:

A jump target that lies within 127 bytes of the conditional jump instruction is called a short jump. A jump target that is further away than 127 bytes but still within the current code segment is called a near jump.

There is a third kind of jump called a far jump, which involves leaving the current code segment entirely for whatever reason. In the old DOS real‐mode world, this meant specifying both a segment address and an offset address for the jump target. Far jumps were not used very often. In the 32‐bit protected mode and 64‐bit long mode, far jumps are extremely rare and involve all sorts of operating system complexity that I can’t go into in this book.

The problem really lies with the difference between short and near jumps. A short conditional jump instruction generates a short—and hence compact—binary opcode. Short jump opcodes are always two bytes in size, no more. Near jump opcodes are either four or six bytes in size, depending on various factors.

jne Scan ; Jump within 127 bytes in either direction
jne near Scan ; Jump anywhere in the current code segment

Once you create library files containing procedures, there are two ways to use them: - A library file can be assembled separately to a .o file, which in turn can be linked by the Linux linker into other programs that you may write in the future. - A library file can be included in the source code file of the main program, using a directive called %INCLUDE. This is what you must do to use libraries from programs written within SASM.

The very heart of programming in modules is “putting off” resolution of addresses until link time.

Declare a procedure external.

EXTERN myProc

Over in the other module where procedure MyProc is actually defined, it isn’t enough just to define the procedure. An eyelet needs a hook. You have to warn the assembler that MyProc will be referenced from outside the module. The assembler needs to forge the hook that will hook into the eyelet. You forge the hook by declaring the procedure global, meaning that other modules anywhere else in the program may freely reference the procedure.

GLOBAL myProc

In GAS:

.exern myProc 
.global myProc

A procedure that is declared as GLOBAL where it is defined may be referenced from anywhere its label is declared as EXTERN.

What works for procedures works for data as well, and it can work in either direction. Your program can declare any named variable as GLOBAL, and that variable may then be used by any module in which the same variable name is declared as external with the EXTERN directive.

defining Macro in NASM:

%macro ExitProg 0
mov rsp,rbp ; Stack alignment epilog
pop rbp
mov rax,60 ; 60 = exit the program
mov rdi,0 ; Return value in rdi 0 = nothing to return
syscall ; Call syscall sys_exit to return to Linux
%endmacro

Macros with parameters.

%macro WriteCtr 3 ; %1 = row; %2 = String addr; %3 = String length
push rbx ; Save caller's RBX
push rdx ; Save caller's RDX
mov rdx,%3 ; Load string length into RDX
xor rbx,rbx ; Zero RBX
mov bl,SCRWIDTH ; Load the screen width value to BL
sub bl,dl ; Calc diff. of screen width and string length
shr bl,1 ; Divide difference by two for X value
GotoXY bl,%1 ; Position the cursor for display
WriteStr %2,%3 ; Write the string to the console
pop rdx ; Restore caller's RDX
pop rbx ; Restore caller's RBX
%endmacro

Call a macro

WriteCtr al,AdMsg,ADLEN

All labels defined within a macro are considered local to the macro and are handled specially by the assembler. A label in a macro is made local by beginning it with two percent symbols: %%.

%macro UpCase 2 ; %1 = Address of buffer; %2 = Chars in buffer
mov rdx,%1 ; Place the offset of the buffer into rdx
mov rcx,%2 ; Place the number of bytes in the buffer into rcx
%%IsLC:cmp byte [rdx+rcx‐1],'
a' ; Below 'a'?
jb %%Bump ; Not lowercase. Skip
cmp byte [rdx+rcx‐1],'
z' ; Above 'z'?
ja %%Bump ; Not lowercase. Skip
sub byte [rdx+rcx‐1],
20h ; Force byte in buffer to uppercase
%%Bump:dec rcx ; Decrement character count
jnz %%IsLC ; If more chars in the buffer, repeat
%endmacro

GAS define macros, refer to macros in GAS

.section .data

   prompt_str:
      .ascii "Enter Your Name: "
   pstr_end:
      .set STR_SIZE, pstr_end - prompt_str

   greet_str:
      .ascii "Hello "

   gstr_end:
      .set GSTR_SIZE, gstr_end - greet_str

.section .bss

// Reserve 32 bytes of memory
   .lcomm  buff, 32

// A macro with two parameters
//  implements the write system call
   .macro write str, str_size 
      movl  $4, %eax
      movl  $1, %ebx
      movl  \str, %ecx
      movl  \str_size, %edx
      int   $0x80
   .endm


// Implements the read system call
   .macro read buff, buff_size
      movl  $3, %eax
      movl  $0, %ebx
      movl  \buff, %ecx
      movl  \buff_size, %edx
      int   $0x80
   .endm


.section .text

   .globl _start

   _start:
      write $prompt_str, $STR_SIZE
      read  $buff, $32

// Read returns the length in eax
      pushl %eax

// Print the hello text
      write $greet_str, $GSTR_SIZE

      popl  %edx

// edx = length returned by read
   write $buff, %edx

   _exit:
      movl  $1, %eax
      movl  $0, %ebx
      int   $0x80

Chap11, Strings and Things

Any contiguous sequence of bytes or larger units in memory may be considered a string – not simply sequences of human-readable characters.

The string instructions deal with these large sequences of bytes or larger units in an extraordinarily compact way: by executing an instruction loop entirely inside the CPU.

Pascal treat strings as a separate data type, with a length counter at the start of the string to indicate how many bytes are in the string. In C, a string has no length byte in front of it. A C string is said to end when a byte with a binary value of 0 is encountered.

Assembly strings are just contiguous regions of memory. Assembly strings have no boundary values or length indicators. You should instead think of strings in terms of the register values that define them.

There are 2 kinds of strings in x64 assembly work. Source strings are strings that you read from. Destination strings are strings that you write to.

A source string is pointed to by RSI. A destination string is pointed to by RDI.
The length of both kinds of strings is the value you place in RCX. How this length is acted upon by the CPU depends on the specific instruction and how it’s being used.
Data coming from a source string or going to a destination string must begin the trip from, end the trip at, or pass through register RAX.

stosb instruction: STOre String by Byte - RDI must be loaded with the address of the destination string. - RCX must be loaded with the number of times the value in AL is to be stored into the string. - AL must be loaded with the 8-bit value to be stored into the string. - The direction flag DF must be set or cleared, depending on whether you want the search to be up-memory(cleared; use CLD) or down-memory(set; use STD).

Once you set up these three registers, you can safely execute a stosb instruction. This is what happens: - The byte value in AL is copied to the memory address stored in RDI. - RDI is incremented by 1, such that it now points to the next byte in memory following the one just written to.

RCX is not decremented by stosb. RCX is decremented automatically onlyif you put the REP prefix in front of stosb. Lacking the REP prefix, you have to do the decrementing yourself, either explicitly through DEC or through the LOOP instruction.

;intel syntax 
cld 
mov al, FILLCHR 
mov rdi, VidBuff 
mov rcx, COLS*ROWs 
rep stosb

When DF is set, STOSB and its fellow string instructions work downhill, from higher to lower addresses. When DF is cleared, STOSB and its siblings work uphill, from lower to higher addresses. When DF is set, RDI is decremented during string instruction execution. When DF is cleared, RDI is incremented.

Loop instruction: combines the decrementing of RCX with a test and jump based on ZF. When executed, the loop instruction first decrements RCX by 1. It then checks the Zero flag o see if the decrement operation forced RCX to 0. If so, it falls through to the next instruction. If not, loop branches to the label specified as its operand.

  mov al,30h
dochar: 
  stosb 
  inc al 
  loop dochar

MUL, IMUL instructions: MUL treats its operand values as unsigned, where as IMUL treats them as signed.

STOSB, STOSW, STOSD, STOSQ instructions. RDI is changed according to the sizes of the quantity acted upon by the instruction.

X86-64 removed all BCD math instructions found in the x86 definition: AAA, DAA, DAS, AAS, AAM, AAD.

MOVSB,MOVSW,MOVSD,MOVSQ instructions. A block of memory data at the address stored in RSI is copied to the address stored in RDI.

JRCXZ instruction: Watches the RCX register. When it sees that RCX has just gone to zero, it jumps to the specified label. Note: there is not JRCXNZ instruction.

LOOP and LOOPNZ: LOOP watches the state of the RCX register, and closes the loop until RCX goes to 0. LOOPNZ watches both the state of the RCX register and the state of the zero flag ZF.

When one of your programs begins running, any command-line arguments that were entered when the program was launched are passed to the program on the Linux stack.

SCASB instruction: Scan String by Byte. - For up-memory searches, the CLD instruction is used to ensure that the DF is cleared. - The address of the first byte of the string to be searched is placed in RDI. - The value to be search for is placed in 8-bit register AL. - The maximum count is placed in RCX.

The prefix may be REPE or REPNE.

REPNE SCANSB

Use REPNE: Repeat SCASB as long as the content pointed by RDI does not equal AL.

Use REPE: Repeat SCASB as long as [RDI] equals AL.

Program Prologus and Epilogs.

;intel syntax 
;prologue 
push rbp 
mov rbp, rsp 
and rsp, -16 

;epilogue 
mov rsp, rbp 
pop rbp

Chap12, Calling external functions written in the C language

end

ideas:

how to use .equ . - MSGBEGIN to define the message length, to be used in sys call.

how to zero a register.

Recursive function call after loop.

Use MUL and SHIFT operations for multiplication.

conditional move instructions

how to inline assembly in Free PASCAL and GCC?

using macros?

local labels?

movsb,storsb,loop instructions?