Created Basic Assembly tutorial (markdown)

2025-04-24 13:05:51 +00:00 · 2017-11-04 18:54:45 -04:00 · 2017-11-04 18:54:45 -04:00 · f5673fc9ab
commit f5673fc9ab
1 changed files with 448 additions and 0 deletions
--- a/Basic-Assembly-tutorial.md
+++ b/Basic-Assembly-tutorial.md
@ -0,0 +1,448 @@
+## The very basics
+### Binary and hexadecimal number representation
+Computers can only "understand" the binary numbers (or base-2) as an electrical signal in its circuits. It is represented with the symbols 0 and 1. In this tutorial, each digit of a binary number is called a **bit**. We say that a bit is 'set' when it is equal to 1.
+
+For example, `0000 0101` is a 8-bit number (also called a **byte**) and `0000 0000 0000 0101` is a 16-bit binary number (also called a **word**).
+
+A number can be represented in a lot of different ways: `0101` is base-2 is the same number as `5` in base-10 and `0x5` in base-16. You don't really *convert* a binary number to a decimal number, you just change the way it is represented.
+
+Let's take a look at this number representation table for [unsigned](https://github.com/simon987/Much-Assembly-Required-3/wiki/Basic-Assembly-tutorial#number-representation---negative-numbers) 4-bit numbers:
+
+| base-16 | base-2 | base-10 |
+| ------- | ------ | ------- |
+| `0x0`   | `0000` | `0` |
+| `0x1`   | `0001` | `1` |
+| `0x2`   | `0010` | `2` |
+| `0x3`   | `0011` | `3` |
+| `0x4`   | `0100` | `4` |
+| `0x5`   | `0101` | `5` |
+| `0x6`   | `0110` | `6` |
+| `0x7`   | `0111` | `7` |
+| `0x8`   | `1000` | `8` |
+| `0x9`   | `1001` | `9` |
+| `0xA`   | `1010` | `10` |
+| `0xB`   | `1011` | `11` |
+| `0xC`   | `1100` | `12` |
+| `0xD`   | `1101` | `13` |
+| `0xE`   | `1110` | `14` |
+| `0xF`   | `1111` | `15` |
+
+Note that a 4-bit binary number can be easily represented with a single hexadecimal (base-16) digit. Hexadecimal numbers are often a lot easier to read than a string of bits and this is why it is widely used by programmers to represent binary numbers. Hexadecimal numbers are usually prefixed with `0x`.
+
+//todo: convert bin to dec? or maybe it's irrelevant?
+
+### Data types in computers
+In programming, there are various ways to use the data in the computer: integers, floating point numbers, Boolean, strings of characters, etc. A data type is just a way to interpret the binary data.
+
+For example, let's say the data stored in a certain location in the computer is `0100 0001 0000 0101`. We could interpret it as a single 16-bit number - `16645` - or two 8-bit numbers - `65` and `5` - or two [ASCII](http://www.asciitable.com/) characters - `A` and the `Enquiry character`. It is up to the programmer to decide how to interpret the data.
+
+Here is a short list of common data types:
+
+| Names | Size (bits) | range |
+| ----- | ----------- | ----- |
+| [Boolean](https://github.com/simon987/Much-Assembly-Required-3/wiki/Basic-Assembly-tutorial#booleans-and-boolean-logic) | 1 | `0` to `1` |
+| Byte, char, octet (Signed) | 8 | `-128` to `127`|
+| Byte, char, octet (Unsigned) | 8 | `0` to `255`|
+| Word, short (Signed) | 16 | `−32768` to `32767`|
+| Word, short (Unsigned) | 16 | `0` to `65535`|
+
+
+### Number representation - Negative numbers?
+We have seen how positive numbers are represented in binary, now let's take a look at how negative numbers are represented.
+
+The processor uses the most significant (leftmost, also called the sign bit) bit of a number to indicate the sign: 0 for positive numbers and 1 for negative numbers. To negate a number, we use the **two's complement** operation: we 'flip' each bit (0 becomes 1 and 1 becomes 0, this is what we call a logical NOT) and add 1 (See [Binary arithmetic](https://github.com/simon987/Much-Assembly-Required-3/wiki/Basic-Assembly-tutorial#binary-arithmetic) for more information).
+
+Let's take the number 5 for example (16-bit):
+```
+    0000 0000 0000 0101   # The two's complement (signed) value is  5, the unsigned value is 5
+    1111 1111 1111 1011   # The two's complement (signed) value is -5, the unsigned value is 65533
+    ^
+    The sign bit is set
+```
+We can see that the same data can be represented by a positive number and a negative number at the same time, it is up the programmer to decide to interpret it as a signed number or an unsigned number.
+
+### Binary arithmetic
+**Addition**
+
+It is quite simple to add two binary numbers together using the **carry** method:
+```
+1 + 0 = 1
+0 + 1 = 1
+1 + 1 = 0, carry 1
+```
+
+When the value of a bit exceeds 1, the excess needs to be **carried out** of this bit **into** the left bit (This is an important concept for the [overflow and carry flags](https://github.com/simon987/Much-Assembly-Required-3/wiki/Central-processing-unit-(CPU)#the-flags-register) later on in the tutorial)
+
+So a 16-bit addition in the CPU looks like this:
+```
+                 111    < Carried bits
+  0000 0000 0000 0101
+ 0000 0000 0000 0011
+  -------------------
+                 1000
+```
+The first (rightmost) bit's excess is **carried into** the second bit, the second bit's excess is **carried into** the third bit and the third bit's excess is **carried into** the fourth bit (We can also say that there is a **carry out** of the first bit, the second bit and the third bit)
+
+**Subtraction**
+
+The computer does not implement a way to execute the subtraction operation, they simply negate one of the operands and do an addition.
+
+For example:
+```
+5 - 3 = 5 + (-3)
+
+
+ 11111 1111 1111 1 1    < Carried bits
+  0000 0000 0000 0101
+ 1111 1111 1111 1101
+  -------------------
+  0000000000000000010
+```
+Note that the there is a carry out of the leftmost bit: the result couldn't fit in the 16-bit destination.
+
+**Multiplication**
+TODO
+**Division and remainder**
+TODO
+**Left/Right bit shift**
+TODO
+
+### Booleans and Boolean logic
+A 'Boolean' is a data type with only two possible values: **true** (represented by 1) and **false** (represented by 0). It is very useful in programming for conditional logic (e.g. IF my glass is empty THEN order another beer) and in electric circuits.
+Here is a table of the boolean operators used in computers
+
+| Operator | Names | Representation | Example |
+| --- | --- | --- | --- |
+| NOT | Logical negation | `!` | `!0 = 1` |
+| AND | Logical and | `&` | `0 & 1 = 0` |
+| OR | Logical inclusive or | `\|` | `0 \| 1 = 1` |
+| XOR | Logical exclusive or | `^` | `0 ^ 1 = 1` |
+
+**Logical negation (NOT)**
+The logical negation simply inverts (or flips) the given binary value.
+
+Truth table:
+
+| A | !A |
+| --- | --- |
+| `0` | `1` |
+| `1` | `0` |
+
+**Logical and (AND)**
+The result of a logical and is true if both of the operands are true.
+
+Truth table:
+
+| A | B | A & B |
+| --- | --- | --- |
+| `0` | `0` | `0` |
+| `0` | `1` | `0` |
+| `1` | `0` | `0` |
+| `1` | `1` | `1` |
+
+**Logical inclusive or (OR)**
+The result of a logical or is true if one or both of the operands are true.
+
+Truth table:
+
+| A | B | A \| B |
+| --- | --- | --- |
+| `0` | `0` | `0` |
+| `0` | `1` | `1` |
+| `1` | `0` | `1` |
+| `1` | `1` | `1` |
+
+**Logical exclusive or (XOR)**
+The result of a logical and is true if one of the operands are true, but not both.
+
+Truth table:
+
+| A | B | A ^ B |
+| --- | --- | --- |
+| `0` | `0` | `0` |
+| `0` | `1` | `1` |
+| `1` | `0` | `1` |
+| `1` | `1` | `0` |
+
+The same concepts apply for binary numbers:
+```
+     0000 1010 1111 1111
+AND  1000 1011 0111 0101
+     -------------------
+     0000 1010 0111 0101
+
+NOT  0011 1111 0000 1010
+     -------------------
+     1100 0000 1111 0101
+```
+
+## Computers and code
+### Basic computer architecture
+Here is a basic diagram of the system design of a modern computer
+![]()
+We can see that all there is a lot of I/O (input and output) hardware connected to the CPU, but we don't really have to worry about the exact way to access these from within the CPU because it is taken care of by the operating system (Linux, Windows...). In the real world, programmers use **system calls** which are understood by the operating system to request access to certain files on the hard drive or to display text on the screen.
+Only the CPU, memory and registers (in green) are relevant for the game.
+
+The central processing unit (CPU) is the 'brain' of the computer. It is capable of reading and writing data to the memory and the registers.
+
+The memory is a long array of locations accessible by the CPU.
+The registers are easily accessible locations accessible by the CPU. We will see in detail how the CPU interacts with the memory and the registers in the next sections.
+
+
+### Machine code and compilation
+Today, almost all of computer software is written using programming languages such as C/C++ and Java. With the use of a compiler or an interpreter, programmers translate computer code (written in plain text, readable by humans) into **machine code** that the CPU can understand. More specifically, into a list of **instructions**.
+
+Here's an example using C:
+```c
+int main(){
+
+    int a = 1;
+
+    if(a < 12){
+        return 1;
+    }else{
+        return 0;
+    }
+
+}
+```
+Using the GNU Compiler, the plain text C source file is translated into machine code:
+```
+...
+4855 e589 45c7 01fc 0000 8300 fc7d 7f0b
+b807 0001 0000 05eb 00b8 0000 5d00 00c3
+...
+```
+Using the `objdump` command on Linux, the machine code is translated human-readable assembly language:
+```assembly
+0000000000000000 <main>:
+   0:  55                     push   rbp
+   1:  48 89 e5               mov    rbp,rsp
+   4:  c7 45 fc 01 00 00 00   mov    DWORD PTR [rbp-4],1
+   b:  83 7d fc 0b            cmp    DWORD PTR [rbp-4],12
+   f:  7f 07                  jg     18
+  11:  b8 01 00 00 00         mov    eax,1
+  16:  eb 05                  jmp    1d
+  18:  b8 00 00 00 00         mov    eax,0
+  1d:  5d                     pop    rbp
+  1e:  c3                     ret
+```
+
+The machine code produced by the C compiler can be directly translated to assembly language: `00b8 0000` becomes `mov eax, 0` and `c3` becomes `ret`. We will see what these instructions do very soon. The compilation process varies greatly from language to language, but the end result must be understandable by the CPU. The instructions (machine code numbers) that the CPU can understand vary depending on the CPU's **instruction set**. Most modern consumer processors use the x86-64 (or amd64) instruction set.
+The [game's CPU](https://github.com/simon987/Much-Assembly-Required-3/wiki/Central-processing-unit-(CPU)) is based on the Intel 8086 processor (x86-16 instruction set)
+
+**So, what does a *32-bit* processor mean?**
+A 32-bit processor will handle data in 32-bit chunks, and do its calculations with 32-bit numbers. Intel x86 processors are backward compatible, meaning that a 32-bit processor can handle data and do its calculations with 16-bit and 8-bit numbers (That is why the `DWORD PTR` directive is required at offset 4 and b in the previous example: we are telling the CPU that we want to deal with 32-bit numbers (double word) for this instruction).
+
+That is not the case with the game's CPU, which is a purely 16-bit processor.
+
+### Memory and registers
+Random-access memory (RAM) is a piece of hardware capable of storing data and accessing it in no particular order (hence the name). You can think of the memory as a pile of blocks, each block having an address and storage for data. In the game, memory is only accessed in 16-bit chunks so each "block" holds 16 bits of data. Here's how it looks:
+
+[](link of pic)
+
+Each 16-bit block has an address (left).
+
+Registers are individual storage blocks that are located inside the CPU. They are used in some instructions and can be written to or accessed just like memory, but they are referred by their name (A, B, X...) instead of by an address. On modern CPUs, registers are much faster to access than memory.
+
+[](link of pic)
+Each register holds 16 bits of data
+
+### Execution cycle
+To execute a program, the CPU start by reading (fetching) the instruction located at the address stored in the instruction pointer (IP), then it will decode the instruction and increment the instruction pointer. Finally, it executes the instruction and the cycle starts over. The instruction pointer is a register that always holds the memory address of the next instruction.
+Let's pretend that we have a program that is compiled in machine code and copied into memory at the address 0x0400. To execute this program, IP is set to 0x400 (We say that the execution *jumps* to this address), the CPU fetches the instruction at this address in memory, then it is decoded and executed and the cycle starts over.
+The program can *jump* to other addresses depending on certain conditions using a jump instruction.
+
+This is normal if this seems very abstract, but don't worry, there will be practical examples in the next section.
+
+## Assembly language
+So we have seen how programmers can use programming languages to write computer programs, and we have seen how these programs are handled by the CPU, now let's take a look at how to write assembly code from scratch.
+
+### Instruction and operands
+Let's start with some an example:
+```assembly
+MOV A, 1
+INT 9
+BRK
+```
+You can try to execute it yourself by copying it directly into the game editor and uploading it. The code simply displays a warning icon on top of your controlled unit.
+
+The first instruction is a MOV (move) instruction. The syntax for the MOV instruction is MOV *destination*, *source* and it copies whatever is specified for the source operand in the destination operand, overwriting what was there before the instruction. The INT instruction will be explained in more detail later.
+
+The first line can be read as "Move `1` into `A`"
+
+General syntax for each line of the assembly code:
+```assembly
+label:    mnemonic  destination, source    ; comment
+```
+You can prefix any line with a name and a semicolon, this will define a label which is a symbolic name for the memory address of the code or data following it. During the assembly process, each mention of a label is replaced by this address and therefore makes them immediate values. We will see labels in more detail in the next section.
+Each instruction has a mnemonic to help remember it, the move instruction's mnemonic is MOV.
+The instruction's mnemonic is followed by one or two operands. There are 3 types of operands in the game: memory operand, register operand and immediate value operand.
+
+**Immediate value operand:**
+An immediate value operand is simple a base-10 or base-16 number (with 0x prefix). The following immediate operands are *valid*: `12, -3, 0x0002, 0x03, label_name, +4` and the following immediate operands are *invalid*: `12.3, -0x02, abcd`
+
+**Register operand:**
+A register operand is the name of a register. only the following registers can be used in instructions: `A, B, X, Y, SP, BP`, the register `FLAGS, IP` can't be modified or accessed directly. Register names and mnemonics are not case-sensitive.
+
+**Memory Operand:**
+A memory operand is a memory address in between square brackets. An immediate value or a register can be used to specify the address. The following memory operands are *valid*: `[12], [0xFFFE], [-3], [A], [BP], [label_name]`. You can also specify a register with a displacement value: `[A + 3], [BP-4], [X + label_name]`
+
+Take a moment familiarize yourself with [the various instructions available](https://github.com/simon987/Much-Assembly-Required-3/wiki/Instruction-set).
+
+Let's quickly observe some more complex instructions before moving to the next section.
+**MOV**
+Transfer a word from the source operand to the destination operand. The source operand remains unchanged and the destination operand is overwritten.
+**ADD**
+The destination operand is overwritten by the sum of the two operands
+```assembly
+MOV A, 0 ; A = 0
+MOV B, 2 ; B = 2
+ADD A, B ; A = A + B = 2
+```
+
+**MUL**
+Performs unsigned multiplication.
+
+The A register and the source operand are multiplied and the 32-bit result is stored in Y:A.
+```assembly
+MOV A, 6
+MUL 4
+```
+The result of the multiplication (A * 4 = 24) is stored in Y:A, the most significant (leftmost, or higher) word is stored in the Y register, and the least significant (rightmost, or lower) word is stored in the A register. The Y register is not overwritten if the higher word is zero (if the result of the multiplication is lower than 65536).
+```
+Y = 0x???? # The value of Y is unchanged
+A = 0x0018 # 6 * 4 = 24
+```
+One more example:
+```assembly
+MOV A, 0x8001
+MUL 0x0002
+```
+```
+Y = 0x0001
+A = 0x0002 # The complete result is Y:A = 0x00010002
+```
+If the result is greater than 65536 and Y is overwritten, the carry flag (CF) and the overflow flag (OF) are set, otherwise they are both reset to 0 (More information on the FLAGS register in the next section).
+
+**BRK**
+Sets the break flag, which stops the execution until the next game tick (see TODO).
+
+
+### Conditional execution
+As we have seen, the CPU always execute each instruction in order, one after the other. But what if we want to execute the same code over and over again in a loop or execute some code only in particular conditions?
+
+First, let's take a look at the JMP instruction (also called unconditional jump):
+```assembly
+        MOV A, 0
+        JMP myLabel
+        ADD A, 1
+myLabel:
+        BRK
+
+```
+In the above example code, the CPU will execute the MOV instruction, then the JMP instruction: the JMP instruction overwrites the value of the instruction pointer - the register that points to the next instruction to execute - with the value of the label *myLabel* which is the address of the code following it. So the instruction pointer will point to the next instruction to execute: the BRK instruction.Therefore, the ADD instruction is never executed. We can say the execution **jumps** to *myLabel*.
+If we were to move the label just above the JMP instruction, the CPU would execute the JMP instruction over and over again in an infinite loop.
+
+**Conditional jumps**
+Conditional jumps are instructions that only change the IP if a condition is met, here's a list of conditional jumps:
+
+| Mnemonic | Summary | Jump condition |
+| -------- | ------- | -------------- |
+| JGE | Jump if greater or equal | `SF = OF` |
+| JG | Jump if greater | `SF = OF & ZF = 0` |
+| JLE | Jump if less or equal | `SF != OF \| ZF = 1` |
+| JL | Jump if lower | `SF != OF` |
+| JNS | Jump if not sign | `SF = 0` |
+| JS | Jump if sign | `SF = 1`|
+| JNZ | Jump if not zero | `ZF = 0`|
+| JZ | Jump if zero | `ZF = 1`|
+
+You can read the FLAGS section on the [CPU page](https://github.com/simon987/Much-Assembly-Required-3/wiki/Central-processing-unit-(CPU)#the-flags-register) to know exactly in which condition the flags are set.
+
+Let's say that we wanted to check if the value of the register A was equal to 5, we could do this:
+```assembly
+SUB A, 5
+JZ some_label
+```
+This above code can be read as "Subtract 5 to A, Jump to some_label if the result of the last instruction was 0". This would work, but whatever was stored in the A register would be modified no matter what, and we might to keep that value for some other calculations. Luckily for us, the CMP (Compare) instruction exists and is the same as the SUB instruction except that it doesn't store the result anywhere: it only modifies the FLAGS register.
+
+Let's say that we want to check if the third bit of the A register was set:
+```assembly
+AND A, 0x0004 ; = 0000 0000 0000 0100
+JNZ some_label
+```
+If the result of the AND instruction different from 0, that means that the third bit of the A register is set. The TEST instruction is the same as the AND instruction except that it does not store the result anywhere. There is almost always a TEST or a CMP instruction preceding a conditional jump.
+
+### The stack, procedures and the stack frame
+
+In programming languages, we use procedures or functions to organize the code. Functions are a very important concept to master when reading or writing assembly language. This part can be quite hard to understand at first but it is very rewarding to be able to understand how real-world programs works on the lowest level.
+
+This is the basic structure of a function in most programming languages:
+* The function declaration
+  * Declare a return type
+  * Declare the parameters of the function
+* Local variable declaration
+* Statements
+* Return statement
+
+And this is how it looks in C:
+```c
+int doSomeMath(int a, int b){ //Function declaration
+    int result; //Local variable declaration
+
+    //Some statements
+    result = 0;
+    result = result + a;
+    result = result - b;
+    result = result + 1;
+   
+    //The return statement
+    return result;
+}
+```
+Let's go through this example line by line.
+The first line is the function declaration, we are telling the C compiler that the following block of code is our function called *doSomeMath*, and that it accepts two **arguments** of *int* type: a and b. We are also telling the compiler that the **return type** of the *doSomeMath* function is *int*.  
+In the second line (`int result = 0;`) we are declaring one **local variable** of type *int* called *result*. The local variables are stored in memory, so the compiler needs to know how much space to reserve for each variable.  
+The following 4 lines are simple statements that can each be translated into single instructions.
+The last statement (`return result`) indicates that we want to **return** to the callee (The code that *called* this function) the value inside the *result* local variable.  
+
+Let's see how it translates in assembly language:
+```assembly
+; Function doSomeMath - Does some math
+; Usage:
+; word doSomeMath(word a, word b);
+; Modifies: A
+doSomeMath:
+    push BP             ; Save previous stack frame
+    mov BP, SP
+
+    sub SP, 1           ; Reserve space for one 16-bit block of memory
+    
+    mov [BP-1], 0       ; result = 0
+    add [BP-1], [BP+2]  ; result = result + a
+    sub [BP-1], [BP+3]  ; result = result - b
+    add [BP-1], 1       ; result = result + 1
+
+    mov A, [BP-1]       ; The return value is stored in the A register
+
+    mov SP, BP          ; Undo the reservation of space
+    pop BP              ; Restore the previous stack frame
+    ret
+```
+To understand this example, we must first understand what is the **stack**. The **stack** is a data structure with a *Last In First Out* (LIFO) system:  the *last* inserted item goes out *first*. We can imagine the **stack** as a stack of plates in a restaurant: plates are taken from the top of the stack and new plates are placed on the top of the stack. We can't insert or remove data that is under other pieces of data, but we can modify it.  
+
+
+We use the term *push* when we add a piece of data on top of the stack and *pop* when we remove a piece of data from the stack.
+
+Visual representation of the stack:
+![Visual representation of the stack](link to picture)
+
+We can notice that this implementation of the stack is actually upside-down compared to a real-world stack of items: the stack grows downwards in memory, towards lower addresses. The important thing to remember is that the bottom of the stack is at a fixed address and that the top of the stack changes when an item is *pushed* on the stack or *popped* out. In the game's CPU, the **stack pointer** (SP) register always points to the top of the stack and the **base pointer** always points to the base of the current **stack frame**
+
+
+
+### System calls  
+//todo: system calls
+### Exemple code  
+//todo: walk through of exemple code from MAR-Exemple-Code repo