What is the actual relation between assembly, machine code, bytecode, and opcode?
I have read most of the SO questions about assembly and machine code, such as this, but they are too high level and do not show examples of actual assembly code being transformed into machine code. As a result, I still don't understand how it works at a deeper level.
The ideal answer to this question would show a specific example of some assembly code, such as the snippet below, and how each assembly instruction gets mapped to machine code, bytecode, and/or opcode. An answer like this would be very helpful to future people learning assembly, because so far in the past few days of digging I haven't found any clear summary.
The main things I am looking for are:
- a snippet of assembly code
- a snippet of machine code
- a mapping between the snippet of assembly and machine code (how to do that mapping, or at least some general examples, and how do you know how to do this, where is all this information on the web)
- how to interpret the machine code (like are opcodes somehow related, and where is all the information on the web about what all those numbers mean)
Note: I don't have a computer science background, so I have just been slowly going lower level over the past several years and have now gotten to the point of wanting to understand assembly and machine code.
Relation Between Assembly and Machine Code
My current understanding is that an "assembler" (such as NASM) takes assembly code and creates machine code from it.
So when you compile some assembly such as this example.asm
:
global main
section .text
main:
call write
write:
mov rax, 0x2000004
mov rdi, 1
mov rsi, message
mov rdx, length
syscall
section .data
message: db 'Hello, world!', 0xa
length: equ $ - message
(compile it with nasm -f macho64 -o example.o example.asm
). It outputs this example.o
object file:
cffa edfe 0700 0001 0300 0000 0100 0000
0200 0000 0001 0000 0000 0000 0000 0000
1900 0000 e800 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
2e00 0000 0000 0000 2001 0000 0000 0000
2e00 0000 0000 0000 0700 0000 0700 0000
0200 0000 0000 0000 5f5f 7465 7874 0000
0000 0000 0000 0000 5f5f 5445 5854 0000
0000 0000 0000 0000 0000 0000 0000 0000
2000 0000 0000 0000 2001 0000 0000 0000
5001 0000 0100 0000 0005 0080 0000 0000
0000 0000 0000 0000 5f5f 6461 7461 0000
0000 0000 0000 0000 5f5f 4441 5441 0000
0000 0000 0000 0000 2000 0000 0000 0000
0e00 0000 0000 0000 4001 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0200 0000 1800 0000
5801 0000 0400 0000 9801 0000 1c00 0000
e800 0000 00b8 0400 0002 bf01 0000 0048
be00 0000 0000 0000 00ba 0e00 0000 0f05
4865 6c6c 6f2c 2077 6f72 6c64 210a 0000
1100 0000 0100 000e 0700 0000 0e01 0000
0500 0000 0000 0000 0d00 0000 0e02 0000
2000 0000 0000 0000 1500 0000 0200 0000
0e00 0000 0000 0000 0100 0000 0f01 0000
0000 0000 0000 0000 0073 7461 7274 0077
7269 7465 006d 6573 7361 6765 006c 656e
6774 6800
(that is the entire contents of example.o
). When you then "link" that using ld -o example example.o
, it gives you more machine code:
cffa edfe 0700 0001 0300 0080 0200 0000
0d00 0000 7803 0000 8500 0000 0000 0000
1900 0000 4800 0000 5f5f 5041 4745 5a45
524f 0000 0000 0000 0000 0000 0000 0000
0010 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 1900 0000 9800 0000
5f5f 5445 5854 0000 0000 0000 0000 0000
0010 0000 0000 0000 0010 0000 0000 0000
... 523 lines of this
But how did it go from assembly instructions, to those numbers? Is there some sort of standard reference that lists out all of those numbers, and what they mean, for whatever architecture you are on (I am using x86-64 through NASM on OSX), and how each set of numbers maps to each assembly instruction?
I understand that machine code is different for every machine, and there are dozens if not hundreds of different types of machines. So I am not currently looking for how assembly gets transformed to every one (that would be complicated). I just am interested in an example that illustrates how the transformation works, and any architecture can serve as the example. And from that point, I could go and research the specific architecture I am interested in and find the mapping.
Relation Between Assembly and Bytecode (or is it called "opcode"?)
So from my reading so far, assembly gets transformed into machine code as demonstrated above.
But now I get confused. I see people talk about bytecode, such as in this SO answer, showing stuff like this:
void myfunc(int a) {
printf("%s", a);
}
The assembly for this function would look like this:
OP Params OpName Description
13 82 6a PushString 82 means string, 6a is the address of "%s"
So this function pushes a pointer to "%s" on the stack.
13 83 00 PushInt 83 means integer, 00 means the one on the top of the stack.
So this function gets the integer at the top of the stack,
And pushes it on the stack again
17 13 88 Call 1388 is printf, so this calls the printf function
03 02 Pop This pops the two things we pushed back off the stack
02 Return This returns to the calling code.
So then I get confused. Doing some digging, I can't tell if each of those 2-digit hex numbers like 13 82 6a
are each, individually, called "opcodes", and the whole set of them is called "bytecode" as a catch-all term. In addition, I can't find a table that lists out all of these 2-digit hex numbers, and what their relation is to machine code, or assembly.
To summarize, I am very much looking forward to an example showing how assembly instructions map to machine code, and it's relation to bytecode and/or opcode. (I am not looking for how a compiler does this, just how the general mapping works). I think this would clarify it for not only myself but for many people down the road who are interested in learning more about the bare metal.
One other reason why this would be valuable to know is, so one can understand how the LLVM compiler generates machine code. Do they have some sort of "complete list" of 2-digit opcodes or machine code 4-digit sequences, and know exactly how that maps to any architecture-specific assembly? Where did they get that information from? An answer to this overall question would make it much clearer how LLVM implemented its code generation.
Update
Updating from @HansPassant's comment. I actually don't care what the actual distinctions are between the words, sorry if that wasn't clear. I just want to know this: how does assembly map to machine code (and where are places to begin looking for the references that hold that information on the web), and are opcodes or bytecode used anywhere in that process? And if so how?
See Question&Answers more detail:
os