NGINX Unit Adds Assembly Language Support

Original: https://www.nginx.com/blog/nginx-unit-adds-assembly-language-support/


The NGINX Unit team is pleased to add support for a new programming language to our already versatile bag of tricks.

Despite its complexity, the assembly language is popular in all kinds of system software; we believe that web development can also benefit from the advantages it provides. Assembly has been used since the very beginning of the computing era and still retains a very active community of supporters. In the last few months, we’ve been receiving a lot of requests to add support for it, and finally its time has come. The result: web development in assembly has never been so easy.

Background

The benefits of using the assembly language for your web apps are immense:

All of the above makes the assembly language a very promising option for web development. For brevity only, this post focuses on the AMD64 (x86_64) architecture and uses the generic name x64. If you are not familiar with x64 assembly, please refer to the software developer manual for your processor and the System V Application Binary Interface (ABI). The System V ABI is very useful here because of the need to interface with NGINX Unit’s C‑based API.

Note that there are two common syntax flavors for x64 assembly, Intel and AT&T. The differences between them cause tremendous flame wars within the programmer community. To avoid controversy, we did our best to support both versions.

Your First ‘Hello World’ Website in Assembly

All application modules in NGINX Unit share a secret: they rely on a static library called libunit.a which provides primitives for communication with the core processes. This post assumes you have followed the steps in the NGINX Unit installation guide to clone the sources.

In NGINX Unit, assembly support is built in (pretty much like the external application type), but you need to build libunit.a yourself by running these commands:

$ pwd
/home/user/unit
$ ./configure && make libunit-install

The second command creates the file build/libunit.a; that’s the only dependency. Now comes the fun part!

First, though, let’s iterate through several necessary assumptions:

The basic workflow for a worker to register itself with the NGINX Unit daemon is as follows:

  1. Allocate an nxt_unit_init struct and set the request_handler callback.
  2. Call nxt_unit_init(init) to initialize the client and register the application.
  3. Call nxt_unit_run(ctx) to start the request processing cycle.
  4. Call nxt_unit_done(ctx) at the end.

Pretty easy, right? But hey, assembly has no structs, so we need to allocate an opaque chunk of bytes and fill it manually. Remember that structures in C are by default padded to ensure the alignment that enables faster memory access. You can obtain more information on how this works here and see the init structure declaration here. Assembly programmers don’t need structs. Instead, they calculate offsets manually.

So, without further ado, here’s the first code snippet, which just allocates the init structure and zeroes the memory:

The init_struct_size value was obtained from sizeof(nxt_unit_init_t) in C. You may want to check the size of the structure yourself (that’s just a sanity check in case of an almost improbable GCC ABI change or an update in NGINX Unit’s header files).

As promised, we’re using prologue and epilogue in each function to improve readability, but then we need to zero the %rbp register at line 37 to mark it as the first frame pointer because some older libc versions and gdb rely on it while unwinding the stack to generate back traces. Newer toolchains use Call Frame Information (CFI) directives to populate ELF sections with stack information for every function, but we’re not going to use them here. For more information, see the very user‑friendly DWARF specification.

On line 38, we ensure there’s room on the stack by decreasing the stack pointer for the init structure. Be aware that the stack must be16‑byte aligned on x64 before any call instructions are executed; this is mandatory, and your program will crash if you miss it. Section 3.2.2 of the System V ABI mandates that the stack must be aligned properly when transferring control to _start. Effectively, this means that %rsp+8 is multiple of 16. The additional 8 bytes are needed because call instructions push the return address of the next instruction onto the stack; when control reaches the invoked function, it’s already aligned. Luckily, 192 is multiple of 16, so we don’t need any adjustments.

On line 42 we call our memzero function to zero the memory. Its code appears just below, but the declaration is as follows:

void memzero(ptr, size)

Before calling memzero, on line 40 we move the address at the top of the stack (now pointing to the base of our locally allocated memory) to %rdi. Why %rdi, you may ask? Again, that’s the convention (see section 3.2.3 of the ABI). If the parameters are integers, the register‑passing sequence is %rdi, %rsi, %rdx, %rcx, %r8, and %r9.

Here’s the memzero function in full:

This code loops over the address in %rdi and zeroes the memory at line 132. We initialize the counter (%rcx) with the second parameter at line 129. The loop terminates as soon as %rcx reaches zero. Note that this is a slower version of memset, spelled out here for educational purposes. A production version would do the zeroing in 8‑byte or 16‑byte cycles (movdqa, movups, etc.).

The prologue and epilogue macros are defined like this:

If you remember old‑time x86 assembly programming, you recognize these macros, as they were used by compilers for a long time. By including prologue and epilogue in every function, we help debuggers to walk the stack in case of a crash or while iteratively debugging the program. Using them also improves readability because we can now refer to any local variable using indexes from the %rbp register. It also prevents bugs, because if we don’t use prologue and epilogue but still index variables and parameters directly by %rsp, we have to update the indexes and offsets whenever we push or pop something as %rsp moves.

The -fomit-frame-pointer option is the default on modern compilers, which means the compiler keeps track of local variables relative to %rsp. However, this mode trades performance for readability of the assembly code.

Now let’s gain some momentum and move on to the larger components:

This code sets the init.callbacks.request_handler to our request_handler function and calls the nxt_unit_init function, passing the init structure as a parameter.

Moreover, the only required member of the structure is the request_handler callback, which is called whenever a request arrives at this application.

The nxt_unit_init function on line 47 returns a context pointer or NULL in case of an error. The code that follows the call checks for the error condition and branches to handling if necessary.

The following code invokes the request processing function:

First, we save the value of ctx in %rbx. Again, why specifically %rbx? We could have saved it on the stack or in the global statically allocated storage (BSS). However, this code executes within the topmost function of our program, so we can take advantage of the ABI and use %rbx because it’s a callee‑saved (non‑volatile) register that is included in the calling convention but usually acts as an optional base pointer for historical reasons. The i386 ABI used ebx as the base pointer for the Global Offset Table (GOT), but x64 uses relative addressing with %rip as its base, so %rbx is not used. Moreover, invoked functions are required to preserve its value.

Then, we call nxt_unit_run to start reading input requests in an endless loop, calling our custom callback for each request. This function only returns when the NGINX Unit daemon asks the application to terminate. When this happens, we check the returned error code: if it’s not zero (NXT_UNIT_OK), the code branches to error handling again.

If the nxt_unit_run function returns successfully, we proceed to clean up by calling nxt_unit_done to release our resources and quit:

Now comes the interesting part, specifically the request_handler:

We start by reserving 16 bytes on the stack for the input request and the error code. Pointers are 8 bytes, and the error code is an integer (4 bytes), but we allocate a total of 16 bytes for alignment purposes, as explained before.

Then we proceed to call nxt_unit_response_init to initialize the response state. The function has the following signature:

It accepts the status code, the maximum number of fields to be returned, and the maximum size of a field. In this instance, our code passes the 200 OK status, a maximum of one field, and the $req_total_len variable which is a constant derived from the content offsets in the .data section:

The variables in this snippet are calculated at assembly time.

After the nxt_unit_response_init call, we store the returned error code and check for errors. Then, the code proceeds to add a Content-Type field with value text/plain to the response:

Finally, we add the content body and send the response back to NGINX Unit:

The call to nxt_unit_response_send sends the headers and the content to the client, but the application can continue sending additional data by calling nxt_unit_buf_send. To finalize the response and release the allocated resources, a call to nxt_unit_request_done is required:

That’s basically it – the app is ready. We haven’t discussed some bits of code like error handling and data variables; the complete code is available in this file.

Building and Running Your App

To build and link your application, run these commands:

$ gcc -c -g hello.s -o hello.o
$ ld -o hello hello.o ./build/libunit.a -lc --dynamic-linker=/lib64/ld-linux-x86-64.so.2

When you’re ready to run your first assembly app on NGINX Unit, apply the config below (remember to update the executable path):

If the config fails to load, try changing the application type from asm to external.

Finally, navigate to http://localhost:8081/. There, you see the eagerly anticipated greeting from your app:

Hello from x64 assembly

Yay!

Conclusion

The need for highly effective web apps capable of supporting the ever‑increasing load on the global online infrastructure has never been clearer than now. Existing web applications are bloated with multiple tiers and layers of API scaffolding and obscure libraries, wasting precious disk space and CPU cycles. The issues of digital cloud pollution and big‑endian footprint from redundant code also call for immediate action from all key stakeholders. Here, we offer a recipe for the much‑needed change: a turn to the assembly language for web development should become the cornerstone of a new, better web. As a concise, surgically precise approach to software development, the assembly language was one of the drivers behind the fledging IT industry; perhaps, it can save us in these troubled times as well.

Safety warning: All sample code provided and referenced in today’s post was written by a team of highly trained monkeys. Use it at your own discretion and risk.

Retrieved by Nick Shadrin from nginx.com website.