May 4, 2021

Implementing an AOT pipeline for FEX-Emu

Posted in FEX-Emu, emulators

Startup time / Stutter is a common problem in many emulators doing Dynamic Binary Translation, and FEX-Emu is no exception.

The problem is worse than a typical console emulator, as the amount of translated code is usually far bigger, and a general purpose OS such as linux has far more executable code than a console-optimized OS.

While optimizing the translator is important, when there are hundreds of megabytes of code being translated, there’s only one solution: Do the translations beforehand, aka Ahead Of Time (AOT) translation.

AOT IR vs AOT OBJ

AOT IR caches our internal Intermediate Representation (IR), but still does the final IR -> Executable Code generation step. Each fragment takes 0.1 ms on average, with some outliers taking as long as 100ms or more.

The IR needs to be guest position independent (PIC), as different libraries are loaded at different addresses in different processes, as well as ASLR security measures.

AOT OBJ, which caches generated “Object Code”, requires the Object Code to be also PIC, thread sharable, and linked/relocated on load.

How AOT IR is implemented in FEX-Emu

FEX-Emu identifies binaries by tracking mmap system calls. It keeps a list of the files that contain executable code, and writes their file-id and full path in ~/.fex-emu/aotir/fileid.path.

AOT IR files are designed so they can be memory mapped, and contain a sorted index that is binary searched when the JIT translator looks for IR for a given block. Entries include a guest code hash using xxh3.

For AOT generation, FEX-Emu scans elf files using exports, debug symbols, and Exception Unwind Tables to seed the translation list, then scans for CALL instructions to further augment it. This usually discovers around 90% of executable entrypoints.

How does it perform?

# Without AOT
$ time FEXLoader `which clang`
real  0m1.288s

# After AOT IR generation (2.5 minutes, 3.4GB on disk)
$ time FEXLoader --aotirload `which clang`
real  0m0.377s

Having a pregenerated AOT IR cache cuts clang launch time down to less than 1/3rd!

Further Work

Multithreading the AOT translator, using less memory, and implementing AOT OBJ are the next steps.