Pishi Reloaded: Binary only address sanitizer for macOS KEXT.

In the part 1 of my tutorial style blog post about fuzzing, I discussed how we can instrument the macOS KEXTs to collect code coverage at the basic block or edge level.

We talked about Coverage-guided fuzzing which uses code coverage as a metric to direct the fuzzer. However, a fuzzer also requires additional feedback from the target to determine if it has discovered a vulnerability. Almost all fuzzers rely on crashes, panics, or BSODs as indicators.
Fuzzers send inputs to a program and monitor it for segmentation faults or crashes. A segmentation fault typically indicates a bug or vulnerability.

Not only fault detection has always been a challenge and a subject of study in fuzzing, and as lcamtuf tells, one of the most labor-intensive portions of any fuzzing project is the work needed to determine if a particular crash poses a security risk.

In the second part of the series, I plan to discuss about memory error detectors or sanitizer which are tools and methods to increase the likelihood of triggering a crash in the case of a bug or vulnerability and our approach to implementing a binary-only address sanitizer for macOS KEXT. I also presneted this project at zer0con2025.

XNU has a KASAN build (will describe this later) in KDK, or even you can build a KASAN version of it, but KEXTs are not included. So, whenever you fuzz a KEXT, a vulnerability may go unnoticed. This is why I decided to work on this project.

It’s important to note that a panic is not the only sign of a vulnerability, for example, Mateusz Jurczyk and Gynvael Coldwind have demonstrated that memory access patterns can be used to identify double-fetch kernel vulnerabilities.

Vulnerabilities are sneaky

Imagine there is a Use after free or a Heap buffer overflow two common memory safety vulnerabilities in C and C++, and your fuzzer has reached the vulnerable code, but if nothing happens after triggering the vulnerability (e.g., no crash), the fuzzer won’t be notified and won’t save the generated input.

// Use after free
int main(int argc, char **argv) {
  int *array = new int[100];
  delete [] array;
  return array[argc]; // UAF, but no crash.
}

// Heap buffer overflow
int main(int argc, char **argv) {
  int *array = new int[100];
  array[0] = 0;
  int res = array[argc + 100]; // overflow, but no crash.
  delete [] array;
  return res;
}

A segmentation fault can happen for many reasons, such as attempting to access a nonexistent memory address or trying to access memory with improper permission. However, this is not the case for many triggered bugs. for example, freed memory is just another accessible chunk of the heap that returns to the heap manager. If you allocate a new heap chunk, it’s part of a larger memory block managed by the heap manager( memory managers usually allocate one or more pages and splits them into chunks.) so overflowing into it won’t be noticeable at the time of write, unless later another part of the code is using the corrupted buffer.

To trigger a crash, you need to be lucky, for instance, if the index of an array is controllable by an attacker, a segmentation fault will only occur if the index is big enough. for example, in a buffer with size of 10, accessing buf[0x10000] may cause a crash, but buf[15] will not.

For the same reason, years ago I had implemented my own tool to increase the likelihood of kernel panic in case of triggering a UAF in Windows kernel heap.

A final note about crashes is that not all aborts or synchronous Exceptions indicate a vulnerability( sometimes the exception handling has a vulnerability). If you check the Arm64’s Exception Syndrome Register, you’ll see many exceptions that do not point to an actual vulnerability. for example, divide by zero or unlike before a null pointer dereference, is not necessarily a sign of a vulnerability, unless it may be a side effect of some overflow. as I mentioned before one of the most labor-intensive portions of any fuzzing project is the work needed to determine if a particular crash poses a security risk. Microsoft even developed the !exploitable Crash Analyzer to help save time on identifying vulnerabilities.

An example in the ARM64 world a PAC authentication failure panic, which might initially seem like an exploitable bug. but it could actually be a null pointer dereference. I highly recommend reading sleh_synchronous function in sleh.c file of XNU.

Traces of an Old Time

More than 15 years ago, I started using Driver Verifier as the first tool to detect memory corruption in the Windows kernel to enable special kernel pool.
When special pool is enabled, the memory allocation function ExAllocatePoolWithTag and the corresponding free function ExFreePoolWithTag will use MmAllocateSpecialPool and MmFreeSpecialPool, respectively. Each memory allocation requested by the driver is placed on a separate page. The highest possible address that allows the allocation to fit on the page is returned, so that the memory is aligned with the end of the page. The previous portion of the page is written with special patterns. The previous page and the next page are marked inaccessible.

We will get a BSOD if the driver reads or writes to the buffer after freeing it or If the driver attempts to access memory after the end of the allocation.

The good side of special pool is that it works for closed-source kernel modules, but the bad side is that it has very poor efficiency, because each allocation from the special pool uses one page.

Using virtual memory and paging is very common method to detect Vulnerabilities in closed source binaries.
PageHeap in Windows and libgmalloc in macOS, according to its man page also employs same technique to detect heap memory related Vulnerabilities in user space.

libgmalloc is used in place of the standard system malloc, and uses the virtual memory system to identify memory access bugs. Each malloc allocation is placed on its own virtual memory page (or pages). By default, the returned address for the allocation is positioned such that the end of the allocated buffer is at the end of the last page, and the next page after that is kept unallocated. Thus, accesses beyond the end of the buffer cause a bad access error immediately. When memory is freed, libgmalloc deallocates its virtual memory, so reads or writes to the freed buffer cause a bad access error. Bugs which had been difficult to isolate become immediately obvious, and you'll know exactly which code is causing the problem.

Using virtual memory and dedicated a page for each allocation doesn’t need to instrument memory read or write instructions and there is no need to have source code of the system under test,
there are many more methods to detect memory corruptions, which we discuss some of them in the following parts.

Shadow memory and Valgrind

Although using Pages to detect vulnerabilities is simple to implement, it does not support many types of vulnerabilities. Another method to detect memory vulnerabilities in closed source binaries is to rewrite them in order to instrument every read or write instruction which verifies them with Shadow memory.

Shadow memory is a technique where a parallel memory space is used to track additional information about a program’s memory, such as initialization or access status. It helps detect issues like memory corruption, uninitialized memory access, There is a one-to-one relationship between the shadow memory and the main application memory.

Valgrind is one of the first and most well-known implementations of this method. Valgrind first translates the program into a temporary, intermediate representation (IR). After the conversion, it can instrument or perform whatever transformations it wants to make on the IR, such as inserting extra instrumentation code around almost all memory read or write instructions. Then, at each instrumentation point, it checks if the memory access is safe or not(poisoned) with the help of the metadata stored in the shadow memory.

Valgrind also implement its own free and alloc functions to update the metadata in the Shadow memory.

Shadow memory with exception handling.

BoKASAN paper offers another binary-level method to validate memory access. It uses shadow memory, similar to Valgrind, but instead of rewriting the kernel module binary, it hooks into the kernel’s page fault and debug exception handlers to monitor memory access. When a sanitized memory region is accessed, a page fault occurs due to the unset present bit, triggering the page fault handler. The handler checks the memory address validity via shadow memory and raises a kernel panic for invalid accesses. If valid, the instruction is single-stepped with the present bit temporarily set. After execution, the present bit is cleared to trap future accesses. BoKASAN applies selective sanitization by registering target processes and checking for sanitization during memory allocation. Registered processes undergo sanitization, while others receive default allocation. Allocation. BoKASAN hooks the memory allocation functions (e.g., kmalloc) to allocate the object and creating a red zone then initializes the shadow memory.

Using an exception handler for every memory access would consume a significant amount of CPU time, making it neither efficient nor a good approach.

LLVM and AddressSanitizer

When we have access to the source code, we can use AddressSanitizer (ASAN), a widely used memory vulnerability detector that combines shadow memory with source code-level instrumentation. ASAN leverages LLVM to instrument every memory read and write operation and introduces its own runtime library for memory allocation and deallocation (such as malloc/free). At these instrumented points, ASAN checks memory access by comparing it to the corresponding shadow memory.

The run-time library replaces malloc, free and related functions, whenever memory is allocated, additional memory is reserved beyond the requested block as a poisoned redzones(It means the metadata for this address in shadow memory is labeled as inaccessible.) around allocated heap regions, and it delays the reuse of freed heap regions, if a memory read or write operation overflows into the adjacent chunk, i.e, poisoned redzones, KASAN detects the violation at the time of read or write after verifying metadata, and reports it as a bug.

Our target is the kernel, so let’s talk about all three KASAN modes in Linux kernel and how the macOS kernel also implements its own version of it.

Generic or Software KASAN

As described on the KASAN page of the Linux kernel, Software KASAN modes use shadow memory to record whether each byte of memory is safe to access. It also uses compile-time instrumentation to insert shadow memory checks before each memory access.

Generic KASAN dedicates 1/8th of kernel memory to its shadow memory (16TB to cover 128TB on x86_64) and uses direct mapping with a scale and offset to translate a memory address to its corresponding shadow address.

For each memory access or write, the compiler calls the following functions to check the validity of the accessed memory. based on the size of read or write N will be 2, 4, 8 and 16.

void __asan_storeN(unsigned long addr, size_t size)
{
	check_memory_region(addr, size, true);
}

Which checks the shadow memory to check if the destination memory is poisoned or not.

static __always_inline void check_memory_region(unsigned long addr,
						size_t size, bool write)
{
	if (unlikely(size == 0))
		return;
	if (unlikely((void *)addr <
		kasan_shadow_to_mem((void *)KASAN_SHADOW_START))) {
		kasan_report(addr, size, write, _RET_IP_);
		return;
	}
	if (likely(!memory_is_poisoned(addr, size)))
		return;
	kasan_report(addr, size, write, _RET_IP_);
}

The memory_is_poisoned_1 function uses kasan_mem_to_shadow to locate the corresponding byte in the shadow memory. It then checks the metadata to determine if the memory is addressable. If the metadata indicates that the memory is addressable, the execution continues normally. However, if the memory is not addressable, it implies that the execution is attempting to access memory that it should not, indicating a potential memory corruption or violation.

static __always_inline bool memory_is_poisoned_1(unsigned long addr)
{
	s8 shadow_value = *(s8 *)kasan_mem_to_shadow((void *)addr);
	if (unlikely(shadow_value)) {
		s8 last_accessible_byte = addr & KASAN_SHADOW_MASK;
		return unlikely(last_accessible_byte >= shadow_value);
	}
	return false;
}

static inline void *kasan_mem_to_shadow(const void *addr)
{
    return (void *)((unsigned long)addr >> KASAN_SHADOW_SCALE_SHIFT)
            + KASAN_SHADOW_OFFSET;
}

As I mentioned, each byte in the shadow memory represents 8 bytes in the main memory. It can precisely indicate how many bytes are addressable, or if the memory is within a redzone. If the shadow byte is 00, it means all 8 bytes are accessible. If the shadow byte is 01, 02, 03, 04, 05, 06, or 07, it means that the corresponding number of bytes (01, 02, 03, 04, 05, 06, or 07) are accessible.

Shadow byte legend (one shadow byte represents 8 application bytes):
 Addressable:           00
 Partially addressable: 01 02 03 04 05 06 07  
 Heap left redzone:       fa
 Freed heap region:       fd
 Stack left redzone:      f1
 Stack mid redzone:       f2
 Stack right redzone:     f3
 Stack after return:      f5
 Stack use after scope:   f8
 Global redzone:          f9
 Global init order:       f6
 Poisoned by user:        f7
 Container overflow:      fc
 Array cookie:            ac
 Intra object redzone:    bb
 ASan internal:           fe
 Left alloca redzone:     ca
 Right alloca redzone:    cb

Let’s see a real output of disassembled memory access in AddressSanitizer.

int a = 0;
int* b = &a;
*b = 0x41;

If we instrument above code with AddressSanitizer, it will be translated into the following code. Note that all the above functions are inlined for performance reasons.

  local_38 = puVar2;
  bVar3 = *(byte *)(((ulong)puVar2 >> 3) + 0x1000000000);
  local_94 = (uint)bVar3;
  if ((local_94 != 0) &&
     (iVar1 = (int)(char)(((byte)puVar2 & 7) + 3), iVar4 = (char)bVar3 - iVar1,
     iVar4 == 0 || (char)bVar3 < iVar1)) {
    __asan_report_store4(iVar4,puVar2);
  }
  *(undefined4 *)puVar2 = 0x41;

As I mentioned before, allocation functions update the shadow memory, and when you free memory, it also updates the shadow memory to indicate that the memory is free. The software-based AddressSanitizer (ASan) verifies memory at a byte granularity, the compiler also inserts ‍readzones around stack and global variables to be able to detect violations around them. For more detailed information, you can refer to the following resources, AddressSanitizerAlgorithm here and its original paper

ASAN is very fast. The average slowdown of the instrumented program is ~2x and it can detect almost all memory related violations

Use after free (dangling pointer dereference)
Heap buffer overflow
Stack buffer overflow
Global buffer overflow
Use after return
Use after scope
Initialization order bugs
Memory leaks

One known issue is that it cannot detect certain bugs. For instance, a memory access might end up in second array, which has a valid addressable flag in the shadow memory. This is an issue with shadow memory but not with tagging, which I will describe in the next part.

char *first = new char[100];
char *second = new char[1000];
a[500] = 0;

Software Tag-Based KASAN

64-bit ARM processors only use part of the 64-bit address space for virtual addresses. All Armv8-A implementations support 48-bit virtual addresses. Number of address bits of virtual addresses depends on the memory addressing scheme of TCR_EL1, Number of bits of the virtual address space can be expressed as 64 - TCR_EL1.TnSZ

For example, if TCR_EL1.T1SZ is set to 32, the size of the kernel region in the EL1 virtual address space is 2^32 bytes (0xFFFF_FFFF_0000_0000 to 0xFFFF_FFFF_FFFF_FFFF). Any access to an address that is outside of the configured ranges will generate an exception or translation fault.

TnSZ and TGn, i.e., the granule size for the TTBRn region, are two of the most important parts of the memory schema of VMSAv8-64.

I would like to mention that XNU also uses T1SZ, along with E0PD( check locore.s) as part of its Meltdown/Spectre mitigations(search for __ARM_KERNEL_PROTECT__ in XNU to see how other parts of the kernel have been modified for kernel protection against potential architectural or microarchitectural vulnerabilities.).

Translation Control Register is part of the virtual memory control register functional group that controls the configuration of memory translation tables and address translation in the Arm64.

/*
 * Translation Control Register (TCR)
 * Current (with 16KB granule support):
 *
 *  63  39   38   37 36   34 32    30 29 28 27 26 25 24   23 22 21  16    14 13 12 11 10 9   8    7   5  0
 * +------+----+----+--+-+-----+-----+-----+-----+-----+----+--+------+-----+-----+-----+-----+----+-+----+
 * | zero |TBI1|TBI0|AS|z| IPS | TG1 | SH1 |ORGN1|IRGN1|EPD1|A1| T1SZ | TG0 | SH0 |ORGN0|IRGN0|EPD0|z|T0SZ|
 * +------+----+----+--+-+-----+-----+-----+-----+-----+----+--+------+-----+-----+-----+-----+----+-+----+
 *
 * TBI1:  Top Byte Ignored for TTBR1 region
 * TBI0:  Top Byte Ignored for TTBR0 region
 * T0SZ:  Virtual address size for TTBR0
 * T1SZ:  Virtual address size for TTBR1
 */

From list of bits in the TCR two of them are important for us.

TBI1: Top Byte Ignored for TTBR1 region
TBI0: Top Byte Ignored for TTBR0 region

Top Byte Ignored bit configures CPU to ignore top byte of virtual addresses.

This one byte is used to tag the address, this byte is exactly what is used in Arm’s Memory Tagging Extension, except in MTE it does use dedicated hardware storage for tags.
The size of the PAC also depends on whether pointer tagging is enabled or not, The PAC is stored in the remaining unused bits of the pointer. SiPearl’s white paper about CFI on Arm64 very clearly explains how virtual address works regarding to TBI and PAC. (image by SiPearl’s White Paper)

Software Tag-Based KASAN utilizes the Top Byte Ignore (TBI) feature of arm64 CPUs to store a pointer tag in the top byte of kernel pointers. Additionally, it uses shadow memory to associate memory tags with each 16-byte memory cell, reserving 1/16th of the kernel memory for this purpose.

When memory is allocated, Software Tag-Based KASAN generates a random tag, assigns it to the allocated memory, and embeds the same tag in the returned pointer.

The kernel build system like previous method employs LLVM’s compile-time instrumentation to insert checks before every memory access. These checks ensure that the tag of the accessed memory matches the tag of the pointer being used. If there is a tag mismatch, Software Tag-Based KASAN reports a bug.

Software Tag-Based KASAN uses 0xFF as a match-all pointer tag (accesses through pointers with the 0xFF pointer tag are not checked).

XNU implements both Classic KASAN and Software Tag-Based KASAN in kasan-tbi.c

a comment in kasan.c elegantly describes how it has been implemented tag-based sanitizer. For each 16-byte granule in the address space, one byte is reserved in the shadow table. TBI (Top Byte Ignore) is used to associate a tag to each VA pointer and to match it with the shadow table backing storage. This mode of operation is similar to hardware memory tagging solutions (e.g. MTE) and is not available on x86-64. Cost: ~8% of memory. No need for redzones or quarantines. See kasan-tbi.c for details.

Using a software-based tag consumes less memory because it doesn’t rely on redzones or quarantines to detect violations. However, it has a 16-byte granularity, meaning it can only detect overflows if they exceed 16 bytes, unless partial addressing is supported( described at the last section).

uint8_t *allocated_mem = ( uint8_t *)IOMalloc(4);
allocated_mem[10]; // this is an overflow but tag based KASAN can't detect it.

The arm_init function of XNU executes once on the boot CPU upon entry from iBoot. It then calls arm_setup_pre_sign, which subsequently invokes arm_set_kernel_tbi.

/*
 *	Routine:	arm_setup_pre_sign
 *	Function:	Perform HW initialization that must happen ahead of the first PAC sign
 *			operation.
 */
static void
arm_setup_pre_sign(void)
{
#if __arm64__
	/* DATA TBI, if enabled, affects the number of VA bits that contain the signature */
	arm_set_kernel_tbi();
#endif /* __arm64 */
}

If you build a KASAN-enabled version of XNU for an ARM SoC the CONFIG_KERNEL_TBI preprocessor set will be defined. In this case, the arm_set_kernel_tbi function will configure the TBI (Top-Byte Ignore).

/*
 * TBI (top-byte ignore) is an ARMv8 feature for ignoring the top 8 bits of
 * address accesses. It can be enabled separately for TTBR0 (user) and
 * TTBR1 (kernel).
 */
void
arm_set_kernel_tbi(void)
{
#if !__ARM_KERNEL_PROTECT__ && CONFIG_KERNEL_TBI
	uint64_t old_tcr, new_tcr;

	old_tcr = new_tcr = get_tcr();
	/*
	 * For kernel configurations that require TBI support on
	 * PAC systems, we enable DATA TBI only.
	 */
	new_tcr |= TCR_TBI1_TOPBYTE_IGNORED;
	new_tcr |= TCR_TBID1_ENABLE;

	if (old_tcr != new_tcr) {
		set_tcr(new_tcr);
		sysreg_restore.tcr_el1 = new_tcr;
	}
#endif /* !__ARM_KERNEL_PROTECT__ && CONFIG_KERNEL_TBI */
}

with CONFIG_KERNEL_TBI preprocessor that memory allocation functions like zalloc, kalloc (including kalloc_type and kalloc_data) also tag the pointer upon allocation and update the shadow table with proper flags in the vm_memtag_set_tag function. XNU sets up the shadow memory in arm_vm_init for KASAN builds, which end up calling kasan_arm64_pte_map

#if KASAN
	/* record the extent of the physmap */
	physmap_vbase = physmap_base;
	physmap_vtop = physmap_end;
	kasan_init();
#endif /* KASAN */

kasan_arm_pte_map() is the heart of the arch-specific handling of the shadow table. It walks the existing page tables that map shadow ranges and allocates/creates valid entries as required.

At this point, everything required to have a working software tag-based KASAN is described, Pointer tagging while allocating memory, Shadow memory mapping in the Page table at boot time and read/write instrumentation of instructions to check the tag and verify if it matches the shadow memory, which happens in the build time with help of LLVM and only for XNU.

Based on this knowledge, I aimed to implement all these ideas (e.g., enabling TBI, allocating shadow memory, and tagging pointers during allocations by hooking memory allocate functions) in my KEXT.

However, I remembered that we have the KASAN build of the kernel in the KDK. why not using it? I booted the KASAN kernel with my KEXT, allocated some memory using IOMalloc and IOMallocData, and applied the same logic to locate the shadow memory and verify the tag. unsurprisingly, it worked as expected.

void check_kasan(uint8_t *address)
{
  int8_t pointer_tag = (uint8_t)((uintptr_t)address >> 56);
  uint8_t memory_tag = *(uint8_t *)(((((uintptr_t)address) | 0xFF00000000000000ULL) >> 4) + 0xF000000000000000ULL);
                
  if ((pointer_tag != memory_tag) && (pointer_tag != 0xFF)) {
      printf("[Pishi] Overflow panic\n");
      //  asm ("brk 0x0");
  }
}

void test_heap_overflow()
{
  uint8_t *allocated_mem = ( uint8_t *)IOMalloc(16);//IOMallocData(16);
  if (!allocated_mem)
    return ;
    
  check_kasan(allocated_mem); // tag matches
  check_kasan(allocated_mem + 17); // detect overflow
  IOFree(allocated_mem, 16);
  check_kasan(allocated_mem); // detect UAF
  return;
    
}

Shadow for address

I’d like to have a brief discussion about shadow mapping in XNU and LLVM to better understand our check_kasan function and learn how LLVM handles instrumentation.

We know the pointer tag is on the top byte of the pointer so we can extract it by shifting the address to right by 56 bits.

uint8_t pointer_tag = (uint8_t)((uintptr_t)address >> 56);

but what are the following numbers and why we are shifting and doing bitwise or operation.

uint8_t memory_tag = *(uint8_t *)(((((uintptr_t)address) | 0xFF00000000000000ULL) >> 4) + 0xF000000000000000ULL);

In XNU, the SHADOW_FOR_ADDRESS macro is used to map an address to shadow memory, which expands to the following preprocessor directives.

#define VM_KERNEL_STRIP_PAC(_v) (ptrauth_strip((void *)(uintptr_t)(_v), ptrauth_key_asia))
#define VM_KERNEL_STRIP_TAG(_v)         (vm_memtag_canonicalize_address((vm_offset_t)_v))
#define VM_KERNEL_STRIP_PTR(_va)        ((VM_KERNEL_STRIP_TAG(VM_KERNEL_STRIP_PAC((_va)))))
#define VM_KERNEL_STRIP_UPTR(_va)       ((vm_address_t)VM_KERNEL_STRIP_PTR((uintptr_t)(_va)))
#define KASAN_STRIP_ADDR(_x)    (VM_KERNEL_STRIP_UPTR(_x))
#define SHADOW_FOR_ADDRESS(x) (uint8_t *)(((KASAN_STRIP_ADDR(x)) >> KASAN_SCALE) + KASAN_OFFSET)

KASAN_SCALE and KASAN_OFFSET are defined at the build time. As you see in The disassembled code of kasan_tbi_get_tag_address, It untags the pointer, removes the PAC, and then performs the mapping to shadow memory.VM_KERNEL_STRIP_PAC emits xpaci instruction which strips the Pointer Authentication Code.

uint8_t *
kasan_tbi_get_tag_address(vm_offset_t address)
{
	return SHADOW_FOR_ADDRESS(address);
}

    kasan_tbi_get_tag_address(vm_offset_t address)
     e00099f0904 7f 23 03 d5     pacibsp
     e00099f0908 fd 7b bf a9     stp        x29,x30,[sp, #local_10]!
     e00099f090c fd 03 00 91     mov        x29,sp
     e00099f0910 e0 43 c1 da     xpaci      address
     e00099f0914 e1 01 80 52     mov        w1,#0xf
     e00099f0918 b0 ff ff 97     bl         vm_memtag_add_ptr_tag                            vm_offset_t vm_memtag_add_ptr_ta
     e00099f091c 08 00 fe d2     mov        x8,#-0x1000000000000000
     e00099f0920 08 fc 44 b3     bfxil      x8,address,#0x4,#0x3c
     e00099f0924 e0 03 08 aa     mov        address,x8
     e00099f0928 fd 7b c1 a8     ldp        x29=>local_10,x30,[sp], #0x10
     e00099f092c ff 0f 5f d6     retab

LLVM also uses the same logic( except stripping PAC) in HWAddressSanitizer.cpp to map a pointer to shadow memory.

Shadow = (Mem >> scale) + offset

The scale is set to the default value of 0x4 (the same as in XNU), to effectively divides the address by 16, which means you group addresses into 16-byte chunks. as I told before for each 16-byte granule in the address space, one byte is reserved in the shadow table.

the offset is provided to LLVM by XNU through an argument.

-mllvm -hwasan-mapping-offset=$(KASAN_OFFSET)

The following LLVM code generates untagging instruction and maps the memory to shadow at the instrumented points.

void HWAddressSanitizer::ShadowMapping::init(Triple &TargetTriple,
                                             bool InstrumentWithCalls) {
  Scale = kDefaultShadowScale; // Initialization of Scale
  // ... rest of the initialization logic
}

void HWAddressSanitizer::untagPointerOperand(Instruction *I, Value *Addr) {
  if (TargetTriple.isAArch64() || TargetTriple.getArch() == Triple::x86_64 ||
      TargetTriple.isRISCV64())
    return;

  IRBuilder<> IRB(I);
  Value *AddrLong = IRB.CreatePointerCast(Addr, IntptrTy);
  Value *UntaggedPtr =
      IRB.CreateIntToPtr(untagPointer(IRB, AddrLong), Addr->getType());
  I->setOperand(getPointerOperandIndex(I), UntaggedPtr);
}

Value *HWAddressSanitizer::memToShadow(Value *Mem, IRBuilder<> &IRB) {
  // Mem >> Scale
  Value *Shadow = IRB.CreateLShr(Mem, Mapping.Scale);
  if (Mapping.Offset == 0)
    return IRB.CreateIntToPtr(Shadow, PtrTy);
  // (Mem >> Scale) + Offset
  return IRB.CreatePtrAdd(ShadowBase, Shadow);
}

If we don’t untag the address, the shift operation will result in an incorrect address instead of pointing to the shadow memory. at this point we know why we did a bitwise OR followed by a right shift of 4, then addition.

Binary instrumentation

The KASAN build of the kernel maps shadow memory, enables TBI, and tags all allocated memory. However, because KEXTs are not instrumented, even though they use tagged pointers, there is no verification during read or write access. At this point, we just need to instrument every memory read and write operation, similar to what we did in Pishi. which brings its own set of challenges, as we will discuss in the subsequent sections.

Memory allocation in KEXT

Curious readers might wonder if KEXTs use the same memory allocation mechanisms as the kernel. If they don’t, it could be that they are using untagged pointers. We used IOMalloc and IOMallocData in our test, but do all of KEXTs use memory allocators that tag pointers?

XNU provides two main ways to allocate memory, every memory allocation function boils down to them:

kmem_alloc and similar interfaces for allocations at the granularity of pages.
<kern/zalloc.h> the zone allocator subsystem which is a slab-allocator of objects of fixed size.

IOKit also uses the provided memory allocators, and these APIs tag the pointers.

Let’s see how kalloc tags pointers as it’s the main memory allocator in IOKit.

static inline struct kalloc_result
kalloc_zone(
	zone_t                  z,
	zone_stats_t            zstats,
	zalloc_flags_t          flags,
	vm_size_t               req_size)
{
...
		kalloc_mark_unused_space(kr.addr, esize, kr.size);
...
}

kalloc_mark_unused_space(void *addr, vm_size_t size, vm_size_t used)
{
	kasan_tbi_retag_unused_space((vm_offset_t)addr, size, used ? :1);
}

void
kasan_tbi_retag_unused_space(vm_offset_t addr, vm_size_t size, vm_size_t used)
{
	used = kasan_granule_round(used);
	if (used < size) {
		vm_offset_t unused_tag_addr = vm_memtag_assign_tag(addr + used, size - used); // assign tag to the pointer,
		vm_memtag_set_tag(unused_tag_addr, size - used); // update the shadow memory
	}
}

vm_memtag_assign_tag assigns a random tag to the pointer and vm_memtag_set_tag update the shadow memory for the allocated size.

It means everything is provided, we just need to instrument read and write instructions as I described above.

For the sake of curiosity, let’s briefly discuss the new and delete C++ keywords in IOKit.
As it’s documented in api-basics.md of XNU, most, if not all, C++ objects used in conjunction with IOKit APIs should probably use OSObject as a base class. all subclasses of OSObject must declare and define one of IOKit’s OSDeclare* and OSDefine* macros. As part of those, an operator new and operator delete are injected that force objects to enroll into kalloc_type.

#define __IODefineTypedOperators(type)                          \
	void *type::operator new(size_t size __unused)              \
	{                                                           \
	        return IOMallocType(type);                                \
	}                                                           \
	void type::operator delete(void *mem, size_t size __unused) \
	{                                                           \
	        IOFreeType(mem, type);                                    \
	}

As you can see, the new operator also uses IOMallocType under the hood, which is part of the new typed allocators API, kalloc_type.

for more information and better understanding of the XNU and IOKit memory allocation read following links.

I was planning to instrument global and stack variables. as a test, I triggered a global memory overflow in XNU using its sysctl interface, but nothing happened. After checking the function in the kernel binary with Ghidra, I found that global memory is not instrumented.

#define STATIC_ARRAY_SZ 66
unsigned long static_array[STATIC_ARRAY_SZ];
static int
test_global_overflow(struct kasan_test __unused *t)
{
  int i;
  /* rookie error */
  for (i = 0; i <= STATIC_ARRAY_SZ; i++) {
    static_array[i] = i;
  }
  return 0;
}

A closer look at the XNU build system shows that the software tag-based KASAN in XNU does not instrument global memory. The build system uses the -fsanitize=kernel-hwaddress option, which does not instrument the stack or alloca calls by default, though these can be enabled with specific flags.

Stack instrumentation is enabled for all targets except WatchOS using: -mllvm -hwasan-instrument-stack=$(HWASAN_INSTRUMENT_STACK). However, no option is provided to enable instrumentation for global memory.

As I have mentioned before magic happens in HWAddressSanitizer.

I will not instrument global or stack memory; I will focus on instrumenting only dynamic memory allocation, as it covers most memory vulnerabilities, A non-linear stack overflow is so rare, and a linear stack overflow is not exploitable due to PAC.

I’ve shared some resources discussing the instrumentation of global variables and the stack in relation to MTE. Although these resources focus on MTE, the concepts are the same since both approaches use TBI for tagging. The key difference is that with MTE, comparisons happen automatically in the SoC, and MTE doesn’t require shadow memory.

Instrumenting load and store instructions

Before we dive into instrumentation details of load and store instructions, let me briefly categorize and discuss the binary rewriting methods as I promised to do in the second part.

Trampolines: At each point where instrumentation is needed, the code is modified to jump to an instructions. at the end the instrumentation, the code jumps back to where it left off. This method is simple and easy to implement because it keeps the original code layout the same, and no internal references or branches are broken. as you remember we used this method in Pishi.
Direct: In this method, code is either overwritten somewhere else from scratch like a JIT or shifted to create space for the instrumentation. all references and branches need to be carefully updated to ensure everything works correctly. TinyInst is a very good example.
Lifting: This approach converts the binary code into a simpler Intermediate Representation (IR), similar to the one used in compilers like LLVM. The idea is that it’s easier to add instrumentation to this simpler form. After adding the instrumentation, the IR is converted back into binary code to create the final instrumented executable. QEMU’s TCG and Valgrind are two well-known examples.
Symbolization: This method turns the binary into an assembly listing and then applies instrumentation. taking about different binary writing methods and how to instrument memory read/write to get destination address.

I have to emphasize that sometimes these methods overlap, making it difficult to distinguish and categorize an implementation into one category.

To verify load and store instructions with help of shadow memory and pointer tag, LLVM instruments the source code at the IR level, which abstracts read and write operations in a much simpler way than a machine instructions, it’s aware of the source and destination of the read and write operations. LLVM finds interesting instructions before lowering them to Arm64, but we have to deal with Arm64 directly, which has many large number of instructions that read from or write to memory.

void HWAddressSanitizer::getInterestingMemoryOperands
{
  ...
  if (LoadInst *LI = dyn_cast<LoadInst>(I))
  ...
  else if (StoreInst *SI = dyn_cast<StoreInst>(I)) 
  ...
  else if (AtomicRMWInst *RMW = dyn_cast<AtomicRMWInst>(I)) 
  ...
  else if (AtomicCmpXchgInst *XCHG = dyn_cast<AtomicCmpXchgInst>(I)) 
  ...
}

As I told Arm64 has lots of load and store instructions.

Instruction = ["ldnp", "ldxrh", "ldarh", "stlr", "stlrb",
               "ldur", "ldaxrh", "stur", "stnp", "prfm",
               "ldrb", "stxp", "ldar", "ldpsw", "ldursb",
               "ldaxrb", "stlxp", "prfum", "stlxr", "ldxrb",
               "strh", "ldurb", "stxrh", "ldaxr", "str",
               "ldrsh", "ldxr", "ldp", "ldxp", "ldursw",
               "sturb", "stlxrh", "stxrb", "ldrsb", "sturh",
               "ldr", "stp", "ldaxp", "ldarb", "stlxrb", "ldrh",
               "ldursh", "ldurh", "strb", "stlrh", "stxr", "ldrsw",]

Like LLVM, we need to instrument load and store instructions to detect memory violations. Our trampolines should calculate the effective address of them, based on ARM64 addressing modes, and pass it to the check_kasan function, which uses shadow memory and the tag of the pointer for mismatches as we disscused about.

We explore the complexities and constraints involved in generating instructions to calculate effective addresses due to various addressing modes, and how I managed to overcome them.

Some instructions use an immediate value larger than the one that can be encoded directly in the add instruction. for example add x0, x0, #0x1120 is not a valid arm64 instruction and we can’t use it to calculate the effective address of str x1, [x0, #0x1120]

Some immediate values are negative (e.g., ldur x10, [x8, #-0x18]), but the ADD instruction in ARM64 does not support negative immediate values. we can’t rely on a single instruction to compute the effective address.

Instrumented instruction	Addressing Mode	Effective Address
`ldr w0, [x2]`	Base register only	Address in register `x2`
`ldr w0, [x2, #-0x100]`	Base register + offset	Address in `x2 - 0x100`
`ldr w0, [x1, x2]`	Base register + register	Address in `x1 + x2`
`ldr x0, [x1, x2, lsl #3]`	Base + scaled register offset	Address in `x1 + (x2 << 3)`
`ldr w0, [x2, #0x28]!`	Pre-indexing	Address in `x2 + 0x28`
`ldr w0, [x2], #0x100`	Post-indexing	Address in `x2`
`LDR Wt, label`	Literal Addressing	Address of `label`

However, by combining two instructions, such as mov and add, we can reliably compute the effective address for Pre-indexing and Base register + offset.

For example, consider the instruction:

ldr w0, [x2, #-0x100]

We can calculate the target address as follows:

mov x1, #-0x100
add x0, x1, x2

x0 now has the effect address of previous instruction, the decision to use x1 or x0 in the mov instruction is made during instrumentation. If the target instruction contains x0, then we will use x1.

For other addressing modes, a single add instruction is sufficient. for instance

ldrsw x16, [x17, x16, LSL #0x2] 
  ---> add x0, x17, x16, LSL #0x2

and

ldr w0, [x1, x2] 
  ---> add x0, x1, x2

By combining all the addressing models, we obtain the following:

+------ Post-indexing ---------+    +--- Base register only ---------+
| 1. ldr w0, [x1], #1          |    | 2. ldr w0, [x1]                |
+------------------------------+    +--------------------------------+
                     |                       |
                     v                       v
                   +---------------------------+
                   |     nop                   |
                   |     mov x0, R             |
                   +---------------------------+


+-- Base register + reg offset --+    +--- Base + scaled register off  ----+
| 3. ldr w0, [x0, x1]            |    | 4. add x0, x1, x2, lsl #3          |
+--------------------------------+    +------------------------------------+
                             |                       |
                             v                       v
                            +-------------------------+
                            |      nop                |
                            |      add x0, R          |
                            +-------------------------+


+------- Pre-indexing ------+    +-- Base register + offset ---+
| 5. ldr w0, [x1, 1]!       |    | 6. ldr w0, [x1, 1]          |
+---------------------------+    +-----------------------------+
                        |                     |
                        v                     v
                       +-------------------------+
                       |      mov R, R1          |
                       |      add x0, R          |
                       +-------------------------+

Here are a few other key points I’d like to highlight.

The LDR Wd, =label_expr and LDR Wt, label become PC related instruction e.g., ldr x0, 0x100003f98 <helloworld>, or sometimes just a mov for pseudo-instruction. due to relativity we can’t easily copy them somwhere else, but we can ignore them because we don’t instrument global variables.

We don’t instrument relative addresses also have to mention sometimes global variables are in form of on-non relative instruction so we can’t differentiate them with heap memory and we will instrument them for example this code will lead to the next assembly code.

struct Address {
    char city[16];
};

struct Person {
    char name[0x10];
    struct Address* address;  // Pointer to the second structure
};

// Declare global variables
struct Address globalAddress = { "bbbbb" };
struct Person globalPerson = { "aaaaa", &globalAddress };

void test_heap_overflow()
{
    char* mem = globalPerson.address->city;
    check_kasan_test((uint8_t*) mem + 0x19);
}

adrp       x8,0x378000
add        x8,x8,#0x100
ldr        x8,[x8, #0x10] <<<--  globalPerson.address->city;
str        x8,[sp, #0x8]
ldr        x8,[sp, #0x8]
add        x0,x8,#0x19
bl         _check_kasan_test

But we don’t need to worry about this because the pointer to globale memory is tagged by KASAN_TBI_DEFAULT_TAG so at the time of the check we will not have any kernel panic.

stp and ldp instructions perform write and read operations on pairs of registers, respectively.
The effective address for loading and storing floating-point registers is also calculated using general-purpose registers.
Ensuring proper handling of post-indexing, for instance, in ldr w0, [x2], #0x100, the effective address is simply x2.
Ghidra decodes some instruction in the following forms:

[a, b, UXTX #0]
[a, b, UXTX]
[a, b, LSL #0]

However, we cannot encode them back, but they are equivalent to [a, b] or a + (uint64_t)b, therefore, it is safe to replace them with [a, b].

Similarly, we can safely replace [a, b, UXTW #0] with [a, b, UXTW], which is equivalent to a + (uint32_t)b.

Also, we don’t need to worry about user-mode pointers. unlike the Windows kernel and its drivers, which can directly access user-mode memory( SMEP is enabled but SMAP is not), iOS enables the Privileged Access Never (PAN) feature. PAN, introduced with the ARMv8.1-A architecture in the Apple A10 processor, is a security mechanism that prevents the kernel from accessing virtual addresses shared with userspace. As a result, we can be sure that no part of our KEXTs interacts with user-mode memory.

But we should be careful about unprivileged load and store instructions. Sometimes the OS does need to access unprivileged regions, for example, to write to a buffer owned by an application. To support this, the instruction set provides the LDTR and STTR instructions. LDTR and STTR are unprivileged loads and stores. They are checked against EL0 permission checking even when executed by the OS at EL1 or EL2. Because these are explicitly unprivileged accesses, they are not blocked by PAN. ( images by ARM)

ARM64 uses a weak memory model, it means there is no requirement for non-dependent loads/stores to normal memory in program order to be observed by the memory system in that same order.

To have sequentially consistent ordering Arm64 provides explicit barrier instructions(DMB, DSB, and ISB) and they have to be used after load or store instructions, which means if there is any barrier, it will be right after our trampoline( we will have a jump after original instruction to it) it means we don’t need to worry about patched instructions that relies on barriers. also I have not seen any barrier instructions in the KEXTs. they use exclusive and atomic instructions which rely on implicit barriers.

Based on the information above, we instrument the KEXT within the following loops:

Enumerate memory read/write instructions, unlike collecting code coverage we have to instrument all of them.
Generate instructions to calculate the effective address of each instruction to use it in _check_kasan. These instructions are rewritten into a trampoline that calls _check_kasan. If a tag mismatch is detected, the system will panic. Otherwise, the trampoline restores the context, executes the original instruction, and transitions to the next instruction.

void check_kasan(uint8_t *address)
{
    if (address) {
      uint8_t pointer_tag = (uint8_t)((uintptr_t)address >> 56);
      uint8_t memory_tag = *(uint8_t *)(((((uintptr_t)address) | 0xFF00000000000000ULL) >> 4) + 0xF000000000000000ULL);
      
      if ((pointer_tag != memory_tag) && (pointer_tag != 0xFF)) {
          asm ("brk 0x0");
      }
    }
}

void instrument_kasan()
{
    asm volatile (
                  ".rept " xstr(REPEAT_COUNT_KASAN) "\n"  // Repeat the following block many times
                  "    STR x30, [sp, #-16]!\n"            // Save LR. we can't restore it in pop_regs. as we have jumped here.
                  "    bl _push_regs\n"
                  "    nop\n"                            // Instruction to calc effective address.
                  "    nop\n"                            // Instruction to calc effective address and send it as arg0.
                  "    bl _check_kasan\n"
                  "    bl _pop_regs\n"
                  "    LDR x30, [sp], #16\n"              // Restore LR
                  "    nop\n"                             // Placeholder for original instruction.
                  "    nop\n"                             // Placeholder for jump back
                  ".endr\n"                               // End of repetition
                  );
}

I have tested Pishi’s binary-KASAN with several KEXTs, including APFS. By combining it with our code coverage, we can achieve effective fuzzing.

KASAN for Intrinsic function

Not all memory reads or writes happen due to assigning or dereferencing pointers. Sometimes, they occur inside memory manipulation functions like memset and memcpy. generally, these functions are implemented as intrinsic functions. An intrinsic function is a function that the compiler directly implements, when possible, instead of relying on a library-provided version.

XNU has to provide its own versions of intrinsic functions, including __asan_memset, __asan_memcpy, and __asan_memmove, to validate memory access and detect violations, in the KASAN build.

To make a KEXT’s memory manipulation functions KASAN-aware, we can find the addresses of these functions in kernel collecting via the DWARF file then enumerate all the instructions in our KEXT. If there is a BL (branch and link) to one of the intrinsic functions, we can replace the call with the corresponding __asan version.

mov        x2,x22
bl         _memcpy
mov        w19,#0x0
mov        w8,#0x1

intrinsic_functions = ["fffffe000a517f70", "___asan_memcpy"] # fffffe000a517f70 is _memcpy
for inst in instructions:
    if inst.getMnemonicString() == "bl":
        if str(inst.getOpObjects(0)[0]) == intrinsic_functions[0]: # first operand of BL is the target address. 
            hook_to_asan(inst.getAddress(), "___asan_memcpy")

mov        x2,x22
bl         ___asan_memcpy
mov        w19,#0x0
mov        w8,#0x1

Benchmark

Let’s instrument APFS using KCOV and the KASAN mode of Pishi, and benchmark it similar to Part 1 of the blog.

Demonstration

Let’s look at the following heap overflow in the kernel. we will trigger this vulnerability from user mode.

void heap_overflow(uint16_t oow)
{
    char* a = (char* ) IOMalloc(16);
    printf("[PISHI] allocated: 16 index: %u\n", oow);
    a[oow] = 0x41;
}

This video shows that triggering this kernel vulnerability in a KEXT does not result in a kernel panic. To increase the chances of a kernel panic we will use Pishi.

In the second video, triggering this heap overflow in my instrumented KEXT causes a kernel panic. Instead of directly invoking the brk instruction, I call XNU’s kasan_crash_report function to trigger a clean and classic KASAN-style kernel panic report.

The final note

LLVM instruments every memory access with the following code when building XNU with KASAN.

uint8_t shadow_tag = *(uint8_t *)((((uintptr_t)pointer | 0xff00000000000000) >> 4) + 0xf000000000000000);
    uint8_t pointer_tag = (uint8_t)((uintptr_t)pointer >> 0x38);
    if (((pointer_tag != shadow_tag) && (pointer_tag != 0xff)) &&
        ((0xf < shadow_tag ||
          (((uint)shadow_tag <= ((uintptr_t)pointer & 0xf) + 7 ||
            (pointer_tag != *(uint8_t *)((uintptr_t)pointer | 0xff0000000000000f))))))) {
            SoftwareBreakpoint();
}

This results in the kernel binary being filled with useless code that is never executed. if you rewrite above code with nested if statements, you will see that if 0xf < shadow_tag is always true because kasan_tbi_full_tag which is called by vm_memtag_assign_tag, always ORs the tag with 0xF0.

void check(uint8_t *pointer)
{
    uint8_t shadow_tag = *(uint8_t *)(
        (((uintptr_t)pointer | 0xff00000000000000ULL) >> 4) + 0xf000000000000000ULL
    );
    uint8_t pointer_tag = (uint8_t)((uintptr_t)pointer >> 0x38);

    if ((pointer_tag != shadow_tag) && (pointer_tag != 0xff)) {
        if (0xf < shadow_tag) { // if we are here then this `if` is always true.
            SoftwareBreakpoint();
        } else {
            uint8_t memory_val = *(uint8_t *)((uintptr_t)pointer | 0xff0000000000000fULL);
            uint32_t offset_check = ((uintptr_t)pointer & 0xf) + 7;
            
            if ((uint)shadow_tag <= offset_check || (pointer_tag != memory_val)) {
                SoftwareBreakpoint();
            }
        }
    }
}

static inline uint8_t
kasan_tbi_full_tag(void)
{
	return kasan_tbi_full_tags[kasan_tbi_lfsr_next() %
	       sizeof(kasan_tbi_full_tags)] | 0xF0;
}

Another reason the LLVM-generated code mentioned above is redundant is that tag-based KASAN can not support partial addressing modes. This is because all memory allocated with kalloc are from pre-made collection of zones, one per size class (kalloc.16, kalloc.32, …). This ensures that the allocated memory is always addressable in multiples of 16, making partial addressing ineffective. for example, if a single byte is requested, it will allocate 16 bytes. whenever a 15-byte overflow occurs, the overflow remains within the padding space and does not spill into the next allocated object. Such overflows would go undetected. interestingly, if we write 16 bytes, even without Pishi in a KASAN build of XNU, printf will catch the mismatch, its code is instrumented via KASAN as part of XNU .

char* a = (char* ) IOMalloc(1);
for ( int i=0; i < 15; i++) {
  a[i]  = 0x41;
}
printf("[Pishi] a %s\n", a);

I can’t classify them as vulnerabilities since they don’t overflow into the next object or metadata. However, this isn’t a case of objects being in zones that aren’t allocated in multiples of 16.

intra-object vulnerabilities, are another example of issues that KASAN or tag-based detection methods cannot identify.

struct S {
  int buffer[5];
  int other_field;
};

void Foo(S *s, int idx) {
  s->buffer[idx] = 0;  // if idx == 5, asan will not complain
}