Introduction¶
In this section, we mainly introduce some prerequisite knowledge and considerations when exploiting the SLUB allocator.
Kernel Heap Exploitation and CPU Core Binding¶
The slub allocator prioritizes memory allocation from the current core's kmem_cache_cpu. In a multi-core architecture, there are multiple kmem_cache_cpu instances. Since the process scheduling algorithm maintains load balance across cores, our exp process may run on different cores, which means that kernel object allocations during exploitation may come from different kmem_cache_cpu instances. This makes the exploitation model more complex and reduces the success rate of the exploit.
For example, if you set up a double free on core 0, and then when you're ready for the next step, the exp runs on core 1, things can get very confusing :(
Therefore, to ensure the stability of the exploit, we need to bind our process to a specific CPU core. This way, the slub allocator model is simplified to a single kmem_cache_node + a single kmem_cache_cpu from our perspective, making exploitation much more convenient.
Here is a template for binding the exp process to a specific core:
#include <sched.h>
/* to run the exp on the specific core only */
void bind_cpu(int core)
{
cpu_set_t cpu_set;
CPU_ZERO(&cpu_set);
CPU_SET(core, &cpu_set);
sched_setaffinity(getpid(), sizeof(cpu_set), &cpu_set);
printf("\033[34m\033[1m[*] Process binded to core \033[0m%d\n", core);
}
Common kmalloc Flags¶
In kernel object allocation, the most commonly used function is kmalloc(), whose prototype is as follows (from kernel version 6.14.4):
static __always_inline __alloc_size(1) void *kmalloc_noprof(size_t size, gfp_t flags)
//...
#define kmalloc(...) alloc_hooks(kmalloc_noprof(__VA_ARGS__))
The first parameter size is the size of the kernel object to allocate, and the second parameter flag is the Get-Free-Page flag, which indicates the strategy the kernel adopts during allocation. Different bits represent different strategies. For example, when ___GFP_KSWAPD_RECLAIM_BIT is enabled, it means waking up kswapd for memory reclamation when memory is insufficient.
GFP_KERNEL and GFP_KERNEL_ACCOUNT are the most common and general allocation flags in the kernel, composed of multiple commonly used bits. The main difference between these two flags is that GFP_KERNEL_ACCOUNT has one additional bit ___GFP_ACCOUNT_BIT compared to GFP_KERNEL — indicating that the object uses the MEMCG mechanism for data recording. This mechanism is part of Memory CGroups, mainly used for tracking kernel object allocations, and takes effect when the kernel has CONFIG_MEMCG_KMEM=y enabled. If CONFIG_MEMCG_KMEM is not enabled, GFP_KERNEL and GFP_KERNEL_ACCOUNT are equivalent.
We mainly focus on the changes in kernel object allocation when this mechanism is enabled:
- Under normal circumstances, their allocations all come from the same
kmem_cache— the generalkmalloc-xx. - For kernels with the
CONFIG_MEMCG_KMEMcompile option enabled (usually enabled by default), it will create a separate set ofkmem_cacheinstances namedkmalloc-cg-*for general objects allocated usingGFP_KERNEL_ACCOUNT, resulting in isolation between objects using these two flags.
Before version 5.9, there was an isolation mechanism between
GFP_KERNELandGFP_KERNEL_ACCOUNT. The isolation was removed in this commit, and was reintroduced starting from kernel version 5.14 in this commit.
Additionally, when the __GFP_ZERO flag is used, the allocated kernel object will be zeroed out before being returned to the user, which is equivalent to kzalloc().
When performing binary reverse analysis on kernel images or modules, we usually only see the flag as a constant value, so we need to manually analyze which bits are enabled. The following are the values corresponding to common allocation flags:
GFP_KERNEL: 0xCC0GFP_KERNEL | __GFP_ZERO: 0xDC0GFP_KERNEL_ACCOUNT: 0x400CC0GFP_KERNEL_ACCOUNT | __GFP_ZERO: 0x400DC0
kmalloc Compiler Optimization¶
When the size of the kernel object to be allocated is a fixed value, we can know at compile time which kmem_cache the allocation will come from. Therefore, the kernel will optimize the call to kmalloc() into a call to kmem_cache_alloc_noprof(), whose prototype is as follows (from kernel version 6.14.4):
void *kmem_cache_alloc_noprof(struct kmem_cache *cachep,
gfp_t flags) __assume_slab_alignment __malloc;
#define kmem_cache_alloc(...) alloc_hooks(kmem_cache_alloc_noprof(__VA_ARGS__))
In fact, the properties of the kernel's native kmem_cache instances (i.e., kmalloc-xx, etc.) are roughly determined at compile time and are stored in a fixed order in a global kmem_cache array kmalloc_caches (defined in mm/slab_common.c). Therefore, when this type of compiler optimization occurs, you may see something like kmem_cache_alloc_noprof(kmalloc_caches[3], 0xCC0) in the decompiler. The original definition is as follows (from kernel version 6.14.4):
typedef struct kmem_cache * kmem_buckets[KMALLOC_SHIFT_HIGH + 1];
kmem_buckets kmalloc_caches[NR_KMALLOC_TYPES] __ro_after_init =
{ /* initialization for https://llvm.org/pr42570 */ };
EXPORT_SYMBOL(kmalloc_caches);
This variable is actually initialized through new_kmalloc_cache() in one of the kernel memory initialization functions create_kmalloc_caches(). The detailed analysis process is left as an exercise for the reader. Here we directly provide the code locations needed for index calculation (from kernel version 6.14.4, mm/slab_common.c, include/linux/slab.h):
#define INIT_KMALLOC_INFO(__size, __short_size) \
{ \
.name[KMALLOC_NORMAL] = "kmalloc-" #__short_size, \
KMALLOC_RCL_NAME(__short_size) \
KMALLOC_CGROUP_NAME(__short_size) \
KMALLOC_DMA_NAME(__short_size) \
KMALLOC_RANDOM_NAME(RANDOM_KMALLOC_CACHES_NR, __short_size) \
.size = __size, \
}
/*
* kmalloc_info[] is to make slab_debug=,kmalloc-xx option work at boot time.
* kmalloc_index() supports up to 2^21=2MB, so the final entry of the table is
* kmalloc-2M.
*/
const struct kmalloc_info_struct kmalloc_info[] __initconst = {
INIT_KMALLOC_INFO(0, 0),
INIT_KMALLOC_INFO(96, 96),
INIT_KMALLOC_INFO(192, 192),
INIT_KMALLOC_INFO(8, 8),
INIT_KMALLOC_INFO(16, 16),
INIT_KMALLOC_INFO(32, 32),
INIT_KMALLOC_INFO(64, 64),
INIT_KMALLOC_INFO(128, 128),
INIT_KMALLOC_INFO(256, 256),
INIT_KMALLOC_INFO(512, 512),
INIT_KMALLOC_INFO(1024, 1k),
INIT_KMALLOC_INFO(2048, 2k),
INIT_KMALLOC_INFO(4096, 4k),
INIT_KMALLOC_INFO(8192, 8k),
INIT_KMALLOC_INFO(16384, 16k),
INIT_KMALLOC_INFO(32768, 32k),
INIT_KMALLOC_INFO(65536, 64k),
INIT_KMALLOC_INFO(131072, 128k),
INIT_KMALLOC_INFO(262144, 256k),
INIT_KMALLOC_INFO(524288, 512k),
INIT_KMALLOC_INFO(1048576, 1M),
INIT_KMALLOC_INFO(2097152, 2M)
};
enum kmalloc_cache_type {
KMALLOC_NORMAL = 0,
#ifndef CONFIG_ZONE_DMA
KMALLOC_DMA = KMALLOC_NORMAL,
#endif
#ifndef CONFIG_MEMCG
KMALLOC_CGROUP = KMALLOC_NORMAL,
#endif
KMALLOC_RANDOM_START = KMALLOC_NORMAL,
KMALLOC_RANDOM_END = KMALLOC_RANDOM_START + RANDOM_KMALLOC_CACHES_NR,
#ifdef CONFIG_SLUB_TINY
KMALLOC_RECLAIM = KMALLOC_NORMAL,
#else
KMALLOC_RECLAIM,
#endif
#ifdef CONFIG_ZONE_DMA
KMALLOC_DMA,
#endif
#ifdef CONFIG_MEMCG
KMALLOC_CGROUP,
#endif
NR_KMALLOC_TYPES
};
In the kernel source code, kmalloc_caches is used in the form kmalloc_caches[type][index], but in practice it is used as a one-dimensional pointer array, where each type occupies a contiguous range of indices, and sizes larger than 8k are typically not enabled, so each type generally has 14 kmem_cache instances. Based on this rule, we can calculate the size corresponding to kmalloc_caches[N]. Here are two examples:
kmalloc_caches[12]: The size range falls withinKMALLOC_NORMAL's corresponding index0~13. Takingkmalloc_info[12 - 0], we get size 4k, correspondingkmem_cacheiskmalloc-4k, corresponding flag isGFP_KERNEL.kmalloc_caches[54]: The size range falls withinKMALLOC_CGROUP's corresponding index42~55. Takingkmalloc_info[54 - 42], we get size 4k, correspondingkmem_cacheiskmalloc-cg-4k, corresponding flag isGFP_KERNEL_ACCOUNT.
This calculation uses
kmalloc_cache_typeas[KMALLOC_NORMAL,KMALLOC_RECLAIM,KMALLOC_DMA,KMALLOC_CGROUP], from the default Gentoo Linux configuration.
Note that this calculation method is only correct in general cases. For some complex special cases, index calculations beyond the KMALLOC_NORMAL range may not be reliable (i.e., calculations other than GFP_KERNEL may not be entirely accurate, because we may not be sure about the number of enabled types and the maximum cache size for each type). Therefore, a more accurate approach is to use dynamic debugging to check the corresponding kmem_cache::name.
SLUB Merging & Isolation¶
The slab alias mechanism is a mechanism for reusing kmem_cache instances of equal/similar-sized objects:
- When a
kmem_cacheis being created, if there already exists akmem_cachethat can allocate equal/similar-sized objects, a new kmem_cache will not be created. Instead, an alias will be created for the existing kmem_cache and returned as the "new" kmem_cache.
For example, cred_jar is a kmem_cache specifically used for allocating cred structures. In Linux versions before 4.4, it was an alias for kmalloc-192, meaning that cred structures and other 192-sized objects would all be allocated from the same kmem_cache — kmalloc-192.
For kmem_cache instances that have the SLAB_ACCOUNT flag set during initialization, a new kmem_cache will be created instead of establishing an alias for the existing one. For example, in newer kernel versions, cred_jar and kmalloc-192 are two independent kmem_cache instances, which do not interfere with each other.
Reference¶
https://arttnba3.cn/2021/03/03/PWN-0X00-LINUX-KERNEL-PWN-PART-I/
https://arttnba3.cn/2023/02/24/OS-0X04-LINUX-KERNEL-MEMORY-6.2-PART-III/
https://kernel.org/doc/html/v5.4/admin-guide/cgroup-v1/memory.html
https://lwn.net/Articles/821664/
https://github.com/torvalds/linux/commit/10befea91b61c4e2c2d1df06a2e978d182fcf792
https://github.com/torvalds/linux/commit/494c1dfe855ec1f70f89552fce5eadf4a1717552