Skip to content

corrode/tarnish

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tarnish

A library for isolating crash-prone code in separate processes with automatic recovery.

Why?

Sometimes you need to run code that might crash your process. Maybe you're calling into a C library through FFI, and somewhere in that library there's a null pointer dereference waiting to happen. Or you're using a third-party sys-crate with brittle unsafe code. Or you're experimenting with code that panics unpredictably.

Rust's type system can't protect you from segfaults in C code. It can't prevent an abort() call in a dependency. When those things happen, the entire process terminates.

This library provides crash isolation: the fragile code runs in a separate process. If it crashes, the parent process survives and can restart the worker.

This library was born out of necessity to wrap a brittle FFI binding that would occasionally segfault. That specific use case works well. I haven't extensively tested it beyond that, so proceed with appropriate caution for your use case.

Features

  • Crash isolation: Process-level isolation survives segfaults, panics, and aborts
  • Automatic recovery: Workers automatically restart after crashes
  • Type-safe messaging: Built-in serialization for process communication
  • Trait-based API: Simple, composable design
  • Cross-platform: Works anywhere Rust can spawn processes
  • General-purpose: Not limited to FFI use cases

Differences to std::panic::catch_unwind

std::panic::catch_unwind can only catch Rust panics that use unwinding.

Feature catch_unwind tarnish
Isolation Level Thread-level (same process) Process-level (separate process)
Survives segfaults ❌ No ✅ Yes
Survives C/FFI crashes ❌ No ✅ Yes
Survives abort() ❌ No ✅ Yes
Survives panic-abort ❌ No ✅ Yes
Survives stack overflow ❌ No* ✅ Yes
Performance Very fast (nanoseconds) Slower (process spawn + IPC)
Memory overhead Minimal Separate process (~few MB)
Trait bounds Requires UnwindSafe Requires Serialize + Deserialize
Data sharing Direct (same address space) Serialized (across process boundary)

The failures get handled on different levels. While catch_unwind catches panics within the same process, tarnish isolates code in a separate process to survive ANY crash.

Here are some examples of what catch_unwind cannot handle:

use std::panic;

// ❌ Segfault from unsafe code
panic::catch_unwind(|| unsafe {
    let ptr = std::ptr::null_mut::<i32>();
    *ptr = 42; // SEGFAULT - entire process dies
});

// ❌ C library crash
panic::catch_unwind(|| unsafe {
    libc::abort(); // Terminates entire process
});

// ❌ Panic with -C panic=abort
// (Terminates entire process)

// ❌ Stack overflow (usually)
// (May or may not be caught)

The cost of tarnish is higher latency and memory usage due to process spawning and inter-process communication. Use catch_unwind for pure Rust code, and tarnish when calling unsafe FFI or when you need absolute crash isolation.

How It Works

You implement a Task trait that encapsulates your risky business logic. When you spawn a worker, the library creates a fresh copy of your own binary, but with a special environment variable set. Your main() function checks for this variable at startup. If it's there, you know you're the worker subprocess, and you should run the worker loop. If it's not there, you're the parent, and you can spawn workers as needed.

This is conceptually similar to the classic fork pattern on Unix systems, but it works on any platform that can spawn processes. The parent and worker communicate over stdin and stdout, using messages which get serialized with postcard, then base64 encoded so they play nice with text streams. The concrete messaging format is an implementation detail that you should not rely on, as it can change in future versions.

Serialization and deserialization happens automatically, so contrary to the Unix fork-exec model, you only need to worry about the business logic.

When a worker panics or crashes, the parent notices immediately. It spawns a fresh worker and retries the operation. If the crash was transient (cosmic ray, memory pressure, who knows), the retry succeeds. If it was deterministic (i.e., that input will always crash), the retry fails too, and you get an error back. Either way, your parent process keeps running.

Task Trait

The task trait looks like this:

pub trait Task: Default + 'static {
    type Input: Serialize + Deserialize;
    type Output: Serialize + Deserialize;
    type Error: Display;

    fn run(&mut self, input: Self::Input) -> Result<Self::Output, Self::Error>;
}

Your input and output types need to derive Serialize and Deserialize; everything else happens behind the scenes. You can also use types from the standard library like String for both input and output if that's all you need. (There is a blanket implementation for those.)

Inline Tasks Using the task! Macro

For simple cases, you can use the task! macro instead of manually implementing the Task trait:

use tarnish::task;

fn main() {
    // Simple task with explicit return type
    let result = task!(calculate: || -> Result<i32, String> {
        // This code runs in an isolated process
        Ok(42)
    });

    // Task with default return type (tarnish::Result<()>)
    let result = task!(simple: || {
        // Do something that might crash
        Ok(())
    });
}

The macro automatically generates the Task implementation and handles process spawning. Each task runs in its own isolated process, so if it crashes, your main process survives.

The label (calculate, simple) is required to generate a unique type for each task. When a subprocess spawns, it uses this label to identify which task it should run. Each task! call in your binary needs a unique label. If this turns out to be a limitation, please open an issue.

The macro is perfect for quick isolation of crash-prone code blocks without the ceremony of defining a full Task struct. Use the full Task trait when you need persistent workers that handle multiple requests, maintain state between calls, or when you want more control over the lifecycle.

Example Use-Case: Wrapping Crash-Prone FFI

The original use-case is isolating FFI calls that might crash, so let's look at an example in more detail.

use tarnish::{Task, Process};
use serde::{Serialize, Deserialize};

#[derive(Default)]
struct UnsafeFFIWrapper;

// Define your input/output types

#[derive(Serialize, Deserialize)]
struct Input {
    operation: String,
    data: Vec<u8>,
}

#[derive(Debug, Serialize, Deserialize)]
struct Output {
    success: bool,
    data: Vec<u8>,
}

impl Task for UnsafeFFIWrapper {
    type Input = Input;
    type Output = Output;
    type Error = String;

    fn run(&mut self, input: Input) -> Result<Output, String> {
        // This unsafe block might segfault!
        // If it does, only this worker process dies, not the parent.
        unsafe {
            let result = some_unsafe_c_function(
                input.data.as_ptr(),
                input.data.len()
            );

            if result.is_null() {
                return Err("C function returned null".to_string());
            }

            // Process the result...
            Ok(Output {
                success: true,
                data: vec![],
            })
        }
    }
}

fn main() {
    tarnish::main::<UnsafeFFIWrapper>(parent_main);
}

fn parent_main() {
    let mut process = Process::<UnsafeFFIWrapper>::spawn()
        .expect("Failed to spawn worker");

    let input = Input {
        operation: "transform".to_string(),
        data: vec![1, 2, 3, 4],
    };

    match process.call(input) {
        Ok(output) => {
            println!("FFI call succeeded: {:?}", output);
        }
        Err(e) => {
            // If the C code segfaulted, we get an error here,
            // but the parent process is still running
            eprintln!("Worker crashed or returned error: {}", e);
        }
    }
}

// Your unsafe FFI declaration
unsafe extern "C" {
    fn some_unsafe_c_function(data: *const u8, len: usize) -> *mut std::ffi::c_void;
}

Note how main just calls tarnish::main() with the parent logic function. This handles the check for parent-vs-task context.

Process Pools

For concurrent task execution we offer a ProcessPool to manage multiple worker processes. This is useful in situations where your tasks are CPU-bound, which is a common problem with *-sys crates.

use std::num::NonZeroUsize;
use tarnish::{Task, ProcessPool};

#[derive(Default)]
struct HeavyComputation;

impl Task for HeavyComputation {
    type Input = Vec<u8>;
    type Output = u64;
    type Error = String;

    fn run(&mut self, input: Vec<u8>) -> Result<u64, String> {
        // Do the expensive computation
        Ok(input.iter().map(|&x| x as u64).sum())
    }
}

fn main() {
    tarnish::main::<HeavyComputation>(|| {
        let size = NonZeroUsize::new(4).unwrap();
        let mut pool = ProcessPool::<HeavyComputation>::new(size)
            .expect("Failed to create pool");

        // Process tasks across 4 workers
        for i in 0..100 {
            let result = pool.call(vec![i; 1000]);
            println!("Result: {:?}", result);
        }
    });
}

Each pool provides a set of guarantees:

  • Uses round-robin scheduling to distribute work
  • Automatically restarts crashed workers, just like Process does
  • Each worker maintains its own isolated process memory
  • Workers persist between calls for efficiency

Shutdown

When you drop a Process handle, it sends a shutdown message to the task and waits for up to 5 seconds. If the task doesn't exit cleanly, it gets a SIGKILL.

When Tasks Crash

When a task crashes mid-operation, process.call() automatically restarts the task and returns an error. The fresh task is ready for the next call.

You have to retry the operation yourself.

// Try once, retry on failure
let result = process.call(input.clone())
    .or_else(|_| process.call(input));

Or implement more sophisticated retry logic with backoff, limits, etc. The library handles keeping a fresh task available, you decide when to retry.

Serialization format

Messages are serialized with postcard using COBS encoding. Postcard is a compact binary format (~10-20% the size of JSON), and COBS adds only ~0.4% overhead while providing natural frame delimiters.

How it works: COBS encoding transforms binary data to never contain 0x00 bytes, which we then use as message delimiters.

This is an implementation detail, however, and may change in future versions.

If you really don't want the serde dependency, you can disable the default features and implement MessageEncode and MessageDecode manually for your types. But honestly, you probably want serde.

Limitations

Warning

This library provides crash isolation, not security isolation. It protects your parent process from crashes (segfaults, panics, aborts), but does NOT sandbox malicious code. Worker processes have full access to the filesystem, network, and other system resources. Do not use this to run untrusted code.

Platform support: Tested on macOS. Probably works on other Unix-like systems. Windows support would require work around process spawning and signal handling.

Requirements: Tasks must implement Default (for spawning fresh workers) and be 'static (no borrowed data across process boundaries).

Similar Libraries

Feature tarnish Sandcrust rusty-fork Bastion rust_supervisor subprocess/async-process
Process isolation
Automatic restart
Survives segfaults
Production ready
Built-in IPC
Trait-based API
FFI focus
External commands
Active maintenance

About The Name

Tarnish is the protective layer that forms on metal when it's exposed to air. It looks like damage, but it's actually protecting the metal underneath from further corrosion. When you scratch it off, it just grows back.

I think you now understand where I'm going with this. This library is similar. The worker process is "the tarnish." It takes the hits so your main process doesn't have to. When it gets damaged, we regenerate it. The protection continues. This and the Rust pun.