Future 的大小对性能的影响

在 Rust 异步编程中，有一种观点认为：Future 的大小显著影响性能。你是否怀疑过这个说法的真实性？如果是真的，这种性能差异的根源又是什么？今天，我翻阅了一些源码，并编写实验代码来一探究竟。

Future 的大小如何计算？

为了验证“Future 大小影响性能”这一说法是否成立，我们先从一些简单代码入手。首要任务是弄清楚一个 Future 的大小是如何确定的。毕竟，在编译器眼里，Future 只是一个 trait：

pub trait Future {
    type Output;
    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output>;
}

那么，其大小取决于实现这个 trait 的具体结构体吗？我翻阅了 smol 的源码，发现在 spawn 一个 Future 时，相关代码是这样处理的：

pub unsafe fn spawn_unchecked<'a, F, Fut, S>(
    self,
    future: F,
    schedule: S,
) -> (Runnable<M>, Task<Fut::Output, M>)
where
    F: FnOnce(&'a M) -> Fut,
    Fut: Future + 'a,
    S: Schedule<M>,
    M: 'a,
{
    // Allocate large futures on the heap.
    let ptr = if mem::size_of::<Fut>() >= 2048 {
        let future = |meta| {
            let future = future(meta);
            Box::pin(future)
        };
        RawTask::<_, Fut::Output, S, M>::allocate(future, schedule, self)
    } else {
        RawTask::<Fut, Fut::Output, S, M>::allocate(future, schedule, self)
    };

    let runnable = Runnable::from_raw(ptr);
    let task = Task {
        ptr,
        _marker: PhantomData,
    };
    (runnable, task)
}

这里可以看到 mem::size_of::<Fut>() 是在计算这个 Future 的大小，我来写个简单的 Future 验证：

use async_executor::Executor;
use futures_lite::future;
use std::future::Future;
use std::pin::Pin;
use std::task::{Context, Poll};

pub struct LargeFuture {
    pub data: [u8; 10240],
}

impl Future for LargeFuture {
    type Output = usize;

    fn poll(self: Pin<&mut Self>, _cx: &mut Context<'_>) -> Poll<Self::Output> {
        let value = self.data[0];
        println!("First byte: {}", value);
        Poll::Ready(self.data.len())
    }
}

fn main() {
    let ex = Executor::new();
    let large_future = LargeFuture { data: [0u8; 10240] };
    let res = future::block_on(ex.run(async { ex.spawn(large_future).await }));
    println!("Result: {}", res);
}

在上面那个 async-task 的 spawn_unchecked 函数加上日志，打印出来的大小为 10256，刚好比这个 struct 的大小大 16，顺着代码往上可以看到这里在原始的 Future 上做了一个封装，这里的意思是如果这个 Future 以后执行完，需要从 runtime 里面删掉：

let future = AsyncCallOnDrop::new(future, move || drop(state.active().try_remove(index)));

这解释了尺寸略有增加的原因。对于结构体的尺寸，我们不难理解，但对于 async 函数，其大小又是如何计算的呢？这就涉及 Rust 编译器对 async 的转换机制。

异步状态机：冰山之下的庞然大物

当你写下一个简单的 async fn 函数时，Rust 编译器在幕后悄然完成了一场复杂的转换：

async fn function() -> usize {
    let data = [0u8; 102400];
    future::yield_now().await;
    data[0] as usize
}

这段代码会被编译器转化为一个庞大的状态机，负责追踪执行进度并保存所有跨越 .await 点的变量。转换后的结构体封装了状态切换的逻辑：

enum FunctionState {
    // 初始状态
    Initial,

    // yield_now 挂起后的状态，必须包含所有跨 await 点的变量
    Suspended {
        data: [u8; 102400], // 整个大数组必须保存！
    },

    // 完成状态
    Completed,
}

// 2. 定义状态机结构体
struct FunctionFuture {
    // 当前状态
    state: FunctionState,

    // yield_now future
    yield_fut: Option<YieldNow>,
}

impl Future for FunctionFuture {
    // 3. 为状态机实现 Future traitimpl Future for FunctionFuture {
    type Output = usize;

    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<usize> {
        // 安全地获取可变引用
        let this = unsafe { self.get_unchecked_mut() };

        match &mut this.state {
            FunctionState::Initial => {
                // 创建大数组及其长度
                let data = [0u8; 102400];
                // 创建 yield future 并保存
                this.yield_fut = Some(future::yield_now());

                // 状态转换，保存所有需要跨越 await 的数据
                this.state = FunctionState::Suspended { data };

                // 立即轮询 yield
                match Pin::new(&mut this.yield_fut.as_mut().unwrap()).poll(cx) {
                    Poll::Ready(_) => {
                        // 如果立即完成，返回结果
                        if let FunctionState::Suspended { data } = &this.state {
                            let result = data[0] as usize;
                            this.state = FunctionState::Completed;
                            Poll::Ready(result)
                        } else {
                            unreachable!()
                        }
                    }
                    Poll::Pending => Poll::Pending,
                }
            }

            FunctionState::Suspended { data } => {
                // 继续轮询 yield
                match Pin::new(&mut this.yield_fut.as_mut().unwrap()).poll(cx) {
                    Poll::Ready(_) => {
                        // yield 完成，读取数组首元素并返回
                        let result = data[0] as usize;
                        this.state = FunctionState::Completed;
                        Poll::Ready(result)
                    }
                    Poll::Pending => Poll::Pending,
                }
            }

            FunctionState::Completed => {
                panic!("Future polled after completion")
            }
        }
    }
}

可以看到，Suspended 状态中包含了那个大数组。当状态从 Initial 切换到 Suspended 时，data 会被完整保留。

由此可知，对于一个 async 函数，若临时变量需跨越 await 存活，就会被纳入状态机，导致编译时生成的 Future 大小显著增加。

尺寸对性能的影响

明确了 Future 大小的定义后，我们接着通过代码验证其对性能的影响。在之前的 mem::size_of::<Fut>() >= 2048 条件中可以看到，如果 Future 的大小过大，Box::pin(future) 会从堆上分配内存，理论上会带来额外开销。这种设计可能基于几点考量：小型 Future 直接嵌入任务结构体中，能提升缓存命中率；而大型 Future 若嵌入，会让任务结构体过于臃肿，占用过多栈空间，反而不利于性能。

我通过实验验证，若 async 函数中包含较大的结构体，确实会导致 Future 执行变慢（即便计算逻辑相同）：

RESULTS:
--------
Small Future (64B): 100000 iterations in 30.863125ms (avg: 308ns per iteration)
Medium Future (1KB): 100000 iterations in 61.100916ms (avg: 611ns per iteration)
Large Future (3KB): 100000 iterations in 105.185292ms (avg: 1.051µs per iteration)
Very Large Future (10KB): 100000 iterations in 273.469167ms (avg: 2.734µs per iteration)
Huge Large Future (100KB): 100000 iterations in 5.896455959s (avg: 58.964µs per iteration)

PERFORMANCE RATIOS (compared to Small Future):
-------------------------------------------
Medium Future (1KB): 1.98x slower
Large Future (3KB): 3.41x slower
Very Large Future (10KB): 8.88x slower
Huge Large Future (100KB): 191.44x slower

在微调这个 async 函数时，我发现了一些微妙的现象。为了让 data 跨越 await 存活，我特意在最后引用了它，以防编译器优化掉：

async fn huge_large_future() -> u64 {
    let data = [1u8; 102400]; // 10KB * 10
    let len = data.len();
    future::yield_now().await;
    (data[0] + data[len - 1]) as u64
}

理论上，若改成下面这样，由于 len 在 await 前已计算完成，后面又没用引用到，生成的 Future 大小应该很小：

async fn huge_large_future() -> u64 {
    let data = [1u8; 102400]; // 10KB * 10
    let len = data.len();
    future::yield_now().await;
    0
}

fn main() {
    let ex = Executor::new();
    let task = ex.spawn(huge_large_future());
    let res = future::block_on(ex.run(task));
    eprintln!("Result: {}", res);
}

然而，我发现 data 仍被保留在状态机中，即便 len 未被后续使用。这涉及到编译器如何判断变量是否跨越 await 存活的问题。当然，若显式限定 data 的生命周期在 await 之前，它就不会被纳入状态机：

async fn huge_large_future() -> u64 {
    {
        let data = [1u8; 102400]; // 10KB * 10
        let len = data.len();
    }
    future::yield_now().await;
    0
}

编译器如何判断哪些变量应该保存

我查阅了 Rust 编译器的源码，发现变量是否跨越 await 存活由 locals_live_across_suspend_points 函数决定：

/// The basic idea is as follows:
/// - a local is live until we encounter a `StorageDead` statement. In
///   case none exist, the local is considered to be always live.
/// - a local has to be stored if it is either directly used after the
///   the suspend point, or if it is live and has been previously borrowed.

在我们的代码中，let len = data.len() 构成了对 data 的借用，因此 data 被保留在状态机中。或许这里仍有优化的空间？我去社区问问看。

结语

所有实验代码均可在以下链接找到：async-executor-examples。

在 Rust 异步编程中，代码的细微调整可能引发性能的显著波动。深入理解状态机生成的内在机制，能助你打造更高效的异步代码。下次编写 async fn 时，不妨自问：这个状态机究竟有多大？

公号同步更新，欢迎关注👻

Tags: 编程 Rust

← 从椭圆曲线到 secp256k Fiber Network: 基于 CKB ... →