January 4, 2026
Draft v0.2
Note: Mo[nad].
Functional programming’s compositional abstractions—map, filter, fold—simplify reasoning about data transformations but incur overhead that has historically relegated these languages to domains where performance is secondary. Stream fusion addresses this for pipelines (Coutts, Leshchinskiy, and Stewart 2007), but real programs involve graph traversals, dynamic programming, balanced trees, and state machines.
We present Mo, a strict functional language demonstrating that three mechanisms—stream fusion, linearity-driven optimization (Wadler 1990), and opaque linear handles—together enable pure functional code to compile to systems-competitive performance. Five diverse algorithms all use less memory than hand-written Zig (Kelley 2016).
The thesis is not that Mo has arrived as a production language, but that the path exists: pure functional code can achieve systems-level performance. What remains is engineering, not fundamental research.
A note on scope: Mo targets application developers implementing algorithms—graph traversal, dynamic programming, balanced trees—with competitive performance. Most developers consume data structures rather than implement them; Mo provides efficient built-in structures (trees, hash maps, caches) that cover common needs. It is not a systems language for library authors requiring custom pointer-based layouts.
Mo’s foundation is guaranteed stream fusion. The expression
fold (\acc x -> add acc x) 0 (map (\x -> mul x 2) xs)
compiles to a single loop with no intermediate array. Inspecting
generated ARM64 assembly reveals SIMD vectorization—the map and fold
combinators disappear entirely, replaced by vector load
(ldp q4, q5), doubling via addition
(add.2d v4, v4, v4), and accumulation
(add.2d v0, v4, v0) processing 8 elements per
iteration.
We validated this with a DataFrame library benchmark: filtering 10 million rows, computing a derived column, and performing grouped aggregation. Single-threaded, Mo completes in 2.64s versus Polars’ (Vink 2024) 2.37s (1.11x ratio). Memory efficiency favors Mo by ~700MB (1.75GB vs 2.45GB) due to fusion eliminating intermediate buffers and immediate buffer release.
But pipelines are the easy case. The harder question: what happens when algorithms don’t fit the pipeline mold?
We selected a benchmark suite targeting computational patterns historically problematic for functional languages:
| Algorithm | Pattern | Why It’s Hard |
|---|---|---|
| Graph BFS | Visited set, queue | Non-linear data flow, stateful traversal |
| Edit Distance | 2D dynamic programming | Random access into memo table |
| Red-Black Tree | Balanced tree with rebalancing | Pointer-heavy, recursive structure |
| LRU Cache | HashMap + linked list | Two data structures in sync |
| Regex NFA (Thompson 1968) | State machine simulation | Set operations, backtracking |
Initial results were sobering. Edit distance used 752x more memory than Zig. Red-black tree was 75x—consistent with the known difficulties of purely functional data structures (Okasaki 1999). These algorithms expose two categories of problems that stream fusion alone cannot address:
Fusion-limited algorithms (BFS, Edit Distance, Regex NFA): These map well to fold-based iteration but allocate unnecessarily within loops. Each iteration creates new arrays that could be reused.
Representation-limited algorithms (RB-Tree, LRU Cache): These require O(log n) or O(1) data structures. Simulating a red-black tree with purely functional arrays requires copying on each update, giving O(n) insertion—wrong algorithmic complexity.
For fusion-limited algorithms, we extended linearity analysis to recognize common stateful patterns (Kiselyov et al. 2017). These optimizations are fully automatic.
Swap-buffer detection (automatic). Edit distance uses two DP rows, swapping each iteration. The compiler recognizes this pattern and allocates two buffers upfront rather than per-iteration:
-- prev_row and curr_row are the two DP rows
fold (\(prev_row, curr_row) i ->
let new_row = compute_dp_row prev_row curr_row i in
(curr_row, new_row) -- Swap: current becomes previous
) (row0, row1) (range 0 num_rows)
Buffer hoisting (automatic). When arrays are allocated inside folds but don’t escape, the compiler hoists allocation outside the loop. The compiler also inlines small functions to expose hidden allocations to the hoisting pass:
-- Before: allocates per iteration
fold (\state i ->
let buf = collect (map (\_ -> 0) (range 0 n)) in
use buf
) init iterations
-- Compiled output (Zig):
const buf = allocator.alloc(i64, n); // Allocated once
while (i < iterations) {
@memset(buf, 0); // Clear each iteration
use(buf);
}
The hoisting optimization requires that buffer sizes are known before
the loop. Sizes computed inside the loop (e.g.,
n = length str2) cannot be hoisted—a fundamental limitation
of static analysis. For such cases, the recommended pattern is to
precompute sizes before the loop.
Prefix slice optimization (automatic). When only a prefix of an array is meaningful (tracked via a length variable), the compiler avoids copying by returning slices:
arr[0..len] -- Returns slice, not copy
These optimizations reduced Edit Distance from 752x to 0.97x memory—now using less than Zig.
For representation-limited algorithms, we introduced opaque linear handles: efficient mutable data structures with pure functional semantics.
-- O(log n) red-black tree
let tree = treemap_put tree key value in
let (tree, result) = treemap_get tree key in
-- O(1) LRU cache
let cache = lru_put cache key value in
let (cache, value, found) = lru_get cache key in
Handles are opaque (users cannot inspect internals) and linear (used exactly once, then returned) (Tov and Pucella 2011). The compiler enforces uniqueness: using a handle after passing it to an operation is a compile error. This ensures the runtime can safely mutate the underlying structure while presenting a pure interface.
We implemented HashMap (O(1)), TreeMap (O(log n)), Deque (O(1)), and LRUCache (O(1)). The red-black tree benchmark dropped from 75x memory to 0.90x—again, less than Zig.
After implementing both optimizations, all five benchmarks perform competitively:
| Benchmark | Mo vs Zig (Time) | Mo vs Zig (Memory) |
|---|---|---|
| Graph BFS | ~1x | 0.82x |
| Edit Distance | ~1x | 0.97x |
| Red-Black Tree | ~1x | 0.90x |
| LRU Cache | ~1x | 0.92x |
| Regex NFA | ~1x | 0.80x |
All five use less memory than hand-written Zig, with comparable execution time. A Zig expert could match Mo’s memory profile by hoisting allocations into arena allocators or manually threading buffers through loops. But this would transform clean, readable code into complex state-threading machinery. Mo provides this optimization automatically—the developer writes simple functional code, and the compiler generates the complex imperative version.
Stream fusion is necessary but not sufficient for general-purpose performance. Three mechanisms together achieve generality:
Linearity inference. The compiler automatically
proves when values are used exactly once, enabling safe in-place
mutation—without user-written ownership annotations. Unlike
Rust, where developers explicitly manage lifetimes and borrowing, Mo
infers linearity from dataflow. Users write
update_inplace arr i val and the compiler verifies the
array is linear. When linearity cannot be proven, the compiler rejects
the program with a clear error (“array used after mutation”) rather than
silently falling back to copying. This avoids performance cliffs—code
either compiles with guaranteed in-place updates or doesn’t compile at
all.
Pattern recognition. Common stateful patterns (swap-buffer, buffer hoisting) are detected and optimized automatically. The compiler extends fusion beyond pipelines to stateful folds.
Opaque handles. When the right data structure matters algorithmically (O(log n) trees, O(1) caches), handles provide efficient implementations without breaking purity.
Users write clean functional code:
let result = fold (\state i ->
let arr = fst state in
let total = snd state in
let new_arr = update_inplace arr i (compute i) in
(new_arr, add total 1)
) (init_arr, 0) (range 0 n) in
The compiler generates efficient imperative code with proper data structures and in-place mutation.
Mo’s approach—a functional language that compiles to efficient imperative code—invites comparison with Scala’s (Odersky et al. 2004) relationship to Java. Scala brought functional programming to the JVM, proving that developers could write in a higher-level paradigm while leveraging Java’s mature runtime and ecosystem.
But Scala inherited Java’s constraints: garbage collection, object overhead, and the JVM’s memory model. Fusion in Scala is limited by the need to interoperate with Java collections. The result is a functional language that’s more expressive than Java but rarely faster.
Mo takes a different path. Rather than layering functional abstractions over an existing imperative runtime, Mo’s compiler generates the imperative code. There’s no pre-existing runtime to constrain optimization. Stream fusion can eliminate allocations entirely because the compiler controls code generation end-to-end. Opaque handles can use mutation internally because the compiler enforces linearity.
This is closer to what MLton achieved for Standard ML: whole-program optimization with no runtime overhead (Weeks 2006). The difference is that Mo targets a systems language (Zig) rather than native code directly, inheriting Zig’s memory layout, C interoperability, and toolchain.
The comparison to Rust is instructive (Matsakis and Klock 2014). Rust achieves memory safety and performance through explicit ownership—developers annotate lifetimes and fight the borrow checker. The result is safe, fast code, but written in an imperative style with pervasive mutation. Mo achieves similar goals through inferred linearity—developers write pure functional code, and the compiler proves where mutation is safe. The tradeoff: Rust offers more control (users can define arbitrary pointer structures), while Mo offers more abstraction (users write declarative transformations).
There is also an ecosystem tradeoff. Scala developers can use any Java library. Mo developers cannot (yet) use arbitrary Zig or C libraries without FFI work. But for compute-intensive workloads where performance matters, Mo demonstrates that the functional-over-imperative approach can outperform hand-written systems code.
The benchmarks demonstrate the path exists. What remains is implementation work, not fundamental research:
User-defined handles. Currently, opaque handles are built into the compiler. Extending the handle system to user-defined structures is straightforward—the linearity enforcement already exists. Users would declare a type as linear, and the compiler would apply the same uniqueness checking it uses for built-in handles.
Explicit regions. Memory management relies on region-based reuse within loops; top-level handles currently rely on process exit for reclamation. Adding explicit regions for long-running applications follows naturally from the existing region-based reuse—the mechanism exists; it need only be exposed at module boundaries.
Constant propagation for hoisting. Buffer hoisting
currently requires sizes to be literal constants or variables defined
before the loop. Recognizing that
length (gen_string len ...) equals len would
enable hoisting in more cases. This requires interprocedural constant
propagation.
FFI support. Mo compiles to Zig, which has seamless C interop. Exposing this to Mo programs requires syntax and type marshaling, not architectural changes.
Tooling. Compilation time is 3–6 seconds (Mo
transpilation \(<\)1 second, Zig
with -O ReleaseFast 2–5 seconds). Incremental compilation
and source-level debugging would improve the development experience.
The thesis motivating Mo was: can pure functional code compile to systems-level performance without user annotations? For the algorithms we tested, the answer is yes. Three mechanisms—stream fusion, linearity inference, and opaque handles—together close the gap between functional semantics and imperative efficiency.
Our benchmark suite—graph traversal, dynamic programming, balanced trees, caches, and state machines—achieves memory usage below hand-written Zig across all cases. Users write pure, compositional code. The compiler generates efficient imperative code with proper data structures and in-place mutation.
Stream fusion started as a technique for eliminating intermediate arrays in pipelines. Extended with linearity tracking and opaque handles, it becomes a foundation for functional programming with systems-level performance. The path exists; what remains is engineering.
Mo will be open source. Benchmark code and implementations will be available at the project repository upon release.
This document was written with assistance from Claude (Anthropic).↩︎