Mo: Pure Functional Code, Imperative Performance

No Way Labs1

January 4, 2026
Draft v0.2

Note: Mo[nad].

Functional programming’s compositional abstractions—map, filter, fold—simplify reasoning about data transformations but incur overhead that has historically relegated these languages to domains where performance is secondary. Stream fusion addresses this for pipelines (Coutts, Leshchinskiy, and Stewart 2007), but real programs involve graph traversals, dynamic programming, balanced trees, and state machines.

We present Mo, a strict functional language demonstrating that three mechanisms—stream fusion, linearity-driven optimization (Wadler 1990), and opaque linear handles—together enable pure functional code to compile to systems-competitive performance. Five diverse algorithms all use less memory than hand-written Zig (Kelley 2016).

The thesis is not that Mo has arrived as a production language, but that the path exists: pure functional code can achieve systems-level performance. What remains is engineering, not fundamental research.

A note on scope: Mo targets application developers implementing algorithms—graph traversal, dynamic programming, balanced trees—with competitive performance. Most developers consume data structures rather than implement them; Mo provides efficient built-in structures (trees, hash maps, caches) that cover common needs. It is not a systems language for library authors requiring custom pointer-based layouts.

The Original Promise: Pipeline Fusion

Mo’s foundation is guaranteed stream fusion. The expression fold (\acc x -> add acc x) 0 (map (\x -> mul x 2) xs) compiles to a single loop with no intermediate array. Inspecting generated ARM64 assembly reveals SIMD vectorization—the map and fold combinators disappear entirely, replaced by vector load (ldp q4, q5), doubling via addition (add.2d v4, v4, v4), and accumulation (add.2d v0, v4, v0) processing 8 elements per iteration.

We validated this with a DataFrame library benchmark: filtering 10 million rows, computing a derived column, and performing grouped aggregation. Single-threaded, Mo completes in 2.64s versus Polars’ (Vink 2024) 2.37s (1.11x ratio). Memory efficiency favors Mo by ~700MB (1.75GB vs 2.45GB) due to fusion eliminating intermediate buffers and immediate buffer release.

But pipelines are the easy case. The harder question: what happens when algorithms don’t fit the pipeline mold?

The Generalization Challenge

We selected a benchmark suite targeting computational patterns historically problematic for functional languages:

Algorithm Pattern Why It’s Hard
Graph BFS Visited set, queue Non-linear data flow, stateful traversal
Edit Distance 2D dynamic programming Random access into memo table
Red-Black Tree Balanced tree with rebalancing Pointer-heavy, recursive structure
LRU Cache HashMap + linked list Two data structures in sync
Regex NFA (Thompson 1968) State machine simulation Set operations, backtracking

Initial results were sobering. Edit distance used 752x more memory than Zig. Red-black tree was 75x—consistent with the known difficulties of purely functional data structures (Okasaki 1999). These algorithms expose two categories of problems that stream fusion alone cannot address:

Fusion-limited algorithms (BFS, Edit Distance, Regex NFA): These map well to fold-based iteration but allocate unnecessarily within loops. Each iteration creates new arrays that could be reused.

Representation-limited algorithms (RB-Tree, LRU Cache): These require O(log n) or O(1) data structures. Simulating a red-black tree with purely functional arrays requires copying on each update, giving O(n) insertion—wrong algorithmic complexity.

Solution 1: Stateful Pattern Fusion

For fusion-limited algorithms, we extended linearity analysis to recognize common stateful patterns (Kiselyov et al. 2017). These optimizations are fully automatic.

Swap-buffer detection (automatic). Edit distance uses two DP rows, swapping each iteration. The compiler recognizes this pattern and allocates two buffers upfront rather than per-iteration:

-- prev_row and curr_row are the two DP rows
fold (\(prev_row, curr_row) i ->
  let new_row = compute_dp_row prev_row curr_row i in
  (curr_row, new_row)  -- Swap: current becomes previous
) (row0, row1) (range 0 num_rows)

Buffer hoisting (automatic). When arrays are allocated inside folds but don’t escape, the compiler hoists allocation outside the loop. The compiler also inlines small functions to expose hidden allocations to the hoisting pass:

-- Before: allocates per iteration
fold (\state i ->
  let buf = collect (map (\_ -> 0) (range 0 n)) in
  use buf
) init iterations

-- Compiled output (Zig):
const buf = allocator.alloc(i64, n);  // Allocated once
while (i < iterations) {
    @memset(buf, 0);  // Clear each iteration
    use(buf);
}

The hoisting optimization requires that buffer sizes are known before the loop. Sizes computed inside the loop (e.g., n = length str2) cannot be hoisted—a fundamental limitation of static analysis. For such cases, the recommended pattern is to precompute sizes before the loop.

Prefix slice optimization (automatic). When only a prefix of an array is meaningful (tracked via a length variable), the compiler avoids copying by returning slices:

arr[0..len]  -- Returns slice, not copy

These optimizations reduced Edit Distance from 752x to 0.97x memory—now using less than Zig.

Solution 2: Opaque Linear Handles

For representation-limited algorithms, we introduced opaque linear handles: efficient mutable data structures with pure functional semantics.

-- O(log n) red-black tree
let tree = treemap_put tree key value in
let (tree, result) = treemap_get tree key in

-- O(1) LRU cache
let cache = lru_put cache key value in
let (cache, value, found) = lru_get cache key in

Handles are opaque (users cannot inspect internals) and linear (used exactly once, then returned) (Tov and Pucella 2011). The compiler enforces uniqueness: using a handle after passing it to an operation is a compile error. This ensures the runtime can safely mutate the underlying structure while presenting a pure interface.

We implemented HashMap (O(1)), TreeMap (O(log n)), Deque (O(1)), and LRUCache (O(1)). The red-black tree benchmark dropped from 75x memory to 0.90x—again, less than Zig.

Results

After implementing both optimizations, all five benchmarks perform competitively:

Benchmark results. Time ratios \(\sim\)1x indicate comparable performance. Memory ratios \(<\)1.0 indicate Mo uses less.
Benchmark Mo vs Zig (Time) Mo vs Zig (Memory)
Graph BFS ~1x 0.82x
Edit Distance ~1x 0.97x
Red-Black Tree ~1x 0.90x
LRU Cache ~1x 0.92x
Regex NFA ~1x 0.80x

All five use less memory than hand-written Zig, with comparable execution time. A Zig expert could match Mo’s memory profile by hoisting allocations into arena allocators or manually threading buffers through loops. But this would transform clean, readable code into complex state-threading machinery. Mo provides this optimization automatically—the developer writes simple functional code, and the compiler generates the complex imperative version.

The Key Insight

Stream fusion is necessary but not sufficient for general-purpose performance. Three mechanisms together achieve generality:

  1. Linearity inference. The compiler automatically proves when values are used exactly once, enabling safe in-place mutation—without user-written ownership annotations. Unlike Rust, where developers explicitly manage lifetimes and borrowing, Mo infers linearity from dataflow. Users write update_inplace arr i val and the compiler verifies the array is linear. When linearity cannot be proven, the compiler rejects the program with a clear error (“array used after mutation”) rather than silently falling back to copying. This avoids performance cliffs—code either compiles with guaranteed in-place updates or doesn’t compile at all.

  2. Pattern recognition. Common stateful patterns (swap-buffer, buffer hoisting) are detected and optimized automatically. The compiler extends fusion beyond pipelines to stateful folds.

  3. Opaque handles. When the right data structure matters algorithmically (O(log n) trees, O(1) caches), handles provide efficient implementations without breaking purity.

Users write clean functional code:

let result = fold (\state i ->
  let arr = fst state in
  let total = snd state in
  let new_arr = update_inplace arr i (compute i) in
  (new_arr, add total 1)
) (init_arr, 0) (range 0 n) in

The compiler generates efficient imperative code with proper data structures and in-place mutation.

Mo’s approach—a functional language that compiles to efficient imperative code—invites comparison with Scala’s (Odersky et al. 2004) relationship to Java. Scala brought functional programming to the JVM, proving that developers could write in a higher-level paradigm while leveraging Java’s mature runtime and ecosystem.

But Scala inherited Java’s constraints: garbage collection, object overhead, and the JVM’s memory model. Fusion in Scala is limited by the need to interoperate with Java collections. The result is a functional language that’s more expressive than Java but rarely faster.

Mo takes a different path. Rather than layering functional abstractions over an existing imperative runtime, Mo’s compiler generates the imperative code. There’s no pre-existing runtime to constrain optimization. Stream fusion can eliminate allocations entirely because the compiler controls code generation end-to-end. Opaque handles can use mutation internally because the compiler enforces linearity.

This is closer to what MLton achieved for Standard ML: whole-program optimization with no runtime overhead (Weeks 2006). The difference is that Mo targets a systems language (Zig) rather than native code directly, inheriting Zig’s memory layout, C interoperability, and toolchain.

The comparison to Rust is instructive (Matsakis and Klock 2014). Rust achieves memory safety and performance through explicit ownership—developers annotate lifetimes and fight the borrow checker. The result is safe, fast code, but written in an imperative style with pervasive mutation. Mo achieves similar goals through inferred linearity—developers write pure functional code, and the compiler proves where mutation is safe. The tradeoff: Rust offers more control (users can define arbitrary pointer structures), while Mo offers more abstraction (users write declarative transformations).

There is also an ecosystem tradeoff. Scala developers can use any Java library. Mo developers cannot (yet) use arbitrary Zig or C libraries without FFI work. But for compute-intensive workloads where performance matters, Mo demonstrates that the functional-over-imperative approach can outperform hand-written systems code.

What Remains to Build

The benchmarks demonstrate the path exists. What remains is implementation work, not fundamental research:

User-defined handles. Currently, opaque handles are built into the compiler. Extending the handle system to user-defined structures is straightforward—the linearity enforcement already exists. Users would declare a type as linear, and the compiler would apply the same uniqueness checking it uses for built-in handles.

Explicit regions. Memory management relies on region-based reuse within loops; top-level handles currently rely on process exit for reclamation. Adding explicit regions for long-running applications follows naturally from the existing region-based reuse—the mechanism exists; it need only be exposed at module boundaries.

Constant propagation for hoisting. Buffer hoisting currently requires sizes to be literal constants or variables defined before the loop. Recognizing that length (gen_string len ...) equals len would enable hoisting in more cases. This requires interprocedural constant propagation.

FFI support. Mo compiles to Zig, which has seamless C interop. Exposing this to Mo programs requires syntax and type marshaling, not architectural changes.

Tooling. Compilation time is 3–6 seconds (Mo transpilation \(<\)1 second, Zig with -O ReleaseFast 2–5 seconds). Incremental compilation and source-level debugging would improve the development experience.

Conclusion

The thesis motivating Mo was: can pure functional code compile to systems-level performance without user annotations? For the algorithms we tested, the answer is yes. Three mechanisms—stream fusion, linearity inference, and opaque handles—together close the gap between functional semantics and imperative efficiency.

Our benchmark suite—graph traversal, dynamic programming, balanced trees, caches, and state machines—achieves memory usage below hand-written Zig across all cases. Users write pure, compositional code. The compiler generates efficient imperative code with proper data structures and in-place mutation.

Stream fusion started as a technique for eliminating intermediate arrays in pipelines. Extended with linearity tracking and opaque handles, it becomes a foundation for functional programming with systems-level performance. The path exists; what remains is engineering.


Mo will be open source. Benchmark code and implementations will be available at the project repository upon release.

Coutts, Duncan, Roman Leshchinskiy, and Don Stewart. 2007. “Stream Fusion: From Lists to Streams to Nothing at All.” In Proceedings of the 12th ACM SIGPLAN International Conference on Functional Programming (ICFP), 315–26. ACM. https://doi.org/10.1145/1291151.1291199.
Kelley, Andrew. 2016. “The Zig Programming Language.” https://ziglang.org.
Kiselyov, Oleg, Aggelos Biboudis, Nick Palladinos, and Yannis Smaragdakis. 2017. “Stream Fusion, to Completeness.” In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages (POPL), 285–99. ACM. https://doi.org/10.1145/3009837.3009880.
Matsakis, Nicholas D., and Felix S. Klock. 2014. “The Rust Language.” In Proceedings of the 2014 ACM SIGAda Annual Conference on High Integrity Language Technology (HILT), 103–4. ACM. https://doi.org/10.1145/2663171.2663188.
Odersky, Martin, Philippe Altherr, Vincent Cremet, Burak Emir, Sebastian Maneth, Stéphane Micheloud, Nikolay Mihaylov, Michel Schinz, Erik Stenman, and Matthias Zenger. 2004. “An Overview of the Scala Programming Language.” In ECOOP Workshop on Multiparadigm Programming with Object-Oriented Languages.
Okasaki, Chris. 1999. Purely Functional Data Structures. Cambridge University Press.
Thompson, Ken. 1968. “Programming Techniques: Regular Expression Search Algorithm.” Communications of the ACM 11 (6): 419–22. https://doi.org/10.1145/363347.363387.
Tov, Jesse A., and Riccardo Pucella. 2011. “Practical Affine Types.” In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 447–58. ACM. https://doi.org/10.1145/1926385.1926436.
Vink, Ritchie. 2024. “Polars: Blazingly Fast DataFrames in Rust and Python.” https://pola.rs.
Wadler, Philip. 1990. “Linear Types Can Change the World!” In Programming Concepts and Methods, edited by M. Broy and C. Jones, 561–81. North-Holland.
Weeks, Stephen. 2006. “Whole-Program Compilation in MLton.” In Proceedings of the 2006 Workshop on ML, 1–1. ACM. https://doi.org/10.1145/1159876.1159877.

  1. This document was written with assistance from Claude (Anthropic).↩︎