Open Inference Accelerator (OpenAIU) — Project Charter

Executive Summary

Current AI inference deployments are built around large, hot, expensive GPU cards — designed for training workloads and repurposed for inference. A standard AI inference server holds eight double-wide GPU cards drawing 350–700W each. The chassis runs at 3–5 kilowatts. The hardware is proprietary, the firmware is unauditable, and the economics are controlled by a single vendor.

OpenAIU is designed from first principles as an inference accelerator. Single-slot. 75W. A standard server chassis that today holds eight double-wide GPU cards holds sixteen OpenAIU cards. The same rack space, at roughly one-quarter the power draw per inference node, running open source silicon with auditable firmware from boot to inference kernel.

The SoC uses two tiers of POWER ISA cores: a dedicated POWER Control Cluster of four Microwatt-class in-order cores handling on-card firmware, host bridge management, NOC orchestration, and power/thermal control — and a 32-core AI compute array connected to those control cores via the Auxiliary Execution Unit (AXU) interface. High-bandwidth memory access is delivered through an open OMI buffer. The entire SoC — control plane and compute — runs POWER ISA. No RISC-V, no hybrid ISA, no proprietary management microcontroller.

The project will use agentic EDA AI tools — Cadence ChipStack AI and Synopsys DSO.ai — to compress what would traditionally be a 5-year design cycle into a 36-month delivery.

Value Proposition

Density: 2× cards per chassis

Single-slot form factor at 75W fits 16 cards where a GPU-based inference server holds 8 double-wide cards. Same chassis, double the inference nodes, each independently managed.

Power: ~¼ per inference node

At 75W vs 350–700W per GPU card, OpenAIU targets dramatically lower power per inference unit. Designed for sustained inference workloads — not peak GPU training bursts.

Fully Auditable Stack

AI core RTL, control firmware, and memory interface are Apache 2.0 open source. Every numerical operation is specifiable and verifiable — critical for regulated AI deployment under EU AI Act and OCC model risk guidance.

No Licensing Fees

Open RTL, open firmware, open toolchain. Any OPF member can manufacture, deploy, or extend the design. No per-card royalties, no driver licensing, no ecosystem lock-in.

Performance targets are architectural goals based on design parameters. Validated benchmarks will be published following FPGA bring-up at milestone M6.

SoC Architecture

OpenAIU SoC — Block Diagram (POWER ISA Throughout)

AI Compute Array — 8×4 Grid
32 Open AI Cores
2D systolic arrays (matrix multiply) + 1D vector units · INT4/INT8/FP16/BF16 · Bidirectional ring NOC interconnect · AXU-dispatched from POWER control cluster

POWER Control Cluster

4× Microwatt Cores (POWER ISA)

In-order · On-card firmware · PCIe host bridge · NOC task scheduler · Power & thermal management · AXU dispatch to AI array · Secure boot (OpenFSP chain)

On-Chip Memory

64 MB SRAM + L3

Distributed SRAM tiles across AI array · Shared L3 for POWER control cluster + AI cores · Weight caching

Memory Interface — Project Gap

Open OMI Buffer

OpenCAPI OMI subset · >200 GB/s · LPDDR5X memory stacks · Open buffer chip (ESI sub-project)

Host I/O

PCIe Gen5 ×16 / OpenCAPI

Negotiated at boot · PCIe for x86/ARM hosts · OpenCAPI for POWER hosts · Card-to-card chaining · Managed by POWER control cluster

The AI Core Array

Thirty-two AI accelerator cores arranged in an 8×4 grid, connected by a bidirectional ring bus. Each core contains:

A 2D systolic array for matrix multiplication — the dominant operation in transformer inference (attention, feed-forward layers). Systolic arrays pipeline data through a grid of multiply-accumulate units with no memory bandwidth required between steps.
A 1D vector unit for activation functions, softmax, layer normalization, and element-wise operations that do not fit the systolic pattern.
Local SRAM tile for weight caching — keeping frequently accessed model weights on-chip to avoid off-chip bandwidth pressure.
Low-precision arithmetic support: INT4, INT8, FP16, and BF16. Quantized INT4/INT8 inference delivers 4–8× throughput improvement over FP32 with acceptable accuracy loss for most production models.

4× Microwatt POWER cores (control plane) 32 AI cores (compute array)

8-column grid — row 0: POWER control cluster · rows 1–4: AI compute array

The AXU Interface: Tight-Coupling to POWER ISA

What is the AXU (Auxiliary Execution Unit) Interface?

The POWER control cluster exposes an Auxiliary Execution Unit (AXU) interface — a standardized connection point defined in the POWER ISA that allows custom functional units to sit directly adjacent to the POWER execution pipeline. AXU operations are dispatched as POWER ISA instructions and share register state with the control core's integer and floating-point register files. This means the AI cores in OpenAIU are not a separate "device" communicating over PCIe or even an AXI bus — they are execution units from the CPU's perspective, with register-speed data transfer and zero DMA overhead for small tensors. The POWER ISA Matrix Multiply Assist (MMA) instructions are specifically designed to dispatch to AXU-class functional units of exactly this type.

The POWER Control Cluster — Why Not RISC-V?

Modern AI accelerator SoCs commonly use a two-tier architecture: small management cores handle PCIe enumeration, DDR initialization, and NOC fabric orchestration, while the compute array handles the AI workload. It is the correct structural pattern — separating control plane concerns from compute plane concerns allows each tier to be optimized independently.

The typical industry choice for those management cores is RISC-V. OpenAIU makes a different choice: the control cores are Microwatt-class POWER ISA cores. This has three compounding advantages:

Single ISA throughout: The entire SoC — control plane and compute plane — is POWER ISA. The firmware running on the control cluster is compiled with the same POWER toolchain as the software stack on the host server. No ISA context switch, no separate toolchain, no RISC-V firmware binary sitting in an otherwise POWER ecosystem.
Microwatt is silicon-proven: Microwatt has already been taped out as a real ASIC using the OpenROAD open source EDA flow. It runs Linux on FPGA today. The control cluster is not a new core design — it is a known-good, open RTL block with production tapeout heritage.
Auditability is complete: The AXU interface, the Microwatt control cores, the AI compute array RTL, and the on-card firmware are all POWER ISA and all Apache 2.0 open source. RISC-V management cores used elsewhere may be open ISA, but the firmware running on them is typically proprietary. OpenAIU's control cluster firmware is part of the fully auditable stack.

The four Microwatt control cores are small, in-order, and clocked conservatively — their job is orchestration, not compute. They consume a fraction of the die area and power budget. The AI compute array does the heavy lifting; the POWER control cluster makes the card a first-class, autonomously bootable, fully managed PCIe device.

Host Interface: Dual PCIe Gen5 and OpenCAPI

OpenAIU supports two host interfaces — and this is the architectural feature that makes it viable across the broadest possible deployment base:

PCIe Gen5 ×16: Standard compatibility with any x86, ARM, or POWER server. Plug into any OCP NIC 3.0 or standard PCIe slot. Works in existing data center infrastructure today.
OpenCAPI: Direct cache-coherent attachment to POWER hosts. OpenCAPI delivers significantly higher bandwidth and lower latency than PCIe, with coherent memory access between host and accelerator — eliminating copy overhead for large model weight transfers. On POWER-based ESI reference platform servers, OpenCAPI is the primary interface.

The two interfaces are not exclusive — the card negotiates at boot time based on what the host presents. The same hardware ships to an x86 hyperscaler over PCIe and to a POWER-based regulated industry deployment over OpenCAPI, with the on-card POWER control cluster managing interface initialization in both cases.

Card-to-Card Chaining and Inference Scale-Out

OpenCAPI's topology supports direct card-to-card communication without routing through the host CPU. Multiple OpenAIU cards can be chained into an inference fabric — sharing the KV cache for long-context inference, pipelining model layers across cards, or distributing batch workloads across the full card pool.

Chassis Density Comparison

Typical GPU Inference Server

8 cards · 350–700W each

GPU ×2

2,800–5,600W total · Double-slot · Proprietary

OpenAIU Inference Server

16 cards · 75W each

AIU

1,200W total · Single-slot · OpenCAPI chained · Open source

The Open OMI Challenge

The Most Significant Technical Gap in the Project

The Open Memory Interface (OMI) is a low-latency, high-bandwidth serial memory interface — a subset of the OpenCAPI specification — that delivers HBM-class bandwidth with a fraction of the die area of traditional parallel DDR interfaces. IBM uses OMI across POWER10, POWER11, and Spyre. The problem: every existing OMI buffer chip (the component that bridges the OMI serial link to LPDDR5 or HBM memory) is proprietary. Microchip Technology's explorer buffer chip is the only commercial option, and it is closed. OpenAIU requires an open OMI buffer chip as a sub-project deliverable. The path forward uses the Universal Memory Interface (UMI) from Zero ASIC as a transaction-layer reference, combined with an open-source SerDes implementation, to build a minimal OMI-compatible buffer in an accessible process node. This is explicitly the highest-risk deliverable in the project scope and is flagged for early funding prioritization.

AI-Accelerated Design: Compressing the Timeline

A traditional silicon design project of this complexity — a multi-core AI SoC at 5nm — takes 5–7 years from architecture to tape-out. The OpenAIU project explicitly targets 36 months by using agentic AI-powered EDA tooling throughout the design flow. This is not hypothetical: the tools exist and are in commercial use as of 2025-2026.

Front-End / RTL

Cadence ChipStack AI

Cadence ChipStack AI · "Super Agent"

Agentic AI that automates RTL coding, test plan generation, regression orchestration, and bug fixing for the Microwatt control cluster updates and AI core RTL. Announced February 2026. Compresses front-end design from months to days for well-specified blocks.

Verification

Synopsys VSO.ai

Synopsys VSO.ai + Cadence Verisium

Autonomously reaches coverage targets up to 10× faster by identifying redundant tests and routing simulation resources toward uncovered functional states. Critical for verifying AI core correctness and AXU dispatch correctness across all precision formats.

Physical Design / PPA

Cadence Cerebrus AI Studio

Cadence Cerebrus · Reinforcement Learning

Uses reinforcement learning to optimize floorplanning, clock tree synthesis, and power gating across all 32 AI cores simultaneously — targeting the 75W envelope with up to 20% better power efficiency than manual closure. One engineer can drive multiple block closures in parallel.

RTL-to-GDSII

Synopsys DSO.ai

Synopsys DSO.ai · Design Space Optimization

Autonomous search across the implementation design space (synthesis strategies, placement constraints, routing options) to find minimum-power configurations at 5nm. Particularly valuable for the AI core array where repeated instances allow aggressive optimization sharing.

Analog / SerDes / OMI

Cadence Virtuoso Studio AI

Virtuoso Studio AI + Synopsys ASO.ai

For the analog-intensive open OMI SerDes and memory interface circuits. AI-assisted process migration takes an existing open-source SerDes design and retargets it to the 5nm node, dramatically reducing analog design iteration time — the slowest part of any mixed-signal SoC.

Chiplet Architecture

Multi-Die Integration

UCIe · 2.5D integration

The AI core array and the Microwatt control cluster may be implemented as separate chiplets (AI cores at 5nm, control cluster at 7nm) connected via UCIe die-to-die interface. This reduces per-die yield risk and allows node-optimized manufacturing for each functional block. AI-driven EDA tools optimize the die partition boundary.

Regulatory Value: Auditable AI Inference

Regulated industries deploying AI for credit decisions, fraud detection, clinical decision support, or benefits eligibility face a specific problem with current AI accelerator hardware: the inference computation is performed on opaque proprietary silicon with no auditable firmware and no independent verification of numerical precision. Regulators in the EU (AI Act), US financial sector (OCC guidance on model risk), and healthcare (FDA guidance on AI-based devices) are increasingly focused on infrastructure provenance for AI systems.

OpenAIU is designed with auditability as a first-class architectural requirement:

Open AI core RTL: The matrix multiply units, vector units, and precision conversion logic are open source. An auditor can verify that INT8 quantization is implemented to specification — with no rounding modes hidden in proprietary firmware.
Deterministic execution: OpenAIU will specify and document the exact numerical behavior of all supported precision formats — a property that proprietary GPU inference engines do not guarantee across driver versions.
Firmware chain of trust: Boot from OpenFSP (Project 03) with measured boot extending from the service processor through the PCIe card firmware. The entire software stack from power-on to inference kernel is cryptographically attested.
Model isolation: LPAR-equivalent partitioning of AI core resources — enforced in hardware — ensures model inference from different tenants cannot interfere. Relevant for shared regulated cloud environments.

Integration with the ESI Stack

OpenAIU is not a standalone device. It is designed as a component in the full ESI stack:

Microwatt (Community Project): The 4× Microwatt POWER ISA control cluster on the OpenAIU SoC draws directly from the open source Microwatt project — already silicon-proven via OpenROAD tapeout. OpenAIU is the first deployment of Microwatt inside a production AI accelerator SoC.
Project 02 (OpenHMC): OpenHMC manages OpenAIU card inventory, partitions inference capacity across LPARs, monitors power draw and thermals, and handles firmware updates — exactly as it manages compute LPARs. No separate management plane required.
Project 03 (OpenFSP): OpenFSP handles the PCIe card's out-of-band management, secure boot attestation, and hardware error reporting through the same Redfish API surface used for host servers.
Project 04 (OCP Reference Platform): OpenAIU slots into the OCP NIC 3.0 and PCIe card slots on the reference platform chassis. When deployed with OpenCAPI-capable POWER hosts, multiple OpenAIU cards chain directly over OpenCAPI — enabling low-latency scale-out inference across a full rack without routing through the host CPU.

Target Specifications

Parameter	Specification	Notes
AI Cores	32 (8×4 grid)	Open RTL, 2D systolic + 1D vector per core
Precision	INT4 / INT8 / FP16 / BF16	Hardware acceleration for all; INT4 for LLM inference
On-chip SRAM	≥64 MB	Distributed tiles + shared L3
Memory Interface	OMI (open buffer)	OpenCAPI OMI subset; LPDDR5X memory stacks
Memory Bandwidth	>200 GB/s	Peak to AI core array
Control Processor	4× Microwatt POWER ISA	AXU-coupled to AI array; MMA instruction dispatch; full POWER ISA
Host Interface	PCIe Gen5 ×16 / OpenCAPI	Negotiated at boot; OpenCAPI primary on POWER hosts; PCIe fallback for x86/ARM
Power Envelope	75W TDP	PCIe card form factor, passive or active cooling
Process Node	5nm (TSMC N5)	7nm fallback; chiplet option for mixed nodes
Architecture	Monolithic or 2.5D chiplet	UCIe die-to-die interface for chiplet variant
Firmware	Open source (Apache 2.0)	Boot via OpenFSP; management via OpenHMC

Deliverables

AI Core Architecture Spec

Published microarchitecture specification for a single AI core: systolic array dimensions, vector unit design, SRAM tile, and AXU dispatch protocol. Month 4.

AI Core RTL (32 cores)

Synthesizable open source RTL for the full 32-core array including ring bus interconnect. Verification suite with INT4/INT8/FP16/BF16 test vectors. Month 14.

Open OMI Buffer RTL

Open source OMI buffer chip design — the most novel deliverable. SerDes, memory controller, and OMI protocol stack. FPGA-verified. Month 18.

Full SoC RTL Integration

Microwatt control cluster + AI array + OMI + PCIe Gen5 + OpenCAPI integrated into a complete SoC RTL. Functional simulation with reference AI models (BERT, LLaMA quantized). Month 22.

FPGA Prototype

Partial implementation on UltraScale+ — Microwatt control cluster + reduced AI core array (8 cores) — running quantized inference against reference benchmarks. Month 20.

ASIC Tape-Out Package

GDSII at 5nm (or 7nm). PPA closure verified with Cerebrus/DSO.ai. DRC clean. Third-party security audit of AI core arithmetic. Month 36.

Milestones

Month 1TSC formed with representation from semiconductor, AI, and regulated industry backgrounds. EDA tool agreements established with Cadence and Synopsys for project use.
Month 3AXU interface specification finalized. AI core microarchitecture v1.0 published for community review. OMI buffer chip feasibility study complete.
Month 6Single AI core RTL complete. AXU dispatch working in simulation with Microwatt control cluster. MMA instruction encoding verified against POWER ISA specification.
Month 108-core cluster RTL with ring bus interconnect. INT8 matrix multiply verified against reference implementation. Cadence Cerebrus PPA optimization begun on core cluster.
Month 14Full 32-core array RTL complete. DSO.ai physical optimization pass targeting 75W at 5nm. OMI buffer SerDes RTL alpha.
Month 18Open OMI buffer chip RTL complete. FPGA partial prototype: 8 AI cores + Microwatt control cluster running BERT INT8 inference. INT4 precision format verified.
Month 22Full SoC RTL integration. LLaMA-class quantized model inference running in simulation. PCIe Gen5 and OpenCAPI interfaces integrated; card-to-card chaining validated in simulation.
Month 26Timing closure at 5nm with Cerebrus/DSO.ai assistance. Third-party arithmetic audit of AI core precision formats. OpenHMC and OpenFSP integration validated.
Month 30Pre-tape-out DRC and LVS clean. SBOM published. Security audit complete. Founding member sign-off on tape-out configuration.
Month 36GDSII tape-out package submitted to foundry. First silicon expected 12–18 months post-submission. FPGA prototype units available to all Founding Members.

The Strategic Case

Every major regulated industry is deploying AI inference on-premises — for model risk compliance, data residency requirements, or latency-sensitive applications. They are doing it on proprietary GPU hardware they cannot audit, connected via proprietary software stacks they cannot verify, managed by cloud providers whose infrastructure they cannot inspect.

OpenAIU is the only project in the world combining: an open processor ISA with IBM patent coverage, open AI accelerator cores derived from published research architecture, auditable firmware from chip boot to inference kernel, and a hardware form factor compatible with multi-vendor manufacturing. For a bank that must explain to its regulator exactly what hardware and software computed a credit decision — OpenAIU is infrastructure that makes that explanation possible. No proprietary GPU can.

Open Inference AcceleratorProject Charter