Performance Analysis — VAIOS | Vayu Technical Report

The performance of VAIOS was evaluated using a comprehensive benchmark suite executed on 5 March 2026. The benchmarks were designed to assess the behavior of key subsystems including task scheduling, inter-task communication, memory management, floating-point computation, and DMA-based data transfer. All tests were conducted on an STM32F401RE (Cortex-M4) platform with a 1 ms system tick.

The benchmark suite covered a total of 14 test cases spanning multiple subsystems, all of which completed successfully without errors, timeouts, or failures. The results demonstrate stable operation under both isolated and concurrent workloads.

Performance benchmarks for VAIOS subsystems on STM32F401RE. The results demonstrate efficient kernel execution and high computational throughput using the hardware FPU.

Experimental Platform

The benchmarks were conducted on a Nucleo-F401RE development board. The configuration details are summarized in Table 6.1.

Experimental platform configuration for VAIOS performance characterization.
Field	Value
Board	STM32F401RE (Nucleo-64)
CPU	Cortex-M4 @ 84 MHz
FPU	Enabled (hard-float ABI)
DMA	Enabled (UART logging)
SysTick	1 ms period
Build	Release (-O2)

Benchmark Summary

A comprehensive benchmark suite was executed on 5 March 2026 to evaluate the system’s performance across five critical sub-domains. The tests were performed on an STM32F401RE platform running at 84 MHz.

VAIOS Benchmark Results Summary. All 14 tests passed, demonstrating robust subsystem integration and performance.
Benchmark	Status	Duration	Throughput
FPU: `sinf` throughput	PASS	4 ms	250,000 ops/s
FPU: `sqrtf` throughput	PASS	3 ms	333,333 ops/s
FPU: multi-task context-save	PASS	4 ms	250,000 ops/s
DMA: Memory-to-Memory (Looped)	PASS	20 ms	$\sim$ 2,000 KB/s
DMA: Concurrent Streams	PASS	21 ms	$\sim$ 1,904 KB/s
TASK: Context switch rate	PASS	212 ms	3,773 sw/s
TASK: Priority preemption	PASS	65 ms	Verified
TASK: Delay accuracy	PASS	101 ms	1% Error
IPC: Semaphore ping-pong	PASS	31 ms	32,258 trips/s
IPC: Mutex shared counter	PASS	69 ms	Verified
MEM: Alloc/free throughput	PASS	97 ms	6,185 ops/s
MEM: Fragmentation resilience	PASS	—	Coalesced
STRESS: All subsystems concurrent	PASS	8,004 ms	Verified

Aggregate memory allocation latency for varying workload sizes (Small: 1000 $\times$ 32B, Medium: 500 $\times$ 128B, Large: 100 $\times$ 1024B). The linear search time is proportional to the total heap traversal depth.

Task Scheduling: The scheduler achieved a context switch rate of approximately 3,773 switches per second under release build conditions at 84 MHz. This translates to approximately 22,000 clock cycles per context switch, including PendSV overhead and scheduler logic. Delay accuracy was measured with a 1% error margin (101 ms measured for a 100 ms requested delay), which is the theoretical limit for a 1 ms system tick.

Computational Performance: The hardware FPU demonstrates high efficiency, with sqrtf outperforming sinf by approximately 33%, consistent with the hardware-accelerated square root instruction set. Multi-tasking tests verified the FPU context preservation logic, with lazy stacking ensuring that only active FPU-using tasks incur the register save/restore overhead.

Subsystem Reliability: A 3,000 ms high-contention stress test successfully executed 200 DMA transfers, 6,902 semaphore trips, and numerous memory operations concurrently. Zero allocation failures and zero data corruption incidents were observed, confirming kernel stability under high-load autonomous flight scenarios. Significant reliability fixes were implemented during benchmarking, including a bounded spin-wait in the logging subsystem and optimized DMA flag clearing sequences to prevent spurious interrupts.