Meet the 240W PCIe AI Accelerator That Packs 384GB to Run 700B-Parameter LLMs Locally—Using Less Than Half the Power of NVIDIA’s RTX PRO 6000 Blackwell

A Taiwan-based AI hardware and software company has introduced a new PCIe AI accelerator card aimed at making on-premises large language model inference far more practical. The newly announced Skymizer HTX301 is being positioned as a low-power alternative to building or renting large GPU clusters, with the company claiming it can run inference for models as large as 700B parameters locally while staying around 240W.

The idea behind HTX301 is straightforward: deliver enterprise-grade LLM inference in a familiar PCIe add-in card format, while keeping power draw, infrastructure complexity, and long-term cost predictable. Skymizer is targeting organizations that want on-prem AI for data sovereignty, deterministic latency, and a fixed hardware footprint rather than ongoing cloud spend.

Skymizer says HTX301 is its first inference chip built on the company’s HyperThought platform, featuring its next-generation LPU IP. The LPU approach is described as purpose-built for LLM workloads, focusing on performance-per-watt and efficient orchestration of the two major phases of inference: prefill and decode. According to the company, HTX301 pairs decode acceleration with unified prefill/decode orchestration to improve real-world throughput.

Physically, the solution resembles a standard PCIe accelerator card design, with a main chip and surrounding memory. Skymizer notes that each board will feature six HTX301 chips. Even though the chips are built on an older 28 nm process, the company claims strong efficiency and performance characteristics, including figures such as 30 tokens per second at just 0.5 TOPS with 100 GB/s bandwidth. The platform is also described as scalable, enabling different deployment and configuration options.

On model performance, Skymizer reports that its Octa-Core LPU can reach 240 tokens per second in Llama2 7B prefill. The company also says multi-chip configurations can scale to around 1200 tokens per second for the same LLM, with support extending up to 700B models depending on configuration.

Memory capacity is another key part of the pitch. The PCIe card is said to support up to 384 GB of memory, using mainstream LPDDR4 and LPDDR5 rather than more specialized and expensive options such as HBM or GDDR. Skymizer frames this as a deliberate design choice intended to balance cost, capacity, and bandwidth needs for targeted inference scenarios.

To further reduce memory pressure, Skymizer highlights compression techniques baked into the HTX301 architecture. The company claims long-term “weight” compression can outperform open-source llama.cpp by 9% to 17.8%. It also cites KV cache compression (short-term memory) with minimal perplexity impact, claiming losses ranging from under 0.06% up to 3.52%.

Power consumption is one of HTX301’s headline claims. Skymizer lists the accelerator at roughly 240W, positioning it as significantly lower power than some top-end PCIe AI accelerators that can reach around 600W. If the company’s performance claims hold up in real deployments, this kind of efficiency could be appealing for businesses trying to bring LLM inference in-house without major upgrades to cooling, power delivery, or rack density.

Skymizer plans to preview the HTX301 at Computex this year, where more clarity should emerge around real-world throughput, model compatibility, software stack maturity, and how closely the product matches the early numbers. On paper, though, the concept is clear: a PCIe AI accelerator built for local LLM inference that aims to reduce the need for expensive, large-scale GPU installations—especially for entry-level enterprise deployments that want to keep AI workloads on their own servers.

Meet the 240W PCIe AI Accelerator That Packs 384GB to Run 700B-Parameter LLMs Locally—Using Less Than Half the Power of NVIDIA’s RTX PRO 6000 Blackwell

Share this:

Related Posts: