RamTrend

AI Memory · May 5, 2026

Research Paper Explores HBM-PNM Cubes for Long-Context AI Attention

A new academic paper involving UC San Diego, Columbia, Yonsei University, NVIDIA, and Samsung proposes a memory-centric architecture that uses HBM processing-near-memory cubes for long-context LLM attention.

Price impact: 6Direction: neutralSource: Semiconductor Engineering

The paper, titled AMMA, targets the decode-attention stage of large language model serving, where very long contexts can make memory movement a bottleneck. According to the abstract excerpt, the architecture shifts the design away from GPU compute dies and toward HBM-PNM cubes, aiming to increase available memory bandwidth for attention workloads while reducing wasted compute area and power. The work is still research rather than a commercial product announcement, so it should not be read as an immediate HBM demand forecast. Its relevance is directional: AI inference workloads are continuing to motivate designs that place more compute capability closer to high-bandwidth memory, a theme that could influence future accelerator packaging, HBM logic-die features, and memory-centric chiplet architectures.

NVIDIASamsungHBMPNMPIMmulti-chiplet architecturesLLM inference
Original sourceBack to news archive