Bernhard's shared items

NVIDIA Tensor Core Evolution: From Volta To Blackwell
Tuesday June 24^th, 2025 at 10:45 AM

SemiAnalysis

In our AI Scaling Laws article from late last year, we discussed how multiple stacks of AI scaling laws have continued to drive the AI industry forward, enabling greater than Moore’s Law growth in model capabilities as well as a commensurately rapid reduction in unit token costs. These scaling laws are driven by training and inference optimizations and innovations, but advancements in compute capabilities transcending Moore’s Law have also played a critical role.

One this front, in the AI Scaling Laws article, we revisited the decades-long debate around compute scaling, recounting the end of Dennard Scaling in the late 2000s as well as the end of classic Moore’s Law pace cost per transistor declines by the late 2010s. Despite this, compute capabilities have continued to improve at a rapid pace, with the baton being passed to other technologies such as advanced packaging, 3D stacking, new transistor types and specialized architectures such as the GPU.

When it comes to AI and deep learning, GPU compute capabilities have improved at a faster than Moore’s law pace, consistently delivering remarkable “Huang’s Law” performance improvements year after year. The technology that is at the heart of driving this improvement is the Tensor Core.

Though the Tensor Core is unquestionably the bedrock upon which the foundations of modern AI and machine learning are built, it is not well understood, even by many experienced practitioners in the field. The rapid evolution of GPU architecture and programming models that run on this architecture means that it is increasingly challenging for Machine Learning researchers and scientists to keep up with the latest changes to Tensor Cores and grasp the implications of these changes.

In this report, we will introduce the core features of the major datacenter GPUs, first explaining important first principles of performance engineering. We will then trace the evolution of Nvidia’s Tensor Core architectures and programming models, highlighting the motivations behind this evolution. Our end goal is to provide a resource for understanding Nvidia’s GPU architecture and offer intuitive insights into their architectural evolution. Only after explaining each architecture can we explain the beauty of the Blackwell tensor core and the new memory hierarchy of it.

It is important that we explain that a solid grasp of computer architecture is a prerequisite for being able to follow many of the explanations and discussions in this article, and this article will provide a brief section about CUDA programming as a refresher rather than explaining foundational concepts of GPU architecture. Instead, we build on the forefront of Tensor Core knowledge, extending understanding of this cutting-edge technology by documenting what is currently tribal knowledge into accessible, structured insight through detailed explanation.

Just as a university will teach 101 courses as well as 4000 level courses, different articles at SemiAnalysis will cater to varying levels of understanding of the subject matter as well as to readers in different vocations and specializations.

We would like to thank our collaborators:

Jay Shah, Colfax Research: Terrific CUTLASS tutorials and numerous meetings meticulously checking the technical details
Ben Spector, Stanford Hazy Research: Offered great insights into programming model change and writing advice
Tri Dao, Princeton and Together AI: Reviewed drafts and gave detailed feedback
Neil Movva, Together AI: Reviewed drafts and offered insights into GPU kernel writing
Charles Frye, Modal: Pedagogical GPU Glossary and general review of the draft
Simon Guo, Stanford PhD student: Illustrated the cover picture and reviewed the draft
NVIDIA: Shared context around the progression of Tensor Core designs. Teams include:
Many other GPU wizards

SemiAnalysis will be posting exclusive content on Instagram Reels and TikTok starting next week. Follow our socials to get the latest insights on the AI and GPU industry.

For a fixed problem size, Amdahl’s Law specifies the maximum speedup you can obtain by parallelizing with more compute resources. Concretely, scaling compute resources only drives down the execution time of the parallel portion, so the performance improvement is bounded by the serial portion. To quantify it, the maximum performance improvement is:

where S is the parallel work execution time and p is the speedup of the parallelizable work. In an ideal world where the parallel portion is perfectly parallelized, the speedup p can be the number of processing units.

Strong and weak scaling describe the performance improvement of scaling compute resources for different problem setups. Strong scaling refers to scaling compute resources to solve a fixed-size problem, and Amdahl’s Law quantifies the speedup of strong scaling. On the other hand, weak scaling refers to scaling compute resources to solve larger problems at a constant time. For example, processing a 4x larger image in the same time using 4x more compute resources. We recommend this blog post for more detailed explanations.

Strong and weak scaling imply different performance improvements across problem sizes. Strong scaling offers speedup for all problem sizes, while weak scaling only guarantees performance improvement when we use more compute to solve a larger problem.

Data movement is a sin because in terms of runtime and scaling, computation is cheap and data movement is expensive. Data movement is fundamentally slower because modern DRAM cells operate at tens of nanoseconds, while transistors switch at sub-nanosecond speed. Regarding scaling, while computation speed gains have slowed since the 2000s, memory speed has improved slower, creating the memory wall.

In this section, we introduce the main Nvidia GPU architectures that use Tensor Cores, namely the Tesla V100 GPU, A100 Tensor Core GPU, H100 Tensor Core GPU, as well as the Blackwell GPU. We have also included a pre-Tensor Core section as a refresher for the CUDA programming model. We will briefly go over the major features and changes that are relevant to understanding the Tensor Core, and we defer the details to other sources, which we link in each subsection.

Parallel Thread Execution (PTX) is a virtual instruction set that abstracts over GPU generations. A PTX program describes a kernel function that is executed with a large number of GPU threads, which are executed on the GPU’s hardware execution units, i.e. CUDA cores. Threads are organized as a grid, and a grid consists of cooperative thread arrays (CTAs). PTX threads can access data from multiple state spaces, which are memory storage areas with different characteristics. Specifically, threads have per-thread registers, threads within a CTA have shared memory, and all threads can access global memory. For more information, please read this section of the CUDA documentation.

The GPU architecture is built around an array of streaming multiprocessors (SMs). An SM consists of scalar processing cores, a multithreaded instruction unit, and an on-chip shared memory. An SM maps each thread to a scalar processing core (also known as a CUDA core), and the multithreaded instruction unit manages threads in groups of 32 parallel threads called warps.

At instruction issue time, the instruction unit selects a warp and issues an instruction to the threads of the warp. This execution method is called single-instruction, multiple threads (SIMT). Similar to single-instruction, multiple data (SIMD), SIMT controls multiple processing elements with a single instruction, but unlike SIMD, SIMT specifies a single thread behavior instead of vector width. For more information, please read this section of the CUDA documentation.

Streaming Assembler (SASS) is the architecture-specific instruction set that PTX virtualizes over. See the CUDA binary utilities documentation for more information. Unfortunately, SASS is not well documented due to NVIDIA hiding their architecture ISA details from their competitors.

As deep learning became more prominent, the industry noticed that ML workloads were in need of hardware acceleration. Early in 2015, Google deployed TPUv1 for accelerating their internal ML workloads, and in 2017, Nvidia introduced dedicated hardware for matrix math. Although GPUs consume a small amount of energy when issuing instructions (~30pJ) because of their simple hardware pipeline, simple floating point operations like consume even less energy at only 1.5pJ. This creates a 20x overhead of power needed for instructions vs for the floating point operation itself. As a result, performing a lot of floating point operations for matrix multiplication is power inefficient. To amortize the instruction overhead, we need to use complex instructions that can perform more computation per instruction. To this end, Nvidia designed the half-precision matrix multiply and accumulate () instruction, a specialized instruction that performs half-precision matrix multiplication. The corresponding dedicated hardware to execute this instruction is the Tensor Core, introduced in the Tesla V100 GPU of Volta architecture in 2017. The Volta tensor core was added very late into development of the Volta architecture, only a handful of months before tape out, a testament to how fast Nvidia can pivot their architecture.

Given a matrix, the multiply and accumulate (MMA) instruction computes D = A * B + C:

A is an M by K matrix
B is a K by N matrix
C and D are M by N matrices

We denote the matrix shapes as or MxNxK.

To perform the full computation, we first load matrices A, B, and C from shared memory to thread registers, so that each thread holds fragments of the matrices. Second, we execute the MMA instruction, which reads the matrices from thread registers, performs computation on Tensor Cores, and stores the result to thread registers. Finally, we store the results from thread registers back to shared memory. The full computation is collectively performed by multiple threads, meaning that every step requires a synchronization between the collaborating threads.

An SM of a Tesla V100 GPU contains 8 Tensor Cores, grouped in partitions of two. Each Tensor Core is capable of computing an equivalent of 4x4x4 matrix multiplication per cycle, which amounts to 1024 FLOPs per cycle per SM.

NVIDIA designed PTX instruction mma to target the lower level instructions. On Volta architecture, an MMA instruction performs an 8x8x4 matrix multiplication, and a quadpair of 8 threads participate in the operation by collectively holding the input and output matrices. Here T0 refers to thread 0, [T0, T1, T2, T3] and [T16, T17, T18, T19] are threadgroups, and the 2 threadgroups form a quadpair.

In terms of data types, Volta Tensor Cores support FP16 inputs with FP32 accumulation in correspondence with NVIDIA’s mixed-precision training technique. This technique showed it is possible to train models at lower precision without losing model accuracy.

To fully understand the MMA layout, please refer to Citadel’s microbenchmarking paper, Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. To see the interleaved layout pattern for Volta Tensor Core MMAs, please read the slides Programming Tensor Cores: Native Tensor Cores with CUTLASS. Finally, for other information of the Volta architecture, please refer to the whitepaper NVIDIA Tesla V100 GPU Architecture.

Turing architecture includes the 2nd generation Tensor Cores, an enhanced version of Volta Tensor Cores, adding INT8 and INT4 precision support. Turing Tensor Cores support a new warp-level synchronous MMA, which we will discuss in the next section. Turing Tensor Cores also enabled Deep Learning Super Sampling (DLSS), marking the start of NVIDIA applying deep learning to gaming graphics. Interested readers can refer to NVIDIA’s blog post NVIDIA Turing Architecture In-Depth and the Turing architecture whitepaper.

With Ampere, NVIDIA introduced asynchronous data copy, a way of copying data directly from global memory to shared memory in an asynchronous fashion. To load data from global memory to shared memory on Volta, threads must first load data from global memory to registers, and then store it to shared memory. However, MMA instructions have high register usage and must share the register file with data-loading operations, causing high register pressure and wasting memory bandwidth for copying data in and out of RF.

Async data copy mitigates this issue by fetching data from global memory (DRAM) and directly storing it into shared memory (with optional L1 access), freeing up more registers for MMA instructions. Data loading and compute can happen asynchronously which is more difficult from a programming model perspective but unlocks higher performance.

This feature is implemented as PTX instruction thread-level async copy cp.async (documentation). The corresponding SASS is LDGSTS, asynchronous global to shared memory copy. The exact synchronization methods are async-group and mbarrier-based completion mechanisms, detailed here.

Ampere has 4 Tensor Cores per SM, and each Tensor Core is capable of performing 512 FLOPs per cycle, amounting to 2048 Dense FLOPs per cycle per SM, doubling the performance of Volta.

While Volta requires a quadpair of 8 threads to participate in an MMA operation, Ampere requires a full warp of 32 threads. Having MMA instructions warp-wide simplifies the thread layout & reducing RF pressure for Ampere. For instance, here is the thread and data layout for mixed-precision floating point of shape 16x8x16:

NVIDIA introduced in Ampere, an enhanced vectorized load operation. Like , is warp-wide, meaning that a warp of threads collectively loads a matrix. Compared to issuing multiple load instructions, this reduces address generation register use, lowering register pressure. See the CUDA documentation for more information.

loads data to registers in a layout that matches Tensor Core’s data layout. Compared to Volta’s interleaved pattern (See Programming Tensor Cores: Native Tensor Cores with CUTLASS), a simpler thread and data layout greatly improves the programming ergonomics. Watch the GTC talk Developing CUDA Kernels to Push Tensor Cores to the Absolute Limit on NVIDIA A100 to learn more about how exactly Ampere’s memory loading is coherent with Tensor Core.

Ampere MMA features Brain Floating Point Format (BF16), which has become the de facto standard for half-precision data types. BF16 provides the same 8-bit exponent range as FP32 but with a 7-bit mantissa, allowing FP32-level dynamic range at half the storage cost. BF16 also removes the need for loss scaling in mixed-precision training.

As the number of SMs grew, the size disparity between an SM and the whole GPU increased. To offer a finer granularity of control between CTAs (map to SMs) and the grid (maps to the whole GPU), on Hopper, NVIDIA added a new thread hierarchy level, thread block cluster, which maps to a group of SMs physically located in the same graphics processing cluster (GPC). Thread block cluster is also called cooperative grid array (CGA) and referred to as cluster in the CUDA documentation (See here for more information).

CTAs in a thread block cluster are guaranteed to be co-scheduled on SMs in the same GPC and distributed one CTA per SM by default. The shared memory partitions of those SMs form a distributed shared memory (DSMEM). A thread can access the shared memory from another SM with low latency through the dedicated SM-to-SM network (without going through L2 cache). By exposing the GPC hardware execution unit to the programming model, programmers can reduce data movement and improve the data locality.

To improve data fetch efficiency, NVIDIA added the Tensor Memory Accelerator (TMA) to each Hopper SM. TMA is a dedicated hardware unit that accelerates asynchronous data transfers of large quantities between global and shared memory (bulk asynchronous copies).

A single thread in a CTA can initiate a TMA copy operation. TMA frees up threads to execute other independent work, handling address generation and offering additional benefits such as out-of-bounds handling. In PTX, the corresponding instruction is , detailed in this CUDA documentation section.

However, for small requests, TMA loads have higher latency than regular async data copies because of the address generation overhead. Thus, NVIDIA recommends programmers to use TMAs for large data copies to amortize the overhead. For example, in LLM inference, TMA is not suitable for workloads that load KV cache in small chunks, but works well when each chunk is a multiple of 16 bytes. For more concrete examples of this, see SGLang prefix caching, paper FlashInfer section 3.2.1, paper Hardware-Efficient Attention for Fast Decoding section 4.2, and ThunderKittens MLA decode.

TMA also supports a mode of loading data called multicast, where TMA loads data from global memory to shared memory of multiple SMs in a thread block cluster, specified by a multicast mask. Instead of issuing multiple global memory loads loading the same piece of data into multiple SMs, multicast completes it in one load. Specifically, multiple CTAs in a thread block cluster load a portion of the data into their corresponding SMEMs and share the data through DSMEM. This reduces L2 cache traffic and subsequently reduces HBM traffic. We recommend reading Jay Shah’s TMA tutorial for more details.

NVIDIA introduced a new type of MMA with Hopper, warpgroup-level MMA (). is warpgroup-wide, meaning that a warpgroup of 4 warps collectively performs an MMA operation. supports a wider range of shapes. For example, mixed-precision MMA supports , where N can be multiples of 8 from 8 to 256. lowers to a new set of SASS: . In another example, half-precision instructions lowers to . See this CUDA documentation section for the details of MMA shapes and data types.

While all threads in a warpgroup collectively hold the output matrix in their registers, Hopper Tensor Cores can directly load operands from shared memory instead of registers, saving register space and bandwidth. Specifically, operand matrix A can reside in either registers or shared memory, while operand matrix B can only be accessed through shared memory. See the CUDA documentation wgmma section for the details of ’s completion mechanism, SMEM layout, and more.

For data types, Hopper introduced 8-bit floating-point data types (E4M3 and E5M2) with FP32 accumulation. In practice, the accumulation path was implemented as a 22-bit fixed-point format (13-bit mantissa plus sign and exponent bits), limiting the dynamic range compared to true 32-bit accumulation. Due to the reduced tensor core precision, every N_c accumulations has to happen in the CUDA core to prevent constraining training accuracy. (See this paper section 3.3.2). This reduced precision accumulation improves efficiency, but comes at the cost of accuracy.

For more information on the Hopper Architecture, see the following:

For examples of how to program Hopper GPUs, see:

The extreme register pressure did not let up on Hopper, which motivated Tensor Memory (TMEM), a new piece of memory specialized for Tensor Core operations. On every SM, TMEM has 128 rows (lanes) and 512 columns of 4-byte cells, totaling to 256 KB, which is also the size of the register file on an SM.

TMEM has a restricted memory access pattern. Specifically, it takes a warpgroup to access the whole TMEM, and each warp in a warpgroup can only access a specific set of lanes. By limiting the memory access pattern, hardware designers can reduce the number of access ports, saving chip space. On the other hand, this design also means that epilogue operations need a warpgroup to operate. Unlike shared memory, programmers have to explicitly manage TMEM, including allocation, deallocation, and copying data in and out of TMEM.

Two CTAs in a thread block cluster form a CTA pair if their CTA ranks in their thread block cluster differ by the last bit, e.g. 0 and 1, 4 and 5. A CTA pair maps to a Texture Processing Cluster (TPC), which consists of two SMs and combines with other TPCs to form a GPC. When Blackwell Tensor Core operations perform at a CTA pair granularity, the two CTAs are able to share input operands. This sharing reduces both SMEM capacity and bandwidth requirements.

Tensor Core 5th Generation MMA instruction ( in PTX) fully moved away from using registers for holding matrices. Operands now reside in shared memory and Tensor Memory.

Specifically, suppose the MMA computes D = A * B + D: Not using thread registers removes the complex data layouts and frees up thread register space for other work such as epilogue operations. Unlike using a warpgroup to initiate an MMA operation, has single thread semantics, meaning that a single thread initiates an MMA operation. This removes the role of warps from issuing MMA.

One notable MMA variant is MMA.2SM, which uses 2 SMs to collectively perform an MMA operation. MMA.2SM executes at the CTA-pair level granularity, and since has single thread semantics, a single thread in the leader CTA of the CTA pair launches MMA.2SM. Here we illustrate data path organization layout A. Layout A shows MMA.2SM doubles the M dimension compared to the 1SM version (layout D), so the two SMs load different matrix A and D tiles. In addition, MMA.2SM splits matrix B, halving the amount of data loaded.

Matrix B is shared across the two SMs, meaning tiles B0 and B1 need to be communicated across the DSMEM. Although there is a bandwidth difference between DSMEM and SMEM, the effects on the coordination are minimal because we are loading smaller tiles. That said, we suspect that on Blackwell the communication bandwidth between SMs in a TPC is higher than DSMEM’s, so MMA.2SM leverages this to achieve better performance.

5th-gen Tensor Cores can also perform convolutions in addition to general matrix multiplication. supports weight stationary patterns with a collector buffer, which caches matrix B for reuse. For more information, please refer to the CUDA documentation and the corresponding weight stationary MMA instruction.

In terms of supported data types, Blackwell supports microscaling floating-point format (MXFP), including MXFP8, MXFP6, and MXFP4. See this paper for details. Blackwell also supports NVIDIA’s own NVFP4 format, which is known for being more accurate than MXFP4. This is likely because of its smaller block size, different scaling factor data format, and the two-level quantization method (See this GitHub issue). See this paper for data format comparisons.

With Blackwell, since FP8 and FP6 have the same theoretical throughput, we believe that they share physical circuits in Tensor Cores. In contrast, CDNA4 has 2x the FP6 throughput compared to FP8 because their FP6 units share data paths with FP4 instead. We believe that UDNA will switch to having FP6 units share with FP8 instead.

Ampere featured 2:4 structured sparsity, which in theory doubled the Tensor Core throughput. It achieves this by pruning the weight matrix such that for every 4 elements, 2 of them are zero. In this format, the matrix is compressed by removing zero elements, and an additional metadata index matrix records their positions, roughly halving the memory usage and bandwidth.

According to this microbenchmarking paper from cracked chinese engineers, Ampere’s structured sparsity can realize 2x speedup for large shape MMA operations at the instruction level. It also shows that in Hopper, structured sparsity instructions can reach 2x speedup and save up to 2x on memory bandwidth used to load weights.

Unfortunately, 2:4 structured sparsity GEMMs kernels are unable to reach anywhere close to 2x speedup compared to their dense counterparts on hopper. This is due to difficulties in doing structured pruning while maintaining model accuracy, cuSPARSELt kernels being unoptimized, and TDP limitations. Except for Chinese AI labs and a limited number of experimental western research papers, most AI labs ignore 2:4 structured sparsity for production inferencing and focus on quantization & distillation. Meta is experimenting with it in Llama, but that is a dead end path in many cases as well.

Furthermore, there is a lack of closed or open models that have shown performance improvements with 2:4 FP8 structured sparsity or 4:8 FP4 structured sparsity while maintaining zero accuracy loss & a general lack of resources dedicated to structured pruning. We recommend that NVIDIA should stop with Jensen math structured sparsity flops in keynotes & marketing material unless they start consistently showing SOTA open models being able to take advantage of structured pruning for inferencing. A good first step would be to do structured sparsity on DeepSeek and also show that performance can stack on top of other techniques like distillation & quantization like NVFP4.

In its fifth‑generation Tensor Cores, NVIDIA introduced pair‑wise 4 : 8 structured sparsity for the NVFP4 data type. In this scheme, every eight elements are grouped into four consecutive pairs, and exactly two of those pairs must contain non‑zero values while the remaining two are pruned to zero. Because NVFP4 is a sub‑byte data type, we believe this constraint motivated NVIDIA to adopt the pair‑wise 4 : 8 pattern. Although 4 : 8 sparsity may appear more permissive than the earlier 2 : 4 pattern, the added pair‑wise requirement means it is not, in practice, a more relaxed constraint for ML engineers seeking to preserve model accuracy while pruning.

Over generations, NVIDIA scaled the Tensor Core size more aggressively than the number of Tensor Cores. NVIDIA chose scaling the tensor core size rather than number of cores because it suits the performance characteristics of matrix multiplication better. Specifically, when scaling the problem size, matrix multiplication computation grows cubically, but data movement grows quadratically, meaning the arithmetic intensity grows linearly. O(n) arithmetic intensity, combined with the fact that data movement is more expensive than computation, incentivized the tensor core size increase.

However, both scaling core size and number of cores come at the cost of the quantization effects. Specifically, having a large number of cores suffer from the tile quantization effect, and having a large core size leads to wave quantization effect. The wave quantization effect occurs when the number of work units isn’t fully divisible by the number of workers, causing utilization to drop when processing the final, smaller batch of work. Increasing tensor core size is essentially increasing the work unit size, resulting in low utilization for small matrices (See this ThunderKittens blog post).

The linear growth in arithmetic intensity also motivates the increase in MMA shape. Having larger MMA shapes enhances the operand sharing granularity. Specifically, launching fewer larger tiles would increase the data reuse, saving memory footprint and bandwidth of RF and SMEM. For architectures before Blackwell, this led to increasing the number of threads to collectively perform an MMA operation, from a quadpair of 8 threads (Volta), to a warp of 32 threads (Ampere), and then a warpgroup of 128 threads (Hopper).

Shared memory increased almost every generation, while register file size stayed constant. The reason for this is that Tensor Core throughput increase requires a deeper staging buffer.

Because Tensor Cores consume data much faster than global memory can load, we use a staging memory to buffer data, so memory loading can run ahead of MMA operations. Tensor Core throughput doubled every generation, but global memory load latency didn’t decrease and in fact increased. As a result, we need to increase the staging memory size for buffering more data. To implement this, NVIDIA chose shared memory as the staging memory for Tensor Cores, which explains why shared memory increased but register file size remained constant.

However, Blackwell’s shared memory size didn’t increase from Hopper. This is because tcgen05 MMA can leverage 2 SMs, so each SM’s shared memory only needs to load half of the operands. Thus, Blackwell’s shared memory size effectively doubled.

NVIDIA’s staging memory choice also explains why operand locations gradually moved away from registers to shared memory. That said, NVIDIA added TMEM on Blackwell to support the increased Tensor Core throughput. Since TMEM is placed closer to Tensor Cores, it can be more power efficient. In addition, having a separate memory increases the aggregate memory bandwidth for saturating the Tensor Cores.

Among all operands, matrix D always stays in TMEM. We can take advantage of TMEM’s power efficiency with this design because matrix D is more frequently accessed than matrix A and B. For example, to compute a tile in a naive tiled matrix multiplication, matrix D tile is accessed 2Kt times (Kt reads and Kt writes. Kt: The number of tiles along the K dimension), whereas matrix A tiles and matrix B tiles are accessed only once.

The “H” in stands for half precision since it is a 16 bit format while “Q” in stands for quarter precision (8 bit) since 8 bits is a quarter of a full precision (32 bits). “O” stands for “Octal” which means one eighth of 32 bits as is FP4.

MMA instructions seemingly jumped from synchronous to asynchronous. In reality, MMA instructions gradually became asynchronous at the SASS level because of the need to overlap instructions.

At SASS level, an MMA operation involves executing one instruction to load matrix tiles from shared memory to the register file, and then two instructions to perform MMA. During execution, the two instructions are issued asynchronously, and block the register usage with hardware interlocks. Since hardware interlocks disallows overlapping LDSM instructions, sequential execution of one and two instructions creates a small bubble in the instruction issue pipeline. However, Tensor Cores have become so fast that this bubble causes non-negligible amount of performance loss, which calls for an asynchronous completion mechanism for MMA.

Hopper supports asynchronous completion mechanism commit and fence for . When instructions are issued, there are no hardware interlocks to guard register usage. Instead, the compiler schedules for the next MMA and uses instruction to keep the next waiting. With Blackwell, the MMA operation is fully asynchronous. Instructions for loading into Tensor Memory (tcgen05.ld / tcgen05.st / tcgen05.cp) are all explicitly asynchronous.

Throughout each successive generation of NVIDIA Tensor Cores, NVIDIA continues to add lower precision data types, starting from 16-bit to 4-bits. This is because deep learning workloads are extremely tolerant of low precision. This is especially true for inference, where even lower precision can be used than during training. Low precision is more power efficient, takes up less silicon floor space and achieves higher compute throughput. In newer generations, we also see NVIDIA removing FP64 support to prioritize low precision data types under silicon area and power budgets.

Interestingly, the prioritization also affected integer data type support. Since Hopper, INT4 data types are deprecated, and on Blackwell Ultra, we see lower INT8 compute throughput. This is caused by the delayed popularity of low-precision integer data types. Although Turing supported INT8 and INT4, it wasn’t until 4 years later that new inference quantization methods were able to exploit the compactness of INT4 for serving LLMs. By that time, NVIDIA had already deprecated INT4 on Hopper .

Next, we will talk about how the programming model evolved, including the transition from high-occupancy to single-occupancy, the increase in explicit asynchronous execution, and how those designs relate to NVIDIA betting on strong scaling.

If readers like to learn the basics of CUDA programming model, hardware, and concepts, GPU Glossary by Modal is a great resource for everything before Blackwell. To understand the big ideas of CUDA, we recommend all of Stephen Jones’ GTC talks (playlist here). To get a deeper understanding of the memory features, GTC talk CUDA Techniques to Maximize Memory Bandwidth and Hide Latency explains the memory features of Volta, Ampere, and Hopper, and Advanced Performance Optimization in CUDA dives deep into memory models. Finally, for Blackwell-specific resources, we recommend GTC talk Programming Blackwell Tensor Cores with CUTLASS, Colfax research CUTLASS articles (latest one here), and the CUTLASS kernel examples.

Read the whole story

bernhardbock

17 days ago

reply

Anthropic: How we built our multi-agent research system
Tuesday June 24^th, 2025 at 9:15 AM

Simon Willison's Weblog

Anthropic: How we built our multi-agent research system. OK, I'm sold on multi-agent LLM systems now.

I've been pretty skeptical of these until recently: why make your life more complicated by running multiple different prompts in parallel when you can usually get something useful done with a single, carefully-crafted prompt against a frontier model?

This detailed description from Anthropic about how they engineered their "Claude Research" tool has cured me of that skepticism.

Reverse engineering Claude Code had already shown me a mechanism where certain coding research tasks were passed off to a "sub-agent" using a tool call. This new article describes a more sophisticated approach.

They start strong by providing a clear definition of how they'll be using the term "agent" - it's the "tools in a loop" variant:

A multi-agent system consists of multiple agents (LLMs autonomously using tools in a loop) working together. Our Research feature involves an agent that plans a research process based on user queries, and then uses tools to create parallel agents that search for information simultaneously.

Why use multiple agents for a research system?

The essence of search is compression: distilling insights from a vast corpus. Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects of the question simultaneously before condensing the most important tokens for the lead research agent. [...]

Our internal evaluations show that multi-agent research systems excel especially for breadth-first queries that involve pursuing multiple independent directions simultaneously. We found that a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval. For example, when asked to identify all the board members of the companies in the Information Technology S&P 500, the multi-agent system found the correct answers by decomposing this into tasks for subagents, while the single agent system failed to find the answer with slow, sequential searches.

As anyone who has spent time with Claude Code will already have noticed, the downside of this architecture is that it can burn a lot more tokens:

There is a downside: in practice, these architectures burn through tokens fast. In our data, agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats. For economic viability, multi-agent systems require tasks where the value of the task is high enough to pay for the increased performance. [...]

We’ve found that multi-agent systems excel at valuable tasks that involve heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools.

The key benefit is all about managing that 200,000 token context limit. Each sub-task has its own separate context, allowing much larger volumes of content to be processed as part of the research task.

Providing a "memory" mechanism is important as well:

The LeadResearcher begins by thinking through the approach and saving its plan to Memory to persist the context, since if the context window exceeds 200,000 tokens it will be truncated and it is important to retain the plan.

The rest of the article provides a detailed description of the prompt engineering process needed to build a truly effective system:

Early agents made errors like spawning 50 subagents for simple queries, scouring the web endlessly for nonexistent sources, and distracting each other with excessive updates. Since each agent is steered by a prompt, prompt engineering was our primary lever for improving these behaviors. [...]

In our system, the lead agent decomposes queries into subtasks and describes them to subagents. Each subagent needs an objective, an output format, guidance on the tools and sources to use, and clear task boundaries.

They got good results from having special agents help optimize those crucial tool descriptions:

We even created a tool-testing agent—when given a flawed MCP tool, it attempts to use the tool and then rewrites the tool description to avoid failures. By testing the tool dozens of times, this agent found key nuances and bugs. This process for improving tool ergonomics resulted in a 40% decrease in task completion time for future agents using the new description, because they were able to avoid most mistakes.

Sub-agents can run in parallel which provides significant performance boosts:

For speed, we introduced two kinds of parallelization: (1) the lead agent spins up 3-5 subagents in parallel rather than serially; (2) the subagents use 3+ tools in parallel. These changes cut research time by up to 90% for complex queries, allowing Research to do more work in minutes instead of hours while covering more information than other systems.

There's also an extensive section about their approach to evals - they found that LLM-as-a-judge worked well for them, but human evaluation was essential as well:

We often hear that AI developer teams delay creating evals because they believe that only large evals with hundreds of test cases are useful. However, it’s best to start with small-scale testing right away with a few examples, rather than delaying until you can build more thorough evals. [...]

In our case, human testers noticed that our early agents consistently chose SEO-optimized content farms over authoritative but less highly-ranked sources like academic PDFs or personal blogs. Adding source quality heuristics to our prompts helped resolve this issue.

There's so much useful, actionable advice in this piece. I haven't seen anything else about multi-agent system design that's anywhere near this practical.

They even added some example prompts from their Research system to their open source prompting cookbook. Here's the bit that encourages parallel tool use:

<use_parallel_tool_calls> For maximum efficiency, whenever you need to perform multiple independent operations, invoke all relevant tools simultaneously rather than sequentially. Call tools in parallel to run subagents at the same time. You MUST use parallel tool calls for creating multiple subagents (typically running 3 subagents at the same time) at the start of the research, unless it is a straightforward query. For all other queries, do any necessary quick initial planning or investigation yourself, then run multiple subagents in parallel. Leave any extensive tool calls to the subagents; instead, focus on running subagents in parallel efficiently. </use_parallel_tool_calls>

And an interesting description of the OODA research loop used by the sub-agents:

Research loop: Execute an excellent OODA (observe, orient, decide, act) loop by (a) observing what information has been gathered so far, what still needs to be gathered to accomplish the task, and what tools are available currently; (b) orienting toward what tools and queries would be best to gather the needed information and updating beliefs based on what has been learned so far; (c) making an informed, well-reasoned decision to use a specific tool in a certain way; (d) acting to use this tool. Repeat this loop in an efficient way to research well and learn based on new results.

Read the whole story

bernhardbock

17 days ago

reply

Tips on prompting ChatGPT for UK technology secretary Peter Kyle
Friday June 6^th, 2025 at 11:00 AM

Simon Willison's Weblog

3rd June 2025

Back in March New Scientist reported on a successful Freedom of Information request they had filed requesting UK Secretary of State for Science, Innovation and Technology Peter Kyle’s ChatGPT logs:

New Scientist has obtained records of Kyle’s ChatGPT use under the Freedom of Information (FOI) Act, in what is believed to be a world-first test of whether chatbot interactions are subject to such laws.

What a fascinating precedent this could set!

They picked out some highlights they thought were particularly newsworthy. Personally I’d have loved to see that raw data to accompany the story.

Among the questions Kyle asked of ChatGPT was this one:

Why is AI adoption so slow in the UK small and medium business community?

(I pinged the New Scientist reporter, Chris Stokel-Walker, to confirm the exact wording here.)

This provides an irresistible example of the “jagged frontier” of LLMs in action. LLMs are great at some things, terrible at others and the difference between the two is often not obvious at all.

Experienced prompters will no doubt have the same reaction I did: that’s not going to give an accurate response! It’s worth digging into why those of us with a firmly developed sense of intuition around LLMs would jump straight to that conclusion.

The problem with this question is that it assumes a level of omniscience that even the very best LLMs do not possess.

At the very best, I would expect this prompt to spit out the approximate average of what had been published on that subject in time to be hoovered up by the training data for the GPT-4o training cutoff of September 2023.

(Here’s what I got just now running it against GPT-4o.)

This illustrates the first lesson of effective LLM usage: know your training cutoff dates. For many queries these are an essential factor in whether or not the LLM is likely to provide you with a useful answer.

Given the pace of change in the AI landscape, an answer based on September 2023 training data is unlikely to offer useful insights into the state of things in 2025.

It’s worth noting that there are tools that might do better at this. OpenAI’s Deep Research tool for example can run a barrage of searches against the web for recent information, then spend multiple minutes digesting those results, running follow-up searches and crunching that together into an impressive looking report.

(I still wouldn’t trust it for a question this broad though: the report format looks more credible than it is, and can suffer from misinformation by omission which is very difficult to spot.)

Deep Research only rolled out in February this year, so it is unlikely to be the tool Peter Kyle was using given likely delays in receiving the requested FOIA data.

What I would do instead

Off the top of my head, here are examples of prompts I would use if I wanted to get ChatGPT’s help digging into this particular question:

Brainstorm potential reasons that UK SMBs might be slow to embrace recent advances in AI. This would give me a starting point for my own thoughts about the subject, and may highlight some things I hadn’t considered that I should look into further.
Identify key stakeholders in the UK SMB community who might have insights on this issue. I wouldn’t expect anything comprehensive here, but it might turn up some initial names I could reach out to for interviews or further research.
I work in UK Government: which departments should I contact that might have relevant information on this topic? Given the size and complexity of the UK government even cabinet ministers could be excused from knowing every department.
Suggest other approaches I could take to research this issue. Another brainstorming prompt. I like prompts like this where “right or wrong” doesn’t particularly matter. LLMs are electric bicycles for the mind.
Use your search tool: find recent credible studies on the subject and identify their authors. I’ve been getting some good results from telling LLMs with good search tools—like o3 and o4-mini—to evaluate the “credibility” of sources they find. It’s a dumb prompting hack but it appears to work quite well—you can watch their reasoning traces and see how they place more faith in papers from well known publications, or newspapers with strong reputations for fact checking.

Prompts that do make sense

From the New Scientist article:

As well as seeking this advice, Kyle asked ChatGPT to define various terms relevant to his department: antimatter, quantum and digital inclusion. Two experts New Scientist spoke to said they were surprised by the quality of the responses when it came to ChatGPT’s definitions of quantum. “This is surprisingly good, in my opinion,” says Peter Knight at Imperial College London. “I think it’s not bad at all,” says Cristian Bonato at Heriot-Watt University in Edinburgh, UK.

This doesn’t surprise me at all. If you ask a good LLM for definitions of terms with strong, well established meanings you’re going to get great results almost every time.

My rule of thumb used to be that if a friend who had just read the Wikipedia page on a subject could answer my question then an LLM will be able to answer it too.

As the frontier models have grown stronger I’ve upgraded that rule of thumb. I now expect a good result for any mainstream-enough topic for which there was widespread consensus prior to that all-important training cutoff date.

Once again, it all comes down to intuition. The only way to get really strong intuition as to what will work with LLMs is to spend a huge amount of time using them, and paying a skeptical eye to everything that they produce.

Treating ChatGPT as an all knowing Oracle for anything outside of a two year stale Wikipedia version of the world’s knowledge is almost always a mistake.

Treating it as a brainstorming companion and electric bicycle for the mind is, I think, a much better strategy.

Should the UK technology secretary be using ChatGPT?

Some of the reporting I’ve seen around this story has seemed to suggest that Peter Kyle’s use of ChatGPT is embarrassing.

Personally, I think that if the UK’s Secretary of State for Science, Innovation and Technology was not exploring this family of technologies it would be a dereliction of duty!

The thing we can’t tell from these ChatGPT logs is how dependent he was on these results.

Did he idly throw some questions at ChatGPT out of curiosity to see what came back, then ignore that entirely, engage with his policy team and talk to experts in the field to get a detailed understanding of the issues at hand?

Or did he prompt ChatGPT, take the results as gospel and make policy decisions based on that sloppy interpretation of a two-year stale guess at the state of the world?

Those are the questions I’d like to see answered.

Read the whole story

bernhardbock

35 days ago

reply

Introduction#
Thursday April 3^rd, 2025 at 6:10 AM

Module Federation is an architectural pattern for the decentralization of JavaScript applications (similar to microservices on the server-side). It allows you to share code and resources among multiple JavaScript applications (or micro-frontends). This can help you:

Reduce code duplication
Improve code maintainability
Lower the overall size of your applications
Enhance the performance of your applications

✨ What is Module Federation 2.0?#

Module Federation 2.0 differs from the Module Federation built into Webpack5 by providing not only the core features of module export, loading, and dependency sharing but also additional dynamic type hinting, Manifest, Federation Runtime, and Runtime Plugin System. These features make Module Federation more suitable for use as a micro-frontend architecture in large-scale Web applications.

🔥 Features#

Module Federation has the following features:

🎯 Use Cases#

Module Federation is suitable for the following scenarios:

Large Applications: For large applications, you can break the application into multiple micro-frontends and use Module Federation to share code and resources between them.
Microfrontend Architecture: Module Federation is an ideal tool for building microfrontend architectures.
Multi-team Development: Module Federation can assist multiple teams in collaboratively developing large applications.

🕠 History of Module Federation#

Module Federation is a new feature introduced in Webpack 5, but its history dates back to 2017. At that time, the Webpack team began exploring a way to share code between multiple applications.

In 2018, Webpack 4.20 was released, introducing module hooks, which laid the foundation for the development of Module Federation.
In 2019, Webpack 5 was released, officially introducing the Module Federation feature.

Module Federation has become a powerful tool for building modern web applications.

🕰️ The Future of Module Federation#

Module Federation aims to become an architectural method for building large web applications, similar to microservices in the backend. Module Federation will provide more capabilities to meet the foundational needs of large web application decentralization, currently including these parts:

Providing comprehensive Devtool tools
Offering more high-level framework capabilities like Router, Sandbox, SSR
Providing best practices for large web applications based on Module Federation

Follow Us#

✨ Next Steps#

You might want to:

Read the whole story

bernhardbock

99 days ago

reply

GitHub - PriorLabs/TabPFN
Thursday April 3^rd, 2025 at 5:45 AM

Official installation (pip)

OR installation from source

pip install "tabpfn @ git+https://github.com/PriorLabs/TabPFN.git"

OR local development installation

git clone <a href="https://github.com/PriorLabs/TabPFN.git" rel="nofollow">https://github.com/PriorLabs/TabPFN.git</a>
pip install -e "TabPFN[dev]"

from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split from tabpfn import TabPFNClassifier # Load data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42) # Initialize a classifier
clf = TabPFNClassifier()
clf.fit(X_train, y_train) # Predict probabilities
prediction_probabilities = clf.predict_proba(X_test)
print("ROC AUC:", roc_auc_score(y_test, prediction_probabilities[:, 1])) # Predict labels
predictions = clf.predict(X_test)
print("Accuracy", accuracy_score(y_test, predictions))

from sklearn.datasets import fetch_openml
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split # Assuming there is a TabPFNRegressor (if not, a different regressor should be used)
from tabpfn import TabPFNRegressor # Load Boston Housing data
df = fetch_openml(data_id=531, as_frame=True) # Boston Housing dataset
X = df.data
y = df.target.astype(float) # Ensure target is float for regression # Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42) # Initialize the regressor
regressor = TabPFNRegressor() regressor.fit(X_train, y_train) # Predict on the test set
predictions = regressor.predict(X_test) # Evaluate the model
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions) print("Mean Squared Error (MSE):", mse)
print("R² Score:", r2)

For optimal performance, use the AutoTabPFNClassifier or AutoTabPFNRegressor for post-hoc ensembling. These can be found in the TabPFN Extensions repository. Post-hoc ensembling combines multiple TabPFN models into an ensemble.

Steps for Best Results:

Install the extensions:

git clone <a href="https://github.com/priorlabs/tabpfn-extensions.git" rel="nofollow">https://github.com/priorlabs/tabpfn-extensions.git</a>
pip install -e tabpfn-extensions

from tabpfn_extensions.post_hoc_ensembles.sklearn_interface import AutoTabPFNClassifier clf = AutoTabPFNClassifier(max_time=120, device="cuda") # 120 seconds tuning time
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

Choose the right TabPFN implementation for your needs:

TabPFN Client
Simple API client for using TabPFN via cloud-based inference.
TabPFN Extensions
A powerful companion repository packed with advanced utilities, integrations, and features - great place to contribute:
- 🔍 interpretability: Gain insights with SHAP-based explanations, feature importance, and selection tools.
- 🕵️‍♂️ unsupervised: Tools for outlier detection and synthetic tabular data generation.
- 🧬 embeddings: Extract and use TabPFN’s internal learned embeddings for downstream tasks or analysis.
- 🧠 many_class: Handle multi-class classification problems that exceed TabPFN's built-in class limit.
- 🌲 rf_pfn: Combine TabPFN with traditional models like Random Forests for hybrid approaches.
- ⚙️ hpo: Automated hyperparameter optimization tailored to TabPFN.
- 🔁 post_hoc_ensembles: Boost performance by ensembling multiple TabPFN models post-training.
✨ To install:
```
git clone <a href="https://github.com/priorlabs/tabpfn-extensions.git" rel="nofollow">https://github.com/priorlabs/tabpfn-extensions.git</a>
pip install -e tabpfn-extensions
```
TabPFN (this repo)
Core implementation for fast and local inference with PyTorch and CUDA support.
TabPFN UX
No-code graphical interface to explore TabPFN capabilities—ideal for business users and prototyping.

Prior Labs License (Apache 2.0 with additional attribution requirement): here

We're building the future of tabular machine learning and would love your involvement:

Connect & Learn:
- Join our Discord Community
- Read our Documentation
- Check out GitHub Issues
Contribute:
- Report bugs or request features
- Submit pull requests
- Share your research and use cases
Stay Updated: Star the repo and join Discord for the latest updates

You can read our paper explaining TabPFN here.

@article{hollmann2025tabpfn, title={Accurate predictions on small data with a tabular foundation model}, author={Hollmann, Noah and M{\"u}ller, Samuel and Purucker, Lennart and
 Krishnakumar, Arjun and K{\"o}rfer, Max and Hoo, Shi Bin and
 Schirrmeister, Robin Tibor and Hutter, Frank}, journal={Nature}, year={2025}, month={01}, day={09}, doi={10.1038/s41586-024-08328-6}, publisher={Springer Nature}, url={<a href="https://www.nature.com/articles/s41586-024-08328-6" rel="nofollow">https://www.nature.com/articles/s41586-024-08328-6</a>},
} @inproceedings{hollmann2023tabpfn, title={TabPFN: A transformer that solves small tabular classification problems in a second}, author={Hollmann, Noah and M{\"u}ller, Samuel and Eggensperger, Katharina and Hutter, Frank}, booktitle={International Conference on Learning Representations 2023}, year={2023}
}

Q: What dataset sizes work best with TabPFN?
A: TabPFN is optimized for datasets up to 10,000 rows. For larger datasets, consider using Random Forest preprocessing or other extensions. See our Colab notebook for strategies.

Q: Why can't I use TabPFN with Python 3.8?
A: TabPFN v2 requires Python 3.9+ due to newer language features. Compatible versions: 3.9, 3.10, 3.11, 3.12, 3.13.

Q: How do I use TabPFN without an internet connection?

TabPFN automatically downloads model weights when first used. For offline usage:

Using the Provided Download Script

If you have the TabPFN repository, you can use the included script to download all models (including ensemble variants):

# After installing TabPFN
python scripts/download_all_models.py

This script will download the main classifier and regressor models, as well as all ensemble variant models to your system's default cache directory.

Manual Download

Download the model files manually from HuggingFace:
- Classifier: tabpfn-v2-classifier.ckpt
- Regressor: tabpfn-v2-regressor.ckpt
Place the file in one of these locations:
- Specify directly: TabPFNClassifier(model_path="/path/to/model.ckpt")
- Set environment variable: os.environ["TABPFN_MODEL_CACHE_DIR"] = "/path/to/dir"
- Default OS cache directory:
  - Windows: %APPDATA%\tabpfn\
  - macOS: ~/Library/Caches/tabpfn/
  - Linux: ~/.cache/tabpfn/

Q: I'm getting a pickle error when loading the model. What should I do?
A: Try the following:

Download the newest version of tabpfn pip install tabpfn --upgrade
Ensure model files downloaded correctly (re-download if needed)

Q: Can TabPFN handle missing values?
A: Yes!

Q: How can I improve TabPFN’s performance?
A: Best practices:

Use AutoTabPFNClassifier from TabPFN Extensions for post-hoc ensembling
Feature engineering: Add domain-specific features to improve model performance
Not effective:
- Adapt feature scaling
- Convert categorical features to numerical values (e.g., one-hot encoding)

python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
git clone <a href="https://github.com/PriorLabs/TabPFN.git" rel="nofollow">https://github.com/PriorLabs/TabPFN.git</a>
cd tabpfn
pip install -e ".[dev]"
pre-commit install

pre-commit run --all-files

You can’t perform that action at this time.

Read the whole story

bernhardbock

99 days ago

reply

Minimal CSS-only blurry image placeholders
Thursday April 3^rd, 2025 at 4:15 AM

Read the whole story

bernhardbock

99 days ago

reply

NVIDIA Tensor Core Evolution: From Volta To Blackwell Tuesday June 24th, 2025 at 10:45 AM

Anthropic: How we built our multi-agent research system Tuesday June 24th, 2025 at 9:15 AM

Tips on prompting ChatGPT for UK technology secretary Peter Kyle Friday June 6th, 2025 at 11:00 AM

What I would do instead

Prompts that do make sense

Should the UK technology secretary be using ChatGPT?

Introduction# Thursday April 3rd, 2025 at 6:10 AM

✨ What is Module Federation 2.0?#

🔥 Features#

🎯 Use Cases#

🕠 History of Module Federation#

🕰️ The Future of Module Federation#

Follow Us#

✨ Next Steps#

GitHub - PriorLabs/TabPFN Thursday April 3rd, 2025 at 5:45 AM

Minimal CSS-only blurry image placeholders Thursday April 3rd, 2025 at 4:15 AM

NVIDIA Tensor Core Evolution: From Volta To Blackwell
Tuesday June 24^th, 2025 at 10:45 AM

Anthropic: How we built our multi-agent research system
Tuesday June 24^th, 2025 at 9:15 AM

Tips on prompting ChatGPT for UK technology secretary Peter Kyle
Friday June 6^th, 2025 at 11:00 AM

Introduction#
Thursday April 3^rd, 2025 at 6:10 AM

GitHub - PriorLabs/TabPFN
Thursday April 3^rd, 2025 at 5:45 AM

Minimal CSS-only blurry image placeholders
Thursday April 3^rd, 2025 at 4:15 AM