Click here - to use the wp menu builder

Next Generation of FlashAttention | NVIDIA Technical Blog

NVIDIA is excited to collaborate with Colfax, Together.ai, Meta, and Princeton University on their recent achievement to exploit the Hopper GPU architecture and Tensor Cores and accelerate key Fused Attention kernels using CUTLASS 3.

FlashAttention-3 incorporates key techniques to achieve 1.5–2.0x faster performance than FlashAttention-2 with FP16, up to 740 TFLOPS. With FP8, FlashAttention-3 reaches up to 1.2 PFLOPS, with 2.6x smaller errors than baseline FP8 attention.

CUTLASS is an open-source CUDA library intended to enable deep learning and HPC practitioners to achieve speed-of-light performance on NVIDIA Tensor Core GPUs for custom algorithms and research and production workloads alike.

For more information about the collaboration, see the FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision post and research paper.

Stay in the Loop

Get the daily email from CryptoNews that makes reading the news actually enjoyable. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

- Advertisement -

Next Generation of FlashAttention | NVIDIA Technical Blog

Stay in the Loop

Latest stories

Uma Lele, PhD, Honored for Expertise in the Agricultural Industry in the Nonprofit Sector

[Tabnine in Forbes] Building Trust In AI: Cruise Control Won’t Cut It

Debunking Common SEO Myths: What You Really Need to Know

Indian Demat Accounts Holders Surge Past 17 Crore in August 2024

Elevate Your Business with the Right Outsourcing Partner

You might also like...

How To Choose The Right Cam Review Site

Top 5 website brokers that help you sell or buy a website

Five Practices to Drive Business Resilience

Stay in the Loop