An interview with Mike Clark, the Father of Zen — ‘Zen Daddy’ says 3nm Zen 5 is coming faster than you think; also talks compact cores for desktop chips


We interviewed Mike Clark, AMD’s Corporate Fellow Silicon Design Engineer, during the company’s recent Tech Day, where it unveiled the Zen 5 microarchitecture that powers the company’s Ryzen 9000 and Ryzen AI 300 processors. Clark, known as the ‘Father of Zen’ or, depending on which AMD employee you ask, the ‘Zen Daddy,’ has worked on AMD’s CPU architectures for 31 years. He was the lead architect of the first generation of Zen, which he unveiled at Hot Chips while the company was teetering on the edge of bankruptcy back in 2016. Over the last seven years, AMD has unveiled five generations of Zen, each delivering double-digit increases in instructions per clock (IPC) improvement. Clark has led Zen’s development through all five generations, with a sixth in the hopper, transforming AMD from a struggling chipmaker to a stock market darling that has now clawed back a significant amount of market share from Intel. Now AMD has nearly twice the market cap of its long-time foe Intel, and the architectures driven by Clark served as the fuel for those stellar results.AMD has used its compact Zen ‘c’ cores, smaller cores designed for background tasks much like Intel’s E-cores, to reduce cost and boost performance in its laptop processors. However, unlike its competitor, AMD hasn’t brought those cores to its desktop lineup yet. Zen 5c marks the second iteration of AMD’s compact cores, but they are currently not planned for the Ryzen 9000 family. However, Clark said he thinks compact cores will come to future Ryzen desktop chips and expanded on the techniques the company uses for its unique implementation.AMD’s Zen 5 architecture will span both the 4nm and 3nm process nodes, powering the next generation of AMD’s entire CPU product stack that spans from desktop and mobile PCs to its EPYC processors for the data center. Designing one cohesive underlying architecture to address all those markets is an incredible engineering feat. Clark expanded on the challenges of designing Zen 5 for both the 4nm and 3nm processes concurrently, saying the two versions are basically arriving “on top of each other.”Intel has famously abandoned injecting hardware acceleration support for high-performance AVX-512 instructions, but AMD’s Zen 5 marks the debut of full AVX-512 acceleration for the Ryzen family. Unlike Intel, which has to reduce clock speeds when its processors run AVX-512 workloads, AMD says these powerful instructions will run at the same clock speeds as standard integer operations. Clark also expanded on how the company achieved that feat and said that its Zen 5c cores can also run full AVX-512.Below is a lightly edited transcript of the key points of our conversation with Clark.Get Tom’s Hardware’s best news and in-depth reviews, straight to your inbox.Will Zen 5c ‘compact cores’ come to desktop PCs?AMD’s approach to its compact Zen 5c cores is inherently different than Intel’s approach with its e-cores. As with Intel’s E-cores, AMD’s Zen 5c cores are designed to consume less space on a processor die than the ‘standard’ performance cores while delivering enough performance for less demanding tasks, thus saving power and delivering more compute horsepower per square millimeter than was previously possible (deep dive here). But the similarities end there. Unlike Intel, AMD employs the same microarchitecture and supports the same features with its smaller cores. With Zen 5, AMD has also designed the smaller compact cores to deliver nearly the same performance as the larger cores, thus preventing the faster Zen 5 cores from waiting on the compact cores during threaded workloads. Clark said that he expects AMD’s compact cores to eventually come to desktop processors, explained that AMD uses a thread placement technique to target certain workloads to the smaller cores, and expanded on how AMD has shrunk its standard cores to create Zen 5c.  Tom’s Hardware (TH): When you view Zen 5c compact cores, do you think they only have a place in power-constrained environments [mobile]? Could you see this coming over to desktop PCs, where power isn’t a consideration?Mike Clark (MC): […] If we keep building the compact cores in the way that we talked about—which I think we will; I don’t know why I said it a little more theoretically—the hard part is really making sure we hit the right frequency point so that it’s balanced with however many [cores] you’re going to put down. But let’s say you’re really good at that, then there’s no reason not to put a compact core on a desktop.Whether it’s the same performance at a given core count to the customer and cheaper because there’s less area used, or we can squeeze in even more cores on a desktop because of the compact cores. And we couldn’t leverage them [performance cores] anyway because they were TDP-constrained when you got out to that many cores, so you may as well have used a combat core. I think as we get more experienced with Windows and see that the scheduling does work, well, I think you’ll see us, in desktop, using the compact cores to both get more cores and be more cost-effective. Because it’s wasted area [for performance cores] because we can’t run everything at that 5.7 GHz frequency.TH: When using compact cores in a heterogeneous design, do you schedule workloads into those cores using some sort of thread placement?MC: We don’t have any hardware that can magically move cores or make it transparent to software, so we leverage software. We can build a table of capabilities of the different cores and dynamically update that table to give them feedback as things are going on so that they can manage where to place the core for a lightly threaded workload. […] We expect both the classic cores and the throughput [Zen 5c] cores to keep up at the same level and not be burdened by the throughput core not really having enough compute. The algorithm runs at the order of the slowest cores, so those throughput cores can run at a pretty high frequency so that we can handle true multi-threaded workloads. But then when you have multi-processing, you need to be smart about where you place things.You should test it. I haven’t seen it, but you can run Teams, and you’ll see it on the compact cores. You can open up your browser, and it’ll go over to the performance cores because you need that burstiness. And then, when you’re done, it’ll go away; Teams will still stay on those compact cores, and you’ll get the best of both worlds.TH: When you are looking at the standard core and shrinking it down while closely matching the performance capabilities so you don’t have thread dependency problems, how do you achieve that? Denser libraries, closer spacing?MC: It’s more of the latter — the library’s the same. [..] There are sort of logical blocks, and there are even subblocks, but to hit the high frequency in certain critical speed pads, we have to break the design down into small pieces, which we then do custom work on. But at the end of the day, it’s a rectangle; things are further apart than they need to be, there’s whitespace, and that’s all to drive that high frequency. But then we say, ‘Okay, well, lower the max frequency.’ Then, we can combine blocks together; we don’t need to do as much custom work, and it can pull the design in. It’s now just naturally smaller because we utilize the space more. When it was bigger there’s extra logic for repeaters and stuff like that, there’s buffering, and that all gets removed.It’s amazing how much you can shrink the core at whatever target you picked to then find a bunch of area and power to get the squeeze out of it. It was really just because of what we had to do to get that high frequency. Now, you could say, ‘Well, why aren’t you better at picking those small bundles? ‘ But we’ve been doing that for years, and we can’t perfect the smaller blocks. It’s just kind of in the nature of the design.How Zen 5 runs at normal frequencies while running AVX-512 workloadsTH: You mentioned that Zen 5 runs AVX-512 instructions at the same clocks as standard instructions. Intel has struggled with this for a long time, and then they’ve done all kinds of things, like bifurcating AVX instructions into different classes denoted by power usage. Has Zen 5 employed any notable tweaks to keep the AVX-512 clocks high? What’s your secret to success?MC: Fundamental to what I would call our secret to our success is trying to introduce it at a point where it’s more balanced with the rest of the machine. That’s so it doesn’t look like such a one-off and so you don’t have to treat it as such a one-off, which leads to all those problems. Now, it can obviously burn more power, but so could AVX-256. But it’s better that things grow together. If you imagined us trying to put AVX-512 on Zen 2, we had just grown from AVX-128 to AVX-256 at that time. I just have this balance thing; that’s what Zen is, and it’s just so in balance.Now, we’ve learned as well. Even on the integer side, our schedulers burn a lot of power. And so, on both sides, I think a lot of the trick is, and I’m sure Intel’s learned this too, is laying out the floor plan in a way that you’re cognizant of where hotspots are going to be, knowing also that you never get everything right, so putting in sensors everywhere — but especially where you’re worried. We’ve been good at getting those to work and using our firmware to manage that dynamically so that we can better respond. There are times when we do have to throttle it down because multiple cores are using it, and it’s more TDP-constrained. But that happens on the integer side, too.TH: So frequencies would be pretty much in lockstep with integer?MC: It’s just trying to sense it and react to it enough so that it’s not, ‘Oh, this one guy [core] did it, and we took everyone down [frequency],’ and it’s not really that serious of a situation. So, it’s a management problem that we’ve grown to understand and deploy across the design, not just for AVX-512.TH: When we look at the compact cores running AVX-512, do they run that at standard full data path, full 512-bit width, or do they run double-pumped AVX-256?MC: We can do either. For what we’re launching today in Strix Point, both the performance core and the compact core both have the AVX cut-down [AVX-256] because they’re in a heterogeneous situation, and they’re in a mobile platform where area is at a premium. And while you could argue we could try to have it, we don’t want software to have to try to deal with something like that. Even though we cut it down on the performance core, which helps the area, we can have more throughput cores at some level. But we could build a compact core for other markets, and I think you’ll see that where we do have the full 512-bit data path as well because it’s great for AI and vector workloads, even if it’s a more dense design, that doesn’t mean it doesn’t want great vector performance when it needs it.The biggest challenge of Zen 5 designTH: What was the biggest challenge you encountered with Zen 5 development?MC: It was actually dealing with two technologies [designing Zen 5 for both the 4nm and 3nm process technologies], especially a technology that the previous generation was in. And trying to do so much change, and therefore the unavoidable reality that in 4nm it’s going to be [consume] more power than it’s going to be in 3nm, no matter how smart we are. But we need that flexibility in our roadmap, and it makes sense. But still that was really hard to try to control having the two technologies and the features, and a feature that looks great in 3nm not looking so great in 4nm because of the power impact of the not-as-efficient transistor and how it affects the floorplan. Normally, we do the architecture in one, and then we port and the next one, and then you have a lot of time to deal in the floor plan with the two technologies. [..] It was just really challenging. But that gives Zen 6 a lot of room to improve.And we’re going to deliver 3nm here in short order with 4nm; basically, they’re on top of each other. So the design teams are separate in building those, but we’re trying to communicate and work together — it is still the same. We’ve tried to keep it simple for our own sanity. We have all these designs we have to validate and we have to build, and the more they’re different, the more things just get out of control. It drives complexity.That was a challenge, and one we love because, like I said, now that we’ve done it, we’ve learned a lot from it. We’re going to be able to do it better the next time. That’s what makes this job so fun: constantly learning, constantly new challenges, and new innovation.

Stay in the Loop

Get the daily email from CryptoNews that makes reading the news actually enjoyable. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

- Advertisement - spot_img

You might also like...