What we know about AMD’s next-generation processors

Discussion in 'Windows 7 Hardware' started by kemical, May 19, 2010.

  1. kemical

    kemical Windows Forum Admin
    Staff Member Premium Supporter Microsoft MVP

    Aug 28, 2007
    Likes Received:
    Taken from an article found here: What we know about AMD’s next-generation processors

    You might be surprised to learn that AMD is just seven months away from releasing new CPUs based on not one, but three, new designs. The Phenom II that we have known for the past 17 months will soon be put to pasture, never to be seen again. Its replacements are built for the server, the desktop, the notebook and the netbook.
    Dubbed Bulldozer, Bobcat and Llano, the new processor designs are the final piece of AMD’s grand strategy to emerge from years of debt and struggle as a leaner, meaner company. For enthusiasts, they are something altogether more important: a clear sign that the fascinating war between AMD and Intel is about to go nuclear once again.
    Bulldozer: the chip for enthusiasts

    [​IMG] A block diagram of a single Bulldozer module, or core.

    Chips based on Bulldozer will be scalable across any number of what AMD calls “modules” (shown above), each of which contains two CPU cores. It is postulated that each module is equipped with a technology called Cluster-Based Multi-Threading, or CMT.
    To understand CMT, we must first have an understanding of its lesser sibling, Symmetric Multi-Threading (SMT), which you are likely to know by Intel’s name: Hyper-Threading. Though Intel did not create the technology, their implementation is by far the most famous.
    Intel’s implementation of SMT duplicates architectural states—the part of a CPU which holds the condition of a process—but not the execution engine. This allows their processors to maximize execution resources by busying silicon that would otherwise lay idle, or by injecting threads into the pipeline in the event of a stall.
    To give a real-world analogy, Intel’s implementation of SMT is similar to an automobile assembly plant with only one assembly line capable of taking a car from parts to completion. At every stage of the assembly, however, workers are standing by with completed parts to keep the line moving if there’s a problem. The workers can’t build a car (they don’t have a line), but they can make sure that line is always moving the car on to the next step without issue.
    Intel uses SMT in the same way: to ensure that the processor’s line is always busy moving to the next step, and today’s operating systems are increasingly intelligent at dispatching threads for this setup.
    The “problem” with this implementation of SMT is that one instruction window tracks the dispatch, execution and retirement of both threads. Going back to the assembly line, it would be like putting one supervisor in charge of watching the line and the workers—that supervisor can’t watch for problems with the line and the workers at the same time. Something is bound to fail. On a CPU, as in an assembly line, failures lead to a reduction in apparent performance.
    Each Bulldozer module, meanwhile, puts the plant on steroids not only by adding a second fully-functional assembly line, but by giving each line the ability to break one big stage down into several, parallel stages—little assembly lines that can be created, run, merged and closed on demand without sacrificing the efficiency of the main assembly line. This is CMT, and the Bulldozer can do it.
    [​IMG] CMT is more efficient and performs more consistently than Hyper-Threading.

    When a processor is done sending calculations through the pipeline, it stores that data in cache for programs to access (L1 DCache in the diagram below). In essence, these are the completed cars sitting in the parking lot waiting for transport. Intel processors have one parking lot that may contain a mix of cars and trucks, which reduces efficiency when a shipping company arrives to grab a shipment made exclusively of trucks. The Bulldozer plant has two parking lots, which gives that plant more flexibility to be efficient with storing and shipping.
    From end to end, the entire Bulldozer plant can do more, and do it more intelligently than the plants AMD and Intel run today.
    [​IMG] AMD Bulldozer

    Going back to raw architecture, both of Bulldozer’s lines share a single floating point scheduler (cordoned in red), with two 128-bit FMAC pipelines. Fused multiply-accumulate (FMAC) gives the chip improved floating point precision, which grants Bulldozer a leg up on the Phenom II when it comes to calculating big equations more accurately and efficiently. And, when you realize that everything you do on a computer is a mathematical equation, you can see why this is important.
    A 128-bit floating point pipe is also a natural choice as AMD has announced SSE5 for the Bulldozer, an instruction extension that has several 128-bit multimedia instructions. Fusing the 128-bit FPUs will also allow the chip to crunch 256-bit Intel AVX instructions in just one cycle. SSE5 and AVX alone will take these processors to a whole new level of performance when it comes to multimedia, encryption and scientific research.
    Finally, the Bulldozer brings forward the Phenom II’s cache hierarchy by dumping all the pipelines into shared pools of L2 and L3 cache. These shared L2 and L3 caches give either core on a Bulldozer module access to completed calculations that can be pulled back in to speed up a new task. This is standard for today’s processors.
    Your future Bulldozer CPU

    The first enthusiast CPU to employ the Bulldozer design is currently codenamed Zambezi, and it will contain four of these dual core modules for a total of eight cores. We also know for a fact that Zambezi will use socket AM3, meaning anyone with a DDR3 Phenom II motherboard will be ready to rock with a BIOS upgrade.
    What about performance?

    Unfortunately, there are some elements of the Bulldozer design that we just don’t understand yet, including:
    • How many cars the supervisor can send down the line at a time;
    • How many stages it takes to complete a car;
    • How AMD has configured the floating point unit (FPU) to run the numbers;
    • and how exactly AMD shares the single FPU amongst two independent assembly lines.
    Until this information tips up, we just can’t know how Bulldozer will compare to today’s processors. In the interim, we can only admire the genuinely different architecture and speculate over the diagram’s many ambiguities.
    Bobcat: the chip for netbooks

    Next on the launch deck is AMD’s “Bobcat” architecture, a chip explicitly designed to cater to products containing CPUs like the Athlon Neo or the Intel Atom.
    According to the company’s roadmaps, the first chip to launch with Bobcat architecture will be the 32nm Ontario APU, which combines two Bobcat modules and a rudimentary DirectX 11 chip on the same processor.
    [​IMG] AMD Bobcat architecture

    Each Bobcat module is a single core design, with one supervisor (int scheduler) and one assembly line, which consists of the I-Pipes, Ld-Pipe and St-pipe in the diagram above. These can be considered specialized workers—electricians versus mechanics, for example—that perform unique tasks on the car while it is rolling down the line. You’ll note that Bulldozer, too, had four pipelines per int scheduler, but we just don’t know what kind of workers they are yet.
    The Bobcat’s integer pipe is paired with a dual-pipe FPU, ambiguously titled “A-Pipe” and “M-Pipe” in this diagram. We postulate that the “A” and “M” refer to the addition and multiplication/division floating point operations, respectively. The size of these pipelines—the number of bits they can calculate at a time—will not only determine what this processor is strongest at, but its complexity, and how it consumes power.
    On the topic of power, AMD claims that Bobcat is capable of radiating less than 1 watt of heat, which could mean something around 0.5W. A chip at that wattage isn’t doing much more than sitting around on standby, but it’s a healthy number for users looking for laptop designs with a long standby life. In practice, Bobcat’s actual TDP should be around 5-10W, which is perfect for netbook-sized laptops.
    On the point of performance, AMD says it’ll weigh in at “90% of today’s mainstream performance” at less than half of the die size. If AMD’s definition of mainstream is the Athlon II—an assumption that bears out in their platform roadmaps—then Bobcat is essentially an Athlon II in a (much) smaller, cooler and quieter package. Not bad.
    Bobcat’s most remarkable feature is not its architecture, however, but its design process. AMD has designed the Bobcat via high-level synthesis, or HLS. HLS is a process by which a chip’s design begins its life as a set of behaviors coded by a programmer in C++. The code is then interpreted and synthesized by a machine that manufactures a processor that exhibits the behavior written by the programmer.
    HLS is a fascinating way to rapidly design and produce a chip that can easily be modified or ported to other processes for outstanding flexibility in the market. The trade off for this agility is frequency—Bobcat’s maximum clockspeed with an HLS-driven design is about 20% lower than it could have been were it designed “by hand.”
    All things considered, Bobcat will assuredly be faster than any ultra low-voltage chip in the market today; it will handily eclipse the Nano, the Atom and the Athlon Neo, by orders of magnitude on some metrics. Additionally, AMD’s decision to roll with HLS gives the firm the ability to respond to market conditions in ways its competitors simply cannot with current processes.
    Fusion: the chip for notebooks and budget desktops

    AMD’s acquisition of ATI Technologies was completed on October 26, 2006 and was accompanied by an official, and very important statement:
    AMD plans to create a new class of x86 processor that integrates the central processing unit (CPU) and graphics processing unit (GPU) at the silicon level with a broad set of design initiatives collectively codenamed “Fusion.”
    In other words, AMD announced that it would soon put GPUs and CPUs on a processor. AMD calls these chips an accelerated processor unit, or APU. If you’re familiar with the CPU market, the APU might not be new to you: some of Intel’s Core i5 processors have a GPU onboard. Yes, Intel beat AMD to the punch, and it was almost a direct result of AMD’s financial hardship.
    Despite yielding the first design wins to its chief rival, there is a silver lining for AMD’s APU initiative: even AMD’s slowest modern GPU bloody annihilates anything Intel has to offer. This includes the GPUs AMD plans to stick inside its processors, starting next year with Llano.

    The Llano CPU is AMD’s first processor scheduled to adopt the Fusion APU design. Based on the die shots provided earlier this year, the chip strongly resembles an Athlon II X4 that has been shrunk from 45nm to 32nm to accommodate an onboard GPU.
    This would make perfect sense given that Llano and Propus are both oriented for the mainstream. Marrying existing technologies manufactured at a smaller size is much easier than starting over with a brand new architecture when none is needed.
    [​IMG] An uncanny resemblance: Propus (Left) and Llano (Right)

    It is certainly worth noting that the above x-ray of the Llano is not complete; the bottom section of the chip has been cut off in press materials, meaning there’s even more silicon at play than we can see at this time.
    However, judging from what we can see, the Llano APU will feature 512k-1MB L2 cache per core, no L3 cache and six Radeon HD 5000-series units for a total of 480 stream processors.
    In short, Llano is shaping up to be an Athlon II X4 with 66% of a Radeon HD 5750 on board. If that bears out, then it is more than capable slugging Intel’s Clarkdale and Arrandale (Core i5) designs into the pavement without lifting much more than a few fingers.

    Before we head into our final thoughts, let’s take a moment to quickly summarize all the architectures that have been tossed around in this article.

    Family: Bulldozer
    Cores: 4 to 8
    Process: 32nm
    Socket: AM3
    Onboard GPU?: No
    Platform: Scorpius
    Role: Performance Desktop
    Launch date: Late 2010

    Family: Bobcat
    Cores: 2-4
    Process: 32nm
    Socket: N/A
    Onboard GPU?: Yes
    Platform: Brazos
    Role: Ultra Thins, Netbooks
    Launch date: 2011

    Family: Stars (Athlon II)
    Cores: 4
    Process: 32nm
    Socket: C32
    Onboard GPU?: Yes
    Platform: Brazos
    Role: Mainstream notebook, mainstream desktop
    Launch date: 2011
    Final thoughts

    AMD has been saying that “the future is Fusion” for years, and the company is just now in a place with its capital and processes to realize that future. By 2011, AMD will completely revamp their desktop, laptop and netbook offerings with three innovative and purpose-built CPU designs, all of which can be paired with on-die GPUs if the market demands it.
    You read that right: Llano isn’t the only design that can support an onboard GPU. AMD can pair Bulldozer and Bobcat modules with a GPU, too.
    Now, AMD’s first generation Fusion won’t have the performance to take on the discrete GPU market, but the groundwork is being laid. It will start with mainstream and low-voltage in laptops and netbooks, respectively. Economical desktop designs aren’t out of the question either, but there are signs that something much bigger is in the works.
    For example, Bulldozer may not be an APU now, but its relatively small floating point unit speaks to a future architecture that cedes floating point operations entirely to the GPU, a component that crushes the CPU in floating point performance.
    And indeed, in conversations with AMD, this is the paradigm they have been working to kickstart: a computing ecosystem that recognizes CPUs and GPUs alike as valid processors for a program. They envision a day when processing tasks are easily and automatically sent to the best processor for the job.
    We are just beginning on that road, the one that blurs the line between the CPU and the video card, but AMD appears poised to make a confident first step. They have the resources, they have the engineers, and they have the drive. AMD is extremely passionate about where they’re going with their market strategy; talking to engineers and representatives at all levels of the company reveals an infectious enthusiasm that can’t be manufactured or faked.
    Do not believe for a moment that competition between AMD and Intel has waned: 2011 will be more exciting than ever.

Share This Page