Can An ARM-Based Supercomputer Become the World s Fastest?

Can An ARM-Based Supercomputer Become the World’s Fastest? Page 1 of 5 LOGIN TOP STORIES LATEST STORIES W EDNESDAY,APR 4,2012 NEW ER STORIES… COM...
Author: Nelson Ellis
0 downloads 2 Views 421KB Size
Can An ARM-Based Supercomputer Become the World’s Fastest?

Page 1 of 5

LOGIN TOP STORIES

LATEST STORIES

W EDNESDAY,APR 4,2012

NEW ER STORIES… COM PUTERS Can An ARM-Based Supercomputer Become the World’s Fastest? PSYCHOLOGY To Ace an Interview, Forget Being Liked: You Just Need to Love

COM PUTERS

Can An ARM-Based Supercomputer Become the World’s Fastest?

BY ROBERT MCMILLAN -W I… Share

Like

APR 3,2012 8:11AM 24

15,610

30

IBM IBM’s Failed Operating System OS/2 Is 25 Years Old—But It Still Powers ATMs and Checkouts

The Barcelona Supercomputer Center is building one of the greenest highperformance computers on the planet, but if Alex Ramirez gets his way, it could also be the most powerful.

PHOTOGRAPHY Explaining What the Hell F-Stop Is, and How It Can Change Your Photography

Ramirez, a manager with the center, is in the midst of building a new supercomputer, called Mont-Blanc, that will use the same kind of low-power chips that you can find in tablets and smartphones today. Starting next month, his team will start assembling the first Mont-Blanc prototype using Nvidia's Tegra 3 processors instead of the RISC or Intel x86-compatible processors that are used on virtually all of today's supercomputers. The Tegra 3 will handle communications between different parts of the system while the actual number crunching will be done by yet-to-bedetermined low-power multicore Nvidia graphics processors similar to the GeForce 520MX. By June, Ramirez plans to run benchmarks for the widely followed Top500 supercomputer list, which measures how well computers perform a supercomputing benchmark program known as Linpack. But Ramirez says that he's got his eyes on another target: the Green 500 list. This list ranks computers by power efficiency, not raw performance. "There we expect to be in the top 10," Ramirez says.

Yourself

POW ER How To Make an External Mac Book Battery Pack for $60 COM PUTERS Intel: Your Next Laptop Will Be a Touchscreen Clamshell CENSORSHIP Trolling Could Get You 25 Years in Jail in Arizona FILE SHARING MPAA Fears MegaUpload Will Take Its Servers and Run BLUETOOTH Plantronics BackBeat Go: The Tiny, Cheap Bluetooth Headset-Buds You’ve Been Waiting For THE STONER CHANNEL

GET OUR TOP STORIES FOLLOW GIZM ODO Like

Albert Pellejero and 388,276 others like this.

Moroccan Hash, Car Rolling, and That Time the CIA Gave People LSD (NSFW)

M ONDAY,APRIL 2,2012 BIKES Handle Bar-Mounted Tallboys Are the Best Thing Since Fixed Gears

Last November, the top computer on the Green 500 - a prototpye of IBM's Blue Gene computer at the Thomas J. Watson Research Center - could do just over 2 quadrillion calculations per second (2 gigaflops) per watt. When the Mont-Blanc prototype is up and running next month, it should be closer to 7 gigaflops per watt.

FIRE Million-Year-Old Campfire Could Be One of Humanity’s First TROJANS

Ramirez is tapping into one of the coolest trends in supercomputing: the drive to use low-power mobile and graphics processors to do high-power computing.

New Flashback Trojan Variant Doesn’t Need A Password to Infect Your Mac

http://gizmodo.com/5898633/can-an-arm+based-supercomputer-become-the-worlds-f... 04/04/2012

Can An ARM-Based Supercomputer Become the World’s Fastest?

Page 2 of 5

Because battery life is so important in mobile devices, chips like the Tegra 3 focus on using as little power as possible. The Mont-Blanc Tegra 3 chips will probably run in the 4-watt range.

M USIC Audiobus Will Democratize Apple’s Proprietary ‘Group Jamming’

That's nothing compared to an Intel Xeon chip, which can easily burn between 50 and 100 watts. The trick is that supercomputer programs have to be rewritten in order to take advantage of the GPUs and the Tegra 3. Nvidia has tried to help that along by releasing a software development kit that helps people like Ramirez write programs for its chipsets. Ramirez expects to be on June's Top500 with a computer that uses between 2,000 and 4,000 processors. "Instead of using very few - but very big performance - processors… we're going to be

Feature IPHONE APPS How to Send Secret Encrypted Text Messages on Your iPhone M ORE STORIES…

using a lot of very low-power - but middle performance - processors," he says. But things get really interesting when Nvidia starts shipping successors to the Tegra 3, including a new 64-bit chip based on a new Cortex A15 design from ARM Holdings. That processor will be able to take on some of the supercomputing workload being done by Ramirez's GPUs right now and it could give him a real breakthrough in performance: four-times the computer processing for essentially the same 4 watts of power. But to take advantage of this next-generation of chips, the Barcelona Team will need to get their software finely tuned for this new and unproven architecture. They need to run the Linpack benchmark used by the Top500 group, but they'll also need to rewrite the research programs used by the university's scientists: software that simulates intricate chemical and physics problems. That's going to be the tricky part. If his bet pays off, though, Ramirez thinks that this machine could pave the way for the most powerful system on the Top500 list by 2017. "We are working now toward a machine that could be deployed five years from now," he says. That system would probably be in the 200 Petaflop range - or about 20 times as powerful as the top supercomputer in the world today, Japan's K Computer. For all of the excitement about Mont-Blanc in the supercomputing world, ARM is sanguine about the project. It sees big bucks in all of those smartphones and tablets that consumers are buying - not in geeky supercomputers. Last month, ARM president Simon Segars told us that the Barcelona Supercomputing project was "interesting," but he downplayed the supercomputing market. "Supercomputers, for ARM, is not a high volume market," he said. "It's not something we spend a lot of time talking about. Ours is a business that is royalty and unit driven, so we're interested in high-volume markers." Cade Metz contributed to this story. Photo: Barcelona Supercomputer Center W ired.com hasbeen expanding the hive mind with technology, science and geek culture newssince 1995. RELATED STORIES

Intel: Your Next Laptop Will Be a Touchscreen Clamshell 5 Reasons to Spend More Than $500 on Your Next Laptop Priest's PC Autoplays Gay Porn Slideshow to Congregation

DISCUSSION THREADS

FEATURED

ALL

START A NEW THREAD

Tue 03 Apr 2012 8:20 AM

http://gizmodo.com/5898633/can-an-arm+based-supercomputer-become-the-worlds-f... 04/04/2012

Can An ARM-Based Supercomputer Become the World’s Fastest?

Page 3 of 5

Myth of Echelon

Titles like "Can An ARM-Based Supercomputer Become the World’s Fastest?" are relative. Of course it could happen. Easily. If the computer uses 20,000 ARM chips instead of 1 Intel/AMD chip.. (I'm not saying that's how many chips would be needed to equal) promoted by wagnerrp wagnerrp @Myth of Echelon

Titles like "Can An ARM-Based Supercomputer Become the World's Fastest?" are completely bogus. No! It cannot! The ARM is not designed for performance, and it is definitely not designed for floating point performance. It is a GPU-based supercomputer, and the ARM is merely around to manage the more mundant tasks a GPU is not well suited for. Replace ARM with Opteron or Xeon and you have about a quarter of the top50 list. Myth of Echelon @wagnerrp

You obviously didn't read my comment at all. wagnerrp @Myth of Echelon

No. I read it. It's close, but wrong, for the same reason we don't call systems like SETI@HOME and FOLDING@HOME, and often times even "Beowulf clusters", supercomputers. The traditional supercomputer was a single, very simple computational unit running serial code at insanely high speeds, using an array of co-processors to continually feed it data. For obvious reasons, this morphed into the MPP systems we see today, however the problem set has remained the same, and while the problems can usually be broken into smaller computational domains for parallelization, they require fine grained synchronization in your time stepping or else the solution will never converge. Nearly all large supercomputer installations use Infiniband, or some custom network fabric, in a really screwy topology, to keep latency down and throughput up. Some lower end ones use "cheaper" 10GbE. You could build a cluster using some GbE with a fat tree, but its primary role will be a bunch of smaller, independent, and less computationally intensive tasks. It may achieve the same synthetic throughput as something tightly bound with more expensive networking, but will never achieve the same real world throughput on the type of problem "supercomputers" are designed for. Fewer but more powerful CPUs rule the day, and ARM will not compete in GFLOP/$ or GFLOP/Watt. Edited by wagnerrp at 04/03/12 10:25 AM

monkunashi @wagnerrp

i take it you didn't read the article?

promoted by wagnerrp

wagnerrp @monkunashi

Which goes back to my first comment, it's not an ARM-based supercomputer. It's a GPU-based supercomputer. DeltaDAWG @wagnerrp

The problem with your logic is you've put a strict requirement on what a supercomputer needs to be when that requirement doesn't even exist. "While the supercomputers of the 1970s used only a few processors, in the 1990s, machines with thousands of processors began to appear and by the end of the 20th century, massively parallel supercomputers with tens of thousands of "off-the-shelf" processors were the norm." promoted by wagnerrp

monkunashi @wagnerrp

"But things get really interesting when Nvidia starts shipping successors to the Tegra 3, including a new 64-bit chip based on a new Cortex A15 design from ARM Holdings. That processor will be able to take on some of the supercomputing workload being done by Ramirez's GPUs right now and it could give him a real breakthrough in performance: four-times the computer processing for essentially the same 4 watts of power" promoted by wagnerrp wagnerrp @DeltaDAWG

http://gizmodo.com/5898633/can-an-arm+based-supercomputer-become-the-worlds-f... 04/04/2012

Can An ARM-Based Supercomputer Become the World’s Fastest?

Page 4 of 5

My "strict requirement" is based off the problem set it is intended to solve. You've got grids, clusters, and supercomputers, defined based off the inter-connectivity between them. Grids are your @Home stuff. You use these to run your parametric studies, your Monte Carlo simulations, your offline post processing, your render farm. In a sense, it really is throwing science at the wall and seeing what sticks. You run tasks sufficiently small in scope that it can fit on one compute node, and that compute node may go off and run for days before reporting back. You run these with assorted computers on the internet (@Home), or at "cloud" facility, or on your employees' computers while they are not in the office. They are very cheap for their performance, and can scale well beyond the largest of supercomputers in peak throughput, but are limited in the kinds of problems they can efficiently solve. Clusters are the things you most often see at small to medium sized companies and educational institutions, or running the backends for dynamic websites. Your database cluster or redundant routers might communicate with time frames on the order of seconds. A physics simulation might run locally for tens of seconds on a single iteration, or many iterations on a single node, before nodes update their boundary conditions with neighboring blocks. These use commodity PCs or servers, and typically gigabit networking. More expensive ones might use 10GbE or Infiniband. They are more expensive than grids, but far cheaper than supercomputers. As with grids, they can be much more powerful than supercomputers, however due to the limited network capacity, individual tasks tend to be limited to a few hundred to a few thousand nodes. There is a bit of a blur between clusters and supercomputers, with low end supercomputers starting at the commodity hardware and Infiniband, and high end supercomputers using custom system boards, custom networking fabrics with strange multiply-linked topologies, and even custom CPUs. Here, network throughput and latency is more important than actual CPU performance. You want to be able to run large single problems spanning the entire computer, or very energetic simulations that require very fine synchronization between neighbor computational domains. You get as fast of processors as you can to limit your domain boundaries. You get as many cores per package as possible to limit communications distance and use high throughput local interconnects. You get as many packages per CPU module as possible. You get as many CPU modules per system board as possible. All of this is designed to approximate the performance of one single extremely high performance CPU as closely as possible. These cost far more per GFLOP of performance than clusters or grids, but they scale far better as your problem size increases. Different names for different design strategies for different types of problems, none better than the next except in their own particular strengths. An ARM cluster could be interesting if you have high integer performance needs. I could see such a thing being useful for SANs and the like. In fact, my RAID card uses an Intel XScale (ARM) chip for just that purpose. An ARM could be useful in a supercomputer if you use the ARM as an IO and task manager, on a die with other elements doing the actual computation. The Power A2 chip in IBM's BlueGene supercomputers operate in this manner, using an 18-core CPU where one is dedicated to running the local OS, and one is dedicated to managing the various external interconnects. A Tegra-based supercomputer would be doing exactly that, using one or more A9/A15 cores to keep the nVidia graphics shaders filled with data and operating at full speed, however at that point, it is not an ARM-based supercomputer. The ARM is merely providing management duties, ancillary to the actual computational effort.

DeltaDAWG @wagnerrp

It doesn't change the fact that if you cluster a billion of them, they could crunch some serious shit. There's nothing wrong with poking fun at the obvious. Like claiming potato skins are poisonous. Sure, if you eat a truck load at once. promoted by wagnerrp

wagnerrp @monkunashi

Except, that's all marketing mumbo-jumbo, and the real world doesn't operate that way. The A15 core is only about 40% faster than the existing A9 on the single precision integer workloads, and being 64-bit doesn't help at all. 64-bit operation helps with double precision integer workloads and nothing else. It allows access to more memory without having to go through some funky paging mechanism. The modified architecture may offer more registers, independent of whether it happens to be 32-bit or 64-bit. The FPU has been upgraded from the A9, but that is also completely independent of how large the integer units are.

http://gizmodo.com/5898633/can-an-arm+based-supercomputer-become-the-worlds-f... 04/04/2012

Can An ARM-Based Supercomputer Become the World’s Fastest?

Page 5 of 5

64-bit operation means your integer unit is now roughly double the size, directly consuming more power. The real performance improvement from the A15 comes from a larger 3-way dispatch, up from 2-way on the A9, offering improved IPC at again, roughly double the size. Twice the cores, a 40% bump in IPC, and 50% bump in clock rate gets you 4x the performance as the current A9, however that A9 would be running full out at only 1W, not "essentially the same 4 watts".

wagnerrp @DeltaDAWG

Yes. It very much does. It all depends on your problem. If your problem is embarrassingly parallel such that you break it into parts and never hear from the nodes again until the problem is completely solved, then by all means, get your billion ARMs, stuff them on a grid, and set them loose. If your problem is anything else, the nodes will need to spend some amount of time during computation of the solution communicating with each other. If you add more compute nodes and spread the load between them, each node will churn through its computational period faster, and start communicating sooner. You still have at least as much data to transfer during communication, and with more domain boundaries, you likely have more. If you don't similarly improve your network backbone as you increase node counts, your computational periods will continue to decrease in length, and your communications periods will continue to increase in length, until such time as your CPUs are just sitting idle waiting for updated data all the time. Hence, the reason why "proper supercomputers" use such expensive networking gear and complex topologies to allow them to continue to scale up in throughput without topping out from this bottleneck. Lower performance individual CPUs means you need all that much more communications gear, with all the cost and overhead that entails, to link them all together.

About

Help

Forums

Jobs Legal

Privacy

Permissions

Advertising

Subscribe Send a tip

http://gizmodo.com/5898633/can-an-arm+based-supercomputer-become-the-worlds-f... 04/04/2012