Free counters!

Tuesday, May 24, 2011

AMD’s Lee Howes on Heterogeneous Computing

As we approach the AMD Fusion Developer Summit in June, I want to use this opportunity to spotlight a few of the summit sessions that will be hosted by AMD experts. These blogs should give you a better idea of what you can expect to see and hear when the AMD team touches down in Bellevue, WA early this summer.
First up, I sat down with Dr. Lee Howes about his session, “The AMD Fusion APU Architecture – A Programmers Perspective // Programming Models for Heterogeneous Computing.” Lee is a parallel compute expert who works on the APU runtimes team defining how high level programming models will map onto current and future APU architectures. Here’s what he had to say.

Q: What’s the topic of your presentation at AFDS?
A: It’s important for programmers to understand what they are developing for, and what they can expect from an architecture. It is important that developers are able to understand in advance how their existing algorithms will map efficiently to the various hardware available in a heterogeneous system like an AMD Fusion Accelerated Processing Unit (APU).
The aim of my talk at the AMD Fusion Developer Summit is to provide the fundamental concepts and enhance developers’ understanding  and ability to see what they can gain from a heterogeneous architecture, and how best to go about solving the problem. We’ll then move on to talking about programming models past, present and future and how we expect these to change for future APU architectures.

Q: Quickly talk about the hurdles that the AMD Fusion APU architecture helps you overcome from a programming perspective.
A: One of the major problems with offloading work to the GPU is the latency and bandwidth limitations of the PCI express bus that connects main memory to GPU memory. These limitations restrict the work that can be offloaded to large long-running kernels that do not have a tight turnaround time between producing data and the data being needed back in main CPU memory. The APU architecture is designed to help reduce this bottleneck by providing efficient access to shared memory. As we move forward during the next few years, we plan to continue to increase this flexibility and further reduce offload overhead, allowing developers to offload smaller units of work in tighter dependent loops without finding themselves limited by the interconnect.

Q: Why has parallel computing long been adopted by academia and HPC, but failed to achieve mainstream adoption?
A: Developing parallel programs is difficult. Synchronization is hard to implement correctly; data sharing is easy to get wrong; and good performance is harder to achieve than people imagine. Data must be moved around, not only for correctness but also to place it in an appropriate location for fast access. These hurdles have limited the implementation of parallel applications to cases where developers have the time and financing to put in the necessary development effort. PhD students and developers of vast simulations fit in that category.
However, it’s not entirely true to say that parallel computing has failed to achieve mainstream adoption. Graphics processors are highly parallel and traditionally accessed through a relatively simple programming interface that casts the problem in graphics-specific terms while still mapping relatively efficiently to the hardware and making use of those parallel resources. It’s this factor that has allowed us to develop such parallel processors with a large market to recover the cost of  the development effort.
And now, OpenCL is expanding the scope of adoption of parallel architectures by enabling the large base of C programmers to take advantage of this leap in computational performance and the associated power savings.
Q: What driving factors are making parallel computing more relevant than ever to mainstream computing?
A: The scaling of CPU frequency has been limited in recent years due to power consumption and heat dissipation issues. Consequently, the only way to effectively use the continuing increase in transistor density is to increase the parallelism on the device. We can achieve this by increasing the number of standard cores, as we do on the multi-core x86 CPUs, by increasing the number of threads each core can process concurrently (which consumes transistors because it requires state storage), or by increasing the number of ALUs as we see with the increasingly wide vector units on current architectures, moving through SSE, AVX and so on. A GPU combines these features, executing multiple thread contexts concurrently on wide SIMD units and with tens of cores. The combination of both CPU and GPU cores, and their availability as an integrated single device, makes parallelism ubiquitous and using this widespread parallel hardware is vital for consumers to see any improvement from buying new computing products – laptops, desktops, tablets etc.

Q: Contrast what parallel programming models look like for CPU and GPGPU today vs. what they might look like for the APU in the future.
A: Although OpenCL can make developing applications that leverage the power of the multi-core CPU or the GPU much easier, both OpenCL and other current programming models continue to maintain a very clear distinction between the GPU and the CPU. This distinction limits the range of algorithms that can be cleanly and easily implemented. Over time, as we continue to integrate the CPU and the GPU, we expect the range of algorithms that can efficiently be mapped to an APU to increase, and for it to become easier to develop applications that leverage both capabilities without the developer needing the detailed architectural knowledge that is currently necessary.



Post a Comment

Related Posts Plugin for WordPress, Blogger...
Twitter Delicious Facebook Digg Stumbleupon Favorites More