DMA-Assisted, Intranode Communication in GPU Accelerated Systems

Feng Ji*, Ashwin M. Aji†, James Dinan‡, Darius Buntinas‡, Pavan Balaji‡, Rajeev Thakur‡, Wu-chun Feng‡, Xiaosong Ma*§

* Department of Computer Science, North Carolina State University. fji@ncsu.edu, ma@cs.ncsu.edu
† Department of Computer Science, Virginia Tech. {aaji, feng}@cs.vt.edu
‡ Math. and Computer Science, Argonne National Lab. {dinan, buntinas, balaji, thakur}@mcs.anl.gov
§ Computer Science and Mathematics Division, Oak Ridge National Laboratory

Abstract—Accelerator-awareness has become a pressing issue in data movement models, such as MPI, due to the rapid deployment of systems that utilize accelerators. In our previous work, we developed techniques to enhance MPI with accelerator-awareness, thus allowing applications to easily and efficiently communicate data between accelerator memories. In this paper, we extend this work with techniques to perform efficient data movement between accelerators within the same node using a DMA-assisted, peer-to-peer intranode communication technique that was recently introduced for NVIDIA GPUs. We present a detailed design of our new approach to intranode communication and evaluate its improvement to communication and application performance using micro-kernel benchmarks and a 2d stencil application kernel.

I. INTRODUCTION

In the recent years, graphics processing units (GPUs) have emerged as excellent low-cost, power-efficient accelerators for general purpose, highly-parallel computations. Across a broad range of computational science, engineering, and analytics domains, GPUs have shown significant advantages over traditional CPUs. These results have, in turn, resulted in an increasing number of supercomputer systems to be designed with GPUs. In the November 2011 Top500 list [1], for example, three out of the top five supercomputers in the world utilized GPUs. Systems that can accommodate two or even four GPUs per node are fairly common today, and the price versus performance benefit of GPUs indicates that such trends will become even more common over the next several years.

The processing units on current GPUs can operate only on data that is located in the on-board GPU memory. While the GPU programming libraries provide mechanisms for transferring data between host memory and GPU memory, challenges remain to efficiently schedule and synchronize these transfers. Managing data transfers has been previously studied [2]–[4] with respect to message passing libraries, specifically those implementing the Message Passing Interface (MPI) standard [5]. These studies concentrated on optimizing transfers between GPU memory and host memory, when the source and/or the destination of the data are on a GPU device. One of the shortcomings of these previous approaches is that they did not take advantage of systems where multiple GPUs were installed in the same compute node and data had to be moved between them. In our previous work [6], we designed a shared-memory approach that utilizes a common host memory buffer that is visible to both the source and destination process to reduce the number of memory copy operations required, thus improving overall performance. However, this approach still requires intervention from the host processor and memory to “stage” data before it can be moved between the two GPU devices.

Recently, a GPU IPC feature has been introduced on new GPU hardware from NVIDIA, that allows GPU direct memory access (DMA) engines to directly move data from one GPU to another on the same node. In this paper, we present the design of an efficient intranode cross-GPU/CPU peer-to-peer communication scheme for MPI communication. We explore the use of the GPU’s on-board DMA engines and take advantage of new, cross-GPU, and peer-to-peer data accessibility, provided by GPU IPC. By utilizing the latest architectural features, our scheme can bypass the extra data paths through host memory used in current intranode GPU communication mechanisms for MPI; and it can instead perform a direct transfer between source and destinations buffers. Furthermore, this work addresses the following challenges in intranode GPU communication.

- We introduce the use of GPU DMA engines, GPUDirect [7] and CUDA IPC [8] in the design of a DMA-assisted, peer-to-peer, direct intranode MPI communication subsystem. We show that this design can be extended to optimize communication between CPU and GPU devices.
- We explore the design space of intranode communication with GPU devices, including protocol design and use of DMA engines.
- We implement our design by adapting MPICH2, a widely-used MPI implementation. We evaluate the performance on two typical GPU-accelerated systems. Our results show that the DMA-assisted peer-to-peer communication is beneficial, especially when participating GPU devices are close in a system. Applying our solution to the 2d stencil benchmark from SHOC [9], we demonstrate an average 4.7% and 2.3% performance improvement for...
has recently added support for several capabilities that can be utilized to enhance the efficiency of GPU data movement, including GPUDirect and CUDA IPC.

**GPUDirect** [7] is a recent CUDA feature, which enables direct, peer-to-peer GPU data transmission through GPU DMA engines, without any host processor intervention. In the past, when data needed to be moved between the device memories of two GPUs, it had to be “staged” in the host memory. With GPUDirect, the data can be transferred from one device directly to another. However, this feature is currently restricted to peer accessible devices—i.e., those that are attached to the same chipset or different chipsets that are connected via AMD HT. GPUDirect does not currently support Intel QPI connected cross-chipset GPU devices.

Another technology only available in CUDA, **CUDA IPC** [8], allows different processes to access the same buffer located in GPU device memory. Using this technique, a process can share a memory handle that references a device memory buffer with another process. This feature is useful for parallel applications with multiple processes running on the same node, such as MPI applications.

### B. MPI and MPICH2 Intranode Communication

MPI [5] is the industry standard for parallel programming on virtually all parallel computing architectures. Most popular MPI implementations provide highly optimized internode communication as well as intranode communication between cores and processors on the same node. MPICH2 [12], developed at Argonne National Laboratory, is a widely used open-source MPI implementation. Its intranode communication is handled by the *Nemesis* [13] communication subsystem. MPICH2 has two data transmission modes: eager mode, optimized toward latency for shorter messages, and rendezvous mode, optimized toward bandwidth for large messages. The rendezvous mode is implemented through the large message transfer (LMT) protocol in *Nemesis*. Currently, this protocol has supported several transport methods that use shared-memory buffers and kernel-assisted single copy through host-side DMA. The shared-memory buffer implementation allocates buffers shared between the sender and receiver processes for them to store/remove message data. The sender and receiver processes work in parallel to pipeline the memory copies.

In our previous work, we designed an approach to allow intranode communication from GPU buffers [6]. This eliminated the need for the application to explicitly copy data from the source GPU memory to the host memory before an MPI send operation. It also eliminated explicit data copying from the host memory to the destination GPU memory after an MPI receive operation. The shared-memory LMT implementation was modified to use GPU data movement commands to directly transfer PCIe transaction data into and out of an LMT buffer. However, this method still requires copying GPU-resident data to shared buffers in host memory, requiring two DMA transfers and intervention from the host processor for the data transfer to occur.

---

**Fig. 1.** GPU-Accelerated computing system architecture.
Each process within a node has its own virtual address space in CPU and GPU memories. A virtual address from one process cannot be dereferenced in the address space of another, without OS support for sharing memory mappings. While peer-to-peer GPU memory copies (via GPUDirect) are possible with CUDA, they are restricted to a single process. In previous work, we addressed the intranode communication problem by extending MPICH2’s Nemesis communication system and performing MPI communication between GPUs via host-side shared memory (shm).

As described in Section II-A, recent releases of CUDA (v4.1 or later) have exposed a new family of IPC functions, namely CUDA IPC [8], which provide the capability of exporting a memory handle to a GPU memory allocation from one process directly into the address space of another process within the same node. Using CUDA UPC, the MPI process driving the communication can issue an asynchronous DMA request, by calling cudaMemcpyAsync, to move data between participating GPUs directly. This feature, together with GPUDirect, can be used to perform direct, peer-to-peer data transfer. It also allows us to completely avoid pipelining through host-side shared memory buffers. It is important to note, however, that GPUDirect is limited only to peer GPU devices that are connected to the same I/O hub or different I/O hubs connected via AMD HT and such “peer accessibility” must be queried from the GPU device. In our design, we use this approach for peer GPU devices, and fall back to the original shared memory based approach for other GPU devices.

As no static binding exists between an MPI rank and a GPU device, the process can choose any available GPU at runtime. Therefore a process cannot know whether peer accessibility is available to the pair by using its own information. To solve this problem, we use the handshake phase of the Nemesis LMT protocol (discussed in Section III-A) to exchange the peer accessibility information of devices before performing the communication. Note that the usage of LMT limits the applicability of DMA-assisted communication only to MPICH2 rendezvous mode, which is primarily used for large messages.

A. LMT Peer-GPU Protocol for Intranode Communication

In Nemesis, three LMT protocol models, PUT, GET, and COOPERATE, are provided for supporting intranode communication. PUT and GET protocols are used to implement kernel-assisted, single-copy protocols, and the COOPERATE protocol is used for the shared memory based intranode communication. The three protocol models are different in the process that initiates the payload transfer. In this paper, we design an additional LMT Peer-GPU protocol, which can adaptively change into a PUT, GET or COOPERATE mode, depending on peer accessibility.

We show the control flow of the LMT Peer-GPU protocol in Figure 2. When the sender starts to participate in the handshake, it retrieves the inter-process memory handle for the sender’s data buffer and then sends this memory handle along with the device number, packaged in a cookie, with the request to send (RTS) message to the receiver. When the RTS message arrives at the receiver process, it inspects the cookie and checks the peer accessibility of the two devices. If GPU peer accessibility is available, it performs peer-to-peer GPU communication using one of the three LMT protocols. Otherwise, it reverts to the shm approach. In another cookie created for the clear-to-send (CTS) message, the receiver embeds this decision, in order to inform the sender of the chosen protocol.

If the source and destination GPUs are peer accessible, the receiver can choose one of the three modes, namely PUT, GET, COOPERATE. This choice is arbitrary when GPU peers are accessible in both directions, but not in special cases (as further explained in Section III-B).

a) LMT Peer-GPU GET: If the receiver decides to get the data after receiving the RTS, it opens the sender’s memory handle, maps it into its address space, and starts peer-GPU data movement. A progress element is inserted into MPICH2’s progress engine queue by the receiver and the DMA status is polled for completion. A DONE message is then sent to notify the sender of completion.

b) LMT Peer-GPU PUT: If the receiver decides to let the sender push the data, it retrieves the interprocess memory handle of the receive buffer, packages it with the device number as well as the decision in the CTS packet’s cookie, and sends it back to the sender. The sender opens the receiver’s memory handle, maps it into its address space and executes peer-GPU data transfer. A progress element is inserted by the sender. The progress engine is again polled for completion, and a DONE message is sent to notify the receiver of completion.

c) LMT Peer-GPU COOPERATE: If the receiver decides both sides can help the data transfer, after both interprocess memory handles are exchanged in RTS and CTS messages, the
payload is divided into two halves; the sender puts one half in the receive buffer and the receiver gets the other half from the sender’s buffer. Progress elements are created on both sides. When receiver is done, a COOKIE message is sent to the sender to notify the partial completion of the data transfer; the sender finally sends a DONE message to denote full completion.

The receiver’s decision may seem arbitrary in the mutually accessible case, however protocol selection decides which GPU, and hence the DMA engine, will be used for data movement—i.e., the driving process will use the DMA engine on its GPU. Furthermore, DMA requests issued to the same engine will be serialized. Thus, the choice of protocol can be critical in managing DMA contention, depending on the communication pattern.

B. Intranode MPI Communication between Host and Device

In addition to GPU-to-GPU communication, we address intranode MPI communication between the host memory and the device memory. Currently, only GPU device memory can be exported to another process in CUDA IPC. Main memory buffers cannot be exported without the support of operating system kernel modules. This means that the process with communication buffers in main memory (the host-side process) must be the one to initiate the payload transfer. Upon receiving the interprocess memory handle to the device memory buffer exported in the other process, the host-side process opens it and maps the memory to its address space and then initiates the data transfer to or from the device.

The host-side process needs a valid GPU context to request the GPU DMA engine for data communication between the host and the device. However, this may not always be possible. Depending on the availability of an active GPU context, two situations might arise for host-device MPI communication, as described below.

a) Attach: If no active GPU context is available, the host-side process can attach to any available GPU device, by creating a new context on that GPU. A good choice here is to attach to the same device that contains the communicating device buffer. The context is then cached and can be reused for future data transfers.

b) Relay: If an active GPU context is already available, the DMA engine of the corresponding GPU can act as a relay to perform the data transfers with the device-side process. Although we can temporarily change the active GPU context to use another GPU device – possibly the one with the communicating device buffer – and change it back after the communication is done, this approach is not feasible. Since the active device context is a global setting in CUDA, changing it will redirect all GPU commands issued simultaneously to this communication onto the temporary active device, potentially polluting the user’s program.

C. Efficient Management of Memory Handles

In all the LMT peer-GPU communication protocols, we first get the interprocess memory handle of a memory region in one process, then open it and map it into the address space of the other process. While getting the memory handle (cudaIpcGetMemHandle) is a lightweight operation [11], we find that opening the handle (cudaIpcOpenMemHandle) is expensive, probably due to interactions involving the importing and exporting of device buffer addresses in the driver run on the host side.

In a preliminary design, we open a memory handle after the RTS/CTS message exchange at the beginning of the communication, and close it after the data transfer is done. Repeated opening and closing of the interprocess memory handle causes significant performance overhead.

Observing that many GPU programs have a relatively fixed memory creation pattern, e.g. creating a large memory region before computation, reusing it for computation and data communication, and only releasing it after a period long enough, we choose to cache memory handles. Therefore, when a communication is done, we do not close the memory handle and leave it open. During the next communication operation, when a memory handle arrives in an RTS/CTS packet’s cookie, we check first if the memory handle has been cached locally. If this is the case, our memory handle caching eliminates the reopening/closing of memory handles. We observe the latency is more than halved after applying this optimization.

This design leads to a problem in closing a memory handle; in particular, the MPI runtime does not know when an open memory handle should be closed and closing a memory handle should happen before the memory region is freed [11]. To solve this problem, we add a two-phase GPU memory free mechanism, by providing two functions (gpuMemFree and gpuMemFree_commit). When a GPU memory free is called on a memory region, it is only recorded, with its memory handles marked in case it is ever exported, but not released immediately. When a GPU memory free commit is called, the marked memory handles will be exchanged with other processes on that node. If found, a process will close the memory handle. After all processes finish closing memory handles, GPU memory regions will be released.

IV. EVALUATION

Our evaluation was conducted on two typical GPU-accelerated clusters. These systems are representative of current multi-GPU heterogeneous architectures, with different cross-socket interconnects, NUMA settings, and GPU connection topologies, as summarized in Table I.

The Keeneland [10] cluster is a National Science Foundation Track 2D system based on the HP SL390 and is located at Oak Ridge National Laboratory. This system is powered with NVIDIA Tesla M2070 GPUs. Each compute node is configured with two Intel Xeon X5660 hex-core CPUs, 24 GB main memory, 3 GPU devices connected through 2 I/O hubs; nodes are connected via single rail, QDR InfiniBand. The software environment is CentOS release 5.5 (Final) with Linux kernel 2.6.18-194.el5.perfctr and CUDA driver/runtime v4.1.

The Magellan [14] cluster is a DOE mid-range distributed computing research effort and is located at Argonne Na-
### TABLE I

<table>
<thead>
<tr>
<th>Cluster</th>
<th>NUMA nodes</th>
<th>Interconnect</th>
<th>GPUs</th>
<th>GPU Topology</th>
<th>Peer Access</th>
<th>Distance Between Peers</th>
</tr>
</thead>
<tbody>
<tr>
<td>Keeneland</td>
<td>2</td>
<td>Intel QPI</td>
<td>3</td>
<td>GPU 0: Node 0; GPU 1,2: Node 1</td>
<td>Only GPU 1 and 2</td>
<td>2 PCIe hops</td>
</tr>
<tr>
<td>Magellan</td>
<td>4</td>
<td>AMD HT</td>
<td>2</td>
<td>GPU 0: Node 0; GPU 1: Node 3</td>
<td>Yes</td>
<td>2 PCIe hops + 1 HT hop</td>
</tr>
</tbody>
</table>

Keeneland and Magellan System Architectures, including GPU topologies.

The system is powered with NVIDIA Tesla M2070 GPUs. Nodes are configured with four AMD Opteron 6128 quad-core CPUs, 64 GB main memory, and 2 GPU devices connected to 2 I/O hubs. The system interconnect is QDR InfiniBand and the software environment is CentOS Linux release 6.0 (Final) with Linux kernel 2.6.32-71.29.1.el6.x86_64 and CUDA driver/runtime v4.1.

Communication performance measurements on these systems were taken using the latency and bandwidth tests from OSU benchmark suite [15]. In addition, the impact of this work on application-level performance was measured using the Stencil2D kernel from the SHOC benchmark suite [9].

#### A. DMA-assisted, Intranode GPU-GPU Communication

We first evaluate the performance of our GPU DMA-assisted peer-to-peer intranode communication. In this test, we compare performance against our previous design (shm), the shared memory based data transfer approach. The latency and bandwidth test both involve two processes, using both source and destination buffers in GPU memory. On Keeneland, we use GPUs 1 and 2, connected on the same I/O hub (near case). On Magellan, GPUs 0 and 1 are connected to two different I/O hubs (far case). All our experiments in this section evaluate large message transfers, i.e. for messages larger than 64 KB in our current setting. We also always pin the CPU controlling process to the socket that is closest to the controlled GPU.

From this data, we see that the DMA-assisted communication provides lower latency and higher bandwidth than shm when two near GPU devices are used, primarily because DMA-assisted LMT avoids data staging through host-side shared memory buffers, and reduces the contention on the shared I/O hub. However, when two GPUs are attached to different I/O hubs, as in the far case shown in Figure 4, we find the opposite result. This is because two GPUs are now connected by three subchannels, as a longer data path, including a PCIe bus, an HT interconnect link, and another PCIe bus. Though DMA-assisted data movement avoids data staging in host-side shared memory buffers, this is a small portion of the data path. Meanwhile, GPU DMA driven transactions travel serially over the data path; anytime, only one packet is going over the three subchannels. On the contrary, the shm protocol uses both the sender and the receiver to write and read data into staging buffers, respectively, which partitions the data path into two relays and creates more parallelism.

By comparing different LMT modes, we see that COOPERATE mode is never the best. This is a surprising result since, for the cooperative mode, we split the data into two halves, and both GPU DMA engines drive half of the data transfer concurrently. However, results indicate that, in practice, this method is consistently slower than one of the one-sided modes. This may be caused by the interference between two GPU devices; when a peer direct access happens, the DMA engine will talk to a remote agent on the GPU device for data location translation and memory module commands issuing. Therefore, when two DMA engines are working simultaneously, this can lead to contention in accessing these hardware resources.

We evaluate GPU-GPU communication performance in the case where two processes are sharing one GPU device, which can be a common case in practice because clusters typically have more CPU cores than GPU devices. Results are presented in Figure 5 for the Keeneland system, and similar results were observed on Magellan. From these results, we see that DMA-assisted data transfer is able to leverage fast data movement.
within the GPU device, resulting in an order of magnitude improvement in bandwidth and latency over \textit{shm}.

\subsection*{B. DMA-assisted, Intranode GPU-Host Communication}

The DMA-assisted communication protocol can also be used for intranode communication where one buffer is located in host memory and the other is in GPU memory. Figure 6 shows results for this case on Keeneland, where two processes are both pinned to NUMA node 1, where GPUs 1 and 2 are connected. We evaluate two cases, which are distinguished by whether the process using the host buffer is using other devices connected to the same I/O hub. Using this setup, we show results for \textit{attached} and \textit{relayed} transfers, as explained in Section III-B.

Result indicate that DMA-assisted LMT does not perform as well as \textit{shm} on Keeneland and we observed similar results on Magellan for GPU-Host communication where processes are pinned to further NUMA nodes. When comparing attached transfer performance with \textit{shm}, we observe that, although both data paths go from the GPU device to the host CPU, in \textit{shm}, data can be copied to the shared memory buffer without considering peer accessibility. In contrast to this, the attached case must start a new context there to access the exported memory. As a result of this, the overhead of establishing peer accessibility overcomes the benefit of eliminating main memory copies—especially when the overhead of creating a new context is large. Though this overhead is amortized by later reuse, it significantly impacts performance. We expect, with the improvement of GPUs and the GPU driver, this overhead will decrease on future devices.

The relayed case emulates the scenario where the CPU process is using another GPU device for some computation while it performs Host-GPU communication. In this situation, the CPU process must use the DMA engine on the currently active device. This case shows that using a remote DMA engine to relay the data between a host buffer and device memory, results in poor performance. Thus, in both Host-GPU transfer cases, we should fall back to \textit{shm} to provide the best performance.

\subsection*{C. Application Evaluation: Stencil2D}

The Stencil2D kernel from SHOC benchmark suite [9] measures the performance of a nine-point, two-dimensional
stencil computation. It performs an iterative stencil computation on the GPU and requires a data exchange every haloWidth iterations. In this type of computation, processes are arranged in an $N$-dimensional Cartesian grid, and each process is assigned a corresponding section of a $N$-d array. Periodically, a process must obtain the values that its neighbors have calculated for the array elements that border its patch, or its halo. Thus, this communication idiom, which is common across a broad range of iterative solvers, is referred to as a halo exchange.

Figure 7 shows the relative performance improvement of DMA-assisted communication over the original shared memory ($shm$) approach. Results were gathered on the Keenland system for both single- and double-precision versions of the calculation. Overall, DMA-assisted communication provided an average speedup of 4.7% for single-precision and 2.3% for double-precision. Given that double-precision performance is much lower than single-precision performance on GPUs, communication accounts for a smaller portion of the total runtime, resulting in a smaller overall benefit in the double-precision case.

In this workload MPI ranks are assigned to GPU devices in a round-robin manner. This explains the high improvement factor seen in the case of 4 processes. In this case, ranks 0 and 3 are assigned to the GPU 0, and ranks 1 and 2 are assigned to GPU 1. The halo exchange happens first vertically, and then horizontally; in each step, DMA-assisted peer communication occurs between one pair of processes and the original $shm$ protocol is used between the other pair. As a result, communication overlaps in a mutually beneficial pattern. This overlap also occurs in the 6 process case, however additional MPI ranks sharing a GPU leads to a higher level of contention and results in a lower degree of performance improvement. The case of 2 processes results in a surprising decrease in performance. This may be due to PCIe bus contention, since both parties are trying to send and receive equal size messages; we plan to continue study this case in our future work.

When we analyze performance improvement relative to problem size for 2048, 4096, 8192 groups of matrix sizes (the matrix dimensions within each group are changed as needed to vary halo width), average speedups are, 8.5%, 3.9%
and 1.6% for single precision, and 4.3%, 2.2% and 0.3% for double precision. We observe that the amount of computation increases quadratically with problem size, which effectively reduces the fraction of time spent on communication and, as a result, the potential for performance improvement.

We vary the halo widths within each matrix dimension group to explore the change in performance as the communication volume is increased; the increased halo width reduces the total communication time—e.g. for the problem size of 2048, the communication portion of total execution time decreases from 34% down to 14% for single-precision and 22% down to 9% for double-precision—and therefore reduces the benefit of improved data movement.

V. RELATED WORK

Several research efforts have investigated modifications to MPI to better facilitate hybrid MPI+GPU programming. Currently, only processes running on the CPU can perform MPI calls. Stuart et al. [16] have suggested several mechanisms for extending the MPI standard to provide native support for accelerators. One significant proposal would allow GPU threads to obtain MPI ranks and participate directly in MPI communication [17]. However, due to lacking network I/O functionality on GPUs, CPU helper threads are needed, which presents challenges to performance modeling and may introduce new overheads.

Another major area of research, utilizes the current model for MPI participation in GPU-accelerated systems and extends it with transparent solutions for interacting with accelerator data [2]–[4], [6]. An advantage of this approach is that it allows existing MPI programs to more easily benefit from GPUs by reducing the amount of programmer effort that must be expended to manage distinct host and device memories. This work falls into the above category and differs from prior work in this space by developing techniques to use new accelerator features to accelerate intranode communication.

While MPI has traditionally been known as a system for internode communication, intranode communication has become equally important because of increasing core counts [18]–[22]. This work complements existing intranode communication efforts by studying the impact of GPUs on intranode communication systems.

VI. CONCLUDING REMARKS

In this work, we explored the design of an intranode communication subsystem for a GPU-aware MPI implementation that allows the programmer to supply device buffers directly to MPI calls. Mechanisms for direct, DMA-assisted peer-to-peer data transfers involving host and GPU, as well as GPU and GPU buffers were developed. Through communication benchmarking, we evaluated the performance of several significant alternatives in the design space and constructed a full system that utilizes the best protocols and parameters for each variation in calling context. Our communication benchmarking revealed that DMA-assisted peer-to-peer data transfer yields greater benefits when applied GPU devices that are nearby; in some situations, DMA-assisted transfers can hurt performance and our implementation falls back to an efficient shared memory transport.

We evaluated the performance impact of our modified MPI implementation on a halo exchange application kernel. When compared against the baseline shared memory data transfer method, an average speedup of 4.7% and 2.3% was observed for the stencil kernel for single- and double-precision computations, respectively.

ACKNOWLEDGEMENTS

This work was sponsored in part by an NSF CAREER Award (CNS-0546301), an NSF award (CNS-0915861), Xiaosong Mas joint appointment between NCSU and ORNL, and the U.S. Department of Energy under Contract DE-AC02-06CH11357. This work was also supported in part by NSF grant I/UCRC IIP-0804155 via the NSF Center for High-Performance Reconfigurable Computing, NSF grant MRI-0960081, and NSF grant CSR-0916719. This work used resources provided by the Keeneland Computing Facility at the Georgia Institute of Technology, which is supported by the National Science Foundation under contract OCI-0910735.

REFERENCES


