Cecka Research

Low-Communication FMM-Accelerated FFT on GPUs

Conference Paper. Media: NVIDIA.

For distributed 1D FFTs, communication costs quickly dominate execution time as all industry-standard implementations perform three all-to-all transpositions of the data. In this work, we reformulate an algorithm that employs the Fast Multipole Method to reduce the communication requirements to approximately a single all-to-all transpose. We present a detailed and clear implementation strategy that relies heavily on existing library primitives, demonstrate that our strategy achieves consistent speed-ups between 1.3x and 2.2x against cuFFTXT on up to eight NVIDIA Tesla P100 GPUs, and develop an accurate compute model to analyze the performance and dependencies of the algorithm.

Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Conference Paper. Slides. Media: ParallelForAll.

Typically, tensor contractions are reshaped so they may be evaluated with GEMM or other high-performance linear algebra primitives. We demonstrate that any explicit memory motion can easily dominate tensor contractions.

The StridedBatchedGEMM primitive allows these contractions to be performed without explicit copies or transposes and can provide up to 10x performance over other libraries. The primitive is now available in CUBLAS 8.0.

FMMTL: FMM Template Library

Code Repo. Book Chapter. Slides.

A generalized library designed to separate out research areas in fast structured dense matrix algorithms.

This library will facilitate continuing research in developing kernel expansions, scheduling and parallel evaluation of tree-based methods, and abstracted development applicable to a huge range of physics, graphics, and machine learning problems.

The library takes advantage of GPU acceleration when possible, and currently uses OpenMP parallelization for additional performance.

CrowdCL: Web-Based Volunteer Computing with WebCL

Code Repo. Presentation.

CrowdCL is an open-source framework for the rapid development of volunteer computing and OpenCL applications on the web. Drawing inspiration from existing GPU libraries like PyCUDA, CrowdCL provides an abstraction layer for WebCL aimed at reducing boilerplate and improving code readability. CrowdCL also provides developers with a framework to easily run computations in the background of a web page, which allows developers to distribute computations across a network of clients and aggregate results on a centralized server. We compare the performance of CrowdCL against serial implementations in Javascript and Java across a variety of platforms. Our benchmark results show strong promise for the web browser as a high-performance distributed computing platform.

Cauchy Fast Multipole Method for General Analytic Kernels

Presentation.

In this paper, we present a novel formulation of a kernel-independent fast multipole method.

The FMM has found many applications in the field of integral equations and boundary element methods, in particular by accelerating the solution of dense linear systems arising from such formulations. Standard FMMs are derived from analytical expansions of the kernel, for example using spherical harmonics or Taylor expansions. In recent years, the range of applicability and the ease of use of FMMs has been extended by the introduction of black-box and kernel-independent techniques. In these approaches, the user need only provide a subroutine to numerically calculate the interaction kernel. This allows changing the definition of the kernel with minimal changes to the computer program. This paper presents a novel kernel-independent FMM, which retains to diagonal multipole-to-local operators. The result is a significant reduction in the computational cost, in particular when high accuracy is needed. The approach is based on an analytic extension of the kernel, Cauchy’s integral formula, and the Laplace transform. We present a numerical analysis of the convergence and numerical results in the case of a multilevel level one-dimensional FMM.

The Fast Multipole Method's M2L on Graphics Processors

Review article. Application to Helmholz BIEM. GTC2012. Source code.

This research presents a number of algorithms to run the fast multipole method graphics processors. The methods described in the paper are applicable to any FMM in which the multipole-to-local (M2L) operator is a dense matrix and the matrix is precomputed. This is the case for example in the black-box fast multipole method (bbFMM) of Fong et al., which is a variant of the FMM that can handle large class of kernels.

In the FMM, two operators represent most of the computational cost and an optimal implementation typically tries to balance those two operators. One is the nearby interaction calculation (P2P) and the other is the M2L operation. We focus on the M2L. By combining multiple M2L operations and reordering the primitive loops of the M2L so that CUDA threads can reuse or share common data, these approaches reduce the movement of data in the GPU. Since memory bandwidth is the primary bottleneck of these methods, significant performance improvements are realized. Four M2L schemes are detailed and analyzed in the case of a uniform tree.

The four schemes are tested and compared with an optimized, OpenMP parallelized, multi-core CPU code. The accuracy of the GPU codes is found to be satisfactory and achieved performance over 200 Gflop/s on one NVIDIA Tesla C1060 GPU. This was compared against two quad-core Intel Xeon E5345 processors running at 2.33 GHz, for a combined peak performance of 149 Gflop/s for single-precision. We also present benchmarks on an NVIDIA C2050 GPU (a Fermi processor) in single and double precision.

Fourier Based Fast Multipole Method for the Helmholtz Equation

Presentation. Group Presentation. Source code.

The fast multipole method (FMM) has had great success in reducing the computational time required to solve the boundary integral form of the Helmholtz equation.

We present a formulation of the Helmholtz FMM using Fourier basis functions rather than spherical harmonics that accelerates some of the time-critical stages of the algorithm. With modifications to the transfer function in the precomputation stage of the FMM, the interpolation and anterpolation operators become straightforward applications of fast Fourier transforms and the transfer operator remains diagonal.

Using Fourier analysis, constructive algorithms are derived to a priori determine an integration quadrature for a given error tolerance. Sharp error bounds are derived and verified numerically. Various optimizations are considered to reduce the number of quadrature points and reduce the cost of computing the transfer function.

Application of Assembly of Finite Element Methods on Graphics Processors

Real-time Video. Presentation. Source code.

The previous research on accelerating the assembly of finite element methods on GPUs is applied to a non-linear Neo-Hookean model for large time-step elastodynamics. To solve the non-linear system of equations, a Newton-Raphson method is used which requires assembling and solving the linearized tangent system of equations for each iteration (approximately 4-6 iterations per time-step). In the linked paper we show that accelerating only the sparse matrix-vector multiplication via the GPU makes the assembly stage a sever bottleneck.

We show that by running each stage of the algorithm on the GPU, the assembly, solution, and visualization of a dynamic FEM problem can benefit from speed-ups in each stage and avoid expensive data transfers between the host device and the GPU coprocessor, resulting in simulations of unprecedented speed and size. The linked demonstration video shows the application to a volumetric mesh of a hand consisting of 125,127 elements and 28,796 nodes run in real time and interacting with the user.

Assembly of Finite Element Matrices on Graphics Processors

Poster. Presentation. Source code.

Graphics processing units (GPUs) have had great success in accelerating many numerical computations. In this research, we present apply their unique compute model to computations on unstructured meshes such as those in finite element methods. Multiple approaches in assembling and solving sparse linear systems with NVIDIA GPUs and the Compute Unified Device Architecture (CUDA) are presented and multiple strategies for efficient use of global, shared, and local memory, methods to achieve memory coalescing, the optimal choice of parameters, and the appropriate level of parallelism to choose are discussed. The pros and cons of each technique are presented in addition to how each can efficiently make use of the GPU hardware resources.

The more successful approaches are compared using a two-dimensional benchmark case on an NVIDIA GeForce 8800 GTX, an NVIDIA Tesla C1060, and a single core of an Intel Core 2 Quad CPU Q9450 2.66 GHz. The performance is analyzed as a function of the problem size (number of mesh nodes), and the order of the finite-element method. We find that with appropriate preprocessing and arrangement of support data, the GPU coprocessor achieves speedups of 30 or more in comparison to a well optimized serial implementation. We also find that the optimal assembly strategy depends on the order of polynomials used in the finite-element discretization.

Thus, the assembly, solution, and visualization of a dynamic FEM problem can be performed completely on the GPU, avoiding expensive data transfers between the host device and the GPU coprocessor. This is especially interesting for real-time visualization in graphics, design software, simulation, and gaming.

Vascular Tumor Modeling

Poster. Presentation.

This research was my undergraduate senior clinic project, which is similar to a thesis. Over the course of a year, I works on an interdisciplinary team of four Harvey Mudd seniors -- Alan Davidson (CS), Tiffany Head (Math), Dana Mohamed (Math-Bio), and Liam Robinson (Math) -- in conjunction with Los Alamos National Laboratory to vascularize an existing multiscale tumor model. The model was composed of

Extracellular model: Production, consumption, and diffusion of chemicals.
Cellular model: Monte Carlo governed cell movement and growth.
Intracellular model: Cell cycle, protein network, and protein expression.

We investigated the effects of vasculature on tumor growth as well as the impacts of various models for chemotherapy. The final report is linked above.

Primarily, I research and wrote a diffusion equation solver with time and space dependent source and sink terms which could be applied to systems with arbitrary Dirichlet boundary condition geometries in three dimensions. This diffusion model was composed of a backwards Euler approximation solved with an adaptive Gauss-Seidel iteration.

Quantum Computing Research

Presentation on Grover's Algorithm.

Linked above is a short summary of some of the research into quantum computing an quantum information I conducted with Professor Chris Stone over the summer of 2005. During that summer, we

Researched and developed a framework for quantum computing languages.
Investigated the requirements and restrictions of any quantum computing language: quantum operator and quantum bit manipulation, reversibility, state coherence, no cloning, etc.
Investigated current quantum computing algorithms and looked at their correctness, optimizations, improvements, and generalizations. Included were the Deutsch-Jozsa, Grover, and Shor algorithms in addition to teleportation, encryption, and operator decomposition.
Implemented some of these algorithms in various existing quantum computing simulation languages including QCL and Q-Language and also implemented small improvements to these languages.

Cris Cecka

Senior Research Scientist, NVIDIA Research
Adjunct Professor, Stanford University