Compiling with CUDA support
---------------------------

These tools are intended to allow CP2K to perform portions of the calculation on NVIDIA graphics cards using the CUDA API.  In order to use these features you need a CUDA compatible graphics card.  In addition, you must download and install the CUDA toolkit, CUDA SDK, and a CUDA compatible NVIDIA driver.  These can be obtained from NVIDIA's website.

To compile CP2K with CUDA support, three modifications must be made to the arch file.  An example is provided in arch/Linux-x86-64-cuda.sopt.  The necessary modifications are:

BASIC:

1)  Add a line "NVCC = nvcc" pointing the makefile to the nvcc compiler.

2)  Add -D__CUDAPW and -D__FFTCU to the DFLAGS and NVFLAGS environmental variables;
    optionally add -D__FFTSGL for faster single precision FFTs.

3)  Add libcudart.so, libcufft.so, and cublas to the LIBS variable.

4)  If using in conjunction with DBCSR (see next section) then specify
    an appropriate size for the GLOBAL / CUDA / MEMORY option in the
    input file.

DBCSR (experimental, with limitations):

1. Add __DBCSR_CUDA to DFLAGS.

2. Add -lcudart and -lrt to the LIBS variable (ensuring the
   libcudart.so is in the library path; use -L/path/to/libcudart.so to
   define it).

3. Consult the input reference manual for input file options that
   modify DBCSR CUDA behavior.

Then, compile as normal.

Alternatively one can comile with support for running DGEMM and DSYMM on the gpu, but nothing else.  This is accomplised as above, but instead of __CUDAPW and __FFTSGL one should include __CUBLASDP.

Features
--------
At the moment, the only portion of the calculation which can be performed on the graphics card is the FFT, the associated scatter/gather operations, and some linear algebra.  Because NVIDIA graphics cards only reach full performance for single precision floating point arithmetic, the FFT must be performed in single precision (hence the -D__FFTSGL option.)

