[][src]Macro rustacuda::launch

macro_rules! launch {
    ($module:ident . $function:ident <<<$grid:expr, $block:expr, $shared:expr, $stream:ident>>>( $( $arg:expr),* )) => { ... };
    ($function:ident <<<$grid:expr, $block:expr, $shared:expr, $stream:ident>>>( $( $arg:expr),* )) => { ... };

Launch a kernel function asynchronously.


The format of this macro is designed to resemble the triple-chevron syntax used to launch kernels in CUDA C. There are two forms available:

This example is not tested
let result = launch!(module.function_name<<<grid, block, shared_memory_size, stream>>>(parameter1, parameter2...));

This will load a kernel called function_name from the module module and launch it with the given grid/block size on the given stream. Unlike in CUDA C, the shared memory size and stream parameters are not optional. The shared memory size is a number of bytes per thread for dynamic shared memory (Note that this uses extern __shared__ int x[] in CUDA C, not the fixed-length arrays created by __shared__ int x[64]. This will usually be zero.). stream must be the name of a Stream value. grid can be any value which implements Into<GridSize> (such as u32 values, tuples of up to three u32 values, and GridSize structures) and likewise block can be any value that implements Into<BlockSize>.

NOTE: due to some limitations of Rust's macro system, module and stream must be local variable names. Paths or function calls will not work.

The second form is similar:

This example is not tested
let result = launch!(function<<<grid, block, shared_memory_size, stream>>>(parameter1, parameter2...));

In this variant, the function parameter must be a variable. Use this form to avoid looking up the kernel function for each call.


Launching kernels must be done in an unsafe block. Calling a kernel is similar to calling a foreign-language function, as the kernel itself could be written in C or unsafe Rust. The kernel must accept the same number and type of parameters that are passed to the launch! macro. The kernel must not write invalid data (for example, invalid enums) into areas of memory that can be copied back to the host. The programmer must ensure that the host does not access device or unified memory that the kernel could write to until after calling stream.synchronize().


use rustacuda::memory::*;
use rustacuda::module::Module;
use rustacuda::stream::*;
use std::ffi::CString;

// Set up the context, load the module, and create a stream to run kernels in.
let _ctx = rustacuda::quick_init()?;
let ptx = CString::new(include_str!("../resources/add.ptx"))?;
let module = Module::load_from_string(&ptx)?;
let stream = Stream::new(StreamFlags::NON_BLOCKING, None)?;

// Create buffers for data
let mut in_x = DeviceBuffer::from_slice(&[1.0f32; 10])?;
let mut in_y = DeviceBuffer::from_slice(&[2.0f32; 10])?;
let mut out_1 = DeviceBuffer::from_slice(&[0.0f32; 10])?;
let mut out_2 = DeviceBuffer::from_slice(&[0.0f32; 10])?;

// This kernel adds each element in `in_x` and `in_y` and writes the result into `out`.
unsafe {
    // Launch the kernel with one block of one thread, no dynamic shared memory on `stream`.
    let result = launch!(module.sum<<<1, 1, 0, stream>>>(
    // `launch!` returns an error in case anything went wrong with the launch itself, but
    // kernel launches are asynchronous so errors caused by the kernel (eg. invalid memory
    // access) will show up later at some other CUDA API call (probably at `synchronize()`
    // below).

    // Launch the kernel again using the `function` form:
    let function_name = CString::new("sum")?;
    let sum = module.get_function(&function_name)?;
    // Launch with 1x1x1 (1) blocks of 10x1x1 (10) threads, to show that you can use tuples to
    // configure grid and block size.
    let result = launch!(sum<<<(1, 1, 1), (10, 1, 1), 0, stream>>>(

// Kernel launches are asynchronous, so we wait for the kernels to finish executing.

// Copy the results back to host memory
let mut out_host = [0.0f32; 20];
out_1.copy_to(&mut out_host[0..10])?;
out_2.copy_to(&mut out_host[10..20])?;

for x in out_host.iter() {
    assert_eq!(3.0, *x);