CUDA Kernels API Reference

This module provides CUDA-accelerated functions for trajectory planning and dynamics computation with automatic CPU fallback.

Availability Check

ManipulaPy.cuda_kernels.check_cuda_availability()[source]

Enhanced CUDA availability check with detailed diagnostics.

Check if CUDA is available and provide helpful diagnostic information.

Returns:

bool – True if CUDA is available, False otherwise

ManipulaPy.cuda_kernels.check_cupy_availability()[source]

Check CuPy availability for additional GPU operations.

Check if CuPy is available for GPU array operations.

Returns:

bool – True if CuPy is available, False otherwise

ManipulaPy.cuda_kernels.get_gpu_properties()[source]

Get comprehensive GPU properties for optimization.

Retrieve current CUDA device properties for kernel optimization and resource allocation.

Returns:

dict or None – GPU device properties including multiprocessor count, memory limits, etc.

Core CUDA Kernels

Trajectory Kernels

ManipulaPy.cuda_kernels.trajectory_kernel(*args, **kwargs)[source]

CUDA kernel for generating joint trajectory points with time-scaling.

Parameters:

thetastart (cuda.device_array) – Starting joint angles

thetaend (cuda.device_array) – Target joint angles

traj_pos (cuda.device_array) – Output trajectory positions

traj_vel (cuda.device_array) – Output trajectory velocities

traj_acc (cuda.device_array) – Output trajectory accelerations

Tf (float) – Total trajectory time

N (int) – Number of trajectory points

method (int) – Time scaling method (3=cubic, 5=quintic)

stream (int) – CUDA stream for kernel execution

ManipulaPy.cuda_kernels.cartesian_trajectory_kernel(*args, **kwargs)[source]

CUDA kernel for generating Cartesian trajectory with time-scaling.

Parameters:

pstart (cuda.device_array) – Starting point coordinates [x, y, z]

pend (cuda.device_array) – Ending point coordinates [x, y, z]

traj_pos (cuda.device_array) – Output trajectory positions

traj_vel (cuda.device_array) – Output trajectory velocities

traj_acc (cuda.device_array) – Output trajectory accelerations

Tf (float) – Total trajectory duration

N (int) – Number of trajectory points

method (int) – Time-scaling method (3=cubic, 5=quintic)

stream (int) – CUDA stream for kernel execution

ManipulaPy.cuda_kernels.batch_trajectory_kernel(*args, **kwargs)[source]

Optimized CUDA kernel for batch trajectory generation with time-scaling.

Parameters:

thetastart_batch (cuda.device_array) – Starting joint positions for each batch

thetaend_batch (cuda.device_array) – Ending joint positions for each batch

traj_pos_batch (cuda.device_array) – Output trajectory positions

traj_vel_batch (cuda.device_array) – Output trajectory velocities

traj_acc_batch (cuda.device_array) – Output trajectory accelerations

Tf (float) – Total trajectory duration

N (int) – Number of trajectory timesteps

method (int) – Time-scaling method (3=cubic, 5=quintic)

batch_size (int) – Number of trajectory batches

stream (int) – CUDA stream for kernel execution

Dynamics Kernels

ManipulaPy.cuda_kernels.inverse_dynamics_kernel(*args, **kwargs)[source]

Optimized CUDA kernel for computing inverse dynamics using 2D parallelization.

Parameters:

thetalist_trajectory (cuda.device_array) – Joint position trajectory

dthetalist_trajectory (cuda.device_array) – Joint velocity trajectory

ddthetalist_trajectory (cuda.device_array) – Joint acceleration trajectory

gravity_vector (cuda.device_array) – Gravity vector

Ftip (cuda.device_array) – End-effector wrench

Glist (cuda.device_array) – Mass matrix diagonal elements

Slist (cuda.device_array) – Velocity quadratic force coefficients

M (cuda.device_array) – Full mass matrix

torques_trajectory (cuda.device_array) – Output joint torque trajectory

torque_limits (cuda.device_array) – Joint torque limits

stream (int) – CUDA stream for kernel execution

ManipulaPy.cuda_kernels.forward_dynamics_kernel(*args, **kwargs)[source]

Compute forward dynamics for a robotic system using a CUDA kernel.

Parameters:

thetalist (cuda.device_array) – Initial joint positions

dthetalist (cuda.device_array) – Initial joint velocities

taumat (cuda.device_array) – Applied joint torques trajectory

g (cuda.device_array) – Gravity vector

Ftipmat (cuda.device_array) – End-effector wrenches

dt (float) – Total time step

intRes (int) – Integration resolution/substeps

Glist (cuda.device_array) – Mass matrix diagonal elements

Slist (cuda.device_array) – Velocity quadratic force coefficients

M (cuda.device_array) – Full mass matrix

thetamat (cuda.device_array) – Output joint position trajectory

dthetamat (cuda.device_array) – Output joint velocity trajectory

ddthetamat (cuda.device_array) – Output joint acceleration trajectory

joint_limits (cuda.device_array) – Joint position limits

stream (int) – CUDA stream for kernel execution

Potential Field Kernels

ManipulaPy.cuda_kernels.fused_potential_gradient_kernel(*args, **kwargs)[source]

CUDA kernel for computing potential and gradient for path planning.

Parameters:

positions (cuda.device_array) – Input positions to evaluate

goal (cuda.device_array) – Target goal point coordinates

obstacles (cuda.device_array) – Array of obstacle point coordinates

potential (cuda.device_array) – Output array for computed potential values

gradient (cuda.device_array) – Output array for computed gradient vectors

influence_distance (float) – Distance threshold for obstacle influence

stream (int) – CUDA stream for kernel execution

ManipulaPy.cuda_kernels.attractive_potential_kernel(*args, **kwargs)[source]

Legacy function - use fused_potential_gradient_kernel instead.

Legacy CUDA kernel for attractive potential field computation.

Parameters:

positions (cuda.device_array) – Query positions (N, 3)

goal (cuda.device_array) – Goal position [x, y, z]

potential (cuda.device_array) – Output potential values (N,)

ManipulaPy.cuda_kernels.repulsive_potential_kernel(*args, **kwargs)[source]

Legacy function - use fused_potential_gradient_kernel instead.

Legacy CUDA kernel for repulsive potential field computation.

Parameters:

positions (cuda.device_array) – Query positions (N, 3)

obstacles (cuda.device_array) – Obstacle positions (M, 3)

potential (cuda.device_array) – Output potential values (N,)

influence_distance (float) – Maximum influence distance

ManipulaPy.cuda_kernels.gradient_kernel(*args, **kwargs)[source]

Legacy function - use fused_potential_gradient_kernel instead.

Legacy CUDA kernel for numerical gradient computation.

Parameters:

potential (cuda.device_array) – Potential field values (N,)

gradient (cuda.device_array) – Output gradient (N-1,)

High-Level Wrappers

ManipulaPy.cuda_kernels.optimized_trajectory_generation(thetastart, thetaend, Tf, N, method, use_pinned=True, kernel_type='auto')[source]

Main entry point for optimized trajectory generation.

This function automatically selects the best kernel and configuration for maximum performance and 40x+ speedups.

Parameters:

thetastart – Start and end joint angles
thetaend – Start and end joint angles
Tf – Final time
N – Number of trajectory points
method – Time scaling method (3=cubic, 5=quintic)
use_pinned – Use pinned memory for faster transfers
kernel_type – Kernel selection (“auto”, “standard”, “vectorized”, etc.)

Generates an optimized trajectory using CUDA acceleration with automatic memory management.

Parameters:

thetastart (np.ndarray) – Initial joint configuration

thetaend (np.ndarray) – Final joint configuration

Tf (float) – Total trajectory duration

N (int) – Number of trajectory timesteps

method (int) – Trajectory generation method

use_pinned (bool) – Use pinned memory for faster GPU transfers

Returns:

tuple – (trajectory positions, trajectory velocities, trajectory accelerations)

ManipulaPy.cuda_kernels.optimized_potential_field(positions, goal, obstacles, influence_distance, use_pinned=True)[source]

Optimized potential field computation with CUDA acceleration.

Compute potential field and gradient for a set of positions using a CUDA-accelerated kernel.

Parameters:

positions (np.ndarray) – Input positions to compute potential field for

goal (np.ndarray) – Target goal position

obstacles (np.ndarray) – Array of obstacle positions

influence_distance (float) – Distance within which obstacles influence the potential field

use_pinned (bool) – Use pinned memory for faster GPU transfers

Returns:

tuple – (potential values, gradient vectors) for each input position

ManipulaPy.cuda_kernels.optimized_batch_trajectory_generation(thetastart_batch, thetaend_batch, Tf, N, method, use_pinned=True)[source]

Optimized batch trajectory generation for multiple trajectories.

Efficiently generate batch trajectories using CUDA acceleration.

Parameters:

thetastart_batch (np.ndarray) – Batch of initial joint configurations

thetaend_batch (np.ndarray) – Batch of final joint configurations

Tf (float) – Total trajectory duration

N (int) – Number of trajectory timesteps

method (int) – Trajectory generation method identifier

use_pinned (bool) – Use pinned memory for faster GPU transfers

Returns:

tuple – Batch of trajectory positions, velocities, and accelerations

CPU Fallback Functions

ManipulaPy.cuda_kernels.trajectory_cpu_fallback(thetastart, thetaend, Tf, N, method)[source]

Optimized CPU fallback using NumPy vectorization.

Compute trajectory positions, velocities, and accelerations on the CPU when CUDA is unavailable.

Parameters:

thetastart (np.ndarray) – Initial joint configurations

thetaend (np.ndarray) – Target joint configurations

Tf (float) – Total trajectory duration

N (int) – Number of trajectory points to generate

method (int) – Time scaling method (3=cubic, 5=quintic)

Returns:

tuple – (positions, velocities, accelerations) arrays

Memory Management

ManipulaPy.cuda_kernels.get_cuda_array(*args, **kwargs)[source]

Get a CUDA array from the memory pool.

Parameters:

shape (tuple) – Array dimensions

dtype (np.dtype) – Data type

Returns:

cuda.device_array – GPU array from memory pool

ManipulaPy.cuda_kernels.return_cuda_array(*args, **kwargs)[source]

Return a CUDA array to the memory pool.

Parameters:

array (cuda.device_array) – GPU array to return

ManipulaPy.cuda_kernels._h2d_pinned(arr)[source]

Optimized pinned memory transfer for maximum bandwidth.

Helper function for pinned memory H2D transfers.

Parameters:

arr (np.ndarray) – Array to transfer to device

Returns:

cuda.device_array – Device array with data transferred

Grid Configuration

ManipulaPy.cuda_kernels.make_1d_grid(size, threads=256)[source]

Create optimal 1D grid for maximum GPU utilization.

Create a 1D grid configuration for CUDA kernel launch with optimal thread and block sizing.

Parameters:

size (int) – Total number of elements or work items to process

threads (int) – Desired number of threads per block

Returns:

tuple – ((blocks,), (threads,)) for kernel launch configuration

ManipulaPy.cuda_kernels.make_2d_grid(N, num_joints, block_size=(128, 8))[source]

Create 2D grid configuration for CUDA kernel launch (backward compatibility).

This is the original function maintained for compatibility. For optimal performance, use make_2d_grid_optimized().

Compute optimal 2D grid configuration for CUDA kernel launch.

Parameters:

N (int) – First dimension of problem space

num_joints (int) – Second dimension of problem space

block_size (tuple) – Initial suggested block dimensions

Returns:

tuple – ((blocks_x, blocks_y), (threads_x, threads_y))

Parameters:

N (int) –
num_joints (int) –
block_size (Tuple[int, int]) –

Performance Tools

ManipulaPy.cuda_kernels.benchmark_kernel_performance(*args, **kwargs)[source]

Benchmark the performance of a specific CUDA kernel by executing it multiple times.

Parameters:

kernel_name (str) – Name of the kernel to benchmark

*args – Arguments to pass to the kernel function

num_runs (int) – Number of times to run the kernel

Returns:

dict or None – Performance metrics including average, std, min/max times

ManipulaPy.cuda_kernels.profile_start()[source]: Start CUDA profiling.

ManipulaPy.cuda_kernels.profile_stop()[source]: Stop CUDA profiling.

ManipulaPy.cuda_kernels._best_2d_config(*args, **kwargs)[source]

Auto-tune 2D CUDA kernel launch configuration for optimal performance.

Parameters:

N (int) – Number of time steps or trajectory points

J (int) – Number of joints or degrees of freedom

Returns:

tuple – ((grid_x, grid_y), (block_x, block_y))

Module Constants

ManipulaPy.cuda_kernels.CUDA_AVAILABLE = False

bool(x) -> bool

Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.

bool – True if CUDA is available, False otherwise

ManipulaPy.cuda_kernels.CUPY_AVAILABLE = True

bool(x) -> bool

Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.

bool – True if CuPy is available, False otherwise

ManipulaPy.cuda_kernels.FAST_MATH = True

bool(x) -> bool

Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.

bool – Whether fast math optimizations are enabled

ManipulaPy.cuda_kernels.float_t = <class 'numpy.float32'>

Single-precision floating-point number type, compatible with C float.

Character code:: 'f'
Canonical name:: numpy.single
Alias on this platform (Linux x86_64):: numpy.float32: 32-bit-precision floating-point number type: sign bit, 8 bits exponent, 23 bits mantissa.

type – Float precision type (float32 or float16)

Environment Variables

MANIPULAPY_FASTMATH: Set to “1” to enable fast math optimizations (~2x speedup with relaxed IEEE 754 compliance)
MANIPULAPY_USE_FP16: Set to “1” to use 16-bit floating point precision for memory-bound kernels