CUDA Kernels API Reference#

This module provides CUDA-accelerated functions for trajectory planning and dynamics computation with automatic CPU fallback.

Availability Check#

ManipulaPy.cuda_kernels.check_cuda_availability()[source]#

Enhanced CUDA availability check with detailed diagnostics.

Check if CUDA is available and provide helpful diagnostic information.

Returns:

  • bool – True if CUDA is available, False otherwise

Return type:

bool

ManipulaPy.cuda_kernels.check_cupy_availability()[source]#

Check CuPy availability for additional GPU operations.

Check if CuPy is available for GPU array operations.

Returns:

  • bool – True if CuPy is available, False otherwise

Return type:

bool

ManipulaPy.cuda_kernels.get_gpu_properties()[source]#

Get comprehensive GPU properties for optimization.

Retrieve current CUDA device properties for kernel optimization and resource allocation.

Returns:

  • dict or None – GPU device properties including multiprocessor count, memory limits, etc.

Return type:

Dict[str, Any] | None

Core CUDA Kernels#

Trajectory Kernels#

ManipulaPy.cuda_kernels.trajectory_kernel(*args, **kwargs)[source]#

Raise because the CUDA trajectory kernel is unavailable.

CUDA kernel for generating joint trajectory points with time-scaling.

Parameters:

  • thetastart (cuda.device_array) – Starting joint angles

  • thetaend (cuda.device_array) – Target joint angles

  • traj_pos (cuda.device_array) – Output trajectory positions

  • traj_vel (cuda.device_array) – Output trajectory velocities

  • traj_acc (cuda.device_array) – Output trajectory accelerations

  • Tf (float) – Total trajectory time

  • N (int) – Number of trajectory points

  • method (int) – Time scaling method (1=linear, 3=cubic, 5=quintic)

  • stream (int) – CUDA stream for kernel execution

Note

All trajectory kernels are safe at N <= 1 (returns tau = 0 so single-sample trajectories produce the start configuration instead of NaN/Inf from division by zero). Linear time-scaling (method=1) is supported in every trajectory kernel variant (trajectory_kernel, trajectory_kernel_vectorized, trajectory_kernel_memory_optimized, trajectory_kernel_warp_optimized, trajectory_kernel_cache_friendly, cartesian_trajectory_kernel, and batch_trajectory_kernel) as of v1.3.2; previously only cubic and quintic were honored.

Parameters:
Return type:

NoReturn

ManipulaPy.cuda_kernels.cartesian_trajectory_kernel(*args, **kwargs)[source]#

Raise because the CUDA Cartesian trajectory kernel is unavailable.

CUDA kernel for generating Cartesian trajectory with time-scaling.

Parameters:

  • pstart (cuda.device_array) – Starting point coordinates [x, y, z]

  • pend (cuda.device_array) – Ending point coordinates [x, y, z]

  • traj_pos (cuda.device_array) – Output trajectory positions

  • traj_vel (cuda.device_array) – Output trajectory velocities

  • traj_acc (cuda.device_array) – Output trajectory accelerations

  • Tf (float) – Total trajectory duration

  • N (int) – Number of trajectory points

  • method (int) – Time-scaling method (1=linear, 3=cubic, 5=quintic)

  • stream (int) – CUDA stream for kernel execution

Changed in version 1.3.2: Each thread now computes its own time scaling (no shared-memory reuse), eliminating a race where thread (0,0,0)’s t_idx scaling leaked into all other threads. Quintic acceleration also uses the correct s_ddot = 60 tau (1 - tau)(1 - 2 tau) / Tf**2.

Parameters:
Return type:

NoReturn

ManipulaPy.cuda_kernels.batch_trajectory_kernel(*args, **kwargs)[source]#

Raise because the CUDA batch trajectory kernel is unavailable.

Optimized CUDA kernel for batch trajectory generation with time-scaling.

Parameters:

  • thetastart_batch (cuda.device_array) – Starting joint positions for each batch

  • thetaend_batch (cuda.device_array) – Ending joint positions for each batch

  • traj_pos_batch (cuda.device_array) – Output trajectory positions

  • traj_vel_batch (cuda.device_array) – Output trajectory velocities

  • traj_acc_batch (cuda.device_array) – Output trajectory accelerations

  • Tf (float) – Total trajectory duration

  • N (int) – Number of trajectory timesteps

  • method (int) – Time-scaling method (1=linear, 3=cubic, 5=quintic)

  • batch_size (int) – Number of trajectory batches

  • stream (int) – CUDA stream for kernel execution

Changed in version 1.3.2: Per-thread time scaling replaces the previous shared-memory layout that broadcast a single thread’s t_idx to the whole block, and quintic acceleration uses the full 60 tau (1 - tau)(1 - 2 tau) / Tf**2 form.

Parameters:
Return type:

NoReturn

Dynamics Kernels#

ManipulaPy.cuda_kernels.inverse_dynamics_kernel(*args, **kwargs)[source]#

Raise because the CUDA inverse dynamics kernel is unavailable.

Optimized CUDA kernel for computing inverse dynamics using 2D parallelization.

Parameters:

  • thetalist_trajectory (cuda.device_array) – Joint position trajectory

  • dthetalist_trajectory (cuda.device_array) – Joint velocity trajectory

  • ddthetalist_trajectory (cuda.device_array) – Joint acceleration trajectory

  • gravity_vector (cuda.device_array) – Gravity vector

  • Ftip (cuda.device_array) – End-effector wrench

  • Glist (cuda.device_array) – Mass matrix diagonal elements

  • Slist (cuda.device_array) – Velocity quadratic force coefficients

  • M (cuda.device_array) – Full mass matrix

  • torques_trajectory (cuda.device_array) – Output joint torque trajectory

  • torque_limits (cuda.device_array) – Joint torque limits

  • stream (int) – CUDA stream for kernel execution

Parameters:
Return type:

NoReturn

ManipulaPy.cuda_kernels.forward_dynamics_kernel(*args, **kwargs)[source]#

Raise because the CUDA forward dynamics kernel is unavailable.

Compute forward dynamics for a robotic system using a CUDA kernel.

Parameters:

  • thetalist (cuda.device_array) – Initial joint positions

  • dthetalist (cuda.device_array) – Initial joint velocities

  • taumat (cuda.device_array) – Applied joint torques trajectory

  • g (cuda.device_array) – Gravity vector

  • Ftipmat (cuda.device_array) – End-effector wrenches

  • dt (float) – Total time step

  • intRes (int) – Integration resolution/substeps

  • Glist (cuda.device_array) – Mass matrix diagonal elements

  • Slist (cuda.device_array) – Velocity quadratic force coefficients

  • M (cuda.device_array) – Full mass matrix

  • thetamat (cuda.device_array) – Output joint position trajectory

  • dthetamat (cuda.device_array) – Output joint velocity trajectory

  • ddthetamat (cuda.device_array) – Output joint acceleration trajectory

  • joint_limits (cuda.device_array) – Joint position limits

  • stream (int) – CUDA stream for kernel execution

Changed in version 1.3.2: Each thread now integrates from the initial state up to its own t_idx instead of reading thetamat[t_idx-1], removing the temporal data race where threads at higher t_idx could read rows that lower-t_idx threads had not yet written.

Parameters:
Return type:

NoReturn

Potential Field Kernels#

ManipulaPy.cuda_kernels.fused_potential_gradient_kernel(*args, **kwargs)[source]#

Raise because the CUDA potential field kernel is unavailable.

Changed in version 1.3.2: Repulsive-gradient sign corrected. Previous versions produced an attracting repulsive field due to a sign error in the gradient factor (grad_factor is now -influence_term * dist_inv**3). Existing code relying on the v1.3.1 behavior must be reviewed – the sign flip is a correctness fix, not an API change.

CUDA kernel for computing potential and gradient for path planning.

Parameters:

  • positions (cuda.device_array) – Input positions to evaluate

  • goal (cuda.device_array) – Target goal point coordinates

  • obstacles (cuda.device_array) – Array of obstacle point coordinates

  • potential (cuda.device_array) – Output array for computed potential values

  • gradient (cuda.device_array) – Output array for computed gradient vectors

  • influence_distance (float) – Distance threshold for obstacle influence

  • stream (int) – CUDA stream for kernel execution

Parameters:
Return type:

NoReturn

ManipulaPy.cuda_kernels.attractive_potential_kernel(*args, **kwargs)[source]#

Legacy function - use fused_potential_gradient_kernel instead.

Legacy CUDA kernel for attractive potential field computation.

Parameters:

  • positions (cuda.device_array) – Query positions (N, 3)

  • goal (cuda.device_array) – Goal position [x, y, z]

  • potential (cuda.device_array) – Output potential values (N,)

Parameters:
Return type:

NoReturn

ManipulaPy.cuda_kernels.repulsive_potential_kernel(*args, **kwargs)[source]#

Legacy function - use fused_potential_gradient_kernel instead.

Legacy CUDA kernel for repulsive potential field computation.

Parameters:

  • positions (cuda.device_array) – Query positions (N, 3)

  • obstacles (cuda.device_array) – Obstacle positions (M, 3)

  • potential (cuda.device_array) – Output potential values (N,)

  • influence_distance (float) – Maximum influence distance

Parameters:
Return type:

NoReturn

ManipulaPy.cuda_kernels.gradient_kernel(*args, **kwargs)[source]#

Legacy function - use fused_potential_gradient_kernel instead.

Legacy CUDA kernel for numerical gradient computation.

Parameters:

  • potential (cuda.device_array) – Potential field values (N,)

  • gradient (cuda.device_array) – Output gradient (N-1,)

Parameters:
Return type:

NoReturn

High-Level Wrappers#

ManipulaPy.cuda_kernels.optimized_trajectory_generation(thetastart, thetaend, Tf, N, method, use_pinned=True, kernel_type='auto')[source]#

Main entry point for optimized trajectory generation.

This function automatically selects the best kernel and configuration for maximum performance and 40x+ speedups.

Parameters:
  • thetastart (Any) – Start and end joint angles

  • thetaend (Any) – Start and end joint angles

  • Tf (float) – Final time

  • N (int) – Number of trajectory points

  • method (int) – Time scaling method (3=cubic, 5=quintic)

  • use_pinned (bool) – Use pinned memory for faster transfers

  • kernel_type (str) – Kernel selection (β€œauto”, β€œstandard”, β€œvectorized”, etc.)

Return type:

Tuple[ndarray, ndarray, ndarray]

Generates an optimized trajectory using CUDA acceleration with automatic memory management.

Parameters:

  • thetastart (np.ndarray) – Initial joint configuration

  • thetaend (np.ndarray) – Final joint configuration

  • Tf (float) – Total trajectory duration

  • N (int) – Number of trajectory timesteps

  • method (int) – Trajectory generation method (1=linear, 3=cubic, 5=quintic)

  • use_pinned (bool) – Use pinned memory for faster GPU transfers

Returns:

  • tuple – (trajectory positions, trajectory velocities, trajectory accelerations)

ManipulaPy.cuda_kernels.optimized_potential_field(positions, goal, obstacles, influence_distance, use_pinned=True)[source]#

Optimized potential field computation with CUDA acceleration.

Parameters:
  • positions (ndarray) – (N, 3) ndarray of query point positions.

  • goal (ndarray) – (3,) ndarray, attractive goal position.

  • obstacles (ndarray) – (num_obstacles, 3) ndarray of obstacle positions.

  • influence_distance (float) – Repulsive influence radius; obstacles farther than this contribute nothing.

  • use_pinned (bool) – If True, use pinned host memory for host-to-device transfers.

Returns:

(potential, gradient) where potential is an (N,) float32 array of total potential values and gradient is an (N, 3) float32 array of potential gradients.

Return type:

Tuple[np.ndarray, np.ndarray]

Raises:

RuntimeError – If CUDA is not available.

Compute potential field and gradient for a set of positions using a CUDA-accelerated kernel.

Parameters:

  • positions (np.ndarray) – Input positions to compute potential field for

  • goal (np.ndarray) – Target goal position

  • obstacles (np.ndarray) – Array of obstacle positions

  • influence_distance (float) – Distance within which obstacles influence the potential field

  • use_pinned (bool) – Use pinned memory for faster GPU transfers

Returns:

  • tuple – (potential values, gradient vectors) for each input position

ManipulaPy.cuda_kernels.optimized_batch_trajectory_generation(thetastart_batch, thetaend_batch, Tf, N, method, use_pinned=True)[source]#

Optimized batch trajectory generation for multiple trajectories.

Parameters:
  • thetastart_batch (ndarray) – (batch_size, num_joints) ndarray of starting joint angles, radians.

  • thetaend_batch (ndarray) – (batch_size, num_joints) ndarray of ending joint angles, radians.

  • Tf (float) – Total trajectory duration, seconds.

  • N (int) – Number of trajectory time steps.

  • method (int) – Time-scaling order: 3 cubic, 5 quintic, else linear.

  • use_pinned (bool) – If True, use pinned host memory for host-to-device transfers.

Returns:

(traj_pos_batch, traj_vel_batch, traj_acc_batch), each a (batch_size, N, num_joints) float32 ndarray of joint positions (radians), velocities (radians/s), and accelerations (radians/s^2).

Return type:

Tuple[np.ndarray, np.ndarray, np.ndarray]

Raises:

RuntimeError – If CUDA is not available.

Efficiently generate batch trajectories using CUDA acceleration.

Parameters:

  • thetastart_batch (np.ndarray) – Batch of initial joint configurations

  • thetaend_batch (np.ndarray) – Batch of final joint configurations

  • Tf (float) – Total trajectory duration

  • N (int) – Number of trajectory timesteps

  • method (int) – Trajectory generation method identifier (1=linear, 3=cubic, 5=quintic)

  • use_pinned (bool) – Use pinned memory for faster GPU transfers

Returns:

  • tuple – Batch of trajectory positions, velocities, and accelerations

CPU Fallback Functions#

ManipulaPy.cuda_kernels.trajectory_cpu_fallback(thetastart, thetaend, Tf, N, method)[source]#

Optimized CPU fallback using NumPy vectorization.

Parameters:
  • thetastart (ndarray) – (num_joints,) ndarray of starting joint angles, radians.

  • thetaend (ndarray) – (num_joints,) ndarray of ending joint angles, radians.

  • Tf (float) – Total trajectory duration, seconds. Values <= 0 collapse to the start configuration with zero velocity and acceleration.

  • N (int) – Number of trajectory time steps. Values <= 1 collapse to the start configuration.

  • method (int) – Time-scaling polynomial order: 3 for cubic, 5 for quintic, any other value (e.g. 1) for linear.

Returns:

(traj_pos, traj_vel, traj_acc), each an (N, num_joints) float32 ndarray of joint positions (radians), velocities (radians/s), and accelerations (radians/s^2).

Return type:

Tuple[np.ndarray, np.ndarray, np.ndarray]

Compute trajectory positions, velocities, and accelerations on the CPU when CUDA is unavailable.

Parameters:

  • thetastart (np.ndarray) – Initial joint configurations

  • thetaend (np.ndarray) – Target joint configurations

  • Tf (float) – Total trajectory duration

  • N (int) – Number of trajectory points to generate

  • method (int) – Time scaling method (1=linear, 3=cubic, 5=quintic)

Returns:

  • tuple – (positions, velocities, accelerations) arrays

Memory Management#

ManipulaPy.cuda_kernels.get_cuda_array(*args, **kwargs)[source]#

Raise because the CUDA memory pool is unavailable.

Get a CUDA array from the memory pool.

Parameters:

  • shape (tuple) – Array dimensions

  • dtype (np.dtype) – Data type

Returns:

  • cuda.device_array – GPU array from memory pool

Parameters:
Return type:

NoReturn

ManipulaPy.cuda_kernels.return_cuda_array(*args, **kwargs)[source]#

Raise because the CUDA memory pool is unavailable.

Return a CUDA array to the memory pool.

Parameters:

  • array (cuda.device_array) – GPU array to return

Parameters:
Return type:

NoReturn

ManipulaPy.cuda_kernels._h2d_pinned(arr)[source]#

Host-to-device transfer with optional pinned-memory acceleration.

Pinned memory delivers ~3x peak transfer bandwidth on large arrays, but cuda.pinned_array is currently incompatible with several modern numba+driver combinations (see _PINNED_MEMORY_OPT_IN above). Plain cuda.to_device is correct on every supported configuration; pinned transfers are a pure performance optimisation that must be opted in.

Parameters:

arr (ndarray) – Host ndarray to copy to the device. Forced to C-contiguous layout if it is not already.

Returns:

A numba CUDA device array holding a copy of arr.

Raises:

RuntimeError – If CUDA is not available.

Return type:

Any

Helper function for pinned memory H2D transfers.

Parameters:

  • arr (np.ndarray) – Array to transfer to device

Returns:

  • cuda.device_array – Device array with data transferred

Grid Configuration#

ManipulaPy.cuda_kernels.make_1d_grid(size, threads=256)[source]#

Create optimal 1D grid for maximum GPU utilization.

Parameters:
  • size (int) – Total number of elements to cover with one thread each.

  • threads (int) – Initial thread-block size; overridden internally based on size for better occupancy.

Returns:

(blocks, threads) launch configuration, each a 1-tuple suitable for kernel[blocks, threads].

Return type:

Tuple[Tuple[int, …], Tuple[int, …]]

Create a 1D grid configuration for CUDA kernel launch with optimal thread and block sizing.

Parameters:

  • size (int) – Total number of elements or work items to process

  • threads (int) – Desired number of threads per block

Returns:

  • tuple – ((blocks,), (threads,)) for kernel launch configuration

ManipulaPy.cuda_kernels.make_2d_grid(N, num_joints, block_size=(128, 8))[source]#

Create 2D grid configuration for CUDA kernel launch (backward compatibility).

This is the original function maintained for compatibility. For optimal performance, use make_2d_grid_optimized().

Parameters:
  • N (int) – Number of trajectory time steps (X dimension of the grid).

  • num_joints (int) – Number of joints (Y dimension of the grid).

  • block_size (Tuple[int, int]) – Initial (threads_x, threads_y) block shape; shrunk for tiny problems and adjusted to reach a minimum block count.

Returns:

(grid, block) 2D launch configuration. Returns ((1, 1), (1, 1)) when CUDA is unavailable.

Return type:

Tuple[Tuple[int, int], Tuple[int, int]]

Compute optimal 2D grid configuration for CUDA kernel launch.

Parameters:

  • N (int) – First dimension of problem space

  • num_joints (int) – Second dimension of problem space

  • block_size (tuple) – Initial suggested block dimensions

Returns:

  • tuple – ((blocks_x, blocks_y), (threads_x, threads_y))

Performance Tools#

ManipulaPy.cuda_kernels.benchmark_kernel_performance(*args, **kwargs)[source]#

Report that CUDA benchmarking is unavailable.

Benchmark the performance of a specific CUDA kernel by executing it multiple times.

Parameters:

  • kernel_name (str) – Name of the kernel to benchmark

  • *args – Arguments to pass to the kernel function

  • num_runs (int) – Number of times to run the kernel

Returns:

  • dict or None – Performance metrics including average, std, min/max times

Parameters:
Return type:

Dict[str, Any] | None

ManipulaPy.cuda_kernels.profile_start()[source]#

No-op CUDA profiler start for CPU-only environments.

Start CUDA profiling.

Return type:

None

ManipulaPy.cuda_kernels.profile_stop()[source]#

Return empty CUDA profiler stats in CPU-only environments.

Stop CUDA profiling.

Return type:

Dict[str, Any]

ManipulaPy.cuda_kernels._best_2d_config(*args, **kwargs)[source]#

Return a minimal launch shape when CUDA is unavailable.

Auto-tune 2D CUDA kernel launch configuration for optimal performance.

Parameters:

  • N (int) – Number of time steps or trajectory points

  • J (int) – Number of joints or degrees of freedom

Returns:

  • tuple – ((grid_x, grid_y), (block_x, block_y))

Parameters:
Return type:

Tuple[Tuple[int, int], Tuple[int, int]]

Module Constants#

ManipulaPy.cuda_kernels.CUDA_AVAILABLE = False#

bool(x) -> bool

Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.

bool – True if CUDA is available, False otherwise

ManipulaPy.cuda_kernels.CUPY_AVAILABLE = True#

bool(x) -> bool

Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.

bool – True if CuPy is available, False otherwise

ManipulaPy.cuda_kernels.FAST_MATH = True#

bool(x) -> bool

Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.

bool – Whether fast math optimizations are enabled

ManipulaPy.cuda_kernels.float_t = <class 'numpy.float32'>#

Single-precision floating-point number type, compatible with C float.

Character code:

'f'

Canonical name:

numpy.single

Alias on this platform (Linux x86_64):

numpy.float32: 32-bit-precision floating-point number type: sign bit, 8 bits exponent, 23 bits mantissa.

type – Float precision type (float32 or float16)

Environment Variables#

MANIPULAPY_FASTMATH

Set to β€œ1” to enable fast math optimizations (~2x speedup with relaxed IEEE 754 compliance)

MANIPULAPY_USE_FP16

Set to β€œ1” to use 16-bit floating point precision for memory-bound kernels