CUDA Kernels API Reference#

This module provides CUDA-accelerated functions for trajectory planning and dynamics computation with automatic CPU fallback.

Availability Check#

ManipulaPy.cuda_kernels.check_cuda_availability()[source]#

Enhanced CUDA availability check with detailed diagnostics.

Check if CUDA is available and provide helpful diagnostic information.

Returns:

bool – True if CUDA is available, False otherwise

Return type:: bool

ManipulaPy.cuda_kernels.check_cupy_availability()[source]#

Check CuPy availability for additional GPU operations.

Check if CuPy is available for GPU array operations.

Returns:

bool – True if CuPy is available, False otherwise

Return type:: bool

ManipulaPy.cuda_kernels.get_gpu_properties()[source]#

Get comprehensive GPU properties for optimization.

Retrieve current CUDA device properties for kernel optimization and resource allocation.

Returns:

dict or None – GPU device properties including multiprocessor count, memory limits, etc.

Return type:: Dict[str, Any] | None

Core CUDA Kernels#

Trajectory Kernels#

ManipulaPy.cuda_kernels.trajectory_kernel(*args, **kwargs)[source]#

Raise because the CUDA trajectory kernel is unavailable.

CUDA kernel for generating joint trajectory points with time-scaling.

Parameters:

thetastart (cuda.device_array) – Starting joint angles

thetaend (cuda.device_array) – Target joint angles

traj_pos (cuda.device_array) – Output trajectory positions

traj_vel (cuda.device_array) – Output trajectory velocities

traj_acc (cuda.device_array) – Output trajectory accelerations

Tf (float) – Total trajectory time

N (int) – Number of trajectory points

method (int) – Time scaling method (1=linear, 3=cubic, 5=quintic)

stream (int) – CUDA stream for kernel execution

Note

All trajectory kernels are safe at N <= 1 (returns tau = 0 so single-sample trajectories produce the start configuration instead of NaN/Inf from division by zero). Linear time-scaling (method=1) is supported in every trajectory kernel variant (trajectory_kernel, trajectory_kernel_vectorized, trajectory_kernel_memory_optimized, trajectory_kernel_warp_optimized, trajectory_kernel_cache_friendly, cartesian_trajectory_kernel, and batch_trajectory_kernel) as of v1.3.2; previously only cubic and quintic were honored.

Parameters:

args (Any)
kwargs (Any)

Return type:

NoReturn

ManipulaPy.cuda_kernels.cartesian_trajectory_kernel(*args, **kwargs)[source]#

Raise because the CUDA Cartesian trajectory kernel is unavailable.

CUDA kernel for generating Cartesian trajectory with time-scaling.

Parameters:

pstart (cuda.device_array) – Starting point coordinates [x, y, z]

pend (cuda.device_array) – Ending point coordinates [x, y, z]

traj_pos (cuda.device_array) – Output trajectory positions

traj_vel (cuda.device_array) – Output trajectory velocities

traj_acc (cuda.device_array) – Output trajectory accelerations

Tf (float) – Total trajectory duration

N (int) – Number of trajectory points

method (int) – Time-scaling method (1=linear, 3=cubic, 5=quintic)

stream (int) – CUDA stream for kernel execution

Changed in version 1.3.2: Each thread now computes its own time scaling (no shared-memory reuse), eliminating a race where thread (0,0,0)’s t_idx scaling leaked into all other threads. Quintic acceleration also uses the correct s_ddot = 60 tau (1 - tau)(1 - 2 tau) / Tf**2.

Parameters:

args (Any)
kwargs (Any)

Return type:

NoReturn

ManipulaPy.cuda_kernels.batch_trajectory_kernel(*args, **kwargs)[source]#

Raise because the CUDA batch trajectory kernel is unavailable.

Optimized CUDA kernel for batch trajectory generation with time-scaling.

Parameters:

thetastart_batch (cuda.device_array) – Starting joint positions for each batch

thetaend_batch (cuda.device_array) – Ending joint positions for each batch

traj_pos_batch (cuda.device_array) – Output trajectory positions

traj_vel_batch (cuda.device_array) – Output trajectory velocities

traj_acc_batch (cuda.device_array) – Output trajectory accelerations

Tf (float) – Total trajectory duration

N (int) – Number of trajectory timesteps

method (int) – Time-scaling method (1=linear, 3=cubic, 5=quintic)

batch_size (int) – Number of trajectory batches

stream (int) – CUDA stream for kernel execution

Changed in version 1.3.2: Per-thread time scaling replaces the previous shared-memory layout that broadcast a single thread’s t_idx to the whole block, and quintic acceleration uses the full 60 tau (1 - tau)(1 - 2 tau) / Tf**2 form.

Parameters:

args (Any)
kwargs (Any)

Return type:

NoReturn

Dynamics Kernels#

ManipulaPy.cuda_kernels.inverse_dynamics_kernel(*args, **kwargs)[source]#

Raise because the CUDA inverse dynamics kernel is unavailable.

Optimized CUDA kernel for computing inverse dynamics using 2D parallelization.

Parameters:

thetalist_trajectory (cuda.device_array) – Joint position trajectory

dthetalist_trajectory (cuda.device_array) – Joint velocity trajectory

ddthetalist_trajectory (cuda.device_array) – Joint acceleration trajectory

gravity_vector (cuda.device_array) – Gravity vector

Ftip (cuda.device_array) – End-effector wrench

Glist (cuda.device_array) – Mass matrix diagonal elements

Slist (cuda.device_array) – Velocity quadratic force coefficients

M (cuda.device_array) – Full mass matrix

torques_trajectory (cuda.device_array) – Output joint torque trajectory

torque_limits (cuda.device_array) – Joint torque limits

stream (int) – CUDA stream for kernel execution

Parameters:

args (Any)
kwargs (Any)

Return type:

NoReturn

ManipulaPy.cuda_kernels.forward_dynamics_kernel(*args, **kwargs)[source]#

Raise because the CUDA forward dynamics kernel is unavailable.

Compute forward dynamics for a robotic system using a CUDA kernel.

Parameters:

thetalist (cuda.device_array) – Initial joint positions

dthetalist (cuda.device_array) – Initial joint velocities

taumat (cuda.device_array) – Applied joint torques trajectory

g (cuda.device_array) – Gravity vector

Ftipmat (cuda.device_array) – End-effector wrenches

dt (float) – Total time step

intRes (int) – Integration resolution/substeps

Glist (cuda.device_array) – Mass matrix diagonal elements

Slist (cuda.device_array) – Velocity quadratic force coefficients

M (cuda.device_array) – Full mass matrix

thetamat (cuda.device_array) – Output joint position trajectory

dthetamat (cuda.device_array) – Output joint velocity trajectory

ddthetamat (cuda.device_array) – Output joint acceleration trajectory

joint_limits (cuda.device_array) – Joint position limits

stream (int) – CUDA stream for kernel execution

Changed in version 1.3.2: Each thread now integrates from the initial state up to its own t_idx instead of reading thetamat[t_idx-1], removing the temporal data race where threads at higher t_idx could read rows that lower-t_idx threads had not yet written.

Parameters:

args (Any)
kwargs (Any)

Return type:

NoReturn

Potential Field Kernels#

ManipulaPy.cuda_kernels.fused_potential_gradient_kernel(*args, **kwargs)[source]#

Raise because the CUDA potential field kernel is unavailable.

Changed in version 1.3.2: Repulsive-gradient sign corrected. Previous versions produced an attracting repulsive field due to a sign error in the gradient factor (grad_factor is now -influence_term * dist_inv**3). Existing code relying on the v1.3.1 behavior must be reviewed – the sign flip is a correctness fix, not an API change.

CUDA kernel for computing potential and gradient for path planning.

Parameters:

positions (cuda.device_array) – Input positions to evaluate

goal (cuda.device_array) – Target goal point coordinates

obstacles (cuda.device_array) – Array of obstacle point coordinates

potential (cuda.device_array) – Output array for computed potential values

gradient (cuda.device_array) – Output array for computed gradient vectors

influence_distance (float) – Distance threshold for obstacle influence

stream (int) – CUDA stream for kernel execution

Parameters:

args (Any)
kwargs (Any)

Return type:

NoReturn

ManipulaPy.cuda_kernels.attractive_potential_kernel(*args, **kwargs)[source]#

Legacy function - use fused_potential_gradient_kernel instead.

Legacy CUDA kernel for attractive potential field computation.

Parameters:

positions (cuda.device_array) – Query positions (N, 3)

goal (cuda.device_array) – Goal position [x, y, z]

potential (cuda.device_array) – Output potential values (N,)

Parameters:

args (Any)
kwargs (Any)

Return type:

NoReturn

ManipulaPy.cuda_kernels.repulsive_potential_kernel(*args, **kwargs)[source]#

Legacy function - use fused_potential_gradient_kernel instead.

Legacy CUDA kernel for repulsive potential field computation.

Parameters:

positions (cuda.device_array) – Query positions (N, 3)

obstacles (cuda.device_array) – Obstacle positions (M, 3)

potential (cuda.device_array) – Output potential values (N,)

influence_distance (float) – Maximum influence distance

Parameters:

args (Any)
kwargs (Any)

Return type:

NoReturn

ManipulaPy.cuda_kernels.gradient_kernel(*args, **kwargs)[source]#

Legacy function - use fused_potential_gradient_kernel instead.

Legacy CUDA kernel for numerical gradient computation.

Parameters:

potential (cuda.device_array) – Potential field values (N,)

gradient (cuda.device_array) – Output gradient (N-1,)

Parameters:

args (Any)
kwargs (Any)

Return type:

NoReturn

High-Level Wrappers#

ManipulaPy.cuda_kernels.optimized_trajectory_generation(thetastart, thetaend, Tf, N, method, use_pinned=True, kernel_type='auto')[source]#

Main entry point for optimized trajectory generation.

This function automatically selects the best kernel and configuration for maximum performance and 40x+ speedups.

Parameters:

thetastart (Any) – Start and end joint angles
thetaend (Any) – Start and end joint angles
Tf (float) – Final time
N (int) – Number of trajectory points
method (int) – Time scaling method (3=cubic, 5=quintic)
use_pinned (bool) – Use pinned memory for faster transfers
kernel_type (str) – Kernel selection (“auto”, “standard”, “vectorized”, etc.)

Return type:

Tuple[ndarray, ndarray, ndarray]

Generates an optimized trajectory using CUDA acceleration with automatic memory management.

Parameters:

thetastart (np.ndarray) – Initial joint configuration

thetaend (np.ndarray) – Final joint configuration

Tf (float) – Total trajectory duration

N (int) – Number of trajectory timesteps

method (int) – Trajectory generation method (1=linear, 3=cubic, 5=quintic)

use_pinned (bool) – Use pinned memory for faster GPU transfers

Returns:

tuple – (trajectory positions, trajectory velocities, trajectory accelerations)

ManipulaPy.cuda_kernels.optimized_potential_field(positions, goal, obstacles, influence_distance, use_pinned=True)[source]#

Optimized potential field computation with CUDA acceleration.

Parameters:

positions (ndarray) – (N, 3) ndarray of query point positions.
goal (ndarray) – (3,) ndarray, attractive goal position.
obstacles (ndarray) – (num_obstacles, 3) ndarray of obstacle positions.
influence_distance (float) – Repulsive influence radius; obstacles farther than this contribute nothing.
use_pinned (bool) – If True, use pinned host memory for host-to-device transfers.

Returns:

(potential, gradient) where potential is an (N,) float32 array of total potential values and gradient is an (N, 3) float32 array of potential gradients.

Return type:

Tuple[np.ndarray, np.ndarray]

Raises:

RuntimeError – If CUDA is not available.

Compute potential field and gradient for a set of positions using a CUDA-accelerated kernel.

Parameters:

positions (np.ndarray) – Input positions to compute potential field for

goal (np.ndarray) – Target goal position

obstacles (np.ndarray) – Array of obstacle positions

influence_distance (float) – Distance within which obstacles influence the potential field

use_pinned (bool) – Use pinned memory for faster GPU transfers

Returns:

tuple – (potential values, gradient vectors) for each input position

ManipulaPy.cuda_kernels.optimized_batch_trajectory_generation(thetastart_batch, thetaend_batch, Tf, N, method, use_pinned=True)[source]#

Optimized batch trajectory generation for multiple trajectories.

Parameters:

thetastart_batch (ndarray) – (batch_size, num_joints) ndarray of starting joint angles, radians.
thetaend_batch (ndarray) – (batch_size, num_joints) ndarray of ending joint angles, radians.
Tf (float) – Total trajectory duration, seconds.
N (int) – Number of trajectory time steps.
method (int) – Time-scaling order: 3 cubic, 5 quintic, else linear.
use_pinned (bool) – If True, use pinned host memory for host-to-device transfers.

Returns:

(traj_pos_batch, traj_vel_batch, traj_acc_batch), each a (batch_size, N, num_joints) float32 ndarray of joint positions (radians), velocities (radians/s), and accelerations (radians/s^2).

Return type:

Tuple[np.ndarray, np.ndarray, np.ndarray]

Raises:

RuntimeError – If CUDA is not available.

Efficiently generate batch trajectories using CUDA acceleration.

Parameters:

thetastart_batch (np.ndarray) – Batch of initial joint configurations

thetaend_batch (np.ndarray) – Batch of final joint configurations

Tf (float) – Total trajectory duration

N (int) – Number of trajectory timesteps

method (int) – Trajectory generation method identifier (1=linear, 3=cubic, 5=quintic)

use_pinned (bool) – Use pinned memory for faster GPU transfers

Returns:

tuple – Batch of trajectory positions, velocities, and accelerations

CPU Fallback Functions#

ManipulaPy.cuda_kernels.trajectory_cpu_fallback(thetastart, thetaend, Tf, N, method)[source]#

Optimized CPU fallback using NumPy vectorization.

Parameters:

thetastart (ndarray) – (num_joints,) ndarray of starting joint angles, radians.
thetaend (ndarray) – (num_joints,) ndarray of ending joint angles, radians.
Tf (float) – Total trajectory duration, seconds. Values <= 0 collapse to the start configuration with zero velocity and acceleration.
N (int) – Number of trajectory time steps. Values <= 1 collapse to the start configuration.
method (int) – Time-scaling polynomial order: 3 for cubic, 5 for quintic, any other value (e.g. 1) for linear.

Returns:

(traj_pos, traj_vel, traj_acc), each an (N, num_joints) float32 ndarray of joint positions (radians), velocities (radians/s), and accelerations (radians/s^2).

Return type:

Tuple[np.ndarray, np.ndarray, np.ndarray]

Compute trajectory positions, velocities, and accelerations on the CPU when CUDA is unavailable.

Parameters:

thetastart (np.ndarray) – Initial joint configurations

thetaend (np.ndarray) – Target joint configurations

Tf (float) – Total trajectory duration

N (int) – Number of trajectory points to generate

method (int) – Time scaling method (1=linear, 3=cubic, 5=quintic)

Returns:

tuple – (positions, velocities, accelerations) arrays

Memory Management#

ManipulaPy.cuda_kernels.get_cuda_array(*args, **kwargs)[source]#

Raise because the CUDA memory pool is unavailable.

Get a CUDA array from the memory pool.

Parameters:

shape (tuple) – Array dimensions

dtype (np.dtype) – Data type

Returns:

cuda.device_array – GPU array from memory pool

Parameters:

args (Any)
kwargs (Any)

Return type:

NoReturn

ManipulaPy.cuda_kernels.return_cuda_array(*args, **kwargs)[source]#

Raise because the CUDA memory pool is unavailable.

Return a CUDA array to the memory pool.

Parameters:

array (cuda.device_array) – GPU array to return

Parameters:

args (Any)
kwargs (Any)

Return type:

NoReturn

ManipulaPy.cuda_kernels._h2d_pinned(arr)[source]#

Host-to-device transfer with optional pinned-memory acceleration.

Pinned memory delivers ~3x peak transfer bandwidth on large arrays, but cuda.pinned_array is currently incompatible with several modern numba+driver combinations (see _PINNED_MEMORY_OPT_IN above). Plain cuda.to_device is correct on every supported configuration; pinned transfers are a pure performance optimisation that must be opted in.

Parameters:: arr (ndarray) – Host ndarray to copy to the device. Forced to C-contiguous layout if it is not already.
Returns:: A numba CUDA device array holding a copy of arr.
Raises:: RuntimeError – If CUDA is not available.
Return type:: Any

Helper function for pinned memory H2D transfers.

Parameters:

arr (np.ndarray) – Array to transfer to device

Returns:

cuda.device_array – Device array with data transferred

Grid Configuration#

ManipulaPy.cuda_kernels.make_1d_grid(size, threads=256)[source]#

Create optimal 1D grid for maximum GPU utilization.

Parameters:

size (int) – Total number of elements to cover with one thread each.
threads (int) – Initial thread-block size; overridden internally based on size for better occupancy.

Returns:

(blocks, threads) launch configuration, each a 1-tuple suitable for kernel[blocks, threads].

Return type:

Tuple[Tuple[int, …], Tuple[int, …]]

Create a 1D grid configuration for CUDA kernel launch with optimal thread and block sizing.

Parameters:

size (int) – Total number of elements or work items to process

threads (int) – Desired number of threads per block

Returns:

tuple – ((blocks,), (threads,)) for kernel launch configuration

ManipulaPy.cuda_kernels.make_2d_grid(N, num_joints, block_size=(128, 8))[source]#

Create 2D grid configuration for CUDA kernel launch (backward compatibility).

This is the original function maintained for compatibility. For optimal performance, use make_2d_grid_optimized().

Parameters:

N (int) – Number of trajectory time steps (X dimension of the grid).
num_joints (int) – Number of joints (Y dimension of the grid).
block_size (Tuple[int, int]) – Initial (threads_x, threads_y) block shape; shrunk for tiny problems and adjusted to reach a minimum block count.

Returns:

(grid, block) 2D launch configuration. Returns ((1, 1), (1, 1)) when CUDA is unavailable.

Return type:

Tuple[Tuple[int, int], Tuple[int, int]]

Compute optimal 2D grid configuration for CUDA kernel launch.

Parameters:

N (int) – First dimension of problem space

num_joints (int) – Second dimension of problem space

block_size (tuple) – Initial suggested block dimensions

Returns:

tuple – ((blocks_x, blocks_y), (threads_x, threads_y))

Performance Tools#

ManipulaPy.cuda_kernels.benchmark_kernel_performance(*args, **kwargs)[source]#

Report that CUDA benchmarking is unavailable.

Benchmark the performance of a specific CUDA kernel by executing it multiple times.

Parameters:

kernel_name (str) – Name of the kernel to benchmark

*args – Arguments to pass to the kernel function

num_runs (int) – Number of times to run the kernel

Returns:

dict or None – Performance metrics including average, std, min/max times

Parameters:

args (Any)
kwargs (Any)

Return type:

Dict[str, Any] | None

ManipulaPy.cuda_kernels.profile_start()[source]#

No-op CUDA profiler start for CPU-only environments.

Start CUDA profiling.

Return type:: None

ManipulaPy.cuda_kernels.profile_stop()[source]#

Return empty CUDA profiler stats in CPU-only environments.

Stop CUDA profiling.

Return type:: Dict[str, Any]

ManipulaPy.cuda_kernels._best_2d_config(*args, **kwargs)[source]#

Return a minimal launch shape when CUDA is unavailable.

Auto-tune 2D CUDA kernel launch configuration for optimal performance.

Parameters:

N (int) – Number of time steps or trajectory points

J (int) – Number of joints or degrees of freedom

Returns:

tuple – ((grid_x, grid_y), (block_x, block_y))

Parameters:

args (Any)
kwargs (Any)

Return type:

Tuple[Tuple[int, int], Tuple[int, int]]

Module Constants#

ManipulaPy.cuda_kernels.CUDA_AVAILABLE = False#

bool(x) -> bool

Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.

bool – True if CUDA is available, False otherwise

ManipulaPy.cuda_kernels.CUPY_AVAILABLE = True#

bool(x) -> bool

bool – True if CuPy is available, False otherwise

ManipulaPy.cuda_kernels.FAST_MATH = True#

bool(x) -> bool

bool – Whether fast math optimizations are enabled

ManipulaPy.cuda_kernels.float_t = <class 'numpy.float32'>#

Single-precision floating-point number type, compatible with C float.

Character code:: 'f'
Canonical name:: numpy.single
Alias on this platform (Linux x86_64):: numpy.float32: 32-bit-precision floating-point number type: sign bit, 8 bits exponent, 23 bits mantissa.

type – Float precision type (float32 or float16)

Environment Variables#

MANIPULAPY_FASTMATH: Set to “1” to enable fast math optimizations (~2x speedup with relaxed IEEE 754 compliance)
MANIPULAPY_USE_FP16: Set to “1” to use 16-bit floating point precision for memory-bound kernels