CUDA Kernels API Reference#
This module provides CUDA-accelerated functions for trajectory planning and dynamics computation with automatic CPU fallback.
Availability Check#
- ManipulaPy.cuda_kernels.check_cuda_availability()[source]#
Enhanced CUDA availability check with detailed diagnostics.
Check if CUDA is available and provide helpful diagnostic information.
Returns:
bool β True if CUDA is available, False otherwise
- Return type:
- ManipulaPy.cuda_kernels.check_cupy_availability()[source]#
Check CuPy availability for additional GPU operations.
Check if CuPy is available for GPU array operations.
Returns:
bool β True if CuPy is available, False otherwise
- Return type:
Core CUDA Kernels#
Trajectory Kernels#
- ManipulaPy.cuda_kernels.trajectory_kernel(*args, **kwargs)[source]#
Raise because the CUDA trajectory kernel is unavailable.
CUDA kernel for generating joint trajectory points with time-scaling.
Parameters:
thetastart (cuda.device_array) β Starting joint angles
thetaend (cuda.device_array) β Target joint angles
traj_pos (cuda.device_array) β Output trajectory positions
traj_vel (cuda.device_array) β Output trajectory velocities
traj_acc (cuda.device_array) β Output trajectory accelerations
Tf (float) β Total trajectory time
N (int) β Number of trajectory points
method (int) β Time scaling method (1=linear, 3=cubic, 5=quintic)
stream (int) β CUDA stream for kernel execution
Note
All trajectory kernels are safe at
N <= 1(returnstau = 0so single-sample trajectories produce the start configuration instead of NaN/Inf from division by zero). Linear time-scaling (method=1) is supported in every trajectory kernel variant (trajectory_kernel,trajectory_kernel_vectorized,trajectory_kernel_memory_optimized,trajectory_kernel_warp_optimized,trajectory_kernel_cache_friendly,cartesian_trajectory_kernel, andbatch_trajectory_kernel) as of v1.3.2; previously only cubic and quintic were honored.
- ManipulaPy.cuda_kernels.cartesian_trajectory_kernel(*args, **kwargs)[source]#
Raise because the CUDA Cartesian trajectory kernel is unavailable.
CUDA kernel for generating Cartesian trajectory with time-scaling.
Parameters:
pstart (cuda.device_array) β Starting point coordinates [x, y, z]
pend (cuda.device_array) β Ending point coordinates [x, y, z]
traj_pos (cuda.device_array) β Output trajectory positions
traj_vel (cuda.device_array) β Output trajectory velocities
traj_acc (cuda.device_array) β Output trajectory accelerations
Tf (float) β Total trajectory duration
N (int) β Number of trajectory points
method (int) β Time-scaling method (1=linear, 3=cubic, 5=quintic)
stream (int) β CUDA stream for kernel execution
Changed in version 1.3.2: Each thread now computes its own time scaling (no shared-memory reuse), eliminating a race where thread (0,0,0)βs
t_idxscaling leaked into all other threads. Quintic acceleration also uses the corrects_ddot = 60 tau (1 - tau)(1 - 2 tau) / Tf**2.
- ManipulaPy.cuda_kernels.batch_trajectory_kernel(*args, **kwargs)[source]#
Raise because the CUDA batch trajectory kernel is unavailable.
Optimized CUDA kernel for batch trajectory generation with time-scaling.
Parameters:
thetastart_batch (cuda.device_array) β Starting joint positions for each batch
thetaend_batch (cuda.device_array) β Ending joint positions for each batch
traj_pos_batch (cuda.device_array) β Output trajectory positions
traj_vel_batch (cuda.device_array) β Output trajectory velocities
traj_acc_batch (cuda.device_array) β Output trajectory accelerations
Tf (float) β Total trajectory duration
N (int) β Number of trajectory timesteps
method (int) β Time-scaling method (1=linear, 3=cubic, 5=quintic)
batch_size (int) β Number of trajectory batches
stream (int) β CUDA stream for kernel execution
Changed in version 1.3.2: Per-thread time scaling replaces the previous shared-memory layout that broadcast a single threadβs
t_idxto the whole block, and quintic acceleration uses the full60 tau (1 - tau)(1 - 2 tau) / Tf**2form.
Dynamics Kernels#
- ManipulaPy.cuda_kernels.inverse_dynamics_kernel(*args, **kwargs)[source]#
Raise because the CUDA inverse dynamics kernel is unavailable.
Optimized CUDA kernel for computing inverse dynamics using 2D parallelization.
Parameters:
thetalist_trajectory (cuda.device_array) β Joint position trajectory
dthetalist_trajectory (cuda.device_array) β Joint velocity trajectory
ddthetalist_trajectory (cuda.device_array) β Joint acceleration trajectory
gravity_vector (cuda.device_array) β Gravity vector
Ftip (cuda.device_array) β End-effector wrench
Glist (cuda.device_array) β Mass matrix diagonal elements
Slist (cuda.device_array) β Velocity quadratic force coefficients
M (cuda.device_array) β Full mass matrix
torques_trajectory (cuda.device_array) β Output joint torque trajectory
torque_limits (cuda.device_array) β Joint torque limits
stream (int) β CUDA stream for kernel execution
- ManipulaPy.cuda_kernels.forward_dynamics_kernel(*args, **kwargs)[source]#
Raise because the CUDA forward dynamics kernel is unavailable.
Compute forward dynamics for a robotic system using a CUDA kernel.
Parameters:
thetalist (cuda.device_array) β Initial joint positions
dthetalist (cuda.device_array) β Initial joint velocities
taumat (cuda.device_array) β Applied joint torques trajectory
g (cuda.device_array) β Gravity vector
Ftipmat (cuda.device_array) β End-effector wrenches
dt (float) β Total time step
intRes (int) β Integration resolution/substeps
Glist (cuda.device_array) β Mass matrix diagonal elements
Slist (cuda.device_array) β Velocity quadratic force coefficients
M (cuda.device_array) β Full mass matrix
thetamat (cuda.device_array) β Output joint position trajectory
dthetamat (cuda.device_array) β Output joint velocity trajectory
ddthetamat (cuda.device_array) β Output joint acceleration trajectory
joint_limits (cuda.device_array) β Joint position limits
stream (int) β CUDA stream for kernel execution
Changed in version 1.3.2: Each thread now integrates from the initial state up to its own
t_idxinstead of readingthetamat[t_idx-1], removing the temporal data race where threads at highert_idxcould read rows that lower-t_idxthreads had not yet written.
Potential Field Kernels#
- ManipulaPy.cuda_kernels.fused_potential_gradient_kernel(*args, **kwargs)[source]#
Raise because the CUDA potential field kernel is unavailable.
Changed in version 1.3.2: Repulsive-gradient sign corrected. Previous versions produced an attracting repulsive field due to a sign error in the gradient factor (
grad_factoris now-influence_term * dist_inv**3). Existing code relying on the v1.3.1 behavior must be reviewed β the sign flip is a correctness fix, not an API change.CUDA kernel for computing potential and gradient for path planning.
Parameters:
positions (cuda.device_array) β Input positions to evaluate
goal (cuda.device_array) β Target goal point coordinates
obstacles (cuda.device_array) β Array of obstacle point coordinates
potential (cuda.device_array) β Output array for computed potential values
gradient (cuda.device_array) β Output array for computed gradient vectors
influence_distance (float) β Distance threshold for obstacle influence
stream (int) β CUDA stream for kernel execution
- ManipulaPy.cuda_kernels.attractive_potential_kernel(*args, **kwargs)[source]#
Legacy function - use fused_potential_gradient_kernel instead.
Legacy CUDA kernel for attractive potential field computation.
Parameters:
positions (cuda.device_array) β Query positions (N, 3)
goal (cuda.device_array) β Goal position [x, y, z]
potential (cuda.device_array) β Output potential values (N,)
- ManipulaPy.cuda_kernels.repulsive_potential_kernel(*args, **kwargs)[source]#
Legacy function - use fused_potential_gradient_kernel instead.
Legacy CUDA kernel for repulsive potential field computation.
Parameters:
positions (cuda.device_array) β Query positions (N, 3)
obstacles (cuda.device_array) β Obstacle positions (M, 3)
potential (cuda.device_array) β Output potential values (N,)
influence_distance (float) β Maximum influence distance
- ManipulaPy.cuda_kernels.gradient_kernel(*args, **kwargs)[source]#
Legacy function - use fused_potential_gradient_kernel instead.
Legacy CUDA kernel for numerical gradient computation.
Parameters:
potential (cuda.device_array) β Potential field values (N,)
gradient (cuda.device_array) β Output gradient (N-1,)
High-Level Wrappers#
- ManipulaPy.cuda_kernels.optimized_trajectory_generation(thetastart, thetaend, Tf, N, method, use_pinned=True, kernel_type='auto')[source]#
Main entry point for optimized trajectory generation.
This function automatically selects the best kernel and configuration for maximum performance and 40x+ speedups.
- Parameters:
thetastart (Any) β Start and end joint angles
thetaend (Any) β Start and end joint angles
Tf (float) β Final time
N (int) β Number of trajectory points
method (int) β Time scaling method (3=cubic, 5=quintic)
use_pinned (bool) β Use pinned memory for faster transfers
kernel_type (str) β Kernel selection (βautoβ, βstandardβ, βvectorizedβ, etc.)
- Return type:
Generates an optimized trajectory using CUDA acceleration with automatic memory management.
Parameters:
thetastart (np.ndarray) β Initial joint configuration
thetaend (np.ndarray) β Final joint configuration
Tf (float) β Total trajectory duration
N (int) β Number of trajectory timesteps
method (int) β Trajectory generation method (1=linear, 3=cubic, 5=quintic)
use_pinned (bool) β Use pinned memory for faster GPU transfers
Returns:
tuple β (trajectory positions, trajectory velocities, trajectory accelerations)
- ManipulaPy.cuda_kernels.optimized_potential_field(positions, goal, obstacles, influence_distance, use_pinned=True)[source]#
Optimized potential field computation with CUDA acceleration.
- Parameters:
positions (ndarray) β (N, 3) ndarray of query point positions.
goal (ndarray) β (3,) ndarray, attractive goal position.
obstacles (ndarray) β (num_obstacles, 3) ndarray of obstacle positions.
influence_distance (float) β Repulsive influence radius; obstacles farther than this contribute nothing.
use_pinned (bool) β If True, use pinned host memory for host-to-device transfers.
- Returns:
(potential, gradient)wherepotentialis an(N,)float32 array of total potential values andgradientis an(N, 3)float32 array of potential gradients.- Return type:
Tuple[np.ndarray, np.ndarray]
- Raises:
RuntimeError β If CUDA is not available.
Compute potential field and gradient for a set of positions using a CUDA-accelerated kernel.
Parameters:
positions (np.ndarray) β Input positions to compute potential field for
goal (np.ndarray) β Target goal position
obstacles (np.ndarray) β Array of obstacle positions
influence_distance (float) β Distance within which obstacles influence the potential field
use_pinned (bool) β Use pinned memory for faster GPU transfers
Returns:
tuple β (potential values, gradient vectors) for each input position
- ManipulaPy.cuda_kernels.optimized_batch_trajectory_generation(thetastart_batch, thetaend_batch, Tf, N, method, use_pinned=True)[source]#
Optimized batch trajectory generation for multiple trajectories.
- Parameters:
thetastart_batch (ndarray) β (batch_size, num_joints) ndarray of starting joint angles, radians.
thetaend_batch (ndarray) β (batch_size, num_joints) ndarray of ending joint angles, radians.
Tf (float) β Total trajectory duration, seconds.
N (int) β Number of trajectory time steps.
method (int) β Time-scaling order: 3 cubic, 5 quintic, else linear.
use_pinned (bool) β If True, use pinned host memory for host-to-device transfers.
- Returns:
(traj_pos_batch, traj_vel_batch, traj_acc_batch), each a(batch_size, N, num_joints)float32 ndarray of joint positions (radians), velocities (radians/s), and accelerations (radians/s^2).- Return type:
Tuple[np.ndarray, np.ndarray, np.ndarray]
- Raises:
RuntimeError β If CUDA is not available.
Efficiently generate batch trajectories using CUDA acceleration.
Parameters:
thetastart_batch (np.ndarray) β Batch of initial joint configurations
thetaend_batch (np.ndarray) β Batch of final joint configurations
Tf (float) β Total trajectory duration
N (int) β Number of trajectory timesteps
method (int) β Trajectory generation method identifier (1=linear, 3=cubic, 5=quintic)
use_pinned (bool) β Use pinned memory for faster GPU transfers
Returns:
tuple β Batch of trajectory positions, velocities, and accelerations
CPU Fallback Functions#
- ManipulaPy.cuda_kernels.trajectory_cpu_fallback(thetastart, thetaend, Tf, N, method)[source]#
Optimized CPU fallback using NumPy vectorization.
- Parameters:
thetastart (ndarray) β (num_joints,) ndarray of starting joint angles, radians.
thetaend (ndarray) β (num_joints,) ndarray of ending joint angles, radians.
Tf (float) β Total trajectory duration, seconds. Values <= 0 collapse to the start configuration with zero velocity and acceleration.
N (int) β Number of trajectory time steps. Values <= 1 collapse to the start configuration.
method (int) β Time-scaling polynomial order: 3 for cubic, 5 for quintic, any other value (e.g. 1) for linear.
- Returns:
(traj_pos, traj_vel, traj_acc), each an(N, num_joints)float32 ndarray of joint positions (radians), velocities (radians/s), and accelerations (radians/s^2).- Return type:
Tuple[np.ndarray, np.ndarray, np.ndarray]
Compute trajectory positions, velocities, and accelerations on the CPU when CUDA is unavailable.
Parameters:
thetastart (np.ndarray) β Initial joint configurations
thetaend (np.ndarray) β Target joint configurations
Tf (float) β Total trajectory duration
N (int) β Number of trajectory points to generate
method (int) β Time scaling method (1=linear, 3=cubic, 5=quintic)
Returns:
tuple β (positions, velocities, accelerations) arrays
Memory Management#
- ManipulaPy.cuda_kernels.get_cuda_array(*args, **kwargs)[source]#
Raise because the CUDA memory pool is unavailable.
Get a CUDA array from the memory pool.
Parameters:
shape (tuple) β Array dimensions
dtype (np.dtype) β Data type
Returns:
cuda.device_array β GPU array from memory pool
- ManipulaPy.cuda_kernels.return_cuda_array(*args, **kwargs)[source]#
Raise because the CUDA memory pool is unavailable.
Return a CUDA array to the memory pool.
Parameters:
array (cuda.device_array) β GPU array to return
- ManipulaPy.cuda_kernels._h2d_pinned(arr)[source]#
Host-to-device transfer with optional pinned-memory acceleration.
Pinned memory delivers ~3x peak transfer bandwidth on large arrays, but
cuda.pinned_arrayis currently incompatible with several modern numba+driver combinations (see_PINNED_MEMORY_OPT_INabove). Plaincuda.to_deviceis correct on every supported configuration; pinned transfers are a pure performance optimisation that must be opted in.- Parameters:
arr (ndarray) β Host ndarray to copy to the device. Forced to C-contiguous layout if it is not already.
- Returns:
A numba CUDA device array holding a copy of
arr.- Raises:
RuntimeError β If CUDA is not available.
- Return type:
Helper function for pinned memory H2D transfers.
Parameters:
arr (np.ndarray) β Array to transfer to device
Returns:
cuda.device_array β Device array with data transferred
Grid Configuration#
- ManipulaPy.cuda_kernels.make_1d_grid(size, threads=256)[source]#
Create optimal 1D grid for maximum GPU utilization.
- Parameters:
- Returns:
(blocks, threads)launch configuration, each a 1-tuple suitable forkernel[blocks, threads].- Return type:
Create a 1D grid configuration for CUDA kernel launch with optimal thread and block sizing.
Parameters:
size (int) β Total number of elements or work items to process
threads (int) β Desired number of threads per block
Returns:
tuple β ((blocks,), (threads,)) for kernel launch configuration
- ManipulaPy.cuda_kernels.make_2d_grid(N, num_joints, block_size=(128, 8))[source]#
Create 2D grid configuration for CUDA kernel launch (backward compatibility).
This is the original function maintained for compatibility. For optimal performance, use make_2d_grid_optimized().
- Parameters:
- Returns:
(grid, block)2D launch configuration. Returns((1, 1), (1, 1))when CUDA is unavailable.- Return type:
Compute optimal 2D grid configuration for CUDA kernel launch.
Parameters:
N (int) β First dimension of problem space
num_joints (int) β Second dimension of problem space
block_size (tuple) β Initial suggested block dimensions
Returns:
tuple β ((blocks_x, blocks_y), (threads_x, threads_y))
Performance Tools#
- ManipulaPy.cuda_kernels.benchmark_kernel_performance(*args, **kwargs)[source]#
Report that CUDA benchmarking is unavailable.
Benchmark the performance of a specific CUDA kernel by executing it multiple times.
Parameters:
kernel_name (str) β Name of the kernel to benchmark
*args β Arguments to pass to the kernel function
num_runs (int) β Number of times to run the kernel
Returns:
dict or None β Performance metrics including average, std, min/max times
- ManipulaPy.cuda_kernels.profile_start()[source]#
No-op CUDA profiler start for CPU-only environments.
Start CUDA profiling.
- Return type:
None
- ManipulaPy.cuda_kernels.profile_stop()[source]#
Return empty CUDA profiler stats in CPU-only environments.
Stop CUDA profiling.
- ManipulaPy.cuda_kernels._best_2d_config(*args, **kwargs)[source]#
Return a minimal launch shape when CUDA is unavailable.
Auto-tune 2D CUDA kernel launch configuration for optimal performance.
Parameters:
N (int) β Number of time steps or trajectory points
J (int) β Number of joints or degrees of freedom
Returns:
tuple β ((grid_x, grid_y), (block_x, block_y))
Module Constants#
- ManipulaPy.cuda_kernels.CUDA_AVAILABLE = False#
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
bool β True if CUDA is available, False otherwise
- ManipulaPy.cuda_kernels.CUPY_AVAILABLE = True#
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
bool β True if CuPy is available, False otherwise
- ManipulaPy.cuda_kernels.FAST_MATH = True#
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
bool β Whether fast math optimizations are enabled
- ManipulaPy.cuda_kernels.float_t = <class 'numpy.float32'>#
Single-precision floating-point number type, compatible with C
float.- Character code:
'f'- Canonical name:
numpy.single
- Alias on this platform (Linux x86_64):
numpy.float32: 32-bit-precision floating-point number type: sign bit, 8 bits exponent, 23 bits mantissa.
type β Float precision type (float32 or float16)
Environment Variables#
- MANIPULAPY_FASTMATH
Set to β1β to enable fast math optimizations (~2x speedup with relaxed IEEE 754 compliance)
- MANIPULAPY_USE_FP16
Set to β1β to use 16-bit floating point precision for memory-bound kernels