CUDA Kernels API Reference

This module provides CUDA-accelerated functions for trajectory planning and dynamics computation with automatic CPU fallback.

Availability Check

ManipulaPy.cuda_kernels.check_cuda_availability()[source]

Enhanced CUDA availability check with detailed diagnostics.

Check if CUDA is available and provide helpful diagnostic information.

Returns:

  • bool – True if CUDA is available, False otherwise

ManipulaPy.cuda_kernels.check_cupy_availability()[source]

Check CuPy availability for additional GPU operations.

Check if CuPy is available for GPU array operations.

Returns:

  • bool – True if CuPy is available, False otherwise

ManipulaPy.cuda_kernels.get_gpu_properties()[source]

Get comprehensive GPU properties for optimization.

Retrieve current CUDA device properties for kernel optimization and resource allocation.

Returns:

  • dict or None – GPU device properties including multiprocessor count, memory limits, etc.

Core CUDA Kernels

Trajectory Kernels

ManipulaPy.cuda_kernels.trajectory_kernel(*args, **kwargs)[source]

CUDA kernel for generating joint trajectory points with time-scaling.

Parameters:

  • thetastart (cuda.device_array) – Starting joint angles

  • thetaend (cuda.device_array) – Target joint angles

  • traj_pos (cuda.device_array) – Output trajectory positions

  • traj_vel (cuda.device_array) – Output trajectory velocities

  • traj_acc (cuda.device_array) – Output trajectory accelerations

  • Tf (float) – Total trajectory time

  • N (int) – Number of trajectory points

  • method (int) – Time scaling method (3=cubic, 5=quintic)

  • stream (int) – CUDA stream for kernel execution

ManipulaPy.cuda_kernels.cartesian_trajectory_kernel(*args, **kwargs)[source]

CUDA kernel for generating Cartesian trajectory with time-scaling.

Parameters:

  • pstart (cuda.device_array) – Starting point coordinates [x, y, z]

  • pend (cuda.device_array) – Ending point coordinates [x, y, z]

  • traj_pos (cuda.device_array) – Output trajectory positions

  • traj_vel (cuda.device_array) – Output trajectory velocities

  • traj_acc (cuda.device_array) – Output trajectory accelerations

  • Tf (float) – Total trajectory duration

  • N (int) – Number of trajectory points

  • method (int) – Time-scaling method (3=cubic, 5=quintic)

  • stream (int) – CUDA stream for kernel execution

ManipulaPy.cuda_kernels.batch_trajectory_kernel(*args, **kwargs)[source]

Optimized CUDA kernel for batch trajectory generation with time-scaling.

Parameters:

  • thetastart_batch (cuda.device_array) – Starting joint positions for each batch

  • thetaend_batch (cuda.device_array) – Ending joint positions for each batch

  • traj_pos_batch (cuda.device_array) – Output trajectory positions

  • traj_vel_batch (cuda.device_array) – Output trajectory velocities

  • traj_acc_batch (cuda.device_array) – Output trajectory accelerations

  • Tf (float) – Total trajectory duration

  • N (int) – Number of trajectory timesteps

  • method (int) – Time-scaling method (3=cubic, 5=quintic)

  • batch_size (int) – Number of trajectory batches

  • stream (int) – CUDA stream for kernel execution

Dynamics Kernels

ManipulaPy.cuda_kernels.inverse_dynamics_kernel(*args, **kwargs)[source]

Optimized CUDA kernel for computing inverse dynamics using 2D parallelization.

Parameters:

  • thetalist_trajectory (cuda.device_array) – Joint position trajectory

  • dthetalist_trajectory (cuda.device_array) – Joint velocity trajectory

  • ddthetalist_trajectory (cuda.device_array) – Joint acceleration trajectory

  • gravity_vector (cuda.device_array) – Gravity vector

  • Ftip (cuda.device_array) – End-effector wrench

  • Glist (cuda.device_array) – Mass matrix diagonal elements

  • Slist (cuda.device_array) – Velocity quadratic force coefficients

  • M (cuda.device_array) – Full mass matrix

  • torques_trajectory (cuda.device_array) – Output joint torque trajectory

  • torque_limits (cuda.device_array) – Joint torque limits

  • stream (int) – CUDA stream for kernel execution

ManipulaPy.cuda_kernels.forward_dynamics_kernel(*args, **kwargs)[source]

Compute forward dynamics for a robotic system using a CUDA kernel.

Parameters:

  • thetalist (cuda.device_array) – Initial joint positions

  • dthetalist (cuda.device_array) – Initial joint velocities

  • taumat (cuda.device_array) – Applied joint torques trajectory

  • g (cuda.device_array) – Gravity vector

  • Ftipmat (cuda.device_array) – End-effector wrenches

  • dt (float) – Total time step

  • intRes (int) – Integration resolution/substeps

  • Glist (cuda.device_array) – Mass matrix diagonal elements

  • Slist (cuda.device_array) – Velocity quadratic force coefficients

  • M (cuda.device_array) – Full mass matrix

  • thetamat (cuda.device_array) – Output joint position trajectory

  • dthetamat (cuda.device_array) – Output joint velocity trajectory

  • ddthetamat (cuda.device_array) – Output joint acceleration trajectory

  • joint_limits (cuda.device_array) – Joint position limits

  • stream (int) – CUDA stream for kernel execution

Potential Field Kernels

ManipulaPy.cuda_kernels.fused_potential_gradient_kernel(*args, **kwargs)[source]

CUDA kernel for computing potential and gradient for path planning.

Parameters:

  • positions (cuda.device_array) – Input positions to evaluate

  • goal (cuda.device_array) – Target goal point coordinates

  • obstacles (cuda.device_array) – Array of obstacle point coordinates

  • potential (cuda.device_array) – Output array for computed potential values

  • gradient (cuda.device_array) – Output array for computed gradient vectors

  • influence_distance (float) – Distance threshold for obstacle influence

  • stream (int) – CUDA stream for kernel execution

ManipulaPy.cuda_kernels.attractive_potential_kernel(*args, **kwargs)[source]

Legacy function - use fused_potential_gradient_kernel instead.

Legacy CUDA kernel for attractive potential field computation.

Parameters:

  • positions (cuda.device_array) – Query positions (N, 3)

  • goal (cuda.device_array) – Goal position [x, y, z]

  • potential (cuda.device_array) – Output potential values (N,)

ManipulaPy.cuda_kernels.repulsive_potential_kernel(*args, **kwargs)[source]

Legacy function - use fused_potential_gradient_kernel instead.

Legacy CUDA kernel for repulsive potential field computation.

Parameters:

  • positions (cuda.device_array) – Query positions (N, 3)

  • obstacles (cuda.device_array) – Obstacle positions (M, 3)

  • potential (cuda.device_array) – Output potential values (N,)

  • influence_distance (float) – Maximum influence distance

ManipulaPy.cuda_kernels.gradient_kernel(*args, **kwargs)[source]

Legacy function - use fused_potential_gradient_kernel instead.

Legacy CUDA kernel for numerical gradient computation.

Parameters:

  • potential (cuda.device_array) – Potential field values (N,)

  • gradient (cuda.device_array) – Output gradient (N-1,)

High-Level Wrappers

ManipulaPy.cuda_kernels.optimized_trajectory_generation(thetastart, thetaend, Tf, N, method, use_pinned=True, kernel_type='auto')[source]

Main entry point for optimized trajectory generation.

This function automatically selects the best kernel and configuration for maximum performance and 40x+ speedups.

Parameters:
  • thetastart – Start and end joint angles

  • thetaend – Start and end joint angles

  • Tf – Final time

  • N – Number of trajectory points

  • method – Time scaling method (3=cubic, 5=quintic)

  • use_pinned – Use pinned memory for faster transfers

  • kernel_type – Kernel selection (β€œauto”, β€œstandard”, β€œvectorized”, etc.)

Generates an optimized trajectory using CUDA acceleration with automatic memory management.

Parameters:

  • thetastart (np.ndarray) – Initial joint configuration

  • thetaend (np.ndarray) – Final joint configuration

  • Tf (float) – Total trajectory duration

  • N (int) – Number of trajectory timesteps

  • method (int) – Trajectory generation method

  • use_pinned (bool) – Use pinned memory for faster GPU transfers

Returns:

  • tuple – (trajectory positions, trajectory velocities, trajectory accelerations)

ManipulaPy.cuda_kernels.optimized_potential_field(positions, goal, obstacles, influence_distance, use_pinned=True)[source]

Optimized potential field computation with CUDA acceleration.

Compute potential field and gradient for a set of positions using a CUDA-accelerated kernel.

Parameters:

  • positions (np.ndarray) – Input positions to compute potential field for

  • goal (np.ndarray) – Target goal position

  • obstacles (np.ndarray) – Array of obstacle positions

  • influence_distance (float) – Distance within which obstacles influence the potential field

  • use_pinned (bool) – Use pinned memory for faster GPU transfers

Returns:

  • tuple – (potential values, gradient vectors) for each input position

ManipulaPy.cuda_kernels.optimized_batch_trajectory_generation(thetastart_batch, thetaend_batch, Tf, N, method, use_pinned=True)[source]

Optimized batch trajectory generation for multiple trajectories.

Efficiently generate batch trajectories using CUDA acceleration.

Parameters:

  • thetastart_batch (np.ndarray) – Batch of initial joint configurations

  • thetaend_batch (np.ndarray) – Batch of final joint configurations

  • Tf (float) – Total trajectory duration

  • N (int) – Number of trajectory timesteps

  • method (int) – Trajectory generation method identifier

  • use_pinned (bool) – Use pinned memory for faster GPU transfers

Returns:

  • tuple – Batch of trajectory positions, velocities, and accelerations

CPU Fallback Functions

ManipulaPy.cuda_kernels.trajectory_cpu_fallback(thetastart, thetaend, Tf, N, method)[source]

Optimized CPU fallback using NumPy vectorization.

Compute trajectory positions, velocities, and accelerations on the CPU when CUDA is unavailable.

Parameters:

  • thetastart (np.ndarray) – Initial joint configurations

  • thetaend (np.ndarray) – Target joint configurations

  • Tf (float) – Total trajectory duration

  • N (int) – Number of trajectory points to generate

  • method (int) – Time scaling method (3=cubic, 5=quintic)

Returns:

  • tuple – (positions, velocities, accelerations) arrays

Memory Management

ManipulaPy.cuda_kernels.get_cuda_array(*args, **kwargs)[source]

Get a CUDA array from the memory pool.

Parameters:

  • shape (tuple) – Array dimensions

  • dtype (np.dtype) – Data type

Returns:

  • cuda.device_array – GPU array from memory pool

ManipulaPy.cuda_kernels.return_cuda_array(*args, **kwargs)[source]

Return a CUDA array to the memory pool.

Parameters:

  • array (cuda.device_array) – GPU array to return

ManipulaPy.cuda_kernels._h2d_pinned(arr)[source]

Optimized pinned memory transfer for maximum bandwidth.

Helper function for pinned memory H2D transfers.

Parameters:

  • arr (np.ndarray) – Array to transfer to device

Returns:

  • cuda.device_array – Device array with data transferred

Grid Configuration

ManipulaPy.cuda_kernels.make_1d_grid(size, threads=256)[source]

Create optimal 1D grid for maximum GPU utilization.

Create a 1D grid configuration for CUDA kernel launch with optimal thread and block sizing.

Parameters:

  • size (int) – Total number of elements or work items to process

  • threads (int) – Desired number of threads per block

Returns:

  • tuple – ((blocks,), (threads,)) for kernel launch configuration

ManipulaPy.cuda_kernels.make_2d_grid(N, num_joints, block_size=(128, 8))[source]

Create 2D grid configuration for CUDA kernel launch (backward compatibility).

This is the original function maintained for compatibility. For optimal performance, use make_2d_grid_optimized().

Compute optimal 2D grid configuration for CUDA kernel launch.

Parameters:

  • N (int) – First dimension of problem space

  • num_joints (int) – Second dimension of problem space

  • block_size (tuple) – Initial suggested block dimensions

Returns:

  • tuple – ((blocks_x, blocks_y), (threads_x, threads_y))

Parameters:

Performance Tools

ManipulaPy.cuda_kernels.benchmark_kernel_performance(*args, **kwargs)[source]

Benchmark the performance of a specific CUDA kernel by executing it multiple times.

Parameters:

  • kernel_name (str) – Name of the kernel to benchmark

  • *args – Arguments to pass to the kernel function

  • num_runs (int) – Number of times to run the kernel

Returns:

  • dict or None – Performance metrics including average, std, min/max times

ManipulaPy.cuda_kernels.profile_start()[source]

Start CUDA profiling.

ManipulaPy.cuda_kernels.profile_stop()[source]

Stop CUDA profiling.

ManipulaPy.cuda_kernels._best_2d_config(*args, **kwargs)[source]

Auto-tune 2D CUDA kernel launch configuration for optimal performance.

Parameters:

  • N (int) – Number of time steps or trajectory points

  • J (int) – Number of joints or degrees of freedom

Returns:

  • tuple – ((grid_x, grid_y), (block_x, block_y))

Module Constants

ManipulaPy.cuda_kernels.CUDA_AVAILABLE = False

bool(x) -> bool

Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.

bool – True if CUDA is available, False otherwise

ManipulaPy.cuda_kernels.CUPY_AVAILABLE = True

bool(x) -> bool

Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.

bool – True if CuPy is available, False otherwise

ManipulaPy.cuda_kernels.FAST_MATH = True

bool(x) -> bool

Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.

bool – Whether fast math optimizations are enabled

ManipulaPy.cuda_kernels.float_t = <class 'numpy.float32'>

Single-precision floating-point number type, compatible with C float.

Character code:

'f'

Canonical name:

numpy.single

Alias on this platform (Linux x86_64):

numpy.float32: 32-bit-precision floating-point number type: sign bit, 8 bits exponent, 23 bits mantissa.

type – Float precision type (float32 or float16)

Environment Variables

MANIPULAPY_FASTMATH

Set to β€œ1” to enable fast math optimizations (~2x speedup with relaxed IEEE 754 compliance)

MANIPULAPY_USE_FP16

Set to β€œ1” to use 16-bit floating point precision for memory-bound kernels