.. _api-cuda-kernels: ========================== CUDA Kernels API Reference ========================== .. currentmodule:: ManipulaPy.cuda_kernels This module provides CUDA-accelerated functions for trajectory planning and dynamics computation with automatic CPU fallback. -------------------- Availability Check -------------------- .. autofunction:: check_cuda_availability Check if CUDA is available and provide helpful diagnostic information. **Returns:** - **bool** -- True if CUDA is available, False otherwise .. autofunction:: check_cupy_availability Check if CuPy is available for GPU array operations. **Returns:** - **bool** -- True if CuPy is available, False otherwise .. autofunction:: get_gpu_properties Retrieve current CUDA device properties for kernel optimization and resource allocation. **Returns:** - **dict or None** -- GPU device properties including multiprocessor count, memory limits, etc. ------------------- Core CUDA Kernels ------------------- Trajectory Kernels -------------------- .. autofunction:: trajectory_kernel CUDA kernel for generating joint trajectory points with time-scaling. **Parameters:** - **thetastart** (*cuda.device_array*) -- Starting joint angles - **thetaend** (*cuda.device_array*) -- Target joint angles - **traj_pos** (*cuda.device_array*) -- Output trajectory positions - **traj_vel** (*cuda.device_array*) -- Output trajectory velocities - **traj_acc** (*cuda.device_array*) -- Output trajectory accelerations - **Tf** (*float*) -- Total trajectory time - **N** (*int*) -- Number of trajectory points - **method** (*int*) -- Time scaling method (1=linear, 3=cubic, 5=quintic) - **stream** (*int*) -- CUDA stream for kernel execution .. note:: All trajectory kernels are safe at ``N <= 1`` (returns ``tau = 0`` so single-sample trajectories produce the start configuration instead of NaN/Inf from division by zero). Linear time-scaling (``method=1``) is supported in every trajectory kernel variant (``trajectory_kernel``, ``trajectory_kernel_vectorized``, ``trajectory_kernel_memory_optimized``, ``trajectory_kernel_warp_optimized``, ``trajectory_kernel_cache_friendly``, ``cartesian_trajectory_kernel``, and ``batch_trajectory_kernel``) as of v1.3.2; previously only cubic and quintic were honored. .. autofunction:: cartesian_trajectory_kernel CUDA kernel for generating Cartesian trajectory with time-scaling. **Parameters:** - **pstart** (*cuda.device_array*) -- Starting point coordinates [x, y, z] - **pend** (*cuda.device_array*) -- Ending point coordinates [x, y, z] - **traj_pos** (*cuda.device_array*) -- Output trajectory positions - **traj_vel** (*cuda.device_array*) -- Output trajectory velocities - **traj_acc** (*cuda.device_array*) -- Output trajectory accelerations - **Tf** (*float*) -- Total trajectory duration - **N** (*int*) -- Number of trajectory points - **method** (*int*) -- Time-scaling method (1=linear, 3=cubic, 5=quintic) - **stream** (*int*) -- CUDA stream for kernel execution .. versionchanged:: 1.3.2 Each thread now computes its own time scaling (no shared-memory reuse), eliminating a race where thread (0,0,0)'s ``t_idx`` scaling leaked into all other threads. Quintic acceleration also uses the correct ``s_ddot = 60 tau (1 - tau)(1 - 2 tau) / Tf**2``. .. autofunction:: batch_trajectory_kernel Optimized CUDA kernel for batch trajectory generation with time-scaling. **Parameters:** - **thetastart_batch** (*cuda.device_array*) -- Starting joint positions for each batch - **thetaend_batch** (*cuda.device_array*) -- Ending joint positions for each batch - **traj_pos_batch** (*cuda.device_array*) -- Output trajectory positions - **traj_vel_batch** (*cuda.device_array*) -- Output trajectory velocities - **traj_acc_batch** (*cuda.device_array*) -- Output trajectory accelerations - **Tf** (*float*) -- Total trajectory duration - **N** (*int*) -- Number of trajectory timesteps - **method** (*int*) -- Time-scaling method (1=linear, 3=cubic, 5=quintic) - **batch_size** (*int*) -- Number of trajectory batches - **stream** (*int*) -- CUDA stream for kernel execution .. versionchanged:: 1.3.2 Per-thread time scaling replaces the previous shared-memory layout that broadcast a single thread's ``t_idx`` to the whole block, and quintic acceleration uses the full ``60 tau (1 - tau)(1 - 2 tau) / Tf**2`` form. Dynamics Kernels ------------------- .. autofunction:: inverse_dynamics_kernel Optimized CUDA kernel for computing inverse dynamics using 2D parallelization. **Parameters:** - **thetalist_trajectory** (*cuda.device_array*) -- Joint position trajectory - **dthetalist_trajectory** (*cuda.device_array*) -- Joint velocity trajectory - **ddthetalist_trajectory** (*cuda.device_array*) -- Joint acceleration trajectory - **gravity_vector** (*cuda.device_array*) -- Gravity vector - **Ftip** (*cuda.device_array*) -- End-effector wrench - **Glist** (*cuda.device_array*) -- Mass matrix diagonal elements - **Slist** (*cuda.device_array*) -- Velocity quadratic force coefficients - **M** (*cuda.device_array*) -- Full mass matrix - **torques_trajectory** (*cuda.device_array*) -- Output joint torque trajectory - **torque_limits** (*cuda.device_array*) -- Joint torque limits - **stream** (*int*) -- CUDA stream for kernel execution .. autofunction:: forward_dynamics_kernel Compute forward dynamics for a robotic system using a CUDA kernel. **Parameters:** - **thetalist** (*cuda.device_array*) -- Initial joint positions - **dthetalist** (*cuda.device_array*) -- Initial joint velocities - **taumat** (*cuda.device_array*) -- Applied joint torques trajectory - **g** (*cuda.device_array*) -- Gravity vector - **Ftipmat** (*cuda.device_array*) -- End-effector wrenches - **dt** (*float*) -- Total time step - **intRes** (*int*) -- Integration resolution/substeps - **Glist** (*cuda.device_array*) -- Mass matrix diagonal elements - **Slist** (*cuda.device_array*) -- Velocity quadratic force coefficients - **M** (*cuda.device_array*) -- Full mass matrix - **thetamat** (*cuda.device_array*) -- Output joint position trajectory - **dthetamat** (*cuda.device_array*) -- Output joint velocity trajectory - **ddthetamat** (*cuda.device_array*) -- Output joint acceleration trajectory - **joint_limits** (*cuda.device_array*) -- Joint position limits - **stream** (*int*) -- CUDA stream for kernel execution .. versionchanged:: 1.3.2 Each thread now integrates from the initial state up to its own ``t_idx`` instead of reading ``thetamat[t_idx-1]``, removing the temporal data race where threads at higher ``t_idx`` could read rows that lower-``t_idx`` threads had not yet written. Potential Field Kernels -------------------------- .. autofunction:: fused_potential_gradient_kernel .. versionchanged:: 1.3.2 Repulsive-gradient sign corrected. Previous versions produced an *attracting* repulsive field due to a sign error in the gradient factor (``grad_factor`` is now ``-influence_term * dist_inv**3``). Existing code relying on the v1.3.1 behavior must be reviewed -- the sign flip is a correctness fix, not an API change. CUDA kernel for computing potential and gradient for path planning. **Parameters:** - **positions** (*cuda.device_array*) -- Input positions to evaluate - **goal** (*cuda.device_array*) -- Target goal point coordinates - **obstacles** (*cuda.device_array*) -- Array of obstacle point coordinates - **potential** (*cuda.device_array*) -- Output array for computed potential values - **gradient** (*cuda.device_array*) -- Output array for computed gradient vectors - **influence_distance** (*float*) -- Distance threshold for obstacle influence - **stream** (*int*) -- CUDA stream for kernel execution .. autofunction:: attractive_potential_kernel Legacy CUDA kernel for attractive potential field computation. **Parameters:** - **positions** (*cuda.device_array*) -- Query positions (N, 3) - **goal** (*cuda.device_array*) -- Goal position [x, y, z] - **potential** (*cuda.device_array*) -- Output potential values (N,) .. autofunction:: repulsive_potential_kernel Legacy CUDA kernel for repulsive potential field computation. **Parameters:** - **positions** (*cuda.device_array*) -- Query positions (N, 3) - **obstacles** (*cuda.device_array*) -- Obstacle positions (M, 3) - **potential** (*cuda.device_array*) -- Output potential values (N,) - **influence_distance** (*float*) -- Maximum influence distance .. autofunction:: gradient_kernel Legacy CUDA kernel for numerical gradient computation. **Parameters:** - **potential** (*cuda.device_array*) -- Potential field values (N,) - **gradient** (*cuda.device_array*) -- Output gradient (N-1,) ------------------- High-Level Wrappers ------------------- .. autofunction:: optimized_trajectory_generation Generates an optimized trajectory using CUDA acceleration with automatic memory management. **Parameters:** - **thetastart** (*np.ndarray*) -- Initial joint configuration - **thetaend** (*np.ndarray*) -- Final joint configuration - **Tf** (*float*) -- Total trajectory duration - **N** (*int*) -- Number of trajectory timesteps - **method** (*int*) -- Trajectory generation method (1=linear, 3=cubic, 5=quintic) - **use_pinned** (*bool*) -- Use pinned memory for faster GPU transfers **Returns:** - **tuple** -- (trajectory positions, trajectory velocities, trajectory accelerations) .. autofunction:: optimized_potential_field Compute potential field and gradient for a set of positions using a CUDA-accelerated kernel. **Parameters:** - **positions** (*np.ndarray*) -- Input positions to compute potential field for - **goal** (*np.ndarray*) -- Target goal position - **obstacles** (*np.ndarray*) -- Array of obstacle positions - **influence_distance** (*float*) -- Distance within which obstacles influence the potential field - **use_pinned** (*bool*) -- Use pinned memory for faster GPU transfers **Returns:** - **tuple** -- (potential values, gradient vectors) for each input position .. autofunction:: optimized_batch_trajectory_generation Efficiently generate batch trajectories using CUDA acceleration. **Parameters:** - **thetastart_batch** (*np.ndarray*) -- Batch of initial joint configurations - **thetaend_batch** (*np.ndarray*) -- Batch of final joint configurations - **Tf** (*float*) -- Total trajectory duration - **N** (*int*) -- Number of trajectory timesteps - **method** (*int*) -- Trajectory generation method identifier (1=linear, 3=cubic, 5=quintic) - **use_pinned** (*bool*) -- Use pinned memory for faster GPU transfers **Returns:** - **tuple** -- Batch of trajectory positions, velocities, and accelerations ---------------------- CPU Fallback Functions ---------------------- .. autofunction:: trajectory_cpu_fallback Compute trajectory positions, velocities, and accelerations on the CPU when CUDA is unavailable. **Parameters:** - **thetastart** (*np.ndarray*) -- Initial joint configurations - **thetaend** (*np.ndarray*) -- Target joint configurations - **Tf** (*float*) -- Total trajectory duration - **N** (*int*) -- Number of trajectory points to generate - **method** (*int*) -- Time scaling method (1=linear, 3=cubic, 5=quintic) **Returns:** - **tuple** -- (positions, velocities, accelerations) arrays ------------------- Memory Management ------------------- .. autofunction:: get_cuda_array Get a CUDA array from the memory pool. **Parameters:** - **shape** (*tuple*) -- Array dimensions - **dtype** (*np.dtype*) -- Data type **Returns:** - **cuda.device_array** -- GPU array from memory pool .. autofunction:: return_cuda_array Return a CUDA array to the memory pool. **Parameters:** - **array** (*cuda.device_array*) -- GPU array to return .. autofunction:: _h2d_pinned Helper function for pinned memory H2D transfers. **Parameters:** - **arr** (*np.ndarray*) -- Array to transfer to device **Returns:** - **cuda.device_array** -- Device array with data transferred ------------------- Grid Configuration ------------------- .. autofunction:: make_1d_grid Create a 1D grid configuration for CUDA kernel launch with optimal thread and block sizing. **Parameters:** - **size** (*int*) -- Total number of elements or work items to process - **threads** (*int*) -- Desired number of threads per block **Returns:** - **tuple** -- ((blocks,), (threads,)) for kernel launch configuration .. autofunction:: make_2d_grid Compute optimal 2D grid configuration for CUDA kernel launch. **Parameters:** - **N** (*int*) -- First dimension of problem space - **num_joints** (*int*) -- Second dimension of problem space - **block_size** (*tuple*) -- Initial suggested block dimensions **Returns:** - **tuple** -- ((blocks_x, blocks_y), (threads_x, threads_y)) ------------------- Performance Tools ------------------- .. autofunction:: benchmark_kernel_performance Benchmark the performance of a specific CUDA kernel by executing it multiple times. **Parameters:** - **kernel_name** (*str*) -- Name of the kernel to benchmark - ***args** -- Arguments to pass to the kernel function - **num_runs** (*int*) -- Number of times to run the kernel **Returns:** - **dict or None** -- Performance metrics including average, std, min/max times .. autofunction:: profile_start Start CUDA profiling. .. autofunction:: profile_stop Stop CUDA profiling. .. autofunction:: _best_2d_config Auto-tune 2D CUDA kernel launch configuration for optimal performance. **Parameters:** - **N** (*int*) -- Number of time steps or trajectory points - **J** (*int*) -- Number of joints or degrees of freedom **Returns:** - **tuple** -- ((grid_x, grid_y), (block_x, block_y)) ------------------- Module Constants ------------------- .. autodata:: CUDA_AVAILABLE **bool** -- True if CUDA is available, False otherwise .. autodata:: CUPY_AVAILABLE **bool** -- True if CuPy is available, False otherwise .. autodata:: FAST_MATH **bool** -- Whether fast math optimizations are enabled .. autodata:: float_t **type** -- Float precision type (float32 or float16) ---------------------- Environment Variables ---------------------- **MANIPULAPY_FASTMATH** Set to "1" to enable fast math optimizations (~2x speedup with relaxed IEEE 754 compliance) **MANIPULAPY_USE_FP16** Set to "1" to use 16-bit floating point precision for memory-bound kernels