mirror of
https://github.com/opencv/opencv.git
synced 2025-01-18 06:03:15 +08:00
fixed documents errors for GPU module
This commit is contained in:
parent
3bac10a1ca
commit
f6974df279
@ -1,7 +1,8 @@
|
||||
\section{Data Structures}
|
||||
|
||||
|
||||
\cvclass{gpu::DevMem2D\_}
|
||||
\cvclass{gpu::DevMem2D\_}\label{cppfunc.gpu.DevMem2D}
|
||||
|
||||
This is a simple lightweight class that encapsulate pitched memory on GPU. It is untented to pass to nvcc-compiled code, i.e. CUDA kernels. Its members can be called both from host and from device code.
|
||||
|
||||
\begin{lstlisting}
|
||||
@ -30,8 +31,9 @@ template <typename T> struct DevMem2D_
|
||||
\end{lstlisting}
|
||||
|
||||
|
||||
\cvclass{gpu::PtrStep\_}
|
||||
This is class like DevMem2D\_ but contain only pointer and row step. Image sizes are excluded due to performance reasons.
|
||||
\cvclass{gpu::PtrStep\_}\label{cppfunc.gpu.PtrStep}
|
||||
|
||||
This is structure is similar toDevMem2D\_ but contains only pointer and row step. Width and height fields are excluded due to performance reasons.
|
||||
|
||||
\begin{lstlisting}
|
||||
template<typename T> struct PtrStep_
|
||||
@ -52,8 +54,8 @@ template<typename T> struct PtrStep_
|
||||
|
||||
\end{lstlisting}
|
||||
|
||||
\cvclass{gpu::PtrElemStrp\_}
|
||||
This is class like DevMem2D\_ but contain only pointer and row step in elements. Image sizes are excluded due to performance reasons. This class is can only be constructed if sizeof(T) is multiple of 256.
|
||||
\cvclass{gpu::PtrElemStrp\_}\
|
||||
This is structure is similar to DevMem2D\_ but contains only pointer and row step in elements. Width and height fields are excluded due to performance reasons. This class is can only be constructed if sizeof(T) is a multiple of 256.
|
||||
|
||||
\begin{lstlisting}
|
||||
template<typename T> struct PtrElemStep_ : public PtrStep_<T>
|
||||
@ -67,9 +69,12 @@ template<typename T> struct PtrElemStep_ : public PtrStep_<T>
|
||||
|
||||
\cvclass{gpu::GpuMat}
|
||||
|
||||
The base storage class for GPU memory with reference counting. Its interface is almost \cvCppCross{Mat} interface with some limitations, so using it won't be a problem. The limitations are no arbitrary dimensions support (only 2D), no functions that returns references to its data (because references on GPU are not valid for CPU), no expression templates technique support. Because of last limitation please take care with overloaded matrix operators - they cause memory allocations. The GpuMat class is convertible to cv::gpu::DevMem2D\_ and cv::gpu::PtrStep\_ so it can be passed to directly to kernel.
|
||||
The base storage class for GPU memory with reference counting. Its interface is almost \cvCppCross{Mat} interface with some limitations, so using it won't be a problem. The limitations are no arbitrary dimensions support (only 2D), no functions that returns references to its data (because references on GPU are not valid for CPU), no expression templates technique support. Because of last limitation please take care with overloaded matrix operators - they cause memory allocations. The GpuMat class is convertible to \hyperref[cppfunc.gpu.DevMem2D]{cv::gpu::DevMem2D\_} and \hyperref[cppfunc.gpu.PtrStep]{cv::gpu::PtrStep\_} so it can be passed to directly to kernel.
|
||||
|
||||
\textbf{Please note:} In contrast with \cvCppCross{Mat}, I most cases \texttt{GpuMat::isContinuous() == false}, i.e. rows are aligned to size depending on hardware.
|
||||
|
||||
|
||||
|
||||
\textbf{Please note:} In contrast with \cvCppCross{Mat}, In most cases \texttt{GpuMat::isContinuous() == false}, i.e. rows are aligned to size depending on hardware. Also single row GpuMat is always a continuous matrix.
|
||||
|
||||
\begin{lstlisting}
|
||||
class CV_EXPORTS GpuMat
|
||||
@ -88,8 +93,8 @@ public:
|
||||
|
||||
//! returns lightweight DevMem2D_ structure for passing
|
||||
//to nvcc-compiled code. Contains size, data ptr and step.
|
||||
template <class T> operator DevMem2D\_<T>() const;
|
||||
template <class T> operator PtrStep\_<T>() const;
|
||||
template <class T> operator DevMem2D_<T>() const;
|
||||
template <class T> operator PtrStep_<T>() const;
|
||||
|
||||
//! pefroms blocking upload data to GpuMat.
|
||||
void upload(const cv::Mat& m);
|
||||
@ -113,7 +118,7 @@ See also: \cvCppCross{Mat}
|
||||
\cvclass{gpu::CudaMem}
|
||||
This is a class with reference counting that wraps special memory type allocation functions from CUDA. Its interface is also \cvCppCross{Mat}-like but with additional memory type parameter:
|
||||
\begin{itemize}
|
||||
\item \texttt{ALLOC\_PAGE\_LOCKED} Sets page locked memory type, used commonly for fast and asynchronous upload/download data from/to GPU.
|
||||
\item \texttt{ALLOC\_PAGE\_LOCKED} Set page locked memory type, used commonly for fast and asynchronous upload/download data from/to GPU.
|
||||
\item \texttt{ALLOC\_ZEROCOPY} Specifies zero copy memory allocation, i.e. with possibility to map host memory to GPU address space if supported.
|
||||
\item \texttt{ALLOC\_WRITE\_COMBINED} Sets write combined buffer which is not cached by CPU. Such buffers are used to supply GPU with data when GPU only reads it. The advantage is better CPU cache utilization.
|
||||
\end{itemize}
|
||||
@ -168,14 +173,14 @@ CudaMem::operator GpuMat() const;
|
||||
}
|
||||
|
||||
\cvCppFunc{gpu::CudaMem::canMapHostMemory}
|
||||
Returns true is current hardware support address space mapping and \texttt{ALLOC\_ZEROCOPY} memory allocation
|
||||
Returns true if the current hardware supports address space mapping and \texttt{ALLOC\_ZEROCOPY} memory allocation
|
||||
\cvdefCpp{static bool CudaMem::canMapHostMemory();}
|
||||
|
||||
|
||||
\cvclass{gpu::Stream}
|
||||
|
||||
|
||||
This class is a queue class used for asynchronous calls. Some functions have overloads with additional \cvCppCross{gpu::Stream} parameter. The overloads do initialization work (allocate output buffers, upload constants, etc.), start GPU kernel and return before results are ready. A check if all operation are complete can be performed via \cvCppCross{gpu::Stream::queryIfComplete()}. Asynchronous upload/download have to be performed from/to page-locked buffers, i.e. using \cvCppCross{gpu::CudaMem} or \cvCppCross{Mat} header that points to a region of \cvCppCross{gpu::CudaMem}.
|
||||
This class encapsulated queue of the asynchronous calls. Some functions have overloads with additional \cvCppCross{gpu::Stream} parameter. The overloads do initialization work (allocate output buffers, upload constants, etc.), start GPU kernel and return before results are ready. A check if all operation are complete can be performed via \cvCppCross{gpu::Stream::queryIfComplete()}. Asynchronous upload/download have to be performed from/to page-locked buffers, i.e. using \cvCppCross{gpu::CudaMem} or \cvCppCross{Mat} header that points to a region of \cvCppCross{gpu::CudaMem}.
|
||||
|
||||
\textbf{Please note the limitation}: currently it is not guaranteed that all will work properly if one operation will be enqueued twice with different data. Some functions use constant GPU memory and next call may update the memory before previous has been finished. But calling asynchronously different operations is safe because each operation has own constant buffer. Memory copy/upload/download/set operations to buffers hold by user are also safe.
|
||||
|
||||
@ -217,7 +222,7 @@ public:
|
||||
\end{lstlisting}
|
||||
|
||||
\cvCppFunc{gpu::Stream::queryIfComplete}
|
||||
Returns true if current stream queue is finished, otherwise false.
|
||||
Returns true if the current stream queue is finished, otherwise false.
|
||||
\cvdefCpp{bool Stream::queryIfComplete()}
|
||||
|
||||
\cvCppFunc{gpu::Stream::waitForCompletion}
|
||||
@ -227,7 +232,7 @@ Blocks until all operations in the stream are complete.
|
||||
|
||||
\cvclass{gpu::StreamAccessor}
|
||||
|
||||
This class provides possibility to get \texttt{cudaStream\_t} from \cvCppCross{gpu::Stream}. This class is declared in \texttt{stream\_accessor.hpp} because this is only public header that depend on Cuda Runtime API. Including it will bring the dependency to your code.
|
||||
This class provides possibility to get \texttt{cudaStream\_t} from \cvCppCross{gpu::Stream}. This class is declared in \texttt{stream\_accessor.hpp} because that is only public header that depend on Cuda Runtime API. Including it will bring the dependency to your code.
|
||||
|
||||
\begin{lstlisting}
|
||||
struct StreamAccessor
|
||||
|
@ -8,7 +8,7 @@ Returns number of CUDA-enabled devices installed. It is to be used before any ot
|
||||
|
||||
|
||||
\cvCppFunc{gpu::setDevice}
|
||||
Sets device and initializes it for current thread. Call of this function can be omitted, but in this case a default device will be initialized on fist GPU usage.
|
||||
Sets device and initializes it for the current thread. Call of this function can be omitted, but in this case a default device will be initialized on fist GPU usage.
|
||||
|
||||
\cvdefCpp{void setDevice(int device);}
|
||||
\begin{description}
|
||||
@ -17,13 +17,13 @@ Sets device and initializes it for current thread. Call of this function can be
|
||||
|
||||
|
||||
\cvCppFunc{gpu::getDevice}
|
||||
Returns current device index, which was set by \cvCppCross{gpu::getDevice} of initialized by default.
|
||||
Returns the current device index, which was set by \cvCppCross{gpu::getDevice} or initialized by default.
|
||||
|
||||
\cvdefCpp{int getDevice();}
|
||||
|
||||
|
||||
\cvCppFunc{gpu::getComputeCapability}
|
||||
Returns compute capability version for given device.
|
||||
Returns compute capability version for the given device.
|
||||
|
||||
\cvdefCpp{void getComputeCapability(int device, int\& major, int\& minor);}
|
||||
\begin{description}
|
||||
@ -42,7 +42,7 @@ Returns number of Streaming Multiprocessors for given device.
|
||||
|
||||
|
||||
\cvCppFunc{gpu::getGpuMemInfo}
|
||||
Returns free and total memory for the current device.
|
||||
Returns free and total memory size for the current device.
|
||||
|
||||
\cvdefCpp{void getGpuMemInfo(size\_t\& free, size\_t\& total);}
|
||||
\begin{description}
|
||||
|
@ -6,11 +6,11 @@ The OpenCV GPU module is a set of classes and functions to utilize GPU computati
|
||||
|
||||
The GPU module is designed as host level API, i.e. if a user has precompiled OpenCV GPU binaries, it is not necessary for him to have Cuda Toolkit installed and have deal with code to execute on GPU. Additional advantage of this is that with the binaries users can use any compiler for any platform. But probably a device layer API will be introduced in future to provide more agility and performance in internal GPU module implementation and more functionality for users.
|
||||
|
||||
External dependencies of the module are only libraries included in Cuda Toolkit and NVidia Performance Primitives library (NPP). These can be downloaded from NVidia site for all supported platforms. Only comparability with the latest Cuda Toolkit and NPP is provided for trunk OpenCV version and we switch to each new release very fast. So please keep it up to date. OpenCV GPU code can be compiled only on such platforms where Cuda Runtime Toolkit is supported by NVidia.
|
||||
External dependencies of the module are only libraries included in Cuda Toolkit and NVidia Performance Primitives library (NPP). These libraries can be downloaded from NVidia site for all supported platforms. Only comparability with the latest Cuda Toolkit and NPP is provided for trunk OpenCV version and we switch to each new release very fast. So please keep it up to date. OpenCV GPU code can be compiled only on such platforms where Cuda Runtime Toolkit is supported by NVidia.
|
||||
|
||||
OpenCV GPU module is designed to make its usage as easier as it possible. It can be used without any knowledge about Cuda. But for advanced programming and extremely optimization it is highly recommended to learn principles of programming and optimization for GPU. This is helpful because of understanding how much each operation cost, what it does, and how is it better to call. In this case GPU module became an effective instrument of development computer vision algorithms for GPU on prototyping stage and when hard optimization is in process.
|
||||
OpenCV GPU module is designed to make its usage as easy as it possible. It can be used without any knowledge about Cuda. But for advanced programming and extremely optimization it is highly recommended to learn principles of programming and optimization for GPU. This is helpful because of understanding how much each operation costs, what it does, and how it is better to call. In this case GPU module became an effective instrument of development computer vision algorithms for GPU on prototyping stage and when hard optimization is in process.
|
||||
|
||||
The OpenCV can be compiled with enabled and disabled \texttt{WITH\_CUDA} flag in CMake. Building with the flag set will force compilation of device code from GPU module and requires dependences above installed. If OpenCV is compiled without the flag, GPU module will also be built, but all functions from it will throw \cvCppCross{Exception} with \texttt{CV\_GpuNotSupported} error code, except \cvCppCross{gpu::getCudaEnabledDeviceCount()}. The last function will return zero GPU count in this case. Building OpenCV without CUDA does not perform device code compilation, so it does not require Cuda Toolkit installed and supported by NVidia compiler. Also such behavior makes it possible to develop in future smart enough algorithms for OpenCV, that can decide itself weather it is reasonable to call GPU or do their work in CPU or use both. Thereby disabling \texttt{WITH\_CUDA} flag will force using only CPU. The mechanism can be used also by OpenCV users in their applications to enable or disable GPU support.
|
||||
The OpenCV can be compiled with enabled and disabled \texttt{WITH\_CUDA} flag in CMake. Building with the flag set will force compilation of device code from GPU module and requires dependences above installed. If OpenCV is compiled without the flag, GPU module will also be built, but all functions from it will throw \cvCppCross{Exception} with \texttt{CV\_GpuNotSupported} error code, except \cvCppCross{gpu::getCudaEnabledDeviceCount()}. The last function will return zero GPU count in this case. Building OpenCV without CUDA does not perform device code compilation, so it does not require Cuda Toolkit installed and supported by NVidia compiler. Also such behavior makes it possible to develop in future smart enough algorithms for OpenCV, that can decide itself whether it is reasonable to call GPU or do their work in CPU or use both. Thereby disabling \texttt{WITH\_CUDA} flag will force using only CPU. The mechanism can be used also by OpenCV users in their applications to enable or disable GPU support.
|
||||
|
||||
\subsection{Compilation for different NVidia platforms.}
|
||||
|
||||
@ -20,34 +20,34 @@ On first GPU call run PTX code is passed to Just In Time (JIT) compilation for c
|
||||
|
||||
By default the following images are linked to GPU module library:
|
||||
\begin{itemize}
|
||||
\item Binaries for compute capabilities 1.3 and 2.0 (controlled by \texttt{CUDA\_ARCH\_GPU} in CMake)
|
||||
\item Binaries for compute capabilities 1.3 and 2.0 (controlled by \texttt{CUDA\_ARCH\_BIN} in CMake)
|
||||
\item PTX code for compute capabilities 1.1 and 1.3 (controlled by \texttt{CUDA\_ARCH\_PTX} in CMake)
|
||||
\end{itemize}
|
||||
|
||||
That means for devices with CC 1.3 and 2.0 binary images are ready to run. For all newer platforms the PTX code for 1.3 is JITed to a binary image. For devices with 1.1 and 1.2 the PTX for 1.1 is JITed. For devices with CC 1.0 no code present and execution will fails with \cvCppCross{Exception} somewhere. For platforms where JIT compilation is performed first run will be slow.
|
||||
|
||||
Devices with compute capability 1.0 are supported by most of GPU functionality now. There are only a couple things that can’t run on it. They are guarded with asserts. But in future the number will raise, because of CC 1.0 support requires writing special implementation for it. We decided not to spend time for old platform support.
|
||||
Devices with compute capability 1.0 are supported by most of GPU functionality now (just compile the library corresponding settings). There are only a couple things that can not run on it. They are guarded with asserts. But the in future the number will raise, because of CC 1.0 support requires writing special implementation for it. So, It is decided not to spend time for old platform support.
|
||||
|
||||
Because of OpenCV can be compiled not for all architectures, there can be binary incompatibility between GPU and code linked to OpenCV. In this case unclear error is returned in arbitrary place. But there is a way to check for what platforms OpenCV GPU was built using \cvCppCross{gpu::isCompatibleWith} function.
|
||||
Because of OpenCV can be compiled not for all architectures, there can be binary incompatibility between GPU and code linked to OpenCV. In this case unclear error is returned in arbitrary place. But there is a way to check if the module was build to be able to run on the given device using \cvCppCross{gpu::isCompatibleWith} function.
|
||||
|
||||
|
||||
\subsection{Threading and multi-threading.}
|
||||
|
||||
Because GPU module is written using Cuda Runtime API, it derives from the API all practices and rules to work with threads. So on first the API call a Cuda context is created implicitly, attached and made current for the calling thread and. All farther operations, such as memory allocation, GPU kernels loads and compilation, will be associated with the context and the thread. Because another thread is not attached to the context, memory allocations done in first thread are not valid for it. For second thread another context will be created on first Cuda call. So by default different threads do not share resources.
|
||||
Because GPU module is written using Cuda Runtime API, it derives from the API all practices and rules to work with threads. So on first the API call a Cuda context is created implicitly, attached and made current for the calling thread. All farther operations, such as memory allocation, GPU kernels loads and compilation, will be associated with the context and the thread. Because another thread is not attached to the context, memory allocations done in first thread are not valid for it. For second thread another context will be created on first Cuda call. So by default different threads do not share resources.
|
||||
|
||||
But such limitation can be removed via using Cuda Driver API. (\textbf{Warning!} Interoperability between Cuda Driver and Runtime APIs is supported only in Cuda Toolkit 3.1 and latter). The Driver API allows retrieving context reference and attaching it to another thread. In this case if the context was created with shared access policy both threads can use the same resources. Shared access policy is default for implicit context creating now.
|
||||
|
||||
Also here is possible in Cuda Driver API to create context explicitly before first Cuda runtime call, and make it current for all necessary threads. Cuda Runtime API (and OpenCV functions respectively) will pick up it.
|
||||
Also there is possible in Cuda Driver API to create context explicitly before first Cuda runtime call, and make it current for all necessary threads. Cuda Runtime API (and OpenCV functions respectively) will pick up it.
|
||||
|
||||
May be in future the tricks above will be wrapped by OpenCV GPU utility functions (it is also necessary for Multi-GPU modes).
|
||||
|
||||
\subsection{Multi-GPU}
|
||||
|
||||
At current stage all OpenCV GPU algorithms are single GPU algorithms. So to utilize multiple GPUs users have to manually parallelize work between GPUs. Multi-GPU practices is also derived from Cuda APIs, so for detailed information please read Cuda documentation. Here is 2 ways to use several GPU:
|
||||
At the current stage all OpenCV GPU algorithms are single GPU algorithms. So to utilize multiple GPUs users have to manually parallelize work between GPUs. Multi-GPU practices is also derived from Cuda APIs, so for detailed information please read Cuda documentation. Here is two ways to use several GPUs:
|
||||
\begin{itemize}
|
||||
\item In case of using only synchronous functions, several threads for each GPU are created and for each thread CUDA context is initialized (explicitly by Driver API or by calling \newline \cvCppCross{cv::gpu::setDevice()}, cudaSetDevice) that is associated with the corresponding GPU (CUDA context is always associated only with one GPU). Now each thread can workload its own GPU.
|
||||
\item In case of using only synchronous functions, several threads for each GPU are created and for each thread CUDA context is initialized (explicitly by Driver API or by calling \newline \cvCppCross{gpu::setDevice()}, cudaSetDevice) that is associated with the corresponding GPU (CUDA context is always associated only with one GPU). Now each thread can workload its own GPU.
|
||||
\item In case of asynchronous functions, it is possible to create several Cuda contexts associated with different GPUs but attached to one thread. This can be done only by Driver API. Next switch between devices is done by making corresponding context current for the thread. With non-blocking GPU calls managing algorithm is clear.
|
||||
\end{itemize}
|
||||
While developing algorithms for multiple GPUs a data passing overhead have to be taken into consideration. For primitive functions and for small images it can be significant and this stops the idea to use several GPU. But for some high level algorithms Multi-GPU acceleration is suitable. For example, we have done parallelization of Stereo Block matching by dividing stereo pair on two parts horizontally with overlapping, processing each part on separate Fermi GPU, next download and merge resulting disparity. Performance for two GPU is about 180\%. As conclusion, may be in future Cuda context managing functions will be wrapped in GPU module and some multi-GPU high level algorithms be implemented. But now user has to do this manually.
|
||||
While developing algorithms for multiple GPUs a data passing overhead have to be taken into consideration. For primitive functions and for small images it can be significant and this stops the idea to use several GPU. But for some high level algorithms Multi-GPU acceleration is suitable. For example, we have done parallelization of Stereo Block Matching by divide the stereo pair into two parts horizontally with overlapping, process each part on separate Fermi GPU, next download and merge resulting disparity. Performance for two GPU is about 180\%. As conclusion, may be in future Cuda context managing functions will be wrapped in GPU module and some multi-GPU high level algorithms be implemented. But now user has to do this manually.
|
||||
|
||||
|
||||
|
BIN
doc/opencv.pdf
BIN
doc/opencv.pdf
Binary file not shown.
Loading…
Reference in New Issue
Block a user