opencv/modules/dnn/src/op_cuda.cpp
alexlyulkov 1d1faaabef
Merge pull request #24411 from alexlyulkov:al/dnn-type-inference
Added int32, int64 support and type inference to dnn #24411

**Added a type inference to dnn similar to the shape inference, added int32 and int64 support.**

- Added getTypes method for layers that calculates layer outputs types and internals types from inputs types (Similar to getMemoryShapes). By default outputs and internals types = input[0] type
- Added type inference pipeline similar to shape inference pipeline. LayersShapes struct (that is used in shape inference pipeline) now contains both shapes and types
- All layers output blobs are now allocated using the calculated types from the type inference.
- Inputs and constants with int32 and int64 types are not automatically converted into float32 now.
- Added int32 and int64 support for all the layers with indexing and for all the layers required in tests.

Added  int32 and int64 support for CUDA:
- Added host<->device data moving for int32 and int64
- Added int32 and int64 support for several layers (just slightly modified CUDA C++ templates)

Passed all the accuracy tests on CPU, OCL, OCL_FP16, CUDA, CUDA_FP16. (except RAFT model)

**CURRENT PROBLEMS**:
-  ONNX parser always converts int64 constants and layers attributes to int32, so some models with int64 constants doesn't work (e.g. RAFT). The solution is to disable int64->int32 conversion and fix attributes reading in a lot of ONNX layers parsers (https://github.com/opencv/opencv/issues/25102)
- I didn't add type inference and int support to VULCAN, so it doesn't work at all now.
- Some layers don't support int yet, so some unknown models may not work.

**CURRENT WORKAROUNDS**:
- CPU arg_layer indides are implemented in int32 followed by a int32->int64 conversion (the master branch has the same workaround with int32->float conversion)
- CPU and OCL pooling_layer indices are implemented in float followed by a float->int64 conversion
- CPU gather_layer indices are implemented in int32, so int64 indices are converted to int32 (the master branch has the same workaround with float->int32 conversion)

**DISABLED TESTS**:
- RAFT model

**REMOVED TESTS**:
- Greater_input_dtype_int64 (because it doesn't fit ONNX rules, the whole test is just comparing float tensor with int constant)

**TODO IN NEXT PULL REQUESTS**:
- Add int64 support for ONNX parser
- Add int support for more layers
- Add int support for OCL (currently int layers just run on CPU)
- Add int tests
- Add int support for other backends
2024-03-01 17:07:38 +03:00

111 lines
3.7 KiB
C++

// This file is part of OpenCV project.
// It is subject to the license terms in the LICENSE file found in the top-level directory
// of this distribution and at http://opencv.org/license.html.
#include "precomp.hpp"
#ifdef HAVE_CUDA
#include "op_cuda.hpp"
#include "cuda4dnn/init.hpp"
#include "net_impl.hpp"
namespace cv { namespace dnn {
CV__DNN_INLINE_NS_BEGIN
void Net::Impl::initCUDABackend(const std::vector<LayerPin>& blobsToKeep_)
{
CV_Assert(preferableBackend == DNN_BACKEND_CUDA);
if (!cudaInfo) /* we need to check only once */
cuda4dnn::checkVersions();
if (cuda4dnn::getDeviceCount() <= 0)
CV_Error(Error::StsError, "No CUDA capable device found.");
if (cuda4dnn::getDevice() < 0)
CV_Error(Error::StsError, "No CUDA capable device selected.");
if (!cuda4dnn::isDeviceCompatible())
CV_Error(Error::GpuNotSupported, "OpenCV was not built to work with the selected device. Please check CUDA_ARCH_PTX or CUDA_ARCH_BIN in your build configuration.");
if (preferableTarget == DNN_TARGET_CUDA_FP16 && !cuda4dnn::doesDeviceSupportFP16())
{
CV_LOG_WARNING(NULL, "The selected CUDA device does not support FP16 target; switching to FP32 target.");
preferableTarget = DNN_TARGET_CUDA;
}
if (!cudaInfo)
{
cuda4dnn::csl::CSLContext context;
context.stream = cuda4dnn::csl::Stream(true);
context.cublas_handle = cuda4dnn::csl::cublas::Handle(context.stream);
context.cudnn_handle = cuda4dnn::csl::cudnn::Handle(context.stream);
auto d2h_stream = cuda4dnn::csl::Stream(true); // stream for background D2H data transfers
cudaInfo = std::unique_ptr<CudaInfo_t>(new CudaInfo_t(std::move(context), std::move(d2h_stream)));
}
cudaInfo->workspace = cuda4dnn::csl::Workspace(); // release workspace memory if any
for (auto& layer : layers)
{
auto& ld = layer.second;
if (ld.id == 0 && netInputLayer->supportBackend(preferableBackend))
{
for (auto& wrapper : ld.inputBlobsWrappers)
{
auto cudaWrapper = wrapper.dynamicCast<CUDABackendWrapper>();
cudaWrapper->setStream(cudaInfo->context.stream, cudaInfo->d2h_stream);
}
}
for (auto& wrapper : ld.outputBlobsWrappers)
{
auto cudaWrapper = wrapper.dynamicCast<CUDABackendWrapper>();
cudaWrapper->setStream(cudaInfo->context.stream, cudaInfo->d2h_stream);
}
}
for (auto& layer : layers)
{
auto& ld = layer.second;
auto& layerInstance = ld.layerInstance;
if (!layerInstance->supportBackend(DNN_BACKEND_CUDA))
{
std::ostringstream os;
os << "CUDA backend will fallback to the CPU implementation for the layer \"" << ld.name
<< "\" of type " << ld.type << '\n';
CV_LOG_INFO(NULL, os.str().c_str());
continue;
}
/* we make a copy so that `initCUDA` doesn't modify `cudaInfo->context` */
auto context = cudaInfo->context;
auto node = layerInstance->initCUDA(&context, ld.inputBlobsWrappers, ld.outputBlobsWrappers);
ld.backendNodes[DNN_BACKEND_CUDA] = node;
if(!node.empty())
{
auto cudaNode = node.dynamicCast<CUDABackendNode>();
cudaInfo->workspace.require(cudaNode->get_workspace_memory_in_bytes());
}
}
if (blobsToKeep_.size() > 1)
{
for (const auto& pin : blobsToKeep_)
{
LayerData& ld = layers[pin.lid];
ld.cudaD2HBackgroundTransfers.push_back(pin.oid);
}
}
}
CV__DNN_INLINE_NS_END
}} // namespace cv::dnn
#endif // HAVE_CUDA