In the previous version only the default stream was/could be used, i.e.
cv::cuda::Stream::Null().
With this change, HOG::compute() will now run in parallel over different
cuda::Streams.
The code has been reordered so that all data allocation is completed
first, then all the kernels are run in parallel over streams.
Fix#8177