See https://github.com/Itseez/opencv/issues/5721
COMMENTS:
* The second __syncthreads() is necessary, I am sure of that.
* The code works without the first __syncthreads() too, but I have however added it for symmetry. Anyway it doesn't affect time performances, I have checked it with some profiling with nvvp