* Vectorize calculating integral for line for single and multiple channels
* Single vector processing for 4-channels - 25-30% faster
* Single vector processing for 4-channels - 25-30% faster
* Fixed AVX512 code for 4 channels
* Disable 3 channel 8UC1 to 32S for SSE2 and SSE3 (slower). Use new version of 8UC1 to 64F for AVX512.
fixed the ordering of contour convex hull points
* partially fixed the issue #4539
* fixed warnings and test failures
* fixed integer overflow (issue #14521)
* added comment to force buildbot to re-run
* extended the test for the issue 4539. Check the expected behaviour on the original contour as well
* added comment; fixed typo, renamed another variable for a little better clarity
* added yet another part to the test for issue #4539, where we run convexHull and convexityDetects on the original contour, without any manipulations. the rest of the test stays the same
* fixed several problems when running tests on Mac:
* OCL_pyrUp
* OCL_flip
* some basic UMat tests
* histogram badarg test (out of range access)
* retained the storepix fix in ocl_flip only for 16U/16S datatype, where the OpenCL compiler on Mac generates incorrect code
* moved deletion of ACCESS_FAST flag to non-SVM branch (where SVM is shared virtual memory (in OpenCL 2.x), not support vector machine)
* force OpenCL to use read/write for GPU<=>CPU memory transfers on machines with discrete video only on Macs. On Windows/Linux the drivers are seemingly smart enough to implement map/unmap properly (and maybe more efficiently than explicit read/write)
Changes:
* UMat for blur + rotate resulting in a speedup of around 2X on an i7
* support for boards larger than specified allowing to cover full FOV
* support for markers moving the origin into the center of the board
* increase detection accuracy
The main change is for supporting boards that are larger than the FOV of
the camera and have their origin in the board center. This allows
building OEM calibration targets similar to the one from intel real
sense utilizing corner points as close as possible to the image border.
cuda4dnn(concat): write outputs from previous layers directly into concat's output
* eliminate concat by directly writing to its output buffer
* fix concat fusion not happening sometimes
* use a whitelist instead of a blacklist
* G-API/Samples: Added a simple "privacy masking camera" sample
The main idea is to host this code for an opencv.org blog post only
* G-API/Samples: Modified privacy masking camera code to look better for the post
* G-API/Samples: fix Windows (MSVC) support in Privacy Masking Camera
* G-API/Samples: Addressed the majority of review comments in PMC
* G-API/Samples: Use TickMeter to measure time + more info in cmd options
* G-API/Samples: fix yet another Windows warning in PMC
* G-API/Samples: Fix wording in PMC cmd arg parameters
* Fix wording, again
* G-API/Samples: Fix PMC cmd-line arguments, again
G-API: Using functors as kernel implementation
* Implement ability to create kernel impls from functors
* Clean up
* Replace make_ocv_functor to ocv_kernel
* Clean up
* Replace GCPUFunctor -> GOCVFunctor
* Move GOCVFunctor to cv::gapi::cpu namespace
* Implement override for rvalue and lvalue cases
* Fix comments to review
* Remove GAPI_EXPORT for template functions
* Fix indentation
fixed cv::moveWindow() on mac
* fixed cv::moveWindow() on mac (issue #16343). Thanks to cwreynolds and saskatchewancatch for the help!
* fixed warnings about _x0 and _y0
* fixed warnings about _x0 and _y0
* Fix NN resize with dimentions > 4
* add test check for nn resize with channels > 4
* Change types from float to double
* Del unnecessary test file. Move nn test to test_imgwarp. Add 5 channels test only.
* improved version of HoughCircles (HOUGH_GRADIENT_ALT method)
* trying to fix build problems on Windows
* fixed typo
* * fixed warnings on Windows
* make use of param2. make it minCos2 (minimal value of squared cosine between the gradient at the pixel edge and the vector connecting it with circle center). with minCos2=0.85 we can detect some more eyes :)
* * added description of HOUGH_GRADIENT_ALT
* cleaned up the implementation; added comments, replaced built-in numeic constants with symbolic constants
* rewrote circle_popcount() to use built-in popcount() if possible
* modified some of HoughCircles tests to use method parameter instead of the built-in loop
* fixed warnings on Windows
trying to fix handling file storages with extremely long lines
* trying to fix handling of file storages with extremely long lines: https://github.com/opencv/opencv/issues/11061
* * fixed errorneous pointer access in JSON parser.
* it's now crash-test time! temporarily set the initial parser buffer size to just 40 bytes. let's run all the test and check if the buffer is always correctly resized and handled
* fixed pointer use in JSON parser; added the proper test to catch this case
* fixed the test to make it more challenging. generate test json with
*
**
***
etc. shape
Fix compilation errors on GLES platforms
* Do not include glx.h when using GLES
GL/glx.h is included on all LINUX plattforms, which is wrong
for a number of reasons:
- GL_PERSPECTIVE_CORRECTION_HINT is defined in GL/gl.h, so we
want gl.h not glx.h, the latter just includes the former
- GL/gl.h is a Desktop GL header, and should not be included
on GLES plattforms
- GL/gl.h is already included via QtOpenGL ->
QtGui/qopengl.h on desktop plattforms
This fixes a problem when Qt is compiled with GLES, which
is often done on ARM platforms where desktop GL is not or
only poorly supported (e.g. slow due to emulation).
Fixes part of #9171.
* Only set GL_PERSPECTIVE_CORRECTION_HINT when GL version defines it
GL_PERSPECTIVE_CORRECTION_HINT does not exist in GLES 2.0/3.x,
and has been deprecated in OpenGL 3.0 core profiles.
Fixes part of #9171.
This is a correction of the previously missleading documentation and a warning related to a common calibration failure described in issue 15992
* corrected incorrect description of failed calibration state.
see issue 15992
* calib3d: apply suggestions from code review by catree
QR-Code detector : multiple detection
* change in qr-codes detection
* change in qr-codes detection
* change in test
* change in test
* add multiple detection
* multiple detection
* multiple detect
* add parallel implementation
* add functional for performance tests
* change in test
* add perftest
* returned implementation for 1 qr-code, added support for vector<Mat> and vector<vector<Point2f>> in MultipleDetectAndDecode
* deleted all lambda expressions
* changing in triangle sort
* fixed warnings
* fixed errors
* add java and python tests
* change in java tests
* change in java and python tests
* change in perf test
* change in qrcode.cpp
* add spaces
* change in qrcode.cpp
* change in qrcode.cpp
* change in qrcode.cpp
* change in java tests
* change in java tests
* solved problems
* solved problems
* change in java and python tests
* change in python tests
* change in python tests
* change in python tests
* change in methods name
* deleted sample qrcode_multi, change in qrcode.cpp
* change in perf tests
* change in objdetect.hpp
* deleted code duplication in sample qrcode.cpp
* returned spaces
* added spaces
* deleted draw function
* change in qrcode.cpp
* change in qrcode.cpp
* deleted all draw functions
* objdetect(QR): extractVerticalLines
* objdetect(QR): whitespaces
* objdetect(QR): simplify operations, avoid duplicated code
* change in interface, additional checks in java and python tests, added new key in sample for saving original image from camera
* fix warnings and errors in python test
* fix
* write in file with space key
* solved error with empty mat check in python test
* correct path to test image
* deleted spaces
* solved error with check empty mat in python tests
* added check of empty vector of points
* samples: rework qrcode.cpp
* objdetect(QR): fix API, input parameters must be first
* objdetect(QR): test/fix points layout
* Reduce LLC loads, stores and multiplies on MulTransposed - 8% faster on VSX
* Add is_same method so c++11 is not required
* Remove trailing whitespaces.
* Change is_same to DataType depth check
Added type check for solvePnPGeneric | Issue: #16049
* Added type check
* Added checks before type fix
* Tests for 16049
* calib3d: update solvePnP regression check (16049)
- Added `explicit` to `VideoCapture` constructors with 2
arguments, 1 of them has default value
- Applied library code style
- Introduced 2 debug macros to improve readability of the code
Vectorize minMaxIdx functions
* Updated documentation and intrinsic tests for v_reduce
* Add other files back in from the forced push
* Prevent an constant overflow with v_reduce for int8 type
* Another alternative to fix constant overflow warning.
* Fix another compiler warning.
* Update comments and change comparison form to be consistent with other vectorized loops.
* Change return type of v_reduce_min & max for v_uint8 and v_uint16 to be same as lane type.
* Cast v_reduce functions to int to avoid overflow. Reduce number of parameters in MINMAXIDX_REDUCE macro.
* Restore cast type for v_reduce_min & max to LaneType
support eltwise sum with different number of input channels in CUDA backend
* add shortcut primitive
* add offsets in shortcut kernel
* skip tests involving more than two inputs
* remove redundant modulus operation
* support multiple inputs
* remove whole file indentation
* skip acc in0 trunc test if weighted
* use shortcut iff channels are unequal
Enable cuda4dnn on hardware without support for __half
* Enable cuda4dnn on hardware without support for half (ie. compute capability < 5.3)
Update CMakeLists.txt
Lowered minimum CC to 3.0
* UPD: added ifdef on new copy kernel
* added fp16 support detection at runtime
* Clarified #if condition on atomicAdd definition
* More explicit CMake error message
Fix implicit conversion from array to scalar in python bindings
* Fix wrong conversion behavior for primitive types
- Introduce ArgTypeInfo namedtuple instead of plain tuple.
If strict conversion parameter for type is set to true, it is
handled like object argument in PyArg_ParseTupleAndKeywords and
converted to concrete type with the appropriate pyopencv_to function
call.
- Remove deadcode and unused variables.
- Fix implicit conversion from numpy array with 1 element to scalar
- Fix narrowing conversion to size_t type.
* Fix wrong conversion behavior for primitive types
- Introduce ArgTypeInfo namedtuple instead of plain tuple.
If strict conversion parameter for type is set to true, it is
handled like object argument in PyArg_ParseTupleAndKeywords and
converted to concrete type with the appropriate pyopencv_to function
call.
- Remove deadcode and unused variables.
- Fix implicit conversion from numpy array with 1 element to scalar
- Fix narrowing conversion to size_t type.·
- Enable tests with wrong conversion behavior
- Restrict passing None as value
- Restrict bool to integer/floating types conversion
* Add PyIntType support for Python 2
* Remove possible narrowing conversion of size_t
* Bindings conversion update
- Remove unused macro
- Add better conversion for types to numpy types descriptors
- Add argument name to fail messages
- NoneType treated as a valid argument. Better handling will be added
as a standalone patch
* Add descriptor specialization for size_t
* Add check for signed to unsigned integer conversion safety
- If signed integer is positive it can be safely converted
to unsigned
- Add check for plain python 2 objects
- Add check for numpy scalars
- Add simple type_traits implementation for better code style
* Resolve type "overflow" false negative in safe casting check
- Move type_traits to separate header
* Add copyright message to type_traits.hpp
* Limit conversion scope for integral numpy types
- Made canBeSafelyCasted specialized only for size_t, so
type_traits header became unused and was removed.
- Added clarification about descriptor pointer
Add lightweight IE hardware targets checks
nGraph: Concat with paddings
Enable more nGraph tests
Restore FP32->FP16 for GPU plugin of IE
try to fix buildbot
Use lightweight IE targets check only starts from R4
- some of `icvCvt_BGR*` functions have R with B channels
swapped what leads to the wrong conversion
- renames misleading `rgb` variable name to `bgr`
- swap back the conversion coefficients, `cB` should be the first
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Actually, we can do this in constant time. xofs always
contains same or increasing offset values. We can instead
find the most extreme value used and never attempt to load it.
Similarly, we can note for all dx >= 0 and dx < (dwidth - cn)
where xofs[dx] + cn < xofs[dwidth-cn] implies dx < (dwidth - cn).
Thus, we can use this to control our loop termination optimally.
This fixes#16137 with little or no performance impact. I have
also added a debug check as a sanity check.
* add cv::compare test when Mat type == CV_16F
* add assertion in cv::compare when src.depth() == CV_16F
* cv::compare assertion minor fix
* core: add more checks
* enable tests for DNN_TARGET_CUDA_FP16
* disable deconvolution tests
* disable shortcut tests
* fix typos and some minor changes
* dnn(test): skip CUDA FP16 test too (run_pool_max)
* Handle det == 0 in findCircle3pts.
Issue 16051 shows a case where findCircle3pts returns NaN for the
center coordinates and radius due to dividing by a determinant of 0. In
this case, the points are colinear, so the longest distance between any
2 points is the diameter of the minimum enclosing circle.
* imgproc(test): update test checks for minEnclosingCircle()
* imgproc: fix handling of special cases in minEnclosingCircle()
G-API: Tutorial: Face beautification algorithm implementation
* Introduce a tutorial on face beautification algorithm
- small typo issue in render_ocv.cpp
* Addressing comments rgarnov smirnov-alexey
* Eltwise::DIV support in Halide backend
* fix typo
* remove div from generated test suite to pass CI, switching to manual test...
* ensure divisor not near to zero
* use randu
* dnn(test): update test data for Eltwise.Accuracy/DIV layer test
Add checks for empty operands in Matrix expressions that don't check properly
* Starting to add checks for empty operands in Matrix expressions that
don't check properly.
* Adding checks and delcarations for checker functions
* Fix signatures and add checks for each class of Matrix Expr operation
* Make it catch the right exception
* Don't expose helper functions to public API
* G-API: Added G-API Overview slides & its source code
- Sample code snippets are moved to separate files;
- Introduced a separate benchmark to measure Fluid/OpenCV
performance;
- Added notes on API changes (it is still a 4.0, not a 4.2 talk!)
- Added a "Metropolis" beamer download-n-build script.
* G-API: Addressed review issues on G-API overview slides
* imgproc: Prevent 1B overrun of 8C3 SIMD optimization
The fourth value read via v_load_q is essentially ignored,
but can cause trouble if it happens to cross page boundaries.
The final few iterations may attempt to read the most extreme
elements of S, which will read 1B beyond the array in most
aligment cases. Dynamically compute the stop. This could be
hoised from the loop, but will require a more extensive change.
Likewise, cleanup the iteration increment statements to make
it more obvious they do channel count (3) elements per pass.
This should resolve#16137
* imgproc(resize): extra check
dnn(eltwise): fix handling of different number of channels
* dnn(test): reproducer for Eltwise layer issue from PR16063
* dnn(eltwise): rework support for inputs with different channels
* dnn(eltwise): get rid of finalize(), variableChannels
* dnn(eltwise): update input sorting by number of channels
- do not swap inputs if number of channels are same after truncation
* dnn(test): skip "shortcut" with batch size 2 on MYRIAD targets
G-API: Fix various issues for 4.2 release
* G-API: Fix issues reported by Coverity
- Fixed: passing values by value instead of passing by reference
* G-API: Fix redundant std::move()'s in return statements
Fixes#15903
* G-API: Added a smarter handling of Stop messages in the pipeline
- This should fix the "expected 100, got 99 frames" problem
- Fixes#15882
* G-API: Pass enum instead of GKernelPackage in Streaming test parameters
- Likely fixes#15836
* G-API: Address review issues in new bugfix comments
* G-API-NG/Docs: Added a tutorial page on interactive face detection sample
- Introduced a "--ser" option to run the pipeline serially for
benchmarking purposes
- Reorganized sample code to better fit the documentation;
- Fixed a couple of issues (mainly typos) in the public headers
* G-API-NG/Docs: Reflected meta-less compilation in new G-API tutorial
* G-API-NG/Docs: Addressed review comments on Face Analytics Pipeline example
cuda4dnn(resize): process multiple channels each iteration
* resize bilinear: process multiple chans. per iter.
* remove unused headers
* correct dispatch logic
* resize_nn: process multiple chans. per iter.
* resize: HResizeLinear reduce duplicate work
There appears to be a 2x unroll of the HResizeLinear against k,
however the k value is only incremented by 1 during the unroll. This
results in k - 1 duplicate passes when k > 1.
Likewise, the final pass may not respect the work done by the vector
loop. Start it with the offset returned by the vector op if
implemented. Note, no vector ops are implemented today.
The performance is most noticable on a linear downscale. A set of
performance tests are added to characterize this. The performance
improvement is 10-50% depending on the scaling.
* imgproc: vectorize HResizeLinear
Performance is mostly gated by the gather operations
for x inputs.
Likewise, provide a 2x unroll against k, this reduces the
number of alpha gathers by 1/2 for larger k.
While not a 4x improvement, it still performs substantially
better under P9 for a 1.4x improvement. P8 baseline is
1.05-1.10x due to reduced VSX instruction set.
For float types, this results in a more modest
1.2x improvement.
* Update U8 processing for non-bitexact linear resize
* core: hal: vsx: improve v_load_expand_q
With a little help, we can do this quickly without gprs on
all VSX enabled targets.
* resize: Fix cn == 3 step per feedback
Per feedback, ensure we don't overrun. This was caught via the
failure observed in Test_TensorFlow.inception_accuracy.
Test create custom layer in python
* check is contiguos
* Add custom layer test
* Fix test
* Remove assert
* Move assert to pyopencv dnn
* remove assert
* Add unregister
* Fix python2
* proto to bytearray
* Fix data type
* G-API: Addressed various documentation issues
- Fixed various typos and missing references;
- Added brief documentaion on G_TYPED_KERNEL and G_COMPOUND_KERNEL macros;
- Briefly described GComputationT<>;
- Briefly described G-API data objects (in a group section).
* G-API: Some clean-ups in doxygen, also a chapter on Render API
* G-API: Expose more graph compilation arguments in the documentation
* G-API: Address documentation review comments
* calib3d: use normalized input in solvePnPGeneric()
* calib3d: java regression test for solvePnPGeneric
* calib3d: python regression test for solvePnPGeneric
* core: disable invalid constructors in C API by default
- C API objects will lose their default initializers through constructors
* samples: stop using of C API
Fix cudacodec python
* Add python bindings to cudacodec.
* Allow args with CV_OUT GpuMat& or CV_OUT cuda::GpuMat& to generate python bindings that allow the argument to be an optional output in the same way as OutputArray.
* Add wrapper flag to indicate that an OutputArray is a GpuMat.
* python: drop CV_GPU, extra checks in test
* Remove "cuda::GpuMat" check rom python parser
G-API-NG/Streaming: don't require explicit metadata in compileStreaming()
* First probably working version
Hardcode gose to setSource() :)
* Pre final version of move metadata declaration from compileStreaming() to setSource().
* G-API-NG/Streaming: recovered the existing Streaming functionality
- The auto-meta test is disabling since it crashes.
- Restored .gitignore
* G-API-NG/Streaming: Made the meta-less compileStreaming() work
- Works fine even with OpenCV backend;
- Fluid doesn't support such kind of compilation so far - to be fixed
* G-API-NG/Streaming: Fix Fluid to support meta-less compilation
- Introduced a notion of metadata-sensitive passes and slightly
refactored GCompiler and GFluidBackend to support that
- Fixed a TwoVideoSourcesFail test on streaming
* Add three smoke streaming tests to gapi_streaming_tests.
All three teste run pipeline with two different input sets
1) SmokeTest_Two_Const_Mats test run pipeline with two const Mats
2) SmokeTest_One_Video_One_Const_Scalar test run pipleline with Mat(video source) and const Scalar
3) SmokeTest_One_Video_One_Const_Vector test run pipeline with Mat(video source) and const Vector
# Please enter the commit message for your changes. Lines starting
* style fix
* Some review stuff
* Some review stuff
* Added Swish and Mish activations
* Fixed whitespace errors
* Kernel implementation done
* Added function for launching kernel
* Changed type of 1.0
* Attempt to add test for Swish and Mish
* Resolving type mismatch for log
* exp from device
* Use log1pexp instead of adding 1
* Added openCL kernels
(1/4) Revert "Correct image borders and principal point computation in cv::stereoRectify"
This reverts commit 93ff1fb2f2.
(2/4) Revert "fix calib3d changes in 6836 plus some others"
This reverts commit fa42a1cfc2.
(3/4) Revert "fix compiler warning"
This reverts commit b3d55489d3.
(4/4) Revert "add test for 6836"
This reverts commit d06b8c4ea9.
Tests for argument conversion of Python bindings generator
* Tests for parsing elemental types from Python bindings
- Add positive and negative tests for int, float, double, size_t,
const char*, bool.
- Tests with wrong conversion behavior are skipped.
* Move implicit conversion of bool to integer/floating types to wrong
conversion behavior.
Fix incorrect use of std::move() in g-api perf tests
* First version
* Fix perfomace tests
Replace
c.apply(...);
with
cc = c.compile(...);
cc(...);
* Remove output meta arguments from .compile()
* Style fix
* Remove useless commented string
* Stick to common pattern : i.e. use gin() and gout() explicitly.
* Use cc(gin(...), gout(...)) in all cases.
* Fix infinite loop when trying to change state of the busy camera
- Add finite number of attempts in tryIoctl functions
10 by default.
* Introduced new flag for ioctl call to handle EBUSY
Improving VSX performance of integral function
* Adding support for vector get function on VSX datatypes so the
integral function gains a bit of performance.
* Removing get as a datatype member function and implementing a new HAL
instruction v_extract_n to get the n-th element of a vector register.
* Adding SSE/NEON/AVX intrinsics.
* Implement new HAL instruction v_broadcast_element on VSX/AVX/NEON/SSE.
* core(simd): add tests for v_extract_n/v_broadcast_element
- updated docs
- commented out code to repair compilation
- added WASM and MSA default implementations
* core(simd): fix compilation
- x86: avoid _mm256_extract_epi64/32/16/8 with MSVS 2015
- x86: _mm_extract_epi64 is 64-bit only
* cleanup
Add retrieve encoded frame to VideoCapture
* Add capacity to retrieve the encoded frame from a VideoCapture object.
* Correct raw codec and pixle format output from ffmpeg capture.
* Remove warnings from build.
* Added VideoCaptureRaw subclass.
* Include abstract base class VideoCaptureBase and rename new subclass VideoContainer as suggested by mshabunin.
* Remove using.
* Change base class name for compatibility with jave bindings generator.
* Move grab and retrieve and add override specifier
* Add setRaw and readRaw to IVideoCapture interface
-setRaw to disable video decoding and enable bitstream filters from mp4 to h254 and h265.
-readRaw to return the raw undecoded/filtered bitstream.
Add createRawCapture to initiate a backend with setRaw enabled.
Remove inheritance and use an independant VideoContainer subclass with IVideoCapture member.
* Address unused parameter warings.
Remove VideoContainer from python bindings as it no longer returns a Mat.
Use opencv type uchar instead of unsigned char.
Add missing destructor to VideoContainer class.
* Address build warnings and include all params in documentation.
* Include deprecated bitstream filtering API.
* Update codec_id query to work with older ffmpeg api's.
Change api version defines to be consistent - most recent api version first.
* Fix typo.
* Update test to work with naming of new files in the extra repo
* Investigate test failure
* Check bytes read by ffmpeg
* Removed mp4 video container test
* Applied suggested changes.
* videoio: rework API for extraction of RAW video streams
- FFmpeg only
* address review comments
Introducing the sample of Face Beautification algorithm implemented via Graph-API
* Introducing the sample of Face Beautification algorithm implemented via Graph-API
- 'gapi/samples/face_beautification.cpp' added
- FIXME added in 'gcpukernel.hpp'
* INF_ENGINE fix
- preprocessing clauses added not to run the sample without Inference Engine
* INF_ENGINE fix 2
- warnings removed
* Fixes
- checking IE version cut as there is no dependency
- some alignments fixed
- the comment about preprocessing commands fixed
* ie::backend() issue fix (according to dmatveev)
- as the sample needs the cv::gapi::ie::backend() to be defined regardless of having IE or not, there is its throw-error definition in `giebackend.cpp` now (by dmatveev)
- for the same reason, #includes in `giebackend.hpp` are fixed
- HAVE_INF_ENGINE check is removed from the sample
Implement Camera Multiplexing API
* IdideoCapture + two wrong function
function waitAny
Add errors catcher
Stub for Python added.
Sifting warnings
One test added
Two tests for camera and Perf tests added
* Perf sync and async tests for waitAny() added, waitAnyInterior() deleted, getDeviceHandle() deleted
* Variable OPENCV_TEST_CAMERA_LIST added
* Without fps set
* ASSERT_FAILED for environment variable
* Perf tests is DISABLED_
* --Trailing whitespace
* Return false from cap.cpp deleted
* Two functions deleted from interface, +range for, +environment variable in test_camera
* Space deleted
* printf deleted, perror added
* CV_WRAP deleted, cv2 cleared from stubs
* -- space
* default timeout added
* @param changed
* place of waitAny changed
* --whitespace
* ++function description
* function description changed
* revert unused changes
* videoio: rework API for VideoCapture::waitAny()
Supported ONNX Squeeze, ReduceL2 and Eltwise::DIV
* Support eltwise div
* Fix test
* OpenCL support added
* refactoring
* fix code style
* Only squeeze with axes supported
* Convert moments in tile algorithms to HAL (1.3x faster for VSX).
* Adding NEON code back in for non 64-bit platforms.
* Remove floats from post processing.