zdenop
0e72733121
Merge pull request #2305 from stweil/fuzz
...
Fix Heap-buffer-overflow in GenericVector<int>::size (issue #2298 )
2019-03-10 16:36:26 +01:00
Stefan Weil
71d4990c6d
Fix Heap-buffer-overflow in GenericVector<int>::size (issue #2298 )
...
Credit to OSS-Fuzz:
This fixes a security issue which was reported by OSS-Fuzz, see details at
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13590 .
Add also some assertions to catch similar bugs.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-10 16:12:30 +01:00
Robert Schubert
3912cb1c33
LSTM char_whitelist/blacklist ( 6ac2ff0
): more robust
...
- unicharset can be null too
2019-03-09 10:40:40 +01:00
Stefan Weil
b7279f6d67
unittest: Remove tmp directory from repository and create it during build
...
This fixes out of tree builds.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-08 16:08:16 +01:00
Stefan Weil
bd95c9d2b8
unittest: Add missing libarchive
...
It is needed for the tests if Tesseract was built with libarchive.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-08 15:50:14 +01:00
Stefan Weil
b20f89006e
unittest: Add another file from Abseil
...
It is needed for newer versions of Abseil.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-08 15:46:38 +01:00
Robert Schubert
b45999088c
LSTM char_whitelist/blacklist ( 6ac2ff0
): multi-code chars
...
- move decision from ComputeTopN to ContinueContext, where
it belongs: block context continuations which emit final
codes translating to disabled unichar_ids.
(The normal logic for fallback from top2 > top2 > rest
will apply.)
- pass UNICHARSET refs appropriately
2019-03-08 12:30:16 +01:00
Robert Schubert
8012d5e653
LSTM char_whitelist/blacklist ( 6ac2ff0
): also sublangs
2019-03-07 18:32:50 +01:00
Robert Schubert
6ac2ff083e
trying to add tessedit_char_whitelist etc. again:
...
- ignore matrix outputs in ComputeTopN if they
belong to a disabled unichar_id
- pass UNICHARSET refs to check that
- in SetBlackAndWhitelist, also update the unicharset
of the lstm_recognizer_ instance, if any
2019-03-07 01:37:23 +01:00
zdenop
f80085c0bf
Merge pull request #2289 from Armyke/master
...
Added an additional optional --tmp_dir parameter to specify the tempo…
2019-03-06 15:03:14 +01:00
zdenop
fe5c82fd24
Merge pull request #2291 from cjmayo/man_configfile
...
Document that configfile can be a file path
2019-03-06 10:19:27 +01:00
Chris Mayo
a9d3efb6e3
Document that configfile can be a file path
...
Useful for custom config or when pointing tessdata to alternate
traineddata.
2019-03-05 19:47:54 +00:00
zdenop
868a623f8d
Merge pull request #2290 from stweil/libarchive
...
Add initial support for traineddata files in standard archive formats
2019-03-05 17:42:13 +01:00
Stefan Weil
1c7e00611b
Add initial support for traineddata files in standard archive formats
...
This requires libarchive-dev.
Tesseract can now load traineddata files in any of the archive formats
which are supported by libarchive. Example of a zipped BagIt archive:
$ unzip -l /usr/local/share/tessdata/zip.traineddata
Archive: /usr/local/share/tessdata/zip.traineddata
Length Date Time Name
--------- ---------- ----- ----
55 2019-03-05 15:27 bagit.txt
0 2019-03-05 15:25 data/
1557 2019-03-05 15:28 manifest-sha256.txt
1082890 2019-03-05 15:25 data/eng.word-dawg
1487588 2019-03-05 15:25 data/eng.lstm
7477 2019-03-05 15:25 data/eng.unicharset
63346 2019-03-05 15:25 data/eng.shapetable
976552 2019-03-05 15:25 data/eng.inttemp
13408 2019-03-05 15:25 data/eng.normproto
4322 2019-03-05 15:25 data/eng.punc-dawg
4738 2019-03-05 15:25 data/eng.lstm-number-dawg
1410 2019-03-05 15:25 data/eng.freq-dawg
844 2019-03-05 15:25 data/eng.pffmtable
6360 2019-03-05 15:25 data/eng.lstm-unicharset
1012 2019-03-05 15:25 data/eng.lstm-recoder
1047 2019-03-05 15:25 data/eng.unicharambigs
4322 2019-03-05 15:25 data/eng.lstm-punc-dawg
16109842 2019-03-05 15:25 data/eng.bigram-dawg
80 2019-03-05 15:25 data/eng.version
6426 2019-03-05 15:25 data/eng.number-dawg
3694794 2019-03-05 15:25 data/eng.lstm-word-dawg
--------- -------
23468070 21 files
`combine_tessdata -d` and `combine_tessdata -u` also work.
The traineddata files in the new format can be generated with
standard tools like zip or tar.
More work is needed for other training tools and big endian support.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-05 17:18:48 +01:00
Armyke
56b04d4ea7
Added the same --tmp_dir flag to tesstrain_utils.sh
2019-03-04 14:05:25 +00:00
Armyke
25fa392887
Added an additional optional --tmp_dir parameter to specify the temporary directory in which tesstrain.py creates the training temporary files. The main reason is due to the slow R/W on HDD, if anyone wants to speed up this process can use as tmp_dir a directory on an SSDrive
2019-03-04 13:26:53 +00:00
Stefan Weil
7fbde96a04
Format new code with clang-format
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 20:26:07 +01:00
Stefan Weil
38fac625cd
Format new code with clang-format
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 20:01:48 +01:00
Shree
a0202bac70
Rename function to TessBaseAPIGetTsvText to be consistent to the Create method
2019-03-02 16:29:53 +00:00
zdenop
5de2a21b3f
Merge pull request #2283 from Shreeshrii/lstmbox
...
Add missing renderers to C-API
2019-03-02 15:15:34 +01:00
zdenop
198c90b124
Merge pull request #2285 from stweil/opt
...
PAGE_RES_IT: Optimize compare operators by using inline code
2019-03-02 15:13:14 +01:00
Stefan Weil
9c90894ff0
PAGE_RES_IT: Optimize compare operators by using inline code
...
Avoiding a function call will make both == and != operator faster.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 14:57:16 +01:00
Egor Pugin
7cc97c25ca
Merge pull request #2284 from stweil/fix
...
Fix some compiler warnings
2019-03-02 16:35:55 +03:00
Stefan Weil
295996ed05
commandlineflags: Fix compiler warnings (signed/unsigned)
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 14:21:04 +01:00
Stefan Weil
eb14726aac
ICOORD: Fix old type casts
...
This fixes compiler warnings and avoids unnecessary conversions
between float and double.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 14:04:54 +01:00
Stefan Weil
fb0f1bcf66
BoxChar: Fix compiler warnings (signed/unsigned)
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 14:04:54 +01:00
Stefan Weil
0e1a1fc3cf
Validator: Fix compiler warnings (signed/unsigned)
...
This also fixes a regression in validate_grapheme_test introduced
by commit 32e9d7c8f5
.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 13:05:03 +01:00
Shree
c7e8131efc
Add TSV option to C-API
2019-03-02 09:50:54 +00:00
Shree
22c099348b
rename LSTMBOX to LSTMBox
2019-03-02 09:11:47 +00:00
zdenop
f5a7ca26e7
Merge pull request #2244 from Shreeshrii/mya
...
Fix Myanmar validation rules as per Unicode charts
2019-03-01 18:37:36 +01:00
zdenop
2ba8e0061a
Merge branch 'master' into mya
2019-03-01 18:37:24 +01:00
zdenop
0b354f2b84
Merge pull request #2282 from Shreeshrii/configs
...
Add lstmbox and wordstrbox to C-API
2019-03-01 18:33:29 +01:00
Shree
c33f03e33e
Add lstmboxand wordstrbox to capi.h
2019-03-01 17:16:59 +00:00
Shree
76ec21df3d
Add lstmbox and wordstrbox to C-API
2019-03-01 16:40:41 +00:00
zdenop
c4b5178296
Merge pull request #2280 from Shreeshrii/configs
...
install lstmbox and wordstrbox config files
2019-03-01 17:20:55 +01:00
Shree
08e96516c6
install lstmbox and wordstrbox config files
2019-03-01 15:26:59 +00:00
zdenop
a783009189
Merge pull request #2279 from Shreeshrii/tamil_numbers
...
correct handling of 0BF0-0BFA Tamil numbers and symbols
2019-03-01 14:36:20 +01:00
zdenop
646b043d2c
use space instead of tab
2019-03-01 14:36:09 +01:00
Shree
5ee1deaea2
correct handling of 0BF0-0BFA Tamil numbers and symbols
2019-03-01 13:21:49 +00:00
zdenop
d7ddc4c5b7
Merge pull request #2270 from Shreeshrii/U_ARABIC_NUMBER
...
Treat U_ARABIC_NUMBER as LTR
2019-02-28 09:27:54 +01:00
zdenop
2a69a4b4e1
Merge pull request #2275 from russiaayya/patch-1
...
Change option -l to --lang
2019-02-27 20:22:32 +01:00
russiaayya
c6cc54aa76
Change option -l to --lang
2019-02-27 12:55:34 -05:00
zdenop
12c1225a5f
Merge pull request #2271 from stweil/refactor
...
Refactor class Network
2019-02-27 07:43:13 +01:00
Stefan Weil
13bd96fd37
Merge pull request #2272 from nijel/patch-1
...
Allow UTF-8 variant of C locale
2019-02-26 22:25:48 +01:00
Michal Čihař
14c4494f42
Allow UTF-8 variant of C locale
...
It behaves same in scanf, but it allows proper handling of unicode
chars.
2019-02-26 21:37:33 +01:00
Stefan Weil
98dd3b6351
Refactor class Network
...
That class is an abstract class with several pure virtual functions.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-26 16:55:31 +01:00
zdenop
cf85054453
Merge pull request #2269 from Shreeshrii/grapheme
...
Use alternative way to comment a block of code
2019-02-26 16:51:34 +01:00
Shree
25b02bf1f2
Treat U_ARABIC_NUMBER as LTR
2019-02-26 09:51:21 +00:00
Shreeshrii
2f71fe280c
Use alternative way to comment a block of code (using the c preprocessor).
...
https://github.com/tesseract-ocr/tesseract/pull/2268#pullrequestreview-207605382
Thanks @amitdo
2019-02-26 15:05:51 +05:30
zdenop
9ddf267907
Merge pull request #2268 from Shreeshrii/grapheme
...
Remove test for Word started with a combiner
2019-02-25 20:59:40 +01:00