[TextExtractor]Fix error blanks in Japanese OCR (#22443)

* fix error blanks in japanese OCR

Kanji ,Hiragana, Katakana, Hankaku-Katakana do not need blank. (not only the range of CJKUnifiedIdeographs). Maybe there are more symbols that don't require spaces like \u3001 \u3002. But give it to ocr engine to improve may be a better choice ?

* Update ImageMethods.cs

fixing spelling

* Update expect.txt

adding in Hankaku

* Update ImageMethods.cs
This commit is contained in:
AO2233 2022-12-09 22:07:45 +08:00 committed by GitHub
parent 08d569ccf6
commit a8a618af1d
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 5 additions and 1 deletions

View File

@ -618,6 +618,7 @@ HACCEL
handlekeyboardhookevent
handlerroutine
hangeul
Hankaku
hanselman
Hanzi
Hardlines

View File

@ -147,7 +147,10 @@ internal class ImageMethods
}
else
{
var cjkRegex = new Regex(@"\p{IsCJKUnifiedIdeographs}");
// Kanji, Hiragana, Katakana, Hankaku-Katakana do not need blank.(not only the symbol in CJKUnifiedIdeographs).
// Maybe there are more symbols that don't require spaces like \u3001 \u3002.
// var cjkRegex = new Regex(@"\p{IsCJKUnifiedIdeographs}|\p{IsHiragana}|\p{IsKatakana}|[\uFF61-\uFF9F]|[\u3000-\u3003]");
var cjkRegex = new Regex(@"\p{IsCJKUnifiedIdeographs}|\p{IsHiragana}|\p{IsKatakana}|[\uFF61-\uFF9F]");
foreach (OcrLine ocrLine in ocrResult.Lines)
{