why tesseract can't recognize the english words on this image?

Question

asked Feb 19, 2021 in Technique[技术] by 深蓝 (71.8m points)

I am using tesseract 4.0 to recognize english words,but fail only on this image ,without any words been recognized,

any one can give a tip,thanks

    r=pytesseract.image_to_string('6.jpg', lang='eng')
    print(r)

update:

I try to OCR with online website

and it works,but why?

how can I use tesseract to recognize it?

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-02-19T03:49:03+0000

The problem is pytesseract is not rotation-invariant. Therefore, you need to do additional pre-processing. source

img = imutils.rotate_bound(cv2.imread("YD90o.png"), 4)

Result:
Now if I apply an adaptive-threshold
To read with pytesseract you need to set additional configuration:
- ```
pytesseract.image_to_string(thr, lang="eng", config="--psm 6")
```
- PSM (page-segmentation-mode) 6 is Assume a single uniform block of text. source

Result:

txt = pytesseract.image_to_string(thr, lang="eng", config="--psm 6")
txt = txt.replace('f', '').split('
')
print(txt[len(txt)-2])

The website might use deep-learning method to detect the words in the image. But when I use newocr.com the result is:

oy Eee a
setuP me -
continve ae