Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
494 views
in Technique[技术] by (71.8m points)

why tesseract can't recognize the english words on this image?

I am using tesseract 4.0 to recognize english words,but fail only on this image ,without any words been recognized,

any one can give a tip,thanks

    r=pytesseract.image_to_string('6.jpg', lang='eng')
    print(r)

Fail image

update:

I try to OCR with online website

https://www.newocr.com/

and it works,but why?

how can I use tesseract to recognize it?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The problem is pytesseract is not rotation-invariant. Therefore, you need to do additional pre-processing. source

  • My first idea is to rotate the image with a small angle

  • img = imutils.rotate_bound(cv2.imread("YD90o.png"), 4)
    
  • Result:

    • enter image description here
  • Now if I apply an adaptive-threshold

    • enter image description here
  • To read with pytesseract you need to set additional configuration:

    • pytesseract.image_to_string(thr, lang="eng", config="--psm 6")
      
    • PSM (page-segmentation-mode) 6 is Assume a single uniform block of text. source

  • Result:

    • You want to get the last sentence of the image.

    • txt = pytesseract.image_to_string(thr, lang="eng", config="--psm 6")
      txt = txt.replace('f', '').split('
      ')
      print(txt[len(txt)-2])
      
    • Result:

    • Continue Setub ie Gene
      

The website might use deep-learning method to detect the words in the image. But when I use newocr.com the result is:

oy Eee a
setuP me -
continve ae

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...