Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
256 views
in Technique[技术] by (71.8m points)

regex - Request for help to speed up batch program for 17,000 TXT files

I have over 17,000 pages that have been scanned (for a local history archive) which I have OCRed using Tesseract to individual TXT files. I want to be able to search/locate every page containing a search word of more than 3, lower case letters. So for each TXT file I need to:

  1. Delete all rubbish from the OCR text i.e. non-alphanumeric characters - jrepl "[^a-zA-Z0-9s]" "" /x /f %%G /O -
  2. Remove 1, 2 and 3 letter words - jrepl "w{1,3}" "" /x /f %%G /O -
  3. Change all characters to lower case - jrepl "(w)" "$1.toLowerCase()" /i /j /x /f %%G /O -
  4. To be able to sort the remaining words they need to be on separate new lines - jrepl "s" " " /x /f %%G /O -
  5. Finally sort all unique words into alphabetic order and create the modified TXT file - sort /UNIQUE %%G /O %%G

I have a batch file that does the above using JREPL but it is very slow. It has been running for over 100 HOURS and I'm not even half way. Any suggestions so as to speed up the processing? I am running Windows 10. Thanks.

question from:https://stackoverflow.com/questions/65850015/request-for-help-to-speed-up-batch-program-for-17-000-txt-files

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Solution?

Since your existing batch does what you want, no doubt testing a replacement will occupy some hours - so:

Split the 17,000 files - or those that remain unprocessed into (however many cores you have) separate directories, then start your existing batch on each directory. Since it's the weekend, leave the process running overnight. 8 cores? should be done in 15 hours or so, while you catch up on sleep or gardening or whatever.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...