Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
223 views
in Technique[技术] by (71.8m points)

java - Regex search pattern in very large file

I'd like to search pattern in very large file (f.e above 1 GB) that consists of single line. It is not possible to load it into memory. Currently, I use BufferedReaderto read into buffers (1024 chars). The main steps:

  1. Read data into two buffers
  2. Search pattern in that buffers
  3. Increment variable if pattern was found
  4. Copy second buffer into first
  5. Load data into second buffers
  6. Search pattern in both buffers.
  7. Increment variable if pattern was found
  8. Repeat above steps (start from 4) until EOF

That algorithm (two buffers) lets me to avoid situation, where searched piece of text is split by chunks. It works like a chram unless pattern result is smaller that two buffers length. For example I can't manage with case, when result is longer - let's say long as 3 buffers (but I've only data in two buffers, so match will fail!). What's more, I can realize such a case:

  1. Prepare 1 GB single line file, that consits of "baaaaaaa(....)aaaaab"
  2. Search for pattern ba*b.
  3. The whole file match pattern!
  4. I don't have to print the result, I've only to be able to say: "Yea, I was able to find pattern" or "No, I wasn't able to find that".

It's possible with java? I mean:

  1. Ability to determine, whether a pattern is present in file (without loading whole line into memory, see case above
  2. Find the way handle the case, when match result is longer than chunk.

I hope my explanation is pretty clear.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I think the solution for you would be to implement CharSequence as a wrapper over very large text files.

Why? Because building a Matcher from a Pattern takes a CharSequence as an argument.

Of course, easier said than done... But then you only have three methods to implement, so that shouldn't be too hard...


EDIT I took the plunge and I ate my own dog's food. The "worst part" is that it actually works!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...