Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

elasticsearch - Word boundary in Lucene regex

I'd like to a make a regex query in Elastisearch with word boundaries, however it looks like the Lucene regex engine doesn't support . What workarounds can I use?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

In ElasticSearch regex flavor, there is no direct equivalent to a word boundary. Initial is something like (^|[^A-Za-z0-9_]) if the word starts with a word char, and the trailing is like ($|[^A-Za-z0-9_]) if the word ends with a word char.

Thus, we need to make sure that there is a non-word char before and after word or start/end of string. Since the regex is anchored by default, all we need to make [^A-Za-z0-9_] optional at start/end of string is add .* beside and wrap with an optional grouping construct:

(.*[^A-Za-z0-9_])?word([^A-Za-z0-9_].*)?

Details

  • (.*[^A-Za-z0-9_])? - either start of string or any 0+ chars (but a line break char, else use (.| )*) and then any char but a word char (basically, it is start of string followed with 1 or 0 occurrences of the pattern inside the group)
  • word - a word
  • ([^A-Za-z0-9_].*)? - an optional sequence of any char but a word char followed with any 0+ chars, followed by the end of string position (implicit in Lucene regex).

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

56.9k users

...