java - Lucene how can I get position of found query?

Question

Welcome To Ask or Share your Answers For Others

java - Lucene how can I get position of found query?

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

java - Lucene how can I get position of found query?

I have a QueryParser, and I want to find the string "War Force" in my text:

TextWord[0]: 2003
TextWord[1]: 09
TextWord[2]: 22T19
TextWord[3]: 01
TextWord[4]: 14Z
TextWord[5]: Book0
TextWord[6]: WEAPONRY
TextWord[7]: NATO2
TextWord[8]: Bar
TextWord[9]: WEAPONRY
TextWord[10]: State
TextWord[11]: WEAPONRY
TextWord[12]: 123
TextWord[13]: War
TextWord[14]: WORD1
TextWord[15]: Force
TextWord[16]: And
TextWord[17]: Book4
TextWord[18]: Book
TextWord[19]: WEAPONRY
TextWord[20]: Book6
TextWord[21]: Terrorist.
TextWord[22]: And
TextWord[23]: WEAPONRY
TextWord[24]: 18
TextWord[25]: 31
TextWord[26]: state
TextWord[27]: AND

I see that I found it, when use phrase slop = 1 (I mean this: "war" word1 "force").

I can find the position of "war" or "force":

        DirectoryReader reader = DirectoryReader.open(this.memoryIndex);
        IndexSearcher searcher = new IndexSearcher(reader);
        
        QueryParser queryParser = new QueryParser("tags", new StandardAnalyzer());
        Query query = queryParser.parse(""War Force"~1");
        TopDocs results = searcher.search(query, 1);

        for (ScoreDoc scoreDoc : results.scoreDocs) {

            Fields termVs = reader.getTermVectors(scoreDoc.doc);
            Terms f = termVs.terms("tags");

            String searchTerm = "War".toLowerCase();
            BytesRef ref = new BytesRef(searchTerm);

            TermsEnum te = f.iterator();
            PostingsEnum docsAndPosEnum = null;
            if (te.seekExact(ref)) {
                
                docsAndPosEnum = te.postings(docsAndPosEnum, PostingsEnum.ALL);
                int nextDoc = docsAndPosEnum.nextDoc();
                assert nextDoc != DocIdSetIterator.NO_MORE_DOCS;
                final int fr = docsAndPosEnum.freq();
                final int p = docsAndPosEnum.nextPosition();
                final int o = docsAndPosEnum.startOffset();

                System.out.println("Word: " + ref.utf8ToString());
                System.out.println("Position: " + p + ", startOffset: " + o + " length: " + ref.length + " Freg: " + fr);
                if (fr > 1) {
                    for (int iter = 1; iter <= fr - 1; iter++) {
                        System.out.println("Possition: " + docsAndPosEnum.nextPosition());
                    }
                }
            }

            System.out.println("Finish");
        }

But I can't find the position of my found query "War Force" or something like that. How can I get the position of the found query result?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T03:10:11+0000

There is probably more than one way to do this, but I suggest using the FastVectorHighlighter, as it gives you access to position and offset data.

Indexing Requirements

To use this approach, you need to ensure your indexed data uses a field which stores term vector data, when the index is created:

final String fieldName = "body";
// a shorter version of the input data in the question, for testing:
final String content = "State WEAPONRY 123 War WORD1 Force And Book4 Book WEAPONRY";

FieldType fieldType = new FieldType();
fieldType.setStored(true);
fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
fieldType.setStoreTermVectors(true);
fieldType.setStoreTermVectorPositions(true);
fieldType.setStoreTermVectorOffsets(true);

doc.add(new Field(fieldName, content, fieldType));
writer.addDocument(doc);

(This may significantly increase the size of your indexed data, if you are not already capturing term vectors.)

Library Requirements

The fast vector highlighter is part of the lucene-highlighter library:

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-highlighter</artifactId>
    <version>8.9.0</version>
</dependency>

Search Example

Assume the following query:

final String searchTerm = ""War Force"~1";

We expect this to find War WORD1 Force from our test data.

The first part of the process performs a standard query execution, using the classic query parser:

Directory dir = FSDirectory.open(Paths.get(indexPath));
try ( DirectoryReader dirReader = DirectoryReader.open(dir)) {
    IndexSearcher indexSearcher = new IndexSearcher(dirReader);
    Analyzer analyzer = new StandardAnalyzer();
    QueryParser parser = new QueryParser(fieldName, analyzer);
    Query query = parser.parse(searchTerm);
    TopDocs topDocs = indexSearcher.search(query, 100);
    ScoreDoc[] hits = topDocs.scoreDocs;
    for (ScoreDoc hit : hits) {
        handleHit(hit, query, dirReader, indexSearcher);
    }

The handleHit() method (shown below) is where we use the FastVectorHighlighter.

If you only want to perform highlighting (and do not need position/offset data), you can use:

FastVectorHighlighter fvh = new FastVectorHighlighter();
fvh.getBestFragment(fieldQuery, dirReader, docId, fieldName, fragCharSize)

But to access the extra data we need, you can do the following:

FieldTermStack fieldTermStack = new FieldTermStack(dirReader, hit.doc, fieldName, fieldQuery);
FieldPhraseList fieldPhraseList = new FieldPhraseList(fieldTermStack, fieldQuery);
FragListBuilder fragListBuilder = new SimpleFragListBuilder();
FragmentsBuilder fragmentsBuilder = new SimpleFragmentsBuilder();
FastVectorHighlighter fvh = new FastVectorHighlighter(phraseHighlight, fieldMatch,
        fragListBuilder, fragmentsBuilder);

This builds a FastVectorHighlighter which contains a FieldPhraseList, which will be populated by the highlighter.

The getBestFragment method now becomes:

// use whatever you want for these settings:
int fragCharSize = 100;
int maxNumFragments = 100;
String[] preTags = new String[]{"-->"};
String[] postTags = new String[]{"<--"};

Encoder encoder = new DefaultEncoder();
// the fragments string array contains the highlighted results:
String[] fragments = fvh.getBestFragments(fieldQuery, dirReader, hit.doc,
        fieldName, fragCharSize, maxNumFragments, fragListBuilder,
        fragmentsBuilder, preTags, postTags, encoder);

And finally we can use the fieldPhraseList to access the data we need:

// the following gives you access to positions and offsets:
fieldPhraseList.getPhraseList().forEach(weightedPhraseInfo -> {
    int phraseStartOffset = weightedPhraseInfo.getStartOffset(); // 19
    int phraseEndOffset = weightedPhraseInfo.getEndOffset();     // 34
    weightedPhraseInfo.getTermsInfos().forEach(termInfo -> {
        String term = termInfo.getText();                // "war"  "force"
        int termPosition = termInfo.getPosition() + 1;    // 4      6
        int termStartOffset = termInfo.getStartOffset(); // 19     29
        int termEndOffset = termInfo.getEndOffset();     // 22     34
    });
});

The phraseStartOffset and phraseEndOffset are character counts telling us where the whole phrase can be found in the source document:

State WEAPONRY 123 War WORD1 Force And Book4 Book WEAPONRY

So, in our case, this is the string from offsets 19 through 34 (offset 0 is the position on the left hand side of the first "S").

Then, for each specific term ("war", and "force") in the search query, we can access their offsets, and also their word positions (termPosition). Position 0 is the forst word, so I add 1 to this index to give "war" at position 4 and "force" at position 6 in the original document:

1     2        3   4   5     6     7   8     9    10
State WEAPONRY 123 War WORD1 Force And Book4 Book WEAPONRY

Here is the complete code for reference:

import java.io.IOException;
import java.math.BigDecimal;
import java.math.RoundingMode;
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.DefaultEncoder;
import org.apache.lucene.search.highlight.Encoder;
import org.apache.lucene.search.vectorhighlight.FastVectorHighlighter;
import org.apache.lucene.search.vectorhighlight.FieldPhraseList;
import org.apache.lucene.search.vectorhighlight.FieldQuery;
import org.apache.lucene.search.vectorhighlight.FieldTermStack;
import org.apache.lucene.search.vectorhighlight.FragListBuilder;
import org.apache.lucene.search.vectorhighlight.FragmentsBuilder;
import org.apache.lucene.search.vectorhighlight.SimpleFragListBuilder;
import org.apache.lucene.search.vectorhighlight.SimpleFragmentsBuilder;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class VectorIndexHighlighterDemo {

    final String indexPath = "./index";
    final String fieldName = "body";
    final String searchTerm = ""War Force"~1";

    public void doDemo() throws IOException, ParseException {

        Directory dir = FSDirectory.open(Paths.get(indexPath));
        try ( DirectoryReader dirReader = DirectoryReader.open(dir)) {
            IndexSearcher indexSearcher = new IndexSearcher(dirReader);
            Analyzer analyzer = new StandardAnalyzer();
            QueryParser parser = new QueryParser(fieldName, analyzer);
            Query query = parser.parse(searchTerm);

            System.out.println();
            System.out.println("Search term: [" + searchTerm + "]");
            System.out.println("Parsed query: [" + query.toString() + "]");

            TopDocs topDocs = indexSearcher.search(query, 100);

            ScoreDoc[] hits = topDocs.scoreDocs;
            for (ScoreDoc hit : hits) {
                handleHit(hit, query, dirReader, indexSearcher);
            }
        }
    }

    private void handleHit(ScoreDoc hit, Query query, DirectoryReader dirReader,
            IndexSearcher indexSearcher) throws IOException {

        boolean phraseHighlight = Boolean.TRUE;
        boolean fieldMatch = Boolean.TRUE;
        FieldQuery fieldQuery = new FieldQuery(query, dirReader, phraseHighlight, fieldMatch);

        FieldTermStack fieldTermStack = new FieldTermStack(dirReader, hit.doc, fieldName, fieldQuery);
        FieldPhraseList fieldPhraseList = new FieldPhraseList(fieldTermStack, fieldQuery);
        FragListBuilder fragListBuilder = new SimpleFragListBuilder();
        FragmentsBuilder fragmentsBuilder = new SimpleFragmentsBuilder();
        FastVectorHighlighter fvh = new FastVectorHighlighter(phraseHighlight, fieldMatch,
                fragListBuilder, fragmentsBuilder);

        // use whatever you want for these settings:
        int fragCharSize = 100;
        int maxNumFragments = 100;
        String[] preTags = new String[]{"-->"};
        String[] postTags = new String[]{"<--"};
        
        Encoder encoder = new DefaultEncoder();
        // the fragments string array contains the highlighted results:
        String[] fragments = fvh.getBestFragments(fieldQuery, dirReader, hit.doc,
                fieldName, fragCharSize, maxNumFragments, fragListBuilder,
                fragmentsBuilder, preTags, postTags, encoder);

        // the following gives you access to positions and offsets:
        fieldPhraseList.getPhraseList().forEach(weightedPhraseInfo -> {
            int phraseStartOffset = weightedPhraseInfo.getStartOffset(); // 19
            int phraseEndOffset = weightedPhraseInfo.getEndOffset();     // 34
            weightedPhraseInfo.getTermsInfos().forEach(termInfo -> {
                String term = termInfo.getText();                // "war"  "force"
                int termPosition = termInfo.getPosition() + 1;    // 4      6
                int termStartOffset = termInfo.getStartOffset(); // 19     29
                int termEndOffset = termInfo.getEndOffset();     // 22     34
            });
        });

        // get the scores, also, if needed:
        BigDecimal score = new BigDecimal(String.valueOf(hit.score))
                .setScale(3, RoundingMode.HALF_EVEN);
        Document hitDoc = indexSearcher.doc(hit.doc);
    }

}

Categories

java - Lucene how can I get position of found query?

java - Lucene how can I get position of found query?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags