java - How to get positions from a document term vector in Lucene?

Question

Welcome To Ask or Share your Answers For Others

java - How to get positions from a document term vector in Lucene?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

java - How to get positions from a document term vector in Lucene?

I need to iterate over all documents in a Lucene index, and obtain the positions at which each term occurs in each document. As far as I am able to understand from the Lucene javadoc, the way to do this is to do something like this:

IndexReader ir = obtainIndexReader();
Terms tv = ir.getTermVector( doc, field );
TermsEnum terms = tv.iterator();
PostingsEnum p = null;
while( terms.next() != null ) {
    p = terms.postings( p, PostingsEnum.ALL );
    while( p.nextDoc() != PostingsEnum.NO_MORE_DOCS ) {
        int freq = p.freq();
        for( int i = 0; i < freq; i++ ) {
            int pos = p.nextPosition();   // Always returns -1!!!
            BytesRef data = p.getPayload();
            doStuff( freq, pos, data ); // Fails miserably, of course.
        }
    }
}

However, even though (1) the index does indeed include positions on the relevant field and (2) the term vector claims to have positions (i.e.: tv.hasPositions() == true), I keep getting "-1" for all positions.

First, am I doing something wrong? Is there an alternative way of iterating over postings on a per-document basis? Second: What is going on anyway? The index contains positions, the Terms instance returned by getTermVector claims to include positions, and I'm looking at the correct position values in Luke, yet I still get -1 when I try to access said values in my code. What gives?

EDIT: The relevant field was configured with the following options:

    FieldType ft = new FieldType();
    ft.setIndexOptions( IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS );
    ft.setStoreTermVectors( true );
    ft.setStoreTermVectorOffsets( true );
    ft.setStoreTermVectorPayloads( true );
    ft.setStoreTermVectorPositions( true );
    ft.setTokenized( true );
    return ft;

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:39:46+0000

Did you set FieldType.setStoreTermVectorPositions(true) on your field type at index time? http://lucene.apache.org/core/5_5_0/core/org/apache/lucene/document/FieldType.html#setStoreTermVectorPositions(boolean)

Categories

java - How to get positions from a document term vector in Lucene?

java - How to get positions from a document term vector in Lucene?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags