Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
394 views
in Technique[技术] by (71.8m points)

lucene - not_indexed field is stored in index

I'm trying to optimize my elasticsearch scheme.

I have a field which is a URL - I do not want to be able to query or filter it, just retreive it.

My understanding is that a field that is defined as "index":"no" is not indexed, but is still stored in the index. (see slide 5 in http://www.slideshare.net/nitin_stephens/lucene-basics) This should match to Lucene UnIndexed, right?

This confuses me, is there a way to store some fields, without them taking more storage than simply their content, and without encumbering the index for the other fields?

What am I missing?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I'm new to posting on stack exchange but believe I can help a bit!

There are a few considerations here:

Analyzing

As you don't want to do extra work you should set "index": "no". This will mean the field will not be run through any tokenizers and filters.

Furthermore it will not be searchable when directing a query at the specific field: (no hits)

"query": {
    "term": {
        "url": "http://www.domain.com/exact/url/that/was/sent/to/elasticsearch"
    }
}

*here "url" is the field name.

However the field will still be searchable in the _all field: (might have a hit)

"query": {
    "term": {
        "_all": "http://www.domain.com/exact/url/that/was/sent/to/elasticsearch"
    }
}

_all field

By default every field gets put in the _all field. Set "include_in_all": "false" to stop that. This might not be an issue with you as you are unlikely to search against the _all field with a URL by mistake.

I was working with a schema where countries were stored as 2 letter codes, e.g.: "NO" means Norway, and it is possible someone might do a search against the all field with "NO", so I make sure to set "include_in_all": "false".

Note: Any query where you don't specify a field explicitly will be executed against the _all field.

Storing

By default, elasticsearch will store your entire document (unanalyzed, as you sent it) and this will be returned to you in a hit's _source field. If you turn this off (if your elasticsearch db is getting huge perhaps?) then you need to explicitly set "store": "yes" to store fields individually. (One thing to notice is that store takes yes or no and not true or false - it tripped me up!)

Note: if you do this you will need to request the fields you want returned to you explicitly. e.g.:

curl -XGET http://path/index_name/type_name/id?fields=url,another_field

finally...

I would leave elasticsearch to store your whole document (as the default) and use the following mapping.

"type_name": {
    "properties": {
        "url": {
            "type": "string",
            "index": "no",
            "include_in_all": "false"
        },
        // other fields' mappings
    }
}

Source: elasticsearch documentation


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...