Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.6k views
in Technique[技术] by (71.8m points)

mysql - Solr not returning all documents after importing with the Data Handler

I have a Solr 8.7.0 installation and I'm using the Data Handler importer plugin via a MySQLi connection.

I have four entities declared:

<dataConfig>
  <dataSource type="JdbcDataSource"
              driver="com.mysql.jdbc.Driver"
              url="jdbc:mysql://localhost:3306/hmsscot_bassculture"
              user="myuser"
              password="mypw"/>
  <document>
    <entity name="author" query="select id,type,firstname,surname,biographical_info,extrainfo from bassculture_author">
      <field column="id" name="id"/>
      <field column="type" name="type"/>
      <field column="firstname" name="firstname"/>
      <field column="surname" name="surname"/>
      <field column="biographical_info" name="biographical_info"/>
      <field column="extrainfo" name="extrainfo"/>
    </entity>

    <entity name="source" query="select id,type,short_title,full_title,publisher,author_id,orientation,variants from bassculture_source">
      <field column="id" name="id"/>
      <field column="type" name="type"/>
      <field column="short_title" name="short_title"/>
      <field column="full_title" name="full_title"/>
      <field column="publisher" name="publisher"/>
      <field column="author_id" name="author_id"/>
      <entity name="author" query="SELECT s.*, CONCAT(ba.firstname, ' ', ba.surname) AS author FROM bassculture_source s, bassculture_author ba WHERE s.id=${source.id} AND s.author_id = ba.id;">
        <field column="author" name="author"/>
      </entity>
      <field column="description" name="description"/>
      <field column="orientation" name="orientation"/>
      <field column="variants" name="variants"/>
    </entity>

    <entity name="copy" query="select id,type,folder,source_id,item_notes,seller,library,shelfmark,pagination,dimensions from bassculture_item">
      <field column="id" name="id"/>
      <field column="type" name="type"/>
      <field column="folder" name="folder"/>
      <field column="source_id" name="source_id"/>
      <entity name="source_title" query="select id,short_title from bassculture_source where id=${copy.source_id}">
        <field column="short_title" name="source_title"/>
      </entity>
      <entity name="source_author" query="SELECT bt.*, CONCAT(ba.firstname, ' ', ba.surname) AS source_author FROM bassculture_tune bt, bassculture_item c, bassculture_source s, bassculture_author ba WHERE c.id=${copy.id} AND c.source_id = s.id AND s.author_id = ba.id;">
        <field column="source_author" name="source_author"/>
      </entity>
      <field column="item_notes" name="item_notes"/>
      <field column="seller" name="seller"/>
      <field column="library" name="library"/>
      <field column="shelfmark" name="shelfmark"/>
      <field column="paginations" name="pagination"/>
      <field column="dimensions" name="dimension"/>
    </entity>

    <entity name="tune" query="select id,type,name,start_page,alternate_spellings,item_id from bassculture_tune">
      <field column="id" name="id"/>
      <field column="type" name="type"/>
      <field column="name" name="name"/>
      <entity name="source_title" query="select s.* FROM bassculture_source s, bassculture_item c, bassculture_tune bt where bt.id=${tune.id} AND c.source_id = s.id AND bt.item_id = c.id">
        <field column="short_title" name="source_title"/>
      </entity>
      <entity name="tune_author" query="SELECT bt.*, CONCAT(ba.firstname, ' ', ba.surname, ' ', ba.extrainfo) AS tune_author FROM bassculture_tune bt, bassculture_item c, bassculture_source s, bassculture_author ba WHERE bt.id=${tune.id} AND bt.item_id = c.id AND c.source_id = s.id AND s.author_id = ba.id;">
        <field column="tune_author" name="tune_author" />
      </entity>
      <field column="start_page" name="start_page"/>
      <field column="alternate_spellings" name="alternate_spellings"/>
      <field column="item_id" name="item_id"/>
    </entity>

  </document>
</dataConfig>

Now, I'm experiencing something which doesn't make sense to me. If I run the data importer leaving the 'entity' drop-down blank (i.e. import all entities):

enter image description here

I get:

Indexing completed. Added/Updated: 2357 documents. Deleted 0 documents. (Duration: 13s)

This is the correct number of documents (authors+sources+copies+tunes). Nevertheless, when I query the database I only get 1938 documents:

  "responseHeader":{
    "status":0,
    "QTime":103,
    "params":{
      "q":"*:*",
      "_":"1609335106436"}},
  "response":{"numFound":1938,"start":0,"numFoundExact":true,"docs":[
      {
    [...]

This are only the tunes (last entity in the configuration file above). I also see this in the dashboard:

enter image description here

If on the other hand I select the entities one by one (e.g. author etc...):

enter image description here

the plugin imports correctly the author, tune, and copy entities (each time the . query reflects the documents imported). Once I get to the fourth entity though (tune), the index apparently 'forgets' about the previous three entities - although after running it, plugin reports 'documents deleted: 0' - and the . query goes back to only 1938 documents found (i.e. only tunes).

There's no error message in the logs. What am I missing?

PARTIAL SOLUTION

I managed to add a prefix to the id in order to differentiate the four different data, so that unique IDs don't get rewritten, eg:

SELECT name,start_page,alternate_spellings,item_id, CONCAT('tune_', id) AS id, 'tune' as type FROM bassculture_tune;

Nevertheless, I need the database id (without the prefix) of the current tune, in this case, for some later comparison, eg:

  <entity name="tune_author" query="SELECT bt.*, CONCAT(ba.firstname, ' ', ba.surname, ' ', ba.extrainfo) AS tune_author FROM bassculture_tune bt, bassculture_item c, bassculture_source s, bassculture_author ba WHERE bt.id=${tune.id} AND bt.item_id = c.id AND c.source_id = s.id AND s.author_id = ba.id;">
    <field column="tune_author" name="tune_author" />
  </entity>

Since ${tune.id} now has a prefix the whole query doesn't do what I need any more. Is there a way to strip the prefix locally?

Edit 2

The query

<entity name="tune_author" query="select s.* FROM bassculture_source s, bassculture_item c, bassculture_tune bt WHERE bt.id=REPLACE(${tune.id}, 'tune_', '') AND c.source_id = s.id AND bt.item_id = c.id;">

throws an error (unable to execute query) on importing data on Solr.

This is the error in the Solr log:

Caused by: java.sql.SQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'SELECT REPLACE(tune_1, 'tune_', ''), AND c.source_id = s.id AND bt.item_id = c.i' at line 1

PS

Something like

select item_id FROM bassculture_tune bt WHERE bt.id= (SELECT REPLACE('tune_1', 'tune_', ''));

on MySQL console works just fine.

Introducing variables

I'm trying my luck with a variable now:

<entity name="this_tune_id" query="SET @this_tune_id = REPLACE('${tune.id}','tune_','');">
        </entity>
      <entity name="source_title" query="select s.* FROM bassculture_source s, bassculture_item c, bassculture_tune bt WHERE c.source_id = s.id AND bt.item_id = c.id AND bt.id = ${this_tune_id};">
        <field column="short_title" name="source_title"/>
      </entity>

This gives me a

org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 1

error.

FINAL SOLUTION

I am storing the database ID as this_tune_id, and the Solr id (with the prefix) as id, so that I can use this_tune_id for my queries while still storing a prefixed id in Solr:

<entity name="tune" query="SELECT name,start_page,alternate_spellings,item_id, id AS this_tune_id, CONCAT('tune_', id) AS id, 'tune' as type FROM bassculture_tune;">

  <field column="name" name="name"/>

  <entity name="source_title" query="select s.* FROM bassculture_source s, bassculture_item c, bassculture_tune bt WHERE c.source_id = s.id AND bt.item_id = c.id AND bt.id = ${tune.this_tune_id};">

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The screenshot containing data from your import reveals the reason: maxDocs shows there has been 2357 documents imported; but there is 419 that has been marked as deleted. Your unique key field (usually id) has overlap between the documents you're importing, resulting in the newer documents overwriting the older ones.

419 documents has been overwritten by documents imported later because over overlapping ids.

You can solve this by prepending the entity type to your ids (there is no need for the ids to be numeric) - the easiest way is to prefix it in your SQL:

SELECT CONCAT('tune_', id) AS id, FROM ..
SELECT CONCAT('author_', id) AS id, .. FROM ..
... repeating for each source ..

That way the id for an author will be author_1 and will not overwrite tune_1 as it would otherwise, where both would have 1 as their ids.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...