Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-2094

When using a XPathEntityProcessor nested within another entity, the xpathReader isn't reinitilized for each new document

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.4.1
    • Fix Version/s: 6.3
    • Labels:
      None
    • Environment:

      Solr 1.4

      Description

      I have a dih config with a SqlEntityProcessor that retrives a table. I then have a sub-entity with the XPathEntityProcessor type, this takes a value from the table as input to parse through an xml doc.
      I find that the first document is created correctly, but then the xpathReader of the XPathEntityProcessor does not reinitialize for the following documents so the initial documents input is used.

      <dataSource name="hivseqdb" driver="com.mysql.jdbc.Driver"
      	   url="l"
                 user="hivseqdb" password="hivseqdb" batchSize="1"/>
                 
          <dataSource name="xmlFile" type="FileDataSource" />
          
      	<document><entity name="Sequence" dataSource="hivseqdb" pk="se_id" query="SELECT * FROM hivseqdb.sequenceentry where se_id != '1'">
      			
                  <entity name="FMA_Tissue_Hierarchy" 
                  		dataSource="xmlFile"
                  		pk="fma-id"
                  		forEach="/tissue-samples" 
                  		processor="XPathEntityProcessor" 
                  		url="/opt/hivseqdb/solr/conf/sub_ontology_translated.xml" 
                  		stream="true">
                      <field column="tissue-antology-parent-path" xpath="/tissue-samples/tissue[@fma-id='${Sequence.sampleTissueCode}']/parent-path"/>
                  </entity>
      

      DocBuilder dose call init on the XPathEntityProcessor but there is a conditional in the init method to check if the xpathReader is null:

        public void init(Context context) {
          super.init(context);
          if (xpathReader == null)
            initXpathReader();
          pk = context.getEntityAttribute("pk");
          dataSource = context.getDataSource();
          rowIterator = null;
      
        }
      

      So the xPathReader is used again and again. Is there away to reinitialize the xPathReader for every document? Or what is the specific design reason for preserving it?

      1. SOLR-2094.patch
        14 kB
        Cao Manh Dat
      2. SOLR-2094.patch
        11 kB
        Cao Manh Dat
      3. SOLR-2094.patch
        11 kB
        Noble Paul
      4. SOLR-2094.patch
        12 kB
        Noble Paul

        Activity

        Hide
        daanbiere@gmail.com Daan Biere added a comment -

        I've got exactly the same problem (SOLR 3.6).
        Is there any way to avoid this behaviour?

        Show
        daanbiere@gmail.com Daan Biere added a comment - I've got exactly the same problem (SOLR 3.6). Is there any way to avoid this behaviour?
        Hide
        nialloc Niall O'Connor added a comment -

        in dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/XPathEntityProcessor.java

        I removed the "if (xpathReader == null)" from the XPathEntityProcessor and rebuilt the package so that the XPathReader was re-initialized.

        I didn't commit this change since there was no activity on this issue.

        Show
        nialloc Niall O'Connor added a comment - in dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/XPathEntityProcessor.java I removed the "if (xpathReader == null)" from the XPathEntityProcessor and rebuilt the package so that the XPathReader was re-initialized. I didn't commit this change since there was no activity on this issue.
        Hide
        daanbiere@gmail.com Daan Biere added a comment - - edited

        Hi Niall,

        Thank you for your quick reply....
        That was exactly what i just did, checked out the 3.5 branch and removed the line in:
        ./solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/XPathEntityProcessor.java
        so now i have:

          public void init(Context context) {
            super.init(context);
            initXpathReader();
            pk = context.getEntityAttribute("pk");
        

        but problem persists...
        My configuration is exactly like yours, in fact i've copied your config and changed xpaths and database parameters

        Show
        daanbiere@gmail.com Daan Biere added a comment - - edited Hi Niall, Thank you for your quick reply.... That was exactly what i just did, checked out the 3.5 branch and removed the line in: ./solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/XPathEntityProcessor.java so now i have: public void init(Context context) { super .init(context); initXpathReader(); pk = context.getEntityAttribute( "pk" ); but problem persists... My configuration is exactly like yours, in fact i've copied your config and changed xpaths and database parameters
        Hide
        sejbot Tobias Berg added a comment -

        I'm having the same issue in 3.6. If there is a reason not to re-initialize the XPathReader every time, maybe the variable resolvement part could be moved from the initalization to readRow()

        Show
        sejbot Tobias Berg added a comment - I'm having the same issue in 3.6. If there is a reason not to re-initialize the XPathReader every time, maybe the variable resolvement part could be moved from the initalization to readRow()
        Hide
        caomanhdat Cao Manh Dat added a comment -

        In this patch, I solved the problem by resolve & cached variables when read rows (not resolve variables on the init of XPathRecordReader like before).

        Show
        caomanhdat Cao Manh Dat added a comment - In this patch, I solved the problem by resolve & cached variables when read rows (not resolve variables on the init of XPathRecordReader like before).
        Hide
        noble.paul Noble Paul added a comment -

        Diagnosis:

        XPathRecordReader objects are cached and reused between XMLs. It's OK as long as the xpaths themselves don't have any variables. If the xpath has a variable such as

         
        xpath="/tissue-samples/tissue[@fma-id='${Sequence.sampleTissueCode}']/parent-path" 
        

        then it needs to be recreated before starting with each XML file.

        Solutions:

        1. make XPathRecordReader aware of the templates and recompute them before each XML
        2. If templates are present in xpath or forEach , discard the XPathRecordReader instance before every XML

        For sake of simplicity I would recommend #2

        Show
        noble.paul Noble Paul added a comment - Diagnosis: XPathRecordReader objects are cached and reused between XMLs. It's OK as long as the xpaths themselves don't have any variables. If the xpath has a variable such as xpath= "/tissue-samples/tissue[@fma-id='${Sequence.sampleTissueCode}']/parent-path" then it needs to be recreated before starting with each XML file. Solutions: make XPathRecordReader aware of the templates and recompute them before each XML If templates are present in xpath or forEach , discard the XPathRecordReader instance before every XML For sake of simplicity I would recommend #2
        Hide
        caomanhdat Cao Manh Dat added a comment -

        Thanks Noble Paul
        This is the patch based on solutions #2, the patch is much cleaner and simpler.

        Show
        caomanhdat Cao Manh Dat added a comment - Thanks Noble Paul This is the patch based on solutions #2, the patch is much cleaner and simpler.
        Hide
        noble.paul Noble Paul added a comment -

        there was a bug

        Show
        noble.paul Noble Paul added a comment - there was a bug
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit d6b6e74703d5f2d29c110d3a7d9491306af9be2c in lucene-solr's branch refs/heads/master from Noble Paul
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d6b6e74 ]

        SOLR-2094: XPathEntityProcessor should reinitialize the XPathRecordReader instance if the 'forEach' or 'xpath' attributes are templates & it is not a root entity

        Show
        jira-bot ASF subversion and git services added a comment - Commit d6b6e74703d5f2d29c110d3a7d9491306af9be2c in lucene-solr's branch refs/heads/master from Noble Paul [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d6b6e74 ] SOLR-2094 : XPathEntityProcessor should reinitialize the XPathRecordReader instance if the 'forEach' or 'xpath' attributes are templates & it is not a root entity
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit fa6fbc08342da5b4e4a4073e587f40892297a9f7 in lucene-solr's branch refs/heads/branch_6x from Noble Paul
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fa6fbc0 ]

        SOLR-2094: XPathEntityProcessor should reinitialize the XPathRecordReader instance if the 'forEach' or 'xpath' attributes are templates & it is not a root entity

        Show
        jira-bot ASF subversion and git services added a comment - Commit fa6fbc08342da5b4e4a4073e587f40892297a9f7 in lucene-solr's branch refs/heads/branch_6x from Noble Paul [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fa6fbc0 ] SOLR-2094 : XPathEntityProcessor should reinitialize the XPathRecordReader instance if the 'forEach' or 'xpath' attributes are templates & it is not a root entity
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Closing after 6.3.0 release.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Closing after 6.3.0 release.

          People

          • Assignee:
            noble.paul Noble Paul
            Reporter:
            nialloc Niall O'Connor
          • Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development