Solr
  1. Solr
  2. SOLR-1437

DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.4
    • Labels:
      None

      Description

      As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

      1. putNulls_using_bitset.patch
        5 kB
        Noble Paul
      2. SOLR-1437.patch
        48 kB
        Fergus McMenemie
      3. masked_NPE.patch
        2 kB
        Noble Paul
      4. SOLR-1437.patch
        49 kB
        Noble Paul
      5. SOLR-1437.patch
        49 kB
        Noble Paul
      6. SOLR-1437.patch
        49 kB
        Fergus McMenemie
      7. SOLR-1437.patch
        47 kB
        Noble Paul
      8. SOLR-1437.patch
        46 kB
        Fergus McMenemie

        Activity

        Hide
        Noble Paul added a comment -

        it may not be viable to target this for 1.4

        Show
        Noble Paul added a comment - it may not be viable to target this for 1.4
        Hide
        Fergus McMenemie added a comment -

        A pity we may not make the 1.4 release, but I guess there is no harm in trying!

        Looking through the code for XPathRecordReader I see a variable skipNextEvent inside the parse method. Can anybody explain why we need to skip an event at the end of a text block?

        Show
        Fergus McMenemie added a comment - A pity we may not make the 1.4 release, but I guess there is no harm in trying! Looking through the code for XPathRecordReader I see a variable skipNextEvent inside the parse method. Can anybody explain why we need to skip an event at the end of a text block?
        Hide
        Noble Paul added a comment -

        for any normal event , parser.next(); should be called in each iteration. But for CDATA it should not do so because handling of CDATA itself would have consumed the next event

        Show
        Noble Paul added a comment - for any normal event , parser.next(); should be called in each iteration. But for CDATA it should not do so because handling of CDATA itself would have consumed the next event
        Hide
        Fergus McMenemie added a comment -

        OK. Would it be bad if I recoded things around a renamed variable "eventAlreadyFetched"? IMHO it far more self-explanatory.

        Show
        Fergus McMenemie added a comment - OK. Would it be bad if I recoded things around a renamed variable "eventAlreadyFetched"? IMHO it far more self-explanatory.
        Hide
        Noble Paul added a comment - - edited

        renaming done in trunk

        let us be clear about what all do we wish to achieve in wild-card support

        • Do we wish to support wild-cards in the forEach parameter
        • How to handle the following case
          example
          d=  /a/b/c/d
          e  = //e
          

          and if the real path to the tag e = /a/x/e . it would not really hard to traverse both the paths

        Just keep in mind that It is hard to achieve everything. But is is easy to achieve a subset of features and meet most of the requirements. We are not trying to implement a complete framework for xml handling

        Show
        Noble Paul added a comment - - edited renaming done in trunk let us be clear about what all do we wish to achieve in wild-card support Do we wish to support wild-cards in the forEach parameter How to handle the following case example d= /a/b/c/d e = //e and if the real path to the tag e = /a/x/e . it would not really hard to traverse both the paths Just keep in mind that It is hard to achieve everything. But is is easy to achieve a subset of features and meet most of the requirements. We are not trying to implement a complete framework for xml handling
        Hide
        Fergus McMenemie added a comment - - edited

        Good points.

        I am still making sure I understand the existing code completely. Adding lots of comments.

        I was not going to allow wild-cards in the forEach parameter. Can you think of a use case?

        Your second point is covered. At this point my plan is to scrap skipTag, replacing it with equivalent inline code. Then after I have found some way of recording //tagnames within the node tree (I am considering the fact that currently it is invalid for the rootNode to have attributes; and I was thinking of abusing the rootNode attributes field and using it to store //tagname Node trees). Then where skipTag would currently be called, I would instead compare the parsed localname against the "attributes" of the rootTag.

        Potential issues I intend to ignored are. If we have:

         
        column="e" xpath="//e/f[@qualifier='fullTitle']"
        column="d" xpath="/a/b/c/d" flatten="true"
        

        and the XML stream contains

         
        /a/b/c/d/e...
        

        Then I will not populate column 'e'. Also the expression for column 'd' which is an absolute Xpath takes precedence over that for column 'e'.

        Thoughts?

        Show
        Fergus McMenemie added a comment - - edited Good points. I am still making sure I understand the existing code completely. Adding lots of comments. I was not going to allow wild-cards in the forEach parameter. Can you think of a use case? Your second point is covered. At this point my plan is to scrap skipTag, replacing it with equivalent inline code. Then after I have found some way of recording //tagnames within the node tree (I am considering the fact that currently it is invalid for the rootNode to have attributes; and I was thinking of abusing the rootNode attributes field and using it to store //tagname Node trees). Then where skipTag would currently be called, I would instead compare the parsed localname against the "attributes" of the rootTag. Potential issues I intend to ignored are. If we have: column="e" xpath="//e/f[@qualifier='fullTitle']" column="d" xpath="/a/b/c/d" flatten="true" and the XML stream contains /a/b/c/d/e... Then I will not populate column 'e'. Also the expression for column 'd' which is an absolute Xpath takes precedence over that for column 'e'. Thoughts?
        Hide
        Noble Paul added a comment -

        try not to make too many changes. If you have javadocs in your patch give it as a separate patch and I shall commit it immedietly .

        Keeping the patch small can make it easy for me to review it.

        Show
        Noble Paul added a comment - try not to make too many changes. If you have javadocs in your patch give it as a separate patch and I shall commit it immedietly . Keeping the patch small can make it easy for me to review it.
        Hide
        Noble Paul added a comment -

        I was not going to allow wild-cards in the forEach parameter. Can you think of a use case?

        not required.

        If a path can be reached in a non-wildcard xpath it should take precedence over the wild-card one

        Show
        Noble Paul added a comment - I was not going to allow wild-cards in the forEach parameter. Can you think of a use case? not required. If a path can be reached in a non-wildcard xpath it should take precedence over the wild-card one
        Hide
        Fergus McMenemie added a comment -

        A first patch to XPathRecordReader.java. These changes pass "ant clean test" and my own test cases. The changes are all about adding or improving comments and JavaDoc. There are very minor code changes where method definitions are changed from single to multiple lines, to allow descriptions to be added against the method arguments.

        One bigger code change involves removal of the skiptag method and its replacement with some in-iine code in the one place it was called from.

        Show
        Fergus McMenemie added a comment - A first patch to XPathRecordReader.java. These changes pass "ant clean test" and my own test cases. The changes are all about adding or improving comments and JavaDoc. There are very minor code changes where method definitions are changed from single to multiple lines, to allow descriptions to be added against the method arguments. One bigger code change involves removal of the skiptag method and its replacement with some in-iine code in the one place it was called from.
        Hide
        Noble Paul added a comment -

        the patch looks fine.

        • It is not created from the latest trunk. SO it did not apply
        • Remove the references to Solr (and other stuff like dataSource) in the documentation. XPathRecordReader is completely independent of Solr. We actually use it extensively in our internal projects for xml parsing
        Show
        Noble Paul added a comment - the patch looks fine. It is not created from the latest trunk. SO it did not apply Remove the references to Solr (and other stuff like dataSource) in the documentation. XPathRecordReader is completely independent of Solr. We actually use it extensively in our internal projects for xml parsing
        Hide
        Fergus McMenemie added a comment -

        Good to see you reuse your own code!

        This new patch is the same as the previous version excepting that the references to SOLR and datasource etc have been rewritten.

        Also, Noble, can you check over and review my comments around line 237 in the file XPathRecordReader.java. Is this correct?

              } else {
                // can we ever get here? This means we are collecting for an Xpath
                // that is outwith any forEach expression
                if (attributes != null || hasText)
                  valuesAddedinThisFrame = new HashSet<String>();
                stack.push(valuesAddedinThisFrame);
              }
        
        Show
        Fergus McMenemie added a comment - Good to see you reuse your own code! This new patch is the same as the previous version excepting that the references to SOLR and datasource etc have been rewritten. Also, Noble, can you check over and review my comments around line 237 in the file XPathRecordReader.java. Is this correct? } else { // can we ever get here? This means we are collecting for an Xpath // that is outwith any forEach expression if (attributes != null || hasText) valuesAddedinThisFrame = new HashSet< String >(); stack.push(valuesAddedinThisFrame); }
        Hide
        Noble Paul added a comment -

        committed r816577
        thanks Fergus

        Show
        Noble Paul added a comment - committed r816577 thanks Fergus
        Hide
        Lance Norskog added a comment -

        Should we just steal the streaming XPath implementation from TIKA? Does it do all we need?

        Note that it seems to only be used for their OpenDocument readers, so it might not have the breadth of testing that we would need.

        http://lucene.apache.org/tika/apidocs/org/apache/tika/sax/xpath/package-tree.html

        http://issues.apache.org/jira/browse/TIKA-116
        http://issues.apache.org/jira/browse/TIKA-129

        Show
        Lance Norskog added a comment - Should we just steal the streaming XPath implementation from TIKA? Does it do all we need? Note that it seems to only be used for their OpenDocument readers, so it might not have the breadth of testing that we would need. http://lucene.apache.org/tika/apidocs/org/apache/tika/sax/xpath/package-tree.html http://issues.apache.org/jira/browse/TIKA-116 http://issues.apache.org/jira/browse/TIKA-129
        Hide
        Noble Paul added a comment -

        DIH needs streaming of a records which contain a bunch of fields. The Tika one is not designed for that kind of consumption

        Show
        Noble Paul added a comment - DIH needs streaming of a records which contain a bunch of fields. The Tika one is not designed for that kind of consumption
        Hide
        Fergus McMenemie added a comment -

        Noble,

        Playing with the code... some observations I would like confirmed.

        1) inside parse() the valuesAddedinThisFrame HashSet and the Stack<Set<String>> stack variables are only used to aid in the clean up after out-puting record.

        2) The code seems unable to collect text for a forEach xpath. So for the following fragment of code

            String xml="<root>\n"
                     + "  <status>live</status>\n"
                     + "  <contenido id=\"10097\" idioma=\"cat\">\n"
                     + "    Cats can be cute\n"
                     + "    <antetitulo></antetitulo>\n"
                     + "    <titulo>\n           This is my title\n    </titulo>\n"
                     + "    <resumen>\n          This is my summary\n   </resumen>\n"
                     + "    <texto>\n     This is the body of my text\n   </texto>\n"
                     + "    </contenido>\n"
                     + "</root>";
            XPathRecordReader rr = new XPathRecordReader("/root/contenido");
            rr.addField("cat"   ,"/root/contenido", false); //  ***** FAILS *****
            rr.addField("id",    "/root/contenido/@id", false);
        

        we can get the string associated with the id attrbute of <contenido> but not its child text! Is this a design goal, or just the way the code ended up behaving. Do we want it to continue to work this way?

        Show
        Fergus McMenemie added a comment - Noble, Playing with the code... some observations I would like confirmed. 1) inside parse() the valuesAddedinThisFrame HashSet and the Stack<Set<String>> stack variables are only used to aid in the clean up after out-puting record. 2) The code seems unable to collect text for a forEach xpath. So for the following fragment of code String xml= "<root>\n" + " <status>live</status>\n" + " <contenido id=\" 10097\ " idioma=\" cat\ ">\n" + " Cats can be cute\n" + " <antetitulo></antetitulo>\n" + " <titulo>\n This is my title\n </titulo>\n" + " <resumen>\n This is my summary\n </resumen>\n" + " <texto>\n This is the body of my text\n </texto>\n" + " </contenido>\n" + "</root>" ; XPathRecordReader rr = new XPathRecordReader( "/root/contenido" ); rr.addField( "cat" , "/root/contenido" , false ); // ***** FAILS ***** rr.addField( "id" , "/root/contenido/@id" , false ); we can get the string associated with the id attrbute of <contenido> but not its child text! Is this a design goal, or just the way the code ended up behaving. Do we want it to continue to work this way?
        Hide
        Noble Paul added a comment -

        it is not an intentional.

        Show
        Noble Paul added a comment - it is not an intentional.
        Hide
        Noble Paul added a comment -

        inside parse() the valuesAddedinThisFrame HashSet and the Stack<Set<String>> stack variables are only used to aid in the clean up after out-puting record.

        yes,

        there is a testcase sameForEachAndXpath() which has same forEach and field. So, something is strange here

        Show
        Noble Paul added a comment - inside parse() the valuesAddedinThisFrame HashSet and the Stack<Set<String>> stack variables are only used to aid in the clean up after out-puting record. yes, there is a testcase sameForEachAndXpath() which has same forEach and field. So, something is strange here
        Hide
        Fergus McMenemie added a comment -

        Apologies! my testcase which uses the same Xpath with the forEach and field value works; once I removed the typo!

        Show
        Fergus McMenemie added a comment - Apologies! my testcase which uses the same Xpath with the forEach and field value works; once I removed the typo!
        Hide
        Fergus McMenemie added a comment -

        Still trying to understand all the ins and outs of the code!

        Around line 250 in XPathRecordReader.java we have the following expression "recordStarted && !isRecord". Why is this done? Why do we want to call putText() with a null value? Why do we only call putText Nodes where isRecord is false?

              try {
                if (attributes != null) {
                  for (Node node : attributes) {
                    String value = parser.getAttributeValue(null, node.name);
                    if (value != null || (recordStarted && !isRecord)) {
                      putText(values, value, node.fieldName, node.multiValued);
                      valuesAddedinThisFrame.add(node.fieldName);
                    }
                  }
                }
        
        Show
        Fergus McMenemie added a comment - Still trying to understand all the ins and outs of the code! Around line 250 in XPathRecordReader.java we have the following expression "recordStarted && !isRecord". Why is this done? Why do we want to call putText() with a null value? Why do we only call putText Nodes where isRecord is false? try { if (attributes != null ) { for (Node node : attributes) { String value = parser.getAttributeValue( null , node.name); if (value != null || (recordStarted && !isRecord)) { putText(values, value, node.fieldName, node.multiValued); valuesAddedinThisFrame.add(node.fieldName); } } }
        Hide
        Fergus McMenemie added a comment -

        A patch of XPathRecordReader.java adding a few cosmetic changes to comments plus.....

        A rewritten core of the parse() method. IMHO it is simpler to understand with the inner while and special paser.next state variable removed and it is probably faster as well.

        Rearranged method getOrAddChild renaming it to getOrAddNode in preparation for adding support for // construct. This method is called from build().

        While in build(), renamed local variable "name" to "xpseg" to better reflect that it holds a segment from an Xpath. And to remove warnings about local variables masking class variables.

        "ant clean test" runs OK with these changes!

        Show
        Fergus McMenemie added a comment - A patch of XPathRecordReader.java adding a few cosmetic changes to comments plus..... A rewritten core of the parse() method. IMHO it is simpler to understand with the inner while and special paser.next state variable removed and it is probably faster as well. Rearranged method getOrAddChild renaming it to getOrAddNode in preparation for adding support for // construct. This method is called from build(). While in build(), renamed local variable "name" to "xpseg" to better reflect that it holds a segment from an Xpath. And to remove warnings about local variables masking class variables. "ant clean test" runs OK with these changes!
        Hide
        Fergus McMenemie added a comment -

        Making good progress! The //expression and its testcase is done and appear to work well. I would have submitted a patch already except that having only altered "XPathRecordReader.java" I am now trying to get "ant clean test" to stop failing in org.apache.solr.handler.dataimport.TestErrorHandling.

        Also while I do not yet have a total grasp of the code I think it is possible to have XPathRecordReader.java only return fields which are children of the forEach expression when a record is emitted. I consider this desirable and would like to press ahead with this extra change; are there other views?

        Having established that "Set<String> valuesAddedinThisFrame" and "Stack<Set<String>> stack", in Node are only used for clearing up after a record has been emitted. I am now pondering the purpose of the HashSet childrenFound and putNulls. Any suggestions?

        Lastly what are the chances of this making the 1.4 release?

        Show
        Fergus McMenemie added a comment - Making good progress! The //expression and its testcase is done and appear to work well. I would have submitted a patch already except that having only altered "XPathRecordReader.java" I am now trying to get "ant clean test" to stop failing in org.apache.solr.handler.dataimport.TestErrorHandling. Also while I do not yet have a total grasp of the code I think it is possible to have XPathRecordReader.java only return fields which are children of the forEach expression when a record is emitted. I consider this desirable and would like to press ahead with this extra change; are there other views? Having established that "Set<String> valuesAddedinThisFrame" and "Stack<Set<String>> stack", in Node are only used for clearing up after a record has been emitted. I am now pondering the purpose of the HashSet childrenFound and putNulls. Any suggestions? Lastly what are the chances of this making the 1.4 release?
        Hide
        Fergus McMenemie added a comment -

        XPathRecordReader enhanced to allow use of //abc type expressions. Sorry but there are lots of changes!

        Lots of comments added which attempt to clarify its operation. Ditto for TestXPathRecordReader where the test strings have been unwrapped and formatted to make them clearer.

        Passes all tests except for TestErrorHandling.java which I think is sensitive to the new features.

        Show
        Fergus McMenemie added a comment - XPathRecordReader enhanced to allow use of //abc type expressions. Sorry but there are lots of changes! Lots of comments added which attempt to clarify its operation. Ditto for TestXPathRecordReader where the test strings have been unwrapped and formatted to make them clearer. Passes all tests except for TestErrorHandling.java which I think is sensitive to the new features.
        Hide
        Fergus McMenemie added a comment -

        incorporated SOLR-1465 into this patch!

        Show
        Fergus McMenemie added a comment - incorporated SOLR-1465 into this patch!
        Hide
        Noble Paul added a comment -

        Fergus can you give the patch updated to trunk?

        Show
        Noble Paul added a comment - Fergus can you give the patch updated to trunk?
        Hide
        Fergus McMenemie added a comment -

        regenerated patch against latest trunk

        Show
        Fergus McMenemie added a comment - regenerated patch against latest trunk
        Hide
        Noble Paul added a comment -

        removed a few unused lines and formatted at some places.

        Fergus, I am tempted to make it go in 1.4?
        It looks largely OK to me. What else do you plan to add ? Do we need more testcases?

        Show
        Noble Paul added a comment - removed a few unused lines and formatted at some places. Fergus, I am tempted to make it go in 1.4? It looks largely OK to me. What else do you plan to add ? Do we need more testcases?
        Hide
        Fergus McMenemie added a comment -

        I am quite pleased with it as far as it goes and think it would be good for 1.4. I have tested it against my test set of 3000 XML documents and replacing:

                <field column="para1" name="text"                xpath="/record/sect1/para" flatten="true"/>
                <field column="para2" name="text"                xpath="/record/list/listitem/para" flatten="true"/>
                <field column="para32"     name="text"                        xpath="/record/address/para"  flatten="true" />
                <field column="para40"     name="text"                        xpath="/record/authoredBy/para"  flatten="true" />
                <field column="para43"     name="text"                        xpath="/record/dataGroup/address/para"  flatten="true" />
                <field column="para47"     name="text"                        xpath="/record/dataGroup/keyPersonnel/doubleList/first/para"  flatten="true" />
                <field column="para49"     name="text"                        xpath="/record/dataGroup/keyPersonnel/doubleList/second/para"  flatten="true" />
                <field column="para50"     name="text"                        xpath="/record/dataGroup/keyPersonnel/para"  flatten="true" />
                <field column="para51"     name="text"                        xpath="/record/dataGroup/para"  flatten="true" />
                <field column="para57"     name="text"                        xpath="/record/doubleList/first/para"  flatten="true" />
                <field column="para59"     name="text"                        xpath="/record/doubleList/second/para"  flatten="true" />
                <field column="para63"     name="text"                        xpath="/record/keyPersonnel/doubleList/first/para"  flatten="true" />
                <field column="para65"     name="text"                        xpath="/record/keyPersonnel/doubleList/second/para"  flatten="true" />
                <field column="para68"     name="text"                        xpath="/record/list/listItem/para"  flatten="true" />
                <field column="para75"     name="text"                        xpath="/record/mediaBlock/doubleList/first/para"  flatten="true" />
                <field column="para77"     name="text"                        xpath="/record/mediaBlock/doubleList/second/para"  flatten="true" />
                <field column="para172"     name="text"                        xpath="/record/noteGroup/note/para"  flatten="true" />
                <field column="para174"     name="text"                        xpath="/record/para"  flatten="true" />
                <field column="para179"     name="text"                        xpath="/record/relatedInfo/list/listItem/relatedArticle/para"  flatten="true" />
                <field column="para184"     name="text"                        xpath="/record/sect1/address/dataGroup/para"  flatten="true" />
                <field column="para185"     name="text"                        xpath="/record/sect1/address/para"  flatten="true" />
                <field column="para195"     name="text"                        xpath="/record/sect1/dataGroup/address/para"  flatten="true" />
                <field column="para199"     name="text"                        xpath="/record/sect1/dataGroup/keyPersonnel/doubleList/first/para"  flatten="true" />
                <field column="para201"     name="text"                        xpath="/record/sect1/dataGroup/keyPersonnel/doubleList/second/para"  flatten="true" />
                <field column="para202"     name="text"                        xpath="/record/sect1/dataGroup/keyPersonnel/para"  flatten="true" />
                <field column="para203"     name="text"                        xpath="/record/sect1/dataGroup/para"  flatten="true" />
                <field column="para208"     name="text"                        xpath="/record/sect1/doubleList/first/para"  flatten="true" />
                <field column="para212"     name="text"                        xpath="/record/sect1/doubleList/second/list/listItem/para"  flatten="true" />
                <field column="para213"     name="text"                        xpath="/record/sect1/doubleList/second/para"  flatten="true" />
                <field column="para217"     name="text"                        xpath="/record/sect1/keyPersonnel/doubleList/first/para"  flatten="true" />
                <field column="para219"     name="text"                        xpath="/record/sect1/keyPersonnel/doubleList/second/para"  flatten="true" />
                <field column="para220"     name="text"                        xpath="/record/sect1/keyPersonnel/para"  flatten="true" />
                <field column="para225"     name="text"                        xpath="/record/sect1/list/listItem/list/listItem/para"  flatten="true" />
                <field column="para226"     name="text"                        xpath="/record/sect1/list/listItem/para"  flatten="true" />
                <field column="para240"     name="text"                        xpath="/record/sect1/para"  flatten="true" />
                <field column="para244"     name="text"                        xpath="/record/sect1/sect2/doubleList/first/para"  flatten="true" />
                <field column="para246"     name="text"                        xpath="/record/sect1/sect2/doubleList/second/para"  flatten="true" />
                <field column="para251"     name="text"                        xpath="/record/sect1/sect2/list/listItem/list/listItem/para"  flatten="true" />
                <field column="para252"     name="text"                        xpath="/record/sect1/sect2/list/listItem/para"  flatten="true" />
                <field column="para258"     name="text"                        xpath="/record/sect1/sect2/noteGroup/note/para"  flatten="true" />
                <field column="para259"     name="text"                        xpath="/record/sect1/sect2/para"  flatten="true" />
                <field column="para265"     name="text"                        xpath="/record/sect1/sect2/sect3/list/listItem/list/listItem/para"  flatten="true" />
                <field column="para266"     name="text"                        xpath="/record/sect1/sect2/sect3/list/listItem/para"  flatten="true" />
                <field column="para271"     name="text"                        xpath="/record/sect1/sect2/sect3/para"  flatten="true" />
                <field column="para275"     name="text"                        xpath="/record/sect1/sect2/sect3/sect4/list/listItem/para"  flatten="true" />
                <field column="para279"     name="text"                        xpath="/record/sect1/sect2/sect3/sect4/para"  flatten="true" />
                <field column="para284"     name="text"                        xpath="/record/sect1/sect2/sect3/sect4/sect5/para"  flatten="true" />
                <field column="para295"     name="text"                        xpath="/record/sect1/sect2/sect3/table/tgroup/tbody/row/entry/noteGroup/note/para"  flatten="true" />
                <field column="para297"     name="text"                        xpath="/record/sect1/sect2/sect3/table/tgroup/tbody/row/entry/para"  flatten="true" />
                <field column="para301"     name="text"                        xpath="/record/sect1/sect2/sect3/table/tgroup/thead/row/entry/para"  flatten="true" />
                <field column="para312"     name="text"                        xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/list/listItem/para"  flatten="true" />
                <field column="para315"     name="text"                        xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/noteGroup/note/para"  flatten="true" />
                <field column="para316"     name="text"                        xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/noteGroup/para"  flatten="true" />
                <field column="para318"     name="text"                        xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/para"  flatten="true" />
                <field column="para322"     name="text"                        xpath="/record/sect1/sect2/table/tgroup/thead/row/entry/para"  flatten="true" />
                <field column="para341"     name="text"                        xpath="/record/sect1/table/tgroup/tbody/row/entry/noteGroup/note/para"  flatten="true" />
                <field column="para342"     name="text"                        xpath="/record/sect1/table/tgroup/tbody/row/entry/noteGroup/para"  flatten="true" />
                <field column="para344"     name="text"                        xpath="/record/sect1/table/tgroup/tbody/row/entry/para"  flatten="true" />
                <field column="para348"     name="text"                        xpath="/record/sect1/table/tgroup/thead/row/entry/para"  flatten="true" />
                <field column="para371"     name="text"                        xpath="/record/table/tgroup/tbody/row/entry/noteGroup/note/para"  flatten="true" />
                <field column="para373"     name="text"                        xpath="/record/table/tgroup/tbody/row/entry/para"  flatten="true" />
                <field column="para377"     name="text"                        xpath="/record/table/tgroup/thead/row/entry/para"  flatten="true" />
        {code]
        
        with 
        
        

        <field column="text" xpath="//para" flatten="true"/>

        
        

        The indexes seemed equivalent and time to index was also equivalent.

        I have one concern which should be addressed before any 1.4 release. I still do not understand the purpose of the HashSet childrenFound and putNulls, if its important then I suspect that whatever is done to childNodes when an end_element is parsed also needs done to descNodes; but I have a feeling the whole lot may be unnecessary and can be removed. If it is required we need to explain it.

        The last change I would like to see, which I am happy to leave to 1.5, involves making sure emitted records do not contain tags from parent nodes unless they are stipulated by "commonField"

        Show
        Fergus McMenemie added a comment - I am quite pleased with it as far as it goes and think it would be good for 1.4. I have tested it against my test set of 3000 XML documents and replacing: <field column= "para1" name= "text" xpath= "/record/sect1/para" flatten= " true " /> <field column= "para2" name= "text" xpath= "/record/list/listitem/para" flatten= " true " /> <field column= "para32" name= "text" xpath= "/record/address/para" flatten= " true " /> <field column= "para40" name= "text" xpath= "/record/authoredBy/para" flatten= " true " /> <field column= "para43" name= "text" xpath= "/record/dataGroup/address/para" flatten= " true " /> <field column= "para47" name= "text" xpath= "/record/dataGroup/keyPersonnel/doubleList/first/para" flatten= " true " /> <field column= "para49" name= "text" xpath= "/record/dataGroup/keyPersonnel/doubleList/second/para" flatten= " true " /> <field column= "para50" name= "text" xpath= "/record/dataGroup/keyPersonnel/para" flatten= " true " /> <field column= "para51" name= "text" xpath= "/record/dataGroup/para" flatten= " true " /> <field column= "para57" name= "text" xpath= "/record/doubleList/first/para" flatten= " true " /> <field column= "para59" name= "text" xpath= "/record/doubleList/second/para" flatten= " true " /> <field column= "para63" name= "text" xpath= "/record/keyPersonnel/doubleList/first/para" flatten= " true " /> <field column= "para65" name= "text" xpath= "/record/keyPersonnel/doubleList/second/para" flatten= " true " /> <field column= "para68" name= "text" xpath= "/record/list/listItem/para" flatten= " true " /> <field column= "para75" name= "text" xpath= "/record/mediaBlock/doubleList/first/para" flatten= " true " /> <field column= "para77" name= "text" xpath= "/record/mediaBlock/doubleList/second/para" flatten= " true " /> <field column= "para172" name= "text" xpath= "/record/noteGroup/note/para" flatten= " true " /> <field column= "para174" name= "text" xpath= "/record/para" flatten= " true " /> <field column= "para179" name= "text" xpath= "/record/relatedInfo/list/listItem/relatedArticle/para" flatten= " true " /> <field column= "para184" name= "text" xpath= "/record/sect1/address/dataGroup/para" flatten= " true " /> <field column= "para185" name= "text" xpath= "/record/sect1/address/para" flatten= " true " /> <field column= "para195" name= "text" xpath= "/record/sect1/dataGroup/address/para" flatten= " true " /> <field column= "para199" name= "text" xpath= "/record/sect1/dataGroup/keyPersonnel/doubleList/first/para" flatten= " true " /> <field column= "para201" name= "text" xpath= "/record/sect1/dataGroup/keyPersonnel/doubleList/second/para" flatten= " true " /> <field column= "para202" name= "text" xpath= "/record/sect1/dataGroup/keyPersonnel/para" flatten= " true " /> <field column= "para203" name= "text" xpath= "/record/sect1/dataGroup/para" flatten= " true " /> <field column= "para208" name= "text" xpath= "/record/sect1/doubleList/first/para" flatten= " true " /> <field column= "para212" name= "text" xpath= "/record/sect1/doubleList/second/list/listItem/para" flatten= " true " /> <field column= "para213" name= "text" xpath= "/record/sect1/doubleList/second/para" flatten= " true " /> <field column= "para217" name= "text" xpath= "/record/sect1/keyPersonnel/doubleList/first/para" flatten= " true " /> <field column= "para219" name= "text" xpath= "/record/sect1/keyPersonnel/doubleList/second/para" flatten= " true " /> <field column= "para220" name= "text" xpath= "/record/sect1/keyPersonnel/para" flatten= " true " /> <field column= "para225" name= "text" xpath= "/record/sect1/list/listItem/list/listItem/para" flatten= " true " /> <field column= "para226" name= "text" xpath= "/record/sect1/list/listItem/para" flatten= " true " /> <field column= "para240" name= "text" xpath= "/record/sect1/para" flatten= " true " /> <field column= "para244" name= "text" xpath= "/record/sect1/sect2/doubleList/first/para" flatten= " true " /> <field column= "para246" name= "text" xpath= "/record/sect1/sect2/doubleList/second/para" flatten= " true " /> <field column= "para251" name= "text" xpath= "/record/sect1/sect2/list/listItem/list/listItem/para" flatten= " true " /> <field column= "para252" name= "text" xpath= "/record/sect1/sect2/list/listItem/para" flatten= " true " /> <field column= "para258" name= "text" xpath= "/record/sect1/sect2/noteGroup/note/para" flatten= " true " /> <field column= "para259" name= "text" xpath= "/record/sect1/sect2/para" flatten= " true " /> <field column= "para265" name= "text" xpath= "/record/sect1/sect2/sect3/list/listItem/list/listItem/para" flatten= " true " /> <field column= "para266" name= "text" xpath= "/record/sect1/sect2/sect3/list/listItem/para" flatten= " true " /> <field column= "para271" name= "text" xpath= "/record/sect1/sect2/sect3/para" flatten= " true " /> <field column= "para275" name= "text" xpath= "/record/sect1/sect2/sect3/sect4/list/listItem/para" flatten= " true " /> <field column= "para279" name= "text" xpath= "/record/sect1/sect2/sect3/sect4/para" flatten= " true " /> <field column= "para284" name= "text" xpath= "/record/sect1/sect2/sect3/sect4/sect5/para" flatten= " true " /> <field column= "para295" name= "text" xpath= "/record/sect1/sect2/sect3/table/tgroup/tbody/row/entry/noteGroup/note/para" flatten= " true " /> <field column= "para297" name= "text" xpath= "/record/sect1/sect2/sect3/table/tgroup/tbody/row/entry/para" flatten= " true " /> <field column= "para301" name= "text" xpath= "/record/sect1/sect2/sect3/table/tgroup/thead/row/entry/para" flatten= " true " /> <field column= "para312" name= "text" xpath= "/record/sect1/sect2/table/tgroup/tbody/row/entry/list/listItem/para" flatten= " true " /> <field column= "para315" name= "text" xpath= "/record/sect1/sect2/table/tgroup/tbody/row/entry/noteGroup/note/para" flatten= " true " /> <field column= "para316" name= "text" xpath= "/record/sect1/sect2/table/tgroup/tbody/row/entry/noteGroup/para" flatten= " true " /> <field column= "para318" name= "text" xpath= "/record/sect1/sect2/table/tgroup/tbody/row/entry/para" flatten= " true " /> <field column= "para322" name= "text" xpath= "/record/sect1/sect2/table/tgroup/thead/row/entry/para" flatten= " true " /> <field column= "para341" name= "text" xpath= "/record/sect1/table/tgroup/tbody/row/entry/noteGroup/note/para" flatten= " true " /> <field column= "para342" name= "text" xpath= "/record/sect1/table/tgroup/tbody/row/entry/noteGroup/para" flatten= " true " /> <field column= "para344" name= "text" xpath= "/record/sect1/table/tgroup/tbody/row/entry/para" flatten= " true " /> <field column= "para348" name= "text" xpath= "/record/sect1/table/tgroup/thead/row/entry/para" flatten= " true " /> <field column= "para371" name= "text" xpath= "/record/table/tgroup/tbody/row/entry/noteGroup/note/para" flatten= " true " /> <field column= "para373" name= "text" xpath= "/record/table/tgroup/tbody/row/entry/para" flatten= " true " /> <field column= "para377" name= "text" xpath= "/record/table/tgroup/thead/row/entry/para" flatten= " true " /> {code] with <field column="text" xpath="//para" flatten="true"/> The indexes seemed equivalent and time to index was also equivalent. I have one concern which should be addressed before any 1.4 release. I still do not understand the purpose of the HashSet childrenFound and putNulls, if its important then I suspect that whatever is done to childNodes when an end_element is parsed also needs done to descNodes; but I have a feeling the whole lot may be unnecessary and can be removed. If it is required we need to explain it. The last change I would like to see, which I am happy to leave to 1.5, involves making sure emitted records do not contain tags from parent nodes unless they are stipulated by "commonField"
        Hide
        Noble Paul added a comment -

        the putNulls are very important . I can probably add a testcase for that and probably it will be evident to you.

        one another concern I have with the patch is that every node which is not relevant results in walking up the tree to figure out if there was a wild-card . So even if somebody does not use wild-cards he ends up paying the price. Each node can keep a boolean flag if that branch has a wild-card upstream and this flag has to be set at the time of building the tree.

        Show
        Noble Paul added a comment - the putNulls are very important . I can probably add a testcase for that and probably it will be evident to you. one another concern I have with the patch is that every node which is not relevant results in walking up the tree to figure out if there was a wild-card . So even if somebody does not use wild-cards he ends up paying the price. Each node can keep a boolean flag if that branch has a wild-card upstream and this flag has to be set at the time of building the tree.
        Hide
        Fergus McMenemie added a comment -

        Re your concern. Correct, i was was also concerned about this, and of course how bad it is depends on the nature of your data and the tree of interesting nodes. However, testing indicated it did not seem to cause to much of an impact. It is not quite as bad as you say; it has to walk up the tree ONCE , not for every node which is not relivent, but rather every time we depart from the list of interesting nodes. Nevertheless your solution is solution is a good one.

        Show
        Fergus McMenemie added a comment - Re your concern. Correct, i was was also concerned about this, and of course how bad it is depends on the nature of your data and the tree of interesting nodes. However, testing indicated it did not seem to cause to much of an impact. It is not quite as bad as you say; it has to walk up the tree ONCE , not for every node which is not relivent, but rather every time we depart from the list of interesting nodes. Nevertheless your solution is solution is a good one.
        Hide
        Fergus McMenemie added a comment -

        I am doing more work on this to:

        • improve performance by avoiding having to walk back up tree
        • to review use of putNulls
        • avoid emitting parent nodes of an emitted record
        Show
        Fergus McMenemie added a comment - I am doing more work on this to: improve performance by avoiding having to walk back up tree to review use of putNulls avoid emitting parent nodes of an emitted record
        Hide
        Fergus McMenemie added a comment -

        OK!

        • I have a better method of dealing with //* searches.
        • I think I know how to only emitting fields associated with the current record.
        • PutNulls: I can see what it is doing but I still dont know why it is needed. I could understand if there was a requirement that every valid hasText node for a record must be defined or null; but I dont think that is what it id doing? Can you help?
        Show
        Fergus McMenemie added a comment - OK! I have a better method of dealing with //* searches. I think I know how to only emitting fields associated with the current record. PutNulls: I can see what it is doing but I still dont know why it is needed. I could understand if there was a requirement that every valid hasText node for a record must be defined or null; but I dont think that is what it id doing? Can you help?
        Hide
        Fergus McMenemie added a comment -

        I propose to have this patch which only addresses the //* issue submitted in time for 1.4 and close this issue.

        The other new feature and sorting out pushNulls can be left for later.

        Show
        Fergus McMenemie added a comment - I propose to have this patch which only addresses the //* issue submitted in time for 1.4 and close this issue. The other new feature and sorting out pushNulls can be left for later.
        Hide
        Noble Paul added a comment - - edited

        PutNulls: I can see what it is doing but I still dont know why it is needed.

        This is not very useful if you are indexing directly w/o tranformers. Imagine , if you write a transformer which wouldhave to make some business decision if one value is missing, this helps.

        I see that you have optimized for the case where no wild card is used .

        Fergus, I am +tive of putting this into 1.4 (if you are done with it). This is clearly a huge improvment over what was already existing . Whatever other enhancements are required , can be done later.

        Show
        Noble Paul added a comment - - edited PutNulls: I can see what it is doing but I still dont know why it is needed. This is not very useful if you are indexing directly w/o tranformers. Imagine , if you write a transformer which wouldhave to make some business decision if one value is missing, this helps. I see that you have optimized for the case where no wild card is used . Fergus, I am +tive of putting this into 1.4 (if you are done with it). This is clearly a huge improvment over what was already existing . Whatever other enhancements are required , can be done later.
        Hide
        Fergus McMenemie added a comment -

        I am done!

        Show
        Fergus McMenemie added a comment - I am done!
        Hide
        Noble Paul added a comment -

        There is a crazy bug (i've added a testcase) we need to fix this before committing

        Show
        Noble Paul added a comment - There is a crazy bug (i've added a testcase) we need to fix this before committing
        Hide
        Noble Paul added a comment -

        any_decendent_from_root fails

        Show
        Noble Paul added a comment - any_decendent_from_root fails
        Hide
        Noble Paul added a comment -

        This must be fixed anyway in the trunk . I am not opening another issue because my previous patch includes this fix too

        Show
        Noble Paul added a comment - This must be fixed anyway in the trunk . I am not opening another issue because my previous patch includes this fix too
        Hide
        Fergus McMenemie added a comment -

        OK, having a look at it right now.

        Show
        Fergus McMenemie added a comment - OK, having a look at it right now.
        Hide
        Fergus McMenemie added a comment - - edited

        Hmmm, the version I uploaded works. I think that while polishing my code you removed an "else return;" from XPathRecordReader.java around line 317. I guess you thought it was a redundant statement, not sure what is going on at all. But I think that code is unchanged from the original.

        Show
        Fergus McMenemie added a comment - - edited Hmmm, the version I uploaded works. I think that while polishing my code you removed an "else return;" from XPathRecordReader.java around line 317. I guess you thought it was a redundant statement, not sure what is going on at all. But I think that code is unchanged from the original.
        Hide
        Noble Paul added a comment -

        the else return; was a bug . look at the mask_NPE patch . finally must not have a return statement it leads to suprression of exceptions

        Show
        Noble Paul added a comment - the else return; was a bug . look at the mask_NPE patch . finally must not have a return statement it leads to suprression of exceptions
        Hide
        Fergus McMenemie added a comment -

        Sorted; added another check for null.

        Show
        Fergus McMenemie added a comment - Sorted; added another check for null.
        Hide
        Noble Paul added a comment -

        committed r822154

        Thanks Fergus

        Show
        Noble Paul added a comment - committed r822154 Thanks Fergus
        Hide
        Noble Paul added a comment -

        another enhancement to use a BitSet instead of a HashSet

        Show
        Noble Paul added a comment - another enhancement to use a BitSet instead of a HashSet
        Hide
        Grant Ingersoll added a comment -

        Bulk close for Solr 1.4

        Show
        Grant Ingersoll added a comment - Bulk close for Solr 1.4

          People

          • Assignee:
            Noble Paul
            Reporter:
            Fergus McMenemie
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 672h
              672h
              Remaining:
              Remaining Estimate - 672h
              672h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development