Solr
  1. Solr
  2. SOLR-1033

DIH transformers should be able to access current entity's namespace

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.4
    • Labels:
      None
    • Environment:

      All operating systems and software platforms

      Description

      It can be very useful to reuse the output from a DIH template in other templates and or regex transformers. Currently this cannot be done. The resolver is initialized at the start of the transformer run with what ever values exist for a column name at that instant. As the transformer executes it may define new values for column names. My change is intended to update the hash used by the resolver after each successful transformation.

      This only applies to the template and regex transformers.

      1. SOLR-1033.patch
        4 kB
        Noble Paul
      2. SOLR-1033.patch
        9 kB
        Noble Paul
      3. SOLR-1033.patch
        16 kB
        Fergus McMenemie

        Activity

        Fergus McMenemie created issue -
        Hide
        Fergus McMenemie added a comment -

        A patch to address the issue.

        Yet again, I cannot get one of unit tests to work. I am hoping that folk better than me can point out where I am going wrong!

        Show
        Fergus McMenemie added a comment - A patch to address the issue. Yet again, I cannot get one of unit tests to work. I am hoping that folk better than me can point out where I am going wrong!
        Fergus McMenemie made changes -
        Field Original Value New Value
        Attachment SOLR-1033.patch [ 12400657 ]
        Hide
        Noble Paul added a comment -

        the output of one transformer can be consumed from other.
        example

        <entity transformer="TemplateTransformer,RegexTransformer">
          <field column="a" template="hello"/>
          <field column="b" regex="(.*)" sourceColName="a"/>
        </entity> 
        

        in this case , the output of TemplateTransformer goes to 'a' . The RegexTransformer can read from column 'a' and it can be put into column 'b' . It is still possible to have another transformer which reads from 'b' and puts the value into 'c'

        Is this the usecase? or am I missing something?

        Show
        Noble Paul added a comment - the output of one transformer can be consumed from other. example <entity transformer= "TemplateTransformer,RegexTransformer" > <field column= "a" template= "hello" /> <field column= "b" regex= "(.*)" sourceColName= "a" /> </entity> in this case , the output of TemplateTransformer goes to 'a' . The RegexTransformer can read from column 'a' and it can be put into column 'b' . It is still possible to have another transformer which reads from 'b' and puts the value into 'c' Is this the usecase? or am I missing something?
        Hide
        Fergus McMenemie added a comment - - edited

        Sorry. I was not as clear as I could have been. No the use case is more

          <entity name="e" transformer="TemplateTransformer,RegexTransformer">
            <field column="a" template="hello"/>
            <field column="c" template="hello world"/>
            <field column="b" regex="${e.a}(.*)" sourceColName="c"/>
            </entity>
        
        Show
        Fergus McMenemie added a comment - - edited Sorry. I was not as clear as I could have been. No the use case is more <entity name= "e" transformer= "TemplateTransformer,RegexTransformer" > <field column= "a" template= "hello" /> <field column= "c" template= "hello world" /> <field column= "b" regex= "${e.a}(.*)" sourceColName= "c" /> </entity>
        Hide
        Fergus McMenemie added a comment -

        Following on from Noble's comments I realised that the test case for regex was not testing or highlighting the use case at all. This patch contains a new working regexp junit test case.

        Show
        Fergus McMenemie added a comment - Following on from Noble's comments I realised that the test case for regex was not testing or highlighting the use case at all. This patch contains a new working regexp junit test case.
        Fergus McMenemie made changes -
        Attachment SOLR-1033.patch [ 12400718 ]
        Hide
        Noble Paul added a comment -

        Fergus, the changes required for TemplateTransformer was clear and your fix is right.
        Can you give the usecase for RegexTranformer also?

        Show
        Noble Paul added a comment - Fergus, the changes required for TemplateTransformer was clear and your fix is right. Can you give the usecase for RegexTranformer also?
        Hide
        Fergus McMenemie added a comment -

        Noble,

        Sure. However I need a little help. What is it I need to do?

        1) reference the examples I posted to solr-user in JIRA?

        2) simplify/clarify what was posted to solr-user?

        3) include a snippet in JIRA?

        4) add example explicitly showing reuse of regex output in another regex?

        5) or details of the problem I am trying to solve right now?

        I had thought the general case included below was sufficient!

        Regards Fergus.

        ===============================================================
        Fergus McMenemie Email:fergus@twig.me.uk
        Techmore Ltd Phone:(UK) 07721 376021

        Unix/Mac/Intranets Analyst Programmer
        ===============================================================

        Show
        Fergus McMenemie added a comment - Noble, Sure. However I need a little help. What is it I need to do? 1) reference the examples I posted to solr-user in JIRA? 2) simplify/clarify what was posted to solr-user? 3) include a snippet in JIRA? 4) add example explicitly showing reuse of regex output in another regex? 5) or details of the problem I am trying to solve right now? I had thought the general case included below was sufficient! Regards Fergus. – =============================================================== Fergus McMenemie Email:fergus@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===============================================================
        Hide
        Noble Paul added a comment -

        Sure. However I need a little help. What is it I need to do?

        A simple usecase with an example which demonstrates the feature .

        The TemplateTransformer example you provided was self explanatory. If you can give a similar one that is more than sufficient.

        Show
        Noble Paul added a comment - Sure. However I need a little help. What is it I need to do? A simple usecase with an example which demonstrates the feature . The TemplateTransformer example you provided was self explanatory. If you can give a similar one that is more than sufficient.
        Hide
        Fergus McMenemie added a comment - - edited

        OK here goes. My document contains references to embeded imagery. For each image there is the image itself along with a thumbnail and caption. The source document contains:-

        <mediaObject vurl="1043130" imageType="graphic"/>

        I have a search application that searches only the captions associated with a given image. It would be nice to populate solr fields with the correct relative path to each image and thumbnails at index time. Problem arises in that although the thumbnail is:

        s$

        {e.vurl}.jpg

        The name of the image itself varies depending on the first letter of the image type imageType! It could be one of 'picture' 'graphic' 'lineDrawing' or 'map'. ie:-

        p${e.vurl}

        .jpg
        g$

        {e.vurl}.jpg
        l${e.vurl}

        .jpg
        m$

        {e.vurl}

        .jpg

        My patch would allow the following sort of thing to be added to a data-config. I feel this considerably increases its power and usefulness.

        <entity name="x" .... transformer="TemplateTransformer,RegexTransformer">
          <field column="fileWebPath"            template="${jc.fileAbsolutePath}" regex="${dataimporter.request.contentdir}(.*)" replaceWith="/ford$1" />
          <field column="vurl"                          xpath="/record/mediaBlock/mediaObject/@vurl" />
          <field column="imagetype"               xpath="/record/mediaBlock/mediaObject/@imageType" regex="^(\w).*"/>
          <field column="imgWebPathICON"  regex="(.*)/.*" replaceWith="$1/imagery/s${x.vurl}.jpg" sourceColName="fileWebPath"/>
          <field column="imgWebPathFULL"  regex="(.*)/.*" replaceWith="$1/imagery/${x.imagetype}${x.vurl}.jpg"  sourceColName="fileWebPath"/>
        
        Show
        Fergus McMenemie added a comment - - edited OK here goes. My document contains references to embeded imagery. For each image there is the image itself along with a thumbnail and caption. The source document contains:- <mediaObject vurl="1043130" imageType="graphic"/> I have a search application that searches only the captions associated with a given image. It would be nice to populate solr fields with the correct relative path to each image and thumbnails at index time. Problem arises in that although the thumbnail is: s$ {e.vurl}.jpg The name of the image itself varies depending on the first letter of the image type imageType! It could be one of 'picture' 'graphic' 'lineDrawing' or 'map'. ie:- p${e.vurl} .jpg g$ {e.vurl}.jpg l${e.vurl} .jpg m$ {e.vurl} .jpg My patch would allow the following sort of thing to be added to a data-config. I feel this considerably increases its power and usefulness. <entity name= "x" .... transformer= "TemplateTransformer,RegexTransformer" > <field column= "fileWebPath" template= "${jc.fileAbsolutePath}" regex= "${dataimporter.request.contentdir}(.*)" replaceWith= "/ford$1" /> <field column= "vurl" xpath= "/record/mediaBlock/mediaObject/@vurl" /> <field column= "imagetype" xpath= "/record/mediaBlock/mediaObject/@imageType" regex= "^(\w).*" /> <field column= "imgWebPathICON" regex= "(.*)/.*" replaceWith= "$1/imagery/s${x.vurl}.jpg" sourceColName= "fileWebPath" /> <field column= "imgWebPathFULL" regex= "(.*)/.*" replaceWith= "$1/imagery/${x.imagetype}${x.vurl}.jpg" sourceColName= "fileWebPath" />
        Hide
        Noble Paul added a comment -

        If I am not wrong the output of one transformation in Regextransformer is available in the next transformation , becaus ethe value is added to the same row object . So it should be working if the TemplateTransformer is fixed

        Show
        Noble Paul added a comment - If I am not wrong the output of one transformation in Regextransformer is available in the next transformation , becaus ethe value is added to the same row object . So it should be working if the TemplateTransformer is fixed
        Hide
        Fergus McMenemie added a comment -

        Not sure I am following what you say. If I number the different steps in my example entity as follows:-

        <entity name="x" .... transformer="TemplateTransformer,RegexTransformer">
        1  <field column="fileWebPath"     template="${jc.fileAbsolutePath}" regex="${dataimporter.request.contentdir}(.*)" replaceWith="/ford$1" />
        2  <field column="vurl"            xpath="/record/mediaBlock/mediaObject/@vurl" />
        3  <field column="imagetype"       xpath="/record/mediaBlock/mediaObject/@imageType" regex="^(\w).*"/>
        4  <field column="imgWebPathICON"  regex="(.*)/.*" replaceWith="$1/imagery/s${x.vurl}.jpg" sourceColName="fileWebPath"/>
        5  <field column="imgWebPathFULL"  regex="(.*)/.*" replaceWith="$1/imagery/${x.imagetype}${x.vurl}.jpg"  sourceColName="fileWebPath"/>
        

        We see that column 5 involves a regex which in turn involves columns 3 and 2. Column 3 is itself a regex. We therefore have the output from one regex being used within another regex. So as far as I can see we need the fix made to both the TemplateTransformer and the RegexTransformer.

        Show
        Fergus McMenemie added a comment - Not sure I am following what you say. If I number the different steps in my example entity as follows:- <entity name= "x" .... transformer= "TemplateTransformer,RegexTransformer" > 1 <field column= "fileWebPath" template= "${jc.fileAbsolutePath}" regex= "${dataimporter.request.contentdir}(.*)" replaceWith= "/ford$1" /> 2 <field column= "vurl" xpath= "/record/mediaBlock/mediaObject/@vurl" /> 3 <field column= "imagetype" xpath= "/record/mediaBlock/mediaObject/@imageType" regex= "^(\w).*" /> 4 <field column= "imgWebPathICON" regex= "(.*)/.*" replaceWith= "$1/imagery/s${x.vurl}.jpg" sourceColName= "fileWebPath" /> 5 <field column= "imgWebPathFULL" regex= "(.*)/.*" replaceWith= "$1/imagery/${x.imagetype}${x.vurl}.jpg" sourceColName= "fileWebPath" /> We see that column 5 involves a regex which in turn involves columns 3 and 2. Column 3 is itself a regex. We therefore have the output from one regex being used within another regex. So as far as I can see we need the fix made to both the TemplateTransformer and the RegexTransformer.
        Hide
        Noble Paul added a comment - - edited

        OK , I see your point. you are constructing the regex replacements themselves with templates. I missed that

        I am wondering , if the system can be modified to have the current entities rows be available always to all transformers. It can be done as a simple change in the EntityprocessorBase#applyTransformers

        Show
        Noble Paul added a comment - - edited OK , I see your point. you are constructing the regex replacements themselves with templates. I missed that I am wondering , if the system can be modified to have the current entities rows be available always to all transformers. It can be done as a simple change in the EntityprocessorBase#applyTransformers
        Hide
        Noble Paul added a comment -

        This should help all other transformers implicitly support templating

        Show
        Noble Paul added a comment - This should help all other transformers implicitly support templating
        Noble Paul made changes -
        Attachment SOLR-1033.patch [ 12400766 ]
        Hide
        Fergus McMenemie added a comment - - edited

        Your comment about modifying the system "to have the current entities rows be available always to all transformers" is good and will produce the fastest most efficient code.

        But I need to sure we are not using the term "template" twice in different ways. You say "you are constructing the regex replacements themselves with templates" by which you mean using the $

        {XXX}

        syntax and not the output from a templatetransformer?

        Anyway I have backed out my patch and applied yours. Everything seems fine, but I am still testing.

        Thanks very much.

        Show
        Fergus McMenemie added a comment - - edited Your comment about modifying the system "to have the current entities rows be available always to all transformers" is good and will produce the fastest most efficient code. But I need to sure we are not using the term "template" twice in different ways. You say "you are constructing the regex replacements themselves with templates" by which you mean using the $ {XXX} syntax and not the output from a templatetransformer? Anyway I have backed out my patch and applied yours. Everything seems fine, but I am still testing. Thanks very much.
        Hide
        Noble Paul added a comment -

        You say "you are constructing the regex replacements themselves with templates" by which you mean using the ${XXX} syntax and not the output from a templatetransformer?

        when I said 'template' I mean any string with $

        {xxx}

        content. the 'template' attribute is the only value Templatetransformer is interested in.

        Any attribute value in DIH is potentially a template .Some are honoured and some are not. I hope we can consistently make it work across all.

        Show
        Noble Paul added a comment - You say "you are constructing the regex replacements themselves with templates" by which you mean using the ${XXX} syntax and not the output from a templatetransformer? when I said 'template' I mean any string with $ {xxx} content. the 'template' attribute is the only value Templatetransformer is interested in. Any attribute value in DIH is potentially a template .Some are honoured and some are not. I hope we can consistently make it work across all.
        Hide
        Noble Paul added a comment -

        the complete patch. XPathEntityprocessor needed some rework

        Show
        Noble Paul added a comment - the complete patch. XPathEntityprocessor needed some rework
        Noble Paul made changes -
        Attachment SOLR-1033.patch [ 12400828 ]
        Hide
        Fergus McMenemie added a comment -

        Hmmm, some thoughts and an enhanced patch for your consideration.

        Surely the test cases should still be revised to test the new functionality.

        Also as the XPathEntityProcessor has been revised, I felt this might be the best time to sort some formating typo's within the error messages.

        Show
        Fergus McMenemie added a comment - Hmmm, some thoughts and an enhanced patch for your consideration. Surely the test cases should still be revised to test the new functionality. Also as the XPathEntityProcessor has been revised, I felt this might be the best time to sort some formating typo's within the error messages.
        Fergus McMenemie made changes -
        Attachment SOLR-1033.patch [ 12400845 ]
        Fergus McMenemie made changes -
        Attachment SOLR-1033.patch [ 12400657 ]
        Fergus McMenemie made changes -
        Attachment SOLR-1033.patch [ 12400718 ]
        Hide
        Noble Paul added a comment -

        Fergus,
        Looks good. Thanks

        Show
        Noble Paul added a comment - Fergus, Looks good. Thanks
        Hide
        Shalin Shekhar Mangar added a comment -

        Updating issue title per the final resolution.

        Patch looks good, I'll commit shortly.

        Show
        Shalin Shekhar Mangar added a comment - Updating issue title per the final resolution. Patch looks good, I'll commit shortly.
        Shalin Shekhar Mangar made changes -
        Summary DIH transformers cannot reuse output from previous transformations DIH transformers should be able to access current entity's namespace
        Assignee Shalin Shekhar Mangar [ shalinmangar ]
        Hide
        Shalin Shekhar Mangar added a comment -

        Committed revision 747664.

        Thanks Fergus and Noble!

        Show
        Shalin Shekhar Mangar added a comment - Committed revision 747664. Thanks Fergus and Noble!
        Shalin Shekhar Mangar made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Grant Ingersoll added a comment -

        Bulk close for Solr 1.4

        Show
        Grant Ingersoll added a comment - Bulk close for Solr 1.4
        Grant Ingersoll made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Shalin Shekhar Mangar
            Reporter:
            Fergus McMenemie
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 24h
              24h
              Remaining:
              Remaining Estimate - 24h
              24h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development