Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9601

DIH: Radicially simplify Tika example to only show relevant configuration

    Details

      Description

      Solr DIH examples are legacy examples to show how DIH work. However, they include full configurations that may obscure teaching points. This is no longer needed as we have 3 full-blown examples in the configsets.

      Specifically for Tika, the field types definitions were at some point simplified to have less support files in the configuration directory. This, however, means that we now have field definitions that have same names as other examples, but different definitions.

      Importantly, Tika does not use most (any?) of those modified definitions. They are there just for completeness. Similarly, the solrconfig.xml includes extract handler even though we are demonstrating a different path of using Tika. Somebody grepping through config files may get confused about what configuration aspects contributes to what experience.

      I am planning to significantly simplify configuration and schema of Tika example to *only* show DIH Tika extraction path. It will end-up a very short and focused example.

      1. tika2_20170308.tgz
        2 kB
        Alexandre Rafalovitch
      2. tika2_20170316.tgz
        2 kB
        Alexandre Rafalovitch

        Issue Links

          Activity

          Hide
          dsmiley David Smiley added a comment -

          +1 and to all the /example configs for that matter – same principle. Keep the relevant parts that are to be exercised; no kitchen sinks that needs to be maintained.

          Show
          dsmiley David Smiley added a comment - +1 and to all the /example configs for that matter – same principle. Keep the relevant parts that are to be exercised; no kitchen sinks that needs to be maintained.
          Hide
          arafalov Alexandre Rafalovitch added a comment -

          It is a little hard to generate a readable DIFF between the original Tika example and one I created. So, for ease of testing, I just created it as a separate tika2 core that can be dropped next to the other DIH cores.

          I removed all of the unused gunk, so the remaining files are tiny. I wish I could remove the infoStream section, but the default is false and I am not sure I should.

          I've also added a prototype-oriented demo of wildcard, renamed and simplified text field definition and did other minor cleanup in what is left.

          I am not sure if I need to worry about docValues here.

          Also, I have commented out uniqueKey section, but the corresponding id field definition is missing. But it was missing in the original example too, so I am not sure it is worth adding in the commented out section.

          This is a big change (even if with tiny results files), so I would appreciate people commenting on it before I actually commit it.

          Show
          arafalov Alexandre Rafalovitch added a comment - It is a little hard to generate a readable DIFF between the original Tika example and one I created. So, for ease of testing, I just created it as a separate tika2 core that can be dropped next to the other DIH cores. I removed all of the unused gunk, so the remaining files are tiny. I wish I could remove the infoStream section, but the default is false and I am not sure I should. I've also added a prototype-oriented demo of wildcard, renamed and simplified text field definition and did other minor cleanup in what is left. I am not sure if I need to worry about docValues here. Also, I have commented out uniqueKey section, but the corresponding id field definition is missing. But it was missing in the original example too, so I am not sure it is worth adding in the commented out section. This is a big change (even if with tiny results files), so I would appreciate people commenting on it before I actually commit it.
          Hide
          dsmiley David Smiley added a comment -

          I took a look but I admit I haven't actually used it.

          • I think you should remove the infoStream part; this is all about being minimalist.
          • I think you should add a comment to solrconfig.xml like
            <!-- MINIMALIST CONFIG JUST TO SHOW DIH. Real configs aren't so minimal. -->
          • I agree not to worry about docValues; keep it simple and focused.
          • I think you should declare the uniqueKey.
          Show
          dsmiley David Smiley added a comment - I took a look but I admit I haven't actually used it. I think you should remove the infoStream part; this is all about being minimalist. I think you should add a comment to solrconfig.xml like <!-- MINIMALIST CONFIG JUST TO SHOW DIH. Real configs aren't so minimal. --> I agree not to worry about docValues; keep it simple and focused. I think you should declare the uniqueKey.
          Hide
          arafalov Alexandre Rafalovitch added a comment -

          The original example did not have a uniqueKey. And I don't think the PDF document provided one, though I could map a field (e.g. fileName) to it.

          Show
          arafalov Alexandre Rafalovitch added a comment - The original example did not have a uniqueKey. And I don't think the PDF document provided one, though I could map a field (e.g. fileName) to it.
          Hide
          dsmiley David Smiley added a comment -

          Ok; unless there would be some problem in declaring a uniqueKey (i.e. some tutorial it would break?) then I think it should now have one regardless. The fact that Solr supports a schema without a uniqueKey is sometimes useful but it's generally not recommended.

          Show
          dsmiley David Smiley added a comment - Ok; unless there would be some problem in declaring a uniqueKey (i.e. some tutorial it would break?) then I think it should now have one regardless. The fact that Solr supports a schema without a uniqueKey is sometimes useful but it's generally not recommended.
          Hide
          arafalov Alexandre Rafalovitch added a comment -

          Turns out to be there is a problem with having - and populating - a uniqueKey. Tika extract does not give us a meaningful key. The nearest one is resourceName but it is not made available when parsing through DIH, as - I suspect - we abstract the filesystem too well.

          I could rename title into id and change type to string but that's a bit too far bending over I think. I could I guess map it to id and copyField to title. Would that be reasonable?

          Ok on removing infoStream, though we have a logging setting that uses it for all examples globally; but I could add a comment in that file I guess.

          solrconfig.xml already has a long comment about the example being minimalistic.

          Show
          arafalov Alexandre Rafalovitch added a comment - Turns out to be there is a problem with having - and populating - a uniqueKey. Tika extract does not give us a meaningful key. The nearest one is resourceName but it is not made available when parsing through DIH, as - I suspect - we abstract the filesystem too well. I could rename title into id and change type to string but that's a bit too far bending over I think. I could I guess map it to id and copyField to title . Would that be reasonable? Ok on removing infoStream, though we have a logging setting that uses it for all examples globally; but I could add a comment in that file I guess. solrconfig.xml already has a long comment about the example being minimalistic.
          Hide
          dsmiley David Smiley added a comment -

          I see. In that case, I suggest adding a comment in schema.xml mentioning why we didn't bother defining a uniqueKey.

          Thanks for doing this cleanup.

          Show
          dsmiley David Smiley added a comment - I see. In that case, I suggest adding a comment in schema.xml mentioning why we didn't bother defining a uniqueKey. Thanks for doing this cleanup.
          Hide
          arafalov Alexandre Rafalovitch added a comment -

          Another version. I made TikaEntityParser an inner entity of FileListEntityProcessor, so the file name is now exposed as part of outer entity.

          This allowed me to demonstrate rootEntity, another processor type as well as provide uniqueKey.

          I also commented out the dynamicField *. If it gets uncommented, a couple extra fields will show from the FileListEntityProcessor, so there is a nice hidden reward for curiosity....

          This should be ready to go with some formatting cleanup (4 spaces offset? whitespace before closing xml tags? anything else?).

          Any final comments?

          Show
          arafalov Alexandre Rafalovitch added a comment - Another version. I made TikaEntityParser an inner entity of FileListEntityProcessor, so the file name is now exposed as part of outer entity. This allowed me to demonstrate rootEntity, another processor type as well as provide uniqueKey. I also commented out the dynamicField *. If it gets uncommented, a couple extra fields will show from the FileListEntityProcessor, so there is a nice hidden reward for curiosity.... This should be ready to go with some formatting cleanup (4 spaces offset? whitespace before closing xml tags? anything else?). Any final comments?
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit b02626de5071c543eb6e8deea450266218238c9e in lucene-solr's branch refs/heads/master from Alexandre Rafalovitch
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b02626d ]

          SOLR-9601: DIH Tika example is now minimal
          Only keep definitions and files required to show Tika-extraction in DIH

          Show
          jira-bot ASF subversion and git services added a comment - Commit b02626de5071c543eb6e8deea450266218238c9e in lucene-solr's branch refs/heads/master from Alexandre Rafalovitch [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b02626d ] SOLR-9601 : DIH Tika example is now minimal Only keep definitions and files required to show Tika-extraction in DIH
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 812b0eebf3d50a141b952af27bbf7c225df5072d in lucene-solr's branch refs/heads/branch_6x from Alexandre Rafalovitch
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=812b0ee ]

          SOLR-9601: DIH Tika example is now minimal.
          Only keep definitions and files required to show Tika-extraction in DIH

          Show
          jira-bot ASF subversion and git services added a comment - Commit 812b0eebf3d50a141b952af27bbf7c225df5072d in lucene-solr's branch refs/heads/branch_6x from Alexandre Rafalovitch [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=812b0ee ] SOLR-9601 : DIH Tika example is now minimal. Only keep definitions and files required to show Tika-extraction in DIH
          Hide
          varunthacker Varun Thacker added a comment -

          Looks so much better now! Thanks Alexandre!

          Show
          varunthacker Varun Thacker added a comment - Looks so much better now! Thanks Alexandre!
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 2319d69fd3d5b67729f31b5796cc1eb68220b664 in lucene-solr's branch refs/heads/master from Cassandra Targett
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2319d69 ]

          Ref Guide: update DIH docs for SOLR-7383; SOLR-9601; plus major surgery on page layout

          Show
          jira-bot ASF subversion and git services added a comment - Commit 2319d69fd3d5b67729f31b5796cc1eb68220b664 in lucene-solr's branch refs/heads/master from Cassandra Targett [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2319d69 ] Ref Guide: update DIH docs for SOLR-7383 ; SOLR-9601 ; plus major surgery on page layout
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 2d054965a5c5313a486540c79ef29b0dbf05bc70 in lucene-solr's branch refs/heads/branch_6x from Cassandra Targett
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2d05496 ]

          Ref Guide: update DIH docs for SOLR-7383; SOLR-9601; plus major surgery on page layout

          Show
          jira-bot ASF subversion and git services added a comment - Commit 2d054965a5c5313a486540c79ef29b0dbf05bc70 in lucene-solr's branch refs/heads/branch_6x from Cassandra Targett [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2d05496 ] Ref Guide: update DIH docs for SOLR-7383 ; SOLR-9601 ; plus major surgery on page layout
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit fd8ac5b959f26c8a979752c9bf61bb8a545b2e3a in lucene-solr's branch refs/heads/branch_6_6 from Cassandra Targett
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fd8ac5b ]

          Ref Guide: update DIH docs for SOLR-7383; SOLR-9601; plus major surgery on page layout

          Show
          jira-bot ASF subversion and git services added a comment - Commit fd8ac5b959f26c8a979752c9bf61bb8a545b2e3a in lucene-solr's branch refs/heads/branch_6_6 from Cassandra Targett [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fd8ac5b ] Ref Guide: update DIH docs for SOLR-7383 ; SOLR-9601 ; plus major surgery on page layout

            People

            • Assignee:
              arafalov Alexandre Rafalovitch
              Reporter:
              arafalov Alexandre Rafalovitch
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development