Solr
  1. Solr
  2. SOLR-1358

Integration of Tika and DataImportHandler

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5
    • Labels:
      None

      Description

      At the moment, it's impossible to configure Solr such that it build up documents by using data that comes from both pdf documents and database table columns. Currently, to accomplish this task, it's up to the user to add some preprocessing that converts pdf files into plain text files. Therefore, I would like to see an integration of Solr Cell into DIH that makes those preprocessing obsolete.

      1. SOLR-1358.patch
        20 kB
        Akshay K. Ukey
      2. SOLR-1358.patch
        7 kB
        Akshay K. Ukey
      3. SOLR-1358.patch
        7 kB
        Noble Paul
      4. SOLR-1358.patch
        7 kB
        Akshay K. Ukey

        Issue Links

          Activity

          Sascha Szott created issue -
          Hide
          Noble Paul added a comment - - edited

          Let us provide a new TikaEntityProcessor

          <dataConfig>
           <!-- use any of type DataSource<InputStream> --> 
            <dataSource type="BinURLDataSource"/>
            <document>
             <!-- The value of format can be text|xml|html|none. this is the format in which the body is emited (the 'text' field) . The implicit field 'text' will have that format.
                    default value is 'text'  (if not specified) . format="none" means body is not emited-->
              <entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml" url="${some.var.goes.here}" format="text">
                <!--Do appropriate mapping here  meta="true" means it is a metadata field -->
                <field column="Author" meta="true" name="author"/>
                <field column="title" meta="true" name="docTitle"/>
                <!--'text' is an implicit field emited by TikaEntityProcessor . Map it appropriately-->
                <field column="text"/>
               </entity>
            <document>
          </dataConfig>
          

          With format=xml|html XPathEntityProcessor can be nested. This may help users extract more nested data from a file. It is even possible to create multiple documents from a single file

          Show
          Noble Paul added a comment - - edited Let us provide a new TikaEntityProcessor <dataConfig> <!-- use any of type DataSource<InputStream> --> <dataSource type= "BinURLDataSource" /> <document> <!-- The value of format can be text|xml|html|none. this is the format in which the body is emited (the 'text' field) . The implicit field 'text' will have that format. default value is 'text' (if not specified) . format= "none" means body is not emited--> <entity processor= "TikaEntityProcessor" tikaConfig= "tikaconfig.xml" url= "${some.var.goes.here}" format= "text" > <!--Do appropriate mapping here meta= "true" means it is a metadata field --> <field column= "Author" meta= "true" name= "author" /> <field column= "title" meta= "true" name= "docTitle" /> <!--'text' is an implicit field emited by TikaEntityProcessor . Map it appropriately--> <field column= "text" /> </entity> <document> </dataConfig> With format=xml|html XPathEntityProcessor can be nested. This may help users extract more nested data from a file. It is even possible to create multiple documents from a single file
          Noble Paul made changes -
          Field Original Value New Value
          Link This issue is blocked by SOLR-1583 [ SOLR-1583 ]
          Noble Paul made changes -
          Summary Integration of Solr Cell and DataImportHandler Integration of Tika and DataImportHandler
          Noble Paul made changes -
          Assignee Noble Paul [ noble.paul ]
          Hide
          Akshay K. Ukey added a comment -

          First cut patch. Not tested.

          Show
          Akshay K. Ukey added a comment - First cut patch. Not tested.
          Akshay K. Ukey made changes -
          Attachment SOLR-1358.patch [ 12427339 ]
          Hide
          Noble Paul added a comment -

          cleaned a bit

          Show
          Noble Paul added a comment - cleaned a bit
          Noble Paul made changes -
          Attachment SOLR-1358.patch [ 12427340 ]
          Noble Paul made changes -
          Comment [ Configuration with attribute to select format of emitted content:

          {code:xml}
          <dataConfig>
           <!-- use any of type DataSource<InputStream> -->
            <dataSource type="BinURLDataSource"/>
            <document>
           <!-- 'emitFormat' can be one of text | html | xml -->
              <entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml" url="${some.var.goes.here}" emitFormat="xml" >
                <!--Do appropriate mapping here meta="true" means it is a metadata field -->
                <field column="Author" meta="true" name="author"/>
                <field column="title" meta="true" name="docTitle"/>
                <!--'text' is an implicit field emitted by TikaEntityProcessor . Map it appropriately-->
                <field column="text"/>
               </entity>
            <document>
          </dataConfig>
          {code}

          With 'emitFormat' different EntityProcessors can be chained. E.g. using "xml" value will allow chaining XPathEntityProcessor with TikaEntityProcessor for further custom processing. ]
          Hide
          Noble Paul added a comment -

          onError implemented

          Show
          Noble Paul added a comment - onError implemented
          Noble Paul made changes -
          Attachment SOLR-1358.patch [ 12427425 ]
          Noble Paul made changes -
          Attachment SOLR-1358.patch [ 12427429 ]
          Noble Paul made changes -
          Attachment SOLR-1358.patch [ 12427340 ]
          Noble Paul made changes -
          Attachment SOLR-1358.patch [ 12427425 ]
          Akshay K. Ukey made changes -
          Attachment SOLR-1358.patch [ 12427474 ]
          Hide
          Akshay K. Ukey added a comment -

          Patch with fix for avoiding reading from data source continuously.

          Show
          Akshay K. Ukey added a comment - Patch with fix for avoiding reading from data source continuously.
          Hide
          Akshay K. Ukey added a comment - - edited

          Patch with test case and with tika parser configurable via parser attribute for entity tag.

          Show
          Akshay K. Ukey added a comment - - edited Patch with test case and with tika parser configurable via parser attribute for entity tag.
          Akshay K. Ukey made changes -
          Attachment SOLR-1358.patch [ 12427721 ]
          Hide
          Noble Paul added a comment -

          committed r889613

          Thanks Akshay

          Show
          Noble Paul added a comment - committed r889613 Thanks Akshay
          Noble Paul made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 1.5 [ 12313566 ]
          Resolution Fixed [ 1 ]

            People

            • Assignee:
              Noble Paul
              Reporter:
              Sascha Szott
            • Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development