Solr
  1. Solr
  2. SOLR-1358

Integration of Tika and DataImportHandler

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5
    • Labels:
      None

      Description

      At the moment, it's impossible to configure Solr such that it build up documents by using data that comes from both pdf documents and database table columns. Currently, to accomplish this task, it's up to the user to add some preprocessing that converts pdf files into plain text files. Therefore, I would like to see an integration of Solr Cell into DIH that makes those preprocessing obsolete.

      1. SOLR-1358.patch
        7 kB
        Akshay K. Ukey
      2. SOLR-1358.patch
        7 kB
        Noble Paul
      3. SOLR-1358.patch
        7 kB
        Akshay K. Ukey
      4. SOLR-1358.patch
        20 kB
        Akshay K. Ukey

        Issue Links

          Activity

          Hide
          Noble Paul added a comment - - edited

          Let us provide a new TikaEntityProcessor

          <dataConfig>
           <!-- use any of type DataSource<InputStream> --> 
            <dataSource type="BinURLDataSource"/>
            <document>
             <!-- The value of format can be text|xml|html|none. this is the format in which the body is emited (the 'text' field) . The implicit field 'text' will have that format.
                    default value is 'text'  (if not specified) . format="none" means body is not emited-->
              <entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml" url="${some.var.goes.here}" format="text">
                <!--Do appropriate mapping here  meta="true" means it is a metadata field -->
                <field column="Author" meta="true" name="author"/>
                <field column="title" meta="true" name="docTitle"/>
                <!--'text' is an implicit field emited by TikaEntityProcessor . Map it appropriately-->
                <field column="text"/>
               </entity>
            <document>
          </dataConfig>
          

          With format=xml|html XPathEntityProcessor can be nested. This may help users extract more nested data from a file. It is even possible to create multiple documents from a single file

          Show
          Noble Paul added a comment - - edited Let us provide a new TikaEntityProcessor <dataConfig> <!-- use any of type DataSource<InputStream> --> <dataSource type= "BinURLDataSource" /> <document> <!-- The value of format can be text|xml|html|none. this is the format in which the body is emited (the 'text' field) . The implicit field 'text' will have that format. default value is 'text' (if not specified) . format= "none" means body is not emited--> <entity processor= "TikaEntityProcessor" tikaConfig= "tikaconfig.xml" url= "${some.var.goes.here}" format= "text" > <!--Do appropriate mapping here meta= "true" means it is a metadata field --> <field column= "Author" meta= "true" name= "author" /> <field column= "title" meta= "true" name= "docTitle" /> <!--'text' is an implicit field emited by TikaEntityProcessor . Map it appropriately--> <field column= "text" /> </entity> <document> </dataConfig> With format=xml|html XPathEntityProcessor can be nested. This may help users extract more nested data from a file. It is even possible to create multiple documents from a single file
          Hide
          Akshay K. Ukey added a comment -

          First cut patch. Not tested.

          Show
          Akshay K. Ukey added a comment - First cut patch. Not tested.
          Hide
          Noble Paul added a comment -

          cleaned a bit

          Show
          Noble Paul added a comment - cleaned a bit
          Hide
          Noble Paul added a comment -

          onError implemented

          Show
          Noble Paul added a comment - onError implemented
          Hide
          Akshay K. Ukey added a comment -

          Patch with fix for avoiding reading from data source continuously.

          Show
          Akshay K. Ukey added a comment - Patch with fix for avoiding reading from data source continuously.
          Hide
          Akshay K. Ukey added a comment - - edited

          Patch with test case and with tika parser configurable via parser attribute for entity tag.

          Show
          Akshay K. Ukey added a comment - - edited Patch with test case and with tika parser configurable via parser attribute for entity tag.
          Hide
          Noble Paul added a comment -

          committed r889613

          Thanks Akshay

          Show
          Noble Paul added a comment - committed r889613 Thanks Akshay

            People

            • Assignee:
              Noble Paul
              Reporter:
              Sascha Szott
            • Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development