Solr
  1. Solr
  2. SOLR-1358

Integration of Tika and DataImportHandler

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5
    • Labels:
      None

      Description

      At the moment, it's impossible to configure Solr such that it build up documents by using data that comes from both pdf documents and database table columns. Currently, to accomplish this task, it's up to the user to add some preprocessing that converts pdf files into plain text files. Therefore, I would like to see an integration of Solr Cell into DIH that makes those preprocessing obsolete.

      1. SOLR-1358.patch
        20 kB
        Akshay K. Ukey
      2. SOLR-1358.patch
        7 kB
        Akshay K. Ukey
      3. SOLR-1358.patch
        7 kB
        Noble Paul
      4. SOLR-1358.patch
        7 kB
        Akshay K. Ukey

        Issue Links

          Activity

          Sascha Szott created issue -
          Noble Paul made changes -
          Field Original Value New Value
          Link This issue is blocked by SOLR-1583 [ SOLR-1583 ]
          Noble Paul made changes -
          Summary Integration of Solr Cell and DataImportHandler Integration of Tika and DataImportHandler
          Noble Paul made changes -
          Assignee Noble Paul [ noble.paul ]
          Akshay K. Ukey made changes -
          Attachment SOLR-1358.patch [ 12427339 ]
          Noble Paul made changes -
          Attachment SOLR-1358.patch [ 12427340 ]
          Noble Paul made changes -
          Comment [ Configuration with attribute to select format of emitted content:

          {code:xml}
          <dataConfig>
           <!-- use any of type DataSource<InputStream> -->
            <dataSource type="BinURLDataSource"/>
            <document>
           <!-- 'emitFormat' can be one of text | html | xml -->
              <entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml" url="${some.var.goes.here}" emitFormat="xml" >
                <!--Do appropriate mapping here meta="true" means it is a metadata field -->
                <field column="Author" meta="true" name="author"/>
                <field column="title" meta="true" name="docTitle"/>
                <!--'text' is an implicit field emitted by TikaEntityProcessor . Map it appropriately-->
                <field column="text"/>
               </entity>
            <document>
          </dataConfig>
          {code}

          With 'emitFormat' different EntityProcessors can be chained. E.g. using "xml" value will allow chaining XPathEntityProcessor with TikaEntityProcessor for further custom processing. ]
          Noble Paul made changes -
          Attachment SOLR-1358.patch [ 12427425 ]
          Noble Paul made changes -
          Attachment SOLR-1358.patch [ 12427429 ]
          Noble Paul made changes -
          Attachment SOLR-1358.patch [ 12427340 ]
          Noble Paul made changes -
          Attachment SOLR-1358.patch [ 12427425 ]
          Akshay K. Ukey made changes -
          Attachment SOLR-1358.patch [ 12427474 ]
          Akshay K. Ukey made changes -
          Attachment SOLR-1358.patch [ 12427721 ]
          Noble Paul made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 1.5 [ 12313566 ]
          Resolution Fixed [ 1 ]

            People

            • Assignee:
              Noble Paul
              Reporter:
              Sascha Szott
            • Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development