Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-2096

DIH should be able read data directly from HDFS for indexing

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Won't Fix
    • Affects Version/s: 1.4.1
    • Fix Version/s: 4.9, 6.0
    • Labels:
      None

      Description

      DIH doesn't support reading from the hdfs:// protocol which makes it hard to index data generated by a M/R job. This tarball contains a subclass of the URLDataSource along with an HDFSReader that allows for this. The data is assumed to be in text format and able to be processed by the LineEntityProcessor.

      Here is an example DIH-Config snippet:
      <dataSource name="queryData" type="org.apache.solr.handler.dataimport.hdfs.HDFSDataSource"
      baseUrl="hdfs://<YOURSERVER>:9000/" encoding="UTF-8"
      connectionTimeout="5000" readTimeout="10000"/>
      <document name="autoSuggester">
      <entity name="jc" processor="LineEntityProcessor"
      url="<YOUR FOLDER>/part*" dataSource="queryData">
      <!-- Field mappings here if necessary -->
      </entity>
      </document>
      </dataConfig>

        Attachments

        1. hdfs_reader.tar
          30 kB
          Amit Nithian

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              anithian Amit Nithian
            • Votes:
              2 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: