Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-2096

DIH should be able read data directly from HDFS for indexing

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Won't Fix
    • 1.4.1
    • 4.9, 6.0
    • None

    Description

      DIH doesn't support reading from the hdfs:// protocol which makes it hard to index data generated by a M/R job. This tarball contains a subclass of the URLDataSource along with an HDFSReader that allows for this. The data is assumed to be in text format and able to be processed by the LineEntityProcessor.

      Here is an example DIH-Config snippet:
      <dataSource name="queryData" type="org.apache.solr.handler.dataimport.hdfs.HDFSDataSource"
      baseUrl="hdfs://<YOURSERVER>:9000/" encoding="UTF-8"
      connectionTimeout="5000" readTimeout="10000"/>
      <document name="autoSuggester">
      <entity name="jc" processor="LineEntityProcessor"
      url="<YOUR FOLDER>/part*" dataSource="queryData">
      <!-- Field mappings here if necessary -->
      </entity>
      </document>
      </dataConfig>

      Attachments

        1. hdfs_reader.tar
          30 kB
          Amit Nithian

        Activity

          People

            Unassigned Unassigned
            anithian Amit Nithian
            Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: