Details
-
New Feature
-
Status: Closed
-
Minor
-
Resolution: Won't Fix
-
1.4.1
-
None
Description
DIH doesn't support reading from the hdfs:// protocol which makes it hard to index data generated by a M/R job. This tarball contains a subclass of the URLDataSource along with an HDFSReader that allows for this. The data is assumed to be in text format and able to be processed by the LineEntityProcessor.
Here is an example DIH-Config snippet:
<dataSource name="queryData" type="org.apache.solr.handler.dataimport.hdfs.HDFSDataSource"
baseUrl="hdfs://<YOURSERVER>:9000/" encoding="UTF-8"
connectionTimeout="5000" readTimeout="10000"/>
<document name="autoSuggester">
<entity name="jc" processor="LineEntityProcessor"
url="<YOUR FOLDER>/part*" dataSource="queryData">
<!-- Field mappings here if necessary -->
</entity>
</document>
</dataConfig>