Description
I found there is a discrepancy in execution paths when running Nutch in local standalone mode vis-à-vis server mode.
I observed, in local standalone mode, when the indexing process is done the document along with its fields get indexed and committed in solr and is returned if queried immediately. However, the same when done through server mode, the document gets indexed but is not committed in solr, hence not returned if queried immediately. When we restart solr the indexed document is returned if queried.
I browsed through the IndexingJob.java file to understand the cause for this. I found out:
- There are two different entry paths for the local standalone mode and the server mode
- Server mode entry point: public Map<String, Object> run(Map<String, Object> args)
- Standalone mode entry point:
- public int run(String[] args)
- public void index(String batchId)
- The local standalone mode path did extra stuff than the server mode
- The public void index(String batchId) function initially calls the server mode path: public Map<String, Object> run(Map<String, Object> args)
- And then does this extra stuff
- Gets IndexWriters
- Using IndexWriters Describes
Using IndexWriters commits if COMMIT_INDEX=true is specified in the configuration - The aforementioned extra stuff is not done in the server mode
I feel the execution paths for both the modes should be same and hence propose to:
- Move the extra stuff done using IndexWriters in public void index(String batchId) to the end of server mode execution path i.e public Map<String, Object> run(Map<String, Object> args) function
- Call public Map<String, Object> run(Map<String, Object> args) function directly from Standalone mode entry point: public int run(String[] args)
- public int run(String[] args) becomes redundant and can be safely removed.
I have attached the proposed patch along with this issue. Kindly go through the same and approve.