Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.11
    • Component/s: indexer
    • Labels:
      None
    • Flags:
      Patch

      Description

      Once we have made the indexers pluggable, we should add a plugin for Amazon CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a JSON based representation Search Data Format (SDF), which we could reuse for a file based indexer.

        Issue Links

          Activity

          Hide
          tomsolr Tom Hill added a comment -

          Indexer that sends to output to CloudSearch

          Show
          tomsolr Tom Hill added a comment - Indexer that sends to output to CloudSearch
          Hide
          tomsolr Tom Hill added a comment -

          I've attached a patch that adds CloudSearch as a pluggable indexing back-end.

          Slightly verbose description of how to test:

          1. Create a CloudSearch domain
          note the document endpoint
          I created the following fields in the domain

          anchor Active text (Result)
          author Active literal (Search Result)
          boost Active literal (Search Result)
          cache Active literal (Search Result)
          content Active text (Result)
          content_length Active literal (Search Result)
          digest Active literal (Search Result)
          feed Active literal (Search Result)
          host Active literal (Search Result)
          id Active literal (Search Result)
          lang Active literal (Search Result)
          published_date Active uint ()
          segment Active literal (Search Result)
          subcollection Active literal (Search Result)
          tag Active literal (Search Result)
          text Active text (Result)
          title Active text (Result)
          tstamp Active uint ()
          type Active literal (Search Result)
          updated_date Active uint ()
          url Active text (Result)

          2. Checkout nutch
          git clone https://github.com/apache/nutch
          3. Switch to 1.7 branch
          git checkout -t origin/branch-1.7
          4. Apply attached patch
          I created it with : git diff remotes/origin/branch-1.7 --no-prefix > indexer-cloudsearch.patch
          applied with: patch -p0 -i ~/code/nutch/indexer-cloudsearch.patch
          5. Edit conf/nutch-default.xml
          add the document endpoint under the cloudsearch parameters (add http:// on the front and /2011-02-01/documents/batch on the end)
          change the line with "indexer-solr" to "indexer-cloudsearch"
          6. Build nutch
          Just "ant" in top directory.
          builds "runtime" directory, and "local" under that.
          7. cd to nutch/runtime/local
          8. Do step three of the tutorial at http://wiki.apache.org/nutch/NutchTutorial
          1) You've done step #1 already
          2) Step 2, I didn't have to do, it was all correct already
          3) Do step 3, stop before 3.1
          a) Then do this: bin/nutch crawl urls -dir crawl -depth 3 -topN 5
          b) 3.2 through 5.x SKIP
          4) skip tutorial step 4
          5) skip tutorial step 5
          6) Parts of step 6.
          Check that the domain is ready
          Then just do the one line
          bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
          Don't worry about the URL, it's ignored. The real URL comes from nutch-default.xml (set above)
          (This is a hack, since I'm not sure how to integrate properly. Hopefully someone can help here)
          9.Check logs/hadoop.log
          Should show The adds sent to CloudSearch. Errors show there, too.
          Might have to set logging level to info in nutch/runtime/local/conf/log4j.properties

          Show
          tomsolr Tom Hill added a comment - I've attached a patch that adds CloudSearch as a pluggable indexing back-end. Slightly verbose description of how to test: 1. Create a CloudSearch domain note the document endpoint I created the following fields in the domain anchor Active text (Result) author Active literal (Search Result) boost Active literal (Search Result) cache Active literal (Search Result) content Active text (Result) content_length Active literal (Search Result) digest Active literal (Search Result) feed Active literal (Search Result) host Active literal (Search Result) id Active literal (Search Result) lang Active literal (Search Result) published_date Active uint () segment Active literal (Search Result) subcollection Active literal (Search Result) tag Active literal (Search Result) text Active text (Result) title Active text (Result) tstamp Active uint () type Active literal (Search Result) updated_date Active uint () url Active text (Result) 2. Checkout nutch git clone https://github.com/apache/nutch 3. Switch to 1.7 branch git checkout -t origin/branch-1.7 4. Apply attached patch I created it with : git diff remotes/origin/branch-1.7 --no-prefix > indexer-cloudsearch.patch applied with: patch -p0 -i ~/code/nutch/indexer-cloudsearch.patch 5. Edit conf/nutch-default.xml add the document endpoint under the cloudsearch parameters (add http:// on the front and /2011-02-01/documents/batch on the end) change the line with "indexer-solr" to "indexer-cloudsearch" 6. Build nutch Just "ant" in top directory. builds "runtime" directory, and "local" under that. 7. cd to nutch/runtime/local 8. Do step three of the tutorial at http://wiki.apache.org/nutch/NutchTutorial 1) You've done step #1 already 2) Step 2, I didn't have to do, it was all correct already 3) Do step 3, stop before 3.1 a) Then do this: bin/nutch crawl urls -dir crawl -depth 3 -topN 5 b) 3.2 through 5.x SKIP 4) skip tutorial step 4 5) skip tutorial step 5 6) Parts of step 6. Check that the domain is ready Then just do the one line bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/* Don't worry about the URL, it's ignored. The real URL comes from nutch-default.xml (set above) (This is a hack, since I'm not sure how to integrate properly. Hopefully someone can help here) 9.Check logs/hadoop.log Should show The adds sent to CloudSearch. Errors show there, too. Might have to set logging level to info in nutch/runtime/local/conf/log4j.properties
          Hide
          djc391 Daniel Ciborowski added a comment -

          Does the above patch disable solr indexing?

          Show
          djc391 Daniel Ciborowski added a comment - Does the above patch disable solr indexing?
          Hide
          tomsolr Tom Hill added a comment -

          I believe you can configure either in nutch-default.xml, but not both.

          Show
          tomsolr Tom Hill added a comment - I believe you can configure either in nutch-default.xml, but not both.
          Hide
          jnioche Julien Nioche added a comment -

          Tom - by convention changes made by the users are set in nutch-site.xml whereas nutch-default.xml is used to list the parameters and their default values. It is just a convention but it helps finding what has been changed specifically for a given setup.

          You can have multiple indexing backends used at the same time and e.g. index with both solr and elasticsearch provided of course that their respective plugins are activated in nutch-site.xml and that they are properly configured.

          Show
          jnioche Julien Nioche added a comment - Tom - by convention changes made by the users are set in nutch-site.xml whereas nutch-default.xml is used to list the parameters and their default values. It is just a convention but it helps finding what has been changed specifically for a given setup. You can have multiple indexing backends used at the same time and e.g. index with both solr and elasticsearch provided of course that their respective plugins are activated in nutch-site.xml and that they are properly configured.
          Hide
          tomsolr Tom Hill added a comment -

          Thanks for the clarification!

          Show
          tomsolr Tom Hill added a comment - Thanks for the clarification!
          Hide
          djc391 Daniel Ciborowski added a comment -

          I have followed the above process, but am getting errors are trying to do "bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*"

          13/09/06 18:03:19 ERROR security.UserGroupInformation: PriviledgedActionException as:hadoop cause:org.apache.hadoop.mapred.InvalidInputException: Input Pattern hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/crawl_fetch matches 0 files
          Input Pattern hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/crawl_parse matches 0 files
          Input Pattern hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/parse_data matches 0 files
          Input Pattern hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/parse_text matches 0 files
          Input path does not exist: hdfs://10.148.178.153:9000/user/hadoop/crawl/linkdb/current
          13/09/06 18:03:19 ERROR indexer.IndexingJob: Indexer: org.apache.hadoop.mapred.InvalidInputException: Input Pattern hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/crawl_fetch matches 0 files
          Input Pattern hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/crawl_parse matches 0 files
          Input Pattern hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/parse_data matches 0 files
          Input Pattern hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/parse_text matches 0 files
          Input path does not exist: hdfs://10.148.178.153:9000/user/hadoop/crawl/linkdb/current

          Any suggestions?

          Show
          djc391 Daniel Ciborowski added a comment - I have followed the above process, but am getting errors are trying to do "bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*" 13/09/06 18:03:19 ERROR security.UserGroupInformation: PriviledgedActionException as:hadoop cause:org.apache.hadoop.mapred.InvalidInputException: Input Pattern hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/crawl_fetch matches 0 files Input Pattern hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/crawl_parse matches 0 files Input Pattern hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/parse_data matches 0 files Input Pattern hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/parse_text matches 0 files Input path does not exist: hdfs://10.148.178.153:9000/user/hadoop/crawl/linkdb/current 13/09/06 18:03:19 ERROR indexer.IndexingJob: Indexer: org.apache.hadoop.mapred.InvalidInputException: Input Pattern hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/crawl_fetch matches 0 files Input Pattern hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/crawl_parse matches 0 files Input Pattern hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/parse_data matches 0 files Input Pattern hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/parse_text matches 0 files Input path does not exist: hdfs://10.148.178.153:9000/user/hadoop/crawl/linkdb/current Any suggestions?
          Hide
          tomsolr Tom Hill added a comment -

          Did you do the step 3a:" bin/nutch crawl urls -dir crawl -depth 3 -topN 5"

          Show
          tomsolr Tom Hill added a comment - Did you do the step 3a:" bin/nutch crawl urls -dir crawl -depth 3 -topN 5"
          Hide
          djc391 Daniel Ciborowski added a comment -

          I did, but I noticed that this is not creating a crawl/segments/ folder after running.

          Show
          djc391 Daniel Ciborowski added a comment - I did, but I noticed that this is not creating a crawl/segments/ folder after running.
          Hide
          tomsolr Tom Hill added a comment -

          I don't believe my patch affects processing at that point. Could you try the steps on an unpatched nutch 1.7, and make sure the crawl is working properly?

          Show
          tomsolr Tom Hill added a comment - I don't believe my patch affects processing at that point. Could you try the steps on an unpatched nutch 1.7, and make sure the crawl is working properly?
          Hide
          djc391 Daniel Ciborowski added a comment - - edited

          Okay so it that issue of no segments being placed in hdfs was because I was using runtime/deploy/ instead of runtime/local/ so I'll worry about that later. Got it to run and all, but now I am running into this error:

          CloudSearchIndexWriter
          cloudsearch.endpoint : URL of the CloudSearch domain's document endpoint. (mandatory)

          I have set my value in the conf/nutch-default.xml like

          <value>http://doc-placesearch-BLAHBLAHBLAH.us-east-1.cloudsearch.amazonaws.com/2011-02-01/documents/batch</value>

          My cloudsearch access policy is set to 0.0.0.0/0

          Show
          djc391 Daniel Ciborowski added a comment - - edited Okay so it that issue of no segments being placed in hdfs was because I was using runtime/deploy/ instead of runtime/local/ so I'll worry about that later. Got it to run and all, but now I am running into this error: CloudSearchIndexWriter cloudsearch.endpoint : URL of the CloudSearch domain's document endpoint. (mandatory) I have set my value in the conf/nutch-default.xml like <value> http://doc-placesearch-BLAHBLAHBLAH.us-east-1.cloudsearch.amazonaws.com/2011-02-01/documents/batch </value> My cloudsearch access policy is set to 0.0.0.0/0
          Hide
          tomsolr Tom Hill added a comment -

          It seems to print that message for me, even when it works. I may have not done something correctly.

          Please check your logs in the logs directory, and see what it says. Or check your cloudsearch domain, and see if the documents made it there.

          Show
          tomsolr Tom Hill added a comment - It seems to print that message for me, even when it works. I may have not done something correctly. Please check your logs in the logs directory, and see what it says. Or check your cloudsearch domain, and see if the documents made it there.
          Hide
          djc391 Daniel Ciborowski added a comment -

          Sorry. You are right I have 10 documents in there.

          Now to try and figure out how to get this running with
          runtime/deploy

          So that I can index my items on HDFS. I am not sure if I am getting errors because of hdfs, or because I am running in Amazon EMR.
          Thanks for the help! Once I have finished my "install" script I will post it.

          Show
          djc391 Daniel Ciborowski added a comment - Sorry. You are right I have 10 documents in there. Now to try and figure out how to get this running with runtime/deploy So that I can index my items on HDFS. I am not sure if I am getting errors because of hdfs, or because I am running in Amazon EMR. Thanks for the help! Once I have finished my "install" script I will post it.
          Hide
          djc391 Daniel Ciborowski added a comment - - edited

          git clone https://github.com/apache/nutch
          wget https://issues.apache.org/jira/secure/attachment/12601469/0023883254_1377197869_indexer-cloudsearch.patch
          cd nutch/
          git checkout -t origin/branch-1.7
          patch -p0 -i ~/0023883254_1377197869_indexer-cloudsearch.patch
          vi conf/nutch-site.xml
          ant
          cd runtime/local/
          mkdir -p urls
          echo "http://www.princeton.edu/" > ./urls/seeds.txt
          bin/nutch crawl urls -dir crawl -depth 3 -topN 5
          bin/nutch index crawl/crawldb -linkdb crawl/linkdb crawl/segments/*

          the vi step is where I add my crawler name, change solr to cloudsearch and add my endpoint url. Tried to do this with sed to replace lines but couldn't figure it out.

          Edits based on feedback.

          Show
          djc391 Daniel Ciborowski added a comment - - edited git clone https://github.com/apache/nutch wget https://issues.apache.org/jira/secure/attachment/12601469/0023883254_1377197869_indexer-cloudsearch.patch cd nutch/ git checkout -t origin/branch-1.7 patch -p0 -i ~/0023883254_1377197869_indexer-cloudsearch.patch vi conf/nutch-site.xml ant cd runtime/local/ mkdir -p urls echo "http://www.princeton.edu/" > ./urls/seeds.txt bin/nutch crawl urls -dir crawl -depth 3 -topN 5 bin/nutch index crawl/crawldb -linkdb crawl/linkdb crawl/segments/* the vi step is where I add my crawler name, change solr to cloudsearch and add my endpoint url. Tried to do this with sed to replace lines but couldn't figure it out. Edits based on feedback.
          Hide
          jnioche Julien Nioche added a comment -

          why are you using the solrindex command? the generic 'nutch index' one would make more sense. Look at the content of the nutch script to see how the solrindex command is converted into the generic one.

          Show
          jnioche Julien Nioche added a comment - why are you using the solrindex command? the generic 'nutch index' one would make more sense. Look at the content of the nutch script to see how the solrindex command is converted into the generic one.
          Hide
          tomsolr Tom Hill added a comment -

          I was just trying to change as little as possible from the example. I'll take a look.

          Show
          tomsolr Tom Hill added a comment - I was just trying to change as little as possible from the example. I'll take a look.
          Hide
          tomsolr Tom Hill added a comment -

          @Daniel, per Julien's comment, you should probably be editing nutch-site.xml, instead of nutch-default.xml. That should make it easier, as you can just keep the edited version around and copy it over.

          Show
          tomsolr Tom Hill added a comment - @Daniel, per Julien's comment, you should probably be editing nutch-site.xml, instead of nutch-default.xml. That should make it easier, as you can just keep the edited version around and copy it over.
          Hide
          djc391 Daniel Ciborowski added a comment -

          Updated my script.

          I am running into errors now if I use "segments/*" when trying to run in deployed HDFS mode. If I select an individual segment then it works fine.

          Show
          djc391 Daniel Ciborowski added a comment - Updated my script. I am running into errors now if I use "segments/*" when trying to run in deployed HDFS mode. If I select an individual segment then it works fine.
          Hide
          djc391 Daniel Ciborowski added a comment - - edited

          Current Error Message
          java.io.IOException: Split metadata size exceeded 10000000. Aborting job job_201309111525_0003
          at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:48)
          at org.apache.hadoop.mapred.JobInProgress.createSplits(JobInProgress.java:1079)
          at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:969)
          at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4237)
          at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          at java.lang.Thread.run(Thread.java:724)

          13/09/11 17:10:46 ERROR indexer.IndexingJob: Indexer: java.io.IOException: Job failed!
          at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1320)
          at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
          at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:185)
          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
          at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:195)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at java.lang.reflect.Method.invoke(Method.java:606)
          at org.apache.hadoop.util.RunJar.main(RunJar.java:187)

          Solution:
          Add property "mapreduce.jobtracker.split.metainfo.maxsize" and set its value to -1, like this:
          <!-- In: conf/mapred-site.xml -->
          <property>
          <name>mapreduce.jobtracker.split.metainfo.maxsize</name>
          <value>-1</value>
          </property>

          http://blog.dongjinleekr.com/my-hadoop-job-crashes-with-split-metadata-size-exceeded/

          Show
          djc391 Daniel Ciborowski added a comment - - edited Current Error Message java.io.IOException: Split metadata size exceeded 10000000. Aborting job job_201309111525_0003 at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:48) at org.apache.hadoop.mapred.JobInProgress.createSplits(JobInProgress.java:1079) at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:969) at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4237) at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) 13/09/11 17:10:46 ERROR indexer.IndexingJob: Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1320) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:185) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:195) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:187) Solution: Add property "mapreduce.jobtracker.split.metainfo.maxsize" and set its value to -1, like this: <!-- In: conf/mapred-site.xml --> <property> <name>mapreduce.jobtracker.split.metainfo.maxsize</name> <value>-1</value> </property> http://blog.dongjinleekr.com/my-hadoop-job-crashes-with-split-metadata-size-exceeded/
          Hide
          tomsolr Tom Hill added a comment -

          I don't think that has any relation to the CloudSearch indexer. Googling the error "Split metadata size exceeded 10000000." gets a number of discussions of this, and how to fix it. How much data do you have?

          Show
          tomsolr Tom Hill added a comment - I don't think that has any relation to the CloudSearch indexer. Googling the error "Split metadata size exceeded 10000000." gets a number of discussions of this, and how to fix it. How much data do you have?
          Hide
          tomsolr Tom Hill added a comment -

          Can CloudSearchIndexWriter.write() be called from multiple threads? If so, I need to synchronize some methods. (And I think the Solr indexer does to).

          Show
          tomsolr Tom Hill added a comment - Can CloudSearchIndexWriter.write() be called from multiple threads? If so, I need to synchronize some methods. (And I think the Solr indexer does to).
          Hide
          jnioche Julien Nioche added a comment -

          Hi Tom. It is not currently called from multiple threads but that could be the case in the future so it would be safer to make your code thread safe.

          Show
          jnioche Julien Nioche added a comment - Hi Tom. It is not currently called from multiple threads but that could be the case in the future so it would be safer to make your code thread safe.
          Hide
          jnioche Julien Nioche added a comment -

          Tom,

          I had a quick look at your plugin. Here are a few things I found :

          • serious bug : the batch doesn't get cleared in the CloudSearchBatcher. The batch gets larger and larger as a result with the same docs sent multiple times
          • build your patch against the trunk - not a released version - some things might have moved since
          • move the README file to the src/plugin/indexer-cloudsearch dir
          • populate the language field from the value generated by the LanguageIndexingFilter with a default to 'en' if it's not there
          • in the readme and in your comments above maybe explain where to find the doc for CloudSearch, how to create a domain and declare the fields etc... People usually know how to apply a patch and run Nutch, but not how to deal with CloudSearch
          • use the generic indexer - not the solr command

          Thanks

          Julien

          Show
          jnioche Julien Nioche added a comment - Tom, I had a quick look at your plugin. Here are a few things I found : serious bug : the batch doesn't get cleared in the CloudSearchBatcher. The batch gets larger and larger as a result with the same docs sent multiple times build your patch against the trunk - not a released version - some things might have moved since move the README file to the src/plugin/indexer-cloudsearch dir populate the language field from the value generated by the LanguageIndexingFilter with a default to 'en' if it's not there in the readme and in your comments above maybe explain where to find the doc for CloudSearch, how to create a domain and declare the fields etc... People usually know how to apply a patch and run Nutch, but not how to deal with CloudSearch use the generic indexer - not the solr command Thanks Julien
          Hide
          jnioche Julien Nioche added a comment -

          and maybe allow an option to dump the json batch in a file? that would be useful for debugging and also detect the fields automatically with cs-configure-from-sdf

          Julien

          Show
          jnioche Julien Nioche added a comment - and maybe allow an option to dump the json batch in a file? that would be useful for debugging and also detect the fields automatically with cs-configure-from-sdf Julien
          Hide
          jnioche Julien Nioche added a comment -

          I had another look at the code. It should handle documents marked for deletion and have a more robust handling of the fields (e.g. with a mapping mechanism as in SOLR). It currently fails to remove unsupported characters if they are in fields which aren't the 2 you hardcoded. The regex which checks for the validity of a field name is not correct as it can let through string starting with a _ which is not allowed

          Show
          jnioche Julien Nioche added a comment - I had another look at the code. It should handle documents marked for deletion and have a more robust handling of the fields (e.g. with a mapping mechanism as in SOLR). It currently fails to remove unsupported characters if they are in fields which aren't the 2 you hardcoded. The regex which checks for the validity of a field name is not correct as it can let through string starting with a _ which is not allowed
          Hide
          tomsolr Tom Hill added a comment -

          Thanks for the thorough review. I've already got the serious bug fixed, just doing some testing before uploading fixed version. I'll try to get the things you mentioned covered in the next version.

          Show
          tomsolr Tom Hill added a comment - Thanks for the thorough review. I've already got the serious bug fixed, just doing some testing before uploading fixed version. I'll try to get the things you mentioned covered in the next version.
          Hide
          tomsolr Tom Hill added a comment -

          And I'll try to get mapping, and writing to file covered in my version. Going forward, it seems like this might be common functionality for a base class for all the indexers.

          Show
          tomsolr Tom Hill added a comment - And I'll try to get mapping, and writing to file covered in my version. Going forward, it seems like this might be common functionality for a base class for all the indexers.
          Hide
          tomsolr Tom Hill added a comment -

          Updated patch. Fixes bug with lack of clearing batch.

          Show
          tomsolr Tom Hill added a comment - Updated patch. Fixes bug with lack of clearing batch.
          Hide
          ji.kwon.lim Ji Kwon Lim added a comment -

          Hi,

          We are attempting to use nutch with CloudSearch, and we are using the patch provided in this ticket. However, we noticed that the patch seems to be incomplete, requiring a manual change to org.apache.nutch.parse,MetaTagsParser.java to replace all references to 'metadata.add("metatag."' with 'metadata.add("metatag_"', changing out the period with an underscore. Is there a newer patch out that addresses this issue or a newer process altogether for getting nutch to work with CloudSearch? If not, could we get an update to the patch to include the change to org.apache.nutch.parse,MetaTagsParser.java that's necessary for the indexer to work properly?

          Regards,

          Ji Kwon Lim

          Show
          ji.kwon.lim Ji Kwon Lim added a comment - Hi, We are attempting to use nutch with CloudSearch, and we are using the patch provided in this ticket. However, we noticed that the patch seems to be incomplete, requiring a manual change to org.apache.nutch.parse,MetaTagsParser.java to replace all references to 'metadata.add("metatag."' with 'metadata.add("metatag_"', changing out the period with an underscore. Is there a newer patch out that addresses this issue or a newer process altogether for getting nutch to work with CloudSearch? If not, could we get an update to the patch to include the change to org.apache.nutch.parse,MetaTagsParser.java that's necessary for the indexer to work properly? Regards, Ji Kwon Lim
          Hide
          jnioche Julien Nioche added a comment -

          New implementation of the CloudSearchIndexWriter, uses the latest version of the CloudSearch API. See README file for instructions

          Show
          jnioche Julien Nioche added a comment - New implementation of the CloudSearchIndexWriter, uses the latest version of the CloudSearch API. See README file for instructions
          Hide
          jnioche Julien Nioche added a comment -

          new version of the patch which fixes a small issue with the config when running in BATCH_DUMP mode

          Show
          jnioche Julien Nioche added a comment - new version of the patch which fixes a small issue with the config when running in BATCH_DUMP mode
          Hide
          jorgelbg Jorge Luis Betancourt Gonzalez added a comment -

          +1 I haven't been able to do some tests (no access to CloudSearch), but so far looking good! does anyone else wants to comment?

          Show
          jorgelbg Jorge Luis Betancourt Gonzalez added a comment - +1 I haven't been able to do some tests (no access to CloudSearch), but so far looking good! does anyone else wants to comment?
          Hide
          jnioche Julien Nioche added a comment -

          Thanks Jorge Luis Betancourt Gonzalez. Will commit soon unless someone objects.

          Show
          jnioche Julien Nioche added a comment - Thanks Jorge Luis Betancourt Gonzalez . Will commit soon unless someone objects.
          Hide
          jnioche Julien Nioche added a comment -

          trunk committed revision 1697911.

          Thanks for comments and review

          Show
          jnioche Julien Nioche added a comment - trunk committed revision 1697911. Thanks for comments and review
          Hide
          lewismc Lewis John McGibbney added a comment -

          Nice patch folks.

          Show
          lewismc Lewis John McGibbney added a comment - Nice patch folks.

            People

            • Assignee:
              jnioche Julien Nioche
              Reporter:
              jnioche Julien Nioche
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development