Solr
  1. Solr
  2. SOLR-1301

Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.7, 5.0
    • Component/s: None
    • Labels:
      None

      Description

      This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold:

      • provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat
      • avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network.

      Design
      ----------

      Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer.

      The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken.

      This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-NNNNN directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard.

      An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead.

      This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib.

      Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License.

      1. hadoop.patch
        28 kB
        Andrzej Bialecki
      2. hadoop-0.19.1-core.jar
        2.27 MB
        Andrzej Bialecki
      3. SOLR-1301.patch
        33 kB
        Jason Rutherglen
      4. SOLR-1301.patch
        34 kB
        Jason Rutherglen
      5. SOLR-1301.patch
        34 kB
        Kris Jirapinyo
      6. SolrRecordWriter.java
        10 kB
        Jason Venner (www.prohadoop.com)
      7. README.txt
        3 kB
        Jason Venner (www.prohadoop.com)
      8. SOLR-1301.patch
        60 kB
        Jason Rutherglen
      9. commons-logging-1.0.4.jar
        37 kB
        Jason Rutherglen
      10. commons-logging-api-1.0.4.jar
        26 kB
        Jason Rutherglen
      11. log4j-1.2.15.jar
        383 kB
        Jason Rutherglen
      12. SOLR-1301.patch
        61 kB
        Jason Rutherglen
      13. SOLR-1301.patch
        61 kB
        Jason Rutherglen
      14. SOLR-1301.patch
        61 kB
        Jason Rutherglen
      15. SOLR-1301-hadoop-0-20.patch
        57 kB
        Alexander Kanarsky
      16. SOLR-1301-hadoop-0-20.patch
        57 kB
        Matt Revelle
      17. hadoop-0.20.1-core.jar
        2.56 MB
        Jason Rutherglen
      18. SOLR-1301.patch
        64 kB
        Alexander Kanarsky
      19. SOLR-1301.patch
        64 kB
        Alexander Kanarsky
      20. SOLR-1301.patch
        58 kB
        Alexander Kanarsky
      21. hadoop-core-0.20.2-cdh3u3.jar
        3.43 MB
        Alexander Kanarsky
      22. SOLR-1301.patch
        58 kB
        Greg Bowyer
      23. SOLR-1301.patch
        963 kB
        Mark Miller
      24. SOLR-1301.patch
        2.33 MB
        Mark Miller
      25. SOLR-1301.patch
        2.44 MB
        Mark Miller
      26. SOLR-1301.patch
        2.47 MB
        Mark Miller
      27. SOLR-1301.patch
        2.49 MB
        Mark Miller
      28. SOLR-1301.patch
        4.59 MB
        Mark Miller
      29. SOLR-1301.patch
        4.59 MB
        Mark Miller
      30. SOLR-1301-maven-intellij.patch
        39 kB
        Steve Rowe

        Issue Links

          Activity

          Hide
          Jason Rutherglen added a comment -

          I downloaded the patch. I'd like to be able to execute this as an ant target and integrate (if possible) test cases?

          Show
          Jason Rutherglen added a comment - I downloaded the patch. I'd like to be able to execute this as an ant target and integrate (if possible) test cases?
          Hide
          Jason Rutherglen added a comment -

          I think we'll want to integrate this patch with Katta which
          conveniently creates many shards and merges them in Hadoop.

          The merging in Hadoop is a bit tricky as if we create numerous
          shards using the current patch, merging the shards into the
          existing index with MergeIndexesCommand would likely create too
          much IO and CPU overhead on what would be the search server and
          possibly degrade search performance.

          Instead of maintaining separate write servers, I would rather
          allocate all Solr servers as read only, and rely on Hadoop (with
          EC2) for quickly reindexing all documents (i.e. for a schema
          change) or incremental indexing. I'm not sure how documents
          updates should be handled.

          Show
          Jason Rutherglen added a comment - I think we'll want to integrate this patch with Katta which conveniently creates many shards and merges them in Hadoop. The merging in Hadoop is a bit tricky as if we create numerous shards using the current patch, merging the shards into the existing index with MergeIndexesCommand would likely create too much IO and CPU overhead on what would be the search server and possibly degrade search performance. Instead of maintaining separate write servers, I would rather allocate all Solr servers as read only, and rely on Hadoop (with EC2) for quickly reindexing all documents (i.e. for a schema change) or incremental indexing. I'm not sure how documents updates should be handled.
          Hide
          Jason Rutherglen added a comment -

          Thought I haven't tested it, in browsing the Katta code it looks
          like it isn't merging shards in Hadoop as I couldn't find a call
          to IW.addIndexes (which is used to merge indexes in different
          directories). I'm not sure how expensive it is to copy two
          shards out of HDFS, merge them, then copy the newly merged shard
          back to to HDFS, delete the old shards, then notify Zookeeper of
          the changes. Maybe we can expand on this patch to add those
          capabilities, or add the functionality to Katta.

          Show
          Jason Rutherglen added a comment - Thought I haven't tested it, in browsing the Katta code it looks like it isn't merging shards in Hadoop as I couldn't find a call to IW.addIndexes (which is used to merge indexes in different directories). I'm not sure how expensive it is to copy two shards out of HDFS, merge them, then copy the newly merged shard back to to HDFS, delete the old shards, then notify Zookeeper of the changes. Maybe we can expand on this patch to add those capabilities, or add the functionality to Katta.
          Hide
          Andrzej Bialecki added a comment -

          This patch is intended to work with Solr as it is now, and the idea is to use Hadoop to buld shards (in the Solr sense) so that they could be used by the current Solr distributed search. I have no idea how / whether Katta/Zookeeper fits in this picture - if you want to pursue this integration I feel it would be best to do it in a separate issue.

          Show
          Andrzej Bialecki added a comment - This patch is intended to work with Solr as it is now, and the idea is to use Hadoop to buld shards (in the Solr sense) so that they could be used by the current Solr distributed search. I have no idea how / whether Katta/Zookeeper fits in this picture - if you want to pursue this integration I feel it would be best to do it in a separate issue.
          Hide
          Jason Rutherglen added a comment -

          Andrzej,

          • Are you going to add a way to automatically add an index to a Solr core?
          • Are you planning on adding test cases for this patch?
          • How does one set the maximum size of a generated shard?
          Show
          Jason Rutherglen added a comment - Andrzej, Are you going to add a way to automatically add an index to a Solr core? Are you planning on adding test cases for this patch? How does one set the maximum size of a generated shard?
          Hide
          Andrzej Bialecki added a comment -

          Are you going to add a way to automatically add an index to a Solr core?

          This way already exists by (ab)using the forced replication to a slave from a temporary master.

          Are you planning on adding test cases for this patch?

          This functionality requires a running Hadoop cluster. I'm not sure how to write functional tests without bringing more Hadoop dependencies. I could add unit tests that test some aspects of the patch, but they would be trivial.

          How does one set the maximum size of a generated shard?

          One doesn't, at the moment The size of each shard (in # of documents) is a function of the total number of records divided by the number of reduce tasks.

          Show
          Andrzej Bialecki added a comment - Are you going to add a way to automatically add an index to a Solr core? This way already exists by (ab)using the forced replication to a slave from a temporary master. Are you planning on adding test cases for this patch? This functionality requires a running Hadoop cluster. I'm not sure how to write functional tests without bringing more Hadoop dependencies. I could add unit tests that test some aspects of the patch, but they would be trivial. How does one set the maximum size of a generated shard? One doesn't, at the moment The size of each shard (in # of documents) is a function of the total number of records divided by the number of reduce tasks.
          Hide
          Ken Krugler added a comment -

          Hi Jason,

          Re Katta, you're right that it doesn't support merging indexes. In a way, it does run-time merging by searching across multiple shards, though you can't add new shards to a deployed index. Some people work around this by searching in all indexes, which is a different level of run-time merging that Katta supports.

          In general I think everybody agrees that merging indexes on the search servers is a bad idea, due to potentially high loads impacting end-user search performance. The most common approach I've seen is to use a general map-reduce job in Hadoop to generate N shards, controlled via number of reducers, and then deploy this as a branch new Katta index.

          – Ken

          Show
          Ken Krugler added a comment - Hi Jason, Re Katta, you're right that it doesn't support merging indexes. In a way, it does run-time merging by searching across multiple shards, though you can't add new shards to a deployed index. Some people work around this by searching in all indexes, which is a different level of run-time merging that Katta supports. In general I think everybody agrees that merging indexes on the search servers is a bad idea, due to potentially high loads impacting end-user search performance. The most common approach I've seen is to use a general map-reduce job in Hadoop to generate N shards, controlled via number of reducers, and then deploy this as a branch new Katta index. – Ken
          Hide
          Jason Venner (www.prohadoop.com) added a comment -

          Anyone using this patch set?

          What are people using for the reduce piece shard size, that will be merged, and a final shard for searching?

          I am going to give it a try, I have one change I am going to hack in, which is to allow an existing zip file to hold the solr data to avoid having to rebuild and push it out for each job run.

          Show
          Jason Venner (www.prohadoop.com) added a comment - Anyone using this patch set? What are people using for the reduce piece shard size, that will be merged, and a final shard for searching? I am going to give it a try, I have one change I am going to hack in, which is to allow an existing zip file to hold the solr data to avoid having to rebuild and push it out for each job run.
          Hide
          Jason Rutherglen added a comment -

          Jv,

          I've used the patch. It works, though I'm sure there are changes and additions that can be made. Why would you want to put the data (i.e. the index files) into the zip?

          -J

          Show
          Jason Rutherglen added a comment - Jv, I've used the patch. It works, though I'm sure there are changes and additions that can be made. Why would you want to put the data (i.e. the index files) into the zip? -J
          Hide
          Jason Venner (www.prohadoop.com) added a comment -

          currently you pass the directory of your solr conf/lib to the job, which makes a zip file, and loads it into hdfs. my mod would be to simply allow pointing at an existing config zip file in hdfs, to minimize the job start time.

          Show
          Jason Venner (www.prohadoop.com) added a comment - currently you pass the directory of your solr conf/lib to the job, which makes a zip file, and loads it into hdfs. my mod would be to simply allow pointing at an existing config zip file in hdfs, to minimize the job start time.
          Hide
          Jason Venner (www.prohadoop.com) added a comment -

          I have used this at a decent scale, and will be adding a few patches, to allow mutliple tasks per machine to build.

          The code currently uses the same directory in /tmp for the solr config, and if multipel tasks are running, the directory may be removed by earlier tasks that finish.

          Show
          Jason Venner (www.prohadoop.com) added a comment - I have used this at a decent scale, and will be adding a few patches, to allow mutliple tasks per machine to build. The code currently uses the same directory in /tmp for the solr config, and if multipel tasks are running, the directory may be removed by earlier tasks that finish.
          Hide
          Jason Venner (www.prohadoop.com) added a comment -

          Updated patch with a little clearer documentation and safe to use when more than one instance of the output format is running on a single machine.
          Also attempts to resolve the task timeouts during index updates and the index optimize phases of index building.

          Show
          Jason Venner (www.prohadoop.com) added a comment - Updated patch with a little clearer documentation and safe to use when more than one instance of the output format is running on a single machine. Also attempts to resolve the task timeouts during index updates and the index optimize phases of index building.
          Hide
          Jason Rutherglen added a comment -

          I think we can parallelize the indexing in SolrRecordWriter?
          Meaning multiple threads add docs to Solr? Most machines will
          have multiple cores and only one task is running per machine at
          a time?

          Show
          Jason Rutherglen added a comment - I think we can parallelize the indexing in SolrRecordWriter? Meaning multiple threads add docs to Solr? Most machines will have multiple cores and only one task is running per machine at a time?
          Hide
          Jason Venner (www.prohadoop.com) added a comment -

          In my case, I have 6 tasks per machine, but only 4 disks, so I should
          actually throttle.

          It is an interesting trade of question of whether to make the writes run in
          the background or to just block the job.

          In operation on my cluster, the write time for each batch increases slowly
          from nothing to 20 minutes as the job runs, and the system buffer cache and
          the disk arms get saturated.

          I also was lazy and didn't feel like holding exceptions from a background
          writing thread and delivering them back to the writing thread, on the next
          write or close call.
          There are some operational hadoop api features that make the assumption that
          an error is associated with the key, value that is being output. On the flip
          side the buffered batch is already hiding potential errors associated with a
          specific record...

          The case you are referring to would be the MultiThreadedMapper case which is
          not used widely, the only change for that, which I should have done, is make
          the write and close thread safe, which they are explicitly not right now.

          Show
          Jason Venner (www.prohadoop.com) added a comment - In my case, I have 6 tasks per machine, but only 4 disks, so I should actually throttle. It is an interesting trade of question of whether to make the writes run in the background or to just block the job. In operation on my cluster, the write time for each batch increases slowly from nothing to 20 minutes as the job runs, and the system buffer cache and the disk arms get saturated. I also was lazy and didn't feel like holding exceptions from a background writing thread and delivering them back to the writing thread, on the next write or close call. There are some operational hadoop api features that make the assumption that an error is associated with the key, value that is being output. On the flip side the buffered batch is already hiding potential errors associated with a specific record... The case you are referring to would be the MultiThreadedMapper case which is not used widely, the only change for that, which I should have done, is make the write and close thread safe, which they are explicitly not right now.
          Hide
          Jason Rutherglen added a comment -

          Here's jv ning's patch as a regular patch file to more easily see the changes. JV, can you add notes on about what's changed and why?

          Show
          Jason Rutherglen added a comment - Here's jv ning's patch as a regular patch file to more easily see the changes. JV, can you add notes on about what's changed and why?
          Hide
          Jason Rutherglen added a comment -

          Should we add ThreadedIndexWriter (from Lucene in Action
          source http://www.manning.com/hatcher3/) type of functionality
          where we add documents in parallel using a thread pool? This
          could increase performance on multicore machines.

          Show
          Jason Rutherglen added a comment - Should we add ThreadedIndexWriter (from Lucene in Action source http://www.manning.com/hatcher3/ ) type of functionality where we add documents in parallel using a thread pool? This could increase performance on multicore machines.
          Hide
          Yonik Seeley added a comment -

          I don't know anything about ThreadedIndexWriter, but the SolrJ StreamingUpdateSolrServer uses multiple threads on the client side (and thus causes multiple threads to be used on the server side) to increase concurrency. It's quite speedy.

          Show
          Yonik Seeley added a comment - I don't know anything about ThreadedIndexWriter, but the SolrJ StreamingUpdateSolrServer uses multiple threads on the client side (and thus causes multiple threads to be used on the server side) to increase concurrency. It's quite speedy.
          Hide
          Jason Venner (www.prohadoop.com) added a comment -

          Within a Map/Reduce task, there is usually a significant constraint on available ram, cpu and disk bandwidth.

          For my current use case, if we turn on the various background write threads, I will need to make these configurable to help me manage the task resource consumption.

          In the ideal world, the Map/Reduce framework, via cluster and job configuration, is taking care of running the tasks in parallel, and the tuning to optimize throughput is happening at that level.

          Show
          Jason Venner (www.prohadoop.com) added a comment - Within a Map/Reduce task, there is usually a significant constraint on available ram, cpu and disk bandwidth. For my current use case, if we turn on the various background write threads, I will need to make these configurable to help me manage the task resource consumption. In the ideal world, the Map/Reduce framework, via cluster and job configuration, is taking care of running the tasks in parallel, and the tuning to optimize throughput is happening at that level.
          Hide
          Jason Venner (www.prohadoop.com) added a comment -

          My notes on the patchupdate were in a README.txt that didn't make it into the .patch for some reason.
          Basically, the per task configuration uses the unpacked data that the framework creates, rather than creating a new configuration directory.
          All paths used are task specific, so that multiple tasks can run concurrently without colliding.
          The write method has a heartbeat thread so that the task does not get killed if the batch write takes more than 600 seconds.

          Show
          Jason Venner (www.prohadoop.com) added a comment - My notes on the patchupdate were in a README.txt that didn't make it into the .patch for some reason. Basically, the per task configuration uses the unpacked data that the framework creates, rather than creating a new configuration directory. All paths used are task specific, so that multiple tasks can run concurrently without colliding. The write method has a heartbeat thread so that the task does not get killed if the batch write takes more than 600 seconds.
          Hide
          Jason Rutherglen added a comment -
          • I implemented a thread pool version, which eliminates the need
            for adding in batches (which I'm not sure was necessary?)
          • A numthreads property may be set
          • A maxqueuesize may be set, which acts fulfills the same
            function as batching
          • It's untested yet
          Show
          Jason Rutherglen added a comment - I implemented a thread pool version, which eliminates the need for adding in batches (which I'm not sure was necessary?) A numthreads property may be set A maxqueuesize may be set, which acts fulfills the same function as batching It's untested yet
          Hide
          Jason Rutherglen added a comment -

          Yonik,

          It looks like StreamingUpdateSolrServer can't be used with EmbeddedSolrServer because of the requirement for a URL.

          Show
          Jason Rutherglen added a comment - Yonik, It looks like StreamingUpdateSolrServer can't be used with EmbeddedSolrServer because of the requirement for a URL.
          Hide
          Jason Rutherglen added a comment -

          In the ideal world, the Map/Reduce framework, via
          cluster and job configuration, is taking care of running the
          tasks in parallel, and the tuning to optimize throughput is
          happening at that level.

          True, Hadoop should probably manage calling
          SolrRecordWriter.write from multiple threads, or maybe it
          already is? In which case there wouldn't be a need for thread
          pooling or batching/queuing.

          Show
          Jason Rutherglen added a comment - In the ideal world, the Map/Reduce framework, via cluster and job configuration, is taking care of running the tasks in parallel, and the tuning to optimize throughput is happening at that level. True, Hadoop should probably manage calling SolrRecordWriter.write from multiple threads, or maybe it already is? In which case there wouldn't be a need for thread pooling or batching/queuing.
          Hide
          Kris Jirapinyo added a comment -

          Because we are using MultipleOutputFormat, we can't have the data directory be the task_id, as when the indexes were building, the directories were conflicting. This patch uses a random UUID instead as the data directory so that if there are more than one shards being created under a reducer, the directories will not conflict.

          Show
          Kris Jirapinyo added a comment - Because we are using MultipleOutputFormat, we can't have the data directory be the task_id, as when the indexes were building, the directories were conflicting. This patch uses a random UUID instead as the data directory so that if there are more than one shards being created under a reducer, the directories will not conflict.
          Hide
          Jason Venner (www.prohadoop.com) added a comment -

          I have an updated version that uses a sequence number, to ensure uniqueness
          in the multiple output format case.

          Same concept, shorter path.

          Show
          Jason Venner (www.prohadoop.com) added a comment - I have an updated version that uses a sequence number, to ensure uniqueness in the multiple output format case. Same concept, shorter path.
          Hide
          Jason Venner (www.prohadoop.com) added a comment -

          Updated SolrRecordWriter that uses a static AtomicLong to create a unique sequence number for each index instance created.
          This allows safe use of MultipleOutputFormat as well as the original patch which allowed multiple task instances per machine.

          The only potential issue is that the writer will block during index updates, which may cause the task to run slower than it could.

          Show
          Jason Venner (www.prohadoop.com) added a comment - Updated SolrRecordWriter that uses a static AtomicLong to create a unique sequence number for each index instance created. This allows safe use of MultipleOutputFormat as well as the original patch which allowed multiple task instances per machine. The only potential issue is that the writer will block during index updates, which may cause the task to run slower than it could.
          Hide
          Jason Venner (www.prohadoop.com) added a comment -

          Readme for the patch.

          Show
          Jason Venner (www.prohadoop.com) added a comment - Readme for the patch.
          Hide
          Jason Rutherglen added a comment -

          Here's a new patch, with log jar dependencies.

          The heartbeat is improved, and there's a queuing mechanism that can result in faster execution times.

          Show
          Jason Rutherglen added a comment - Here's a new patch, with log jar dependencies. The heartbeat is improved, and there's a queuing mechanism that can result in faster execution times.
          Hide
          Jason Rutherglen added a comment -

          An additional required library for the latest patch.

          Show
          Jason Rutherglen added a comment - An additional required library for the latest patch.
          Hide
          Jason Rutherglen added a comment -

          We need to include the schema.xml in the shard stored in HDFS as otherwise we could get confused about which schema the index was built with.

          We don't need include solrconfig in HDFS because it's used to define parameters related to how the index is built.

          Show
          Jason Rutherglen added a comment - We need to include the schema.xml in the shard stored in HDFS as otherwise we could get confused about which schema the index was built with. We don't need include solrconfig in HDFS because it's used to define parameters related to how the index is built.
          Hide
          Jason Venner (www.prohadoop.com) added a comment -

          I need to update this patch, there is an error in the close method that
          cause timeouts to occur.

          Basically, the two lines in the close method need to be changed form, there
          is still the possibility of close happening too early in the rare use cases,
          which is why I haven't updated the patch.
          Basically close can't proceed until the thread pool is done AND
          batchWriter.executingBatches.get() == 0

          From:
          batchWriter.close(reporter, core);
          heartBeater.needHeartBeat();

          To:
          heartBeater.needHeartBeat();
          batchWriter.close(reporter, core);

          Show
          Jason Venner (www.prohadoop.com) added a comment - I need to update this patch, there is an error in the close method that cause timeouts to occur. Basically, the two lines in the close method need to be changed form, there is still the possibility of close happening too early in the rare use cases, which is why I haven't updated the patch. Basically close can't proceed until the thread pool is done AND batchWriter.executingBatches.get() == 0 From: batchWriter.close(reporter, core); heartBeater.needHeartBeat(); To: heartBeater.needHeartBeat(); batchWriter.close(reporter, core);
          Hide
          Jason Rutherglen added a comment -

          Thanks for the update Jason. It runs great, I've generated over a terabyte of indexes using the patch. Now I'm trying to deploy them, and that's harder!

          Show
          Jason Rutherglen added a comment - Thanks for the update Jason. It runs great, I've generated over a terabyte of indexes using the patch. Now I'm trying to deploy them, and that's harder!
          Hide
          Jason Rutherglen added a comment -

          Here's an update that includes the change Jason mentioned above
          (needHeartBeat in SRW.close). I've run this patch in production,
          however I was unable to turn off logging due to complexities
          with SLF4J layering Hadoop where I could not turn off the Solr
          update logs. I had to comment out the logging lines in Solr to
          insure the Hadoop logs did not fill up.

          Show
          Jason Rutherglen added a comment - Here's an update that includes the change Jason mentioned above (needHeartBeat in SRW.close). I've run this patch in production, however I was unable to turn off logging due to complexities with SLF4J layering Hadoop where I could not turn off the Solr update logs. I had to comment out the logging lines in Solr to insure the Hadoop logs did not fill up.
          Hide
          Grant Ingersoll added a comment -

          Seems like this would make the most sense as a contrib module to Solr.

          Show
          Grant Ingersoll added a comment - Seems like this would make the most sense as a contrib module to Solr.
          Hide
          Grant Ingersoll added a comment -

          Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network.

          I'm curious about the not sending over the network. Have you tried the Streaming Server or even just the regular one? How would this work with someone who already has a separate Solr cluster setup?

          Also, I haven't looked closely at the patch, but if I understand correctly, it is writing out the indexes to the local disks on the Hadoop cluster?

          Show
          Grant Ingersoll added a comment - Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. I'm curious about the not sending over the network. Have you tried the Streaming Server or even just the regular one? How would this work with someone who already has a separate Solr cluster setup? Also, I haven't looked closely at the patch, but if I understand correctly, it is writing out the indexes to the local disks on the Hadoop cluster?
          Hide
          Andrzej Bialecki added a comment -

          I'm curious about the not sending over the network. Have you tried the Streaming Server or even just the regular one?

          Hmm, I don't think this would make sense - the whole point of this patch is to distribute the load by indexing into multiple Solr instances that use the same config - and this can be an existing user's config including the components from $

          {solr.home}

          /lib .

          How would this work with someone who already has a separate Solr cluster setup?

          It wouldn't - partly because there is no canonical Solr cluster setup against which to code this ... Would that be the same cluster (1:1 mapping) as the Hadoop cluster?

          Also, I haven't looked closely at the patch, but if I understand correctly, it is writing out the indexes to the local disks on the Hadoop cluster?

          HDFS doesn't support enough POSIX to support writing Lucene indexes directly to HDFS - for this reason indexes are always created on local storage of each node, and then after closing they are copied to HDFS.

          Show
          Andrzej Bialecki added a comment - I'm curious about the not sending over the network. Have you tried the Streaming Server or even just the regular one? Hmm, I don't think this would make sense - the whole point of this patch is to distribute the load by indexing into multiple Solr instances that use the same config - and this can be an existing user's config including the components from $ {solr.home} /lib . How would this work with someone who already has a separate Solr cluster setup? It wouldn't - partly because there is no canonical Solr cluster setup against which to code this ... Would that be the same cluster (1:1 mapping) as the Hadoop cluster? Also, I haven't looked closely at the patch, but if I understand correctly, it is writing out the indexes to the local disks on the Hadoop cluster? HDFS doesn't support enough POSIX to support writing Lucene indexes directly to HDFS - for this reason indexes are always created on local storage of each node, and then after closing they are copied to HDFS.
          Hide
          Grant Ingersoll added a comment -

          Hmm, I don't think this would make sense - the whole point of this patch is to distribute the load by indexing into multiple Solr instances that use the same config - and this can be an existing user's config including the components from ${solr.home}/lib .

          Obviously, you would need to have a configuration of Solr indexing servers. This could easily be obtained from ZK per the other work being done. Then, the reduce steps just create their SolrServer based on those values and can index directly. (I realize the ZK stuff didn't exist when you put up this patch.)

          HDFS doesn't support enough POSIX to support writing Lucene indexes directly to HDFS - for this reason indexes are always created on local storage of each node, and then after closing they are copied to HDFS.

          Right, and then copied down from HDFS and installed in Solr, correct? You still have the issue of knowing which Solr instances get which shards off of HDFS, right? Just seems like a little more configuration knowledge could alleviate all that extra copying/installing, etc.

          Show
          Grant Ingersoll added a comment - Hmm, I don't think this would make sense - the whole point of this patch is to distribute the load by indexing into multiple Solr instances that use the same config - and this can be an existing user's config including the components from ${solr.home}/lib . Obviously, you would need to have a configuration of Solr indexing servers. This could easily be obtained from ZK per the other work being done. Then, the reduce steps just create their SolrServer based on those values and can index directly. (I realize the ZK stuff didn't exist when you put up this patch.) HDFS doesn't support enough POSIX to support writing Lucene indexes directly to HDFS - for this reason indexes are always created on local storage of each node, and then after closing they are copied to HDFS. Right, and then copied down from HDFS and installed in Solr, correct? You still have the issue of knowing which Solr instances get which shards off of HDFS, right? Just seems like a little more configuration knowledge could alleviate all that extra copying/installing, etc.
          Hide
          Jason Rutherglen added a comment -

          Andrzej's model works great in production. We have both 1)
          master -> slave for incremental updates, and 2) index in Hadoop
          with this patch, we then deploy each new core/shard in a
          balanced fashion to many servers. They're two separate
          modalities. The ZK stuff (as it's modeled today) isn't useful
          here, because I want the schema I indexed with as a part of the
          zip file stored in HDFS (or S3, or wherever).

          Any sort of ZK thingy is good for managing the core/shards
          across many servers, however Katta does this already (so we're
          either reinventing the same thing, not necessarily a bad thing
          if we also have a clear path for incremental indexing, as
          discussed above). Ultimately, the Solr server can be viewed as
          simply a container for cores, and the cloud + ZK branch as a
          manager of cores/shards. Anything more ambitious will probably
          be overkill, and this is what I believe Ted has been trying to get at.

          Show
          Jason Rutherglen added a comment - Andrzej's model works great in production. We have both 1) master -> slave for incremental updates, and 2) index in Hadoop with this patch, we then deploy each new core/shard in a balanced fashion to many servers. They're two separate modalities. The ZK stuff (as it's modeled today) isn't useful here, because I want the schema I indexed with as a part of the zip file stored in HDFS (or S3, or wherever). Any sort of ZK thingy is good for managing the core/shards across many servers, however Katta does this already (so we're either reinventing the same thing, not necessarily a bad thing if we also have a clear path for incremental indexing, as discussed above). Ultimately, the Solr server can be viewed as simply a container for cores, and the cloud + ZK branch as a manager of cores/shards. Anything more ambitious will probably be overkill, and this is what I believe Ted has been trying to get at.
          Hide
          Andrzej Bialecki added a comment -

          Iff we somehow could get a mapping between a mapred task on node X to a particular target Solr server (beyond the two obvious choices, ie. single URL for one Solr, or localhost for per-node Solr-s) then sure why not. And you are right that we wouldn't use the embedded Solr in that case. But this patch solves a different problem, and it solves it within the facilities of the current config

          Right, and then copied down from HDFS and installed in Solr, correct? You still have the issue of knowing which Solr instances get which shards off of HDFS, right? Just seems like a little more configuration knowledge could alleviate all that extra copying/installing, etc.

          Yes. But that would be a completely different scenario - we could wrap it in a Hadoop OutputFormat as well, but the implementation would be totally different from this patch.

          Show
          Andrzej Bialecki added a comment - Iff we somehow could get a mapping between a mapred task on node X to a particular target Solr server (beyond the two obvious choices, ie. single URL for one Solr, or localhost for per-node Solr-s) then sure why not. And you are right that we wouldn't use the embedded Solr in that case. But this patch solves a different problem, and it solves it within the facilities of the current config Right, and then copied down from HDFS and installed in Solr, correct? You still have the issue of knowing which Solr instances get which shards off of HDFS, right? Just seems like a little more configuration knowledge could alleviate all that extra copying/installing, etc. Yes. But that would be a completely different scenario - we could wrap it in a Hadoop OutputFormat as well, but the implementation would be totally different from this patch.
          Hide
          Grant Ingersoll added a comment -

          Don't confuse the ZK stuff for search w/ the indexing side. Using ZK was just an example of a way to get the list of Solr indexing nodes. What I meant was the Hadoop job could simply know what the set of master indexers are and send the documents directly to them. Then the slaves simply pull the replications from there. It all works with existing capabilities instead of the need for scritps, etc. to pull shards down, etc.

          Show
          Grant Ingersoll added a comment - Don't confuse the ZK stuff for search w/ the indexing side. Using ZK was just an example of a way to get the list of Solr indexing nodes. What I meant was the Hadoop job could simply know what the set of master indexers are and send the documents directly to them. Then the slaves simply pull the replications from there. It all works with existing capabilities instead of the need for scritps, etc. to pull shards down, etc.
          Hide
          Jason Rutherglen added a comment -

          What I meant was the Hadoop job could simply know what
          the set of master indexers are and send the documents directly
          to them

          One can use Hadoop for this purpose, we have implemented the
          system in this way for the incremental indexes, however it
          doesn't require a separate patch or contrib module. The problem
          with the Hadoop streaming model is it doesn't scale well, if for
          example, we need to reindex using the CJKAnalyzer, or using
          Basis' analyzer etc. We use SOLR-1301 for reindexing loads of
          data, as fast as possible by parallelizing the indexing. There
          are lots of little things I'd like to add to the functionality,
          though, implementing ZK based core management takes a higher
          priority, as I spend a lot of time doing this manually today.

          Show
          Jason Rutherglen added a comment - What I meant was the Hadoop job could simply know what the set of master indexers are and send the documents directly to them One can use Hadoop for this purpose, we have implemented the system in this way for the incremental indexes, however it doesn't require a separate patch or contrib module. The problem with the Hadoop streaming model is it doesn't scale well, if for example, we need to reindex using the CJKAnalyzer, or using Basis' analyzer etc. We use SOLR-1301 for reindexing loads of data, as fast as possible by parallelizing the indexing. There are lots of little things I'd like to add to the functionality, though, implementing ZK based core management takes a higher priority, as I spend a lot of time doing this manually today.
          Hide
          Grant Ingersoll added a comment -

          I don't follow how sending docs to a suite of master indexers prevents incremental (re)indexing or any of the analyzers. Those are all on the Solr side, not Hadoop. BTW, I'm not talking about "Hadoop Streaming", just the notion of Hadoop streaming the output of the reduce tasks to the Solr indexing servers.

          Show
          Grant Ingersoll added a comment - I don't follow how sending docs to a suite of master indexers prevents incremental (re)indexing or any of the analyzers. Those are all on the Solr side, not Hadoop. BTW, I'm not talking about "Hadoop Streaming", just the notion of Hadoop streaming the output of the reduce tasks to the Solr indexing servers.
          Hide
          Jason Rutherglen added a comment -

          Hadoop streaming the output of the reduce tasks to the Solr

          indexing servers.

          Yes, this is what we've implemented, it's just normal Solr HTTP
          based indexing, right? It works well to a limited degree, and
          for the particular implementation details, there are reasons why
          this can be less than ideal. The balanced, distributed
          shards/cores system works far better and enables us to use less
          hardware (but I'm not going into all the details here).

          One issue I can mention, is the switch over to a new set of
          incremental servers (which happens then the old servers fill
          up), I'm looking to automate this, and will likely focus on it
          and the core management in the cloud branch.

          Show
          Jason Rutherglen added a comment - Hadoop streaming the output of the reduce tasks to the Solr indexing servers. Yes, this is what we've implemented, it's just normal Solr HTTP based indexing, right? It works well to a limited degree, and for the particular implementation details, there are reasons why this can be less than ideal. The balanced, distributed shards/cores system works far better and enables us to use less hardware (but I'm not going into all the details here). One issue I can mention, is the switch over to a new set of incremental servers (which happens then the old servers fill up), I'm looking to automate this, and will likely focus on it and the core management in the cloud branch.
          Hide
          Jason Rutherglen added a comment -

          I started on the Solr wiki page for this guy...

          http://wiki.apache.org/solr/HadoopIndexing

          Show
          Jason Rutherglen added a comment - I started on the Solr wiki page for this guy... http://wiki.apache.org/solr/HadoopIndexing
          Hide
          Kevin Peterson added a comment -

          As written, this constructs a path relative to the local directories, which is not automatically cleaned up by Hadoop, line 942 in the patch, the constructor for SolrRecordWriter. If this is changed to something like

          temp = new Path(job.getWorkingDirectory(), "solr/_" + job.get("mapred.task.id") + '.' + sequence.incrementAndGet() + '.' + perm.getName());

          this will be in the tasks working directory and automatically cleaned up on exit. I've checked with #hadoop and the consensus seems to be that this will be cleaned up unless the TT dies.

          Show
          Kevin Peterson added a comment - As written, this constructs a path relative to the local directories, which is not automatically cleaned up by Hadoop, line 942 in the patch, the constructor for SolrRecordWriter. If this is changed to something like temp = new Path(job.getWorkingDirectory(), "solr/_" + job.get("mapred.task.id") + '.' + sequence.incrementAndGet() + '.' + perm.getName()); this will be in the tasks working directory and automatically cleaned up on exit. I've checked with #hadoop and the consensus seems to be that this will be cleaned up unless the TT dies.
          Hide
          Ted Dunning added a comment -

          It is critical to put indexes in the task local area on both local and hdfs storage areas not just because of task cleanup, but also because a task may be run more than once. Hadoop handles all the race conditions that would otherwise happen as a result.

          Show
          Ted Dunning added a comment - It is critical to put indexes in the task local area on both local and hdfs storage areas not just because of task cleanup, but also because a task may be run more than once. Hadoop handles all the race conditions that would otherwise happen as a result.
          Hide
          Jason Rutherglen added a comment -

          This update include's Kevin's recommended path change....

          Show
          Jason Rutherglen added a comment - This update include's Kevin's recommended path change....
          Hide
          Jason Rutherglen added a comment -

          There's a bug caused by the latest change:

          java.io.IOException: java.lang.IllegalArgumentException: Wrong FS: hdfs://mi-prod-app01.ec2.biz360.com:9000/user/hadoop/solr/_attempt_201001212110_2841_r_000001_0.1.index-a, expected: file:///
          at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:371)
          at com.biz360.mi.index.hadoop.HadoopIndexer$ArticleReducer.reduce(HadoopIndexer.java:147)
          at com.biz360.mi.index.hadoop.HadoopIndexer$ArticleReducer.reduce(HadoopIndexer.java:103)
          at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
          at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
          at org.apache.hadoop.mapred.Child.main(Child.java:170)
          Caused by: java.lang.IllegalArgumentException: Wrong FS: hdfs://mi-prod-app01.ec2.biz360.com:9000/user/hadoop/solr/_attempt_201001212110_2841_r_000001_0.1.index-a, expected: file:///
          at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:305)
          at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
          at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357)
          at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
          at org.apache.solr.hadoop.SolrRecordWriter.zipDirectory(SolrRecordWriter.java:459)
          at org.apache.solr.hadoop.SolrRecordWriter.packZipFile(SolrRecordWriter.java:390)
          at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:362)
          ... 5 more

          Show
          Jason Rutherglen added a comment - There's a bug caused by the latest change: java.io.IOException: java.lang.IllegalArgumentException: Wrong FS: hdfs://mi-prod-app01.ec2.biz360.com:9000/user/hadoop/solr/_attempt_201001212110_2841_r_000001_0.1.index-a, expected: file:/// at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:371) at com.biz360.mi.index.hadoop.HadoopIndexer$ArticleReducer.reduce(HadoopIndexer.java:147) at com.biz360.mi.index.hadoop.HadoopIndexer$ArticleReducer.reduce(HadoopIndexer.java:103) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.IllegalArgumentException: Wrong FS: hdfs://mi-prod-app01.ec2.biz360.com:9000/user/hadoop/solr/_attempt_201001212110_2841_r_000001_0.1.index-a, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:305) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.solr.hadoop.SolrRecordWriter.zipDirectory(SolrRecordWriter.java:459) at org.apache.solr.hadoop.SolrRecordWriter.packZipFile(SolrRecordWriter.java:390) at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:362) ... 5 more
          Hide
          Karthik K added a comment -

          Did the latest patch involve an upgrade of the hdfs / patched hdfs running ? If there were a change - was the fs migration script run to upgrade the file system being referenced ?

          Also - what version of hdfs is being used ?Would that be 0.19.1 .

          In the roadmap is to release 0.20.2 of hadoop very soon. Of particular interest in that would be HDFS-127 , to recover (gracefully) from failed reads.

          Show
          Karthik K added a comment - Did the latest patch involve an upgrade of the hdfs / patched hdfs running ? If there were a change - was the fs migration script run to upgrade the file system being referenced ? Also - what version of hdfs is being used ?Would that be 0.19.1 . In the roadmap is to release 0.20.2 of hadoop very soon. Of particular interest in that would be HDFS-127 , to recover (gracefully) from failed reads.
          Hide
          Kevin Peterson added a comment -

          I pointed you in the wrong direction. It isn't getWorkingDirectory. I'm trying to find the standard way to get to $

          {mapred.local.dir}

          /taskTracker/jobcache/$jobid/$taskid or construct a Path using the current working directory, but I'm having trouble making sense of which directories refer to local and which to HDFS.

          Show
          Kevin Peterson added a comment - I pointed you in the wrong direction. It isn't getWorkingDirectory. I'm trying to find the standard way to get to $ {mapred.local.dir} /taskTracker/jobcache/$jobid/$taskid or construct a Path using the current working directory, but I'm having trouble making sense of which directories refer to local and which to HDFS.
          Hide
          Jason Rutherglen added a comment -

          I'm testing deleting the temp dir in SRW.close finally...

          Show
          Jason Rutherglen added a comment - I'm testing deleting the temp dir in SRW.close finally...
          Hide
          Jason Rutherglen added a comment -

          I added the following to the SRW.close method's finally clause:

          FileUtils.forceDelete(new File(temp.toString()));
          
          Show
          Jason Rutherglen added a comment - I added the following to the SRW.close method's finally clause: FileUtils.forceDelete( new File(temp.toString()));
          Hide
          shyjuThomas added a comment -

          I have a need to perform Solr indexing in MapReduce task, to achive parallelism. I have noticed 2 Jira issues related to that: SOLR-1045 & SOLR-1301.

          I have tried out the patches available with both the issues, and my observation is given below:
          1. The SOLR-1301 patch, performs input-record to key-value conversion in Map phase; Hadoop (key, value) to SolrInputDocument conversion and the actual indexing will happen in the Reduce phase.
          Meanwhile, SOLR-1045 patch performs the record-to-Doc conversion and the actual indexing in the Map phase; User can make use of the Reducer to perform merging of multiple indices (if required). In another way we can configure the number of reducers as same as the number of Shards.
          2. The SOLR-1301 patch doesn't supports merging of the indices, while SOLR-1045 patch supports.
          3. As per SOLR-1301 patch, no big activity happens in the Map phase (only input-record to key-value conversion). Most of the heavy jobs (esp. the indexing) are happening in the Reduce phase. If we need the final output as a single index, we can use only one reducer, which means bottleneck at Reducer & almost the whole operation happens non-paralelly.
          But the case is different with SOLR-1045 patch. It achieves better parallelism when the number of map tasks is greater than the number of reduce tasks, which is usually the case.

          Based on these observation, I have few questions. (I am a beginner to the Hadoop & Solr world. So, please forgive me if my questions are silly):
          1. As per above observation, SOLR-1045 patch is functionally better (performance I have not verified yet ). Can anyone tell me, whats the actual advantage SOLR-1301 patch offers over SOLR-1045 patch?
          2. If both the jira issues are trying to solve the same problem, do we really need 2 separate issues?

          NOTE : I felt this Jira issue is more active than SOLR-1045. Thats why I posted my comment here.

          Show
          shyjuThomas added a comment - I have a need to perform Solr indexing in MapReduce task, to achive parallelism. I have noticed 2 Jira issues related to that: SOLR-1045 & SOLR-1301 . I have tried out the patches available with both the issues, and my observation is given below: 1. The SOLR-1301 patch, performs input-record to key-value conversion in Map phase; Hadoop (key, value) to SolrInputDocument conversion and the actual indexing will happen in the Reduce phase. Meanwhile, SOLR-1045 patch performs the record-to-Doc conversion and the actual indexing in the Map phase; User can make use of the Reducer to perform merging of multiple indices (if required). In another way we can configure the number of reducers as same as the number of Shards. 2. The SOLR-1301 patch doesn't supports merging of the indices, while SOLR-1045 patch supports. 3. As per SOLR-1301 patch, no big activity happens in the Map phase (only input-record to key-value conversion). Most of the heavy jobs (esp. the indexing) are happening in the Reduce phase. If we need the final output as a single index, we can use only one reducer, which means bottleneck at Reducer & almost the whole operation happens non-paralelly. But the case is different with SOLR-1045 patch. It achieves better parallelism when the number of map tasks is greater than the number of reduce tasks, which is usually the case. Based on these observation, I have few questions. (I am a beginner to the Hadoop & Solr world. So, please forgive me if my questions are silly): 1. As per above observation, SOLR-1045 patch is functionally better (performance I have not verified yet ). Can anyone tell me, whats the actual advantage SOLR-1301 patch offers over SOLR-1045 patch? 2. If both the jira issues are trying to solve the same problem, do we really need 2 separate issues? NOTE : I felt this Jira issue is more active than SOLR-1045 . Thats why I posted my comment here.
          Hide
          Ted Dunning added a comment -

          Based on these observation, I have few questions. (I am a beginner to the Hadoop & Solr world. So, please forgive me if my questions are silly):
          1. As per above observation, SOLR-1045 patch is functionally better (performance I have not verified yet ). Can anyone tell me, whats the actual advantage SOLR-1301 patch offers over SOLR-1045 patch?
          2. If both the jira issues are trying to solve the same problem, do we really need 2 separate issues?

          In the katta community, the recommended practice started with SOLR-1045 (what I call map-side indexing) behavior, but I think that the consensus now is that SOLR-1301 behavior (what I call reduce side indexing) is much, much better. This is not necessarily the obvious result given your observations. There are some operational differences between katta and SOLR that might make the conclusions different, but what I have observed is the following:

          a) index merging is a really bad idea that seems very attractive to begin with because it is actually pretty expensive and doesn't solve the real problems of bad document distribution across shards. It is much better to simply have lots of shards per machine (aka micro-sharding) and use one reducer per shard. For large indexes, this gives entirely acceptable performance. On a pretty small cluster, we can index 50-100 million large documents in multiple ways in 2-3 hours. Index merging gives you no benefit compared to reduce side indexing and just increases code complexity.

          b) map-side indexing leaves you with indexes that are heavily skewed by being composed of of documents from a single input split. At retrieval time, this means that different shards have very different term frequency profiles and very different numbers of relevant documents. This makes lots of statistics very difficult including term frequency computation, term weighting and determining the number of documents to retrieve. Map-side merge virtually guarantees that you have to do two cluster queries, one to gather term frequency statistics and another to do the actual query. With reduce side indexing, you can provide strong probabilistic bounds on how different the statistics in each shard can be so you can use local term statistics and you can depend on the score distribution being this same which radically decreases the number of documents you need to retrieve from each shard.

          c) reduce-side indexing improves the balance of computation during retrieval. If (as is the rule) some document subset is hotter than other document subset due, say to data-source boosting or recency boosting, you will have very bad cluster utilization with skewed shards from map-side indexing while all shards will cost about the same for any query leading to good cluster utilization and faster queries with reduce-side indexing.

          d) with reduce-side indexing has properties that can be mathematically stated and proved. Map-side indexing only has comparable properties if you make unrealistic assumptions about your original data.

          e) micro-sharding allows very simple and very effective use of multiple cores on multiple machines in a search cluster. This can be very difficult to do with large shards or a single index.

          Now, as you say, these advantages may evaporate if you are looking to produce a single output index. That seems, however, to contradict the whole point of scaling. If you need to scale indexing, presumably you also need to scale search speed and throughput. As such you probably want to have many shards rather than few. Conversely, if you can stand to search a single index, then you probably can stand to index on a single machine.

          Another thing to think about is the fact SOLR doesn't yet do micro-sharding or clustering very well and, in particular, doesn't handle multiple shards per core. That will be changing before long, however, and it is very dangerous to design for the past rather than the future.

          In case, you didn't notice, I strongly suggest you stick with reduce-side indexing.

          Show
          Ted Dunning added a comment - Based on these observation, I have few questions. (I am a beginner to the Hadoop & Solr world. So, please forgive me if my questions are silly): 1. As per above observation, SOLR-1045 patch is functionally better (performance I have not verified yet ). Can anyone tell me, whats the actual advantage SOLR-1301 patch offers over SOLR-1045 patch? 2. If both the jira issues are trying to solve the same problem, do we really need 2 separate issues? In the katta community, the recommended practice started with SOLR-1045 (what I call map-side indexing) behavior, but I think that the consensus now is that SOLR-1301 behavior (what I call reduce side indexing) is much, much better. This is not necessarily the obvious result given your observations. There are some operational differences between katta and SOLR that might make the conclusions different, but what I have observed is the following: a) index merging is a really bad idea that seems very attractive to begin with because it is actually pretty expensive and doesn't solve the real problems of bad document distribution across shards. It is much better to simply have lots of shards per machine (aka micro-sharding) and use one reducer per shard. For large indexes, this gives entirely acceptable performance. On a pretty small cluster, we can index 50-100 million large documents in multiple ways in 2-3 hours. Index merging gives you no benefit compared to reduce side indexing and just increases code complexity. b) map-side indexing leaves you with indexes that are heavily skewed by being composed of of documents from a single input split. At retrieval time, this means that different shards have very different term frequency profiles and very different numbers of relevant documents. This makes lots of statistics very difficult including term frequency computation, term weighting and determining the number of documents to retrieve. Map-side merge virtually guarantees that you have to do two cluster queries, one to gather term frequency statistics and another to do the actual query. With reduce side indexing, you can provide strong probabilistic bounds on how different the statistics in each shard can be so you can use local term statistics and you can depend on the score distribution being this same which radically decreases the number of documents you need to retrieve from each shard. c) reduce-side indexing improves the balance of computation during retrieval. If (as is the rule) some document subset is hotter than other document subset due, say to data-source boosting or recency boosting, you will have very bad cluster utilization with skewed shards from map-side indexing while all shards will cost about the same for any query leading to good cluster utilization and faster queries with reduce-side indexing. d) with reduce-side indexing has properties that can be mathematically stated and proved. Map-side indexing only has comparable properties if you make unrealistic assumptions about your original data. e) micro-sharding allows very simple and very effective use of multiple cores on multiple machines in a search cluster. This can be very difficult to do with large shards or a single index. Now, as you say, these advantages may evaporate if you are looking to produce a single output index. That seems, however, to contradict the whole point of scaling. If you need to scale indexing, presumably you also need to scale search speed and throughput. As such you probably want to have many shards rather than few. Conversely, if you can stand to search a single index, then you probably can stand to index on a single machine. Another thing to think about is the fact SOLR doesn't yet do micro-sharding or clustering very well and, in particular, doesn't handle multiple shards per core. That will be changing before long, however, and it is very dangerous to design for the past rather than the future. In case, you didn't notice, I strongly suggest you stick with reduce-side indexing.
          Hide
          Jason Rutherglen added a comment -

          In production the latest patch does not leave temporary files behind... Though before we had failed tasks, so perhaps there's still a bug, we won't know until we run out of disk space again.

          Show
          Jason Rutherglen added a comment - In production the latest patch does not leave temporary files behind... Though before we had failed tasks, so perhaps there's still a bug, we won't know until we run out of disk space again.
          Hide
          Jason Rutherglen added a comment -

          There still seems to be a bug where the temporary directory index isn't deleted on job completion.

          Show
          Jason Rutherglen added a comment - There still seems to be a bug where the temporary directory index isn't deleted on job completion.
          Hide
          Matt Revelle added a comment -

          Hi, Jason, I noticed a few problems with the latest patch and am working on the fixes. On SolrRecordWriter:L373, FileUtils.forceDelete is called but prior to this the path dir to delete may have been moved. The move happens when the output isn't a zip file, at line 364 with a call to FileSystem#completeLocalOutput.

          Less important, the ls process, which runs when logging is set to the debug level (SolrRecordWriter:L278, appears to not always exit properly and throws an exception.

          Show
          Matt Revelle added a comment - Hi, Jason, I noticed a few problems with the latest patch and am working on the fixes. On SolrRecordWriter:L373, FileUtils.forceDelete is called but prior to this the path dir to delete may have been moved. The move happens when the output isn't a zip file, at line 364 with a call to FileSystem#completeLocalOutput. Less important, the ls process, which runs when logging is set to the debug level (SolrRecordWriter:L278, appears to not always exit properly and throws an exception.
          Hide
          Jason Rutherglen added a comment -

          Matt, interesting. I'm most concerned about the left over files, which is still an issue. In production I use a script that deletes the left overs, which isn't ideal but works. I'm not sure if the SolrRecordWriter:L373 bug is related to that, I'm always zipping the indexes into HDFS.

          Show
          Jason Rutherglen added a comment - Matt, interesting. I'm most concerned about the left over files, which is still an issue. In production I use a script that deletes the left overs, which isn't ideal but works. I'm not sure if the SolrRecordWriter:L373 bug is related to that, I'm always zipping the indexes into HDFS.
          Hide
          Alexander Kanarsky added a comment -

          This is the version of the patch rewritten to use the new mapreduce API in hadoop 0.20. I did a quick port of the patch from 2010-02-02 11:57 AM without any optimizations, just to make it work with a new syntax. There are some slight changes around local filesystem temp file name generation etc. The CSVReducer class added just to pass a proper context to use counters in BatchWriter, if you know the better way to do this, please let me know. Tested with hadoop 0.20.2 on csv data, with both compressed and non-compressed output; seems to be OK, but no extensive regression testing performed. Code review and suggestions/corrections are welcome.

          Show
          Alexander Kanarsky added a comment - This is the version of the patch rewritten to use the new mapreduce API in hadoop 0.20. I did a quick port of the patch from 2010-02-02 11:57 AM without any optimizations, just to make it work with a new syntax. There are some slight changes around local filesystem temp file name generation etc. The CSVReducer class added just to pass a proper context to use counters in BatchWriter, if you know the better way to do this, please let me know. Tested with hadoop 0.20.2 on csv data, with both compressed and non-compressed output; seems to be OK, but no extensive regression testing performed. Code review and suggestions/corrections are welcome.
          Hide
          Matt Revelle added a comment -

          Updated the latest patch to include a check for the temp file before calling FileUtils.forceDelete.

          Show
          Matt Revelle added a comment - Updated the latest patch to include a check for the temp file before calling FileUtils.forceDelete.
          Hide
          Jason Rutherglen added a comment -

          Matt, Can you post a patch including the contrib directory structure (and build.xml)?

          Show
          Jason Rutherglen added a comment - Matt, Can you post a patch including the contrib directory structure (and build.xml)?
          Hide
          Jason Rutherglen added a comment -

          Matt, nevermind, I'm just using the patch as is (ie, as a part of Solr core).

          Show
          Jason Rutherglen added a comment - Matt, nevermind, I'm just using the patch as is (ie, as a part of Solr core).
          Hide
          Matt Revelle added a comment -

          Jason, Ok. =)

          Show
          Matt Revelle added a comment - Jason, Ok. =)
          Hide
          Viktors Rotanovs added a comment -

          It looks like when converter returns only 1 document, which is the most common case, number of batches will be equal to number of documents. In earlier version of this patch documents were accumulated and then sent as a batch, and this is what a comment in SolrRecordWriter still says.

          Show
          Viktors Rotanovs added a comment - It looks like when converter returns only 1 document, which is the most common case, number of batches will be equal to number of documents. In earlier version of this patch documents were accumulated and then sent as a batch, and this is what a comment in SolrRecordWriter still says.
          Hide
          Matt Revelle added a comment -

          Viktors: That must have been a regression from Alexander's patch to support the newer Hadoop API. I may have a chance to investigate today. If anyone else takes it on, please leave a comment.

          Show
          Matt Revelle added a comment - Viktors: That must have been a regression from Alexander's patch to support the newer Hadoop API. I may have a chance to investigate today. If anyone else takes it on, please leave a comment.
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hide
          Otis Gospodnetic added a comment -

          I see comments and patches adding support for newer versions of Hadoop. But has anyone used these patches with Elastic Map Reduce (EMR) on EC2?

          Show
          Otis Gospodnetic added a comment - I see comments and patches adding support for newer versions of Hadoop. But has anyone used these patches with Elastic Map Reduce (EMR) on EC2?
          Hide
          Koji Sekiguchi added a comment -

          We are using this patch (Andrzej version + custom code) for our several projects. It works great but sometimes we get OOM when indexing new input data and the data include unexpected large records. I think it is worthy if SolrRecordWriter could have bufferSizeMB (like IndexWriter) to flush buffered docs rather than batchSize basis. Thought?

          Show
          Koji Sekiguchi added a comment - We are using this patch (Andrzej version + custom code) for our several projects. It works great but sometimes we get OOM when indexing new input data and the data include unexpected large records. I think it is worthy if SolrRecordWriter could have bufferSizeMB (like IndexWriter) to flush buffered docs rather than batchSize basis. Thought?
          Hide
          Alexander Kanarsky added a comment -

          Matt, I think Viktors mentioned the original batching logic eliminated by Jason (see his comment from 10/Sep/09 02:46 PM)

          Show
          Alexander Kanarsky added a comment - Matt, I think Viktors mentioned the original batching logic eliminated by Jason (see his comment from 10/Sep/09 02:46 PM)
          Hide
          Jason Rutherglen added a comment -

          Matt, I think Viktors mentioned the original batching logic eliminated by Jason (see his comment from 10/Sep/09 02:46 PM)

          Right, I don't think we need batching because no efficiency will be gained (ie, there's no network overhead being eliminated).

          Show
          Jason Rutherglen added a comment - Matt, I think Viktors mentioned the original batching logic eliminated by Jason (see his comment from 10/Sep/09 02:46 PM) Right, I don't think we need batching because no efficiency will be gained (ie, there's no network overhead being eliminated).
          Hide
          Mathias Walter added a comment -

          I tried this patch with Hadoop 0.20.2. It works pretty well, except if speculative execution is enabled (at least for the reducer). If that is the case, some jobs are running twice. The first job is creating the zip file. The second tries this too and failed. Unfortunately, the first job also fails. I've added a fs.exists(perm) to the SolrRecordWriter.packZipFile method, but the first job still fails with the following exception right after the last write and at nearly the same time the other job tests the existence of the zip file:

          2010-08-06 15:35:33,883 INFO com.excerbt.mapreduce.solrindexing.SolrRecordWriter: RawPath /hadoop/hdfs5/tmp/solr_attempt_201007231114_0068_r_000001_0.1/data/index/_4.frq, baseName part-00001, root /hadoop/hdfs5/tmp/solr_attempt_201007231114_0068_r_000001_0.1, inZip 2 part-00001/data/index/_4.frq
          2010-08-06 15:35:36,164 ERROR com.excerbt.mapreduce.solrindexing.SolrRecordWriter: packZipFile exception {}
          java.io.IOException: Filesystem closed
          at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)
          at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65)
          at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:3058)
          at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150)
          at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:100)
          at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
          at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49)
          at java.io.DataOutputStream.write(DataOutputStream.java:90)
          at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:161)
          at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:118)
          at java.util.zip.ZipOutputStream.write(ZipOutputStream.java:272)
          at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:51)
          at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:87)
          at com.excerbt.mapreduce.solrindexing.SolrRecordWriter.zipDirectory(SolrRecordWriter.java:493)
          at com.excerbt.mapreduce.solrindexing.SolrRecordWriter.zipDirectory(SolrRecordWriter.java:469)
          at com.excerbt.mapreduce.solrindexing.SolrRecordWriter.zipDirectory(SolrRecordWriter.java:469)
          at com.excerbt.mapreduce.solrindexing.SolrRecordWriter.zipDirectory(SolrRecordWriter.java:469)
          at com.excerbt.mapreduce.solrindexing.SolrRecordWriter.packZipFile(SolrRecordWriter.java:385)
          at com.excerbt.mapreduce.solrindexing.SolrRecordWriter.close(SolrRecordWriter.java:349)
          at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
          at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
          at org.apache.hadoop.mapred.Child.main(Child.java:170)

          That's really strange. I checked this for many jobs. Also, the incomplete zip file is not removed after this exception.

          Show
          Mathias Walter added a comment - I tried this patch with Hadoop 0.20.2. It works pretty well, except if speculative execution is enabled (at least for the reducer). If that is the case, some jobs are running twice. The first job is creating the zip file. The second tries this too and failed. Unfortunately, the first job also fails. I've added a fs.exists(perm) to the SolrRecordWriter.packZipFile method, but the first job still fails with the following exception right after the last write and at nearly the same time the other job tests the existence of the zip file: 2010-08-06 15:35:33,883 INFO com.excerbt.mapreduce.solrindexing.SolrRecordWriter: RawPath /hadoop/hdfs5/tmp/solr_attempt_201007231114_0068_r_000001_0.1/data/index/_4.frq, baseName part-00001, root /hadoop/hdfs5/tmp/solr_attempt_201007231114_0068_r_000001_0.1, inZip 2 part-00001/data/index/_4.frq 2010-08-06 15:35:36,164 ERROR com.excerbt.mapreduce.solrindexing.SolrRecordWriter: packZipFile exception {} java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230) at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:3058) at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150) at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:100) at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49) at java.io.DataOutputStream.write(DataOutputStream.java:90) at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:161) at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:118) at java.util.zip.ZipOutputStream.write(ZipOutputStream.java:272) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:51) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:87) at com.excerbt.mapreduce.solrindexing.SolrRecordWriter.zipDirectory(SolrRecordWriter.java:493) at com.excerbt.mapreduce.solrindexing.SolrRecordWriter.zipDirectory(SolrRecordWriter.java:469) at com.excerbt.mapreduce.solrindexing.SolrRecordWriter.zipDirectory(SolrRecordWriter.java:469) at com.excerbt.mapreduce.solrindexing.SolrRecordWriter.zipDirectory(SolrRecordWriter.java:469) at com.excerbt.mapreduce.solrindexing.SolrRecordWriter.packZipFile(SolrRecordWriter.java:385) at com.excerbt.mapreduce.solrindexing.SolrRecordWriter.close(SolrRecordWriter.java:349) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) at org.apache.hadoop.mapred.Child.main(Child.java:170) That's really strange. I checked this for many jobs. Also, the incomplete zip file is not removed after this exception.
          Hide
          Alexander Kanarsky added a comment -

          Mathias, I did not test the patch for 0.20 with speculative execution for the reducers; but it is probably failing because the other task attempt deletes the perm file (see SolrRecordWriter constructor, there is a cleanup fs.delete(perm, true) call after constructing the perm Path).

          If you really need the speculative execution for the reducers, you could try to use the Reducer context to construct the perm file using the getWorkOutputPath instead of getOutputPath() (in this case, if the particular attempt was successful then its perm file should be promoted to the task work dir automatically - see that side effect explanation here: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html#getWorkOutputPath%28org.apache.hadoop.mapreduce.TaskInputOutputContext%29),

          i.e. instead of

          perm = new Path(FileOutputFormat.getOutputPath(context), getOutFileName(context, "part"));

          try to use something like this:

          Reducer.Context rContext = contextMap.get(context.getTaskAttemptID().getTaskID());
          perm = new Path(FileOutputFormat.getWorkOutputPath(rContext), getOutFileName(context, "part"));

          Show
          Alexander Kanarsky added a comment - Mathias, I did not test the patch for 0.20 with speculative execution for the reducers; but it is probably failing because the other task attempt deletes the perm file (see SolrRecordWriter constructor, there is a cleanup fs.delete(perm, true) call after constructing the perm Path). If you really need the speculative execution for the reducers, you could try to use the Reducer context to construct the perm file using the getWorkOutputPath instead of getOutputPath() (in this case, if the particular attempt was successful then its perm file should be promoted to the task work dir automatically - see that side effect explanation here: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html#getWorkOutputPath%28org.apache.hadoop.mapreduce.TaskInputOutputContext%29 ), i.e. instead of perm = new Path(FileOutputFormat.getOutputPath(context), getOutFileName(context, "part")); try to use something like this: Reducer.Context rContext = contextMap.get(context.getTaskAttemptID().getTaskID()); perm = new Path(FileOutputFormat.getWorkOutputPath(rContext), getOutFileName(context, "part"));
          Hide
          Daniel Ivan Pizarro added a comment -

          I'm getting the following error:

          java.lang.IllegalStateException: Failed to initialize record writer for , attempt_local_0001_r_000000_0

          Where can I find instructions to run the CVSUploader?

          (readme file says "Please read the original patch readme for details on the CSV bulk uploader.", and I can't find that readme file)

          Show
          Daniel Ivan Pizarro added a comment - I'm getting the following error: java.lang.IllegalStateException: Failed to initialize record writer for , attempt_local_0001_r_000000_0 Where can I find instructions to run the CVSUploader? (readme file says "Please read the original patch readme for details on the CSV bulk uploader.", and I can't find that readme file)
          Hide
          Grant Ingersoll added a comment -

          I think this should be a contrib module. Alexander, would you be willing to update it to trunk and make it a Solr contrib?

          Show
          Grant Ingersoll added a comment - I think this should be a contrib module. Alexander, would you be willing to update it to trunk and make it a Solr contrib?
          Hide
          Alexander Kanarsky added a comment -

          Grant, sure. Will do this in a next couple of days.

          Show
          Alexander Kanarsky added a comment - Grant, sure. Will do this in a next couple of days.
          Hide
          Alexander Kanarsky added a comment - - edited

          The latest 0.20 patch is repackaged to be placed under the contrib, as it was initially (build.xml is included). and tested against the current trunk. As usual, after applying the patch put the 4 lib jars (hadoop, log4j, and two commons-logging) to the contrib/hadoop/lib. No unit tests as for now but I hope to add some soon. Here is the big question: as Andrzej once mentioned, the unit tests require a running Hadoop cluster. One approach is to make the patch and unit tests working with the Hadoop mini--cluster (ClusterMapReduceTestCase), however this will bring some extra dependencies needed to run the cluster (like jetty). Another idea is to use "your own" cluster and just configure access to this cluster in untt tests; this approach seems to be logical but potentially may give different test results on different clusters, and also may not give some low-level access to the execution, needed for tests. So what is your opinion on how the tests for solr-hadoop should be run? I am not really happy with the idea of starting and running the Hadoop cluster while performing the Solr unit tests, but this still could be the better option than no unit tests at all.

          Show
          Alexander Kanarsky added a comment - - edited The latest 0.20 patch is repackaged to be placed under the contrib, as it was initially (build.xml is included). and tested against the current trunk. As usual, after applying the patch put the 4 lib jars (hadoop, log4j, and two commons-logging) to the contrib/hadoop/lib. No unit tests as for now but I hope to add some soon. Here is the big question: as Andrzej once mentioned, the unit tests require a running Hadoop cluster. One approach is to make the patch and unit tests working with the Hadoop mini--cluster (ClusterMapReduceTestCase), however this will bring some extra dependencies needed to run the cluster (like jetty). Another idea is to use "your own" cluster and just configure access to this cluster in untt tests; this approach seems to be logical but potentially may give different test results on different clusters, and also may not give some low-level access to the execution, needed for tests. So what is your opinion on how the tests for solr-hadoop should be run? I am not really happy with the idea of starting and running the Hadoop cluster while performing the Solr unit tests, but this still could be the better option than no unit tests at all.
          Hide
          Jason Rutherglen added a comment -

          Alexander,

          I think we'll need to use Hadoop's Mini Cluster in order to have a proper unit test. Adding Jetty as a dependency shouldn't be too much of a problem as Solr already includes a small version of Jetty? That being said, it doesn't mean it's fun to write the unit test. I can assist if needed.

          Show
          Jason Rutherglen added a comment - Alexander, I think we'll need to use Hadoop's Mini Cluster in order to have a proper unit test. Adding Jetty as a dependency shouldn't be too much of a problem as Solr already includes a small version of Jetty? That being said, it doesn't mean it's fun to write the unit test. I can assist if needed.
          Hide
          Dhruv Bansal added a comment -

          I am unable to compile SOLR 1.4.1 after patching with the latest (2010-09-20 04:40 AM) SOLR-1301.patch.

          $ wget http://mirror.cloudera.com/apache//lucene/solr/1.4.1/apache-solr-1.4.1.tgz
          ...
          $ tar -xzf apache-solr-1.4.1.tgz
          $ cd apache-solr-1.4.1/contrib
          apache-solr-1.4.1/contrib$ wget https://issues.apache.org/jira/secure/attachment/12455023/SOLR-1301.patch
          apache-solr-1.4.1/contrib$ patch -p2 -i SOLR-1301.patch
          ...
          apache-solr-1.4.1/contrib$ mkdir lib
          apache-solr-1.4.1/contrib$ cd lib
          apache-solr-1.4.1/contrib/lib$ wget .. # download hadoop, log4j, commons-logging, commons-logging-api jars from top of this page
          ...
          apache-solr-1.4.1/contrib/lib$ cd ../..
          apache-solr-1.4.1$ ant dist -k
          
          ...
          
          compile:
              [javac] Compiling 9 source files to /home/dhruv/projects/infochimps/search/apache-solr-1.4.1/contrib/hadoop/build/classes
          Target 'compile' failed with message 'The following error occurred while executing this line:
          /home/dhruv/projects/infochimps/search/apache-solr-1.4.1/common-build.xml:159: Reference lucene.classpath not found.'.
          Cannot execute 'build' - 'compile' failed or was not executed.
          Cannot execute 'dist' - 'build' failed or was not executed.
             [subant] File '/home/dhruv/projects/infochimps/search/apache-solr-1.4.1/contrib/hadoop/build.xml' failed with message 'The following error occurred whil\
          e executing this line:
             [subant] /home/dhruv/projects/infochimps/search/apache-solr-1.4.1/contrib/hadoop/build.xml:65: The following error occurred while executing this line:
             [subant] /home/dhruv/projects/infochimps/search/apache-solr-1.4.1/common-build.xml:159: Reference lucene.classpath not found.'.
          
          ....
          

          Am I following the procedure properly? I'm able to build SOLR just fine out of the box as well as after applying SOLR-1395.

          Show
          Dhruv Bansal added a comment - I am unable to compile SOLR 1.4.1 after patching with the latest (2010-09-20 04:40 AM) SOLR-1301 .patch. $ wget http: //mirror.cloudera.com/apache//lucene/solr/1.4.1/apache-solr-1.4.1.tgz ... $ tar -xzf apache-solr-1.4.1.tgz $ cd apache-solr-1.4.1/contrib apache-solr-1.4.1/contrib$ wget https: //issues.apache.org/jira/secure/attachment/12455023/SOLR-1301.patch apache-solr-1.4.1/contrib$ patch -p2 -i SOLR-1301.patch ... apache-solr-1.4.1/contrib$ mkdir lib apache-solr-1.4.1/contrib$ cd lib apache-solr-1.4.1/contrib/lib$ wget .. # download hadoop, log4j, commons-logging, commons-logging-api jars from top of this page ... apache-solr-1.4.1/contrib/lib$ cd ../.. apache-solr-1.4.1$ ant dist -k ... compile: [javac] Compiling 9 source files to /home/dhruv/projects/infochimps/search/apache-solr-1.4.1/contrib/hadoop/build/classes Target 'compile' failed with message 'The following error occurred while executing this line: /home/dhruv/projects/infochimps/search/apache-solr-1.4.1/common-build.xml:159: Reference lucene.classpath not found.'. Cannot execute 'build' - 'compile' failed or was not executed. Cannot execute 'dist' - 'build' failed or was not executed. [subant] File '/home/dhruv/projects/infochimps/search/apache-solr-1.4.1/contrib/hadoop/build.xml' failed with message 'The following error occurred whil\ e executing this line: [subant] /home/dhruv/projects/infochimps/search/apache-solr-1.4.1/contrib/hadoop/build.xml:65: The following error occurred while executing this line: [subant] /home/dhruv/projects/infochimps/search/apache-solr-1.4.1/common-build.xml:159: Reference lucene.classpath not found.'. .... Am I following the procedure properly? I'm able to build SOLR just fine out of the box as well as after applying SOLR-1395 .
          Hide
          Alexander Kanarsky added a comment -

          Dhruv, thank you, I overlooked this reference. To fix the issue, please go to the build.xml in contrib/hadoop folder and delete the line "<path refid="lucene.classpath"/>" - or just download the new version of the patch (attached). You followed the procedure properly except that the hadoop and logging jars are supposed to go to the contrib/hadoop/lib, not the contrib/lib.

          Show
          Alexander Kanarsky added a comment - Dhruv, thank you, I overlooked this reference. To fix the issue, please go to the build.xml in contrib/hadoop folder and delete the line "<path refid="lucene.classpath"/>" - or just download the new version of the patch (attached). You followed the procedure properly except that the hadoop and logging jars are supposed to go to the contrib/hadoop/lib, not the contrib/lib.
          Hide
          Alexander Kanarsky added a comment - - edited

          Note for the Hadoop 0.21 users: the current patch can be used "as is" with 0.21, but you will need to make sure to compile it with appropriate jars (hadoop-common-0.21.0.jar and hadoop-mapred-0.21.0.jar instead of hadoop-0.20.x-core.jar). Also, as a workaround, I had to put all the relevant jars (solr, solrj etc.) to the lib folder of the job's jar file (i.e. apache-solr-hadoop-xxx-dev.jar) to avoid InvocationTargetException/ClassNotFound exceptions I did not have with Hadoop 0.20.

          Show
          Alexander Kanarsky added a comment - - edited Note for the Hadoop 0.21 users: the current patch can be used "as is" with 0.21, but you will need to make sure to compile it with appropriate jars (hadoop-common-0.21.0.jar and hadoop-mapred-0.21.0.jar instead of hadoop-0.20.x-core.jar). Also, as a workaround, I had to put all the relevant jars (solr, solrj etc.) to the lib folder of the job's jar file (i.e. apache-solr-hadoop-xxx-dev.jar) to avoid InvocationTargetException/ClassNotFound exceptions I did not have with Hadoop 0.20.
          Hide
          Lance Norskog added a comment -

          Hadoop contains something called MR, a unit test framework. Is it possible to use that for this purpose?

          Show
          Lance Norskog added a comment - Hadoop contains something called MR, a unit test framework. Is it possible to use that for this purpose?
          Hide
          Robert Muir added a comment -

          Bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - Bulk move 3.2 -> 3.3
          Hide
          Mark Johnson added a comment -

          It appears that this issue has fallen by the wayside. Is there still a plan to roll this into contrib? Why has all activity stopped on it?

          Show
          Mark Johnson added a comment - It appears that this issue has fallen by the wayside. Is there still a plan to roll this into contrib? Why has all activity stopped on it?
          Hide
          Mark Johnson added a comment -

          Also does anyone have the json converter listed in the readme?

          Show
          Mark Johnson added a comment - Also does anyone have the json converter listed in the readme?
          Hide
          Alexander Kanarsky added a comment -

          Mark, I planned to add some unit tests and the packaging for hadoop 0.21.x but unfortunately had no time for this. The problem with unit tests is that you need to either to use your own external hadoop cluster or to run mini-cluster, both ways do not work well for a Solr contrib module in my opinion. I tried to use MRUnit approach while ago with 0.20.x, without success. Maybe will get back to this and try again with 0.21 but I do not anticipate this until mid of September.

          Show
          Alexander Kanarsky added a comment - Mark, I planned to add some unit tests and the packaging for hadoop 0.21.x but unfortunately had no time for this. The problem with unit tests is that you need to either to use your own external hadoop cluster or to run mini-cluster, both ways do not work well for a Solr contrib module in my opinion. I tried to use MRUnit approach while ago with 0.20.x, without success. Maybe will get back to this and try again with 0.21 but I do not anticipate this until mid of September.
          Hide
          Robert Muir added a comment -

          3.4 -> 3.5

          Show
          Robert Muir added a comment - 3.4 -> 3.5
          Hide
          Viktors Rotanovs added a comment -

          Beware: with ZIP option enabled, this patch probably has 4 GiB limit on entries inside the zip file, because of file format limitation. I was able to generate 6.8 GB zip file with this patch, but unzip -t fails when encountering >10 GB file inside it:

          $ unzip -t /tmp/test.zip
          Archive: /tmp/test.zip
          warning [/tmp/test.zip]: 4294967296 extra bytes at beginning or within zipfile
          (attempting to process anyway)
          file #1: bad zipfile offset (local header sig): 4294967296
          (attempting to re-compensate)
          testing: part-00000/ OK
          testing: part-00000/conf/ OK
          testing: part-00000/conf/schema.xml OK
          testing: part-00000/data/ OK
          testing: part-00000/data/spellchecker/ OK
          testing: part-00000/data/spellchecker/segments_1 OK
          testing: part-00000/data/spellchecker/segments.gen OK
          testing: part-00000/data/index/ OK
          testing: part-00000/data/index/_10i.nrm OK
          testing: part-00000/data/index/_10i.tii OK
          testing: part-00000/data/index/_10i.tis OK
          testing: part-00000/data/index/_10i.fnm OK
          testing: part-00000/data/index/segments_2 OK
          testing: part-00000/data/index/_10i.fdx OK
          testing: part-00000/data/index/_10i.prx OK
          testing: part-00000/data/index/_10i.fdt
          error: invalid compressed data to inflate
          file #17: bad zipfile offset (local header sig): 1528156471
          (attempting to re-compensate)
          testing: part-00000/data/index/_10i.frq

          Show
          Viktors Rotanovs added a comment - Beware: with ZIP option enabled, this patch probably has 4 GiB limit on entries inside the zip file, because of file format limitation. I was able to generate 6.8 GB zip file with this patch, but unzip -t fails when encountering >10 GB file inside it: $ unzip -t /tmp/test.zip Archive: /tmp/test.zip warning [/tmp/test.zip] : 4294967296 extra bytes at beginning or within zipfile (attempting to process anyway) file #1: bad zipfile offset (local header sig): 4294967296 (attempting to re-compensate) testing: part-00000/ OK testing: part-00000/conf/ OK testing: part-00000/conf/schema.xml OK testing: part-00000/data/ OK testing: part-00000/data/spellchecker/ OK testing: part-00000/data/spellchecker/segments_1 OK testing: part-00000/data/spellchecker/segments.gen OK testing: part-00000/data/index/ OK testing: part-00000/data/index/_10i.nrm OK testing: part-00000/data/index/_10i.tii OK testing: part-00000/data/index/_10i.tis OK testing: part-00000/data/index/_10i.fnm OK testing: part-00000/data/index/segments_2 OK testing: part-00000/data/index/_10i.fdx OK testing: part-00000/data/index/_10i.prx OK testing: part-00000/data/index/_10i.fdt error: invalid compressed data to inflate file #17: bad zipfile offset (local header sig): 1528156471 (attempting to re-compensate) testing: part-00000/data/index/_10i.frq
          Hide
          Alexander Kanarsky added a comment -

          Viktors, can you increase the number of reducers to avoid big output files? As for the zip format, the Java 7 seems to support zip64 extensions, alternatively we can add an option to generate jtar'ed output or something similar.

          Show
          Alexander Kanarsky added a comment - Viktors, can you increase the number of reducers to avoid big output files? As for the zip format, the Java 7 seems to support zip64 extensions, alternatively we can add an option to generate jtar'ed output or something similar.
          Hide
          Mark Johnson added a comment -

          Has anyone updated this contrib to work with the new ant tasks in solr 3.4?

          Show
          Mark Johnson added a comment - Has anyone updated this contrib to work with the new ant tasks in solr 3.4?
          Hide
          Alexander Kanarsky added a comment -

          SOLR-1301 patch modified to work with Solr 3.x ant build. tested with Solr 3.5.0 and Cloudera CDH3u3 (also attached)

          Show
          Alexander Kanarsky added a comment - SOLR-1301 patch modified to work with Solr 3.x ant build. tested with Solr 3.5.0 and Cloudera CDH3u3 (also attached)
          Hide
          Randy Prager added a comment -

          I will be out of the office from 2/21 to 2/27 with limited email access.

          For immediate inquiries please contact support@fixflyer.com or 888-349-3593.

          Show
          Randy Prager added a comment - I will be out of the office from 2/21 to 2/27 with limited email access. For immediate inquiries please contact support@fixflyer.com or 888-349-3593.
          Hide
          Alexander Kanarsky added a comment - - edited

          Note, the hadoop-core-0.20.2-cdh3u3.jar is a part of Cloudera's CDH3 Hadoop distribution and is licensed under Apache License v. 2.0.

          Show
          Alexander Kanarsky added a comment - - edited Note, the hadoop-core-0.20.2-cdh3u3.jar is a part of Cloudera's CDH3 Hadoop distribution and is licensed under Apache License v. 2.0.
          Hide
          Alexander Kanarsky added a comment -

          OK, so I changed the patch to work with 3.5 ant build and re-tested it with Solr 3.5 and Cloudera's CDH3u3 (both the build and csv test run in pseudo-distributed mode). Still no unit tests but I am working on this

          No changes compared to previous version except that I had to comment out the code that sets the debug level dynamically in SolrRecordWriter - because of the conflics with slf4j parts in current Solr; I think it is minor but if not please feel free to resolve this and update the patch. With this done, no need to put the log4j and commons-logging jars in the hadoop/lib at a compile time anymore, only the hadoop jar. I provided the hadoop-core-0.20.2-cdh3u3.jar used for testing as a part of the patch but you can use the other versions of 0.20.x if you'd like; it also should work with hadoop 0.21.x. Note that you still need to make the other related jars (solr, solrj, lucene, commons etc) available while you running your job; one way to do this is to put all the needed jars into the lib subfolder of apache-solr-hadoop jar, another ways are described here: http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/.

          Finally, the quick steps to get the patch compiled (on linux):
          1. get the solr source tarball (apache-solr-3.5.0-src.tgz in this example), put it into some folder, cd there
          2. tar -xzf apache-solr-3.5.0-src.tgz
          3. cd apache-solr-3.5.0/solr
          4. wget https://issues.apache.org/jira/secure/attachment/12515662/SOLR-1301.patch
          5. patch -p0 -i SOLR-1301.patch
          6. mkdir contrib/hadoop/lib
          7. cd contrib/hadoop/lib
          8. wget https://issues.apache.org/jira/secure/attachment/12515663/hadoop-core-0.20.2-cdh3u3.jar
          9. cd ../../..
          10. ant dist

          and you should have the apache-solr-hadoop-3.5-SNAPSHOT.jar in solr/dist folder.

          Show
          Alexander Kanarsky added a comment - OK, so I changed the patch to work with 3.5 ant build and re-tested it with Solr 3.5 and Cloudera's CDH3u3 (both the build and csv test run in pseudo-distributed mode). Still no unit tests but I am working on this No changes compared to previous version except that I had to comment out the code that sets the debug level dynamically in SolrRecordWriter - because of the conflics with slf4j parts in current Solr; I think it is minor but if not please feel free to resolve this and update the patch. With this done, no need to put the log4j and commons-logging jars in the hadoop/lib at a compile time anymore, only the hadoop jar. I provided the hadoop-core-0.20.2-cdh3u3.jar used for testing as a part of the patch but you can use the other versions of 0.20.x if you'd like; it also should work with hadoop 0.21.x. Note that you still need to make the other related jars (solr, solrj, lucene, commons etc) available while you running your job; one way to do this is to put all the needed jars into the lib subfolder of apache-solr-hadoop jar, another ways are described here: http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/ . Finally, the quick steps to get the patch compiled (on linux): 1. get the solr source tarball (apache-solr-3.5.0-src.tgz in this example), put it into some folder, cd there 2. tar -xzf apache-solr-3.5.0-src.tgz 3. cd apache-solr-3.5.0/solr 4. wget https://issues.apache.org/jira/secure/attachment/12515662/SOLR-1301.patch 5. patch -p0 -i SOLR-1301 .patch 6. mkdir contrib/hadoop/lib 7. cd contrib/hadoop/lib 8. wget https://issues.apache.org/jira/secure/attachment/12515663/hadoop-core-0.20.2-cdh3u3.jar 9. cd ../../.. 10. ant dist and you should have the apache-solr-hadoop-3.5-SNAPSHOT.jar in solr/dist folder.
          Hide
          Hoss Man added a comment -

          Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently.

          email notification suppressed to prevent mass-spam
          psuedo-unique token identifying these issues: hoss20120321nofix36

          Show
          Hoss Man added a comment - Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently. email notification suppressed to prevent mass-spam psuedo-unique token identifying these issues: hoss20120321nofix36
          Hide
          Greg Bowyer added a comment -

          Updated patch that deals with a minor change post SOLR-3204

          Show
          Greg Bowyer added a comment - Updated patch that deals with a minor change post SOLR-3204
          Hide
          Otis Gospodnetic added a comment - - edited

          Noticed this is issue #11 in terms of votes and has 50 watchers.
          From the description, this is only about MapReduce-based indexing.

          Qs:

          • Are there plans to make it possible to use Solr as Input for MapReduce?
          • Mark Miller, do you know if anything related will be contributed to Solr by Cloudera?
          • Is SOLR-1045 a subset of this issue and thus closable?
          Show
          Otis Gospodnetic added a comment - - edited Noticed this is issue #11 in terms of votes and has 50 watchers. From the description, this is only about MapReduce-based indexing. Qs: Are there plans to make it possible to use Solr as Input for MapReduce? Mark Miller , do you know if anything related will be contributed to Solr by Cloudera? Is SOLR-1045 a subset of this issue and thus closable?
          Hide
          Mark Miller added a comment -

          Yeah, we have taken this issue as a starting point and extended and polished it quite a bit. Wolfgang will provide an initial 'dump' of the work that was done shortly, but there will still be some work integrating it into our build system and trunk.

          Show
          Mark Miller added a comment - Yeah, we have taken this issue as a starting point and extended and polished it quite a bit. Wolfgang will provide an initial 'dump' of the work that was done shortly, but there will still be some work integrating it into our build system and trunk.
          Hide
          Alexander Kanarsky added a comment - - edited

          Otis Gospodnetic, do you mean to use the Solr query result as an MapReduce job input?
          Also, regarding the SOLR-1045, it is a different approach (in Map phase vs. Reduce phase- great explanation by Ted is up here: https://issues.apache.org/jira/browse/SOLR-1301#comment-12828961)

          Show
          Alexander Kanarsky added a comment - - edited Otis Gospodnetic , do you mean to use the Solr query result as an MapReduce job input? Also, regarding the SOLR-1045 , it is a different approach (in Map phase vs. Reduce phase- great explanation by Ted is up here: https://issues.apache.org/jira/browse/SOLR-1301#comment-12828961 )
          Hide
          Otis Gospodnetic added a comment -

          Alexander Kanarsky - yes, take Solr results and use them for MR input, as well as run a MR job and index into Solr (SOLR-1045).

          Show
          Otis Gospodnetic added a comment - Alexander Kanarsky - yes, take Solr results and use them for MR input, as well as run a MR job and index into Solr ( SOLR-1045 ).
          Hide
          Mark Miller added a comment -

          As I mentioned above, Cloudera has a done a lot with moving this issue forward. I've been working on converting the build system from maven to ivy+ant and will post my current progress before long.

          Show
          Mark Miller added a comment - As I mentioned above, Cloudera has a done a lot with moving this issue forward. I've been working on converting the build system from maven to ivy+ant and will post my current progress before long.
          Hide
          Mark Miller added a comment -

          Here is a patch with my current progress.

          This is a Solr contrib module that can build Solr indexes in HDFS via MapReduce. It builds upon the Solr support for reading and writing to HDFS.

          It supports a GoLive feature that allows merging into a running cluster as the final step of the MapReduce job.

          There is fairly comprehensive help documentation as part of the MapReduceIndexerTool.

          For ETL, Morphlines from the open source Cloudera CDK is used: https://github.com/cloudera/cdk/tree/master/cdk-morphlines This is the same ETL library that the Solr integration with Apache Flume uses.

          What I have recently done: updated to latest code, fixed 5x requires solr.xml now, converted maven to ivy+ant, updated license files, fixed validation errors, integrated tests fully into test framework, got tests passing.

          All tests are passing with this patch for me, but there are still a variety of issues to address:

          • run yarn and mr1 - the maven build would run the unit tests against yarn or mr1 depending on the profile chosen on the command line - this patch runs against yarn.
          • The MiniYarnCluster used for unit tests is hard coded to use the 'current-working-dir'/target path. This is a bad and illegal location. For the moment, I've relaxed the Lucene tests policy file to allow read/writes anywhere - this needs to be addressed before committing.
          • We depend on some Morphline commands that depend on Solr - this could cause us problems in the future, and we want to own the code for this commands in Solr I think.
          • There are thread leaks in the tests that should be looked into - some might not be avoidable as in other Hadoop tests (as we wait for fixes from the Hadoop project).
          • We need to sync up with the latest code from the maven version - there have been some changes since this code was extracted.

          There are a number of new contributors to this issue that I will be sure to enumerate in CHANGES.

          I'll add whatever I'm forgetting in a later comment.

          Show
          Mark Miller added a comment - Here is a patch with my current progress. This is a Solr contrib module that can build Solr indexes in HDFS via MapReduce. It builds upon the Solr support for reading and writing to HDFS. It supports a GoLive feature that allows merging into a running cluster as the final step of the MapReduce job. There is fairly comprehensive help documentation as part of the MapReduceIndexerTool. For ETL, Morphlines from the open source Cloudera CDK is used: https://github.com/cloudera/cdk/tree/master/cdk-morphlines This is the same ETL library that the Solr integration with Apache Flume uses. What I have recently done: updated to latest code, fixed 5x requires solr.xml now, converted maven to ivy+ant, updated license files, fixed validation errors, integrated tests fully into test framework, got tests passing. All tests are passing with this patch for me, but there are still a variety of issues to address: run yarn and mr1 - the maven build would run the unit tests against yarn or mr1 depending on the profile chosen on the command line - this patch runs against yarn. The MiniYarnCluster used for unit tests is hard coded to use the 'current-working-dir'/target path. This is a bad and illegal location. For the moment, I've relaxed the Lucene tests policy file to allow read/writes anywhere - this needs to be addressed before committing. We depend on some Morphline commands that depend on Solr - this could cause us problems in the future, and we want to own the code for this commands in Solr I think. There are thread leaks in the tests that should be looked into - some might not be avoidable as in other Hadoop tests (as we wait for fixes from the Hadoop project). We need to sync up with the latest code from the maven version - there have been some changes since this code was extracted. There are a number of new contributors to this issue that I will be sure to enumerate in CHANGES. I'll add whatever I'm forgetting in a later comment.
          Hide
          Mark Miller added a comment -

          Another thing I have not looked at: The final jar that is created in the dist for the MapReduceIndexerTool - it likely still needs tweaking.

          Show
          Mark Miller added a comment - Another thing I have not looked at: The final jar that is created in the dist for the MapReduceIndexerTool - it likely still needs tweaking.
          Hide
          Mark Miller added a comment -

          And another note: need to add support for the skip hadoop tests system property as well.

          Show
          Mark Miller added a comment - And another note: need to add support for the skip hadoop tests system property as well.
          Hide
          Phani Chaitanya Vempaty added a comment -

          I wanted to look at the code but after I downloaded solr-4.4.0 and applied the patch, I'm not able to create the eclipse project. It says that there is no ivy.xml in solr-mr directory and it is missing indeed. I created one and now everything is fine. Is ivy.xml missed as it is an initial-cut or am I doing something wrong. Below is my ivy.xml.

          <ivy-module version="2.0">
              <info organisation="org.apache.solr" module="solr-mr"/>
              <dependencies>
                <dependency org="org.apache.hadoop" name="hadoop-common" rev="2.0.5-alpha" transitive="false"/>
                <dependency org="org.apache.hadoop" name="hadoop-core" rev="1.2.1" transitive="false"/>
                <dependency org="org.apache.hadoop" name="hadoop-hdfs" rev="2.0.5-alpha" transitive="false"/>
                <dependency org="org.apache.hadoop" name="hadoop-mapreduce" rev="2.0.5-alpha" transitive="false"/>
                <dependency org="org.apache.hadoop" name="hadoop-mapreduce-client" rev="2.0.5-alpha" transitive="false"/>
                <dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-core" rev="2.0.5-alpha" transitive="false"/>
                <dependency org="org.apache.hadoop" name="hadoop-mapred" rev="0.22.0" transitive="false"/>
                <dependency org="org.apache.hadoop" name="hadoop-yarn" rev="2.0.5-alpha" transitive="false"/>
                <dependency org="com.codahale.metrics" name="metrics-core" rev="3.0.1" transitive="false"/>
                <dependency org="com.cloudera.cdk" name="cdk-morphlines-core" rev="0.7.0" transitive="false"/>
                <dependency org="com.cloudera.cdk" name="cdk-morphlines-solr-core" rev="0.7.0" transitive="false"/>
                <dependency org="org.skife.com.typesafe.config" name="typesafe-config" rev="0.3.0" transitive="false"/>
                <dependency org="net.sourceforge.argparse4j" name="argparse4j" rev="0.4.1" transitive="false"/>
                <exclude org="*" ext="*" matcher="regexp" type="${ivy.exclude.types}"/>
              </dependencies>
          </ivy-module>
          

          Though at least now I'm able to get the eclipse project to view the code, I still have some compile errors in the project which I guess is mainly due to the hadoop versions that I have in the above ivy.xml file w.r.t hadoop-core and others (I'm not able to find 2.0.5-alpha from http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-core/ - I did not look into CDH jars though). I'm also not able to compile the code base after applying this patch due to this very reason.

          Show
          Phani Chaitanya Vempaty added a comment - I wanted to look at the code but after I downloaded solr-4.4.0 and applied the patch, I'm not able to create the eclipse project. It says that there is no ivy.xml in solr-mr directory and it is missing indeed. I created one and now everything is fine. Is ivy.xml missed as it is an initial-cut or am I doing something wrong. Below is my ivy.xml. <ivy-module version= "2.0" > <info organisation= "org.apache.solr" module= "solr-mr" /> <dependencies> <dependency org= "org.apache.hadoop" name= "hadoop-common" rev= "2.0.5-alpha" transitive= "false" /> <dependency org= "org.apache.hadoop" name= "hadoop-core" rev= "1.2.1" transitive= "false" /> <dependency org= "org.apache.hadoop" name= "hadoop-hdfs" rev= "2.0.5-alpha" transitive= "false" /> <dependency org= "org.apache.hadoop" name= "hadoop-mapreduce" rev= "2.0.5-alpha" transitive= "false" /> <dependency org= "org.apache.hadoop" name= "hadoop-mapreduce-client" rev= "2.0.5-alpha" transitive= "false" /> <dependency org= "org.apache.hadoop" name= "hadoop-mapreduce-client-core" rev= "2.0.5-alpha" transitive= "false" /> <dependency org= "org.apache.hadoop" name= "hadoop-mapred" rev= "0.22.0" transitive= "false" /> <dependency org= "org.apache.hadoop" name= "hadoop-yarn" rev= "2.0.5-alpha" transitive= "false" /> <dependency org= "com.codahale.metrics" name= "metrics-core" rev= "3.0.1" transitive= "false" /> <dependency org= "com.cloudera.cdk" name= "cdk-morphlines-core" rev= "0.7.0" transitive= "false" /> <dependency org= "com.cloudera.cdk" name= "cdk-morphlines-solr-core" rev= "0.7.0" transitive= "false" /> <dependency org= "org.skife.com.typesafe.config" name= "typesafe-config" rev= "0.3.0" transitive= "false" /> <dependency org= "net.sourceforge.argparse4j" name= "argparse4j" rev= "0.4.1" transitive= "false" /> <exclude org= "*" ext= "*" matcher= "regexp" type= "${ivy.exclude.types}" /> </dependencies> </ivy-module> Though at least now I'm able to get the eclipse project to view the code, I still have some compile errors in the project which I guess is mainly due to the hadoop versions that I have in the above ivy.xml file w.r.t hadoop-core and others (I'm not able to find 2.0.5-alpha from http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-core/ - I did not look into CDH jars though). I'm also not able to compile the code base after applying this patch due to this very reason.
          Hide
          Mark Miller added a comment -

          Is ivy.xml missed as it is an initial-cut or am I doing something wrong.

          No, it should be there - not sure why it wouldn't have made it into the patch. I'll post another before long.

          Show
          Mark Miller added a comment - Is ivy.xml missed as it is an initial-cut or am I doing something wrong. No, it should be there - not sure why it wouldn't have made it into the patch. I'll post another before long.
          Hide
          wolfgang hoschek added a comment -

          FYI, One things that's definitely off in that adhoc ivy.xml above is that it should use com.typesafe rather than org.skife.com.typesafe.config. Use version 1.0.2 of it. See http://search.maven.org/#search%7Cga%7C1%7Ctypesafe-config

          Maybe best to wait for Mark to post our full ivy.xml, though.

          (Moving all our solr-mr dependencies from Cloudera Search maven to ivy was a bit of a beast).

          Show
          wolfgang hoschek added a comment - FYI, One things that's definitely off in that adhoc ivy.xml above is that it should use com.typesafe rather than org.skife.com.typesafe.config. Use version 1.0.2 of it. See http://search.maven.org/#search%7Cga%7C1%7Ctypesafe-config Maybe best to wait for Mark to post our full ivy.xml, though. (Moving all our solr-mr dependencies from Cloudera Search maven to ivy was a bit of a beast).
          Hide
          wolfgang hoschek added a comment -

          By the way, docs and the downstream code for our solr-mr contrib submission is here: https://github.com/cloudera/search/tree/master/search-mr

          Show
          wolfgang hoschek added a comment - By the way, docs and the downstream code for our solr-mr contrib submission is here: https://github.com/cloudera/search/tree/master/search-mr
          Show
          wolfgang hoschek added a comment - This new solr-mr contrib uses morphlines for ETL from MapReduce into Solr. To get started, here are some pointers for morphlines background material and code: code: https://github.com/cloudera/cdk/tree/master/cdk-morphlines blog post: http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/ reference guide: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html slides: http://www.slideshare.net/cloudera/using-morphlines-for-onthefly-etl talk recording: http://www.youtube.com/watch?v=iR48cRSbW6A
          Hide
          Phani Chaitanya Vempaty added a comment -

          Thanks Wolfgang. I corrected it in my xml, though will be waiting for Mark to give the full version of the xml file.

          Show
          Phani Chaitanya Vempaty added a comment - Thanks Wolfgang. I corrected it in my xml, though will be waiting for Mark to give the full version of the xml file.
          Hide
          Mark Miller added a comment - - edited

          Here is another patch. No major changes. I've confirmed it has ivy.xml in it.

          I made some minor tweaks to fix issues 'precommit' brought up, but it's currently failing on 3 javadoc warnings due to some new hadoop test dependencies:

            [javadoc] /ssd/workspace3/lucene-solr-5x-mr/solr/contrib/solr-mr/lib/hadoop-mapreduce-client-jobclient-2.0.5-alpha-tests.jar(org/apache/hadoop/mapreduce/TestLocalRunner.class): warning: Cannot find annotation method 'timeout()' in type 'Test': class file for org.junit.Test not found
            [javadoc] /ssd/workspace3/lucene-solr-5x-mr/solr/contrib/solr-mr/lib/hadoop-common-2.0.5-alpha-tests.jar(org/apache/hadoop/util/TestClassUtil.class): warning: Cannot find annotation method 'timeout()' in type 'Test'
            [javadoc] /ssd/workspace3/lucene-solr-5x-mr/solr/contrib/solr-mr/lib/hadoop-common-2.0.5-alpha-tests.jar(org/apache/hadoop/io/TestSortedMapWritable.class): warning: Cannot find annotation method 'timeout()' in type 'Test'
          
          Show
          Mark Miller added a comment - - edited Here is another patch. No major changes. I've confirmed it has ivy.xml in it. I made some minor tweaks to fix issues 'precommit' brought up, but it's currently failing on 3 javadoc warnings due to some new hadoop test dependencies: [javadoc] /ssd/workspace3/lucene-solr-5x-mr/solr/contrib/solr-mr/lib/hadoop-mapreduce-client-jobclient-2.0.5-alpha-tests.jar(org/apache/hadoop/mapreduce/TestLocalRunner.class): warning: Cannot find annotation method 'timeout()' in type 'Test': class file for org.junit.Test not found [javadoc] /ssd/workspace3/lucene-solr-5x-mr/solr/contrib/solr-mr/lib/hadoop-common-2.0.5-alpha-tests.jar(org/apache/hadoop/util/TestClassUtil.class): warning: Cannot find annotation method 'timeout()' in type 'Test' [javadoc] /ssd/workspace3/lucene-solr-5x-mr/solr/contrib/solr-mr/lib/hadoop-common-2.0.5-alpha-tests.jar(org/apache/hadoop/io/TestSortedMapWritable.class): warning: Cannot find annotation method 'timeout()' in type 'Test'
          Hide
          Mark Miller added a comment -

          Given that my first patch was 963k and the latest is 2.33MB, it seems the first was missing a bunch of things due to me not yet adding them to svn - sorry about that - the latest patch should be complete.

          Show
          Mark Miller added a comment - Given that my first patch was 963k and the latest is 2.33MB, it seems the first was missing a bunch of things due to me not yet adding them to svn - sorry about that - the latest patch should be complete.
          Hide
          Mark Miller added a comment -

          FYI - wolfgang hoschek and Patrick Hunt are the primary authors of the code - it was based on the initial patches in this issue, but has been heavily processed and expanded. Full unit tests have also been written. Gregory Chanan, Roman Shaposhnik, I, and Eric Wong also contributed thoughts, code, and features. I have not yet added a CHANGES entry to the patch, but this is the current list of authors that would go in, along with the authors of the previous work in this JIRA.

          Show
          Mark Miller added a comment - FYI - wolfgang hoschek and Patrick Hunt are the primary authors of the code - it was based on the initial patches in this issue, but has been heavily processed and expanded. Full unit tests have also been written. Gregory Chanan , Roman Shaposhnik , I, and Eric Wong also contributed thoughts, code, and features. I have not yet added a CHANGES entry to the patch, but this is the current list of authors that would go in, along with the authors of the previous work in this JIRA.
          Hide
          Mark Miller added a comment -

          Note so I do not forget: Wolfgang just mentioned that we are missing runtime libs that our tests don't require - such as Tika libs - our tests of course don't exercise all of them.

          I'll need to make sure we pull all of those in.

          Show
          Mark Miller added a comment - Note so I do not forget: Wolfgang just mentioned that we are missing runtime libs that our tests don't require - such as Tika libs - our tests of course don't exercise all of them. I'll need to make sure we pull all of those in.
          Hide
          Mark Miller added a comment -

          I'm working on making a new patch with some changes:

          • I updated to Hadoop 2.1.0 beta
          • I updated the versions of some of the dependencies
          • I added some run time dependencies that the tests don't require as well as their license files
          • I started working around the issue where the Yarn cluster is hard coded to write to the CWD/target illegal location. This involved copying and modding some Hadoop test files until we can get Hadoop to make things more flexible. Unfortunately there is still an issue - the mkdirs used by hdfs requires write permissions all the way up the tree it seems, whether the directory you are making already exists or not - this keeps the yarn mini test cluster from being able to run with our test policy. It fails in init when it does this mkdir. Don't know of a good workaround at the moment.
          Show
          Mark Miller added a comment - I'm working on making a new patch with some changes: I updated to Hadoop 2.1.0 beta I updated the versions of some of the dependencies I added some run time dependencies that the tests don't require as well as their license files I started working around the issue where the Yarn cluster is hard coded to write to the CWD/target illegal location. This involved copying and modding some Hadoop test files until we can get Hadoop to make things more flexible. Unfortunately there is still an issue - the mkdirs used by hdfs requires write permissions all the way up the tree it seems, whether the directory you are making already exists or not - this keeps the yarn mini test cluster from being able to run with our test policy. It fails in init when it does this mkdir. Don't know of a good workaround at the moment.
          Hide
          Mark Miller added a comment -

          Latest patch attached.

          Show
          Mark Miller added a comment - Latest patch attached.
          Hide
          Jack Krupansky added a comment -

          Fix version still says 4.5.

          Show
          Jack Krupansky added a comment - Fix version still says 4.5.
          Hide
          Mark Miller added a comment -

          I've worked out the javadoc warnings - that has led to some new issue(s) with jtidy. It's failing on SolrReducer.html.

          Show
          Mark Miller added a comment - I've worked out the javadoc warnings - that has led to some new issue(s) with jtidy. It's failing on SolrReducer.html.
          Hide
          Mark Miller added a comment -

          Got it - the jtidy output was very generic (failed, returned 1), but I worked out that the problem was some '<' and '>' in the javadoc of SolrReducer and another class or two. After addressing that and adding a couple package.html files, the precommit ant task now passes.

          I have a variety of items still on the TODO list, but I think the critical path to an initial commit is:

          • Move the Solr Morphline commands in.
          • Get the tests to run without a hacked test.policy file - see my comment above about FileSystem#mkDirs.
          • Look at the final jar we produce and how it works with the dependencies (eg it's currently going to the extraction contrib for tika, etc).

          The other outstanding issues are not blocking an initial commit I don't think.

          Also, FYI, since I did not mention, the previous patch will run the mini cluster tests based on the tests.disableHdfs sys prop now, so that is checked off.

          Show
          Mark Miller added a comment - Got it - the jtidy output was very generic (failed, returned 1), but I worked out that the problem was some '<' and '>' in the javadoc of SolrReducer and another class or two. After addressing that and adding a couple package.html files, the precommit ant task now passes. I have a variety of items still on the TODO list, but I think the critical path to an initial commit is: Move the Solr Morphline commands in. Get the tests to run without a hacked test.policy file - see my comment above about FileSystem#mkDirs. Look at the final jar we produce and how it works with the dependencies (eg it's currently going to the extraction contrib for tika, etc). The other outstanding issues are not blocking an initial commit I don't think. Also, FYI, since I did not mention, the previous patch will run the mini cluster tests based on the tests.disableHdfs sys prop now, so that is checked off.
          Hide
          Mark Miller added a comment -

          I've moved in the Solr Morphlines code from cdk-morphlines-solr-core and cdk-morphlines-solr-cell. I've made them compliant with the test framework and got the tests passing. There are still some ant precommit and license issues to handle, but otherwise this should be mostly done. I'm still unclear on what will be required for packaging, but I am tackling packaging last.

          I've also updated the Tika parser dependencies to include a couple Solr did not have.

          Once I wrap up the loose ends on this I'll attach my latest patch, and then, only two issues on the critical path:

          • Get the tests to run without a hacked test.policy file.
          • Dist packaging.
          Show
          Mark Miller added a comment - I've moved in the Solr Morphlines code from cdk-morphlines-solr-core and cdk-morphlines-solr-cell. I've made them compliant with the test framework and got the tests passing. There are still some ant precommit and license issues to handle, but otherwise this should be mostly done. I'm still unclear on what will be required for packaging, but I am tackling packaging last. I've also updated the Tika parser dependencies to include a couple Solr did not have. Once I wrap up the loose ends on this I'll attach my latest patch, and then, only two issues on the critical path: Get the tests to run without a hacked test.policy file. Dist packaging.
          Hide
          Mark Miller added a comment - - edited

          Latest patch with Solr Morphlines and the other items I mentioned above.

          New issues though.

          Something is still writing to {CWD}/target that I need to track down.

          Precommit is not yet passing with Solr morphlines code - need to resolve use of forbidden apis:
          [forbidden-apis] Forbidden class/interface use: com.sun.org.apache.xml.internal.serialize.OutputFormat [non-public internal runtime class]
          [forbidden-apis] in org.apache.solr.hadoop.morphline.solrcell.SolrCellBuilder$SolrCell (SolrCellBuilder.java:242)
          [forbidden-apis] Forbidden class/interface use: com.sun.org.apache.xml.internal.serialize.XMLSerializer [non-public internal runtime class]
          [forbidden-apis] in org.apache.solr.hadoop.morphline.solrcell.SolrCellBuilder$SolrCell (SolrCellBuilder.java:242)

          Also, I'm not sure where exactly the Solr morphlines should end up - as their own module or where I put them, but this is where they are for now.

          Show
          Mark Miller added a comment - - edited Latest patch with Solr Morphlines and the other items I mentioned above. New issues though. Something is still writing to {CWD}/target that I need to track down. Precommit is not yet passing with Solr morphlines code - need to resolve use of forbidden apis: [forbidden-apis] Forbidden class/interface use: com.sun.org.apache.xml.internal.serialize.OutputFormat [non-public internal runtime class] [forbidden-apis] in org.apache.solr.hadoop.morphline.solrcell.SolrCellBuilder$SolrCell (SolrCellBuilder.java:242) [forbidden-apis] Forbidden class/interface use: com.sun.org.apache.xml.internal.serialize.XMLSerializer [non-public internal runtime class] [forbidden-apis] in org.apache.solr.hadoop.morphline.solrcell.SolrCellBuilder$SolrCell (SolrCellBuilder.java:242) Also, I'm not sure where exactly the Solr morphlines should end up - as their own module or where I put them, but this is where they are for now.
          Hide
          wolfgang hoschek added a comment -

          cdk-morphlines-solr-core and cdk-morphlines-solr-cell should remain separate and be available through separate maven modules so that clients such as Flume Solr Sink and Hbase Indexer can continue to choose to depend (or not depend) on them. For example, not everyone wants tika and it's dependency chain.

          Show
          wolfgang hoschek added a comment - cdk-morphlines-solr-core and cdk-morphlines-solr-cell should remain separate and be available through separate maven modules so that clients such as Flume Solr Sink and Hbase Indexer can continue to choose to depend (or not depend) on them. For example, not everyone wants tika and it's dependency chain.
          Hide
          wolfgang hoschek added a comment -

          Seems like the patch still misses tika-xmp.

          Show
          wolfgang hoschek added a comment - Seems like the patch still misses tika-xmp.
          Hide
          Mark Miller added a comment -

          This is likely the last patch I'll put up for a bit - I'm on vacation from Wed-Mon.

          Patch Notes:

          ant precommit passes again. I've fixed the forbidden api calls and a couple minor javadoc issues in the new morphlines code. Also fixed a more problematic javadocs issue due to broken links from the morphlines code to extraction code due to extending extraction classes.

          I've added tika-xmp to the extraction dependencies.

          I don't like that tests can pass when some necessary run-time jars are missing - we will likely need to look into adding simple tests that cause each necessary jar to be used - or even just have hack tests that try and create a class in the offending jars or something. I'll save that for a follow up issue though - the solr cell morphlines tests actually upped the number of dependencies tests hit quite a bit at least.

          There is also a test speed issue that is not on the critical path - on my fast machine that does 8 tests in parallel, this adds about 4-5 minutes to the tests. It would be good to try and minimize some of the longer tests for std runs, and keep them as is for @nightly runs. That can wait post commit though.

          That leaves the following 2 critical path items to deal with:

          • Get the tests to run without a hacked test.policy file.
          • Dist packaging. This includes things like creation of the final MapReduceIndexerTool jar file and dealing with it's dependencies, as well as the location of the morphlines code and how it is distributed.

          Other than that we are looking pretty good - all tests passing and precommit passing.

          Show
          Mark Miller added a comment - This is likely the last patch I'll put up for a bit - I'm on vacation from Wed-Mon. Patch Notes: ant precommit passes again. I've fixed the forbidden api calls and a couple minor javadoc issues in the new morphlines code. Also fixed a more problematic javadocs issue due to broken links from the morphlines code to extraction code due to extending extraction classes. I've added tika-xmp to the extraction dependencies. I don't like that tests can pass when some necessary run-time jars are missing - we will likely need to look into adding simple tests that cause each necessary jar to be used - or even just have hack tests that try and create a class in the offending jars or something. I'll save that for a follow up issue though - the solr cell morphlines tests actually upped the number of dependencies tests hit quite a bit at least. There is also a test speed issue that is not on the critical path - on my fast machine that does 8 tests in parallel, this adds about 4-5 minutes to the tests. It would be good to try and minimize some of the longer tests for std runs, and keep them as is for @nightly runs. That can wait post commit though. That leaves the following 2 critical path items to deal with: Get the tests to run without a hacked test.policy file. Dist packaging. This includes things like creation of the final MapReduceIndexerTool jar file and dealing with it's dependencies, as well as the location of the morphlines code and how it is distributed. Other than that we are looking pretty good - all tests passing and precommit passing.
          Hide
          Mark Miller added a comment -

          I have a new patch I'm cleaning up that tackles some of the packaging:

          • Split out solr-morphlines-core and solr-morphlines-cell into their own modules.
          • Updated to trunk and the new modules are now using the new dependency version tracking system.
          • Fixed an issue in the code around the TokenStream contract being violated - the latest code detected this and failed a test - end and close now called.
          • Updated to use Morphlines from CDK 0.8.
          • Setup the main class in the solr-mr jar manifest.
          • I enabled an ignored test which exposed a few bugs because of the required solr.xml in Solr 5.0 - I addressed those bugs.
          • Added a missing metrics health-check dependency that somehow popped up.
          • I played around with naming the solr-mr artifact MapReduceIndexTool.jar, but the system really want's us to follow the rules of the artifacts and have something like solr-solr-mr-5.0.jar. Anything else has some random issues, such as with javadoc, and if your name does not start with solr-, it will be changed to start with lucene-. I'm not yet sure if it's worth the trouble to expand the system or use a different name, so for now it's still just using the default jar name based on the contrib module name (solr-mr).

          Besides the naming issue, there are a couple other things to button up:

          • How we are going to set up the classpath - script, in the manifest, leave it up to the user and doc, etc.
          • All dependencies are currently in solr-morphlines-core - this was a simple way to split out the modules since solr-mr and solr-morphlines-cell depend on solr-morphlines-core.

          Finally, we will probably need some help from Steve Rowe to get the Maven build setup correctly.

          I spent a bunch of time trying to use asm to work around the hacked test policy issue. There are multiple problems I ran into. One is that another module uses asm 4.1, but Hadoop brings in asm 3.1 - if you are doing some asm coding, this can cause compile issues with your ide (at least eclipse). It also ends up being really hard to get an injection in the right place because of how the yarn code is structured. After spending a bunch of time trying to get this to work, I'm backing out and considering other options.

          Show
          Mark Miller added a comment - I have a new patch I'm cleaning up that tackles some of the packaging: Split out solr-morphlines-core and solr-morphlines-cell into their own modules. Updated to trunk and the new modules are now using the new dependency version tracking system. Fixed an issue in the code around the TokenStream contract being violated - the latest code detected this and failed a test - end and close now called. Updated to use Morphlines from CDK 0.8. Setup the main class in the solr-mr jar manifest. I enabled an ignored test which exposed a few bugs because of the required solr.xml in Solr 5.0 - I addressed those bugs. Added a missing metrics health-check dependency that somehow popped up. I played around with naming the solr-mr artifact MapReduceIndexTool.jar, but the system really want's us to follow the rules of the artifacts and have something like solr-solr-mr-5.0.jar. Anything else has some random issues, such as with javadoc, and if your name does not start with solr-, it will be changed to start with lucene-. I'm not yet sure if it's worth the trouble to expand the system or use a different name, so for now it's still just using the default jar name based on the contrib module name (solr-mr). Besides the naming issue, there are a couple other things to button up: How we are going to set up the classpath - script, in the manifest, leave it up to the user and doc, etc. All dependencies are currently in solr-morphlines-core - this was a simple way to split out the modules since solr-mr and solr-morphlines-cell depend on solr-morphlines-core. Finally, we will probably need some help from Steve Rowe to get the Maven build setup correctly. I spent a bunch of time trying to use asm to work around the hacked test policy issue. There are multiple problems I ran into. One is that another module uses asm 4.1, but Hadoop brings in asm 3.1 - if you are doing some asm coding, this can cause compile issues with your ide (at least eclipse). It also ends up being really hard to get an injection in the right place because of how the yarn code is structured. After spending a bunch of time trying to get this to work, I'm backing out and considering other options.
          Hide
          Mark Miller added a comment -

          Here is the patch I referred to in the above comment.

          Precommit still passing and tests passing with the tests policy hack.

          Show
          Mark Miller added a comment - Here is the patch I referred to in the above comment. Precommit still passing and tests passing with the tests policy hack.
          Hide
          Rafał Kuć added a comment -

          Mark, is the version attached to this issue a newest patch or maybe you have something newer?

          Show
          Rafał Kuć added a comment - Mark, is the version attached to this issue a newest patch or maybe you have something newer?
          Hide
          Mark Miller added a comment -

          Hey Rafal - that is the latest at the moment - gotten side tracked with other things. Shortly, I'll upload a new patch that changes around the dependencies between modules a bit. Beyond that there is figuring out the strategy for the classpath, some manual testing, and finally working around that darn test policy issue.

          However, things should be in a useable state regardless of those remaining issues.

          Show
          Mark Miller added a comment - Hey Rafal - that is the latest at the moment - gotten side tracked with other things. Shortly, I'll upload a new patch that changes around the dependencies between modules a bit. Beyond that there is figuring out the strategy for the classpath, some manual testing, and finally working around that darn test policy issue. However, things should be in a useable state regardless of those remaining issues.
          Hide
          Rafał Kuć added a comment -

          Thanks Mark

          Show
          Rafał Kuć added a comment - Thanks Mark
          Hide
          Mark Miller added a comment - - edited

          New Patch.

          • Updated to trunk
          • A pass at putting dependencies in the correct modules
          • A script for running the MapReduceIndexTool - classpath in the manifest doesn't seem very nice.
          • Updated to CDK 0.8.1

          I'm sure there are a variety of other things to polish, fix, decide and finalize, as well as code to sync up - but nothing that needs to be done before this is committed. I need to get this in asap as it's a large burden to maintain over time.

          Except for the test policy issue. That is the only blocker I know of for committing remaining.

          Also have to do a bit of manual testing.

          You can run the tool by running Solr's 'ant package' and then expand one of the release zip/tgz files. Try something like:

          cd solr/example/scripts/solr-mr
          sh solr-mr.sh --help

          Show
          Mark Miller added a comment - - edited New Patch. Updated to trunk A pass at putting dependencies in the correct modules A script for running the MapReduceIndexTool - classpath in the manifest doesn't seem very nice. Updated to CDK 0.8.1 I'm sure there are a variety of other things to polish, fix, decide and finalize, as well as code to sync up - but nothing that needs to be done before this is committed. I need to get this in asap as it's a large burden to maintain over time. Except for the test policy issue. That is the only blocker I know of for committing remaining. Also have to do a bit of manual testing. You can run the tool by running Solr's 'ant package' and then expand one of the release zip/tgz files. Try something like: cd solr/example/scripts/solr-mr sh solr-mr.sh --help
          Hide
          Mark Miller added a comment -

          I have a plan for the test policy issue.

          • disable the couple large integration tests that have a problem by default
          • use a 'no.test.policy' sys prop of some kind to allow those tests to run in a local jenkins setup with no test policy running
          • file a jira with yarn requesting that they take pity on us and make it so that yarn will run with our test policy (eg, if this dir already exists, don't mkdirs up it's parents to the root)
          • once that makes it into a release, we can reenable these couple tests by default
          Show
          Mark Miller added a comment - I have a plan for the test policy issue. disable the couple large integration tests that have a problem by default use a 'no.test.policy' sys prop of some kind to allow those tests to run in a local jenkins setup with no test policy running file a jira with yarn requesting that they take pity on us and make it so that yarn will run with our test policy (eg, if this dir already exists, don't mkdirs up it's parents to the root) once that makes it into a release, we can reenable these couple tests by default
          Hide
          Mark Miller added a comment -

          Steve Rowe sir - I'm going to need your assistance I think

          Show
          Mark Miller added a comment - Steve Rowe sir - I'm going to need your assistance I think
          Hide
          Steve Rowe added a comment -

          Steve Rowe sir - I'm going to need your assistance I think

          I'll take a look.

          Show
          Steve Rowe added a comment - Steve Rowe sir - I'm going to need your assistance I think I'll take a look.
          Hide
          Steve Rowe added a comment -

          FYI, when I apply the latest patch against trunk using svn patch SOLR-1301.patch, svn says:

          Skipped missing target: 'solr/example/scripts/cloud-scripts/zkcli.sh'
          Skipped missing target: 'solr/example/scripts/cloud-scripts/zkcli.bat'
          

          The patch assumes these two files already exist, but solr/example/scripts/ doesn't exist on trunk.

          When I run the following before applying the patch, svn no longer complains about those scripts:

          svn mkdir solr/example/scripts
          svn mv solr/example/cloud-scripts solr/example/scripts/
          
          Show
          Steve Rowe added a comment - FYI, when I apply the latest patch against trunk using svn patch SOLR-1301 .patch , svn says: Skipped missing target: 'solr/example/scripts/cloud-scripts/zkcli.sh' Skipped missing target: 'solr/example/scripts/cloud-scripts/zkcli.bat' The patch assumes these two files already exist, but solr/example/scripts/ doesn't exist on trunk. When I run the following before applying the patch, svn no longer complains about those scripts: svn mkdir solr/example/scripts svn mv solr/example/cloud-scripts solr/example/scripts/
          Hide
          Mark Miller added a comment -

          Sorry about that - eclipse gets confused sometimes when you do some local refactoring.

          I'll commit to 5x and it will be easier to work on this.

          Show
          Mark Miller added a comment - Sorry about that - eclipse gets confused sometimes when you do some local refactoring. I'll commit to 5x and it will be easier to work on this.
          Hide
          Steve Rowe added a comment -

          bq, I'll commit to 5x and it will be easier to work on this.

          I'm working on the Maven build, and am almost there - do you want to wait for a revised patch? I also made some minor modifications to the Ant build: removed solr-mr dependency from solr-morphlines-cell; resolving solr-morphlines-core test deps to test-lib/ instead of lib/.

          Show
          Steve Rowe added a comment - bq, I'll commit to 5x and it will be easier to work on this. I'm working on the Maven build, and am almost there - do you want to wait for a revised patch? I also made some minor modifications to the Ant build: removed solr-mr dependency from solr-morphlines-cell; resolving solr-morphlines-core test deps to test-lib/ instead of lib/.
          Hide
          Mark Miller added a comment -

          I was a little scared of committing from a patch and not the svn checkout I've been working from. It's such a huge patch

          However, whatever is easiest for you.

          Show
          Mark Miller added a comment - I was a little scared of committing from a patch and not the svn checkout I've been working from. It's such a huge patch However, whatever is easiest for you.
          Hide
          Mark Miller added a comment -

          removed solr-mr dependency from solr-morphlines-cell; resolving solr-morphlines-core test deps to test-lib/ instead of lib/.

          Nice, thanks.

          Show
          Mark Miller added a comment - removed solr-mr dependency from solr-morphlines-cell; resolving solr-morphlines-core test deps to test-lib/ instead of lib/. Nice, thanks.
          Hide
          Steve Rowe added a comment -

          It's such a huge patch

          Yeah, it is...

          I guess you should commit, then I'll make a patch of the diff between my modified patched dir and what you commit. That should be the simplest/least scary.

          Show
          Steve Rowe added a comment - It's such a huge patch Yeah, it is... I guess you should commit, then I'll make a patch of the diff between my modified patched dir and what you commit. That should be the simplest/least scary.
          Hide
          Steve Rowe added a comment -

          Mark, BTW, some tests are failing for me in solr-morphlines-cell and in solr-morphlines-core, and test(s?) are hanging in solr-mr (OS X, Oracle JDK 1.7.0_25), both with and without my modifications. Hopefully it's just a local thing.

          Show
          Steve Rowe added a comment - Mark, BTW, some tests are failing for me in solr-morphlines-cell and in solr-morphlines-core, and test(s?) are hanging in solr-mr (OS X, Oracle JDK 1.7.0_25), both with and without my modifications. Hopefully it's just a local thing.
          Hide
          Steve Rowe added a comment -

          Patch that extends the Maven and IntelliJ builds to include the three new modules. Minor Ant build modifications included as well.

          This patch was created with dev-tools/scripts/diffSources.py, comparing two trunk directories patched with Mark's latest patch, the second of which has my modifications as well. So it has to be applied after Mark's latest patch.

          Show
          Steve Rowe added a comment - Patch that extends the Maven and IntelliJ builds to include the three new modules. Minor Ant build modifications included as well. This patch was created with dev-tools/scripts/diffSources.py , comparing two trunk directories patched with Mark's latest patch, the second of which has my modifications as well. So it has to be applied after Mark's latest patch.
          Hide
          Mark Miller added a comment -

          Thanks Steve - took me a bit, but I'm about ready to commit to 5x.

          Show
          Mark Miller added a comment - Thanks Steve - took me a bit, but I'm about ready to commit to 5x.
          Hide
          ASF subversion and git services added a comment -

          Commit 1547139 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1547139 ]

          SOLR-1301: Add a Solr contrib that allows for building Solr indexes via Hadoop's MapReduce.

          Show
          ASF subversion and git services added a comment - Commit 1547139 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1547139 ] SOLR-1301 : Add a Solr contrib that allows for building Solr indexes via Hadoop's MapReduce.
          Hide
          ASF subversion and git services added a comment -

          Commit 1547187 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1547187 ]

          SOLR-1301: Ivy likes to act funny if you don't declare compile and test resources in the same dependency.

          Show
          ASF subversion and git services added a comment - Commit 1547187 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1547187 ] SOLR-1301 : Ivy likes to act funny if you don't declare compile and test resources in the same dependency.
          Hide
          Mark Miller added a comment -

          I've setup a local jenkins job to run the two tests that have a problem with the test policy/manager. Next I'll file a JIRA issue for Yarn.

          Show
          Mark Miller added a comment - I've setup a local jenkins job to run the two tests that have a problem with the test policy/manager. Next I'll file a JIRA issue for Yarn.
          Hide
          Mark Miller added a comment -

          One issue that I had to work around will be solved with https://issues.apache.org/jira/browse/YARN-1442

          Show
          Mark Miller added a comment - One issue that I had to work around will be solved with https://issues.apache.org/jira/browse/YARN-1442
          Hide
          Uwe Schindler added a comment -

          Hi,
          it seems to resolve correctly now. There is one inconsistency: the folder names. The new contribs have all "solr-" in the folder name, which is inconsistent to the others. I would prefer to rename the folder names with svn mv and maybe fix some paths in dependencies and maven. The build.xml files use the correct name already, so JAR files are named correctly.
          Uwe

          Show
          Uwe Schindler added a comment - Hi, it seems to resolve correctly now. There is one inconsistency: the folder names. The new contribs have all "solr-" in the folder name, which is inconsistent to the others. I would prefer to rename the folder names with svn mv and maybe fix some paths in dependencies and maven. The build.xml files use the correct name already, so JAR files are named correctly. Uwe
          Hide
          Mark Miller added a comment -

          Removing solr from the module names sounds good to me.

          Show
          Mark Miller added a comment - Removing solr from the module names sounds good to me.
          Hide
          ASF subversion and git services added a comment -

          Commit 1547232 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1547232 ]

          SOLR-1301: Fix compilation for Java 8 (the Java 8 compiler is more picky, but it's not a Java 8 regression: the code was just wrong)

          Show
          ASF subversion and git services added a comment - Commit 1547232 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1547232 ] SOLR-1301 : Fix compilation for Java 8 (the Java 8 compiler is more picky, but it's not a Java 8 regression: the code was just wrong)
          Hide
          Uwe Schindler added a comment -

          I found out that some tests don't work on Windows, for the same reason why the MiniDFS tests don't work in Solr-Core: Some crazy command line tools are missing. I would mark all those tests with the same assume like HdfsDirectory tests?

          Should I start doing this?

          Show
          Uwe Schindler added a comment - I found out that some tests don't work on Windows, for the same reason why the MiniDFS tests don't work in Solr-Core: Some crazy command line tools are missing. I would mark all those tests with the same assume like HdfsDirectory tests? Should I start doing this?
          Hide
          Mark Miller added a comment -

          Hmm...yeah, you might as well. I'll investigate on my VM.

          Show
          Mark Miller added a comment - Hmm...yeah, you might as well. I'll investigate on my VM.
          Hide
          wolfgang hoschek added a comment -

          There is also a known issue in that Morphlines don't work on Windows because the Guava Classpath utility doesn't work with windows path conventions. For example, see http://mail-archives.apache.org/mod_mbox/flume-dev/201310.mbox/%3C5ACFFCD9-4AD7-4E6E-8365-CEADFAC78B1A@cloudera.com%3E

          Show
          wolfgang hoschek added a comment - There is also a known issue in that Morphlines don't work on Windows because the Guava Classpath utility doesn't work with windows path conventions. For example, see http://mail-archives.apache.org/mod_mbox/flume-dev/201310.mbox/%3C5ACFFCD9-4AD7-4E6E-8365-CEADFAC78B1A@cloudera.com%3E
          Hide
          ASF subversion and git services added a comment -

          Commit 1547239 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1547239 ]

          SOLR-1301: Fix windows problem with escaping of folder name (see crazy https://github.com/typesafehub/config/blob/master/HOCON.md for correct format: string must be quoted and escaped like javascript)

          Show
          ASF subversion and git services added a comment - Commit 1547239 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1547239 ] SOLR-1301 : Fix windows problem with escaping of folder name (see crazy https://github.com/typesafehub/config/blob/master/HOCON.md for correct format: string must be quoted and escaped like javascript)
          Hide
          ASF subversion and git services added a comment -

          Commit 1547242 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1547242 ]

          SOLR-1301: Ignore windows tests that cannot work because they use UNIX semantics. Also remove a never-executed test which tests nothing

          Show
          ASF subversion and git services added a comment - Commit 1547242 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1547242 ] SOLR-1301 : Ignore windows tests that cannot work because they use UNIX semantics. Also remove a never-executed test which tests nothing
          Hide
          Uwe Schindler added a comment -

          OK, I fixed the test suite to pass on Windows.

          Show
          Uwe Schindler added a comment - OK, I fixed the test suite to pass on Windows.
          Hide
          Mark Miller added a comment -

          For posterity, there is a thread on the dev list where we are working through an issue with Saxon on java 8 and ibm's j9. Wolfgang filed https://saxonica.plan.io/issues/1944 upstream. (Saxon is pulled in via cdk-morphlines-saxon).

          Show
          Mark Miller added a comment - For posterity, there is a thread on the dev list where we are working through an issue with Saxon on java 8 and ibm's j9. Wolfgang filed https://saxonica.plan.io/issues/1944 upstream. (Saxon is pulled in via cdk-morphlines-saxon).
          Hide
          ASF subversion and git services added a comment -

          Commit 1547442 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1547442 ]

          SOLR-1301: Ignore these tests on java 8 and j9 for now.

          Show
          ASF subversion and git services added a comment - Commit 1547442 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1547442 ] SOLR-1301 : Ignore these tests on java 8 and j9 for now.
          Hide
          Mark Miller added a comment -

          Removing Solr from the module names would give:

          morphlines-cell Perhaps should be morphlines-extraction? We have always made cell / extraction confusing. The module folder is extraction though, so I see that as the name. We really should standardize on one name.

          morphlines-core Removing Solr is a bit confusing - morphlines-core is a module in the morphlines project - this is a morphlines module with stuff for interacting with Solr - perhaps we just call it morphlines?

          mr Seems we should rename this. Steve suggested map-reduce-indexer in IRC, which seems good to me.

          Show
          Mark Miller added a comment - Removing Solr from the module names would give: morphlines-cell Perhaps should be morphlines-extraction? We have always made cell / extraction confusing. The module folder is extraction though, so I see that as the name. We really should standardize on one name. morphlines-core Removing Solr is a bit confusing - morphlines-core is a module in the morphlines project - this is a morphlines module with stuff for interacting with Solr - perhaps we just call it morphlines? mr Seems we should rename this. Steve suggested map-reduce-indexer in IRC, which seems good to me.
          Hide
          ASF subversion and git services added a comment -

          Commit 1547498 from Steve Rowe in branch 'dev/trunk'
          [ https://svn.apache.org/r1547498 ]

          SOLR-1301: remove unnecessary (POM-only) dependency org.apache.hadoop:hadoop-yarn-server

          Show
          ASF subversion and git services added a comment - Commit 1547498 from Steve Rowe in branch 'dev/trunk' [ https://svn.apache.org/r1547498 ] SOLR-1301 : remove unnecessary (POM-only) dependency org.apache.hadoop:hadoop-yarn-server
          Hide
          wolfgang hoschek added a comment - - edited

          module/dir names

          I propose morphlines-solr-core and morphlines-solr-cell as names. This avoids confusion by fitting nicely with the existing naming pattern, which is cdk-morphlines-solr-core and cdk-morphlines-solr-cell. (https://github.com/cloudera/cdk/tree/master/cdk-morphlines). Thoughts?

          Show
          wolfgang hoschek added a comment - - edited module/dir names I propose morphlines-solr-core and morphlines-solr-cell as names. This avoids confusion by fitting nicely with the existing naming pattern, which is cdk-morphlines-solr-core and cdk-morphlines-solr-cell. ( https://github.com/cloudera/cdk/tree/master/cdk-morphlines ). Thoughts?
          Hide
          wolfgang hoschek added a comment -

          +1 to "map-reduce-indexer" module name/dir.

          Show
          wolfgang hoschek added a comment - +1 to "map-reduce-indexer" module name/dir.
          Hide
          Steve Rowe added a comment -

          I propose morphlines-solr-core and morphlines-solr-cell as names. This avoids confusion by fitting nicely with the existing naming pattern, which is cdk-morphlines-solr-core and cdk-morphlines-solr-cell. (https://github.com/cloudera/cdk/tree/master/cdk-morphlines). Thoughts?

          The problem with these two names is that the artifact names will have "solr-" prepended, and then "solr" will occur twice in their names: solr-morphlines-solr-core-4.7.0.jar, solr-morphlines-solr-cell-4.7.0.jar. Yuck.

          Show
          Steve Rowe added a comment - I propose morphlines-solr-core and morphlines-solr-cell as names. This avoids confusion by fitting nicely with the existing naming pattern, which is cdk-morphlines-solr-core and cdk-morphlines-solr-cell. ( https://github.com/cloudera/cdk/tree/master/cdk-morphlines ). Thoughts? The problem with these two names is that the artifact names will have "solr-" prepended, and then "solr" will occur twice in their names: solr-morphlines-solr-core-4.7.0.jar , solr-morphlines-solr-cell-4.7.0.jar . Yuck.
          Hide
          Mark Miller added a comment -

          That sounds fine to me.

          Show
          Mark Miller added a comment - That sounds fine to me.
          Hide
          Mark Miller added a comment -

          Yuck.

          Whoops - cross posted. Yeah, didn't realize that - not ideal.

          Show
          Mark Miller added a comment - Yuck. Whoops - cross posted. Yeah, didn't realize that - not ideal.
          Hide
          wolfgang hoschek added a comment -

          The problem with these two names is that the artifact names will have "solr-" prepended, and then "solr" will occur twice in their names: solr-morphlines-solr-core-4.7.0.jar, solr-morphlines-solr-cell-4.7.0.jar. Yuck.

          Ah, argh. In this light, what Mark suggested seems good to me as well.

          Show
          wolfgang hoschek added a comment - The problem with these two names is that the artifact names will have "solr-" prepended, and then "solr" will occur twice in their names: solr-morphlines-solr-core-4.7.0.jar, solr-morphlines-solr-cell-4.7.0.jar. Yuck. Ah, argh. In this light, what Mark suggested seems good to me as well.
          Hide
          Steve Rowe added a comment -

          In this light, what Mark suggested seems good to me as well.

          +1 to:

          contrib name artifact name
          morphlines-core solr-morphlines-core
          morphlines-cell solr-morphlines-cell
          map-reduce-indexer solr-map-reduce-indexer
          Show
          Steve Rowe added a comment - In this light, what Mark suggested seems good to me as well. +1 to: contrib name artifact name morphlines-core solr-morphlines-core morphlines-cell solr-morphlines-cell map-reduce-indexer solr-map-reduce-indexer
          Hide
          wolfgang hoschek added a comment -

          +1 on Steve's suggestion as well. Thanks for helping out!

          Show
          wolfgang hoschek added a comment - +1 on Steve's suggestion as well. Thanks for helping out!
          Hide
          wolfgang hoschek added a comment - - edited

          Upon a bit more reflection might be better to call the contrib "map-reduce" and the artifact "solr-map-reduce". This keeps the door open to potentially later add things like a Hadoop SolrInputFormat, i.e. read from solr via MR, rather than just write to solr via MR.

          Show
          wolfgang hoschek added a comment - - edited Upon a bit more reflection might be better to call the contrib "map-reduce" and the artifact "solr-map-reduce". This keeps the door open to potentially later add things like a Hadoop SolrInputFormat, i.e. read from solr via MR, rather than just write to solr via MR.
          Hide
          ASF subversion and git services added a comment -

          Commit 1547819 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1547819 ]

          SOLR-1301: Straighten out module names so that they match current convention

          Show
          ASF subversion and git services added a comment - Commit 1547819 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1547819 ] SOLR-1301 : Straighten out module names so that they match current convention
          Hide
          ASF subversion and git services added a comment -

          Commit 1547871 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1547871 ]

          SOLR-1301: Merge in latest solr-map-reduce updates.

          Show
          ASF subversion and git services added a comment - Commit 1547871 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1547871 ] SOLR-1301 : Merge in latest solr-map-reduce updates.
          Hide
          ASF subversion and git services added a comment -

          Commit 1547879 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1547879 ]

          SOLR-1301: Merge in latest morphlines module updates.

          Show
          ASF subversion and git services added a comment - Commit 1547879 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1547879 ] SOLR-1301 : Merge in latest morphlines module updates.
          Hide
          Mark Miller added a comment -

          MorphlineGoLiveMiniMRTest, which is ignored while the test policy issue gets straightened out, is now too slow for the standard test run. Before re-enabling it, we will have to tone it down for non nightly runs.

          Show
          Mark Miller added a comment - MorphlineGoLiveMiniMRTest, which is ignored while the test policy issue gets straightened out, is now too slow for the standard test run. Before re-enabling it, we will have to tone it down for non nightly runs.
          Hide
          wolfgang hoschek added a comment -

          There are also some fixes downstream in cdk-morphlines-core and cdk-morphlines-solr-cell that would be good to push upstream.

          Show
          wolfgang hoschek added a comment - There are also some fixes downstream in cdk-morphlines-core and cdk-morphlines-solr-cell that would be good to push upstream.
          Hide
          wolfgang hoschek added a comment -

          Minor nit: could remove jobConf.setBoolean(ExtractingParams.IGNORE_TIKA_EXCEPTION, false) in MorphlineBasicMiniMRTest + MorphlineGoLiveMiniMRTest because such a flag is nomore needed, and it removes an unnecessary dependency on tika.

          Show
          wolfgang hoschek added a comment - Minor nit: could remove jobConf.setBoolean(ExtractingParams.IGNORE_TIKA_EXCEPTION, false) in MorphlineBasicMiniMRTest + MorphlineGoLiveMiniMRTest because such a flag is nomore needed, and it removes an unnecessary dependency on tika.
          Hide
          Mark Miller added a comment -

          it removes an unnecessary dependency on tika.

          Whoops - that is why I had changed to just using the string param and I accidentally just reverted that in the merge. I'll remove the params entirely.

          Show
          Mark Miller added a comment - it removes an unnecessary dependency on tika. Whoops - that is why I had changed to just using the string param and I accidentally just reverted that in the merge. I'll remove the params entirely.
          Hide
          ASF subversion and git services added a comment -

          Commit 1547962 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1547962 ]

          SOLR-1301: Clean up.

          Show
          ASF subversion and git services added a comment - Commit 1547962 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1547962 ] SOLR-1301 : Clean up.
          Hide
          wolfgang hoschek added a comment - - edited

          FWIW, a current printout of --help showing the CLI options is here: https://github.com/cloudera/search/tree/master_1.1.0/search-mr

          Show
          wolfgang hoschek added a comment - - edited FWIW, a current printout of --help showing the CLI options is here: https://github.com/cloudera/search/tree/master_1.1.0/search-mr
          Hide
          ASF subversion and git services added a comment -

          Commit 1548319 from Steve Rowe in branch 'dev/trunk'
          [ https://svn.apache.org/r1548319 ]

          SOLR-1301: ignore '.iml' in new Solr contribs' directories; put new Solr contribs' lib/ and test-lib/ directories under Subversion control; ignore '.jar' in these directories

          Show
          ASF subversion and git services added a comment - Commit 1548319 from Steve Rowe in branch 'dev/trunk' [ https://svn.apache.org/r1548319 ] SOLR-1301 : ignore ' .iml' in new Solr contribs' directories; put new Solr contribs' lib/ and test-lib/ directories under Subversion control; ignore ' .jar' in these directories
          Hide
          Mark Miller added a comment -

          Getting started at the moment might be a bit daunting - to help people get started, to help with testing, and to help with figuring out what we need to provide to improve usability, I've started the following GitHub project: https://github.com/markrmiller/solr-map-reduce-example

          It's a script that downloads Hadoop and a nightly build of Solr and then builds an index via map-reduce and deploys that index to Solr.

          For now, it's just for looking - it won't actually work until I make a couple commits so that the standard example config files will correctly work with the map-reduce module.

          This should lower the barrier to entry for anyone that wants to play with things and serve as a nice guide for those looking to try this out on a real cluster.

          I'll make the commit(s) I referenced above sometime today later when I wake up.

          Show
          Mark Miller added a comment - Getting started at the moment might be a bit daunting - to help people get started, to help with testing, and to help with figuring out what we need to provide to improve usability, I've started the following GitHub project: https://github.com/markrmiller/solr-map-reduce-example It's a script that downloads Hadoop and a nightly build of Solr and then builds an index via map-reduce and deploys that index to Solr. For now, it's just for looking - it won't actually work until I make a couple commits so that the standard example config files will correctly work with the map-reduce module. This should lower the barrier to entry for anyone that wants to play with things and serve as a nice guide for those looking to try this out on a real cluster. I'll make the commit(s) I referenced above sometime today later when I wake up.
          Hide
          ASF subversion and git services added a comment -

          Commit 1548600 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1548600 ]

          SOLR-1301: Fix a couple of bugs around setting up the embedded Solr instance.

          Show
          ASF subversion and git services added a comment - Commit 1548600 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1548600 ] SOLR-1301 : Fix a couple of bugs around setting up the embedded Solr instance.
          Hide
          Mark Miller added a comment -

          My plan is to merge this back to 4X before long - I do think we should mark it as an experimental module though and avoid promising strong back compat for a couple of releases. 4X releases frequently and we want to gather some feedback before locking in too much.

          Show
          Mark Miller added a comment - My plan is to merge this back to 4X before long - I do think we should mark it as an experimental module though and avoid promising strong back compat for a couple of releases. 4X releases frequently and we want to gather some feedback before locking in too much.
          Hide
          ASF subversion and git services added a comment -

          Commit 1548605 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1548605 ]

          SOLR-1301: Update to Morphlines 0.9.0

          Show
          ASF subversion and git services added a comment - Commit 1548605 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1548605 ] SOLR-1301 : Update to Morphlines 0.9.0
          Hide
          Mark Miller added a comment -

          If you want to try this out, this example repo script should now be working for everyone: https://github.com/markrmiller/solr-map-reduce-example

          It works with Linux and I just updated it to work with OSX (at least my copies).

          Show
          Mark Miller added a comment - If you want to try this out, this example repo script should now be working for everyone: https://github.com/markrmiller/solr-map-reduce-example It works with Linux and I just updated it to work with OSX (at least my copies).
          Hide
          wolfgang hoschek added a comment - - edited

          There are also some important fixes downstream in 0.9.0 of cdk-morphlines-solr-core and cdk-morphlines-solr-cell that would be good to merge upstream (solr locator race, solr cell bug, etc). Also there are new morphline modules jars to add with 0.9.0 and jars to update (plus upstream is also missing some morphline modules from 0.8 as well)

          Show
          wolfgang hoschek added a comment - - edited There are also some important fixes downstream in 0.9.0 of cdk-morphlines-solr-core and cdk-morphlines-solr-cell that would be good to merge upstream (solr locator race, solr cell bug, etc). Also there are new morphline modules jars to add with 0.9.0 and jars to update (plus upstream is also missing some morphline modules from 0.8 as well)
          Hide
          ASF subversion and git services added a comment -

          Commit 1548795 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1548795 ]

          SOLR-1301: Update jar checksums for Morphlines 0.9.0

          Show
          ASF subversion and git services added a comment - Commit 1548795 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1548795 ] SOLR-1301 : Update jar checksums for Morphlines 0.9.0
          Hide
          Steve Rowe added a comment -

          The Maven Jenkins build on trunk has been failing for a while because com.sun.jersey:jersey-bundle:1.8, a morphlines-core dependency, causes ant validate-maven-dependencies to fail - here's a log excerpt from the most recent failure https://builds.apache.org/job/Lucene-Solr-Maven-trunk/1046/console:

               [echo] Building solr-map-reduce...
          
          -validate-maven-dependencies.init:
          
          -validate-maven-dependencies:
          [artifact:dependencies] [INFO] snapshot org.apache.solr:solr-cell:5.0-SNAPSHOT: checking for updates from maven-restlet
          [artifact:dependencies] [INFO] snapshot org.apache.solr:solr-cell:5.0-SNAPSHOT: checking for updates from releases.cloudera.com
          [artifact:dependencies] [INFO] snapshot org.apache.solr:solr-morphlines-cell:5.0-SNAPSHOT: checking for updates from maven-restlet
          [artifact:dependencies] [INFO] snapshot org.apache.solr:solr-morphlines-cell:5.0-SNAPSHOT: checking for updates from releases.cloudera.com
          [artifact:dependencies] [INFO] snapshot org.apache.solr:solr-morphlines-core:5.0-SNAPSHOT: checking for updates from maven-restlet
          [artifact:dependencies] [INFO] snapshot org.apache.solr:solr-morphlines-core:5.0-SNAPSHOT: checking for updates from releases.cloudera.com
          [artifact:dependencies] An error has occurred while processing the Maven artifact tasks.
          [artifact:dependencies]  Diagnosis:
          [artifact:dependencies] 
          [artifact:dependencies] Unable to resolve artifact: Unable to get dependency information: Unable to read the metadata file for artifact 'com.sun.jersey:jersey-bundle:jar': Cannot find parent: com.sun.jersey:jersey-project for project: null:jersey-bundle:jar:null for project null:jersey-bundle:jar:null
          [artifact:dependencies]   com.sun.jersey:jersey-bundle:jar:1.8
          [artifact:dependencies] 
          [artifact:dependencies] from the specified remote repositories:
          [artifact:dependencies]   central (http://repo1.maven.org/maven2),
          [artifact:dependencies]   releases.cloudera.com (https://repository.cloudera.com/artifactory/libs-release),
          [artifact:dependencies]   maven-restlet (http://maven.restlet.org),
          [artifact:dependencies]   Nexus (http://repository.apache.org/snapshots)
          [artifact:dependencies] 
          [artifact:dependencies] Path to dependency: 
          [artifact:dependencies] 	1) org.apache.solr:solr-map-reduce:jar:5.0-SNAPSHOT
          [artifact:dependencies] 
          [artifact:dependencies] 
          [artifact:dependencies] Not a v4.0.0 POM. for project com.sun.jersey:jersey-project at /home/hudson/.m2/repository/com/sun/jersey/jersey-project/1.8/jersey-project-1.8.pom
          

          I couldn't reproduce locally.

          Turns out the parent POM in question, at /home/hudson/.m2/repository/com/sun/jersey/jersey-project/1.8/jersey-project-1.8.pom, has the wrong contents:

          <html>
          <head><title>301 Moved Permanently</title></head>
          <body bgcolor="white">
          <center><h1>301 Moved Permanently</h1></center>
          <hr><center>nginx/0.6.39</center>
          </body>
          </html>
          

          I replaced this by manually downloading the correct POM and it's checksum file from Maven Central and putting them in the hudson user's local Maven repository.

          Mark Miller: While investigating this failure, I tried dropping the triggering Ivy dependency com.sun.jersey:jersey-bundle, and all enabled tests succeed. Okay with you to drop this dependency? The description from the POM says:

          <description>
          A bundle containing code of all jar-based modules that provide JAX-RS and Jersey-related features. Such a bundle is *only intended* for developers that do not use Maven's dependency system. The bundle does not include code for contributes, tests and samples.
          </description>
          

          Sounds like it's a sneaky replacement for transitive dependencies? IMHO, if we need some of the classes this jar provides, we should declare direct dependencies on the appropriate artifacts.

          Show
          Steve Rowe added a comment - The Maven Jenkins build on trunk has been failing for a while because com.sun.jersey:jersey-bundle:1.8 , a morphlines-core dependency, causes ant validate-maven-dependencies to fail - here's a log excerpt from the most recent failure https://builds.apache.org/job/Lucene-Solr-Maven-trunk/1046/console : [echo] Building solr-map-reduce... -validate-maven-dependencies.init: -validate-maven-dependencies: [artifact:dependencies] [INFO] snapshot org.apache.solr:solr-cell:5.0-SNAPSHOT: checking for updates from maven-restlet [artifact:dependencies] [INFO] snapshot org.apache.solr:solr-cell:5.0-SNAPSHOT: checking for updates from releases.cloudera.com [artifact:dependencies] [INFO] snapshot org.apache.solr:solr-morphlines-cell:5.0-SNAPSHOT: checking for updates from maven-restlet [artifact:dependencies] [INFO] snapshot org.apache.solr:solr-morphlines-cell:5.0-SNAPSHOT: checking for updates from releases.cloudera.com [artifact:dependencies] [INFO] snapshot org.apache.solr:solr-morphlines-core:5.0-SNAPSHOT: checking for updates from maven-restlet [artifact:dependencies] [INFO] snapshot org.apache.solr:solr-morphlines-core:5.0-SNAPSHOT: checking for updates from releases.cloudera.com [artifact:dependencies] An error has occurred while processing the Maven artifact tasks. [artifact:dependencies] Diagnosis: [artifact:dependencies] [artifact:dependencies] Unable to resolve artifact: Unable to get dependency information: Unable to read the metadata file for artifact 'com.sun.jersey:jersey-bundle:jar': Cannot find parent: com.sun.jersey:jersey-project for project: null:jersey-bundle:jar:null for project null:jersey-bundle:jar:null [artifact:dependencies] com.sun.jersey:jersey-bundle:jar:1.8 [artifact:dependencies] [artifact:dependencies] from the specified remote repositories: [artifact:dependencies] central (http://repo1.maven.org/maven2), [artifact:dependencies] releases.cloudera.com (https://repository.cloudera.com/artifactory/libs-release), [artifact:dependencies] maven-restlet (http://maven.restlet.org), [artifact:dependencies] Nexus (http://repository.apache.org/snapshots) [artifact:dependencies] [artifact:dependencies] Path to dependency: [artifact:dependencies] 1) org.apache.solr:solr-map-reduce:jar:5.0-SNAPSHOT [artifact:dependencies] [artifact:dependencies] [artifact:dependencies] Not a v4.0.0 POM. for project com.sun.jersey:jersey-project at /home/hudson/.m2/repository/com/sun/jersey/jersey-project/1.8/jersey-project-1.8.pom I couldn't reproduce locally. Turns out the parent POM in question, at /home/hudson/.m2/repository/com/sun/jersey/jersey-project/1.8/jersey-project-1.8.pom , has the wrong contents: <html> <head><title>301 Moved Permanently</title></head> <body bgcolor="white"> <center><h1>301 Moved Permanently</h1></center> <hr><center>nginx/0.6.39</center> </body> </html> I replaced this by manually downloading the correct POM and it's checksum file from Maven Central and putting them in the hudson user's local Maven repository. Mark Miller : While investigating this failure, I tried dropping the triggering Ivy dependency com.sun.jersey:jersey-bundle, and all enabled tests succeed. Okay with you to drop this dependency? The description from the POM says: <description> A bundle containing code of all jar-based modules that provide JAX-RS and Jersey-related features. Such a bundle is *only intended* for developers that do not use Maven's dependency system. The bundle does not include code for contributes, tests and samples. </description> Sounds like it's a sneaky replacement for transitive dependencies? IMHO, if we need some of the classes this jar provides, we should declare direct dependencies on the appropriate artifacts.
          Hide
          Mark Miller added a comment -

          if we need some of the classes this jar provides, we should declare direct dependencies on the appropriate artifacts.

          Right - Wolfgang likely knows best when it comes to Morphlines.. At a minimum we should pull the necessary jars in explicitly I think. I've got to take a look at what they are.

          Show
          Mark Miller added a comment - if we need some of the classes this jar provides, we should declare direct dependencies on the appropriate artifacts. Right - Wolfgang likely knows best when it comes to Morphlines.. At a minimum we should pull the necessary jars in explicitly I think. I've got to take a look at what they are.
          Hide
          wolfgang hoschek added a comment - - edited

          I'm not aware of anything needing jersey except perhaps hadoop pulls that in.

          The combined dependencies of all morphline modules is here: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html

          The dependencies of each individual morphline modules is here: http://cloudera.github.io/cdk/docs/current/dependencies.html

          The source and POMs are here, as usual: https://github.com/cloudera/cdk/tree/master/cdk-morphlines

          By the way, a somewhat separate issue is that it seems to me that the ivy dependences for solr-morphlines-core and solr-morphlines-cell and solr-map-reduce are a bit backwards upstream in that currently solr-morphlines-core pulls in a ton of dependencies that it doesn't need, and those deps should rather be pulled in by the solr-map-reduce (which is a essentially an out-of-the-box app that bundles user level deps). Correspondingly, would be good to organize ivy and mvn upstream in such a way that

          • solr-map-reduce should depend on solr-morphlines-cell plus cdk-morphlines-all minus cdk-morphlines-solr-cell (now upstream) minus cdk-morphlines-solr-core (now upstream) plus xyz
          • solr-morphlines-cell should depend on solr-morphlines-core plus xyz
          • solr-morphlines-core should depend on cdk-morphlines-core plus xyz

          More concretely, FWIW, to see how the deps look like in production releases downstream review the following POMs:

          https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-core/pom.xml

          and

          https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-cell/pom.xml

          and

          https://github.com/cloudera/search/blob/master_1.1.0/search-mr/pom.xml

          Show
          wolfgang hoschek added a comment - - edited I'm not aware of anything needing jersey except perhaps hadoop pulls that in. The combined dependencies of all morphline modules is here: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html The dependencies of each individual morphline modules is here: http://cloudera.github.io/cdk/docs/current/dependencies.html The source and POMs are here, as usual: https://github.com/cloudera/cdk/tree/master/cdk-morphlines By the way, a somewhat separate issue is that it seems to me that the ivy dependences for solr-morphlines-core and solr-morphlines-cell and solr-map-reduce are a bit backwards upstream in that currently solr-morphlines-core pulls in a ton of dependencies that it doesn't need, and those deps should rather be pulled in by the solr-map-reduce (which is a essentially an out-of-the-box app that bundles user level deps). Correspondingly, would be good to organize ivy and mvn upstream in such a way that solr-map-reduce should depend on solr-morphlines-cell plus cdk-morphlines-all minus cdk-morphlines-solr-cell (now upstream) minus cdk-morphlines-solr-core (now upstream) plus xyz solr-morphlines-cell should depend on solr-morphlines-core plus xyz solr-morphlines-core should depend on cdk-morphlines-core plus xyz More concretely, FWIW, to see how the deps look like in production releases downstream review the following POMs: https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-core/pom.xml and https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-cell/pom.xml and https://github.com/cloudera/search/blob/master_1.1.0/search-mr/pom.xml
          Hide
          Steve Rowe added a comment -

          wolfgang hoschek, I'm lost: what do you mean by "upstream"/"downstream"? In my experience, "upstream" refers to a parent project, i.e. one from which the project in question is derived, and "downstream" is the child/derived project. I don't know the history here, but you seem to be referring to the solr contribs when you say "upstream"? If that's true, then my understanding of these terms is the opposite of how you're using them. Maybe the question I should be asking is: what is/are the relationship(s) between/among cdk-morphlines-solr-* and solr-morphlines-*?

          And (I assume) relatedly, how how does cdk-morphlines-all relate to cdk-morphlines-solr-core/-cell?

          Show
          Steve Rowe added a comment - wolfgang hoschek , I'm lost: what do you mean by "upstream"/"downstream"? In my experience, "upstream" refers to a parent project, i.e. one from which the project in question is derived, and "downstream" is the child/derived project. I don't know the history here, but you seem to be referring to the solr contribs when you say "upstream"? If that's true, then my understanding of these terms is the opposite of how you're using them. Maybe the question I should be asking is: what is/are the relationship(s) between/among cdk-morphlines-solr-* and solr-morphlines-*? And (I assume) relatedly, how how does cdk-morphlines-all relate to cdk-morphlines-solr-core/-cell?
          Hide
          wolfgang hoschek added a comment -

          Apologies for the confusion. We are upstreaming cdk-morphlines-solr-cell into the solr contrib solr-morphlines-cell as well as cdk-morphlines-solr-core into the solr contrib solr-morphlines-core as well as search-mr into the solr contrib solr-map-reduce. Once the upstreaming is done these old modules will go away. Next, "downstream" will be made identical to "upstream" plus perhaps some critical fixes as necessary, and the upstream/downstream terms will apply in the way folks usually think about them, but we are not quite yet there today, but getting there...

          cdk-morphlines-all is simply a convenience pom that includes all the other morphline poms so there's less to type for users who like a bit more auto magic.

          Show
          wolfgang hoschek added a comment - Apologies for the confusion. We are upstreaming cdk-morphlines-solr-cell into the solr contrib solr-morphlines-cell as well as cdk-morphlines-solr-core into the solr contrib solr-morphlines-core as well as search-mr into the solr contrib solr-map-reduce. Once the upstreaming is done these old modules will go away. Next, "downstream" will be made identical to "upstream" plus perhaps some critical fixes as necessary, and the upstream/downstream terms will apply in the way folks usually think about them, but we are not quite yet there today, but getting there... cdk-morphlines-all is simply a convenience pom that includes all the other morphline poms so there's less to type for users who like a bit more auto magic.
          Hide
          Steve Rowe added a comment - - edited

          And (I assume) relatedly, how how does cdk-morphlines-all relate to cdk-morphlines-solr-core/-cell?

          I can answer this one myself from https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-all/pom.xml: it's an aggregation-only module that depends on all of the cdk-morphlines-* modules.

          Show
          Steve Rowe added a comment - - edited And (I assume) relatedly, how how does cdk-morphlines-all relate to cdk-morphlines-solr-core/-cell? I can answer this one myself from https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-all/pom.xml : it's an aggregation-only module that depends on all of the cdk-morphlines-* modules.
          Hide
          Mark Miller added a comment -

          I'm not aware of anything needing jersey except perhaps hadoop pulls that in.

          Yeah, tests use this for running hadoop.

          Show
          Mark Miller added a comment - I'm not aware of anything needing jersey except perhaps hadoop pulls that in. Yeah, tests use this for running hadoop.
          Hide
          Gary Schulte added a comment -

          FYI, a colleague and I just spent the better part of a week trying to get the latest 1301 patch against 4.6 working in our cdh 4.1.2 dev environment, and/or a local cdh 4.3 cluster.

          We discovered that while the indexing process itself worked and we could see the docs and index merges in the reducer output logs, we never actually ended up with anything in the data directories in hdfs for our shards.

          Presumably, hadoop 2.0 is silently fails to do a distributed write when solr is using hdfs for a core's data directory. After reverting SolrRecordWriter to the prior behavior of generating a local index and copying it to hdfs on completion, we were able to get MR indexing to work.

          Show
          Gary Schulte added a comment - FYI, a colleague and I just spent the better part of a week trying to get the latest 1301 patch against 4.6 working in our cdh 4.1.2 dev environment, and/or a local cdh 4.3 cluster. We discovered that while the indexing process itself worked and we could see the docs and index merges in the reducer output logs, we never actually ended up with anything in the data directories in hdfs for our shards. Presumably, hadoop 2.0 is silently fails to do a distributed write when solr is using hdfs for a core's data directory. After reverting SolrRecordWriter to the prior behavior of generating a local index and copying it to hdfs on completion, we were able to get MR indexing to work.
          Hide
          Mark Miller added a comment -

          Sorry - latest patch is no good due to a bug. It was writing the data to the local filesystem. A lot has been committed beyond the last patch.

          Show
          Mark Miller added a comment - Sorry - latest patch is no good due to a bug. It was writing the data to the local filesystem. A lot has been committed beyond the last patch.
          Hide
          Mark Miller added a comment -

          You need at least the commit above that talks about fixing where we set system properties in the solr record writer.

          Show
          Mark Miller added a comment - You need at least the commit above that talks about fixing where we set system properties in the solr record writer.
          Hide
          Gary Schulte added a comment - - edited

          I am getting the same behavior from solr/contrib/map-reduce in http://svn.apache.org/repos/asf/lucene/dev/trunk

          I just verified I am able to reproduce this behavior on cdh 4.1.2 even after https://svn.apache.org/r1548600

          Show
          Gary Schulte added a comment - - edited I am getting the same behavior from solr/contrib/map-reduce in http://svn.apache.org/repos/asf/lucene/dev/trunk I just verified I am able to reproduce this behavior on cdh 4.1.2 even after https://svn.apache.org/r1548600
          Hide
          Mark Miller added a comment -

          Strange - should work fine. If I run the github project above, it has the index in hdfs and they are merged to solr. It uses 5x from a couple days ago.

          Show
          Mark Miller added a comment - Strange - should work fine. If I run the github project above, it has the index in hdfs and they are merged to solr. It uses 5x from a couple days ago.
          Hide
          Mark Miller added a comment -

          It's the "fix a couple bugs around setting up embeddedsolrserver" commit. Keep in mind your solrconfig will need to have the directoryFactory setup to be subbed by sys prop currently - as it is by default..

          Show
          Mark Miller added a comment - It's the "fix a couple bugs around setting up embeddedsolrserver" commit. Keep in mind your solrconfig will need to have the directoryFactory setup to be subbed by sys prop currently - as it is by default..
          Hide
          Gary Schulte added a comment -

          The example works fine for us also. The reality is that we are still on java 1.6 for the most part and therefore can't use Solr 5.x. All of our testing is with java 1.6 and lucene_solr_4_6.

          We've tried using solr-mr with the 1301 patch against 4.6, as well as 'transplanting' contrib/map-reduce from trunk into the 4.6 branch. Both yield the same behavior. Indexing works, but the indexes never 'arrive' in hdfs.

          Perhaps there is an issue with solr-core and hdfs that was addressed in trunk that we haven't picked up? (due to our java 1.6 source restriction)

          Show
          Gary Schulte added a comment - The example works fine for us also. The reality is that we are still on java 1.6 for the most part and therefore can't use Solr 5.x. All of our testing is with java 1.6 and lucene_solr_4_6. We've tried using solr-mr with the 1301 patch against 4.6, as well as 'transplanting' contrib/map-reduce from trunk into the 4.6 branch. Both yield the same behavior. Indexing works, but the indexes never 'arrive' in hdfs. Perhaps there is an issue with solr-core and hdfs that was addressed in trunk that we haven't picked up? (due to our java 1.6 source restriction)
          Hide
          Mark Miller added a comment -

          I'd bet the hdfs directory is not being set for some reason. I was seeing the same thing until that commit. Look around for an errant folder being created on the local fs that starts with hdfs.

          Show
          Mark Miller added a comment - I'd bet the hdfs directory is not being set for some reason. I was seeing the same thing until that commit. Look around for an errant folder being created on the local fs that starts with hdfs.
          Hide
          Gary Schulte added a comment - - edited

          The directoryFactory appears to have been the root of the issue. We were adapting our local solrconfig for use in the embedded solr server and did not have :

          <directoryFactory name="DirectoryFactory"
          class="$

          {solr.directoryFactory:solr.NRTCachingDirectoryFactory}

          "/>

          in our setup. In light of that, we can confirm it works on cdh 4.1.2. Thx

          Show
          Gary Schulte added a comment - - edited The directoryFactory appears to have been the root of the issue. We were adapting our local solrconfig for use in the embedded solr server and did not have : <directoryFactory name="DirectoryFactory" class="$ {solr.directoryFactory:solr.NRTCachingDirectoryFactory} "/> in our setup. In light of that, we can confirm it works on cdh 4.1.2. Thx
          Hide
          Mark Miller added a comment -

          Thanks for closing the loop on that. That part is fragile - will be improved.

          Show
          Mark Miller added a comment - Thanks for closing the loop on that. That part is fragile - will be improved.
          Hide
          Gary Schulte added a comment -

          Some additional feedback, it would be convenient if we could ignore the underscore ("_") hidden files in hdfs as well as the "." hidden files when reading input files from hdfs.

          When trying to index an AvroStorage directory created by Pig, we are having to send each part name individually because the job will fail if we pass the directory. Passing the directory, we end up picking up "_logs/*", "_SUCCESS", etc - the corresponding avro morphlines map jobs fail.

          Show
          Gary Schulte added a comment - Some additional feedback, it would be convenient if we could ignore the underscore ("_") hidden files in hdfs as well as the "." hidden files when reading input files from hdfs. When trying to index an AvroStorage directory created by Pig, we are having to send each part name individually because the job will fail if we pass the directory. Passing the directory, we end up picking up "_logs/*", "_SUCCESS", etc - the corresponding avro morphlines map jobs fail.
          Hide
          wolfgang hoschek added a comment - - edited

          Might be best to write a program that generates the list of files and then explicitly provide that file list to the MR job, e.g. via the --input-list option. For example you could use the HDFS version of the Linux file system 'find' command for that (HdfsFindTool doc and code here: https://github.com/cloudera/search/tree/master_1.1.0/search-mr#hdfsfindtool)

          Show
          wolfgang hoschek added a comment - - edited Might be best to write a program that generates the list of files and then explicitly provide that file list to the MR job, e.g. via the --input-list option. For example you could use the HDFS version of the Linux file system 'find' command for that (HdfsFindTool doc and code here: https://github.com/cloudera/search/tree/master_1.1.0/search-mr#hdfsfindtool )
          Hide
          wolfgang hoschek added a comment -

          it would be convenient if we could ignore the underscore ("_") hidden files in hdfs as well as the "." hidden files when reading input files from hdfs.

          +1

          Show
          wolfgang hoschek added a comment - it would be convenient if we could ignore the underscore ("_") hidden files in hdfs as well as the "." hidden files when reading input files from hdfs. +1
          Hide
          ASF subversion and git services added a comment -

          Commit 1552381 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1552381 ]

          SOLR-1301: Update to Kite 0.10 from CDK 0.9

          Show
          ASF subversion and git services added a comment - Commit 1552381 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1552381 ] SOLR-1301 : Update to Kite 0.10 from CDK 0.9
          Hide
          ASF subversion and git services added a comment -

          Commit 1552398 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1552398 ]

          SOLR-1301: Merge Morphlines modules up to Kite 0.10 and CDK 0.9

          Show
          ASF subversion and git services added a comment - Commit 1552398 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1552398 ] SOLR-1301 : Merge Morphlines modules up to Kite 0.10 and CDK 0.9
          Hide
          ASF subversion and git services added a comment -

          Commit 1553184 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1553184 ]

          SOLR-1301: Ignore this test on Windows - there is a problem with Windows paths and Morphlines.

          Show
          ASF subversion and git services added a comment - Commit 1553184 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1553184 ] SOLR-1301 : Ignore this test on Windows - there is a problem with Windows paths and Morphlines.
          Hide
          ASF subversion and git services added a comment -

          Commit 1553281 from Steve Rowe in branch 'dev/trunk'
          [ https://svn.apache.org/r1553281 ]

          SOLR-1301: maven config: fix map-reduce test compilation problem by adding dependency on morphline-core's test jar

          Show
          ASF subversion and git services added a comment - Commit 1553281 from Steve Rowe in branch 'dev/trunk' [ https://svn.apache.org/r1553281 ] SOLR-1301 : maven config: fix map-reduce test compilation problem by adding dependency on morphline-core's test jar
          Show
          wolfgang hoschek added a comment - Also see https://issues.cloudera.org/browse/CDK-262
          Hide
          ASF subversion and git services added a comment -

          Commit 1555647 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1555647 ]

          SOLR-1301: make debugging these tests a whole lot easier by sending map reduce job logging to std out

          Show
          ASF subversion and git services added a comment - Commit 1555647 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1555647 ] SOLR-1301 : make debugging these tests a whole lot easier by sending map reduce job logging to std out
          Hide
          Mark Miller added a comment -

          Bah - I think the above commit only works on map reduce one as far as sending the logs to std out. Tried to write some code to do it for map reduce 2, but I have not been able to figure out how to programmaticly get the secret key to hash the request url for the http logs API.

          Show
          Mark Miller added a comment - Bah - I think the above commit only works on map reduce one as far as sending the logs to std out. Tried to write some code to do it for map reduce 2, but I have not been able to figure out how to programmaticly get the secret key to hash the request url for the http logs API.
          Hide
          ASF subversion and git services added a comment -

          Commit 1556846 from Steve Rowe in branch 'dev/trunk'
          [ https://svn.apache.org/r1556846 ]

          SOLR-1301: IntelliJ config: morphlines-cell Solr contrib needs lucene-core test-scope dependency

          Show
          ASF subversion and git services added a comment - Commit 1556846 from Steve Rowe in branch 'dev/trunk' [ https://svn.apache.org/r1556846 ] SOLR-1301 : IntelliJ config: morphlines-cell Solr contrib needs lucene-core test-scope dependency
          Hide
          ASF subversion and git services added a comment -

          Commit 1558520 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558520 ]

          SOLR-1301: Add a Solr contrib that allows for building Solr indexes via Hadoop's MapReduce.

          Show
          ASF subversion and git services added a comment - Commit 1558520 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558520 ] SOLR-1301 : Add a Solr contrib that allows for building Solr indexes via Hadoop's MapReduce.
          Hide
          ASF subversion and git services added a comment -

          Commit 1558522 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558522 ]

          SOLR-1301: Ivy likes to act funny if you don't declare compile and test resources in the same dependency.

          Show
          ASF subversion and git services added a comment - Commit 1558522 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558522 ] SOLR-1301 : Ivy likes to act funny if you don't declare compile and test resources in the same dependency.
          Hide
          ASF subversion and git services added a comment -

          Commit 1558523 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558523 ]

          SOLR-1301: Fix compilation for Java 8 (the Java 8 compiler is more picky, but it's not a Java 8 regression: the code was just wrong)

          Show
          ASF subversion and git services added a comment - Commit 1558523 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558523 ] SOLR-1301 : Fix compilation for Java 8 (the Java 8 compiler is more picky, but it's not a Java 8 regression: the code was just wrong)
          Hide
          ASF subversion and git services added a comment -

          Commit 1558524 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558524 ]

          SOLR-1301: Fix windows problem with escaping of folder name (see crazy https://github.com/typesafehub/config/blob/master/HOCON.md for correct format: string must be quoted and escaped like javascript)

          Show
          ASF subversion and git services added a comment - Commit 1558524 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558524 ] SOLR-1301 : Fix windows problem with escaping of folder name (see crazy https://github.com/typesafehub/config/blob/master/HOCON.md for correct format: string must be quoted and escaped like javascript)
          Hide
          ASF subversion and git services added a comment -

          Commit 1558525 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558525 ]

          SOLR-1301: Ignore windows tests that cannot work because they use UNIX semantics. Also remove a never-executed test which tests nothing

          Show
          ASF subversion and git services added a comment - Commit 1558525 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558525 ] SOLR-1301 : Ignore windows tests that cannot work because they use UNIX semantics. Also remove a never-executed test which tests nothing
          Hide
          ASF subversion and git services added a comment -

          Commit 1558529 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558529 ]

          SOLR-1301: Ignore these tests on java 8 and j9 for now.

          Show
          ASF subversion and git services added a comment - Commit 1558529 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558529 ] SOLR-1301 : Ignore these tests on java 8 and j9 for now.
          Hide
          ASF subversion and git services added a comment -

          Commit 1558533 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558533 ]

          SOLR-1301: remove unnecessary (POM-only) dependency org.apache.hadoop:hadoop-yarn-server

          Show
          ASF subversion and git services added a comment - Commit 1558533 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558533 ] SOLR-1301 : remove unnecessary (POM-only) dependency org.apache.hadoop:hadoop-yarn-server
          Hide
          ASF subversion and git services added a comment -

          Commit 1558540 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558540 ]

          SOLR-1301: Straighten out module names so that they match current convention

          Show
          ASF subversion and git services added a comment - Commit 1558540 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558540 ] SOLR-1301 : Straighten out module names so that they match current convention
          Hide
          ASF subversion and git services added a comment -

          Commit 1558541 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558541 ]

          SOLR-1301: Merge in latest solr-map-reduce updates.

          Show
          ASF subversion and git services added a comment - Commit 1558541 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558541 ] SOLR-1301 : Merge in latest solr-map-reduce updates.
          Hide
          ASF subversion and git services added a comment -

          Commit 1558544 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558544 ]

          SOLR-1301: Merge in latest morphlines module updates.

          Show
          ASF subversion and git services added a comment - Commit 1558544 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558544 ] SOLR-1301 : Merge in latest morphlines module updates.
          Hide
          ASF subversion and git services added a comment -

          Commit 1558545 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558545 ]

          SOLR-1301: Clean up.

          Show
          ASF subversion and git services added a comment - Commit 1558545 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558545 ] SOLR-1301 : Clean up.
          Hide
          ASF subversion and git services added a comment -

          Commit 1558547 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558547 ]

          SOLR-1301: ignore '.iml' in new Solr contribs' directories; put new Solr contribs' lib/ and test-lib/ directories under Subversion control; ignore '.jar' in these directories

          Show
          ASF subversion and git services added a comment - Commit 1558547 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558547 ] SOLR-1301 : ignore '.iml' in new Solr contribs' directories; put new Solr contribs' lib/ and test-lib/ directories under Subversion control; ignore '.jar' in these directories
          Hide
          ASF subversion and git services added a comment -

          Commit 1558548 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558548 ]

          SOLR-1301: Fix a couple of bugs around setting up the embedded Solr instance.

          Show
          ASF subversion and git services added a comment - Commit 1558548 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558548 ] SOLR-1301 : Fix a couple of bugs around setting up the embedded Solr instance.
          Hide
          ASF subversion and git services added a comment -

          Commit 1558551 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558551 ]

          SOLR-1301: Update to Morphlines 0.9.0

          Show
          ASF subversion and git services added a comment - Commit 1558551 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558551 ] SOLR-1301 : Update to Morphlines 0.9.0
          Hide
          ASF subversion and git services added a comment -

          Commit 1558553 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558553 ]

          SOLR-1301: Update jar checksums for Morphlines 0.9.0

          Show
          ASF subversion and git services added a comment - Commit 1558553 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558553 ] SOLR-1301 : Update jar checksums for Morphlines 0.9.0
          Hide
          ASF subversion and git services added a comment -

          Commit 1558572 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558572 ]

          SOLR-1301: Update to Kite 0.10 from CDK 0.9

          Show
          ASF subversion and git services added a comment - Commit 1558572 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558572 ] SOLR-1301 : Update to Kite 0.10 from CDK 0.9
          Hide
          ASF subversion and git services added a comment -

          Commit 1558580 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558580 ]

          SOLR-1301: Merge Morphlines modules up to Kite 0.10 and CDK 0.9

          Show
          ASF subversion and git services added a comment - Commit 1558580 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558580 ] SOLR-1301 : Merge Morphlines modules up to Kite 0.10 and CDK 0.9
          Hide
          ASF subversion and git services added a comment -

          Commit 1558582 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558582 ]

          SOLR-1301: Ignore this test on Windows - there is a problem with Windows paths and Morphlines.

          Show
          ASF subversion and git services added a comment - Commit 1558582 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558582 ] SOLR-1301 : Ignore this test on Windows - there is a problem with Windows paths and Morphlines.
          Hide
          ASF subversion and git services added a comment -

          Commit 1558584 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558584 ]

          SOLR-1301: maven config: fix map-reduce test compilation problem by adding dependency on morphline-core's test jar

          Show
          ASF subversion and git services added a comment - Commit 1558584 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558584 ] SOLR-1301 : maven config: fix map-reduce test compilation problem by adding dependency on morphline-core's test jar
          Hide
          ASF subversion and git services added a comment -

          Commit 1558586 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558586 ]

          SOLR-1301: make debugging these tests a whole lot easier by sending map reduce job logging to std out

          Show
          ASF subversion and git services added a comment - Commit 1558586 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558586 ] SOLR-1301 : make debugging these tests a whole lot easier by sending map reduce job logging to std out
          Hide
          ASF subversion and git services added a comment -

          Commit 1558588 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558588 ]

          SOLR-1301: IntelliJ config: morphlines-cell Solr contrib needs lucene-core test-scope dependency

          Show
          ASF subversion and git services added a comment - Commit 1558588 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558588 ] SOLR-1301 : IntelliJ config: morphlines-cell Solr contrib needs lucene-core test-scope dependency
          Hide
          ASF subversion and git services added a comment -

          Commit 1558647 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1558647 ]

          SOLR-1301: Move CHANGES entry to 4.7

          Show
          ASF subversion and git services added a comment - Commit 1558647 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1558647 ] SOLR-1301 : Move CHANGES entry to 4.7
          Hide
          ASF subversion and git services added a comment -

          Commit 1558670 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1558670 ]

          SOLR-1301: Throw an error if HdfsDirectoryFactory is not configured for now.

          Show
          ASF subversion and git services added a comment - Commit 1558670 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1558670 ] SOLR-1301 : Throw an error if HdfsDirectoryFactory is not configured for now.
          Hide
          ASF subversion and git services added a comment -

          Commit 1558671 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558671 ]

          SOLR-1301: Throw an error if HdfsDirectoryFactory is not configured for now.

          Show
          ASF subversion and git services added a comment - Commit 1558671 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558671 ] SOLR-1301 : Throw an error if HdfsDirectoryFactory is not configured for now.
          Hide
          Mark Miller added a comment -

          Major performance issue that relates to this:

          SOLR-5667 Performance problem when not using hdfs block cache.

          Show
          Mark Miller added a comment - Major performance issue that relates to this: SOLR-5667 Performance problem when not using hdfs block cache.
          Hide
          ASF subversion and git services added a comment -

          Commit 1567337 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1567337 ]

          SOLR-1301: Implement the set-map-reduce-classpath.sh script.

          Show
          ASF subversion and git services added a comment - Commit 1567337 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1567337 ] SOLR-1301 : Implement the set-map-reduce-classpath.sh script.
          Hide
          ASF subversion and git services added a comment -

          Commit 1567340 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1567340 ]

          SOLR-1301: Implement the set-map-reduce-classpath.sh script.

          Show
          ASF subversion and git services added a comment - Commit 1567340 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1567340 ] SOLR-1301 : Implement the set-map-reduce-classpath.sh script.
          Hide
          Christian Moen added a comment -

          I've been reading through (pretty much all) the comments on this JIRA and I'd like to thank you all for the great effort you have put into this.

          Show
          Christian Moen added a comment - I've been reading through (pretty much all) the comments on this JIRA and I'd like to thank you all for the great effort you have put into this.
          Hide
          ASF subversion and git services added a comment -

          Commit 1568317 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1568317 ]

          SOLR-1301: Add some readme files.

          Show
          ASF subversion and git services added a comment - Commit 1568317 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1568317 ] SOLR-1301 : Add some readme files.
          Hide
          ASF subversion and git services added a comment -

          Commit 1568318 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1568318 ]

          SOLR-1301: Add some readme files.

          Show
          ASF subversion and git services added a comment - Commit 1568318 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1568318 ] SOLR-1301 : Add some readme files.
          Hide
          Mark Miller added a comment -

          Plenty of comments to go through Time to finally close this issue out. I'll file a new issue for some remaining work.

          Show
          Mark Miller added a comment - Plenty of comments to go through Time to finally close this issue out. I'll file a new issue for some remaining work.
          Hide
          Mark Miller added a comment -

          I filed SOLR-5729 for some additional work post 4.7.

          Show
          Mark Miller added a comment - I filed SOLR-5729 for some additional work post 4.7.
          Hide
          ASF subversion and git services added a comment -

          Commit 1568328 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1568328 ]

          SOLR-1301: Fix spelling.

          Show
          ASF subversion and git services added a comment - Commit 1568328 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1568328 ] SOLR-1301 : Fix spelling.
          Hide
          ASF subversion and git services added a comment -

          Commit 1568329 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1568329 ]

          SOLR-1301: Fix spelling.

          Show
          ASF subversion and git services added a comment - Commit 1568329 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1568329 ] SOLR-1301 : Fix spelling.
          Hide
          rulinma added a comment -

          mark.

          Show
          rulinma added a comment - mark.

            People

            • Assignee:
              Mark Miller
              Reporter:
              Andrzej Bialecki
            • Votes:
              30 Vote for this issue
              Watchers:
              63 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development