Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-5786

MapReduceIndexerTool --help output is missing large parts of the help text



    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 4.7
    • 4.8
    • contrib - MapReduce
    • None


      As already mentioned repeatedly and at length, this is a regression introduced by the fix in https://issues.apache.org/jira/browse/SOLR-5605

      Here is the diff of --help output before SOLR-5605 vs after SOLR-5605:

      <                          lucene  segments  left  in   this  index.  Merging
      <                          segments involves reading  and  rewriting all data
      <                          in all these  segment  files, potentially multiple
      <                          times,  which  is  very  I/O  intensive  and  time
      <                          consuming. However, an  index  with fewer segments
      <                          can later be merged  faster,  and  it can later be
      <                          queried  faster  once  deployed  to  a  live  Solr
      <                          serving shard. Set  maxSegments  to  1 to optimize
      <                          the index for low query  latency. In a nutshell, a
      <                          small maxSegments  value  trades  indexing latency
      <                          for subsequently improved query  latency. This can
      <                          be  a  reasonable  trade-off  for  batch  indexing
      <                          systems. (default: 1)
      <   --fair-scheduler-pool STRING
      <                          Optional tuning knob  that  indicates  the name of
      <                          the fair scheduler  pool  to  submit  jobs to. The
      <                          Fair Scheduler is a  pluggable MapReduce scheduler
      <                          that provides a way to  share large clusters. Fair
      <                          scheduling is a method  of  assigning resources to
      <                          jobs such that all jobs  get, on average, an equal
      <                          share of resources  over  time.  When  there  is a
      <                          single job  running,  that  job  uses  the  entire
      <                          cluster. When  other  jobs  are  submitted,  tasks
      <                          slots that free up are  assigned  to the new jobs,
      <                          so that each job gets  roughly  the same amount of
      <                          CPU time.  Unlike  the  default  Hadoop scheduler,
      <                          which forms a queue of  jobs, this lets short jobs
      <                          finish in reasonable time  while not starving long
      <                          jobs. It is also an  easy  way  to share a cluster
      <                          between multiple of users.  Fair  sharing can also
      <                          work with  job  priorities  -  the  priorities are
      <                          used as  weights  to  determine  the  fraction  of
      <                          total compute time that each job gets.
      <   --dry-run              Run in local mode  and  print  documents to stdout
      <                          instead of loading them  into  Solr. This executes
      <                          the  morphline  in  the  client  process  (without
      <                          submitting a job  to  MR)  for  quicker turnaround
      <                          during early  trial  &  debug  sessions. (default:
      <                          false)
      <   --log4j FILE           Relative or absolute  path  to  a log4j.properties
      <                          config file on the  local  file  system. This file
      <                          will  be  uploaded  to   each  MR  task.  Example:
      <                          /path/to/log4j.properties
      <   --verbose, -v          Turn on verbose output. (default: false)
      <   --show-non-solr-cloud  Also show options for  Non-SolrCloud  mode as part
      <                          of --help. (default: false)
      < Required arguments:
      <   --output-dir HDFS_URI  HDFS directory to  write  Solr  indexes to. Inside
      <                          there one  output  directory  per  shard  will  be
      <                          generated.    Example:     hdfs://c2202.mycompany.
      <                          com/user/$USER/test
      <   --morphline-file FILE  Relative or absolute path  to  a local config file
      <                          that contains one  or  more  morphlines.  The file
      <                          must     be      UTF-8      encoded.      Example:
      <                          /path/to/morphline.conf
      < Cluster arguments:
      <   Arguments that provide information about your Solr cluster. 
      <   --zk-host STRING       The address of a ZooKeeper  ensemble being used by
      <                          a SolrCloud cluster. This  ZooKeeper ensemble will
      <                          be examined  to  determine  the  number  of output
      <                          shards to create  as  well  as  the  Solr  URLs to
      <                          merge the output shards into  when using the --go-
      <                          live option. Requires that  you  also  pass the --
      <                          collection to merge the shards into.
      <                          The   --zk-host   option   implements   the   same
      <                          partitioning semantics as  the  standard SolrCloud
      <                          Near-Real-Time (NRT)  API.  This  enables  to  mix
      <                          batch  updates  from   MapReduce   ingestion  with
      <                          updates from standard  Solr  NRT  ingestion on the
      <                          same SolrCloud  cluster,  using  identical  unique
      <                          document keys.
      <                          Format is: a  list  of  comma  separated host:port
      <                          pairs,  each  corresponding   to   a   zk  server.
      <                          Example: ',,
      <                          2183' If the optional  chroot  suffix  is used the
      <                          example  would  look  like:  ',
      <                ,'     where
      <                          the client would  be  rooted  at  '/solr'  and all
      <                          paths would  be  relative  to  this  root  -  i.e.
      <                          getting/setting/etc... '/foo/bar' would  result in
      <                          operations being run on  '/solr/foo/bar' (from the
      <                          server perspective).
      < Go live arguments:
      <   Arguments for  merging  the  shards  that  are  built  into  a  live Solr
      <   cluster. Also see the Cluster arguments.
      <   --go-live              Allows you to  optionally  merge  the  final index
      <                          shards into a  live  Solr  cluster  after they are
      <                          built. You can pass the  ZooKeeper address with --
      <                          zk-host and the relevant  cluster information will
      <                          be auto detected.  (default: false)
      <   --collection STRING    The SolrCloud  collection  to  merge  shards  into
      <                          when  using  --go-live   and  --zk-host.  Example:
      <                          collection1
      <   --go-live-threads INTEGER
      <                          Tuning knob that indicates  the  maximum number of
      <                          live merges  to  run  in  parallel  at  one  time.
      <                          (default: 1000)

      As already mentioned repeatedly and at length, this bug is because there's a change related to buffer flushing in argparse4 >= 0.4.2.

      The fix is to apply CDH-16434 to MapReduceIndexerTool.java as follows:

      -            parser.printHelp(new PrintWriter(System.out));  
      +            parser.printHelp();


        Issue Links



              markrmiller@gmail.com Mark Miller
              whoschek Wolfgang Hoschek
              0 Vote for this issue
              3 Start watching this issue