Lucene - Core
  1. Lucene - Core
  2. LUCENE-1209

If setConfig(Config config) is called in resetInputs(), you can turn term vectors off and on by round

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: None
    • Component/s: modules/benchmark
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      I want to be able to run one benchmark that tests things using term vectors and not using term vectors.

      Currently this is not easy because you cannot specify term vectors per round.

      While you do have to create a new index per round, this automation is preferable to me in comparison to running two separate tests.

      If it doesn't affect anything else, it would be great to have setConfig(Config config) called in BasicDocMaker.resetInputs(). This would keep the term vector options up to date per round if you reset.

      • Mark
      1. reset_config.patch
        0.6 kB
        Mark Miller
      2. reset_config.patch
        4 kB
        Doron Cohen
      3. reset_config.patch
        6 kB
        Doron Cohen

        Activity

        Hide
        Doron Cohen added a comment -

        Committed, thanks Mark!

        Show
        Doron Cohen added a comment - Committed, thanks Mark!
        Hide
        Doron Cohen added a comment -

        QualityTest fails with previous patch, exposing a related bug in ReutersDocMaker,
        of not reseting files list at call to setConfig(), Was not required before, but now since
        setConfig is called more than once must clear the list of collected files.
        Attached file fixes this and all benchmark tests pass.

        Show
        Doron Cohen added a comment - QualityTest fails with previous patch, exposing a related bug in ReutersDocMaker, of not reseting files list at call to setConfig(), Was not required before, but now since setConfig is called more than once must clear the list of collected files. Attached file fixes this and all benchmark tests pass.
        Hide
        Doron Cohen added a comment -

        same fix + test case that fails without the fix.

        Show
        Doron Cohen added a comment - same fix + test case that fails without the fix.
        Hide
        Doron Cohen added a comment -

        Ok I can see it now, you're right.
        So all doc maker per rounds settings were ignored - first round settings were used.
        I am updating TestPerfTasksLogic.testIndexWriterSettings() to catch this bug.
        Thanks for catching this,
        Doron

        Show
        Doron Cohen added a comment - Ok I can see it now, you're right. So all doc maker per rounds settings were ignored - first round settings were used. I am updating TestPerfTasksLogic.testIndexWriterSettings() to catch this bug. Thanks for catching this, Doron
        Hide
        Mark Miller added a comment - - edited

        My algorithm is below.

        I see "Round 0->1: doc.term.vector:false->true" as well...however if I put a debug print on what is returned from public boolean get (String name, boolean dflt), it is only ever called once for "doc.term.vector" as well as the other guys in setConfig.

        More importantly, lets say I set it to true:false....if I look at the work/index directory on the second run, there are certainly term vectors. Thats how I noticed this to begin with...I was looking at the index and saw the term vector files on every round. Its possible I have something messed up, but every time I run through everything again and it really does not seem to be working. If I set term vectors to false:true, they are never made in any round.

        >>Mark you are right that setConfig is called just once, at start.
        >>At least for setting properties by round this should be sufficient.
        >>I wonder why this doesn't work for you.

        I think this admits the problem right? The get property for everything in setConfig is only called once...that loads up the "false:true", returns false, and sets up "true" to be returned on the next call...the next time you call get on Config you will get the "true"...but there is no next time. Its only done once...so it shows up right in the output "Round 0->1: doc.term.vector:false->true", but its only every called once and so only loads false.

        • Mark
        ram.flush.mb=flush:32:32
        compound=false
        
        analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
        directory=FSDirectory
        
        doc.stored=true
        doc.tokenized=tok:false:true
        doc.term.vector=vec:true:false
        doc.term.vector.offsets=tvo:false:true
        doc.term.vector.positions=tvp:false:true
        doc.add.log.step=2000
        
        docs.dir=reuters-out
        
        doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
        
        query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker
        
        # task at this depth or less would print when they start
        task.max.depth.log=2
        
        log.queries=true
        # -------------------------------------------------------------------------------------
        
        { "Rounds"
              
            ResetSystemErase
        
                CreateIndex
                { "MAddDocs" AddDoc(60) } : 20000
                Optimize
                CloseIndex
          
            OpenReader
              { "SrchTrvRetNewRdr" SearchTravRet(10) > : 1000
            CloseReader
            OpenReader
              { "SearchHlgtSameRdr" SearchTravRetHighlight(size[20],highlight[20],mergeContiguous[true],maxFrags[0],fields[body]) > : 1000
        
            CloseReader
        
            RepSumByPref SearchHlgtSameRdr
        
            NewRound
        
        } : 2
        
        RepSumByNameRound
        RepSumByName
        RepSumByPrefRound MAddDocs
        
        Show
        Mark Miller added a comment - - edited My algorithm is below. I see "Round 0- >1: doc.term.vector:false ->true" as well...however if I put a debug print on what is returned from public boolean get (String name, boolean dflt), it is only ever called once for "doc.term.vector" as well as the other guys in setConfig. More importantly, lets say I set it to true:false....if I look at the work/index directory on the second run, there are certainly term vectors. Thats how I noticed this to begin with...I was looking at the index and saw the term vector files on every round. Its possible I have something messed up, but every time I run through everything again and it really does not seem to be working. If I set term vectors to false:true, they are never made in any round. >>Mark you are right that setConfig is called just once, at start. >>At least for setting properties by round this should be sufficient. >>I wonder why this doesn't work for you. I think this admits the problem right? The get property for everything in setConfig is only called once...that loads up the "false:true", returns false, and sets up "true" to be returned on the next call...the next time you call get on Config you will get the "true"...but there is no next time. Its only done once...so it shows up right in the output "Round 0- >1: doc.term.vector:false ->true", but its only every called once and so only loads false. Mark ram.flush.mb=flush:32:32 compound= false analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer directory=FSDirectory doc.stored= true doc.tokenized=tok: false : true doc.term.vector=vec: true : false doc.term.vector.offsets=tvo: false : true doc.term.vector.positions=tvp: false : true doc.add.log.step=2000 docs.dir=reuters-out doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker # task at this depth or less would print when they start task.max.depth.log=2 log.queries= true # ------------------------------------------------------------------------------------- { "Rounds" ResetSystemErase CreateIndex { "MAddDocs" AddDoc(60) } : 20000 Optimize CloseIndex OpenReader { "SrchTrvRetNewRdr" SearchTravRet(10) > : 1000 CloseReader OpenReader { "SearchHlgtSameRdr" SearchTravRetHighlight(size[20],highlight[20],mergeContiguous[ true ],maxFrags[0],fields[body]) > : 1000 CloseReader RepSumByPref SearchHlgtSameRdr NewRound } : 2 RepSumByNameRound RepSumByName RepSumByPrefRound MAddDocs
        Hide
        Doron Cohen added a comment -

        Mark you are right that setConfig is called just once, at start.
        At least for setting properties by round this should be sufficient.
        I wonder why this doesn't work for you.

        I tried with this one:

        compound=true
        
        analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
        directory=RamDirectory
        
        doc.stored=true
        doc.tokenized=true
        doc.term.vector=termVec:false:true
        doc.add.log.step=10
        
        doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker
        task.max.depth.log=1
        
        {
        
            { "Populate"
                CreateIndex
                { AddDoc > : 50
                Optimize
                CloseIndex
            >
        
            ResetSystemErase
            NewRound
        
        } : 2
        
        RepSumByName
        RepSelectByPref Populate
        

        And got this output:

         Working Directory: work
         Running algorithm from: conf\termVecByRound.alg
         ------------> config properties:
         analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
         compound = true
         directory = RamDirectory
         doc.add.log.step = 10
         doc.maker = org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker
         doc.stored = true
         doc.term.vector = termVec:false:true
         doc.tokenized = true
         task.max.depth.log = 1
         work.dir = work
         -------------------------------
         ------------> algorithm:
         Seq {
             Seq_2 {
                 Populate {
                     CreateIndex
                     Seq_50 {
                         AddDoc
                     > * 50
                     Optimize
                     CloseIndex
                 >
                 ResetSystemErase
                 NewRound
             } * 2
             RepSumByName
             RepSelectByPref Populate
         }
         
         ------------> starting task: Seq
         ------------> starting task: Seq_2
         --> 0.1 sec: main processed (add) 10 docs
         --> 0.1 sec: main processed (add) 20 docs
         --> 0.11 sec: main processed (add) 30 docs
         --> 0.11 sec: main processed (add) 40 docs
         --> 0.11 sec: main processed (add) 50 docs
         ------------> SimpleDocMaker statistics (0): 
         num docs added since last inputs reset:                   50
         total bytes added since last inputs reset:             42,150
         
         
         
         --> Round 0-->1:   doc.term.vector:false-->true
         
         --> 0 sec: main processed (add) 60 docs
         --> 0 sec: main processed (add) 70 docs
         --> 0 sec: main processed (add) 80 docs
         --> 0 sec: main processed (add) 90 docs
         --> 0 sec: main processed (add) 100 docs
         ------------> SimpleDocMaker statistics (1): 
         num docs added since last inputs reset:                   50
         total bytes added since last inputs reset:             42,150
         
         
         
         --> Round 1-->2:   doc.term.vector:true-->false
         
         
         ------------> Report Sum By (any) Name (2 about 3 out of 4)
         Operation   round termVec   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
         Seq_2           0   false        1          106        530.0        0.20       639,912      5,177,344
         Populate        -       -        2           53        706.7        0.15       839,552      5,177,344
         
         
         ------------> Report Select By Prefix (Populate) (2 about 2 out of 4)
         Operation   round termVec   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
         Populate        0   false        1           53        378.6        0.14       858,080      5,177,344
         Populate -  -   1 -  true -  -   1 -  -  -   53 -  - 5,300.0 -  -   0.01 -  -  821,024 -  - 5,177,344
         
         ####################
         ###  D O N E !!! ###
         ####################
        

        Note in particular this line:

        [java] --> Round 0-->1:   doc.term.vector:false-->true 
        

        Note that a NewRound command is required in order for the round number to change.

            NewRound
        

        A possible cause for error is that the property definition parsing requires a property name prefix for multi-valued properties.
        So this would not work as expected:

        doc.term.vector=false:true
        

        But this will work:

        doc.term.vector=termVec:false:true
        

        If it still doesn't work for you, can you post here the algorithm?

        Show
        Doron Cohen added a comment - Mark you are right that setConfig is called just once, at start. At least for setting properties by round this should be sufficient. I wonder why this doesn't work for you. I tried with this one: compound= true analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer directory=RamDirectory doc.stored= true doc.tokenized= true doc.term.vector=termVec: false : true doc.add.log.step=10 doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker task.max.depth.log=1 { { "Populate" CreateIndex { AddDoc > : 50 Optimize CloseIndex > ResetSystemErase NewRound } : 2 RepSumByName RepSelectByPref Populate And got this output: Working Directory: work Running algorithm from: conf\termVecByRound.alg ------------> config properties: analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer compound = true directory = RamDirectory doc.add.log.step = 10 doc.maker = org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker doc.stored = true doc.term.vector = termVec: false : true doc.tokenized = true task.max.depth.log = 1 work.dir = work ------------------------------- ------------> algorithm: Seq { Seq_2 { Populate { CreateIndex Seq_50 { AddDoc > * 50 Optimize CloseIndex > ResetSystemErase NewRound } * 2 RepSumByName RepSelectByPref Populate } ------------> starting task: Seq ------------> starting task: Seq_2 --> 0.1 sec: main processed (add) 10 docs --> 0.1 sec: main processed (add) 20 docs --> 0.11 sec: main processed (add) 30 docs --> 0.11 sec: main processed (add) 40 docs --> 0.11 sec: main processed (add) 50 docs ------------> SimpleDocMaker statistics (0): num docs added since last inputs reset: 50 total bytes added since last inputs reset: 42,150 --> Round 0-->1: doc.term.vector: false --> true --> 0 sec: main processed (add) 60 docs --> 0 sec: main processed (add) 70 docs --> 0 sec: main processed (add) 80 docs --> 0 sec: main processed (add) 90 docs --> 0 sec: main processed (add) 100 docs ------------> SimpleDocMaker statistics (1): num docs added since last inputs reset: 50 total bytes added since last inputs reset: 42,150 --> Round 1-->2: doc.term.vector: true --> false ------------> Report Sum By (any) Name (2 about 3 out of 4) Operation round termVec runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem Seq_2 0 false 1 106 530.0 0.20 639,912 5,177,344 Populate - - 2 53 706.7 0.15 839,552 5,177,344 ------------> Report Select By Prefix (Populate) (2 about 2 out of 4) Operation round termVec runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem Populate 0 false 1 53 378.6 0.14 858,080 5,177,344 Populate - - 1 - true - - 1 - - - 53 - - 5,300.0 - - 0.01 - - 821,024 - - 5,177,344 #################### ### D O N E !!! ### #################### Note in particular this line: [java] --> Round 0-->1: doc.term.vector: false --> true Note that a NewRound command is required in order for the round number to change. NewRound A possible cause for error is that the property definition parsing requires a property name prefix for multi-valued properties. So this would not work as expected: doc.term.vector= false : true But this will work: doc.term.vector=termVec: false : true If it still doesn't work for you, can you post here the algorithm?
        Hide
        Mark Miller added a comment -

        It seems to me that its not working right. Everything that is set in public void setConfig(Config config) is only set once for me, not per round. That is unless I apply the above patch. This means that I cannot seem to set tokenizing, storing, or termvectors per round.

        From what I can tell it is because setConfig is only called once, and so only the first value is every read for those properties. The patch above puts set config in the resetInputs method which does get called per round. Not sure if that is the best fix, but I know cannot currently set those per round and have anything but the first setting take effect.

        • Mark
        Show
        Mark Miller added a comment - It seems to me that its not working right. Everything that is set in public void setConfig(Config config) is only set once for me, not per round. That is unless I apply the above patch. This means that I cannot seem to set tokenizing, storing, or termvectors per round. From what I can tell it is because setConfig is only called once, and so only the first value is every read for those properties. The patch above puts set config in the resetInputs method which does get called per round. Not sure if that is the best fix, but I know cannot currently set those per round and have anything but the first setting take effect. Mark
        Hide
        Doron Cohen added a comment - - edited

        Config maintains properties by round, so this should do the trick:

        doc.term.vector=tvf:true:false
        

        It sets term-vectors to true in round 0, false in round 1, true in round 2, etc.
        Also, a column is added to the reports with the value of this property ('tvf').

        Unless you already tried this and it didn't work?

        Show
        Doron Cohen added a comment - - edited Config maintains properties by round, so this should do the trick: doc.term.vector=tvf: true : false It sets term-vectors to true in round 0, false in round 1, true in round 2, etc. Also, a column is added to the reports with the value of this property ('tvf'). Unless you already tried this and it didn't work?

          People

          • Assignee:
            Doron Cohen
            Reporter:
            Mark Miller
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development