Lucene - Core
  1. Lucene - Core
  2. LUCENE-3440

FastVectorHighlighter: IDF-weighted terms for ordered fragments

    Details

    • Lucene Fields:
      New, Patch Available

      Description

      The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains all of the terms used in the original query.

      This patch provides ordered fragments with IDF-weighted terms:

      total weight = total weight + IDF for unique term per fragment * boost of query;

      The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.

      The patch is simple, but it works for us.

      Some ideas:

      • A better approach would be moving the whole fragments-scoring into a separate class.
      • Switch scoring via parameter
      • Exact phrases should be given a even better score, regardless if a phrase-query was executed or not
      • edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher
      1. weight-vs-boost_table02.html
        1 kB
        Sebastian Lutze
      2. weight-vs-boost_table01.html
        0.6 kB
        Sebastian Lutze
      3. LUCENE-4.0-SNAPSHOT-3440-9.patch
        63 kB
        Sebastian Lutze
      4. LUCENE-3440.patch
        60 kB
        Koji Sekiguchi
      5. LUCENE-3440.patch
        61 kB
        Sebastian Lutze
      6. LUCENE-3440.patch
        64 kB
        Sebastian Lutze
      7. LUCENE-3440_3.6.1-SNAPSHOT.patch
        76 kB
        Sebastian Lutze

        Activity

        Hide
        Sebastian Lutze added a comment -

        Works for lucene_solr_branch_3x.

        Show
        Sebastian Lutze added a comment - Works for lucene_solr_branch_3x.
        Hide
        Sebastian Lutze added a comment -

        Ups, wrong patch ... here's the right one.

        Show
        Sebastian Lutze added a comment - Ups, wrong patch ... here's the right one.
        Hide
        Koji Sekiguchi added a comment -

        I think this is an interesting point of view, thanks! But I couldn't apply the patch to the latest trunk:

        [koji@MacBook LUCENE-3440]$ patch -p0 --dry-run < LUCENE-3440.patch 
        patching file lucene/contrib/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldFragList.java
        patching file lucene/contrib/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldPhraseList.java
        patching file lucene/contrib/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldTermStack.java
        Hunk #1 FAILED at 31.
        Hunk #2 FAILED at 96.
        Hunk #3 FAILED at 108.
        Hunk #4 succeeded at 148 (offset -9 lines).
        3 out of 4 hunks FAILED -- saving rejects to file lucene/contrib/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldTermStack.java.rej
        

        Can you verify that?

        Show
        Koji Sekiguchi added a comment - I think this is an interesting point of view, thanks! But I couldn't apply the patch to the latest trunk: [koji@MacBook LUCENE-3440]$ patch -p0 --dry-run < LUCENE-3440.patch patching file lucene/contrib/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldFragList.java patching file lucene/contrib/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldPhraseList.java patching file lucene/contrib/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldTermStack.java Hunk #1 FAILED at 31. Hunk #2 FAILED at 96. Hunk #3 FAILED at 108. Hunk #4 succeeded at 148 (offset -9 lines). 3 out of 4 hunks FAILED -- saving rejects to file lucene/contrib/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldTermStack.java.rej Can you verify that?
        Hide
        Sebastian Lutze added a comment - - edited

        No, can't verify that. It's my first patch, maybe I did something wrong. The patch is built from branch_3x with the subversion-plug-in for Eclipse. I took the todays branch_3x (Import -> SVN -> Checkout projects ...) a few minutes ago and patched it (Team -> Apply patch). No problem with my setup.

        Another approach:

        Assuming a user searches for a single word, he rather would like to see fragments with a culmination of that word:

        Boost with number of distinct terms per fragment
        for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
         SubInfo subInfo = new SubInfo( phraseInfo.text, phraseInfo.termsOffsets, phraseInfo.seqnum );
         subInfos.add( subInfo );
                
         Iterator it = phraseInfo.termInfos.iterator();
         TermInfo ti;
                
         while ( it.hasNext() ) {
          ti = ( TermInfo ) it.next();
          distinctTerms.add( ti.text );
          totalBoost += ti.weight * phraseInfo.boost;
         }
        }
        totalBoost *= distinctTerms.size();
        
        Show
        Sebastian Lutze added a comment - - edited No, can't verify that. It's my first patch, maybe I did something wrong. The patch is built from branch_3x with the subversion-plug-in for Eclipse. I took the todays branch_3x (Import -> SVN -> Checkout projects ...) a few minutes ago and patched it (Team -> Apply patch). No problem with my setup. Another approach: Assuming a user searches for a single word, he rather would like to see fragments with a culmination of that word: Boost with number of distinct terms per fragment for ( WeightedPhraseInfo phraseInfo : phraseInfoList ){ SubInfo subInfo = new SubInfo( phraseInfo.text, phraseInfo.termsOffsets, phraseInfo.seqnum ); subInfos.add( subInfo ); Iterator it = phraseInfo.termInfos.iterator(); TermInfo ti; while ( it.hasNext() ) { ti = ( TermInfo ) it.next(); distinctTerms.add( ti.text ); totalBoost += ti.weight * phraseInfo.boost; } } totalBoost *= distinctTerms.size();
        Hide
        Koji Sekiguchi added a comment -

        Ah, I see. I saw trunk, but you made the patch for 3x. I'll see.

        Show
        Koji Sekiguchi added a comment - Ah, I see. I saw trunk, but you made the patch for 3x. I'll see.
        Hide
        Sebastian Lutze added a comment -

        Here another patch.

        • The calculation of WeightedFragInfo.totalBoost remains unmodified
        • A new field WeightedFragInfo.totalWeight has been introduced
        • A class WeightOrderFragmentsBuilder sorts now by WeightedFragInfo.totalWeight
        Show
        Sebastian Lutze added a comment - Here another patch. The calculation of WeightedFragInfo.totalBoost remains unmodified A new field WeightedFragInfo.totalWeight has been introduced A class WeightOrderFragmentsBuilder sorts now by WeightedFragInfo.totalWeight
        Hide
        Koji Sekiguchi added a comment -

        Hi,

        1. Which patch do you want me to try?
        2. Can you make that for trunk branch?
        Show
        Koji Sekiguchi added a comment - Hi, Which patch do you want me to try? Can you make that for trunk branch?
        Hide
        Sebastian Lutze added a comment -

        Patch for branch_3x.

        Show
        Sebastian Lutze added a comment - Patch for branch_3x.
        Hide
        Sebastian Lutze added a comment -

        Patch for trunk.

        Show
        Sebastian Lutze added a comment - Patch for trunk.
        Hide
        Sebastian Lutze added a comment - - edited

        Hi Koji,

        1. Which patch do you want me to try?

        Doesn't matter. First time I took the trunk for a long time. I'm looking forward to the new admin-interface in solr/lucene-4.0!

        2. Can you make that for trunk branch?

        Here we go. This Version is slightly different, the weight is now boosted by the normalized number of terms per fragment:

        for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
         SubInfo subInfo = new SubInfo( phraseInfo.text, phraseInfo.termsOffsets, phraseInfo.seqnum );
         subInfos.add( subInfo );   
         Iterator it = phraseInfo.termInfos.iterator();
         TermInfo ti;    
         totalBoost += phraseInfo.boost;      
         while ( it.hasNext() ) {
          ti = ( TermInfo ) it.next();
          if ( uniqueTerms.add( ti.text ) )
           totalWeight += Math.pow(ti.weight, 2) * phraseInfo.boost;
          termsPerFrag++;
          }
         }     
        }
        totalWeight *= termsPerFrag * ( 1 / Math.sqrt( termsPerFrag ) );
        

        Due to a significant lack of mathematical knowledge, a very intuitive solution.
        But it seems to work very well, at least for our data (highly multi-lingual, mostly historical, dirty OCRed, books, journals + papers).

        Show
        Sebastian Lutze added a comment - - edited Hi Koji, 1. Which patch do you want me to try? Doesn't matter. First time I took the trunk for a long time. I'm looking forward to the new admin-interface in solr/lucene-4.0! 2. Can you make that for trunk branch? Here we go. This Version is slightly different, the weight is now boosted by the normalized number of terms per fragment: for ( WeightedPhraseInfo phraseInfo : phraseInfoList ){ SubInfo subInfo = new SubInfo( phraseInfo.text, phraseInfo.termsOffsets, phraseInfo.seqnum ); subInfos.add( subInfo ); Iterator it = phraseInfo.termInfos.iterator(); TermInfo ti; totalBoost += phraseInfo.boost; while ( it.hasNext() ) { ti = ( TermInfo ) it.next(); if ( uniqueTerms.add( ti.text ) ) totalWeight += Math .pow(ti.weight, 2) * phraseInfo.boost; termsPerFrag++; } } } totalWeight *= termsPerFrag * ( 1 / Math .sqrt( termsPerFrag ) ); Due to a significant lack of mathematical knowledge, a very intuitive solution. But it seems to work very well, at least for our data (highly multi-lingual, mostly historical, dirty OCRed, books, journals + papers).
        Hide
        Koji Sekiguchi added a comment -

        Patch looks great! A few comments:

        1. For the new totalWeight, add getter method and modify toString() in WeightedFragInfo().
        2. The patch uses hard-coded DefaultSimilarity to calculate idf. I don't think that a custom similarity can be used here, too. If so, how about just copying idf method rather than creating a similarity object?
        3. Please do not hesitate to update ScoreComparator (do not add WeightOrderFragmentsBuilder)
        4. Could you update package javadoc ( https://builds.apache.org//job/Lucene-trunk/javadoc/contrib-highlighter/org/apache/lucene/search/vectorhighlight/package-summary.html#package_description ) and insert totalWeight into description and figures.
        5. use docFreq(String field, BytesRef term) version for trunk to avoid creating Term object.

        Due to a significant lack of mathematical knowledge, a very intuitive solution.

        I agree. I think if there is a table so that we can compare totalBoost (current) and totalWeight (patch) with real values, it helps a lot.

        Show
        Koji Sekiguchi added a comment - Patch looks great! A few comments: For the new totalWeight, add getter method and modify toString() in WeightedFragInfo(). The patch uses hard-coded DefaultSimilarity to calculate idf. I don't think that a custom similarity can be used here, too. If so, how about just copying idf method rather than creating a similarity object? Please do not hesitate to update ScoreComparator (do not add WeightOrderFragmentsBuilder) Could you update package javadoc ( https://builds.apache.org//job/Lucene-trunk/javadoc/contrib-highlighter/org/apache/lucene/search/vectorhighlight/package-summary.html#package_description ) and insert totalWeight into description and figures. use docFreq(String field, BytesRef term) version for trunk to avoid creating Term object. Due to a significant lack of mathematical knowledge, a very intuitive solution. I agree. I think if there is a table so that we can compare totalBoost (current) and totalWeight (patch) with real values, it helps a lot.
        Hide
        Sebastian Lutze added a comment -

        Patch looks great!

        Thanks.

        1. For the new totalWeight, add getter method and modify toString() in WeightedFragInfo().

        Okay.

        2. The patch uses hard-coded DefaultSimilarity to calculate idf. I don't think that a custom similarity can be used here, too. If so, how about just copying idf method rather than creating a similarity object?

        I played a little with log(numDocs - docFreq + 0.5 / docFreq + 0.5) but is seems to make no difference. If I'm not mistaken there is no method IndexReader.getSimilarity() or IndexReader.getDefaultSimilarity().

        Therefore: Okay.

        3. Please do not hesitate to update ScoreComparator (do not add WeightOrderFragmentsBuilder)

        Hm, I thought about something like that:

        <highlighting>
          <fragmentsBuilder name="ordered" class="org.apache.solr.highlight.ScoreOrderFragmentsBuilder" default="false"/>
          <fragmentsBuilder name="weighted" class="org.apache.solr.highlight.WeightOrderFragmentsBuilder" default="true"/>
        </highlighting>
        

        For Solr-users (like me). If somebody would like to use the boost-based ordering, he could. Maybe, for some use-cases the boost-based approach is better than the weighted one.

        4 Could you update package javadoc ( https://builds.apache.org//job/Lucene-trunk/javadoc/contrib-highlighter/org/apache/lucene/search/vectorhighlight/package-summary.html#package_description ) and insert totalWeight into description and figures.

        Okay.

        5. use docFreq(String field, BytesRef term) version for trunk to avoid creating Term object.

        Okay.

        I agree. I think if there is a table so that we can compare totalBoost (current) and totalWeight (patch) with real values, it helps a lot.

        I'll write some Proof-of-concept Test-Class. But this may take some time.

        I discovered a little problem with overlapping terms, depending on the analyzing-process:

        WeightedPhraseInfo.addIfNoOverlap() dumps the second part of hyphenated words (for example: social-economics). The result is that all informations in TermInfo are lost and not available for computing the fragments weight. I simple modified WeightedPhraseInfo.addIfNoOverlap() a little to change this behavior:

        void addIfNoOverlap( WeightedPhraseInfo wpi ){
         for( WeightedPhraseInfo existWpi : phraseList ){
          if( existWpi.isOffsetOverlap( wpi ) ) {
           existWpi.termInfos.addAll( wpi.termInfos );
           return;
          }
         }
         phraseList.add( wpi );
        }
        

        But I am not sure if there could be some unforeseen site-effects?

        Show
        Sebastian Lutze added a comment - Patch looks great! Thanks. 1. For the new totalWeight, add getter method and modify toString() in WeightedFragInfo(). Okay. 2. The patch uses hard-coded DefaultSimilarity to calculate idf. I don't think that a custom similarity can be used here, too. If so, how about just copying idf method rather than creating a similarity object? I played a little with log(numDocs - docFreq + 0.5 / docFreq + 0.5) but is seems to make no difference. If I'm not mistaken there is no method IndexReader.getSimilarity() or IndexReader.getDefaultSimilarity(). Therefore: Okay. 3. Please do not hesitate to update ScoreComparator (do not add WeightOrderFragmentsBuilder) Hm, I thought about something like that: <highlighting> <fragmentsBuilder name= "ordered" class= "org.apache.solr.highlight.ScoreOrderFragmentsBuilder" default= "false" /> <fragmentsBuilder name= "weighted" class= "org.apache.solr.highlight.WeightOrderFragmentsBuilder" default= "true" /> </highlighting> For Solr-users (like me). If somebody would like to use the boost-based ordering, he could. Maybe, for some use-cases the boost-based approach is better than the weighted one. 4 Could you update package javadoc ( https://builds.apache.org//job/Lucene-trunk/javadoc/contrib-highlighter/org/apache/lucene/search/vectorhighlight/package-summary.html#package_description ) and insert totalWeight into description and figures. Okay. 5. use docFreq(String field, BytesRef term) version for trunk to avoid creating Term object. Okay. I agree. I think if there is a table so that we can compare totalBoost (current) and totalWeight (patch) with real values, it helps a lot. I'll write some Proof-of-concept Test-Class. But this may take some time. I discovered a little problem with overlapping terms, depending on the analyzing-process: WeightedPhraseInfo.addIfNoOverlap() dumps the second part of hyphenated words (for example: social-economics). The result is that all informations in TermInfo are lost and not available for computing the fragments weight. I simple modified WeightedPhraseInfo.addIfNoOverlap() a little to change this behavior: void addIfNoOverlap( WeightedPhraseInfo wpi ){ for ( WeightedPhraseInfo existWpi : phraseList ){ if ( existWpi.isOffsetOverlap( wpi ) ) { existWpi.termInfos.addAll( wpi.termInfos ); return ; } } phraseList.add( wpi ); } But I am not sure if there could be some unforeseen site-effects?
        Hide
        Koji Sekiguchi added a comment -

        Hm, I thought about something like that:

        <highlighting>
          <fragmentsBuilder name="ordered" class="org.apache.solr.highlight.ScoreOrderFragmentsBuilder" default="false"/>
          <fragmentsBuilder name="weighted" class="org.apache.solr.highlight.WeightOrderFragmentsBuilder" default="true"/>
        </highlighting>
        

        For Solr-users (like me). If somebody would like to use the boost-based ordering, he could. Maybe, for some use-cases the boost-based approach is better than the weighted one.

        I thought that, too. But I saw the following in the patch:

        public List<WeightedFragInfo> getWeightedFragInfoList( List<WeightedFragInfo> src ) {
            Collections.sort( src, new ScoreComparator() );
        //    Collections.sort( src, new WeightComparator() );
            return src;
        }
        

        And I thought you wanted to use WeightComparator from ScoreOrderFragmentsBuilder.

        Well now, let's introduce WeightOrderFragmentsBuilder.

        Show
        Koji Sekiguchi added a comment - Hm, I thought about something like that: <highlighting> <fragmentsBuilder name= "ordered" class= "org.apache.solr.highlight.ScoreOrderFragmentsBuilder" default= "false" /> <fragmentsBuilder name= "weighted" class= "org.apache.solr.highlight.WeightOrderFragmentsBuilder" default= "true" /> </highlighting> For Solr-users (like me). If somebody would like to use the boost-based ordering, he could. Maybe, for some use-cases the boost-based approach is better than the weighted one. I thought that, too. But I saw the following in the patch: public List<WeightedFragInfo> getWeightedFragInfoList( List<WeightedFragInfo> src ) { Collections.sort( src, new ScoreComparator() ); // Collections.sort( src, new WeightComparator() ); return src; } And I thought you wanted to use WeightComparator from ScoreOrderFragmentsBuilder. Well now, let's introduce WeightOrderFragmentsBuilder.
        Hide
        Sebastian Lutze added a comment -

        Patch for 3.5. Docs are still missing.

        Show
        Sebastian Lutze added a comment - Patch for 3.5. Docs are still missing.
        Hide
        Sebastian Lutze added a comment -

        WeightOrderFragmentsBuilder_table01.html:
        A one-word-query for 'testament'. Obviously, the sum-of-distinct-weights-approach makes no difference to the existing one.

        Show
        Sebastian Lutze added a comment - WeightOrderFragmentsBuilder_table01.html: A one-word-query for 'testament'. Obviously, the sum-of-distinct-weights-approach makes no difference to the existing one.
        Hide
        Sebastian Lutze added a comment - - edited

        WeightOrderFragmentsBuilder_table02.html:
        A more-word-query for 'das alte testament'. Obviously, the sum-of-boosts-approach scores "das das das das" higher than "das alte testament".

        Show
        Sebastian Lutze added a comment - - edited WeightOrderFragmentsBuilder_table02.html: A more-word-query for 'das alte testament'. Obviously, the sum-of-boosts-approach scores "das das das das" higher than "das alte testament".
        Hide
        Sebastian Lutze added a comment - - edited

        LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java:
        The two tables are created by this simple class. I took, representatively, some single pages as documents from our book-stock to build a "bag-of-words".

        Show
        Sebastian Lutze added a comment - - edited LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java: The two tables are created by this simple class. I took, representatively, some single pages as documents from our book-stock to build a "bag-of-words".
        Hide
        Sebastian Lutze added a comment -

        Hm, I tried to do that all with trunk but:

        29.09.2011 15:43:09 org.apache.solr.common.SolrException log
        SEVERE: java.lang.VerifyError: class org.apache.lucene.analysis.ReusableAnalyzerBase overrides final method tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream;
        	at java.lang.ClassLoader.defineClass1(Native Method)
        	at java.lang.ClassLoader.defineClassCond(ClassLoader.java:632)
        	at java.lang.ClassLoader.defineClass(ClassLoader.java:616)
        	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
        	at org.apache.catalina.loader.WebappClassLoader.findClassInternal(WebappClassLoader.java:2733)
        	at org.apache.catalina.loader.WebappClassLoader.findClass(WebappClassLoader.java:1124)
        	at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1612)
        	at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1491)
        	at java.lang.ClassLoader.defineClass1(Native Method)
        	at java.lang.ClassLoader.defineClassCond(ClassLoader.java:632)
        	at java.lang.ClassLoader.defineClass(ClassLoader.java:616)
        	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
        	at org.apache.catalina.loader.WebappClassLoader.findClassInternal(WebappClassLoader.java:2733)
        	at org.apache.catalina.loader.WebappClassLoader.findClass(WebappClassLoader.java:1124)
        	at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1612)
        	at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1491)
        	at java.lang.Class.forName0(Native Method)
        	at java.lang.Class.forName(Class.java:247)
        	at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:403)
        	at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:407)
        	at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:456)
        	at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1653)
        	at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1647)
        	at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1680)
        	at org.apache.solr.core.SolrCore.loadSearchComponents(SolrCore.java:875)
        	at org.apache.solr.core.SolrCore.<init>(SolrCore.java:574)
        	at org.apache.solr.core.SolrCore.<init>(SolrCore.java:507)
        	at org.apache.solr.core.CoreContainer.create(CoreContainer.java:653)
        	at org.apache.solr.core.CoreContainer.load(CoreContainer.java:407)
        	at org.apache.solr.core.CoreContainer.load(CoreContainer.java:292)
        	at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:241)
        	at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:93)
        	at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
        	at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
        	at org.apache.catalina.core.ApplicationFilterConfig.<init>(ApplicationFilterConfig.java:115)
        	at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4001)
        	at org.apache.catalina.core.StandardContext.start(StandardContext.java:4651)
        	at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
        	at org.apache.catalina.core.StandardHost.start(StandardHost.java:785)
        	at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
        	at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:445)
        	at org.apache.catalina.core.StandardService.start(StandardService.java:519)
        	at org.apache.catalina.core.StandardServer.start(StandardServer.java:710)
        	at org.apache.catalina.startup.Catalina.start(Catalina.java:581)
        	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        	at java.lang.reflect.Method.invoke(Method.java:597)
        	at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289)
        	at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414)
        
        Show
        Sebastian Lutze added a comment - Hm, I tried to do that all with trunk but: 29.09.2011 15:43:09 org.apache.solr.common.SolrException log SEVERE: java.lang.VerifyError: class org.apache.lucene.analysis.ReusableAnalyzerBase overrides final method tokenStream.(Ljava/lang/ String ;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream; at java.lang. ClassLoader .defineClass1(Native Method) at java.lang. ClassLoader .defineClassCond( ClassLoader .java:632) at java.lang. ClassLoader .defineClass( ClassLoader .java:616) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) at org.apache.catalina.loader.WebappClassLoader.findClassInternal(WebappClassLoader.java:2733) at org.apache.catalina.loader.WebappClassLoader.findClass(WebappClassLoader.java:1124) at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1612) at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1491) at java.lang. ClassLoader .defineClass1(Native Method) at java.lang. ClassLoader .defineClassCond( ClassLoader .java:632) at java.lang. ClassLoader .defineClass( ClassLoader .java:616) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) at org.apache.catalina.loader.WebappClassLoader.findClassInternal(WebappClassLoader.java:2733) at org.apache.catalina.loader.WebappClassLoader.findClass(WebappClassLoader.java:1124) at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1612) at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1491) at java.lang. Class .forName0(Native Method) at java.lang. Class .forName( Class .java:247) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:403) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:407) at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:456) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1653) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1647) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1680) at org.apache.solr.core.SolrCore.loadSearchComponents(SolrCore.java:875) at org.apache.solr.core.SolrCore.<init>(SolrCore.java:574) at org.apache.solr.core.SolrCore.<init>(SolrCore.java:507) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:653) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:407) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:292) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:241) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:93) at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295) at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422) at org.apache.catalina.core.ApplicationFilterConfig.<init>(ApplicationFilterConfig.java:115) at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4001) at org.apache.catalina.core.StandardContext.start(StandardContext.java:4651) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045) at org.apache.catalina.core.StandardHost.start(StandardHost.java:785) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045) at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:445) at org.apache.catalina.core.StandardService.start(StandardService.java:519) at org.apache.catalina.core.StandardServer.start(StandardServer.java:710) at org.apache.catalina.startup.Catalina.start(Catalina.java:581) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289) at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414)
        Hide
        Sebastian Lutze added a comment - - edited

        WeightOrderFragmentsBuilder_table01.html:
        A one-word-query for testament. Obviously, the sum-of-distinct-weights-approach makes no difference to the existing one.

        Terms in fragment totalWeight totalBoost
        testament testament 1.8171139 2.0
        testament 1.2848935 1.0
        testament 1.2848935 1.0
        testament 1.2848935 1.0
        testament 1.2848935 1.0
        testament 1.2848935 1.0

        WeightOrderFragmentsBuilder_table02.html:
        A multi-word-query for das alte testament. Obviously, the sum-of-boosts-approach scores das das das das higher than das alte testament.

        Terms in fragment totalWeight totalBoost
        das alte testament 5.799069 3.0
        das alte testament 5.799069 3.0
        das testament alte 5.799069 3.0
        das alte testament 5.799069 3.0
        das testament 2.9178061 2.0
        das alte 2.9178061 2.0
        testament testament 1.8171139 2.0
        das das das das 1.5566137 4.0
        das das das 1.348067 3.0
        alte 1.2848935 1.0
        alte 1.2848935 1.0
        das das 1.100692 2.0
        das das 1.100692 2.0
        das 0.77830684 1.0
        das 0.77830684 1.0
        das 0.77830684 1.0
        das 0.77830684 1.0
        das 0.77830684 1.0

        Show
        Sebastian Lutze added a comment - - edited WeightOrderFragmentsBuilder_table01.html: A one-word-query for testament . Obviously, the sum-of-distinct-weights -approach makes no difference to the existing one. Terms in fragment totalWeight totalBoost testament testament 1.8171139 2.0 testament 1.2848935 1.0 testament 1.2848935 1.0 testament 1.2848935 1.0 testament 1.2848935 1.0 testament 1.2848935 1.0 WeightOrderFragmentsBuilder_table02.html: A multi-word-query for das alte testament . Obviously, the sum-of-boosts -approach scores das das das das higher than das alte testament . Terms in fragment totalWeight totalBoost das alte testament 5.799069 3.0 das alte testament 5.799069 3.0 das testament alte 5.799069 3.0 das alte testament 5.799069 3.0 das testament 2.9178061 2.0 das alte 2.9178061 2.0 testament testament 1.8171139 2.0 das das das das 1.5566137 4.0 das das das 1.348067 3.0 alte 1.2848935 1.0 alte 1.2848935 1.0 das das 1.100692 2.0 das das 1.100692 2.0 das 0.77830684 1.0 das 0.77830684 1.0 das 0.77830684 1.0 das 0.77830684 1.0 das 0.77830684 1.0
        Hide
        Sebastian Lutze added a comment - - edited

        Here's the patch for 4.0. I forgot to update my Solr-plugin-lib to 4.0-SNAPSHOT.

        Another patch, another idea!

        Some thoughts:

        • With the last patch, sum-of-distinct-weights will be calculated anyhow, even if ScoreOrderFragmentsBuilder is used.
        • Also regardless of further calculations, FieldTermsStack retrieves document frequency for each term from IndexReader in any case.
        • Solr-Developers have no chance to implement a FragmentsBuilder-plugin with their custom-scoring for fragments, because the weighting-formula is "hard-coded" in WeightedFragInfo. BTW, that's the reason I started to work on this patch anyway.

        Possible Solution:

        1. Collect and pass all needed Informations to the BaseFragmentsBuilder-implementation

        • Introduction of TermInfo.fieldName
        • Introduction of WeightedFragInfo.phraseInfos
        • Passing a instance of IndexReader as argument to BaseFragmentsBuilder.getWeightedFragInfoList() in order to get the needed statistical data from the index

        2. Move the calculation of sum-of-boosts to ScoreOrderFramentsBuilder.calculateScore()

            
          /**
           * Compute WeightedFragInfo.score based on query-boosts
           * @throws IOException 
           */
          public List<WeightedFragInfo> calculateScore( List<WeightedFragInfo> weightedFragInfos, IndexReader reader ) throws IOException{
            for( WeightedFragInfo wfi : weightedFragInfos ){
              for( WeightedPhraseInfo wpi : wfi.phraseInfos ){
                wfi.score += wpi.boost;
              }
            }
            return weightedFragInfos;
          }
        

        3. Calculation of sum-of-distinct-weights with WeightOrderFramentsBuilder.calculateScore()

        • In this patch WeightOrderFramentsBuilder is a subclass of ScoreOrderFragmentsBuilder.
        • But I think the introduction of an abstract class OrderedFragmentsBuilder as superclass of ScoreOrderFragmentsBuilder and WeightOrderFragmentsBuilder would be a better strategy.
        • Moving calculateScore() into BaseFragmentsBuilder and making it abstract would be another idea.
        • The sum-of-distinct-weight-approach is the same as presented in the last patch.
          /**
           * Compute WeightedFragInfo.score based on IDF-weighted terms
           * @throws IOException 
           */
          @Override
          public List<WeightedFragInfo> calculateScore( List<WeightedFragInfo> weightedFragInfos, IndexReader reader ) throws IOException{
            
            Map<String, Float> lookup = new HashMap<String, Float>(); 
            HashSet<String> distinctTerms  = new HashSet<String>();
            
            int numDocs = reader.numDocs() - reader.numDeletedDocs();
            
            int docFreq;
            int length;
            float boost;
            float weight;
            
            for( WeightedFragInfo wfi : weightedFragInfos ){
              uniqueTerms.clear();
              length = 0;
              boost = 0;
              for( WeightedPhraseInfo wpi : wfi.phraseInfos ){
                for( TermInfo ti : wpi.termInfos ) {
                  length++;
                  if( !distinctTerms.add( ti.text ) ) 
                    continue;
                  if ( lookup.containsKey( ti.text ) )
                    weight = lookup.get( ti.text ).floatValue();
                  else {
                    docFreq = reader.docFreq( new Term( ti.fieldName, ti.text ) );
                    weight = ( float ) ( Math.log( numDocs / ( double ) ( docFreq + 1 ) ) + 1.0 );
                    lookup.put( ti.text, new Float( weight ) );
                  }
                  boost += Math.pow( weight, 2 ) * wpi.boost;
                }
              }
              wfi.score = ( float ) ( boost * length * ( 1 / Math.sqrt( length ) ) );
            }
            
            return weightedFragInfos;
          }
        

        With this approach programmers can implement their own fragments-weighting with ease, simply overwriting calculateScore().

        I think, the major drawback of this idea is that the FragmentsBuilder must traverse the whole stack of WeightedFragInfo once again. Since we have tomes with more than 3000 pages of OCR, this could be a problem. But I can't confirm that for sure. One way to avoid this would be making FieldFragList "plugable" with an Interface "FragList" and the FragmentsBuilder-plugin could be parametrized with the intended implementation of FragList:

        <highlighter>
         <fragmentsBuilder name="weight-ordered" class="org.apache.solr.highlight.OrderedFragmentsBuilder" />
          <fragList class="org.apache.lucene.search.vectorhighlight.WeightedFragList" />
         </fragmentsBuilder>
         <fragmentsBuilder name="boost-ordered" class="org.apache.solr.highlight.OrderedFragmentsBuilder" />
          <fragList class="org.apache.lucene.search.vectorhighlight.BoostedFragList" />
         </fragmentsBuilder>
        </highlighter>
        

        Further notes:

        • As shown in this patch "WeightedFragInfo.totalBoost" should be renamed into "WeightedFragInfo.score".
        • "ScoreOrderFragmentsBuilder" should be renamed into "BoostOrderFragmentsBuilder".
        Show
        Sebastian Lutze added a comment - - edited Here's the patch for 4.0. I forgot to update my Solr-plugin-lib to 4.0-SNAPSHOT. Another patch, another idea! Some thoughts: With the last patch, sum-of-distinct-weights will be calculated anyhow, even if ScoreOrderFragmentsBuilder is used. Also regardless of further calculations, FieldTermsStack retrieves document frequency for each term from IndexReader in any case. Solr-Developers have no chance to implement a FragmentsBuilder-plugin with their custom-scoring for fragments, because the weighting-formula is "hard-coded" in WeightedFragInfo. BTW, that's the reason I started to work on this patch anyway. Possible Solution: 1. Collect and pass all needed Informations to the BaseFragmentsBuilder-implementation Introduction of TermInfo.fieldName Introduction of WeightedFragInfo.phraseInfos Passing a instance of IndexReader as argument to BaseFragmentsBuilder.getWeightedFragInfoList() in order to get the needed statistical data from the index 2. Move the calculation of sum-of-boosts to ScoreOrderFramentsBuilder.calculateScore() /** * Compute WeightedFragInfo.score based on query-boosts * @ throws IOException */ public List<WeightedFragInfo> calculateScore( List<WeightedFragInfo> weightedFragInfos, IndexReader reader ) throws IOException{ for ( WeightedFragInfo wfi : weightedFragInfos ){ for ( WeightedPhraseInfo wpi : wfi.phraseInfos ){ wfi.score += wpi.boost; } } return weightedFragInfos; } 3. Calculation of sum-of-distinct-weights with WeightOrderFramentsBuilder.calculateScore() In this patch WeightOrderFramentsBuilder is a subclass of ScoreOrderFragmentsBuilder. But I think the introduction of an abstract class OrderedFragmentsBuilder as superclass of ScoreOrderFragmentsBuilder and WeightOrderFragmentsBuilder would be a better strategy. Moving calculateScore() into BaseFragmentsBuilder and making it abstract would be another idea. The sum-of-distinct-weight -approach is the same as presented in the last patch. /** * Compute WeightedFragInfo.score based on IDF-weighted terms * @ throws IOException */ @Override public List<WeightedFragInfo> calculateScore( List<WeightedFragInfo> weightedFragInfos, IndexReader reader ) throws IOException{ Map< String , Float > lookup = new HashMap< String , Float >(); HashSet< String > distinctTerms = new HashSet< String >(); int numDocs = reader.numDocs() - reader.numDeletedDocs(); int docFreq; int length; float boost; float weight; for ( WeightedFragInfo wfi : weightedFragInfos ){ uniqueTerms.clear(); length = 0; boost = 0; for ( WeightedPhraseInfo wpi : wfi.phraseInfos ){ for ( TermInfo ti : wpi.termInfos ) { length++; if ( !distinctTerms.add( ti.text ) ) continue ; if ( lookup.containsKey( ti.text ) ) weight = lookup.get( ti.text ).floatValue(); else { docFreq = reader.docFreq( new Term( ti.fieldName, ti.text ) ); weight = ( float ) ( Math .log( numDocs / ( double ) ( docFreq + 1 ) ) + 1.0 ); lookup.put( ti.text, new Float ( weight ) ); } boost += Math .pow( weight, 2 ) * wpi.boost; } } wfi.score = ( float ) ( boost * length * ( 1 / Math .sqrt( length ) ) ); } return weightedFragInfos; } With this approach programmers can implement their own fragments-weighting with ease, simply overwriting calculateScore(). I think, the major drawback of this idea is that the FragmentsBuilder must traverse the whole stack of WeightedFragInfo once again. Since we have tomes with more than 3000 pages of OCR, this could be a problem. But I can't confirm that for sure. One way to avoid this would be making FieldFragList "plugable" with an Interface "FragList" and the FragmentsBuilder-plugin could be parametrized with the intended implementation of FragList: <highlighter> <fragmentsBuilder name= "weight-ordered" class= "org.apache.solr.highlight.OrderedFragmentsBuilder" /> <fragList class= "org.apache.lucene.search.vectorhighlight.WeightedFragList" /> </fragmentsBuilder> <fragmentsBuilder name= "boost-ordered" class= "org.apache.solr.highlight.OrderedFragmentsBuilder" /> <fragList class= "org.apache.lucene.search.vectorhighlight.BoostedFragList" /> </fragmentsBuilder> </highlighter> Further notes: As shown in this patch "WeightedFragInfo.totalBoost" should be renamed into "WeightedFragInfo.score". "ScoreOrderFragmentsBuilder" should be renamed into "BoostOrderFragmentsBuilder".
        Hide
        Sebastian Lutze added a comment -

        Patch for 4.0 trunk.

        Show
        Sebastian Lutze added a comment - Patch for 4.0 trunk.
        Hide
        Sebastian Lutze added a comment -

        Hm, since FieldFragList is created in SimpleFraglistBuilder.createFieldFragList() it should look more like that:

        <highlighter>
         <fragListBuilder name="simple-boosted" class="org.apache.solr.highlight.SimpleFragListBuilder">
          <fragList name="boosted" class="org.apache.lucene.search.vectorhighlight.BoostedFragList"/>
         </fragListBuilder>
         <fragListBuilder name="simple-weighted" class="org.apache.solr.highlight.SimpleFragListBuilder" default="true">
          <fragList name="weighted" class="org.apache.lucene.search.vectorhighlight.WeightedFragList">
         </fragListBuilder>
         <fragmentsBuilder name="ordered" class="org.apache.solr.highlight.ScoreOrderFragmentsBuilder" default="true"/>
        </highlighter>
        
        Show
        Sebastian Lutze added a comment - Hm, since FieldFragList is created in SimpleFraglistBuilder.createFieldFragList() it should look more like that: <highlighter> <fragListBuilder name= "simple-boosted" class= "org.apache.solr.highlight.SimpleFragListBuilder" > <fragList name= "boosted" class= "org.apache.lucene.search.vectorhighlight.BoostedFragList" /> </fragListBuilder> <fragListBuilder name= "simple-weighted" class= "org.apache.solr.highlight.SimpleFragListBuilder" default= "true" > <fragList name= "weighted" class= "org.apache.lucene.search.vectorhighlight.WeightedFragList" > </fragListBuilder> <fragmentsBuilder name= "ordered" class= "org.apache.solr.highlight.ScoreOrderFragmentsBuilder" default= "true" /> </highlighter>
        Hide
        Sebastian Lutze added a comment -

        Another patch for 4.0. This one makes FieldFragList "plugable".

        This patch contains:

        • Introduction of interface FieldFragList
        • Introduction of abstract class BaseFieldFragList which contains SubInfo and FieldFragInfo (I renamed WeightedFragInfo)
        • Introduction of class SimpleFieldFragList (default)
        • Introduction of class WeightedFieldFragList
        • Introduction of abstract class BaseFragListBuilder
        • Introduction of class SimpleFragListBuilder (default)
        • Introduction of class WeightedFragListBuilder

        The weighting-formula now depends on the implementation of
        FieldFragList.add(int startOffset, int endOffset, List<FieldPhraseInfo> phraseInfoList):

          /* (non-Javadoc)
           * @see org.apache.lucene.search.vectorhighlight.FieldFragList#getFragInfos()
           */ 
          @Override
          public void add( int startOffset, int endOffset, List<FieldPhraseInfo> phraseInfoList ) {
            float score = 0;
            List<SubInfo> subInfos = new ArrayList<SubInfo>();
            for( FieldPhraseInfo phraseInfo : phraseInfoList ){
              subInfos.add( new SubInfo( phraseInfo.getText(), phraseInfo.getTermsOffset(), phraseInfo.getSeqnum() ) );
              score += phraseInfo.getBoost();
            }
            getFragInfos().add( new FieldFragInfo( startOffset, endOffset, subInfos, score ) );
          }
        

        The choosen FieldFragList depends on FragListBuilder.createFieldFragList( FieldPhraseList fieldPhraseList, int fragCharSize ):

          /* (non-Javadoc)
           * @see org.apache.lucene.search.vectorhighlight.FragListBuilder#createFieldFragList(FieldPhraseList fieldPhraseList, int fragCharSize)
           */ 
          @Override
          public FieldFragList createFieldFragList( FieldPhraseList fieldPhraseList, int fragCharSize ){
            return createFieldFragList( fieldPhraseList, new SimpleFieldFragList( fragCharSize ), fragCharSize );
          } 
        

        Of course, Solr-config could look like this:

        <highlighter>
         <fragListBuilder name="simple" class="org.apache.solr.highlight.SimpleFragListBuilder"/>
         <fragListBuilder name="weighted" class="org.apache.solr.highlight.WeightedFragListBuilder" default="true"/>
         <fragmentsBuilder name="ordered" class="org.apache.solr.highlight.ScoreOrderFragmentsBuilder" default="true"/>
        </highlighter>
        

        I think, this is the best possible approach, because it maintains backwards-compatibility, but do also some refactoring which would/could/should/can make it easier to plug-in different approaches in future.

        But, after a few weeks of banging my head against the wall I have to admit: I have no idea.

        Show
        Sebastian Lutze added a comment - Another patch for 4.0. This one makes FieldFragList "plugable". This patch contains: Introduction of interface FieldFragList Introduction of abstract class BaseFieldFragList which contains SubInfo and FieldFragInfo (I renamed WeightedFragInfo) Introduction of class SimpleFieldFragList (default) Introduction of class WeightedFieldFragList Introduction of abstract class BaseFragListBuilder Introduction of class SimpleFragListBuilder (default) Introduction of class WeightedFragListBuilder The weighting-formula now depends on the implementation of FieldFragList.add(int startOffset, int endOffset, List<FieldPhraseInfo> phraseInfoList): /* (non-Javadoc) * @see org.apache.lucene.search.vectorhighlight.FieldFragList#getFragInfos() */ @Override public void add( int startOffset, int endOffset, List<FieldPhraseInfo> phraseInfoList ) { float score = 0; List<SubInfo> subInfos = new ArrayList<SubInfo>(); for ( FieldPhraseInfo phraseInfo : phraseInfoList ){ subInfos.add( new SubInfo( phraseInfo.getText(), phraseInfo.getTermsOffset(), phraseInfo.getSeqnum() ) ); score += phraseInfo.getBoost(); } getFragInfos().add( new FieldFragInfo( startOffset, endOffset, subInfos, score ) ); } The choosen FieldFragList depends on FragListBuilder.createFieldFragList( FieldPhraseList fieldPhraseList, int fragCharSize ): /* (non-Javadoc) * @see org.apache.lucene.search.vectorhighlight.FragListBuilder#createFieldFragList(FieldPhraseList fieldPhraseList, int fragCharSize) */ @Override public FieldFragList createFieldFragList( FieldPhraseList fieldPhraseList, int fragCharSize ){ return createFieldFragList( fieldPhraseList, new SimpleFieldFragList( fragCharSize ), fragCharSize ); } Of course, Solr-config could look like this: <highlighter> <fragListBuilder name= "simple" class= "org.apache.solr.highlight.SimpleFragListBuilder" /> <fragListBuilder name= "weighted" class= "org.apache.solr.highlight.WeightedFragListBuilder" default= "true" /> <fragmentsBuilder name= "ordered" class= "org.apache.solr.highlight.ScoreOrderFragmentsBuilder" default= "true" /> </highlighter> I think, this is the best possible approach, because it maintains backwards-compatibility, but do also some refactoring which would/could/should/can make it easier to plug-in different approaches in future. But, after a few weeks of banging my head against the wall I have to admit: I have no idea.
        Hide
        Sebastian Lutze added a comment - - edited

        Patch for trunk (1178632).

        Show
        Sebastian Lutze added a comment - - edited Patch for trunk (1178632).
        Hide
        Koji Sekiguchi added a comment -

        Hi sebastian, thank you for the continuous work on this! I'd like to take a look them in this week.

        Show
        Koji Sekiguchi added a comment - Hi sebastian, thank you for the continuous work on this! I'd like to take a look them in this week.
        Hide
        Sebastian Lutze added a comment -

        Patch for branch_3x (1177996).

        Show
        Sebastian Lutze added a comment - Patch for branch_3x (1177996).
        Hide
        Koji Sekiguchi added a comment -

        Very nice progress, thanks! I think this is almost close to commit. I think the following is a must to do:

        1. update description and figures of the package javadoc ( https://builds.apache.org//job/Lucene-trunk/javadoc/contrib-highlighter/org/apache/lucene/search/vectorhighlight/package-summary.html#package_description )
        2. update test cases. currently they cannot be compiled.
        Show
        Koji Sekiguchi added a comment - Very nice progress, thanks! I think this is almost close to commit. I think the following is a must to do: update description and figures of the package javadoc ( https://builds.apache.org//job/Lucene-trunk/javadoc/contrib-highlighter/org/apache/lucene/search/vectorhighlight/package-summary.html#package_description ) update test cases. currently they cannot be compiled.
        Hide
        Sebastian Lutze added a comment -

        Patch for 3.5-SNAPSHOT & 4.0-SNAPSHOT

        Show
        Sebastian Lutze added a comment - Patch for 3.5-SNAPSHOT & 4.0-SNAPSHOT
        Hide
        Sebastian Lutze added a comment -

        Okay, here we go again.

        This patch contains:

        • Fixed docs
        • Fixed test cases
        Show
        Sebastian Lutze added a comment - Okay, here we go again. This patch contains: Fixed docs Fixed test cases
        Hide
        Koji Sekiguchi added a comment -

        In the latest patch, now FieldFragList becomes interface and BaseFieldFragList abstract class, which implements the interface, is introduced. But I think it is strange that the javadoc of add() method says that the interface depends on FieldFragInfo, which is defined in the abstract class.

        * convert the list of FieldPhraseInfo to FieldFragInfo, then add it to the fragInfos
        

        How about just changing FieldFragList to abstract and avoiding to introduce BaseFieldFragList?

        Show
        Koji Sekiguchi added a comment - In the latest patch, now FieldFragList becomes interface and BaseFieldFragList abstract class, which implements the interface, is introduced. But I think it is strange that the javadoc of add() method says that the interface depends on FieldFragInfo, which is defined in the abstract class. * convert the list of FieldPhraseInfo to FieldFragInfo, then add it to the fragInfos How about just changing FieldFragList to abstract and avoiding to introduce BaseFieldFragList?
        Hide
        Koji Sekiguchi added a comment -

        In this patch, I removed FieldFragList interface and renamed BaseFieldFragList to FieldFragList, and moved javadocs to the abstract from interface.

        I'm still working.

        Show
        Koji Sekiguchi added a comment - In this patch, I removed FieldFragList interface and renamed BaseFieldFragList to FieldFragList, and moved javadocs to the abstract from interface. I'm still working.
        Hide
        Koji Sekiguchi added a comment -

        Ah, sebastian, I think you needed to check "Grant license to ASF for inclusion in ASF works" when you attach your patch. Can you remove the latest patches and reattach them with that flag? Thanks!

        Show
        Koji Sekiguchi added a comment - Ah, sebastian, I think you needed to check "Grant license to ASF for inclusion in ASF works" when you attach your patch. Can you remove the latest patches and reattach them with that flag? Thanks!
        Hide
        Koji Sekiguchi added a comment -

        And I found a lot of test errors...

        Show
        Koji Sekiguchi added a comment - And I found a lot of test errors...
        Hide
        Sebastian Lutze added a comment -

        Hi Koji, patch don't work because of https://issues.apache.org/jira/browse/LUCENE-3513.

        And I found a lot of test errors...

        Frankly, I didn't run the tests because I thought the changes provided with the last patch shouldn't affect the original behavior.
        I'll have a look into it. But this may take some time, due to the fact that I have no knowledge about the test-framework.

        Show
        Sebastian Lutze added a comment - Hi Koji, patch don't work because of https://issues.apache.org/jira/browse/LUCENE-3513 . And I found a lot of test errors... Frankly, I didn't run the tests because I thought the changes provided with the last patch shouldn't affect the original behavior. I'll have a look into it. But this may take some time, due to the fact that I have no knowledge about the test-framework.
        Hide
        Koji Sekiguchi added a comment -

        Hi sebastian,

        Frankly, I didn't run the tests because I thought the changes provided with the last patch shouldn't affect the original behavior.
        I'll have a look into it. But this may take some time, due to the fact that I have no knowledge about the test-framework.

        Ok, no problem. I'll see the test case (hopefully next week or so). But can you take care of the following to go forward?

        Ah, sebastian, I think you needed to check "Grant license to ASF for inclusion in ASF works" when you attach your patch. Can you remove the latest patches and reattach them with that flag? Thanks!

        Show
        Koji Sekiguchi added a comment - Hi sebastian, Frankly, I didn't run the tests because I thought the changes provided with the last patch shouldn't affect the original behavior. I'll have a look into it. But this may take some time, due to the fact that I have no knowledge about the test-framework. Ok, no problem. I'll see the test case (hopefully next week or so). But can you take care of the following to go forward? Ah, sebastian, I think you needed to check "Grant license to ASF for inclusion in ASF works" when you attach your patch. Can you remove the latest patches and reattach them with that flag? Thanks!
        Hide
        Koji Sekiguchi added a comment -

        I've removed my latest patch. Because the patch had ASF granted license flag but it was not right because it was totally based on sebastian's patch, which was not granted to ASF.

        Show
        Koji Sekiguchi added a comment - I've removed my latest patch. Because the patch had ASF granted license flag but it was not right because it was totally based on sebastian's patch, which was not granted to ASF.
        Hide
        Sebastian Lutze added a comment -

        Ah, sebastian, I think you needed to check "Grant license to ASF for inclusion in ASF works" when you attach your patch. Can you remove the latest patches and reattach them with that flag? Thanks!

        Sorry, I forgot that. Done.

        Show
        Sebastian Lutze added a comment - Ah, sebastian, I think you needed to check "Grant license to ASF for inclusion in ASF works" when you attach your patch. Can you remove the latest patches and reattach them with that flag? Thanks! Sorry, I forgot that. Done.
        Hide
        Koji Sekiguchi added a comment -

        New patch, still has failures in test, though.

        Show
        Koji Sekiguchi added a comment - New patch, still has failures in test, though.
        Hide
        Sebastian Lutze added a comment - - edited

        Patch for trunk 1205430. Works for me so far.

        • Test fixed
        • New test case "WeightedFragListBuilderTest"
        Show
        Sebastian Lutze added a comment - - edited Patch for trunk 1205430. Works for me so far. Test fixed New test case "WeightedFragListBuilderTest"
        Hide
        Hoss Man added a comment -

        Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently.

        email notification suppressed to prevent mass-spam
        psuedo-unique token identifying these issues: hoss20120321nofix36

        Show
        Hoss Man added a comment - Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently. email notification suppressed to prevent mass-spam psuedo-unique token identifying these issues: hoss20120321nofix36
        Hide
        Sebastian Lutze added a comment -

        Patch for 3.6.1-SNAPSHOT.

        Show
        Sebastian Lutze added a comment - Patch for 3.6.1-SNAPSHOT.
        Hide
        Simon Willnauer added a comment -

        Koji, do you wanna get this in any time? Now is likely a good time since 4.0 is getting close. We won't apply this to 3.6.1 since that is a bugfix only release if it is going to happen at all.

        Show
        Simon Willnauer added a comment - Koji, do you wanna get this in any time? Now is likely a good time since 4.0 is getting close. We won't apply this to 3.6.1 since that is a bugfix only release if it is going to happen at all.
        Hide
        Simon Willnauer added a comment -

        remove 3.6.1 from fix version - bugfix only relase

        Show
        Simon Willnauer added a comment - remove 3.6.1 from fix version - bugfix only relase
        Hide
        Koji Sekiguchi added a comment -

        Koji, do you wanna get this in any time? Now is likely a good time since 4.0 is getting close.

        Hi Simon, thank you for bring this up to me! Yes, I do want sebastian's great patch to get in 4.0. It has been on my TODO list for a long time, but I couldn't find time to look into it deeply. I'm very sorry about that.

        If I remember correctly, when I tried previous patch, I got errors on testing. Then sebastian fixed them and attached updated patch. I looked into the updated test, but I think I couldn't understand them very well at that time. Just after that, couldn't have my time because I was assigned something.

        Anyway, the idea of this ticket is definitely great and should be committed. So can someone take over it?

        Show
        Koji Sekiguchi added a comment - Koji, do you wanna get this in any time? Now is likely a good time since 4.0 is getting close. Hi Simon, thank you for bring this up to me! Yes, I do want sebastian's great patch to get in 4.0. It has been on my TODO list for a long time, but I couldn't find time to look into it deeply. I'm very sorry about that. If I remember correctly, when I tried previous patch, I got errors on testing. Then sebastian fixed them and attached updated patch. I looked into the updated test, but I think I couldn't understand them very well at that time. Just after that, couldn't have my time because I was assigned something. Anyway, the idea of this ticket is definitely great and should be committed. So can someone take over it?
        Hide
        Sebastian Lutze added a comment -

        Hi Koji,
        hi Simon,

        if there is something to do for me, please let me know.

        Maybe it would be better to split the patch in several smaller ones, e.g.

        1. Use Getters/Setters where possible in FVH
        2. Make FieldFragList interface and BaseFieldFragList abstract class
        3. Introduction of SimpleFieldFragList and SimpleFragListBuilder as default
        4. Introduction of WeightedFieldFragList and WeightedFragListBuilder
        5. Integration into Solr

        When's the 4.0-release scheduled, anyway?

        A Patch for trunk 1342490 is on it's way.

        Show
        Sebastian Lutze added a comment - Hi Koji, hi Simon, if there is something to do for me, please let me know. Maybe it would be better to split the patch in several smaller ones, e.g. 1. Use Getters/Setters where possible in FVH 2. Make FieldFragList interface and BaseFieldFragList abstract class 3. Introduction of SimpleFieldFragList and SimpleFragListBuilder as default 4. Introduction of WeightedFieldFragList and WeightedFragListBuilder 5. Integration into Solr When's the 4.0-release scheduled, anyway? A Patch for trunk 1342490 is on it's way.
        Hide
        Sebastian Lutze added a comment -

        Patch for trunk (1342490)

        Show
        Sebastian Lutze added a comment - Patch for trunk (1342490)
        Hide
        Koji Sekiguchi added a comment -

        Hi sebastian!

        Maybe it would be better to split the patch in several smaller ones, e.g.

        This is a great idea and it helps me a lot! If you could provide them one by one for trunk, I think I can review the smaller patch and commit them one by one.

        Show
        Koji Sekiguchi added a comment - Hi sebastian! Maybe it would be better to split the patch in several smaller ones, e.g. This is a great idea and it helps me a lot! If you could provide them one by one for trunk, I think I can review the smaller patch and commit them one by one.
        Hide
        Sebastian Lutze added a comment -

        Hi Koji,

        This is a great idea and it helps me a lot! If you could provide them one by one for trunk, I think I can review the smaller patch and commit them one by one.

        Okay, lets give it a try, here the first one:

        https://issues.apache.org/jira/browse/LUCENE-4091

        This one simply adds getters. Tests were okay.

        Show
        Sebastian Lutze added a comment - Hi Koji, This is a great idea and it helps me a lot! If you could provide them one by one for trunk, I think I can review the smaller patch and commit them one by one. Okay, lets give it a try, here the first one: https://issues.apache.org/jira/browse/LUCENE-4091 This one simply adds getters. Tests were okay.
        Hide
        Koji Sekiguchi added a comment -

        Hi sebastian,

        I committed LUCENE-4091 in trunk and branch_4x. For the credit, I will give it in CHANGES.txt when committing the main body (LUCENE-3440) patch.

        Show
        Koji Sekiguchi added a comment - Hi sebastian, I committed LUCENE-4091 in trunk and branch_4x. For the credit, I will give it in CHANGES.txt when committing the main body ( LUCENE-3440 ) patch.
        Hide
        Sebastian Lutze added a comment -

        Hi Koji,

        I committed LUCENE-4091 in trunk and branch_4x. For the credit, I will give it in CHANGES.txt when committing the main body (LUCENE-3440) patch.

        great, here is the next one:

        https://issues.apache.org/jira/browse/LUCENE-4107

        This one simply makes FieldFragList abstract and "plugable". Tests were okay.

        Show
        Sebastian Lutze added a comment - Hi Koji, I committed LUCENE-4091 in trunk and branch_4x. For the credit, I will give it in CHANGES.txt when committing the main body ( LUCENE-3440 ) patch. great, here is the next one: https://issues.apache.org/jira/browse/LUCENE-4107 This one simply makes FieldFragList abstract and "plugable". Tests were okay.
        Hide
        Koji Sekiguchi added a comment -

        Hi sebastian,

        I committed LUCENE-4107 in trunk and branch_4x.

        Show
        Koji Sekiguchi added a comment - Hi sebastian, I committed LUCENE-4107 in trunk and branch_4x.
        Hide
        Sebastian Lutze added a comment - - edited

        Hi Koji,

        I committed LUCENE-4107 in trunk and branch_4x.

        That was fast!

        https://issues.apache.org/jira/browse/LUCENE-4113

        This one introduces and maintains IDF-weight for FieldTermStack.TermInfo.

        Show
        Sebastian Lutze added a comment - - edited Hi Koji, I committed LUCENE-4107 in trunk and branch_4x. That was fast! https://issues.apache.org/jira/browse/LUCENE-4113 This one introduces and maintains IDF-weight for FieldTermStack.TermInfo.
        Hide
        Sebastian Lutze added a comment -

        Hi Koji,

        I was just wondering about

        https://issues.apache.org/jira/browse/LUCENE-2949

        Show
        Sebastian Lutze added a comment - Hi Koji, I was just wondering about https://issues.apache.org/jira/browse/LUCENE-2949
        Hide
        Koji Sekiguchi added a comment -

        Hi sebastian,

        I committed LUCENE-4113 in trunk and branch_4x. Is the next the last one?

        Show
        Koji Sekiguchi added a comment - Hi sebastian, I committed LUCENE-4113 in trunk and branch_4x. Is the next the last one?
        Hide
        Sebastian Lutze added a comment -

        Hi Koji,

        Is the next the last one?

        almost. Next thing would be Solr-Integration.

        So, I just realized: trunk is not trunk anymore!

        This one is for branch_4x:

        https://issues.apache.org/jira/browse/LUCENE-4133

        Tests are fine.

        Show
        Sebastian Lutze added a comment - Hi Koji, Is the next the last one? almost. Next thing would be Solr-Integration. So, I just realized: trunk is not trunk anymore! This one is for branch_4x: https://issues.apache.org/jira/browse/LUCENE-4133 Tests are fine.
        Hide
        Koji Sekiguchi added a comment -

        Hi Sebastian,

        I've committed LUCENE-4133.

        I'm going to close and mark this issue as resolved because I think Lucene part has been completed. Can you open a separate issue for Solr part?

        This is a great improvement for FVH. I really appreciate what you've done!

        Show
        Koji Sekiguchi added a comment - Hi Sebastian, I've committed LUCENE-4133 . I'm going to close and mark this issue as resolved because I think Lucene part has been completed. Can you open a separate issue for Solr part? This is a great improvement for FVH. I really appreciate what you've done!
        Hide
        Koji Sekiguchi added a comment -

        Thanks, Sebastian!

        Show
        Koji Sekiguchi added a comment - Thanks, Sebastian!
        Hide
        Sebastian Lutze added a comment -

        Hi Koji,

        I'm going to close and mark this issue as resolved because I think Lucene part has been completed.

        that's really awesome!

        Can you open a separate issue for Solr part?

        Sure.

        This is a great improvement for FVH. I really appreciate what you've done!

        It was an honor for me!

        Show
        Sebastian Lutze added a comment - Hi Koji, I'm going to close and mark this issue as resolved because I think Lucene part has been completed. that's really awesome! Can you open a separate issue for Solr part? Sure. This is a great improvement for FVH. I really appreciate what you've done! It was an honor for me!
        Hide
        Sebastian Lutze added a comment -

        Hi Koji,

        here's the Solr-Integration:

        https://issues.apache.org/jira/browse/SOLR-3542

        Show
        Sebastian Lutze added a comment - Hi Koji, here's the Solr-Integration: https://issues.apache.org/jira/browse/SOLR-3542

          People

          • Assignee:
            Koji Sekiguchi
            Reporter:
            Sebastian Lutze
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development