Details

    • Lucene Fields:
      New

      Description

      Now the classifiers can return only the "best matching" classes. If somebody want it to use more complex tasks he need to modify these classes for get second and third results too. If it is possible to return a list and it is not a lot resource why we dont do that? (We iterate a list so also.)

      The Bayes classifier get too small return values, and there were a bug with the zero floats. It was fixed with logarithmic. It would be nice to scale the class scores sum vlue to one, and then we coud compare two documents return score and relevance. (If we dont do this the wordcount in the test documents affected the result score.)

      With bulletpoints:

      • In the Bayes classification normalized score values, and return with result lists.
      • In the KNN classifier possibility to return a result list.
      • Make the ClassificationResult Comparable for list sorting.
      1. 0810-base.patch
        15 kB
        Gergő Törcsvári
      2. 0803-base.patch
        10 kB
        Gergő Törcsvári
      3. 0730.patch
        10 kB
        Gergő Törcsvári
      4. 06-06-5699.patch
        19 kB
        Gergő Törcsvári

        Activity

        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.
        Hide
        ASF subversion and git services added a comment -

        Commit 1638715 from Tommaso Teofili in branch 'dev/trunk'
        [ https://svn.apache.org/r1638715 ]

        LUCENE-5699 - normalized score for boolean perceptron classifier

        Show
        ASF subversion and git services added a comment - Commit 1638715 from Tommaso Teofili in branch 'dev/trunk' [ https://svn.apache.org/r1638715 ] LUCENE-5699 - normalized score for boolean perceptron classifier
        Hide
        ASF subversion and git services added a comment -

        Commit 1620122 from Tommaso Teofili in branch 'dev/trunk'
        [ https://svn.apache.org/r1620122 ]

        LUCENE-5699 - added missing javadoc for atomic reader

        Show
        ASF subversion and git services added a comment - Commit 1620122 from Tommaso Teofili in branch 'dev/trunk' [ https://svn.apache.org/r1620122 ] LUCENE-5699 - added missing javadoc for atomic reader
        Hide
        ASF subversion and git services added a comment -

        Commit 1620022 from Tommaso Teofili in branch 'dev/trunk'
        [ https://svn.apache.org/r1620022 ]

        LUCENE-5699 - fixed javadoc

        Show
        ASF subversion and git services added a comment - Commit 1620022 from Tommaso Teofili in branch 'dev/trunk' [ https://svn.apache.org/r1620022 ] LUCENE-5699 - fixed javadoc
        Hide
        ASF subversion and git services added a comment -

        Commit 1619933 from Tommaso Teofili in branch 'dev/trunk'
        [ https://svn.apache.org/r1619933 ]

        LUCENE-5699 - fixed javadoc

        Show
        ASF subversion and git services added a comment - Commit 1619933 from Tommaso Teofili in branch 'dev/trunk' [ https://svn.apache.org/r1619933 ] LUCENE-5699 - fixed javadoc
        Hide
        Michael McCandless added a comment -

        Actually it looks like that train is now leaving the station ... I think we should target next release (4.11)?

        Show
        Michael McCandless added a comment - Actually it looks like that train is now leaving the station ... I think we should target next release (4.11)?
        Hide
        Tommaso Teofili added a comment -

        sure, I've added fix version 4.10

        Show
        Tommaso Teofili added a comment - sure, I've added fix version 4.10
        Hide
        Michael McCandless added a comment -

        OK thanks Tommaso Teofili, I guess we should leave this reopened and set a fix version to remind us to backport it ...

        Show
        Michael McCandless added a comment - OK thanks Tommaso Teofili , I guess we should leave this reopened and set a fix version to remind us to backport it ...
        Hide
        ASF subversion and git services added a comment -

        Commit 1619699 from Tommaso Teofili in branch 'dev/trunk'
        [ https://svn.apache.org/r1619699 ]

        LUCENE-5699 - fixed javadoc

        Show
        ASF subversion and git services added a comment - Commit 1619699 from Tommaso Teofili in branch 'dev/trunk' [ https://svn.apache.org/r1619699 ] LUCENE-5699 - fixed javadoc
        Hide
        Tommaso Teofili added a comment -

        thanks Michael, I've fixed the missing javadoc. For the question on backporting to 4.0 I'm generally +1 on that, just this introduces a new API which needs better testing (a patch covering it should be available in LUCENE-5698) before merging into the stable branch IMHO.

        Show
        Tommaso Teofili added a comment - thanks Michael, I've fixed the missing javadoc. For the question on backporting to 4.0 I'm generally +1 on that, just this introduces a new API which needs better testing (a patch covering it should be available in LUCENE-5698 ) before merging into the stable branch IMHO.
        Hide
        Michael McCandless added a comment -

        Reopen to resolve "ant precommit" failures and maybe backport question ...

        Show
        Michael McCandless added a comment - Reopen to resolve "ant precommit" failures and maybe backport question ...
        Hide
        Michael McCandless added a comment -

        This commit caused "ant precommit" failures on trunk:

             [exec] build/docs/classification/org/apache/lucene/classification/SimpleNaiveBayesClassifier.html
             [exec]   missing Fields: analyzer
             [exec]   missing Fields: atomicReader
             [exec]   missing Fields: classFieldName
             [exec]   missing Fields: indexSearcher
             [exec]   missing Fields: query
             [exec]   missing Fields: textFieldNames
             [exec]   missing Methods: countDocsWithClass()
             [exec]   missing Methods: tokenizeDoc(java.lang.String)
             [exec]
             [exec] Missing javadocs were found!
        

        Also, was it intentional that this wasn't backported to 4.x?

        Show
        Michael McCandless added a comment - This commit caused "ant precommit" failures on trunk: [exec] build/docs/classification/org/apache/lucene/classification/SimpleNaiveBayesClassifier.html [exec] missing Fields: analyzer [exec] missing Fields: atomicReader [exec] missing Fields: classFieldName [exec] missing Fields: indexSearcher [exec] missing Fields: query [exec] missing Fields: textFieldNames [exec] missing Methods: countDocsWithClass() [exec] missing Methods: tokenizeDoc(java.lang.String) [exec] [exec] Missing javadocs were found! Also, was it intentional that this wasn't backported to 4.x?
        Hide
        ASF subversion and git services added a comment -

        Commit 1619053 from Tommaso Teofili in branch 'dev/trunk'
        [ https://svn.apache.org/r1619053 ]

        LUCENE-5699 - patch from Gergő Törcsvári for normalized score and return lists in classification

        Show
        ASF subversion and git services added a comment - Commit 1619053 from Tommaso Teofili in branch 'dev/trunk' [ https://svn.apache.org/r1619053 ] LUCENE-5699 - patch from Gergő Törcsvári for normalized score and return lists in classification
        Hide
        Tommaso Teofili added a comment -

        thanks Gergő, the latest patch looks good.

        Show
        Tommaso Teofili added a comment - thanks Gergő, the latest patch looks good.
        Hide
        Tommaso Teofili added a comment - - edited

        thanks Gergő, the patch looks much better.

        When I first tried to use the Lucene Classification, one of the bigger problem was, that the scores, whats come back means nothing. Basically the classifier returns the class, and a random number. If you have 2 text, and you push them in the classifier, the scores didn't help you to figure out what result is more trustworthy.

        while the classification score doesn't of course return a random number, I agree the score should be normalized, between 0 and 1, the higher the better (basically this resumes in a probability measure).
        Regarding the implementation I don't think the API needs to be touched for this, normalized scores should be always returned in ClassificationResults by Classifier#assignClass method implementations.

        If you can tell the user, how sure are you, it's not far that you want to tell them whats are the other options. What are the 3 more relevant or 5 more relevant class.

        ok, the use case sounds reasonable, however my only concern (which extend to the normalization implementation as it's based on the generation of lists) relates to the fact that the current implementation may not scale well if you have huge number of classes.

        Regarding API introduction I would be in favor in introducing something like Classifier#getClasses(String text) returning a List<ClassificationResult> for this use case, in alternative/addition Classifier#getClasses(String text, int max) to filter the maximum number of classes to be returned (as the user is probably interested in the first N classes, rather than the whole list of classes).

        Show
        Tommaso Teofili added a comment - - edited thanks Gergő, the patch looks much better. When I first tried to use the Lucene Classification, one of the bigger problem was, that the scores, whats come back means nothing. Basically the classifier returns the class, and a random number. If you have 2 text, and you push them in the classifier, the scores didn't help you to figure out what result is more trustworthy. while the classification score doesn't of course return a random number, I agree the score should be normalized, between 0 and 1, the higher the better (basically this resumes in a probability measure). Regarding the implementation I don't think the API needs to be touched for this, normalized scores should be always returned in ClassificationResults by Classifier#assignClass method implementations. If you can tell the user, how sure are you, it's not far that you want to tell them whats are the other options. What are the 3 more relevant or 5 more relevant class. ok, the use case sounds reasonable, however my only concern (which extend to the normalization implementation as it's based on the generation of lists) relates to the fact that the current implementation may not scale well if you have huge number of classes. Regarding API introduction I would be in favor in introducing something like Classifier#getClasses(String text) returning a List<ClassificationResult> for this use case, in alternative/addition Classifier#getClasses(String text, int max) to filter the maximum number of classes to be returned (as the user is probably interested in the first N classes, rather than the whole list of classes).
        Hide
        Gergő Törcsvári added a comment -

        So why good the normalized and normalizedList functions?

        First of all, why normalized?
        When I first tried to use the Lucene Classification, one of the bigger problem was, that the scores, whats come back means nothing. Basically the classifier returns the class, and a random number. If you have 2 text, and you push them in the classifier, the scores didn't help you to figure out what result is more trustworthy.
        The normalized values have that option. If you want to tell the user how sure are you, the normalized values help you out.

        Second, why lists?
        If you can tell the user, how sure are you, it's not far that you want to tell them whats are the other options. What are the 3 more relevant or 5 more relevant class.
        Most of the classification algorithms have those numbers a prior.

        The problem with the normalization and the lists:
        Sadly not all classification algorithm have lists, they just drop classes. So it can't go instantly to the api, because some classification method never have list or score.

        I have 2 api suggestion:
        The first where the Classifier interface get those normalized and normalizedList functions, and some of the implementations drop exceptions if somebody want to use them.
        Or, the Classifier interface don't get them, but some classifier can provide these functions.

        Show
        Gergő Törcsvári added a comment - So why good the normalized and normalizedList functions? First of all, why normalized? When I first tried to use the Lucene Classification, one of the bigger problem was, that the scores, whats come back means nothing. Basically the classifier returns the class, and a random number. If you have 2 text, and you push them in the classifier, the scores didn't help you to figure out what result is more trustworthy. The normalized values have that option. If you want to tell the user how sure are you, the normalized values help you out. Second, why lists? If you can tell the user, how sure are you, it's not far that you want to tell them whats are the other options. What are the 3 more relevant or 5 more relevant class. Most of the classification algorithms have those numbers a prior. The problem with the normalization and the lists: Sadly not all classification algorithm have lists, they just drop classes. So it can't go instantly to the api, because some classification method never have list or score. I have 2 api suggestion: The first where the Classifier interface get those normalized and normalizedList functions, and some of the implementations drop exceptions if somebody want to use them. Or, the Classifier interface don't get them, but some classifier can provide these functions.
        Hide
        Gergő Törcsvári added a comment -

        Lucene format code, maximum search instead of short. new NormalizedList function. Cleaner code + doc.

        Show
        Gergő Törcsvári added a comment - Lucene format code, maximum search instead of short. new NormalizedList function. Cleaner code + doc.
        Hide
        Gergő Törcsvári added a comment -

        I done some measure with the lists (results in the readme):
        https://github.com/tg44/Java-Short-vs-Max

        The basic max search is approx. 20x faster but I think this is still not comparable loss if you need a KNN search in more then 1000 doc.

        In fact there is a possibility to take back the search only the first functions, or take out the shorting, and do maximum search where one result is needed, or short if list needed.

        Show
        Gergő Törcsvári added a comment - I done some measure with the lists (results in the readme): https://github.com/tg44/Java-Short-vs-Max The basic max search is approx. 20x faster but I think this is still not comparable loss if you need a KNN search in more then 1000 doc. In fact there is a possibility to take back the search only the first functions, or take out the shorting, and do maximum search where one result is needed, or short if list needed.
        Hide
        Gergő Törcsvári added a comment -

        Yes, the compiler error was something like that, i pressed ctrl+shift+o to organize imports and it vanished in eclipse. (But its build in eclipse without error...) My bad.

        In the KNN there was a maximum search, the list building, sorting and pick the first element is not cost efficient if you have a huge number of classes it's totally true. But if you have a huge number of classes, the list building and Collections.sort will be your last problem in cost calculation If you have few classes, the list building and the max searching is the same complexity, and the collections.sort is the time what you wasted, buts it will be fast because of the elements number. Thats the reason why I made this, I think the search time not increasing relevantly.

        The public "not in the Classifier" functions are there because not all the classifier can return with lists, but thats whose can, that could be a huge usability boost for them. There is 2 way there, add a new function in Classifier, and the not lister classifiers return with a 1 element list, or make an additional interface. As I see, there are only this kind of public functions are there.

        Show
        Gergő Törcsvári added a comment - Yes, the compiler error was something like that, i pressed ctrl+shift+o to organize imports and it vanished in eclipse. (But its build in eclipse without error...) My bad. In the KNN there was a maximum search, the list building, sorting and pick the first element is not cost efficient if you have a huge number of classes it's totally true. But if you have a huge number of classes, the list building and Collections.sort will be your last problem in cost calculation If you have few classes, the list building and the max searching is the same complexity, and the collections.sort is the time what you wasted, buts it will be fast because of the elements number. Thats the reason why I made this, I think the search time not increasing relevantly. The public "not in the Classifier" functions are there because not all the classifier can return with lists, but thats whose can, that could be a huge usability boost for them. There is 2 way there, add a new function in Classifier, and the not lister classifiers return with a 1 element list, or make an additional interface. As I see, there are only this kind of public functions are there.
        Hide
        Tommaso Teofili added a comment -

        I get some compile errors when trying to build the classification module (with 'ant clean compile'):

        common.compile-core:
            [mkdir] Created dir: /Users/tommaso/Documents/workspaces/lucene/trunk/lucene/build/classification/classes/java
            [javac] Compiling 6 source files to /Users/tommaso/Documents/workspaces/lucene/trunk/lucene/build/classification/classes/java
            [javac] /Users/tommaso/Documents/workspaces/lucene/trunk/lucene/classification/src/java/org/apache/lucene/classification/KNearestNeighborClassifier.java:37: error: package org.mockito.internal.listeners does not exist
            [javac] import org.mockito.internal.listeners.CollectCreatedMocks;
            [javac]                                      ^
            [javac] /Users/tommaso/Documents/workspaces/lucene/trunk/lucene/classification/src/java/org/apache/lucene/classification/ClassificationResult.java:24: warning: [rawtypes] found raw type: Comparable
            [javac] public class ClassificationResult<T> implements Comparable{
            [javac]                                                 ^
            [javac]   missing type arguments for generic class Comparable<T>
            [javac]   where T is a type-variable:
            [javac]     T extends Object declared in interface Comparable
            [javac] /Users/tommaso/Documents/workspaces/lucene/trunk/lucene/classification/src/java/org/apache/lucene/classification/ClassificationResult.java:69: warning: [unchecked] unchecked cast
            [javac]             ClassificationResult<T> b = (ClassificationResult<T>) o;
            [javac]                                                                   ^
            [javac]   required: ClassificationResult<T>
            [javac]   found:    Object
            [javac]   where T is a type-variable:
            [javac]     T extends Object declared in class ClassificationResult
            [javac] /Users/tommaso/Documents/workspaces/lucene/trunk/lucene/classification/src/java/org/apache/lucene/classification/KNearestNeighborClassifier.java:132: warning: [unchecked] unchecked method invocation: method sort in class Collections is applied to given types
            [javac]         Collections.sort(returnList);
            [javac]                         ^
            [javac]   required: List<T>
            [javac]   found: List<ClassificationResult<BytesRef>>
            [javac]   where T is a type-variable:
            [javac]     T extends Comparable<? super T> declared in method <T>sort(List<T>)
            [javac] /Users/tommaso/Documents/workspaces/lucene/trunk/lucene/classification/src/java/org/apache/lucene/classification/SimpleNaiveBayesClassifier.java:182: warning: [unchecked] unchecked method invocation: method sort in class Collections is applied to given types
            [javac]       Collections.sort(dataList);
            [javac]                       ^
            [javac]   required: List<T>
            [javac]   found: List<ClassificationResult<BytesRef>>
            [javac]   where T is a type-variable:
            [javac]     T extends Comparable<? super T> declared in method <T>sort(List<T>)
            [javac] 1 error
            [javac] 4 warnings
        

        The fix for the compile error is trivial, however, apart from the strange import of org.mockito.internal.listeners.CollectCreatedMocks in KNN (which I guess is caused by some "automatic organize import" IDE kind of magic, I'm not sure about the suggested approach of creating multiple lists of classification results to manually sort and just select one out of those items, it seems a bit costly. Also I would like to avoid definitions of public methods if they're not needed (they can actually be private).

        Show
        Tommaso Teofili added a comment - I get some compile errors when trying to build the classification module (with 'ant clean compile'): common.compile-core: [mkdir] Created dir: /Users/tommaso/Documents/workspaces/lucene/trunk/lucene/build/classification/classes/java [javac] Compiling 6 source files to /Users/tommaso/Documents/workspaces/lucene/trunk/lucene/build/classification/classes/java [javac] /Users/tommaso/Documents/workspaces/lucene/trunk/lucene/classification/src/java/org/apache/lucene/classification/KNearestNeighborClassifier.java:37: error: package org.mockito.internal.listeners does not exist [javac] import org.mockito.internal.listeners.CollectCreatedMocks; [javac] ^ [javac] /Users/tommaso/Documents/workspaces/lucene/trunk/lucene/classification/src/java/org/apache/lucene/classification/ClassificationResult.java:24: warning: [rawtypes] found raw type: Comparable [javac] public class ClassificationResult<T> implements Comparable{ [javac] ^ [javac] missing type arguments for generic class Comparable<T> [javac] where T is a type-variable: [javac] T extends Object declared in interface Comparable [javac] /Users/tommaso/Documents/workspaces/lucene/trunk/lucene/classification/src/java/org/apache/lucene/classification/ClassificationResult.java:69: warning: [unchecked] unchecked cast [javac] ClassificationResult<T> b = (ClassificationResult<T>) o; [javac] ^ [javac] required: ClassificationResult<T> [javac] found: Object [javac] where T is a type-variable: [javac] T extends Object declared in class ClassificationResult [javac] /Users/tommaso/Documents/workspaces/lucene/trunk/lucene/classification/src/java/org/apache/lucene/classification/KNearestNeighborClassifier.java:132: warning: [unchecked] unchecked method invocation: method sort in class Collections is applied to given types [javac] Collections.sort(returnList); [javac] ^ [javac] required: List<T> [javac] found: List<ClassificationResult<BytesRef>> [javac] where T is a type-variable: [javac] T extends Comparable<? super T> declared in method <T>sort(List<T>) [javac] /Users/tommaso/Documents/workspaces/lucene/trunk/lucene/classification/src/java/org/apache/lucene/classification/SimpleNaiveBayesClassifier.java:182: warning: [unchecked] unchecked method invocation: method sort in class Collections is applied to given types [javac] Collections.sort(dataList); [javac] ^ [javac] required: List<T> [javac] found: List<ClassificationResult<BytesRef>> [javac] where T is a type-variable: [javac] T extends Comparable<? super T> declared in method <T>sort(List<T>) [javac] 1 error [javac] 4 warnings The fix for the compile error is trivial, however, apart from the strange import of org.mockito.internal.listeners.CollectCreatedMocks in KNN (which I guess is caused by some "automatic organize import" IDE kind of magic, I'm not sure about the suggested approach of creating multiple lists of classification results to manually sort and just select one out of those items, it seems a bit costly. Also I would like to avoid definitions of public methods if they're not needed (they can actually be private).
        Hide
        Shawn Heisey added a comment -

        When I commit LUCENE-5747, Lucene/Solr's project-specific settings will override any configured automatic save actions in Eclipse.

        Show
        Shawn Heisey added a comment - When I commit LUCENE-5747 , Lucene/Solr's project-specific settings will override any configured automatic save actions in Eclipse.
        Hide
        Shawn Heisey added a comment -

        This patch is including all the mentioned features. It is contains some really ugly modification because of the auto-formating in eclipse and auto organizing imports.

        Eclipse will not automatically reformat or organize imports unless you have changed its default configuration to turn these options on. Can you re-do your changes and save without these options turned on?

        It also looks like the some of the new formatting is using a different format specification than the one that comes with Lucene/Solr. The correct specification is automatically used when you run "ant eclipse" and import the project into eclipse. Some of it looks correct, which is very odd.

        Show
        Shawn Heisey added a comment - This patch is including all the mentioned features. It is contains some really ugly modification because of the auto-formating in eclipse and auto organizing imports. Eclipse will not automatically reformat or organize imports unless you have changed its default configuration to turn these options on. Can you re-do your changes and save without these options turned on? It also looks like the some of the new formatting is using a different format specification than the one that comes with Lucene/Solr. The correct specification is automatically used when you run "ant eclipse" and import the project into eclipse. Some of it looks correct, which is very odd.
        Hide
        Gergő Törcsvári added a comment -

        This patch is including all the mentioned features. It is contains some really ugly modification because of the auto-formating in eclipse and auto organizing imports.

        It also contains the modifications for the online BayesClassifier.

        The main changes:
        Instead of max searching list making and Collections.sort.
        Instead of calculating the docsWithClassSize once, calculate it in every search.
        Because of the list possible to scale the score sum to 1. (line 180-201 in snbc)

        The "online" function is not tested yet, the scaling seems to work.

        Show
        Gergő Törcsvári added a comment - This patch is including all the mentioned features. It is contains some really ugly modification because of the auto-formating in eclipse and auto organizing imports. It also contains the modifications for the online BayesClassifier. The main changes: Instead of max searching list making and Collections.sort. Instead of calculating the docsWithClassSize once, calculate it in every search. Because of the list possible to scale the score sum to 1. (line 180-201 in snbc) The "online" function is not tested yet, the scaling seems to work.

          People

          • Assignee:
            Tommaso Teofili
            Reporter:
            Gergő Törcsvári
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development