Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7196

DataSplitter should be providing class centric doc sets in all generated indexes

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.1
    • Component/s: modules/classification
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      DataSplitter currently creates 3 indexes (train/test/cv) out of an original index for evaluation of Classifiers however "class coverage" in such generated indexes is not guaranteed; that means e.g. in training index only documents belonging to 50% of the class set could be indexed and hence classifiers may not be very effective. In order to provide more consistent evaluation the generated index should contain _ split-ratio * | docs in c |_ documents for each class c.

        Attachments

          Activity

            People

            • Assignee:
              teofili Tommaso Teofili
              Reporter:
              teofili Tommaso Teofili
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: