Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7196

DataSplitter should be providing class centric doc sets in all generated indexes

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 6.1
    • modules/classification
    • None
    • New

    Description

      DataSplitter currently creates 3 indexes (train/test/cv) out of an original index for evaluation of Classifiers however "class coverage" in such generated indexes is not guaranteed; that means e.g. in training index only documents belonging to 50% of the class set could be indexed and hence classifiers may not be very effective. In order to provide more consistent evaluation the generated index should contain _ split-ratio * | docs in c |_ documents for each class c.

      Attachments

        Activity

          People

            teofili Tommaso Teofili
            teofili Tommaso Teofili
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: