Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-11741

Offline training mode for schema guessing

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Our data driven schema guessing doesn't work under many situations. For example, if the first document has a field with value "0", it is guessed as Long and subsequent fields with "0.0" are rejected. Similarly, if the same field had alphanumeric contents for a latter document, those documents are rejected. Also, single vs. multi valued field guessing is not ideal.

      Proposing an offline training mode where Solr accepts bunch of documents and returns a guessed schema (without indexing). This schema can then be used for actual indexing. I think the original idea is from Hoss.

      I think initial implementation can be based on an UpdateRequestProcessor. We can hash out the API soon, as we go along.

      Attachments

        1. RuleForMostAccomodatingField.png
          14 kB
          Abhishek Kumar Singh
        2. screenshot-1.png
          17 kB
          Abhishek Kumar Singh
        3. screenshot-3.png
          15 kB
          Abhishek Kumar Singh
        4. SOLR-11741.patch
          1019 kB
          Abhishek Kumar Singh
        5. SOLR-11741.patch
          1019 kB
          Abhishek Kumar Singh
        6. SOLR-11741.patch
          77 kB
          Abhishek Kumar Singh
        7. SOLR-11741-temp.patch
          10 kB
          Abhishek Kumar Singh

        Issue Links

          Activity

            People

              ichattopadhyaya Ishan Chattopadhyaya
              ichattopadhyaya Ishan Chattopadhyaya
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: