Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6005

Explore alternative to Document/Field/FieldType API

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Auto-prefix terms (LUCENE-5879) is blocked because it's impossible in
      Lucene today to add a simple API to use it, and I don't think we
      should commit features that only super-experts can figure out how to
      use: that's evil.

      The only realistic "workaround" for such new features is to instead
      add them directly to the various servers on top of Lucene, since they
      all already have nice schema APIs.

      I opened LUCENE-5989 to try do at least a baby step towards making it
      easier to use auto-prefix terms, so you can easily add singleton
      binary tokens, but even that has proven controversial.

      Net/net I think we have to solve the root cause of this by fixing the
      Document/Field/FieldType API so that new index-level features can have
      a usable API, properly defaulted for the right types of fields.

      Towards that, I'm exploring a replacement for
      Document/Field/FieldType. The idea is to expose simple methods on the
      document class (no more separate Field and FieldType classes):

          doc.addLargeText("body", "some text");
          doc.addShortText("title", "a title");
          doc.addAtom("id", "29jafnn");
          doc.addBinary("bytes", new byte[7]);
          doc.addNumber("number", 17);
      

      And then expose a separate FieldTypes class, that you pass to ctor of
      the new document class, which lets you set all the various per-field
      settings (stored, doc values, etc.). E.g.:

          types.enableStored("id");
      

      FieldTypes is a write-once schema, and it throws exceptions if you try
      to make invalid changes once a given setting is already written
      (e.g. enabling norms after having disabled them). It will (I haven't
      implemented this yet) save its state into IndexWriter's commitData, so
      it's available when you open a new IndexWriter for append and when you
      open a reader.

      It has methods to set all the per-field settings (analyzer, stored,
      term vectors, norms, index options, doc values type), and chooses
      "reasonable" defaults based on the value's type when it suddenly sees
      a new field. For example, when you add a number, it's indexed for
      range querying and sorting (numeric doc values) by default.

      FieldTypes provides the analyzer and codec (a little messy) that you
      pass to IndexWriterConfig. Since it's effectively a persistent
      schema, it knows all about the available fields at search time, so we
      could use it to create queries (checking if they are valid given that
      field's type). Query parsers and highlighters could consult it.
      Default UIs (above Lucene) could use it, etc. This is all future .. I
      think for this issue the goal should be to "just" provide a "better"
      index-time API but not yet make use of it at search time.

      So with this change, for auto-prefix terms, we could add an "enable
      range queries/filters" option, but then validate that the selected
      postings format supports such an option.

      I know this exploration will be horribly controversial, but
      realistically I don't think Lucene can move on much further if we
      can't finally address this schema problem head on.

      This is long overdue.

        Activity

        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1658277 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1658277 ]

        LUCENE-6005: merge trunk

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1658277 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1658277 ] LUCENE-6005 : merge trunk
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1656281 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1656281 ]

        LUCENE-6005: checkpoint

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1656281 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1656281 ] LUCENE-6005 : checkpoint
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1649347 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1649347 ]

        LUCENE-6005: merge trunk

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1649347 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1649347 ] LUCENE-6005 : merge trunk
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1643662 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1643662 ]

        LUCENE-6005: merge trunk

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1643662 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1643662 ] LUCENE-6005 : merge trunk
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1643659 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1643659 ]

        LUCENE-6005: checkpoint

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1643659 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1643659 ] LUCENE-6005 : checkpoint
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1642537 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1642537 ]

        LUCENE-6005: merge trunk

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1642537 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1642537 ] LUCENE-6005 : merge trunk
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1642535 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1642535 ]

        LUCENE-6005: checkpoint

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1642535 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1642535 ] LUCENE-6005 : checkpoint
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1642230 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1642230 ]

        LUCENE-6005: merge trunk

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1642230 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1642230 ] LUCENE-6005 : merge trunk
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1642229 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1642229 ]

        LUCENE-6005: merge trunk

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1642229 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1642229 ] LUCENE-6005 : merge trunk
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1642110 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1642110 ]

        LUCENE-6005: checkpoint

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1642110 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1642110 ] LUCENE-6005 : checkpoint
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1640099 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1640099 ]

        LUCENE-6005: merge trunk

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1640099 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1640099 ] LUCENE-6005 : merge trunk
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1640053 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1640053 ]

        LUCENE-6005: checkpoint current changese

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1640053 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1640053 ] LUCENE-6005 : checkpoint current changese
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1638204 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1638204 ]

        LUCENE-6005: merge trunk

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1638204 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1638204 ] LUCENE-6005 : merge trunk
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1638066 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1638066 ]

        LUCENE-6005: cutover more tests

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1638066 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1638066 ] LUCENE-6005 : cutover more tests
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1637544 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1637544 ]

        LUCENE-6005: merge trunk

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1637544 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1637544 ] LUCENE-6005 : merge trunk
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1637540 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1637540 ]

        LUCENE-6005: add UNIQUE_ATOM type (for primary key fields), which IW and CheckIndex enforce; add IW.getReaderManager(); add exists filter support (enabled by default); cutover some more tests / fix nocommits

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1637540 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1637540 ] LUCENE-6005 : add UNIQUE_ATOM type (for primary key fields), which IW and CheckIndex enforce; add IW.getReaderManager(); add exists filter support (enabled by default); cutover some more tests / fix nocommits
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1636528 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1636528 ]

        LUCENE-6005: fix sneaky auto-prefix bug, cutover more tests

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1636528 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1636528 ] LUCENE-6005 : fix sneaky auto-prefix bug, cutover more tests
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1636293 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1636293 ]

        LUCENE-6005: add Date, InetAddress types; add min/maxTokenLength; add maxTokenCount; use ValueType.NONE not null; each FieldType now stores Luceneversion it was created by

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1636293 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1636293 ] LUCENE-6005 : add Date, InetAddress types; add min/maxTokenLength; add maxTokenCount; use ValueType.NONE not null; each FieldType now stores Luceneversion it was created by
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1635912 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1635912 ]

        LUCENE-6005: add sort missing first/last

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1635912 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1635912 ] LUCENE-6005 : add sort missing first/last
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1635908 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1635908 ]

        LUCENE-6005: merge trunk

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1635908 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1635908 ] LUCENE-6005 : merge trunk
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1635898 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1635898 ]

        LUCENE-6005: StoredDocument -> Document2

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1635898 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1635898 ] LUCENE-6005 : StoredDocument -> Document2
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1635002 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1635002 ]

        LUCENE-6005: cutover to auto-prefix

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1635002 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1635002 ] LUCENE-6005 : cutover to auto-prefix
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1635000 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1635000 ]

        LUCENE-6005: fix test failures

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1635000 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1635000 ] LUCENE-6005 : fix test failures
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1634823 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1634823 ]

        LUCENE-6005: merge trunk

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1634823 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1634823 ] LUCENE-6005 : merge trunk
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1634820 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1634820 ]

        LUCENE-6005: checkpoint current state

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1634820 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1634820 ] LUCENE-6005 : checkpoint current state
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1633597 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1633597 ]

        LUCENE-6005: add default sort order; don't use polymorphism with native types; add pos/offset gap; add highlighting; break out query and index analyzer

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1633597 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1633597 ] LUCENE-6005 : add default sort order; don't use polymorphism with native types; add pos/offset gap; add highlighting; break out query and index analyzer
        Hide
        mikemccand Michael McCandless added a comment -

        I committed the current work-in-progress to a new branch
        (https://svn.apache.org/repos/asf/lucene/dev/branches/lucene6005).

        I added a new FieldTypes class (holds the optional write-once schema)
        and Document2 (to replace Document eventually).

        Net/net I think the approach can work well: it's a minimally intrusive
        API to optionally build up the write-once schema. You can skip the
        API entirely and it will "learn" your schema by seeing which Java
        types you are adding to your documents and setting sensible defaults
        accordingly. It's quite a bit simpler than the current oal.document
        API: no more separate XXXField nor FieldType classes.

        Indexed binary tokens work, via Document2.addAtom(...) (LUCENE-5989).

        You can turn on/off sorting for a field, and this "translates" to the
        appropriate DV type; I want to improve this by letting you specify the
        default sort order, and also [eventually] specify collator. I plan to
        similarly enable highlighting.

        I also added search-time APIs, e.g. newSort, newTermQuery,
        newRangeQuery. These methods throw clear exceptions if the field name
        is unknown, or it wasn't indexed with a type that "matches" that
        method.

        There are still many issues and nocommits:

        • Analyzer is passed to FieldTypes now; I would like to remove it
          from IndexWriterConfig. To do this, I think I need to push
          multi-valued field handling out of IndexWriter up into "user
          space"... I already removed IndexableFieldType.tokenized as a
          first step.
        • Analyzers can't be serialized, so the app will have to
          re-initialize them on startup (like they must do anyway today with
          PFAW). Same for Similarity.
        • You can only set per-field DVF and PF.
        • I only cutover a couple tests, but they lose randomness since
          FieldTypes provides the default IWC, vs LTC.newIWC().
        • I had to suck in a fork of KeywordTokenizer.
        Show
        mikemccand Michael McCandless added a comment - I committed the current work-in-progress to a new branch ( https://svn.apache.org/repos/asf/lucene/dev/branches/lucene6005 ). I added a new FieldTypes class (holds the optional write-once schema) and Document2 (to replace Document eventually). Net/net I think the approach can work well: it's a minimally intrusive API to optionally build up the write-once schema. You can skip the API entirely and it will "learn" your schema by seeing which Java types you are adding to your documents and setting sensible defaults accordingly. It's quite a bit simpler than the current oal.document API: no more separate XXXField nor FieldType classes. Indexed binary tokens work, via Document2.addAtom(...) ( LUCENE-5989 ). You can turn on/off sorting for a field, and this "translates" to the appropriate DV type; I want to improve this by letting you specify the default sort order, and also [eventually] specify collator. I plan to similarly enable highlighting. I also added search-time APIs, e.g. newSort, newTermQuery, newRangeQuery. These methods throw clear exceptions if the field name is unknown, or it wasn't indexed with a type that "matches" that method. There are still many issues and nocommits: Analyzer is passed to FieldTypes now; I would like to remove it from IndexWriterConfig. To do this, I think I need to push multi-valued field handling out of IndexWriter up into "user space"... I already removed IndexableFieldType.tokenized as a first step. Analyzers can't be serialized, so the app will have to re-initialize them on startup (like they must do anyway today with PFAW). Same for Similarity. You can only set per-field DVF and PF. I only cutover a couple tests, but they lose randomness since FieldTypes provides the default IWC, vs LTC.newIWC(). I had to suck in a fork of KeywordTokenizer.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1633314 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1633314 ]

        LUCENE-6005: work in progress

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1633314 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1633314 ] LUCENE-6005 : work in progress
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1633312 from Michael McCandless in branch 'dev/branches/lucene6005'
        [ https://svn.apache.org/r1633312 ]

        LUCENE-6005: make branch

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1633312 from Michael McCandless in branch 'dev/branches/lucene6005' [ https://svn.apache.org/r1633312 ] LUCENE-6005 : make branch
        Hide
        rjernst Ryan Ernst added a comment -

        +1 overall. This is sorely needed. I also think we should "level the playing field" for trunk: start by getting trunk back to the same state as 5x with the document api, so that if this is ready in time for 5.0, it can be much more easily backported.

        Show
        rjernst Ryan Ernst added a comment - +1 overall. This is sorely needed. I also think we should "level the playing field" for trunk: start by getting trunk back to the same state as 5x with the document api, so that if this is ready in time for 5.0, it can be much more easily backported.
        Hide
        rcmuir Robert Muir added a comment -

        Is there some reason why you would serialize this in the commit? Fieldinfos is a much better place imo

        Show
        rcmuir Robert Muir added a comment - Is there some reason why you would serialize this in the commit? Fieldinfos is a much better place imo
        Hide
        mikemccand Michael McCandless added a comment -

        Trunk (6.0) only fix version ...

        Show
        mikemccand Michael McCandless added a comment - Trunk (6.0) only fix version ...

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development