Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Its confusing how we present CSF functionality to the user, its actually not a "field" but an "attribute" of a field like STORED or INDEXED.

      Otherwise, its really hard to think about CSF because there is a mismatch between the APIs and the index format.

      1. LUCENE-3453.patch
        242 kB
        Michael McCandless
      2. LUCENE-3453.patch
        81 kB
        Michael McCandless

        Activity

        Hide
        Chris Male added a comment -

        I'm not sure what the better alternative is, but +1 to removing this class.

        Show
        Chris Male added a comment - I'm not sure what the better alternative is, but +1 to removing this class.
        Hide
        Simon Willnauer added a comment -

        +1

        Show
        Simon Willnauer added a comment - +1
        Hide
        Chris Male added a comment -

        Hey Robert,

        Are you putting something together on this, or should I give it a shot?

        Show
        Chris Male added a comment - Hey Robert, Are you putting something together on this, or should I give it a shot?
        Hide
        Robert Muir added a comment -

        please take it! it was just an idea after some discussion with Andrzej, who was experimenting in Luke (I think if you are not careful its easy to get norms with your indexdocvaluesfield?)

        also I noticed in the tests that the added dv fields were hitting up Similarity...

        I have no ideas on naming or api, maybe UNINVERTED?

        Show
        Robert Muir added a comment - please take it! it was just an idea after some discussion with Andrzej, who was experimenting in Luke (I think if you are not careful its easy to get norms with your indexdocvaluesfield?) also I noticed in the tests that the added dv fields were hitting up Similarity... I have no ideas on naming or api, maybe UNINVERTED?
        Hide
        Michael McCandless added a comment -

        +1

        Another spookiness I noticed but didn't yet make a test for to confirm: if you try to add a Document that has the same field w/ docValues twice, you get a scary non-obvious exception.

        Show
        Michael McCandless added a comment - +1 Another spookiness I noticed but didn't yet make a test for to confirm: if you try to add a Document that has the same field w/ docValues twice, you get a scary non-obvious exception.
        Hide
        Chris Male added a comment -

        I find it a little unclear how we want users to use DocValues during indexing.

        Do we want people to mark any Field as also being DocValues? In which case it becomes as Robert said, an attribute of a field and kind of an indexing strategy. Or we do see DocValues fields being more standalone? in which case IndexDocValuesField probably makes sense, just like we have NumericField.

        Show
        Chris Male added a comment - I find it a little unclear how we want users to use DocValues during indexing. Do we want people to mark any Field as also being DocValues? In which case it becomes as Robert said, an attribute of a field and kind of an indexing strategy. Or we do see DocValues fields being more standalone? in which case IndexDocValuesField probably makes sense, just like we have NumericField.
        Hide
        Chris Male added a comment -

        Okay so a battle plan:

        DocValues is basically an attribute of a Field and is a way of processing that Field's value. So to remove the need for IndexDocValues, lets:

        • Move ValueType to FieldType. FieldType is the appropriate place for that kind of metadata. This forces the user to define the ValueType of their DocValues when they initialize the FT.
        • Remove PerFieldDocValues docValues() from Field as its implicit when you consider the ValueType in combination with the actual value.
        • Change DocValuesConsumer to consume an IndexableField. It'll be responsible for then creating the PerFieldDocValues by looking at the ValueType.

        When we get round to introducing StorableField/Type, the DocValues ValueType will go over to StorableFieldType, more closely aligning DocValues and traditional value storing.

        Show
        Chris Male added a comment - Okay so a battle plan: DocValues is basically an attribute of a Field and is a way of processing that Field's value. So to remove the need for IndexDocValues, lets: Move ValueType to FieldType. FieldType is the appropriate place for that kind of metadata. This forces the user to define the ValueType of their DocValues when they initialize the FT. Remove PerFieldDocValues docValues() from Field as its implicit when you consider the ValueType in combination with the actual value. Change DocValuesConsumer to consume an IndexableField. It'll be responsible for then creating the PerFieldDocValues by looking at the ValueType. When we get round to introducing StorableField/Type, the DocValues ValueType will go over to StorableFieldType, more closely aligning DocValues and traditional value storing.
        Hide
        Michael McCandless added a comment -

        That sounds great!

        This makes our expert APIs (direct Field/FieldType construction) much
        cleaner for creating a index doc values field.

        After this we can separately take up what sugar classes/methods we can
        add to make it easy for non-expert users to create doc values. Maybe
        static methods like NumericField.newIntValueField(17) and
        BinaryField.newFixedValueField(bytes) for example...

        I'll open a new issue about accidentally adding same DV field twice...

        Show
        Michael McCandless added a comment - That sounds great! This makes our expert APIs (direct Field/FieldType construction) much cleaner for creating a index doc values field. After this we can separately take up what sugar classes/methods we can add to make it easy for non-expert users to create doc values. Maybe static methods like NumericField.newIntValueField(17) and BinaryField.newFixedValueField(bytes) for example... I'll open a new issue about accidentally adding same DV field twice...
        Hide
        Michael McCandless added a comment -

        I opened LUCENE-3461 for catching accidental multi-valued DV fields.

        Show
        Michael McCandless added a comment - I opened LUCENE-3461 for catching accidental multi-valued DV fields.
        Hide
        Simon Willnauer added a comment -

        hey chris what is the status here?

        Show
        Simon Willnauer added a comment - hey chris what is the status here?
        Hide
        Michael McCandless added a comment -

        I'm taking a stab at this one... here's an initial rough patch
        (tons of nocommits!). I moved docValueType from IndexableField to
        IndexableFieldType, simplified DocValuesField somewhat and cleaned up
        a few things... I'll make a branch so we can iterate there.

        Show
        Michael McCandless added a comment - I'm taking a stab at this one... here's an initial rough patch (tons of nocommits!). I moved docValueType from IndexableField to IndexableFieldType, simplified DocValuesField somewhat and cleaned up a few things... I'll make a branch so we can iterate there.
        Hide
        Michael McCandless added a comment -

        Applyable patch; I still need to go through it more closely but I
        think it's basically ready. I'd like to commit it soon... it's a big
        patch so it's going to quickly get stale/conflict. We can further
        iterate on trunk...

        I ended up keeping (but simplifying) DocValuesField, as sugar, and
        making a number of other simplifications:

        • Removed custom Comparator from DocValuesField, meaning sort is
          always by unsigned byte (UTF8/unicode) order now.
        • Moved numeric field type information out of IndexableField into
          FieldType.
        • LUCENE-3682: added in "transition layer" to construct Field using
          the pre-4.0 enums (Store, Index, TermVector)
        • LUCENE-3694: moved all setXXX methods in NumericField and
          DocValuesField up into Field.java (as setValue methods)
        • Renamed/generalized BinaryField to StoredField, so that you can
          use it to store any value, including numerics. I generalized the
          codecs so they mark a stored field as numeric if its
          .numericValue() is non-null, and at document load time they now
          always create a StoredField instance. This also means you have to
          call .numericValue (not .stringValue()) on the returned
          StoredField, after indexing a NumericField.
        • LUCENE-3616: fail if you try to create a TokenStream field that's
          stored.
        • Added numerics to Field ctor/setValue/numericValue; this is used
          by DocValuesField, NumericField, StoredField
        • You can still reuse a Field (change its value), however I madte
          this more strict: you can only change the value to another value
          of the same type.
        Show
        Michael McCandless added a comment - Applyable patch; I still need to go through it more closely but I think it's basically ready. I'd like to commit it soon... it's a big patch so it's going to quickly get stale/conflict. We can further iterate on trunk... I ended up keeping (but simplifying) DocValuesField, as sugar, and making a number of other simplifications: Removed custom Comparator from DocValuesField, meaning sort is always by unsigned byte (UTF8/unicode) order now. Moved numeric field type information out of IndexableField into FieldType. LUCENE-3682 : added in "transition layer" to construct Field using the pre-4.0 enums (Store, Index, TermVector) LUCENE-3694 : moved all setXXX methods in NumericField and DocValuesField up into Field.java (as setValue methods) Renamed/generalized BinaryField to StoredField, so that you can use it to store any value, including numerics. I generalized the codecs so they mark a stored field as numeric if its .numericValue() is non-null, and at document load time they now always create a StoredField instance. This also means you have to call .numericValue (not .stringValue()) on the returned StoredField, after indexing a NumericField. LUCENE-3616 : fail if you try to create a TokenStream field that's stored. Added numerics to Field ctor/setValue/numericValue; this is used by DocValuesField, NumericField, StoredField You can still reuse a Field (change its value), however I madte this more strict: you can only change the value to another value of the same type.
        Hide
        Chris Male added a comment - - edited

        LUCENE-3694: moved all setXXX methods in NumericField and
        DocValuesField up into Field.java (as setValue methods)

        I like this idea, but I wonder if it could be a source of confusion. People who are indexing numerical content need to use NumericField or DocValuesField ideally. Having it appear as though they can use Field and get the same indexing behavior as from NumericField could lead to bugs, right?

        Renamed/generalized BinaryField to StoredField, so that you can
        use it to store any value, including numerics. I generalized the
        codecs so they mark a stored field as numeric if its
        .numericValue() is non-null, and at document load time they now
        always create a StoredField instance. This also means you have to
        call .numericValue (not .stringValue()) on the returned
        StoredField, after indexing a NumericField.

        Pretty cool idea I have to say. Somewhere during all these discussion we talked about allowing bytes to be indexed, I guess when/if we ever get to that, we'll need to re-create a BinaryField?

        You can still reuse a Field (change its value), however I madte
        this more strict: you can only change the value to another value
        of the same type.

        Good idea. Could we make this cleaner by having a more general DataType enum? Each constructor with its typed parameter could then set the DataType. In each of the setXXX methods we can just check if the DataType is appropriate, rather than using instanceof/isBinary() etc. This could be internal to Field at the moment but could prove useful down the line.

        Show
        Chris Male added a comment - - edited LUCENE-3694 : moved all setXXX methods in NumericField and DocValuesField up into Field.java (as setValue methods) I like this idea, but I wonder if it could be a source of confusion. People who are indexing numerical content need to use NumericField or DocValuesField ideally. Having it appear as though they can use Field and get the same indexing behavior as from NumericField could lead to bugs, right? Renamed/generalized BinaryField to StoredField, so that you can use it to store any value, including numerics. I generalized the codecs so they mark a stored field as numeric if its .numericValue() is non-null, and at document load time they now always create a StoredField instance. This also means you have to call .numericValue (not .stringValue()) on the returned StoredField, after indexing a NumericField. Pretty cool idea I have to say. Somewhere during all these discussion we talked about allowing bytes to be indexed, I guess when/if we ever get to that, we'll need to re-create a BinaryField? You can still reuse a Field (change its value), however I madte this more strict: you can only change the value to another value of the same type. Good idea. Could we make this cleaner by having a more general DataType enum? Each constructor with its typed parameter could then set the DataType. In each of the setXXX methods we can just check if the DataType is appropriate, rather than using instanceof/isBinary() etc. This could be internal to Field at the moment but could prove useful down the line.
        Hide
        Michael McCandless added a comment -

        LUCENE-3694: moved all setXXX methods in NumericField and DocValuesField up into Field.java (as setValue methods)

        I like this idea, but I wonder if it could be a source of confusion. People who are indexing numerical content need to use NumericField or DocValuesField ideally. Having it appear as though they can use Field and get the same indexing behavior as from NumericField could lead to bugs, right?

        In fact you can simply use Field directly to index a numeric field or
        a doc values field, ie, DocValuesField/NumericField are truly just
        sugar now. You can do everything with Field (expert) that you can do
        with these sugar classes... or at least I think so

        Pretty cool idea I have to say. Somewhere during all these discussion we talked about allowing bytes to be indexed, I guess when/if we ever get to that, we'll need to re-create a BinaryField?

        Yeah I think we'd address that if/when we somehow allow indexing of
        byte[] valued fields... even so, expert apps could code directly to
        IF/IFT (or even subclass Field possibly) if they have some way to
        "index" byte[] content...

        You can still reuse a Field (change its value), however I madte this more strict: you can only change the value to another value of the same type.

        Good idea. Could we make this cleaner by having a more general DataType enum? Each constructor with its typed parameter could then set the DataType. In each of the setXXX methods we can just check if the DataType is appropriate, rather than using instanceof/isBinary() etc. This could be internal to Field at the moment but could prove useful down the line.

        We could explore that (making Field's fieldData more strongly typed),
        instead of dynamically checking types. Field has been dynamically
        typed forever...

        But I think we can/should do that after committing this first cleanup?
        I suspect that could be a major change just by itself...

        This would clean up tokenStream creation too – instead of probing
        dynamically for string/reader/pre-tokenized we'd switch on the type.

        Show
        Michael McCandless added a comment - LUCENE-3694 : moved all setXXX methods in NumericField and DocValuesField up into Field.java (as setValue methods) I like this idea, but I wonder if it could be a source of confusion. People who are indexing numerical content need to use NumericField or DocValuesField ideally. Having it appear as though they can use Field and get the same indexing behavior as from NumericField could lead to bugs, right? In fact you can simply use Field directly to index a numeric field or a doc values field, ie, DocValuesField/NumericField are truly just sugar now. You can do everything with Field (expert) that you can do with these sugar classes... or at least I think so Pretty cool idea I have to say. Somewhere during all these discussion we talked about allowing bytes to be indexed, I guess when/if we ever get to that, we'll need to re-create a BinaryField? Yeah I think we'd address that if/when we somehow allow indexing of byte[] valued fields... even so, expert apps could code directly to IF/IFT (or even subclass Field possibly) if they have some way to "index" byte[] content... You can still reuse a Field (change its value), however I madte this more strict: you can only change the value to another value of the same type. Good idea. Could we make this cleaner by having a more general DataType enum? Each constructor with its typed parameter could then set the DataType. In each of the setXXX methods we can just check if the DataType is appropriate, rather than using instanceof/isBinary() etc. This could be internal to Field at the moment but could prove useful down the line. We could explore that (making Field's fieldData more strongly typed), instead of dynamically checking types. Field has been dynamically typed forever... But I think we can/should do that after committing this first cleanup? I suspect that could be a major change just by itself... This would clean up tokenStream creation too – instead of probing dynamically for string/reader/pre-tokenized we'd switch on the type.
        Hide
        Chris Male added a comment -

        In fact you can simply use Field directly to index a numeric field or
        a doc values field, ie, DocValuesField/NumericField are truly just
        sugar now. You can do everything with Field (expert) that you can do
        with these sugar classes... or at least I think so

        Wow, I hadn't explored the patch deeply enough. Big +1 to this!

        But I think we can/should do that after committing this first cleanup?
        I suspect that could be a major change just by itself...

        Yeah definitely, the more I've thought about it the more impact it could have. Definitely worth another issue.

        Show
        Chris Male added a comment - In fact you can simply use Field directly to index a numeric field or a doc values field, ie, DocValuesField/NumericField are truly just sugar now. You can do everything with Field (expert) that you can do with these sugar classes... or at least I think so Wow, I hadn't explored the patch deeply enough. Big +1 to this! But I think we can/should do that after committing this first cleanup? I suspect that could be a major change just by itself... Yeah definitely, the more I've thought about it the more impact it could have. Definitely worth another issue.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development