Lucene.Net
  1. Lucene.Net
  2. LUCENENET-414

The definition of CharArraySet is dangerously confusing and leads to bugs when used.

    Details

      Description

      Right now, CharArraySet derives from System.Collections.Hashtable, but doesn't actually use this base type for storing elements.
      However, the StandardAnalyzer.STOP_WORDS_SET is exposed as a System.Collections.Hashtable. The trivial code to build your own stopword set using the StandardAnalyzer.STOP_WORDS_SET and adding your own set of stopwords like this:

      CharArraySet myStopWords = new CharArraySet(StandardAnalyzer.STOP_WORDS_SET, ignoreCase: false);
      foreach (string domainSpecificStopWord in DomainSpecificStopWords)
      stopWords.Add(domainSpecificStopWord);

      ... will fail because the CharArraySet accepts an ICollection, which will be passed the Hashtable instance of STOP_WORDS_SET: the resulting myStopWords will only contain the DomainSpecificStopWords, and not those from STOP_WORDS_SET.

      One workaround would be to replace the first line with this:

      CharArraySet stopWords = new CharArraySet(StandardAnalyzer.STOP_WORDS_SET.Count + DomainSpecificStopWords.Length, ignoreCase: false);
      foreach (string domainSpecificStopWord in (CharArraySet)StandardAnalyzer.STOP_WORDS_SET)
      stopWords.Add(domainSpecificStopWord);

      ... but this makes use of the implementation detail (the STOP_WORDS_SET is really an UnmodifiableCharArraySet which is itself a CharArraySet). It works because it forces the foreach() to use the correct CharArraySet.GetEnumerator(), which is defined as a "new" method (this has a bad code smell to it)

      At least 2 possibilities exist to solve this problem:

      • Make CharArraySet use the Hashtable instance and a custom comparator, instead of its own implementation.
      • Make CharArraySet use HashSet<char[]>, defined in .NET 4.0.

        Activity

        Hide
        Digy added a comment -

        Hi Vincent,

        I changed the CharArraySet implementation.
        Can you take a look at 2.9.4g branch?

        ( https://svn.apache.org/repos/asf/incubator/lucene.net/branches/Lucene.Net_2_9_4g )

        DIGY

        Show
        Digy added a comment - Hi Vincent, I changed the CharArraySet implementation. Can you take a look at 2.9.4g branch? ( https://svn.apache.org/repos/asf/incubator/lucene.net/branches/Lucene.Net_2_9_4g ) DIGY
        Hide
        Van Den Berghe, Vincent added a comment -

        Hello Digy,

        Thanks for your response.
        I don't want to sound overly pedantic (but please tell me if I do), but this changed implementation solves only part of the problem.
        Now, CharArraySet derives from Set<T>, which itself derives from List<T>. Items are now stored both in this base class, as in the private HashSet<string> _Set.
        However, because List<T> doesn't define its modifiers Add(T), Clear() and Remove(T) as virtual, the derived implementation defines them as "new".
        This violates a variant of the Liskov substitution principle: an operation on the derived type has not the same effect as the same operation on the base type.
        In this case, it means that the following code will cause the items in the List<T> base type and in the _Set to be desynchronized:

        CharArraySet set=...
        List<string> same=set;
        same.Add("whatever");
        // at this point, same.Contains("whatever")==true but set.Contains("whatever")==false even though it's the same instance.

        You might rightfully retort that this never happens and I should mind my own business, but I know at least one poor soul who did just that: me .

        On a completely unrelated matter, the new implementation has 2 methods:

        public void Add(System.Collections.Generic.IList<T> items)
        public void Add(Support.Set<T> items)

        .. which can be collapsed into one, since the only thing used in both cases is the enumerator:

        public void Add(IEnumerable<T> items)

        I don't recall the design rule, but it's something like "to increase reuse, make your function parameters are general as possible, but their return value as specific as possible".
        I am unable to get 2.9.4g to investigate further, but if you are moving towards the Generic collections in Lucene, the following implementation should be a drop-in replacement, without suffering from the aforementioned quirks:

        [Serializable]
        public class Set<T> : ICollection<T>
        {
        private readonly System.Collections.Generic.HashSet<T> _Set = new System.Collections.Generic.HashSet<T>();
        bool _ReadOnly = false;

        public Set()
        {
        }

        public Set(bool readOnly)

        { this._ReadOnly = readOnly; }

        public bool ReadOnly
        {
        set

        { _ReadOnly = value; }

        get

        { return _ReadOnly; }

        }

        public virtual void Add(T item)

        { if (_ReadOnly) throw new NotSupportedException(); if (_Set.Contains(item)) return; _Set.Add(item); }

        public void Add(IEnumerable<T> items)
        {
        if (_ReadOnly) throw new NotSupportedException();
        foreach (T item in items)

        { if (_Set.Contains(item)) continue; _Set.Add(item); }

        }

        public void Clear()

        { if (_ReadOnly) throw new NotSupportedException(); _Set.Clear(); }

        public bool Contains(T item)

        { return _Set.Contains(item); }

        public void CopyTo(T[] array, int arrayIndex)

        { _Set.CopyTo(array, arrayIndex); }

        public int Count
        {
        get

        { return _Set.Count; }

        }

        public bool IsReadOnly
        {
        get

        { return _ReadOnly; }

        }

        public bool Remove(T item)

        { if (_ReadOnly) throw new NotSupportedException(); return _Set.Remove(item); }

        public IEnumerator<T> GetEnumerator()

        { return _Set.GetEnumerator(); }

        System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
        { return _Set.GetEnumerator(); }

        public void RemoveAll(System.Collections.Generic.IList<T> list)

        { if (_ReadOnly) throw new NotSupportedException(); }

        public void RetainAll(System.Collections.Generic.IList<T> list)
        { if (_ReadOnly) throw new NotSupportedException(); }

        public void RemoveAll(System.Collections.ArrayList list)

        { if (_ReadOnly) throw new NotSupportedException(); }

        public void RetainAll(System.Collections.ArrayList list)
        { if (_ReadOnly) throw new NotSupportedException(); }

        }

        I've copied RemoveAll/RetainAll verbatim, since I don't know what they are for. I also left them nonvirtual.
        Note that it implements ICollection<T> and not IList<T>: from what I can determine, IList<T> is not necessary. It's also weird to have an indexed access on a type that by definition cannot have one.
        My medicine won't work if non-generic collection interfaces are still lurking in other parts of the code, because List<T>, for historical reasons, also implements the non-generic IList and ICollection interfaces.
        Alas, this opens another can of worms (IList<T> doesn't implement IList, ICollection<T> doesn't implement ICollection, but IEnumerable<T> implements IEnumerable), but the margin of this e-mail is too small to write down the cure.
        I'll stop rambling now and will try my own medicine as soon as I'm able to get the 2.9.4g from the ASF SVN.

        In the meantime, have a nice week-end and thank you for being the cornerstone of .NET's premier search engine.

        Vincent

        Show
        Van Den Berghe, Vincent added a comment - Hello Digy, Thanks for your response. I don't want to sound overly pedantic (but please tell me if I do), but this changed implementation solves only part of the problem. Now, CharArraySet derives from Set<T>, which itself derives from List<T>. Items are now stored both in this base class, as in the private HashSet<string> _Set. However, because List<T> doesn't define its modifiers Add(T), Clear() and Remove(T) as virtual, the derived implementation defines them as "new". This violates a variant of the Liskov substitution principle: an operation on the derived type has not the same effect as the same operation on the base type. In this case, it means that the following code will cause the items in the List<T> base type and in the _Set to be desynchronized: CharArraySet set=... List<string> same=set; same.Add("whatever"); // at this point, same.Contains("whatever")==true but set.Contains("whatever")==false even though it's the same instance. You might rightfully retort that this never happens and I should mind my own business, but I know at least one poor soul who did just that: me . On a completely unrelated matter, the new implementation has 2 methods: public void Add(System.Collections.Generic.IList<T> items) public void Add(Support.Set<T> items) .. which can be collapsed into one, since the only thing used in both cases is the enumerator: public void Add(IEnumerable<T> items) I don't recall the design rule, but it's something like "to increase reuse, make your function parameters are general as possible, but their return value as specific as possible". I am unable to get 2.9.4g to investigate further, but if you are moving towards the Generic collections in Lucene, the following implementation should be a drop-in replacement, without suffering from the aforementioned quirks: [Serializable] public class Set<T> : ICollection<T> { private readonly System.Collections.Generic.HashSet<T> _Set = new System.Collections.Generic.HashSet<T>(); bool _ReadOnly = false; public Set() { } public Set(bool readOnly) { this._ReadOnly = readOnly; } public bool ReadOnly { set { _ReadOnly = value; } get { return _ReadOnly; } } public virtual void Add(T item) { if (_ReadOnly) throw new NotSupportedException(); if (_Set.Contains(item)) return; _Set.Add(item); } public void Add(IEnumerable<T> items) { if (_ReadOnly) throw new NotSupportedException(); foreach (T item in items) { if (_Set.Contains(item)) continue; _Set.Add(item); } } public void Clear() { if (_ReadOnly) throw new NotSupportedException(); _Set.Clear(); } public bool Contains(T item) { return _Set.Contains(item); } public void CopyTo(T[] array, int arrayIndex) { _Set.CopyTo(array, arrayIndex); } public int Count { get { return _Set.Count; } } public bool IsReadOnly { get { return _ReadOnly; } } public bool Remove(T item) { if (_ReadOnly) throw new NotSupportedException(); return _Set.Remove(item); } public IEnumerator<T> GetEnumerator() { return _Set.GetEnumerator(); } System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() { return _Set.GetEnumerator(); } public void RemoveAll(System.Collections.Generic.IList<T> list) { if (_ReadOnly) throw new NotSupportedException(); } public void RetainAll(System.Collections.Generic.IList<T> list) { if (_ReadOnly) throw new NotSupportedException(); } public void RemoveAll(System.Collections.ArrayList list) { if (_ReadOnly) throw new NotSupportedException(); } public void RetainAll(System.Collections.ArrayList list) { if (_ReadOnly) throw new NotSupportedException(); } } I've copied RemoveAll/RetainAll verbatim, since I don't know what they are for. I also left them nonvirtual. Note that it implements ICollection<T> and not IList<T>: from what I can determine, IList<T> is not necessary. It's also weird to have an indexed access on a type that by definition cannot have one. My medicine won't work if non-generic collection interfaces are still lurking in other parts of the code, because List<T>, for historical reasons, also implements the non-generic IList and ICollection interfaces. Alas, this opens another can of worms (IList<T> doesn't implement IList, ICollection<T> doesn't implement ICollection, but IEnumerable<T> implements IEnumerable), but the margin of this e-mail is too small to write down the cure. I'll stop rambling now and will try my own medicine as soon as I'm able to get the 2.9.4g from the ASF SVN. In the meantime, have a nice week-end and thank you for being the cornerstone of .NET's premier search engine. Vincent
        Hide
        Digy added a comment -

        Fixed in 2.9.4g
        DIGY

        Show
        Digy added a comment - Fixed in 2.9.4g DIGY
        Hide
        Digy added a comment -

        I think the simplest solution for 2.9.4 will be adding

                //
                public virtual bool Add(object key, object value)
                {
                    if (key is string)
                        return Add((string)key);
                    else if (key is char[])
                        return Add((char[])key);
                    return false;
                }
        
        

        to CharArraySet

        DIGY

        Show
        Digy added a comment - I think the simplest solution for 2.9.4 will be adding // public virtual bool Add(object key, object value) { if (key is string) return Add((string)key); else if (key is char []) return Add(( char [])key); return false ; } to CharArraySet DIGY
        Hide
        Digy added a comment -

        Fixed.

        DIGY

        Show
        Digy added a comment - Fixed. DIGY

          People

          • Assignee:
            Unassigned
            Reporter:
            Vincent Van Den Berghe
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development