Solr
  1. Solr
  2. SOLR-1406

Refactor FileDataSource and FileListEntityProcessor to be more extensible

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.4
    • Labels:
      None

      Description

      FileDataSource should make openStream method protected so we can extend FileDataSource for other File types such as GZip, by controlling the underlying InputStreamReader implementation being returned.

      FileListEntityProcessor needs to aggregate a list of files that were processed and expose that list in an accessible way so that further processing on that file list can be done in the close method. For example, deletion or archiving.

      Another improvement would be that in the event of an indexing rollback event, processing of the close method either does not occur, or the close method is allowed access to that event, to prevent processing within the close method if necessary.

      1. ASF.LICENSE.NOT.GRANTED--image.gif
        3 kB
        Luke Forehand
      2. SOLR-1406.patch
        13 kB
        Shalin Shekhar Mangar
      3. SOLR-1406.patch
        6 kB
        Shalin Shekhar Mangar
      4. SOLR-1406.patch
        2 kB
        Luke Forehand

        Activity

        Hide
        Luke Forehand added a comment -

        Patch that adds getDataConfig to Context interface, and ContextImpl implementation.

        i.e.

        public DataConfig getDataConfig() {
            return this.dataImporter.getConfig();
        }
        
        Show
        Luke Forehand added a comment - Patch that adds getDataConfig to Context interface, and ContextImpl implementation. i.e. public DataConfig getDataConfig() { return this .dataImporter.getConfig(); }
        Hide
        Noble Paul added a comment -

        It is easy to expose DataConfig object. But the problem is I will have less flexibility in changing that once it is exposed. So, which information do you require exactly?

        Show
        Noble Paul added a comment - It is easy to expose DataConfig object. But the problem is I will have less flexibility in changing that once it is exposed. So, which information do you require exactly?
        Hide
        Luke Forehand added a comment -

        I am implementing EventListeners for both onImportStart and onImportEnd that will work with a custom GZipFileDataSource for indexing. Upon import start I will calculate a file list using the baseDir attribute of the FileListEntityProcessor, putting that list in a session attribute with Global scope. Upon import end I will iterate over this file list and archive/remove these specific files. The purpose is that if other files are added to the baseDir during indexing, they won't be archived/removed by my onImportEnd event implementation and will be available for the next full indexing operation (which happens on a schedule).

        <dataConfig>
          <dataSource name="myfilereader" type="FileDataSource"/>
            <document>
              <entity name="jc" rootEntity="false" dataSource="null"
                processor="FileListEntityProcessor"
                fileName="^.*\.xml$" recursive="true"
                baseDir="/usr/local/apache2/htdocs/imagery">
                ....
            </entity>
          </document>
        </dataConfig>
        
        Show
        Luke Forehand added a comment - I am implementing EventListeners for both onImportStart and onImportEnd that will work with a custom GZipFileDataSource for indexing. Upon import start I will calculate a file list using the baseDir attribute of the FileListEntityProcessor, putting that list in a session attribute with Global scope. Upon import end I will iterate over this file list and archive/remove these specific files. The purpose is that if other files are added to the baseDir during indexing, they won't be archived/removed by my onImportEnd event implementation and will be available for the next full indexing operation (which happens on a schedule). <dataConfig> <dataSource name= "myfilereader" type= "FileDataSource" /> <document> <entity name= "jc" rootEntity= "false" dataSource= "null" processor= "FileListEntityProcessor" fileName= "^.*\.xml$" recursive= "true" baseDir= "/usr/local/apache2/htdocs/imagery" > .... </entity> </document> </dataConfig>
        Hide
        Shalin Shekhar Mangar added a comment -

        Instead of exposing DataConfig, perhaps you can extend the FileListEntityProcessor and override close() and do these tasks there? Note that EntityProcessor#close() is introduced in Solr 1.4

        Show
        Shalin Shekhar Mangar added a comment - Instead of exposing DataConfig, perhaps you can extend the FileListEntityProcessor and override close() and do these tasks there? Note that EntityProcessor#close() is introduced in Solr 1.4
        Hide
        Luke Forehand added a comment -

        I could extend FileListEntityProcessor if it was written in a more extensible way, for example, exposing it's baseUrl and fileName private members with accessor methods, and refactoring some of the private methods that do fileName filtering so that they are reusable and protected.

        Show
        Luke Forehand added a comment - I could extend FileListEntityProcessor if it was written in a more extensible way, for example, exposing it's baseUrl and fileName private members with accessor methods, and refactoring some of the private methods that do fileName filtering so that they are reusable and protected.
        Hide
        Shalin Shekhar Mangar added a comment -

        I could extend FileListEntityProcessor if it was written in a more extensible way, for example, exposing it's baseUrl and fileName private members with accessor methods, and refactoring some of the private methods that do fileName filtering so that they are reusable and protected.

        Ah, I see. Well, that is easier than exposing DataConfig. DataConfig was never really meant to be exposed. We need to have another look at DataConfig before exposing making it a public API. How about you create an issue (or rename this one) to make FileListEntityProcessor more extensible rather than exposing DataConfig? We can get that in for 1.4.

        Show
        Shalin Shekhar Mangar added a comment - I could extend FileListEntityProcessor if it was written in a more extensible way, for example, exposing it's baseUrl and fileName private members with accessor methods, and refactoring some of the private methods that do fileName filtering so that they are reusable and protected. Ah, I see. Well, that is easier than exposing DataConfig. DataConfig was never really meant to be exposed. We need to have another look at DataConfig before exposing making it a public API. How about you create an issue (or rename this one) to make FileListEntityProcessor more extensible rather than exposing DataConfig? We can get that in for 1.4.
        Hide
        Shalin Shekhar Mangar added a comment -
        1. Made FileDataSource and FileListEntityProcessor more extensible.
        2. I also found that biggerThan and smallerThan were two parameters in FileListEntityProcessor which were never being set. I've fixed that.

        I have not exposed FileListEntityProcessor's getFolderFiles method because I'd prefer to keep the implementation private for now so that we can change it in future without worrying about back-compat (see SOLR-1313). However, if one wants to know the file names being processed, one can override EntityProcessorBase#getNext.

        Luke, does this help in your use-case?

        Show
        Shalin Shekhar Mangar added a comment - Made FileDataSource and FileListEntityProcessor more extensible. I also found that biggerThan and smallerThan were two parameters in FileListEntityProcessor which were never being set. I've fixed that. I have not exposed FileListEntityProcessor's getFolderFiles method because I'd prefer to keep the implementation private for now so that we can change it in future without worrying about back-compat (see SOLR-1313 ). However, if one wants to know the file names being processed, one can override EntityProcessorBase#getNext. Luke, does this help in your use-case?
        Hide
        Shalin Shekhar Mangar added a comment -
        1. Fixed logic of resolving variables in newerThan, olderThan, biggerThan and smallerThan
        2. Added tests for biggerThan and smallerThan
        Show
        Shalin Shekhar Mangar added a comment - Fixed logic of resolving variables in newerThan, olderThan, biggerThan and smallerThan Added tests for biggerThan and smallerThan
        Hide
        Shalin Shekhar Mangar added a comment -

        Committed revision 812122.

        Thanks Luke!

        Show
        Shalin Shekhar Mangar added a comment - Committed revision 812122. Thanks Luke!
        Hide
        Luke Forehand added a comment - - edited

        I believe this does help me a lot, thanks for your time!

        Luke Forehand
        Core Development
        Networked Insights, Inc. <http://www.networkedinsights.com/>

        Show
        Luke Forehand added a comment - - edited I believe this does help me a lot, thanks for your time! Luke Forehand Core Development Networked Insights, Inc. < http://www.networkedinsights.com/ >
        Hide
        Grant Ingersoll added a comment -

        Bulk close Solr 1.4 issues

        Show
        Grant Ingersoll added a comment - Bulk close Solr 1.4 issues

          People

          • Assignee:
            Shalin Shekhar Mangar
            Reporter:
            Luke Forehand
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development