Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-1060

a new DIH EnityProcessor allowing text file lists of files to be indexed

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.4
    • 1.4
    • None

    Description

      I have finished a new DIH EntityProcessor. It is designed around the idea that whatever demon is used to maintain your content store it is likely to drop a report or log file explaining what has changed within your content store. I wish to use this report file to control the indexing of the new or changed content and the removal of old content. The report files, perhaps from un-tar or un-zip, are likely to reference jpegs and directory stubs which need to be ignored. I assumed a file based content repository but this should be expanded to handle URI's as well

      I feel that the current FileListEntityProcessor is poorly named. It should be called the dirWalkEntityProcessor or dirCrawlEntityProcessor or such. And this new EntityProcessor should have the name FileListEntityProcessor. However what is done is done. I then came up with manifestEnityProcessor which I thought suited, manifest files are all over the content sets I deal with and the dictionary definition seemed close enough ("ships manifest"). However how about ChangeListEntityProcessor

             <entity name="jc"
                     processor="ManifestEntityProcessor"
                     baseDir="/Volumes/Techmore/ts/aaa/schema/data"
                     rootEntity="false"
                     dataSource="null"
      
                     allowRegex="^.*\.xml$"
                     blockRegex="usc2009"
                     manifestFileName="/Volumes/ts/man-find.txt"
                     docAddRegex=".*"
                     >
      

      The new entity fields are as follows.

      manifestFileName is the required location of the manifest file. If this value is relative, it assumed to be relative to baseDir.

      allowRegex is an optional attribute that if present discards any line which does not match the regExp

      blockRegex is an optional attribute that is applied after any allowRegex and discards any line which matches the regExp

      docAddRegex is a required regex to identify lines which when matched should cause docs to be added to the index. As well as matching the line it should also return the portion of the line which contains the filepath as group(1)

      docDeleteRegex is an optional value of a regex to identify documents which when matched should be deleted from the index. As well as matching the line it should also return the portion of the line which contains the filepath as group(1) *PLANNED*

      Attachments

        1. regex-fix.patch
          0.8 kB
          Noble Paul
        2. SOLR-1060.patch
          28 kB
          Shalin Shekhar Mangar
        3. SOLR-1060.patch
          27 kB
          Fergus McMenemie
        4. SOLR-1060.patch
          31 kB
          Shalin Shekhar Mangar
        5. SOLR-1060.patch
          34 kB
          Shalin Shekhar Mangar
        6. SOLR-1060.patch
          38 kB
          Fergus McMenemie
        7. SOLR-1060.patch
          20 kB
          Shalin Shekhar Mangar
        8. SOLR-1060.patch
          19 kB
          Fergus McMenemie
        9. SOLR-1060.patch
          13 kB
          Fergus McMenemie
        10. SOLR-1060.patch
          8 kB
          Fergus McMenemie

        Issue Links

          Activity

            People

              shalin Shalin Shekhar Mangar
              fergus Fergus McMenemie
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 120h
                  120h
                  Remaining:
                  Remaining Estimate - 120h
                  120h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified