Solr
  1. Solr
  2. SOLR-1060

a new DIH EnityProcessor allowing text file lists of files to be indexed

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.4
    • Labels:
      None

      Description

      I have finished a new DIH EntityProcessor. It is designed around the idea that whatever demon is used to maintain your content store it is likely to drop a report or log file explaining what has changed within your content store. I wish to use this report file to control the indexing of the new or changed content and the removal of old content. The report files, perhaps from un-tar or un-zip, are likely to reference jpegs and directory stubs which need to be ignored. I assumed a file based content repository but this should be expanded to handle URI's as well

      I feel that the current FileListEntityProcessor is poorly named. It should be called the dirWalkEntityProcessor or dirCrawlEntityProcessor or such. And this new EntityProcessor should have the name FileListEntityProcessor. However what is done is done. I then came up with manifestEnityProcessor which I thought suited, manifest files are all over the content sets I deal with and the dictionary definition seemed close enough ("ships manifest"). However how about ChangeListEntityProcessor

             <entity name="jc"
                     processor="ManifestEntityProcessor"
                     baseDir="/Volumes/Techmore/ts/aaa/schema/data"
                     rootEntity="false"
                     dataSource="null"
      
                     allowRegex="^.*\.xml$"
                     blockRegex="usc2009"
                     manifestFileName="/Volumes/ts/man-find.txt"
                     docAddRegex=".*"
                     >
      

      The new entity fields are as follows.

      manifestFileName is the required location of the manifest file. If this value is relative, it assumed to be relative to baseDir.

      allowRegex is an optional attribute that if present discards any line which does not match the regExp

      blockRegex is an optional attribute that is applied after any allowRegex and discards any line which matches the regExp

      docAddRegex is a required regex to identify lines which when matched should cause docs to be added to the index. As well as matching the line it should also return the portion of the line which contains the filepath as group(1)

      docDeleteRegex is an optional value of a regex to identify documents which when matched should be deleted from the index. As well as matching the line it should also return the portion of the line which contains the filepath as group(1) *PLANNED*

      1. regex-fix.patch
        0.8 kB
        Noble Paul
      2. SOLR-1060.patch
        28 kB
        Shalin Shekhar Mangar
      3. SOLR-1060.patch
        27 kB
        Fergus McMenemie
      4. SOLR-1060.patch
        31 kB
        Shalin Shekhar Mangar
      5. SOLR-1060.patch
        34 kB
        Shalin Shekhar Mangar
      6. SOLR-1060.patch
        38 kB
        Fergus McMenemie
      7. SOLR-1060.patch
        20 kB
        Shalin Shekhar Mangar
      8. SOLR-1060.patch
        19 kB
        Fergus McMenemie
      9. SOLR-1060.patch
        13 kB
        Fergus McMenemie
      10. SOLR-1060.patch
        8 kB
        Fergus McMenemie

        Issue Links

          Activity

          Hide
          Noble Paul added a comment -

          few comments.

          • ChangeListEntityProcessor is preferred over ManifestEntityProcessor. manifestFileName can be changed to fileName
          • 'allowRegex' and blockRegex can be renamed to something else .how about acceptLineRegex and omitLineRegex
          • dataSource='null' is not required from 1.4 onwards.
          Show
          Noble Paul added a comment - few comments. ChangeListEntityProcessor is preferred over ManifestEntityProcessor. manifestFileName can be changed to fileName 'allowRegex' and blockRegex can be renamed to something else .how about acceptLineRegex and omitLineRegex dataSource='null' is not required from 1.4 onwards.
          Hide
          Fergus McMenemie added a comment -

          This is by no means the finished article. It has no test case and only deals with a list file on disk, further the list file can only refer to files that are also on disk. None of the URI stuff is there yet, I guess I ran out of brain power and could not find anything suitable to copy and rework. Having a baseDir feature that seamlessly applies to both www and disk cases is a pain.

          However I have implemented all the suggestions and given it a good testing, works for me.

          Show
          Fergus McMenemie added a comment - This is by no means the finished article. It has no test case and only deals with a list file on disk, further the list file can only refer to files that are also on disk. None of the URI stuff is there yet, I guess I ran out of brain power and could not find anything suitable to copy and rework. Having a baseDir feature that seamlessly applies to both www and disk cases is a pain. However I have implemented all the suggestions and given it a good testing, works for me.
          Hide
          Fergus McMenemie added a comment -

          I am working on minor changes to the (now inappropriatly named) HttpDataSource to allow it to access file:// based resources more cleanly. It should IMHO have been called URIDataSource.

          I am also rewriting ChangeListEntityProcessor to allow it to cooperate with a child entity which is using either HttpDataSource or FileDataSource. This cooperation will not be automatic, the person setting up the data-config.xml will need to take account of which datasource the child is using when configuring the parent ChangeListEntityProcessor. And of course IMHO ... FileDataSource should actually have been called diskDataSource or filesysDataSource

          Show
          Fergus McMenemie added a comment - I am working on minor changes to the (now inappropriatly named) HttpDataSource to allow it to access file:// based resources more cleanly. It should IMHO have been called URIDataSource. I am also rewriting ChangeListEntityProcessor to allow it to cooperate with a child entity which is using either HttpDataSource or FileDataSource. This cooperation will not be automatic, the person setting up the data-config.xml will need to take account of which datasource the child is using when configuring the parent ChangeListEntityProcessor. And of course IMHO ... FileDataSource should actually have been called diskDataSource or filesysDataSource
          Hide
          Shalin Shekhar Mangar added a comment -

          I haven't had a chance to look at the patch yet. But some general comments/questions:

          1. Looks like the ChangeListEntityProcessor is trying to do two things – read a file line by line and process according to add/delete instructions
          2. How about we have a LineEntityProcessor (we can change the name) which just reads text files line by line and a ChangeListEntityProcessor which extends LineEntityProcessor and detects/processes changes according to the regex? This can enable someone to write a CSVTransformer on top of LineEntityProcessor for instance.
          3. Is it necessary to change the semantics of HttpDataSource? It was meant to read an HTTP response and one should use FileDataSource for reading files. Why mix the functionalities at all?
          4. What is the cooperation that needs to be built with ChangeListEntityProcessor and HTTP/File DataSource?
          Show
          Shalin Shekhar Mangar added a comment - I haven't had a chance to look at the patch yet. But some general comments/questions: Looks like the ChangeListEntityProcessor is trying to do two things – read a file line by line and process according to add/delete instructions How about we have a LineEntityProcessor (we can change the name) which just reads text files line by line and a ChangeListEntityProcessor which extends LineEntityProcessor and detects/processes changes according to the regex? This can enable someone to write a CSVTransformer on top of LineEntityProcessor for instance. Is it necessary to change the semantics of HttpDataSource? It was meant to read an HTTP response and one should use FileDataSource for reading files. Why mix the functionalities at all? What is the cooperation that needs to be built with ChangeListEntityProcessor and HTTP/File DataSource?
          Hide
          Fergus McMenemie added a comment -
          1. Correct.
          2. Great idea I am fine with that; but the next patch I submit, will still do both. I will work on seperating them once it is working.
          3. Sort of; it either needs a change or we need a new URLDataSource. FileDataSource is fine for reading files from filesystems but it does not handle the file:// syntax. HttpDataSource already half supports reading from file:// locations (if the URL does not contain the protocol string http:// then baseUrl can be used to add any protocol!). But after considering the input from Otis Gospodnetic, Paul Libbrecht and thinking for a bit; It seemed reasonable that the location of the ChangeList could be specified using file:/// of http:// syntax as well as using plain old filesystem syntax. Likewise the lines within the ChangeList could use either URL or filesystem syntax, this in turn also applied to the baseDir.
          4. I have been playing about with ChangeListEntityProcessor feeding rows to XPathEntityProcessor. ChangeListEntityProcessor currently supports use of either FileDataSource or HttpDataSource by XPathEntityProcessor. However depending which DataSource is specified within XPathEntityProcessor then ChangeListEntityProcessor needs to be configured to return rows with the approriatle syntax. (i Think!)
          Show
          Fergus McMenemie added a comment - Correct. Great idea I am fine with that; but the next patch I submit, will still do both. I will work on seperating them once it is working. Sort of; it either needs a change or we need a new URLDataSource. FileDataSource is fine for reading files from filesystems but it does not handle the file:// syntax. HttpDataSource already half supports reading from file:// locations (if the URL does not contain the protocol string http:// then baseUrl can be used to add any protocol!). But after considering the input from Otis Gospodnetic, Paul Libbrecht and thinking for a bit; It seemed reasonable that the location of the ChangeList could be specified using file:/// of http:// syntax as well as using plain old filesystem syntax. Likewise the lines within the ChangeList could use either URL or filesystem syntax, this in turn also applied to the baseDir. I have been playing about with ChangeListEntityProcessor feeding rows to XPathEntityProcessor. ChangeListEntityProcessor currently supports use of either FileDataSource or HttpDataSource by XPathEntityProcessor. However depending which DataSource is specified within XPathEntityProcessor then ChangeListEntityProcessor needs to be configured to return rows with the approriatle syntax. (i Think!)
          Hide
          Shalin Shekhar Mangar added a comment -

          My concern is that we have two data sources whose names identify their respective functionality. With this change FileDataSource becomes redundant and HttpDataSource does not give the impression that it can read files too. I assume that everyone will be generating the changeset using their own sweet tools/programs. Therefore it is a simple task for the changeset generator to generate http/file separately or mark them differently. Then one can use different root entities.

          The ChangeListEntityProcessor should not care whether the changelist contains a filepath or url – they are all strings to it which should be added to a map and passed along. If it is a delete set, then it should set $deleteDocByQuery or $deleteDocById along with $skipDoc and return. The ChangeListEntityProcessor should not try to collaborate with any other entity. It does not need to know about them just as current EntityProcessor do not know about each other.

          For example:

          <dataConfig>
            <dataSource name="file" type="FileDataSource"/>
            <document>
              <entity name="changeSet" processor="ChangeSetEntityProcessor"
                      rootEntity="false"
                      allowRegex="^.*\.xml$"
                      blockRegex="usc2009"
                      manifestFileName="/Volumes/ts/man-find.txt"
                      docAddRegex=".*">
                <entity name="indexer" processor="XPathEntityProcessor"
                        dataSource="file"
                        forEach="/root/a"
                        url="${changeSet.filename}">
          
                </entity>
              </entity>
            </document>
          </dataConfig>
          

          Here the $

          {changeSet.filename}

          is just a normal key/val in the changeSet entity's row map.

          Show
          Shalin Shekhar Mangar added a comment - My concern is that we have two data sources whose names identify their respective functionality. With this change FileDataSource becomes redundant and HttpDataSource does not give the impression that it can read files too. I assume that everyone will be generating the changeset using their own sweet tools/programs. Therefore it is a simple task for the changeset generator to generate http/file separately or mark them differently. Then one can use different root entities. The ChangeListEntityProcessor should not care whether the changelist contains a filepath or url – they are all strings to it which should be added to a map and passed along. If it is a delete set, then it should set $deleteDocByQuery or $deleteDocById along with $skipDoc and return. The ChangeListEntityProcessor should not try to collaborate with any other entity. It does not need to know about them just as current EntityProcessor do not know about each other. For example: <dataConfig> <dataSource name= "file" type= "FileDataSource" /> <document> <entity name= "changeSet" processor= "ChangeSetEntityProcessor" rootEntity= "false" allowRegex= "^.*\.xml$" blockRegex= "usc2009" manifestFileName= "/Volumes/ts/man-find.txt" docAddRegex= ".*" > <entity name= "indexer" processor= "XPathEntityProcessor" dataSource= "file" forEach= "/root/a" url= "${changeSet.filename}" > </entity> </entity> </document> </dataConfig> Here the $ {changeSet.filename} is just a normal key/val in the changeSet entity's row map.
          Hide
          Noble Paul added a comment -

          How about we have a LineEntityProcessor (we can change the name) which just reads text files line by line and a ChangeListEntityProcessor which extends LineEntityProcessor

          Does the ChangeListEntityProcessor have to be an EntityProcessor? EntityProcessors are heavyweight components. They are complex and ugly. What stops it from being a Transformer . A Transformer is simple and it can do almost everything an EntityProcessor does . The only thing a Transformer does not do is that it does not usually generate it's own data from a DataSource

          This is a just a thought.

          . FileDataSource is fine for reading files from filesystems but it does not handle the file:// syntax. HttpDataSource already half supports reading from file:// locations (if the URL does not contain the protocol string http:// then baseUrl can be used to add any protocol!).

          Is it a possibility that the kind of data source is not known in advance? Then the user will be able to configure an appropriate data source for that entity?

          Show
          Noble Paul added a comment - How about we have a LineEntityProcessor (we can change the name) which just reads text files line by line and a ChangeListEntityProcessor which extends LineEntityProcessor Does the ChangeListEntityProcessor have to be an EntityProcessor? EntityProcessors are heavyweight components. They are complex and ugly. What stops it from being a Transformer . A Transformer is simple and it can do almost everything an EntityProcessor does . The only thing a Transformer does not do is that it does not usually generate it's own data from a DataSource This is a just a thought. . FileDataSource is fine for reading files from filesystems but it does not handle the file:// syntax. HttpDataSource already half supports reading from file:// locations (if the URL does not contain the protocol string http:// then baseUrl can be used to add any protocol!). Is it a possibility that the kind of data source is not known in advance? Then the user will be able to configure an appropriate data source for that entity?
          Hide
          Fergus McMenemie added a comment -

          Oh dear, this is getting complicated!

          "My concern is that we have two data sources whose names identify their respective functionality. With this change FileDataSource becomes redundant and HttpDataSource does not give the impression that it can read files too. I assume that everyone will be generating the changeset using their own sweet tools/programs. Therefore it is a simple task for the changeset generator to generate http/file separately or mark them differently. Then one can use different root entities."

          Hmmmm, no I am not sure about this.

          1. Firstly I agree "FileDataSource becomes redundant and HttpDataSource .. can read files"; bit of a mess really. Ideally I think we need a new dataSource that can read from either a FileSystem or a URI.
          2. I the poor old content indexer am often presented with the manifest fait accompli. It comes as part of the update kit, I have little or no control of its format. I would have to organise some middle-ware to sort its format if we restrict DIH. Which would be a pity since the proposed changes should allow solr to directly handle every case I have seen, and I suspect it is well over 80% of the usecase.
          3. Even if the lines read from the changelist are simple filepaths, how we access those files will depend on other factors. They could be on a local or remote machine. The lines read from the file will not indicate this. As Nobel implies we may not know this ahead of time, we need to be able to pass parameters into the system which supplies that information.

          <thinking out loud>

          1. We need to be able to read lines describing changes we may wish to make to our index from a file:// or a restful web service or URL.
          2. The lines read will need analysed for two purposes. a) to identify the portion of the line we are interested in b) to reformat that portion such that it can be passed to the child entity which will in turn pass it to a dataSource.
          3. We do not know which dataSource the child entity may be using which make the reformating stage 2b) a bit more tricky. Hence the required cooperation.

          1) and 2a) could be done by changeListEntityProcessor (As Noble says we need an EntityProcessor because it is generating data... without even a datasource!)
          2b) could be done by a transformer, information will need to be available to the transformer to allow it deal with local or remote access.
          3)?????
          </thinking out loud>

          For the moment I was intending to build 1)2a)2b) into the ChangeListEntityProcessor, it does not appear to be bad. Once done perhaps we can look again at a need to lift 2b) into a separate EntityProcessor or Transformer.

          Show
          Fergus McMenemie added a comment - Oh dear, this is getting complicated! "My concern is that we have two data sources whose names identify their respective functionality. With this change FileDataSource becomes redundant and HttpDataSource does not give the impression that it can read files too. I assume that everyone will be generating the changeset using their own sweet tools/programs. Therefore it is a simple task for the changeset generator to generate http/file separately or mark them differently. Then one can use different root entities." Hmmmm, no I am not sure about this. Firstly I agree "FileDataSource becomes redundant and HttpDataSource .. can read files"; bit of a mess really. Ideally I think we need a new dataSource that can read from either a FileSystem or a URI. I the poor old content indexer am often presented with the manifest fait accompli. It comes as part of the update kit, I have little or no control of its format. I would have to organise some middle-ware to sort its format if we restrict DIH. Which would be a pity since the proposed changes should allow solr to directly handle every case I have seen, and I suspect it is well over 80% of the usecase. Even if the lines read from the changelist are simple filepaths, how we access those files will depend on other factors. They could be on a local or remote machine. The lines read from the file will not indicate this. As Nobel implies we may not know this ahead of time, we need to be able to pass parameters into the system which supplies that information. <thinking out loud> We need to be able to read lines describing changes we may wish to make to our index from a file:// or a restful web service or URL. The lines read will need analysed for two purposes. a) to identify the portion of the line we are interested in b) to reformat that portion such that it can be passed to the child entity which will in turn pass it to a dataSource. We do not know which dataSource the child entity may be using which make the reformating stage 2b) a bit more tricky. Hence the required cooperation. 1) and 2a) could be done by changeListEntityProcessor (As Noble says we need an EntityProcessor because it is generating data... without even a datasource!) 2b) could be done by a transformer, information will need to be available to the transformer to allow it deal with local or remote access. 3)????? </thinking out loud> For the moment I was intending to build 1)2a)2b) into the ChangeListEntityProcessor, it does not appear to be bad. Once done perhaps we can look again at a need to lift 2b) into a separate EntityProcessor or Transformer.
          Hide
          Fergus McMenemie added a comment -

          Oh and by the way. The "Configuration of HttpDataSource" section of the DIH wiki, where it describes the entity URL attribute, says...

          url (required) : The url used to invoke the REST API. (Can be templatized). if the data souce is file this must be the file location

          So I guess the cat is already out of the bag about HttpDataSource reading file://

          Show
          Fergus McMenemie added a comment - Oh and by the way. The "Configuration of HttpDataSource" section of the DIH wiki, where it describes the entity URL attribute, says... url (required) : The url used to invoke the REST API. (Can be templatized). if the data souce is file this must be the file location So I guess the cat is already out of the bag about HttpDataSource reading file://
          Hide
          Noble Paul added a comment -

          Fergus . it is not a bad idea to have a URIDataSource if that simplifies the problem. The UriDataSource can wrap FiledataSource/HttpDataSource (if that is convenient) .

          Show
          Noble Paul added a comment - Fergus . it is not a bad idea to have a URIDataSource if that simplifies the problem. The UriDataSource can wrap FiledataSource/HttpDataSource (if that is convenient) .
          Hide
          Fergus McMenemie added a comment -

          A couple of lines changed to HttpDataSource and we have a UriDataSource!

          -      if (query.startsWith("http:")) {
          -        url = new URL(query);
          -      } else {
          -        url = new URL(baseUrl + query);
          -      }
          +      if ( URIMETHOD.matcher(query).find()) url = new URL(query);
          +      else url = new URL(baseUrl + query);
          ...
          +  private static final Pattern URIMETHOD = Pattern.compile("\\w{3,}:");
          

          As I said at the start. HttpDataSource already half supports other protocols, its just that it assumes http: when deciding whether to prepend the baseUrl. A minor bug probably. Is it worth wraping? Also; given the cat is already out of the bag, shouldnt we just tweak HttpDataSource?

          Show
          Fergus McMenemie added a comment - A couple of lines changed to HttpDataSource and we have a UriDataSource! - if (query.startsWith( "http:" )) { - url = new URL(query); - } else { - url = new URL(baseUrl + query); - } + if ( URIMETHOD.matcher(query).find()) url = new URL(query); + else url = new URL(baseUrl + query); ... + private static final Pattern URIMETHOD = Pattern.compile( "\\w{3,}:" ); As I said at the start. HttpDataSource already half supports other protocols, its just that it assumes http: when deciding whether to prepend the baseUrl. A minor bug probably. Is it worth wraping? Also; given the cat is already out of the bag, shouldnt we just tweak HttpDataSource?
          Hide
          Noble Paul added a comment -

          OK. Let us do this

          make the change to HttpDataSource and then create a new DataSource URIDataSource extends HttpDataSource.
          So users can use both in this release .and let us deprecate HttpDataSource in favor of URIDataSource in the future

          Show
          Noble Paul added a comment - OK. Let us do this make the change to HttpDataSource and then create a new DataSource URIDataSource extends HttpDataSource. So users can use both in this release .and let us deprecate HttpDataSource in favor of URIDataSource in the future
          Hide
          Fergus McMenemie added a comment - - edited

          I have rewritten and tested the ChangeListEntityProcessor such that it supports URL's. This allows the list of changes to fetched from a local file, a simple URL or any restful type web service. The list of changes must appear as one change per line read, the line can contain an absolute file:/// or http:// pathnames or it can be a relative pathname. The entity attribute baseLocation specifies a prefix to be used with relative pathnames. baseLocation must be a valid URL; file:/// or http:// for the moment. The entity attributes are as follows

          • FileName is the required URL location of the change list. If this value is relative, it assumed to be relative to baseLocation.</li>
          • acceptLineRegex is an optional attribute that if present discards any line read from the change list which does not match the regExp.</li>
          • omitLineRegex is an optional attribute that is applied after any acceptLineRegex and discards any line read from the change list which matches the regExp.</li>
          • docAddRegex is an optional regex to identify lines which when matched should cause docs to be added to the index. As well as matching the line it should also return the portion of the line which is to be treated as the pathname, as group(1). If not specified the whole line is assumed to be valid pathname.</li>
          • docDeleteRegex is an optional value of a regex to identify documents which when matched should be deleted from the index. As well as matching the line it should also return the portion of the line which contains the filepath as group(1) PLANNED WORK see SOLR-1059</li>
          • baseLocation is a required prefix added to fileName or lines read from the change list which do not appear to be absolute http:// or file:/// URL's</li>

          Here is a sample of the way I used it:-

                 <entity name="jc"
                         processor="ChangeListEntityProcessor"
                         acceptLineRegex="^.*\.xml$"
                         omitLineRegex="usc2009"
                         fileName="file:///Volumes/ts/man-findlsurl.txt"
                         rootEntity="false"
                         dataSource="null"
                         baseLocation="http://localhost/ford/"
                         docAddRegex="\s+([^ ]*)$"
                         >
          

          This entity returns a row containing a single "fileAbsolutePath" field for each pathname accepted from the changelist. If the docDeleteRegex was matched then another fields will also be returned $deleteDocId=?? and $deleteDocQuery=??. What do I need to set these values to?

          I have also created a URLDataSource, it seems to work. However "an expert" had better review what I have done; I am still very inexperienced re Java best practice. On that topic; why did we not rename the existing httpDataSource to URLDataSource and then make httpDataSource a wrapper for URLDataSource?

          Testing with my sample of 40000 documents reveals no noticible slowdown compared with FileListEntiryProcessor.

          Show
          Fergus McMenemie added a comment - - edited I have rewritten and tested the ChangeListEntityProcessor such that it supports URL's. This allows the list of changes to fetched from a local file, a simple URL or any restful type web service. The list of changes must appear as one change per line read, the line can contain an absolute file:/// or http:// pathnames or it can be a relative pathname. The entity attribute baseLocation specifies a prefix to be used with relative pathnames. baseLocation must be a valid URL; file:/// or http:// for the moment. The entity attributes are as follows FileName is the required URL location of the change list. If this value is relative, it assumed to be relative to baseLocation.</li> acceptLineRegex is an optional attribute that if present discards any line read from the change list which does not match the regExp.</li> omitLineRegex is an optional attribute that is applied after any acceptLineRegex and discards any line read from the change list which matches the regExp.</li> docAddRegex is an optional regex to identify lines which when matched should cause docs to be added to the index. As well as matching the line it should also return the portion of the line which is to be treated as the pathname, as group(1). If not specified the whole line is assumed to be valid pathname.</li> docDeleteRegex is an optional value of a regex to identify documents which when matched should be deleted from the index. As well as matching the line it should also return the portion of the line which contains the filepath as group(1) PLANNED WORK see SOLR-1059 </li> baseLocation is a required prefix added to fileName or lines read from the change list which do not appear to be absolute http:// or file:/// URL's</li> Here is a sample of the way I used it:- <entity name= "jc" processor= "ChangeListEntityProcessor" acceptLineRegex= "^.*\.xml$" omitLineRegex= "usc2009" fileName= "file: ///Volumes/ts/man-findlsurl.txt" rootEntity= " false " dataSource= " null " baseLocation= "http: //localhost/ford/" docAddRegex= "\s+([^ ]*)$" > This entity returns a row containing a single "fileAbsolutePath" field for each pathname accepted from the changelist. If the docDeleteRegex was matched then another fields will also be returned $deleteDocId=?? and $deleteDocQuery=??. What do I need to set these values to? I have also created a URLDataSource, it seems to work. However "an expert" had better review what I have done; I am still very inexperienced re Java best practice. On that topic; why did we not rename the existing httpDataSource to URLDataSource and then make httpDataSource a wrapper for URLDataSource? Testing with my sample of 40000 documents reveals no noticible slowdown compared with FileListEntiryProcessor.
          Hide
          Fergus McMenemie added a comment -

          Ooooups. Another version of patch, but formated how ASF like things formated. I forgot to mention that I also changed httpDataSource and fileDataSource as follows

          1. harmonised the LOG messages for the individual files processed. Making them equivilent and only outputing at DEBG level.
          1. httpDataSource was altered to making it a generic URL reader and of course URLDataSource it merly a wrapper around it.
          Show
          Fergus McMenemie added a comment - Ooooups. Another version of patch, but formated how ASF like things formated. I forgot to mention that I also changed httpDataSource and fileDataSource as follows harmonised the LOG messages for the individual files processed. Making them equivilent and only outputing at DEBG level. httpDataSource was altered to making it a generic URL reader and of course URLDataSource it merly a wrapper around it.
          Hide
          Shalin Shekhar Mangar added a comment -

          This is great!

          $deleteDocId=?? and $deleteDocQuery=??. What do I need to set these values to?

          Set them to a boolean as a string.

          why did we not rename the existing httpDataSource to URLDataSource and then make httpDataSource a wrapper for URLDataSource?

          +1. Let's do this.

          Show
          Shalin Shekhar Mangar added a comment - This is great! $deleteDocId=?? and $deleteDocQuery=??. What do I need to set these values to? Set them to a boolean as a string. why did we not rename the existing httpDataSource to URLDataSource and then make httpDataSource a wrapper for URLDataSource? +1. Let's do this.
          Hide
          Noble Paul added a comment -

          Hi fergus ,
          did you consider splitting the ChangeListEntityProcessor into two

          • LineEntityprocessor and
          • ChangeListEntityprocessor extends LineEntityprocessor

          Tomorrow some one is definitely going to ask for a Line EntityProcessor.

          why did we not rename the existing httpDataSource to URLDataSource

          +1

          Show
          Noble Paul added a comment - Hi fergus , did you consider splitting the ChangeListEntityProcessor into two LineEntityprocessor and ChangeListEntityprocessor extends LineEntityprocessor Tomorrow some one is definitely going to ask for a Line EntityProcessor. why did we not rename the existing httpDataSource to URLDataSource +1
          Hide
          Fergus McMenemie added a comment - - edited

          Yes, briefly, I did and could not see how it could be done nicely; however it is quite possible I am misunderstanding things..

          To recap, the idea was to split "ChangeListEntityProcessor" into two halves. The first half would deal with reading lines from a file:/// or http:// locations with features to allow lines to be omitted or accepted. The second half would focus on analyzing the line turning it into add/delete instructions and identifying the portion of the lines which was to be operated on. Is this correct?

          If my understanding is correct. Then if baseLocation was allowed to be empty and "docAddRegex" and "docDeleteRegex" are not supplied then the line from the changelist could be returned by the entity exactly as read from the file. Further; if "acceptLineRegex" and "omitLineRegex" are also undefined then the whole file is returned to the next entity. Would that make it the same as part one?

          I had looked at removing all my code for doing the second half described above, replacing it with a transformers. I guess as long as the templatetransformer can assign to the fields $deleteDocId and $deleteDocQuery then it is do-able. Is the following valid? In the following I always assign to $deleteDocQuery but make $deleteDocId true/false to control actual deletion.

          <entity name="jc"
                     processor="ChangeListEntityProcessor"
                     fileName="file:///Volumes/ts/man-findlsurl.txt"
                     rootEntity="false"
                     baseLocation="http://localhost/ford/"
                     transformer="TemplateTransformer,RegexTransformer">
                     >
          <field column="id"                regex=".*(-- find jucy bit--).*" replaceWith="$1" \>
          <field column="$deleteDocQuery"   regex=".*(-- find jucy bit--).*" replaceWith="$1"      sourceColName="fileAbsolutePath"/>
          <field column="$deleteDocId"      template="false"   regex=".*(-- find add/del bit--).*" replaceWith="true" sourceColName="fileAbsolutePath"/>
          

          ?

          Show
          Fergus McMenemie added a comment - - edited Yes, briefly, I did and could not see how it could be done nicely; however it is quite possible I am misunderstanding things.. To recap, the idea was to split "ChangeListEntityProcessor" into two halves. The first half would deal with reading lines from a file:/// or http:// locations with features to allow lines to be omitted or accepted. The second half would focus on analyzing the line turning it into add/delete instructions and identifying the portion of the lines which was to be operated on. Is this correct? If my understanding is correct. Then if baseLocation was allowed to be empty and "docAddRegex" and "docDeleteRegex" are not supplied then the line from the changelist could be returned by the entity exactly as read from the file. Further; if "acceptLineRegex" and "omitLineRegex" are also undefined then the whole file is returned to the next entity. Would that make it the same as part one? I had looked at removing all my code for doing the second half described above, replacing it with a transformers. I guess as long as the templatetransformer can assign to the fields $deleteDocId and $deleteDocQuery then it is do-able. Is the following valid? In the following I always assign to $deleteDocQuery but make $deleteDocId true/false to control actual deletion. <entity name= "jc" processor= "ChangeListEntityProcessor" fileName= "file: ///Volumes/ts/man-findlsurl.txt" rootEntity= " false " baseLocation= "http: //localhost/ford/" transformer= "TemplateTransformer,RegexTransformer" > > <field column= "id" regex= ".*(-- find jucy bit--).*" replaceWith= "$1" \> <field column= "$deleteDocQuery" regex= ".*(-- find jucy bit--).*" replaceWith= "$1" sourceColName= "fileAbsolutePath" /> <field column= "$deleteDocId" template= " false " regex= ".*(-- find add/del bit--).*" replaceWith= " true " sourceColName= "fileAbsolutePath" /> ?
          Hide
          Noble Paul added a comment -

          To recap, the idea was to split "ChangeListEntityProcessor" into two halves. The first half would ....

          right

          The second part may become a bit easier with SOLR-1061

          Show
          Noble Paul added a comment - To recap, the idea was to split "ChangeListEntityProcessor" into two halves. The first half would .... right The second part may become a bit easier with SOLR-1061
          Hide
          Fergus McMenemie added a comment -

          Ok. I will try and see if I can make do as outlined above.

          However I still dont think I have got the use of $deleteDocQuery $deleteDocId understood properly. Shalin says they are a boolean and a string. Yet Solr-1059 implies they are both strings.

          Now following the information from Solr-1059 I think I need to set $deleteDocQuery to a valid Solr query. If so my entity needs to know the solr field name to use and what it contains; I do not have control of these items. I either need to mandate to the user that for deletes to function the field, say "fileAbsolutePath", has to be defined in the schema.xml and it must be equal to the filename returned by the entity. I am trying to test this today... I am not sure the transformers provide the flexibility I need to do everything.

          Show
          Fergus McMenemie added a comment - Ok. I will try and see if I can make do as outlined above. However I still dont think I have got the use of $deleteDocQuery $deleteDocId understood properly. Shalin says they are a boolean and a string. Yet Solr-1059 implies they are both strings. Now following the information from Solr-1059 I think I need to set $deleteDocQuery to a valid Solr query. If so my entity needs to know the solr field name to use and what it contains; I do not have control of these items. I either need to mandate to the user that for deletes to function the field, say "fileAbsolutePath", has to be defined in the schema.xml and it must be equal to the filename returned by the entity. I am trying to test this today... I am not sure the transformers provide the flexibility I need to do everything.
          Hide
          Shalin Shekhar Mangar added a comment - - edited

          Shalin says they are a boolean and a string. Yet Solr-1059 implies they are both strings.

          Sorry for the confusion. I meant that the value should be "true" or "false" (boolean values as a string type)

          I don't know what I was thinking when I wrote that. I'm still stuck in the boolean world of $skipDoc etc. Please disregard my mumblings. Yes, they are both strings.

          Show
          Shalin Shekhar Mangar added a comment - - edited Shalin says they are a boolean and a string. Yet Solr-1059 implies they are both strings. Sorry for the confusion. I meant that the value should be "true" or "false" (boolean values as a string type) I don't know what I was thinking when I wrote that. I'm still stuck in the boolean world of $skipDoc etc. Please disregard my mumblings. Yes, they are both strings.
          Hide
          Fergus McMenemie added a comment - - edited

          Hi,

          I have the following snippet from my data-config.xml. This is after removing all code from ChangeListEntityProcessor which deals with finding the juicy part of the line. However I get tracebacks when ever I start tomcat saying that my schema.xml has no mention of $deleteQuery. Do I have to declare a field $deleteQuery in my schema.xml; if so it is rather ugly!

          I was wondering if perhaps field/column names beginning with '$' could be considered magic, or some fashion local to data-config.xml and skip whatever check is causing tomcat to bomb. With the new power of transformers I could see a need for "temporary variables" within data-config.xml

          Mar 17, 2009 8:54:31 PM org.apache.solr.handler.dataimport.DataImporter loadDataConfig
          INFO: Data Configuration loaded successfully
          Mar 17, 2009 8:54:31 PM org.apache.solr.handler.dataimport.DataImportHandler inform
          SEVERE: Exception while loading DataImporter
          org.apache.solr.handler.dataimport.DataImportHandlerException: There are errors in the Schema
          The field :$deleteQuery present in DataConfig does not have a counterpart in Solr Schema
          
          	at org.apache.solr.handler.dataimport.DataImporter.<init>(DataImporter.java:109)
          	at org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:96)
          	at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:388)
          	at org.apache.solr.core.SolrCore.<init>(SolrCore.java:571)
          	at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:122)
          	at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69)
          	at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:223)
          
          <dataConfig>
            <dataSource name="myFILEreader" type="FileDataSource"/>    
            <dataSource name="myURIreader"  type="URLDataSource" />    
              <document>
                <entity name="jc"
                         processor="ChangeListEntityProcessor"
                         acceptLineRegex="^.*\.xml$"
                         omitLineRegex="usc2009"
                         fileName="file:///Volumes/spare/ts/man-findlsurl.txt"
          	       rootEntity="false"
          	       dataSource="null"
                         baseLocation="file:///Volumes/spare/ts/ford"
          	       transformer="RegexTransformer"
          	       >
          	<!-- the following columns are only defined if the regex matches -->
          	<field column="fileAbsolutePath"    regex="\s+([^ ]*)$" replaceWith="${jc.baseLocation}/$1"  sourceColName="rawLine"/>
          	<field column="$deleteQuery"        regex="^DELETE\s+"  replaceWith="${jc.fileAbsolutePath}" sourceColName="rawLine"/> 	       
          
          	<entity name="x"
          		dataSource="myurireader"
          		processor="XPathEntityProcessor"
          		url="${jc.fileAbsolutePath}"
          		rootEntity="true"
          		flatten="true"
          		stream="false"
          		forEach="/record | /record/mediaBlock"
          		transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer">
          
          <field column="fileAbsolutePath"                 template="${jc.fileAbsolutePath}" />
          <field column="fileWebPath"                      template="${jc.fileAbsolutePath}" regex="${dataimporter.request.fordinstalldir}(.*)" replaceWith="/ford$1"/>
          <field column="fileWebDir"                       regex="(.*)/.*" replaceWith="$1" sourceColName="fileWebPath"/>
          
          Show
          Fergus McMenemie added a comment - - edited Hi, I have the following snippet from my data-config.xml. This is after removing all code from ChangeListEntityProcessor which deals with finding the juicy part of the line. However I get tracebacks when ever I start tomcat saying that my schema.xml has no mention of $deleteQuery. Do I have to declare a field $deleteQuery in my schema.xml; if so it is rather ugly! I was wondering if perhaps field/column names beginning with '$' could be considered magic, or some fashion local to data-config.xml and skip whatever check is causing tomcat to bomb. With the new power of transformers I could see a need for "temporary variables" within data-config.xml Mar 17, 2009 8:54:31 PM org.apache.solr.handler.dataimport.DataImporter loadDataConfig INFO: Data Configuration loaded successfully Mar 17, 2009 8:54:31 PM org.apache.solr.handler.dataimport.DataImportHandler inform SEVERE: Exception while loading DataImporter org.apache.solr.handler.dataimport.DataImportHandlerException: There are errors in the Schema The field :$deleteQuery present in DataConfig does not have a counterpart in Solr Schema at org.apache.solr.handler.dataimport.DataImporter.<init>(DataImporter.java:109) at org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:96) at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:388) at org.apache.solr.core.SolrCore.<init>(SolrCore.java:571) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:122) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69) at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:223) <dataConfig> <dataSource name= "myFILEreader" type= "FileDataSource" /> <dataSource name= "myURIreader" type= "URLDataSource" /> <document> <entity name= "jc" processor= "ChangeListEntityProcessor" acceptLineRegex= "^.*\.xml$" omitLineRegex= "usc2009" fileName= "file: ///Volumes/spare/ts/man-findlsurl.txt" rootEntity= " false " dataSource= " null " baseLocation= "file: ///Volumes/spare/ts/ford" transformer= "RegexTransformer" > <!-- the following columns are only defined if the regex matches --> <field column= "fileAbsolutePath" regex= "\s+([^ ]*)$" replaceWith= "${jc.baseLocation}/$1" sourceColName= "rawLine" /> <field column= "$deleteQuery" regex= "^DELETE\s+" replaceWith= "${jc.fileAbsolutePath}" sourceColName= "rawLine" /> <entity name= "x" dataSource= "myurireader" processor= "XPathEntityProcessor" url= "${jc.fileAbsolutePath}" rootEntity= " true " flatten= " true " stream= " false " forEach= "/record | /record/mediaBlock" transformer= "DateFormatTransformer,TemplateTransformer,RegexTransformer" > <field column= "fileAbsolutePath" template= "${jc.fileAbsolutePath}" /> <field column= "fileWebPath" template= "${jc.fileAbsolutePath}" regex= "${dataimporter.request.fordinstalldir}(.*)" replaceWith= "/ford$1" /> <field column= "fileWebDir" regex= "(.*)/.*" replaceWith= "$1" sourceColName= "fileWebPath" />
          Hide
          Shalin Shekhar Mangar added a comment -

          However I get tracebacks when ever I start tomcat saying that my schema.xml has no mention of $deleteQuery. Do I have to declare a field $deleteQuery in my schema.xml; if so it is rather ugly!

          EntityProcessorBase should remove such variables after using them. Glancing at the code, I think we will see the same problem if we put $skipDoc as "false". We need to update the SOLR-1059 patch.

          Show
          Shalin Shekhar Mangar added a comment - However I get tracebacks when ever I start tomcat saying that my schema.xml has no mention of $deleteQuery. Do I have to declare a field $deleteQuery in my schema.xml; if so it is rather ugly! EntityProcessorBase should remove such variables after using them. Glancing at the code, I think we will see the same problem if we put $skipDoc as "false". We need to update the SOLR-1059 patch.
          Hide
          Noble Paul added a comment -

          shalin we need to remove the check done in DataImporter for variables being present

          Show
          Noble Paul added a comment - shalin we need to remove the check done in DataImporter for variables being present
          Hide
          Fergus McMenemie added a comment -

          Thanks for the changes to SOLR-1059. I am now attempting to test document deletion; it is not going too well!

          Mar 18, 2009 1:12:48 PM org.apache.solr.core.SolrCore execute
          INFO: [] webapp=/apache-solr-1.4-dev path=/dataimport params={command=full-import&clean=false&entity=single-delete&single=/Volumes/spare/ts/schema/data/news/fdw2008/jn71796.xml} status=0 QTime=1 
          Mar 18, 2009 1:12:48 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
          INFO: Read dataimport.properties
          Mar 18, 2009 1:12:48 PM org.apache.solr.handler.dataimport.DataImporter doFullImport
          INFO: Starting Full Import
          Mar 18, 2009 1:12:48 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
          INFO: Read dataimport.properties
          Mar 18, 2009 1:12:48 PM org.apache.solr.handler.dataimport.SolrWriter upload
          WARNING: Error creating document : SolrInputDocument[{}]
          org.apache.solr.common.SolrException: Document [null] missing required field: vdkvgwkey
          	at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:292)
          	at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:59)
          	at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:67)
          	at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:274)
          	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:373)
          	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:221)
          	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:163)
          	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:309)
          	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:367)
          	at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:348)
          Mar 18, 2009 1:12:48 PM org.apache.solr.handler.dataimport.SolrWriter upload
          WARNING: Error creating document : SolrInputDocument[{}]
          org.apache.solr.common.SolrException: Document [null] missing required field: vdkvgwkey
          	at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:292)
          	at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:59)
          	at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:67)
          	at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:274)
          	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:373)
          	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:221)
          	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:163)
          	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:309)
          	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:367)
          	at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:348)
          Mar 18, 2009 1:12:48 PM org.apache.solr.handler.dataimport.SolrWriter upload
          WARNING: Error creating document : SolrInputDocument[{}]
          org.apache.solr.common.SolrException: Document [null] missing required field: vdkvgwkey
          	at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:292)
          	at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:59)
          	at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:67)
          	at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:274)
          	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:373)
          	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:221)
          	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:163)
          	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:309)
          	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:367)
          	at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:348)
          Mar 18, 2009 1:12:48 PM org.apache.solr.handler.dataimport.DocBuilder execute
          INFO: Time taken = 0:0:0.41
          
          

          my entity is as follows:-

               <entity name="single-delete"
          		 dataSource="myFILEreader"
          		 processor="XPathEntityProcessor"
          		 url="${dataimporter.request.single}"
          		 rootEntity="true"
          		 flatten="true"
          		 stream="false"
          		 forEach="/record | /record/mediaBlock"
          		 transformer="RegexTransformer">
          
                <!-- the following columns are only defined if the regex matches -->
                <field column="fileAbsolutePath"    template="${dataimporter.request.single}" /> 
                <field column="$deleteQuery"        template="fileAbsolutePath:${dataimporter.request.single}" /> 	       
                <field column="vdkvgwkey"           template="${dataimporter.request.single}" /> 
                </entity>
          

          And repeating what was shown in the traceback my test was:-

          get 'http://localhost:8080/apache-solr-1.4-dev/dataimport?command=full-import&entity=single-delete&clean=false&single=/Volumes/spare/ts/ford/schema/data/news/fdw2008/jn71796.xml'

          Show
          Fergus McMenemie added a comment - Thanks for the changes to SOLR-1059 . I am now attempting to test document deletion; it is not going too well! Mar 18, 2009 1:12:48 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/apache-solr-1.4-dev path=/dataimport params={command=full- import &clean= false &entity=single-delete&single=/Volumes/spare/ts/schema/data/news/fdw2008/jn71796.xml} status=0 QTime=1 Mar 18, 2009 1:12:48 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Mar 18, 2009 1:12:48 PM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Mar 18, 2009 1:12:48 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Mar 18, 2009 1:12:48 PM org.apache.solr.handler.dataimport.SolrWriter upload WARNING: Error creating document : SolrInputDocument[{}] org.apache.solr.common.SolrException: Document [ null ] missing required field: vdkvgwkey at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:292) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:59) at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:67) at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:274) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:373) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:221) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:163) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:309) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:367) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:348) Mar 18, 2009 1:12:48 PM org.apache.solr.handler.dataimport.SolrWriter upload WARNING: Error creating document : SolrInputDocument[{}] org.apache.solr.common.SolrException: Document [ null ] missing required field: vdkvgwkey at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:292) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:59) at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:67) at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:274) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:373) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:221) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:163) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:309) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:367) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:348) Mar 18, 2009 1:12:48 PM org.apache.solr.handler.dataimport.SolrWriter upload WARNING: Error creating document : SolrInputDocument[{}] org.apache.solr.common.SolrException: Document [ null ] missing required field: vdkvgwkey at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:292) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:59) at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:67) at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:274) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:373) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:221) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:163) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:309) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:367) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:348) Mar 18, 2009 1:12:48 PM org.apache.solr.handler.dataimport.DocBuilder execute INFO: Time taken = 0:0:0.41 my entity is as follows:- <entity name= "single-delete" dataSource= "myFILEreader" processor= "XPathEntityProcessor" url= "${dataimporter.request.single}" rootEntity= " true " flatten= " true " stream= " false " forEach= "/record | /record/mediaBlock" transformer= "RegexTransformer" > <!-- the following columns are only defined if the regex matches --> <field column= "fileAbsolutePath" template= "${dataimporter.request.single}" /> <field column= "$deleteQuery" template= "fileAbsolutePath:${dataimporter.request.single}" /> <field column= "vdkvgwkey" template= "${dataimporter.request.single}" /> </entity> And repeating what was shown in the traceback my test was:- get 'http://localhost:8080/apache-solr-1.4-dev/dataimport?command=full-import&entity=single-delete&clean=false&single=/Volumes/spare/ts/ford/schema/data/news/fdw2008/jn71796.xml'
          Hide
          Noble Paul added a comment -

          hi Fergus,
          The issue here is that when there is a $deleteQuery /$deleteId is present the document is still tried to be inserted . One way is to do a $skipDoc in the same row. or we can add a check to DIH to avoid inserting doc if the uniqueKey is absent .

          Show
          Noble Paul added a comment - hi Fergus, The issue here is that when there is a $deleteQuery /$deleteId is present the document is still tried to be inserted . One way is to do a $skipDoc in the same row. or we can add a check to DIH to avoid inserting doc if the uniqueKey is absent .
          Hide
          Fergus McMenemie added a comment -

          I do like SOLR complaining if the ID is missing or not uniqueue. Or I guess I need to set $skipDoc="true" (is that syntax correct?). However I think that $skipDoc should be invoked internally whenever $deleteQuery /$deleteId is present.

          For the moment I will try $skipDoc="true" using an extra transform.

          Show
          Fergus McMenemie added a comment - I do like SOLR complaining if the ID is missing or not uniqueue. Or I guess I need to set $skipDoc="true" (is that syntax correct?). However I think that $skipDoc should be invoked internally whenever $deleteQuery /$deleteId is present. For the moment I will try $skipDoc="true" using an extra transform.
          Hide
          Noble Paul added a comment -

          I do like SOLR complaining if the ID is missing

          The only problem with this solution is it may not work if <uniqueKey> is not specifie

          I guess I need to set $skipDoc="true" (is that syntax correct?).

          yes.

          Show
          Noble Paul added a comment - I do like SOLR complaining if the ID is missing The only problem with this solution is it may not work if <uniqueKey> is not specifie I guess I need to set $skipDoc="true" (is that syntax correct?). yes.
          Hide
          Fergus McMenemie added a comment -

          Did you notice that besides <uniqueKey> not being specified, that the whole document was empty!

          Mar 18, 2009 1:12:48 PM org.apache.solr.handler.dataimport.SolrWriter upload
          WARNING: Error creating document : SolrInputDocument[{}]
          
          Show
          Fergus McMenemie added a comment - Did you notice that besides <uniqueKey> not being specified, that the whole document was empty! Mar 18, 2009 1:12:48 PM org.apache.solr.handler.dataimport.SolrWriter upload WARNING: Error creating document : SolrInputDocument[{}]
          Hide
          Fergus McMenemie added a comment -

          I have applied the latest version of SOLR-1059 and I just cannot get delete to work!

               <entity name="single-delete"
          		 dataSource="myURIreader"
          		 processor="XPathEntityProcessor"
          		 url="${dataimporter.request.single}"
          		 rootEntity="true"
          		 flatten="true"
          		 stream="false"
          		 forEach="/record | /record/mediaBlock"
          		 transformer="TemplateTransformer">
          
                <field column="$skipDoc"            template="true" /> 
                <field column="fileAbsolutePath"    template="${dataimporter.request.single}" /> 
                <field column="$deleteDocByQuery"   template="fileAbsolutePath:${dataimporter.request.single}" /> 	       
                <field column="vdkvgwkey"           template="${dataimporter.request.single}" /> 
                </entity>
          

          And here is a section from the log file showing that after an attempt to wipe the file, it is still there; it was not removed.

          Mar 19, 2009 5:24:52 PM org.apache.solr.core.SolrCore execute
          INFO: [] webapp=/apache-solr-1.4-dev path=/select params={wt=xml&q=fileAbsolutePath:file\:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml} hits=3 status=0 QTime=10 
          
          
          
          Mar 19, 2009 5:25:04 PM org.apache.solr.core.SolrCore execute
          INFO: [] webapp=/apache-solr-1.4-dev path=/dataimport params={command=full-import&clean=false&entity=single-delete&commit=true&single=file\:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml} status=0 QTime=0 
          Mar 19, 2009 5:25:04 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
          INFO: Read dataimport.properties
          Mar 19, 2009 5:25:04 PM org.apache.solr.handler.dataimport.DataImporter doFullImport
          INFO: Starting Full Import
          Mar 19, 2009 5:25:04 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
          INFO: Read dataimport.properties
          Mar 19, 2009 5:25:04 PM org.apache.solr.handler.dataimport.URLDataSource getData
          SEVERE: Exception thrown while getting data
          java.net.MalformedURLException: no protocol: nullfile\:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml
          	at java.net.URL.<init>(URL.java:567)
          	at java.net.URL.<init>(URL.java:464)
          	at java.net.URL.<init>(URL.java:413)
          	at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:88)
          	at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:47)
          	at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:239)
          	at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:182)
          	at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:165)
          	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:335)
          	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:221)
          	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:163)
          	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:309)
          	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:367)
          	at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:348)
          Mar 19, 2009 5:25:04 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument
          SEVERE: Exception while processing: single-delete document : SolrInputDocument[{}]
          org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in invoking url null Processing Document # 1
          	at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:112)
          	at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:47)
          	at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:239)
          	at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:182)
          	at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:165)
          	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:335)
          	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:221)
          	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:163)
          	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:309)
          	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:367)
          	at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:348)
          Caused by: java.net.MalformedURLException: no protocol: nullfile\:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml
          	at java.net.URL.<init>(URL.java:567)
          	at java.net.URL.<init>(URL.java:464)
          	at java.net.URL.<init>(URL.java:413)
          	at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:88)
          	... 10 more
          Mar 19, 2009 5:25:04 PM org.apache.solr.handler.dataimport.DataImporter doFullImport
          SEVERE: Full Import failed
          org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in invoking url null Processing Document # 1
          	at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:112)
          	at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:47)
          	at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:239)
          	at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:182)
          	at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:165)
          	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:335)
          	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:221)
          	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:163)
          	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:309)
          	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:367)
          	at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:348)
          Caused by: java.net.MalformedURLException: no protocol: nullfile\:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml
          	at java.net.URL.<init>(URL.java:567)
          	at java.net.URL.<init>(URL.java:464)
          	at java.net.URL.<init>(URL.java:413)
          	at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:88)
          	... 10 more
          Mar 19, 2009 5:25:04 PM org.apache.solr.update.DirectUpdateHandler2 rollback
          INFO: start rollback
          Mar 19, 2009 5:25:04 PM org.apache.solr.update.DirectUpdateHandler2 rollback
          INFO: end_rollback
          Mar 19, 2009 5:25:04 PM org.apache.solr.update.DirectUpdateHandler2 commit
          INFO: start commit(optimize=false,waitFlush=false,waitSearcher=true)
          Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher <init>
          INFO: Opening Searcher@281e7e main
          Mar 19, 2009 5:25:04 PM org.apache.solr.update.DirectUpdateHandler2 commit
          INFO: end_commit_flush
          Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm
          INFO: autowarming Searcher@281e7e main from Searcher@7740f6 main
          	fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
          Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm
          INFO: autowarming result for Searcher@281e7e main
          	fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
          Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm
          INFO: autowarming Searcher@281e7e main from Searcher@7740f6 main
          	filterCache{lookups=6,hits=6,hitratio=1.00,inserts=0,evictions=0,size=9,warmupTime=16,cumulative_lookups=25,cumulative_hits=25,cumulative_hitratio=1.00,cumulative_inserts=2,cumulative_evictions=0}
          Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm
          INFO: autowarming result for Searcher@281e7e main
          	filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=9,warmupTime=16,cumulative_lookups=25,cumulative_hits=25,cumulative_hitratio=1.00,cumulative_inserts=2,cumulative_evictions=0}
          Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm
          INFO: autowarming Searcher@281e7e main from Searcher@7740f6 main
          	queryResultCache{lookups=2,hits=2,hitratio=1.00,inserts=7,evictions=0,size=7,warmupTime=8,cumulative_lookups=9,cumulative_hits=7,cumulative_hitratio=0.77,cumulative_inserts=2,cumulative_evictions=0}
          Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm
          INFO: autowarming result for Searcher@281e7e main
          	queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=7,evictions=0,size=7,warmupTime=8,cumulative_lookups=9,cumulative_hits=7,cumulative_hitratio=0.77,cumulative_inserts=2,cumulative_evictions=0}
          Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm
          INFO: autowarming Searcher@281e7e main from Searcher@7740f6 main
          	documentCache{lookups=18,hits=15,hitratio=0.83,inserts=26,evictions=0,size=26,warmupTime=0,cumulative_lookups=165,cumulative_hits=149,cumulative_hitratio=0.90,cumulative_inserts=16,cumulative_evictions=0}
          Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm
          INFO: autowarming result for Searcher@281e7e main
          	documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=165,cumulative_hits=149,cumulative_hitratio=0.90,cumulative_inserts=16,cumulative_evictions=0}
          Mar 19, 2009 5:25:04 PM org.apache.solr.core.QuerySenderListener newSearcher
          INFO: QuerySenderListener sending requests to Searcher@281e7e main
          Mar 19, 2009 5:25:04 PM org.apache.solr.core.SolrCore execute
          INFO: [] webapp=null path=null params={rows=10&start=0&q=solr} hits=0 status=0 QTime=6 
          Mar 19, 2009 5:25:04 PM org.apache.solr.core.SolrCore execute
          INFO: [] webapp=null path=null params={rows=10&start=0&q=rocks} hits=90 status=0 QTime=34 
          Mar 19, 2009 5:25:04 PM org.apache.solr.core.SolrCore execute
          INFO: [] webapp=null path=null params={q=static+newSearcher+warming+query+from+solrconfig.xml} hits=12327 status=0 QTime=98 
          Mar 19, 2009 5:25:04 PM org.apache.solr.core.QuerySenderListener newSearcher
          INFO: QuerySenderListener done.
          Mar 19, 2009 5:25:04 PM org.apache.solr.core.SolrCore registerSearcher
          INFO: [] Registered new searcher Searcher@281e7e main
          Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher close
          INFO: Closing Searcher@7740f6 main
          	fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
          	filterCache{lookups=6,hits=6,hitratio=1.00,inserts=0,evictions=0,size=9,warmupTime=16,cumulative_lookups=25,cumulative_hits=25,cumulative_hitratio=1.00,cumulative_inserts=2,cumulative_evictions=0}
          	queryResultCache{lookups=2,hits=2,hitratio=1.00,inserts=7,evictions=0,size=7,warmupTime=8,cumulative_lookups=9,cumulative_hits=7,cumulative_hitratio=0.77,cumulative_inserts=2,cumulative_evictions=0}
          	documentCache{lookups=18,hits=15,hitratio=0.83,inserts=26,evictions=0,size=26,warmupTime=0,cumulative_lookups=165,cumulative_hits=149,cumulative_hitratio=0.90,cumulative_inserts=16,cumulative_evictions=0}
          
          
          
          Mar 19, 2009 5:25:12 PM org.apache.solr.core.SolrCore execute
          INFO: [] webapp=/apache-solr-1.4-dev path=/select params={wt=xml&q=fileAbsolutePath:file\:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml} hits=3 status=0 QTime=11 
          

          Any hints on what I should try next?

          Show
          Fergus McMenemie added a comment - I have applied the latest version of SOLR-1059 and I just cannot get delete to work! <entity name= "single-delete" dataSource= "myURIreader" processor= "XPathEntityProcessor" url= "${dataimporter.request.single}" rootEntity= " true " flatten= " true " stream= " false " forEach= "/record | /record/mediaBlock" transformer= "TemplateTransformer" > <field column= "$skipDoc" template= " true " /> <field column= "fileAbsolutePath" template= "${dataimporter.request.single}" /> <field column= "$deleteDocByQuery" template= "fileAbsolutePath:${dataimporter.request.single}" /> <field column= "vdkvgwkey" template= "${dataimporter.request.single}" /> </entity> And here is a section from the log file showing that after an attempt to wipe the file, it is still there; it was not removed. Mar 19, 2009 5:24:52 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/apache-solr-1.4-dev path=/select params={wt=xml&q=fileAbsolutePath:file\: ///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml} hits=3 status=0 QTime=10 Mar 19, 2009 5:25:04 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/apache-solr-1.4-dev path=/dataimport params={command=full- import &clean= false &entity=single-delete&commit= true &single=file\: ///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml} status=0 QTime=0 Mar 19, 2009 5:25:04 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Mar 19, 2009 5:25:04 PM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Mar 19, 2009 5:25:04 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Mar 19, 2009 5:25:04 PM org.apache.solr.handler.dataimport.URLDataSource getData SEVERE: Exception thrown while getting data java.net.MalformedURLException: no protocol: nullfile\: ///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml at java.net.URL.<init>(URL.java:567) at java.net.URL.<init>(URL.java:464) at java.net.URL.<init>(URL.java:413) at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:88) at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:47) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:239) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:182) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:165) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:335) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:221) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:163) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:309) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:367) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:348) Mar 19, 2009 5:25:04 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument SEVERE: Exception while processing: single-delete document : SolrInputDocument[{}] org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in invoking url null Processing Document # 1 at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:112) at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:47) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:239) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:182) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:165) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:335) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:221) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:163) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:309) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:367) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:348) Caused by: java.net.MalformedURLException: no protocol: nullfile\: ///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml at java.net.URL.<init>(URL.java:567) at java.net.URL.<init>(URL.java:464) at java.net.URL.<init>(URL.java:413) at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:88) ... 10 more Mar 19, 2009 5:25:04 PM org.apache.solr.handler.dataimport.DataImporter doFullImport SEVERE: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in invoking url null Processing Document # 1 at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:112) at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:47) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:239) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:182) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:165) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:335) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:221) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:163) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:309) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:367) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:348) Caused by: java.net.MalformedURLException: no protocol: nullfile\: ///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml at java.net.URL.<init>(URL.java:567) at java.net.URL.<init>(URL.java:464) at java.net.URL.<init>(URL.java:413) at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:88) ... 10 more Mar 19, 2009 5:25:04 PM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: start rollback Mar 19, 2009 5:25:04 PM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: end_rollback Mar 19, 2009 5:25:04 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize= false ,waitFlush= false ,waitSearcher= true ) Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher <init> INFO: Opening Searcher@281e7e main Mar 19, 2009 5:25:04 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: end_commit_flush Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming Searcher@281e7e main from Searcher@7740f6 main fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming result for Searcher@281e7e main fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming Searcher@281e7e main from Searcher@7740f6 main filterCache{lookups=6,hits=6,hitratio=1.00,inserts=0,evictions=0,size=9,warmupTime=16,cumulative_lookups=25,cumulative_hits=25,cumulative_hitratio=1.00,cumulative_inserts=2,cumulative_evictions=0} Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming result for Searcher@281e7e main filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=9,warmupTime=16,cumulative_lookups=25,cumulative_hits=25,cumulative_hitratio=1.00,cumulative_inserts=2,cumulative_evictions=0} Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming Searcher@281e7e main from Searcher@7740f6 main queryResultCache{lookups=2,hits=2,hitratio=1.00,inserts=7,evictions=0,size=7,warmupTime=8,cumulative_lookups=9,cumulative_hits=7,cumulative_hitratio=0.77,cumulative_inserts=2,cumulative_evictions=0} Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming result for Searcher@281e7e main queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=7,evictions=0,size=7,warmupTime=8,cumulative_lookups=9,cumulative_hits=7,cumulative_hitratio=0.77,cumulative_inserts=2,cumulative_evictions=0} Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming Searcher@281e7e main from Searcher@7740f6 main documentCache{lookups=18,hits=15,hitratio=0.83,inserts=26,evictions=0,size=26,warmupTime=0,cumulative_lookups=165,cumulative_hits=149,cumulative_hitratio=0.90,cumulative_inserts=16,cumulative_evictions=0} Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming result for Searcher@281e7e main documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=165,cumulative_hits=149,cumulative_hitratio=0.90,cumulative_inserts=16,cumulative_evictions=0} Mar 19, 2009 5:25:04 PM org.apache.solr.core.QuerySenderListener newSearcher INFO: QuerySenderListener sending requests to Searcher@281e7e main Mar 19, 2009 5:25:04 PM org.apache.solr.core.SolrCore execute INFO: [] webapp= null path= null params={rows=10&start=0&q=solr} hits=0 status=0 QTime=6 Mar 19, 2009 5:25:04 PM org.apache.solr.core.SolrCore execute INFO: [] webapp= null path= null params={rows=10&start=0&q=rocks} hits=90 status=0 QTime=34 Mar 19, 2009 5:25:04 PM org.apache.solr.core.SolrCore execute INFO: [] webapp= null path= null params={q= static +newSearcher+warming+query+from+solrconfig.xml} hits=12327 status=0 QTime=98 Mar 19, 2009 5:25:04 PM org.apache.solr.core.QuerySenderListener newSearcher INFO: QuerySenderListener done. Mar 19, 2009 5:25:04 PM org.apache.solr.core.SolrCore registerSearcher INFO: [] Registered new searcher Searcher@281e7e main Mar 19, 2009 5:25:04 PM org.apache.solr.search.SolrIndexSearcher close INFO: Closing Searcher@7740f6 main fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} filterCache{lookups=6,hits=6,hitratio=1.00,inserts=0,evictions=0,size=9,warmupTime=16,cumulative_lookups=25,cumulative_hits=25,cumulative_hitratio=1.00,cumulative_inserts=2,cumulative_evictions=0} queryResultCache{lookups=2,hits=2,hitratio=1.00,inserts=7,evictions=0,size=7,warmupTime=8,cumulative_lookups=9,cumulative_hits=7,cumulative_hitratio=0.77,cumulative_inserts=2,cumulative_evictions=0} documentCache{lookups=18,hits=15,hitratio=0.83,inserts=26,evictions=0,size=26,warmupTime=0,cumulative_lookups=165,cumulative_hits=149,cumulative_hitratio=0.90,cumulative_inserts=16,cumulative_evictions=0} Mar 19, 2009 5:25:12 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/apache-solr-1.4-dev path=/select params={wt=xml&q=fileAbsolutePath:file\: ///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml} hits=3 status=0 QTime=11 Any hints on what I should try next?
          Hide
          Shalin Shekhar Mangar added a comment -

          I have applied the latest version of SOLR-1059 and I just cannot get delete to work!

          SOLR-1059 is now committed to trunk so you do not need to apply the patch anymore. There was a slight change too – the delete flag variable
          has been renamed to "$deleteDocById" and "$deleteDocByQuery".

          I see that you are escaping the ':' character. You need to escape it only if you are specifying it in the 'q' parameter. For any other parameter ('single' in this case)
          you do not need to escape it. The escaped ':' is causing this exception.

          Show
          Shalin Shekhar Mangar added a comment - I have applied the latest version of SOLR-1059 and I just cannot get delete to work! SOLR-1059 is now committed to trunk so you do not need to apply the patch anymore. There was a slight change too – the delete flag variable has been renamed to "$deleteDocById" and "$deleteDocByQuery". I see that you are escaping the ':' character. You need to escape it only if you are specifying it in the 'q' parameter. For any other parameter ('single' in this case) you do not need to escape it. The escaped ':' is causing this exception.
          Hide
          Fergus McMenemie added a comment -

          Yes, I spotted you had committed SOLR-1059. I backed out that patch and did a "svn update" to get the new changes. I had changed my data-config.xml as shown above as was already using the $deleteDocByQuery. Removing the escaping of the : I get the following:-

          Mar 19, 2009 6:31:27 PM org.apache.solr.core.SolrCore execute
          INFO: [] webapp=/apache-solr-1.4-dev path=/dataimport params={command=full-import&clean=false&entity=single-delete&commit=true&single=file:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml} status=0 QTime=0 
          Mar 19, 2009 6:31:27 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
          INFO: Read dataimport.properties
          Mar 19, 2009 6:31:27 PM org.apache.solr.handler.dataimport.DataImporter doFullImport
          INFO: Starting Full Import
          Mar 19, 2009 6:31:27 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
          INFO: Read dataimport.properties
          Mar 19, 2009 6:31:27 PM org.apache.solr.handler.dataimport.DocBuilder execute
          INFO: Time taken = 0:0:0.14
          Mar 19, 2009 6:31:42 PM org.apache.solr.core.SolrCore execute
          INFO: [] webapp=/apache-solr-1.4-dev path=/select params={wt=xml&q=fileAbsolutePath:file\:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml} hits=3 status=0 QTime=11 
          

          Any ideas or should I start adding log statements all over the place?

          Show
          Fergus McMenemie added a comment - Yes, I spotted you had committed SOLR-1059 . I backed out that patch and did a "svn update" to get the new changes. I had changed my data-config.xml as shown above as was already using the $deleteDocByQuery. Removing the escaping of the : I get the following:- Mar 19, 2009 6:31:27 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/apache-solr-1.4-dev path=/dataimport params={command=full- import &clean= false &entity=single-delete&commit= true &single=file: ///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml} status=0 QTime=0 Mar 19, 2009 6:31:27 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Mar 19, 2009 6:31:27 PM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Mar 19, 2009 6:31:27 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Mar 19, 2009 6:31:27 PM org.apache.solr.handler.dataimport.DocBuilder execute INFO: Time taken = 0:0:0.14 Mar 19, 2009 6:31:42 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/apache-solr-1.4-dev path=/select params={wt=xml&q=fileAbsolutePath:file\: ///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml} hits=3 status=0 QTime=11 Any ideas or should I start adding log statements all over the place?
          Hide
          Shalin Shekhar Mangar added a comment -

          I'm not sure what to make of the above log. Are the documents not getting deleted? In your data-config I see that skipDoc is true. No documents will be added at all.

          Show
          Shalin Shekhar Mangar added a comment - I'm not sure what to make of the above log. Are the documents not getting deleted? In your data-config I see that skipDoc is true. No documents will be added at all.
          Hide
          Fergus McMenemie added a comment -

          Correct documents are not getting deleted. Line 2 from the log shows:-

          path=/dataimport command=full-import&clean=false&entity=single-delete&commit=true&single=file:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml

          Line 12 is me dong a query for the same document:-

          path=/select params=

          {wt=xml&q=fileAbsolutePath:file\:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml}

          hits=3 status=0 QTime=11

          which returns three hits. So the documents have not been deleted! Removing the $skipDoc=true and rerunning the delete I get:-

          Mar 19, 2009 7:33:34 PM org.apache.solr.core.SolrCore execute
          INFO: [] webapp=/apache-solr-1.4-dev path=/dataimport params={command=full-import&clean=false&entity=single-delete&commit=true&single=file:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml} status=0 QTime=0 
          Mar 19, 2009 7:33:34 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
          INFO: Read dataimport.properties
          Mar 19, 2009 7:33:34 PM org.apache.solr.handler.dataimport.DataImporter doFullImport
          INFO: Starting Full Import
          Mar 19, 2009 7:33:34 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
          INFO: Read dataimport.properties
          Mar 19, 2009 7:33:34 PM org.apache.solr.handler.dataimport.SolrWriter deleteByQuery
          INFO: Deleting documents from Solr with query: fileAbsolutePath:file:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml
          Mar 19, 2009 7:33:34 PM org.apache.solr.common.SolrException log
          SEVERE: org.apache.lucene.queryParser.ParseException: Cannot parse 'fileAbsolutePath:file:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml': Encountered " ":" ": "" at line 1, column 21.
          Was expecting one of:
              <EOF> 
              <AND> ...
              <OR> ...
              <NOT> ...
              "+" ...
              "-" ...
              "(" ...
              "*" ...
              "^" ...
              <QUOTED> ...
              <TERM> ...
              <FUZZY_SLOP> ...
              <PREFIXTERM> ...
              <WILDTERM> ...
              "[" ...
              "{" ...
              <NUMBER> ...
              
          	at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:177)
          	at org.apache.solr.search.QueryParsing.parseQuery(QueryParsing.java:74)
          	at org.apache.solr.search.QueryParsing.parseQuery(QueryParsing.java:63)
          	at org.apache.solr.update.DirectUpdateHandler2.deleteByQuery(DirectUpdateHandler2.java:314)
          	at org.apache.solr.update.processor.RunUpdateProcessor.processDelete(RunUpdateProcessorFactory.java:70)
          	at org.apache.solr.handler.dataimport.SolrWriter.deleteByQuery(SolrWriter.java:153)
          	at org.apache.solr.handler.dataimport.DocBuilder.addFields(DocBuilder.java:449)
          	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:358)
          	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:221)
          	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:163)
          	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:309)
          	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:367)
          	at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:348)
          
          Mar 19, 2009 7:33:34 PM org.apache.solr.handler.dataimport.DataImporter doFullImport
          SEVERE: Full Import failed
          org.apache.solr.handler.dataimport.DataImportHandlerException: org.apache.solr.common.SolrException: Error parsing Lucene query
          	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:400)
          	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:221)
          	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:163)
          	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:309)
          	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:367)
          	at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:348)
          Caused by: org.apache.solr.common.SolrException: Error parsing Lucene query
          	at org.apache.solr.search.QueryParsing.parseQuery(QueryParsing.java:84)
          	at org.apache.solr.search.QueryParsing.parseQuery(QueryParsing.java:63)
          	at org.apache.solr.update.DirectUpdateHandler2.deleteByQuery(DirectUpdateHandler2.java:314)
          	at org.apache.solr.update.processor.RunUpdateProcessor.processDelete(RunUpdateProcessorFactory.java:70)
          	at org.apache.solr.handler.dataimport.SolrWriter.deleteByQuery(SolrWriter.java:153)
          	at org.apache.solr.handler.dataimport.DocBuilder.addFields(DocBuilder.java:449)
          	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:358)
          	... 5 more
          Caused by: org.apache.lucene.queryParser.ParseException: Cannot parse 'fileAbsolutePath:file:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml': Encountered " ":" ": "" at line 1, column 21.
          Was expecting one of:
              <EOF> 
              <AND> ...
              <OR> ...
              <NOT> ...
              "+" ...
              "-" ...
              "(" ...
              "*" ...
              "^" ...
              <QUOTED> ...
              <TERM> ...
              <FUZZY_SLOP> ...
              <PREFIXTERM> ...
              <WILDTERM> ...
              "[" ...
              "{" ...
              <NUMBER> ...
              
          	at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:177)
          	at org.apache.solr.search.QueryParsing.parseQuery(QueryParsing.java:74)
          	... 11 more
          Mar 19, 2009 7:33:34 PM org.apache.solr.update.DirectUpdateHandler2 rollback
          INFO: start rollback
          Mar 19, 2009 7:33:34 PM org.apache.solr.update.DirectUpdateHandler2 rollback
          INFO: end_rollback
          Mar 19, 2009 7:33:34 PM org.apache.solr.update.DirectUpdateHandler2 commit
          INFO: start commit(optimize=false,waitFlush=false,waitSearcher=true)
          Mar 19, 2009 7:33:34 PM org.apache.solr.search.SolrIndexSearcher <init>
          INFO: Opening Searcher@86b804 main
          Mar 19, 2009 7:33:34 PM org.apache.solr.update.DirectUpdateHandler2 commit
          INFO: end_commit_flush
          Mar 19, 2009 7:33:34 PM org.apache.solr.search.SolrIndexSearcher warm
          INFO: autowarming Searcher@86b804 main from Searcher@281e7e main
          	fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
          Mar 19, 2009 7:33:34 PM org.apache.solr.search.SolrIndexSearcher warm
          INFO: autowarming result for Searcher@86b804 main
          	fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
          Mar 19, 2009 7:33:34 PM org.apache.solr.search.SolrIndexSearcher warm
          INFO: autowarming Searcher@86b804 main from Searcher@281e7e main
          	filterCache{lookups=6,hits=6,hitratio=1.00,inserts=0,evictions=0,size=9,warmupTime=16,cumulative_lookups=31,cumulative_hits=31,cumulative_hitratio=1.00,cumulative_inserts=2,cumulative_evictions=0}
          Mar 19, 2009 7:33:35 PM org.apache.solr.search.SolrIndexSearcher warm
          INFO: autowarming result for Searcher@86b804 main
          	filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=9,warmupTime=39,cumulative_lookups=31,cumulative_hits=31,cumulative_hitratio=1.00,cumulative_inserts=2,cumulative_evictions=0}
          Mar 19, 2009 7:33:35 PM org.apache.solr.search.SolrIndexSearcher warm
          INFO: autowarming Searcher@86b804 main from Searcher@281e7e main
          	queryResultCache{lookups=2,hits=2,hitratio=1.00,inserts=7,evictions=0,size=7,warmupTime=8,cumulative_lookups=11,cumulative_hits=9,cumulative_hitratio=0.81,cumulative_inserts=2,cumulative_evictions=0}
          Mar 19, 2009 7:33:35 PM org.apache.solr.search.SolrIndexSearcher warm
          INFO: autowarming result for Searcher@86b804 main
          	queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=7,evictions=0,size=7,warmupTime=9,cumulative_lookups=11,cumulative_hits=9,cumulative_hitratio=0.81,cumulative_inserts=2,cumulative_evictions=0}
          Mar 19, 2009 7:33:35 PM org.apache.solr.search.SolrIndexSearcher warm
          INFO: autowarming Searcher@86b804 main from Searcher@281e7e main
          	documentCache{lookups=18,hits=15,hitratio=0.83,inserts=26,evictions=0,size=26,warmupTime=0,cumulative_lookups=183,cumulative_hits=164,cumulative_hitratio=0.89,cumulative_inserts=19,cumulative_evictions=0}
          Mar 19, 2009 7:33:35 PM org.apache.solr.search.SolrIndexSearcher warm
          INFO: autowarming result for Searcher@86b804 main
          	documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=183,cumulative_hits=164,cumulative_hitratio=0.89,cumulative_inserts=19,cumulative_evictions=0}
          Mar 19, 2009 7:33:35 PM org.apache.solr.core.QuerySenderListener newSearcher
          INFO: QuerySenderListener sending requests to Searcher@86b804 main
          Mar 19, 2009 7:33:35 PM org.apache.solr.core.SolrCore execute
          INFO: [] webapp=null path=null params={rows=10&start=0&q=solr} hits=0 status=0 QTime=3 
          Mar 19, 2009 7:33:35 PM org.apache.solr.core.SolrCore execute
          INFO: [] webapp=null path=null params={rows=10&start=0&q=rocks} hits=90 status=0 QTime=16 
          Mar 19, 2009 7:33:35 PM org.apache.solr.core.SolrCore execute
          INFO: [] webapp=null path=null params={q=static+newSearcher+warming+query+from+solrconfig.xml} hits=12327 status=0 QTime=96 
          Mar 19, 2009 7:33:35 PM org.apache.solr.core.QuerySenderListener newSearcher
          INFO: QuerySenderListener done.
          Mar 19, 2009 7:33:35 PM org.apache.solr.core.SolrCore registerSearcher
          INFO: [] Registered new searcher Searcher@86b804 main
          Mar 19, 2009 7:33:35 PM org.apache.solr.search.SolrIndexSearcher close
          INFO: Closing Searcher@281e7e main
          	fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
          	filterCache{lookups=6,hits=6,hitratio=1.00,inserts=0,evictions=0,size=9,warmupTime=16,cumulative_lookups=31,cumulative_hits=31,cumulative_hitratio=1.00,cumulative_inserts=2,cumulative_evictions=0}
          	queryResultCache{lookups=2,hits=2,hitratio=1.00,inserts=7,evictions=0,size=7,warmupTime=8,cumulative_lookups=11,cumulative_hits=9,cumulative_hitratio=0.81,cumulative_inserts=2,cumulative_evictions=0}
          	documentCache{lookups=18,hits=15,hitratio=0.83,inserts=26,evictions=0,size=26,warmupTime=0,cumulative_lookups=183,cumulative_hits=164,cumulative_hitratio=0.89,cumulative_inserts=19,cumulative_evictions=0}
          
          
          Show
          Fergus McMenemie added a comment - Correct documents are not getting deleted. Line 2 from the log shows:- path=/dataimport command=full-import&clean=false&entity=single-delete&commit=true&single= file:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml Line 12 is me dong a query for the same document:- path=/select params= {wt=xml&q=fileAbsolutePath:file\:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml} hits=3 status=0 QTime=11 which returns three hits. So the documents have not been deleted! Removing the $skipDoc=true and rerunning the delete I get:- Mar 19, 2009 7:33:34 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/apache-solr-1.4-dev path=/dataimport params={command=full- import &clean= false &entity=single-delete&commit= true &single=file: ///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml} status=0 QTime=0 Mar 19, 2009 7:33:34 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Mar 19, 2009 7:33:34 PM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Mar 19, 2009 7:33:34 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Mar 19, 2009 7:33:34 PM org.apache.solr.handler.dataimport.SolrWriter deleteByQuery INFO: Deleting documents from Solr with query: fileAbsolutePath:file: ///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml Mar 19, 2009 7:33:34 PM org.apache.solr.common.SolrException log SEVERE: org.apache.lucene.queryParser.ParseException: Cannot parse 'fileAbsolutePath:file: ///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml': Encountered " " : " " : "" at line 1, column 21. Was expecting one of: <EOF> <AND> ... <OR> ... <NOT> ... "+" ... "-" ... "(" ... "*" ... "^" ... <QUOTED> ... <TERM> ... <FUZZY_SLOP> ... <PREFIXTERM> ... <WILDTERM> ... "[" ... "{" ... <NUMBER> ... at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:177) at org.apache.solr.search.QueryParsing.parseQuery(QueryParsing.java:74) at org.apache.solr.search.QueryParsing.parseQuery(QueryParsing.java:63) at org.apache.solr.update.DirectUpdateHandler2.deleteByQuery(DirectUpdateHandler2.java:314) at org.apache.solr.update.processor.RunUpdateProcessor.processDelete(RunUpdateProcessorFactory.java:70) at org.apache.solr.handler.dataimport.SolrWriter.deleteByQuery(SolrWriter.java:153) at org.apache.solr.handler.dataimport.DocBuilder.addFields(DocBuilder.java:449) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:358) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:221) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:163) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:309) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:367) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:348) Mar 19, 2009 7:33:34 PM org.apache.solr.handler.dataimport.DataImporter doFullImport SEVERE: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: org.apache.solr.common.SolrException: Error parsing Lucene query at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:400) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:221) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:163) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:309) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:367) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:348) Caused by: org.apache.solr.common.SolrException: Error parsing Lucene query at org.apache.solr.search.QueryParsing.parseQuery(QueryParsing.java:84) at org.apache.solr.search.QueryParsing.parseQuery(QueryParsing.java:63) at org.apache.solr.update.DirectUpdateHandler2.deleteByQuery(DirectUpdateHandler2.java:314) at org.apache.solr.update.processor.RunUpdateProcessor.processDelete(RunUpdateProcessorFactory.java:70) at org.apache.solr.handler.dataimport.SolrWriter.deleteByQuery(SolrWriter.java:153) at org.apache.solr.handler.dataimport.DocBuilder.addFields(DocBuilder.java:449) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:358) ... 5 more Caused by: org.apache.lucene.queryParser.ParseException: Cannot parse 'fileAbsolutePath:file: ///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml': Encountered " " : " " : "" at line 1, column 21. Was expecting one of: <EOF> <AND> ... <OR> ... <NOT> ... "+" ... "-" ... "(" ... "*" ... "^" ... <QUOTED> ... <TERM> ... <FUZZY_SLOP> ... <PREFIXTERM> ... <WILDTERM> ... "[" ... "{" ... <NUMBER> ... at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:177) at org.apache.solr.search.QueryParsing.parseQuery(QueryParsing.java:74) ... 11 more Mar 19, 2009 7:33:34 PM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: start rollback Mar 19, 2009 7:33:34 PM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: end_rollback Mar 19, 2009 7:33:34 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize= false ,waitFlush= false ,waitSearcher= true ) Mar 19, 2009 7:33:34 PM org.apache.solr.search.SolrIndexSearcher <init> INFO: Opening Searcher@86b804 main Mar 19, 2009 7:33:34 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: end_commit_flush Mar 19, 2009 7:33:34 PM org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming Searcher@86b804 main from Searcher@281e7e main fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} Mar 19, 2009 7:33:34 PM org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming result for Searcher@86b804 main fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} Mar 19, 2009 7:33:34 PM org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming Searcher@86b804 main from Searcher@281e7e main filterCache{lookups=6,hits=6,hitratio=1.00,inserts=0,evictions=0,size=9,warmupTime=16,cumulative_lookups=31,cumulative_hits=31,cumulative_hitratio=1.00,cumulative_inserts=2,cumulative_evictions=0} Mar 19, 2009 7:33:35 PM org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming result for Searcher@86b804 main filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=9,warmupTime=39,cumulative_lookups=31,cumulative_hits=31,cumulative_hitratio=1.00,cumulative_inserts=2,cumulative_evictions=0} Mar 19, 2009 7:33:35 PM org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming Searcher@86b804 main from Searcher@281e7e main queryResultCache{lookups=2,hits=2,hitratio=1.00,inserts=7,evictions=0,size=7,warmupTime=8,cumulative_lookups=11,cumulative_hits=9,cumulative_hitratio=0.81,cumulative_inserts=2,cumulative_evictions=0} Mar 19, 2009 7:33:35 PM org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming result for Searcher@86b804 main queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=7,evictions=0,size=7,warmupTime=9,cumulative_lookups=11,cumulative_hits=9,cumulative_hitratio=0.81,cumulative_inserts=2,cumulative_evictions=0} Mar 19, 2009 7:33:35 PM org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming Searcher@86b804 main from Searcher@281e7e main documentCache{lookups=18,hits=15,hitratio=0.83,inserts=26,evictions=0,size=26,warmupTime=0,cumulative_lookups=183,cumulative_hits=164,cumulative_hitratio=0.89,cumulative_inserts=19,cumulative_evictions=0} Mar 19, 2009 7:33:35 PM org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming result for Searcher@86b804 main documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=183,cumulative_hits=164,cumulative_hitratio=0.89,cumulative_inserts=19,cumulative_evictions=0} Mar 19, 2009 7:33:35 PM org.apache.solr.core.QuerySenderListener newSearcher INFO: QuerySenderListener sending requests to Searcher@86b804 main Mar 19, 2009 7:33:35 PM org.apache.solr.core.SolrCore execute INFO: [] webapp= null path= null params={rows=10&start=0&q=solr} hits=0 status=0 QTime=3 Mar 19, 2009 7:33:35 PM org.apache.solr.core.SolrCore execute INFO: [] webapp= null path= null params={rows=10&start=0&q=rocks} hits=90 status=0 QTime=16 Mar 19, 2009 7:33:35 PM org.apache.solr.core.SolrCore execute INFO: [] webapp= null path= null params={q= static +newSearcher+warming+query+from+solrconfig.xml} hits=12327 status=0 QTime=96 Mar 19, 2009 7:33:35 PM org.apache.solr.core.QuerySenderListener newSearcher INFO: QuerySenderListener done. Mar 19, 2009 7:33:35 PM org.apache.solr.core.SolrCore registerSearcher INFO: [] Registered new searcher Searcher@86b804 main Mar 19, 2009 7:33:35 PM org.apache.solr.search.SolrIndexSearcher close INFO: Closing Searcher@281e7e main fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} filterCache{lookups=6,hits=6,hitratio=1.00,inserts=0,evictions=0,size=9,warmupTime=16,cumulative_lookups=31,cumulative_hits=31,cumulative_hitratio=1.00,cumulative_inserts=2,cumulative_evictions=0} queryResultCache{lookups=2,hits=2,hitratio=1.00,inserts=7,evictions=0,size=7,warmupTime=8,cumulative_lookups=11,cumulative_hits=9,cumulative_hitratio=0.81,cumulative_inserts=2,cumulative_evictions=0} documentCache{lookups=18,hits=15,hitratio=0.83,inserts=26,evictions=0,size=26,warmupTime=0,cumulative_lookups=183,cumulative_hits=164,cumulative_hitratio=0.89,cumulative_inserts=19,cumulative_evictions=0}
          Hide
          Fergus McMenemie added a comment -

          Lot of weirdness going on here, do not bother looking into this further ill I get my sort straight!

          Show
          Fergus McMenemie added a comment - Lot of weirdness going on here, do not bother looking into this further ill I get my sort straight!
          Hide
          Noble Paul added a comment -

          hi fergus,

          • if the document is empty it is not tried to be added.
          • There is a new LogTransformer checked into the trunk. you can use that to log any information
          Show
          Noble Paul added a comment - hi fergus, if the document is empty it is not tried to be added. There is a new LogTransformer checked into the trunk. you can use that to log any information
          Hide
          Fergus McMenemie added a comment -

          Some of my weirdness resolved by a "ant clean". However I have got myself into a whole pile of regex problems. The RegexTransformer just does not allow me to do what I need, which is to conditionally populate the field $deleteDocByQuery. A snippet from my data-config follows:-

             <entity name="jc"
          	     processor="ChangeListEntityProcessor"
          	     acceptLineRegex="^.*\.xml$"
          	     omitLineRegex="usc2009"
          	     fileName="file:///Volumes/spare/ts/man-findlsurl.txt"
          	     rootEntity="false"
          	     dataSource="null"
          	     baseLocation="file:///Volumes/spare/ts/ford/schema/"
          	     transformer="RegexTransformer"
          	     >
                <field column="fileAbsolutePath"    regex="^.*\s+([^ ]*)$" replaceWith="${jc.baseLocation}/$1"  sourceColName="rawLine"/>
                <field column="$deleteDocByQuery"   regex="^DELETE.*"      replaceWith="fileAbsolutePath:${jc.fileAbsolutePath}" sourceColName="rawLine"/> 	       
          
                <entity name="x"
          	      dataSource="myURIreader"
          	      processor="XPathEntityProcessor"
          

          The trouble is that this

          <field column="$deleteDocByQuery" regex="^DELETE.*" sourceColName="rawLine"/>

          leaves $deleteDocByQuery undefined if the regex is unmatched; which is good. However

          <field column="$deleteDocByQuery" regex="^DELETE.*" sourceColName="rawLine" replaceWith="fileAbsolutePath:$

          {jc.fileAbsolutePath}" />

          leaves $deleteDocByQuery equal to rawLine if the regex is unmatched. I do not think this is desirable. I tried a work around which, after reading the code, I thought I could get away with

          <field column="$deleteDocByQuery" regex="^DELETE.*" sourceColName="rawLine" replaceWith="fileAbsolutePath:${jc.fileAbsolutePath}

          " groupNames="$deleteDocByQuery" />

          however the presence of the "replaceWith" attribute disables the "groupNames" functionality.

          I think something needs sorted; or at least the documentation needs further clarification.

          Show
          Fergus McMenemie added a comment - Some of my weirdness resolved by a "ant clean". However I have got myself into a whole pile of regex problems. The RegexTransformer just does not allow me to do what I need, which is to conditionally populate the field $deleteDocByQuery. A snippet from my data-config follows:- <entity name= "jc" processor= "ChangeListEntityProcessor" acceptLineRegex= "^.*\.xml$" omitLineRegex= "usc2009" fileName= "file: ///Volumes/spare/ts/man-findlsurl.txt" rootEntity= " false " dataSource= " null " baseLocation= "file: ///Volumes/spare/ts/ford/schema/" transformer= "RegexTransformer" > <field column= "fileAbsolutePath" regex= "^.*\s+([^ ]*)$" replaceWith= "${jc.baseLocation}/$1" sourceColName= "rawLine" /> <field column= "$deleteDocByQuery" regex= "^DELETE.*" replaceWith= "fileAbsolutePath:${jc.fileAbsolutePath}" sourceColName= "rawLine" /> <entity name= "x" dataSource= "myURIreader" processor= "XPathEntityProcessor" The trouble is that this <field column="$deleteDocByQuery" regex="^DELETE.*" sourceColName="rawLine"/> leaves $deleteDocByQuery undefined if the regex is unmatched ; which is good. However <field column="$deleteDocByQuery" regex="^DELETE.*" sourceColName="rawLine" replaceWith="fileAbsolutePath:$ {jc.fileAbsolutePath}" /> leaves $deleteDocByQuery equal to rawLine if the regex is unmatched . I do not think this is desirable. I tried a work around which, after reading the code, I thought I could get away with <field column="$deleteDocByQuery" regex="^DELETE.*" sourceColName="rawLine" replaceWith="fileAbsolutePath:${jc.fileAbsolutePath} " groupNames="$deleteDocByQuery" /> however the presence of the "replaceWith" attribute disables the "groupNames" functionality. I think something needs sorted; or at least the documentation needs further clarification.
          Hide
          Noble Paul added a comment -

          Hi fergus,
          I guess you are trying to do something with in-built Transformers which would be better handled by code (custom/new Transformer).

          Show
          Noble Paul added a comment - Hi fergus, I guess you are trying to do something with in-built Transformers which would be better handled by code (custom/new Transformer).
          Hide
          Fergus McMenemie added a comment -

          My original patch did all this in the ChangeListEntityProcessor, as an option! However as a seperate issue I do think we have a ambigutiy in the face value behaviour of the following code when a mismatch occurs.

             <field column="$deleteDocByQuery" regex="^DELETE.*" sourceColName="rawLine"/>
             <field column="$deleteDocByQuery" regex="^DELETE.*" sourceColName="rawLine" replaceWith="fileAbsolutePath:${jc.fileAbsolutePath}" />
          

          While I do understand that under the hood one is a match and the other a replace. I think that we could to enhance the existing transformer somehow to streamline its interface. After all a new custom/new Transformer would just be a regex by another name. Not sure what to do for the best. 1) I could put my optional code back into ChangeListEntityProcessor? 2) I can also get around the problem with temporary fields, but it is rather ugly:-

          <entity name="jc"
          	     processor="ChangeListEntityProcessor"
          	     acceptLineRegex="^.*\.xml$"
          	     omitLineRegex="usc2009"
          	     fileName="file:///Volumes/spare/ts/man-findlsurl.txt"
          	     rootEntity="false"
          	     dataSource="null"
          	     baseLocation="file:///Volumes/spare/ts/ford/schema/"
          	     transformer="RegexTransformer"
          	     >
                <field column="fileAbsolutePath"    regex="^.*\s+([^ ]*)$" replaceWith="${jc.baseLocation}/$1"  sourceColName="rawLine"/>
                <field column="dummy"                  regex="^DELETE.*"      replaceWith="fileAbsolutePath:${jc.fileAbsolutePath}" sourceColName="rawLine"/> 	       
                <field column="$deleteDocByQuery"   regex="^fileAbsolutePath:"  sourceColName="dummy"/> 	       
          
                <entity name="x"
          	      dataSource="myURIreader"
          	      processor="XPathEntityProcessor"
          
          Show
          Fergus McMenemie added a comment - My original patch did all this in the ChangeListEntityProcessor, as an option! However as a seperate issue I do think we have a ambigutiy in the face value behaviour of the following code when a mismatch occurs. <field column= "$deleteDocByQuery" regex= "^DELETE.*" sourceColName= "rawLine" /> <field column= "$deleteDocByQuery" regex= "^DELETE.*" sourceColName= "rawLine" replaceWith= "fileAbsolutePath:${jc.fileAbsolutePath}" /> While I do understand that under the hood one is a match and the other a replace. I think that we could to enhance the existing transformer somehow to streamline its interface. After all a new custom/new Transformer would just be a regex by another name. Not sure what to do for the best. 1) I could put my optional code back into ChangeListEntityProcessor? 2) I can also get around the problem with temporary fields, but it is rather ugly:- <entity name= "jc" processor= "ChangeListEntityProcessor" acceptLineRegex= "^.*\.xml$" omitLineRegex= "usc2009" fileName= "file: ///Volumes/spare/ts/man-findlsurl.txt" rootEntity= " false " dataSource= " null " baseLocation= "file: ///Volumes/spare/ts/ford/schema/" transformer= "RegexTransformer" > <field column= "fileAbsolutePath" regex= "^.*\s+([^ ]*)$" replaceWith= "${jc.baseLocation}/$1" sourceColName= "rawLine" /> <field column= "dummy" regex= "^DELETE.*" replaceWith= "fileAbsolutePath:${jc.fileAbsolutePath}" sourceColName= "rawLine" /> <field column= "$deleteDocByQuery" regex= "^fileAbsolutePath:" sourceColName= "dummy" /> <entity name= "x" dataSource= "myURIreader" processor= "XPathEntityProcessor"
          Hide
          Noble Paul added a comment -

          leaves $deleteDocByQuery equal to rawLine if the regex is unmatched

          is it a good idea to not do anything if the regex is not matched? that is do not do the replaceAll()

          Show
          Noble Paul added a comment - leaves $deleteDocByQuery equal to rawLine if the regex is unmatched is it a good idea to not do anything if the regex is not matched? that is do not do the replaceAll()
          Hide
          Fergus McMenemie added a comment -

          I really dont know, I think it is the syntax that is confusing more than anything else. The first case is clearly a call to a plain olde matcher. The second case is not so clear or explicit; despite your wiki docs which do say you are doing a replace. How about a new RegexTransformer keyword "matcher"?

          1   <field column="$deleteDocByQuery" matcher="^DELETE.*" sourceColName="rawLine"/>
          2   <field column="$deleteDocByQuery"     regex="^DELETE.*" sourceColName="rawLine"/>
          3   <field column="$deleteDocByQuery" matcher="^DELETE.*" sourceColName="rawLine" replaceWith="fileAbsolutePath:${jc.fileAbsolutePath}" />
          4   <field column="$deleteDocByQuery"     regex="^DELETE.*" sourceColName="rawLine" replaceWith="fileAbsolutePath:${jc.fileAbsolutePath}" />
          

          cases 1) and 2) behave identically, with "regex" being deprecated
          cases 3) and 4) differ, case 4) is the existing behavior which boils down to a replace. Case 3) however performs a match first, if the match succeeds then the replace is performed. If the match fails $deleteDocByQuery is unchanged, if it was null, it stays null.

          Show
          Fergus McMenemie added a comment - I really dont know, I think it is the syntax that is confusing more than anything else. The first case is clearly a call to a plain olde matcher. The second case is not so clear or explicit; despite your wiki docs which do say you are doing a replace. How about a new RegexTransformer keyword "matcher"? 1 <field column= "$deleteDocByQuery" matcher= "^DELETE.*" sourceColName= "rawLine" /> 2 <field column= "$deleteDocByQuery" regex= "^DELETE.*" sourceColName= "rawLine" /> 3 <field column= "$deleteDocByQuery" matcher= "^DELETE.*" sourceColName= "rawLine" replaceWith= "fileAbsolutePath:${jc.fileAbsolutePath}" /> 4 <field column= "$deleteDocByQuery" regex= "^DELETE.*" sourceColName= "rawLine" replaceWith= "fileAbsolutePath:${jc.fileAbsolutePath}" /> cases 1) and 2) behave identically, with "regex" being deprecated cases 3) and 4) differ, case 4) is the existing behavior which boils down to a replace. Case 3) however performs a match first, if the match succeeds then the replace is performed. If the match fails $deleteDocByQuery is unchanged, if it was null, it stays null.
          Hide
          Noble Paul added a comment -

          this should check if the regex matches if not it doesn't replace

          Show
          Noble Paul added a comment - this should check if the regex matches if not it doesn't replace
          Hide
          Fergus McMenemie added a comment -

          This is the latest version of my patch, the principle item of work was ChangeListEntityProcessor.java. However a new URLDataSource.java datasource was added which is almost identical to the now depreciated HttpDataSource.java. I also editied FileDataSource.java to make the log messages equivalent to that produced by URLDataSource.java.

          It is as good as working however I cannot get deleteDocByQuery to function properly.

          Show
          Fergus McMenemie added a comment - This is the latest version of my patch, the principle item of work was ChangeListEntityProcessor.java. However a new URLDataSource.java datasource was added which is almost identical to the now depreciated HttpDataSource.java. I also editied FileDataSource.java to make the log messages equivalent to that produced by URLDataSource.java. It is as good as working however I cannot get deleteDocByQuery to function properly.
          Hide
          Fergus McMenemie added a comment -

          Your patched version of regex seems great; thanks very much. I think that it is better more useful behavior.

          Next.... cant get delete to work. Using a data-config of

          <entity name="single-delete"
          		 dataSource="myURIreader"
          		 processor="XPathEntityProcessor"
          		 url="${dataimporter.request.single}"
          		 rootEntity="true"
          		 stream="false"
          		 forEach="/record | /record/mediaBlock"
          		 transformer="TemplateTransformer">
          
                <field column="fileAbsolutePath"    template="${dataimporter.request.single}" /> 
                <field column="$deleteDocByQuery"   template="fileAbsolutePath:${dataimporter.request.single}" /> 	       
                <field column="vdkvgwkey"           template="${dataimporter.request.single}" /> 
                </entity>
          

          If I enter

          get 'http://localhost:8080/apache-solr-1.4-dev/dataimport?command=full-import&entity=single-delete&clean=false&commit=true&single=file:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml'

          I get

          Mar 20, 2009 1:33:36 PM org.apache.solr.core.SolrCore execute
          INFO: [] webapp=/apache-solr-1.4-dev path=/select params={wt=xml&q="sea+stallion"&qt=} hits=10 status=0 QTime=54 
          Mar 20, 2009 1:34:01 PM org.apache.solr.core.SolrCore execute
          INFO: [] webapp=/apache-solr-1.4-dev path=/dataimport params={command=full-import&clean=false&entity=single-delete&commit=true&single=file:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml} status=0 QTime=1 
          Mar 20, 2009 1:34:01 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
          INFO: Read dataimport.properties
          Mar 20, 2009 1:34:01 PM org.apache.solr.handler.dataimport.DataImporter doFullImport
          INFO: Starting Full Import
          Mar 20, 2009 1:34:01 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
          INFO: Read dataimport.properties
          Mar 20, 2009 1:34:01 PM org.apache.solr.handler.dataimport.SolrWriter deleteByQuery
          INFO: Deleting documents from Solr with query: fileAbsolutePath:file:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml
          Mar 20, 2009 1:34:01 PM org.apache.solr.common.SolrException log
          SEVERE: org.apache.lucene.queryParser.ParseException: Cannot parse 'fileAbsolutePath:file:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml': Encountered " ":" ": "" at line 1, column 21.
          Was expecting one of:
              <EOF> 
              <AND> ...
              <OR> ...
              <NOT> ...
              "+" ...
          

          However if I escape the ':' and enter

          get 'http://localhost:8080/apache-solr-1.4-dev/dataimport?command=full-import&entity=single-delete&clean=false&commit=true&single=file\:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml'

          in this case it looks as though the deleteDocByQuery is being ignored!

          Mar 20, 2009 1:34:22 PM org.apache.solr.core.SolrCore execute
          INFO: [] webapp=/apache-solr-1.4-dev path=/dataimport params={command=full-import&clean=false&entity=single-delete&commit=true&single=file\:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml} status=0 QTime=0 
          Mar 20, 2009 1:34:22 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
          INFO: Read dataimport.properties
          Mar 20, 2009 1:34:22 PM org.apache.solr.handler.dataimport.DataImporter doFullImport
          INFO: Starting Full Import
          Mar 20, 2009 1:34:22 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
          INFO: Read dataimport.properties
          Mar 20, 2009 1:34:22 PM org.apache.solr.handler.dataimport.URLDataSource getData
          SEVERE: Exception thrown while getting data
          java.net.MalformedURLException: no protocol: nullfile\:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml
          	at java.net.URL.<init>(URL.java:567)
          	at java.net.URL.<init>(URL.java:464)
          	at java.net.URL.<init>(URL.java:413)
          	at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:88)
          	at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:47)
          	at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:239)
          	at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:182)
          	at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:165)
          	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:335)
          	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:221)
          	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:163)
          	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:309)
          	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:367)
          	at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:348)
          Mar 20, 2009 1:34:22 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument
          SEVERE: Exception while processing: single-delete document : SolrInputDocument[{}]
          org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in invoking url null Processing Document # 1
          	at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:112)
          	at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:47)
          
          Show
          Fergus McMenemie added a comment - Your patched version of regex seems great; thanks very much. I think that it is better more useful behavior. Next.... cant get delete to work. Using a data-config of <entity name= "single-delete" dataSource= "myURIreader" processor= "XPathEntityProcessor" url= "${dataimporter.request.single}" rootEntity= " true " stream= " false " forEach= "/record | /record/mediaBlock" transformer= "TemplateTransformer" > <field column= "fileAbsolutePath" template= "${dataimporter.request.single}" /> <field column= "$deleteDocByQuery" template= "fileAbsolutePath:${dataimporter.request.single}" /> <field column= "vdkvgwkey" template= "${dataimporter.request.single}" /> </entity> If I enter get 'http://localhost:8080/apache-solr-1.4-dev/dataimport?command=full-import&entity=single-delete&clean=false&commit=true&single= file:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml ' I get Mar 20, 2009 1:33:36 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/apache-solr-1.4-dev path=/select params={wt=xml&q= "sea+stallion" &qt=} hits=10 status=0 QTime=54 Mar 20, 2009 1:34:01 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/apache-solr-1.4-dev path=/dataimport params={command=full- import &clean= false &entity=single-delete&commit= true &single=file: ///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml} status=0 QTime=1 Mar 20, 2009 1:34:01 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Mar 20, 2009 1:34:01 PM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Mar 20, 2009 1:34:01 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Mar 20, 2009 1:34:01 PM org.apache.solr.handler.dataimport.SolrWriter deleteByQuery INFO: Deleting documents from Solr with query: fileAbsolutePath:file: ///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml Mar 20, 2009 1:34:01 PM org.apache.solr.common.SolrException log SEVERE: org.apache.lucene.queryParser.ParseException: Cannot parse 'fileAbsolutePath:file: ///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml': Encountered " " : " " : "" at line 1, column 21. Was expecting one of: <EOF> <AND> ... <OR> ... <NOT> ... "+" ... However if I escape the ':' and enter get 'http://localhost:8080/apache-solr-1.4-dev/dataimport?command=full-import&entity=single-delete&clean=false&commit=true&single=file\:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml' in this case it looks as though the deleteDocByQuery is being ignored! Mar 20, 2009 1:34:22 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/apache-solr-1.4-dev path=/dataimport params={command=full- import &clean= false &entity=single-delete&commit= true &single=file\: ///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml} status=0 QTime=0 Mar 20, 2009 1:34:22 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Mar 20, 2009 1:34:22 PM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Mar 20, 2009 1:34:22 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Mar 20, 2009 1:34:22 PM org.apache.solr.handler.dataimport.URLDataSource getData SEVERE: Exception thrown while getting data java.net.MalformedURLException: no protocol: nullfile\: ///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml at java.net.URL.<init>(URL.java:567) at java.net.URL.<init>(URL.java:464) at java.net.URL.<init>(URL.java:413) at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:88) at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:47) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:239) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:182) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:165) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:335) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:221) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:163) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:309) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:367) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:348) Mar 20, 2009 1:34:22 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument SEVERE: Exception while processing: single-delete document : SolrInputDocument[{}] org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in invoking url null Processing Document # 1 at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:112) at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:47)
          Hide
          Shalin Shekhar Mangar added a comment -

          I was having trouble applying the patch as there were some conflicts in HttpDataSource. This patch is in sync with trunk.

          Show
          Shalin Shekhar Mangar added a comment - I was having trouble applying the patch as there were some conflicts in HttpDataSource. This patch is in sync with trunk.
          Hide
          Shalin Shekhar Mangar added a comment -

          The ParseException is because when we try to delete the document, the file path being given contains a ':' character. A valid solr query must cannot contain such characters. The best way is to use ClientUtils.escapeQueryChars on the value of the query. I see that you are creating the delete query through a template. Perhaps we need a Evaluator which can escape query characters using ClientUtils.escapeQueryChars method?

          <field column="$deleteDocByQuery"   template="fileAbsolutePath:${dataimporter.functions.escapeQueryChars(dataimporter.request.single)}" />
          
          Show
          Shalin Shekhar Mangar added a comment - The ParseException is because when we try to delete the document, the file path being given contains a ':' character. A valid solr query must cannot contain such characters. The best way is to use ClientUtils.escapeQueryChars on the value of the query. I see that you are creating the delete query through a template. Perhaps we need a Evaluator which can escape query characters using ClientUtils.escapeQueryChars method? <field column= "$deleteDocByQuery" template= "fileAbsolutePath:${dataimporter.functions.escapeQueryChars(dataimporter.request.single)}" />
          Hide
          Fergus McMenemie added a comment -

          Down loaded your version of my patch. Thanks for taking a look at it and making the improvements.

          However I still can get things to work. My solr-data.xml is now as follows:-

               <entity name="single-delete"
          		 dataSource="myURIreader"
          		 processor="XPathEntityProcessor"
          		 url="${dataimporter.request.single}"
          		 rootEntity="true"
          		 flatten="true"
          		 stream="false"
          		 forEach="/record | /record/mediaBlock"
          		 transformer="TemplateTransformer">
          
                <field column="fileAbsolutePath"    template="${dataimporter.request.single}" /> 
                <field column="$deleteDocByQuery"   template="fileAbsolutePath:${dataimporter.functions.escapeQueryChars(dataimporter.request.single)}" /> 	       
                <field column="vdkvgwkey"           template="${dataimporter.request.single}" /> 
                </entity>
          
          

          But an attempt to delete a document produces the following..

          Mar 23, 2009 12:45:42 PM org.apache.solr.core.SolrCore execute
          INFO: [] webapp=/apache-solr-1.4-dev path=/dataimport params={command=full-import&clean=false&entity=single-delete&commit=true&single=file:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml} status=0 QTime=1 
          Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
          INFO: Read dataimport.properties
          Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.DataImporter doFullImport
          INFO: Starting Full Import
          Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
          INFO: Read dataimport.properties
          Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.TemplateTransformer transformRow
          WARNING: Unable to resolve variable: dataimporter.functions.escapeQueryChars(dataimporter.request.single) while parsing expression: fileAbsolutePath:${dataimporter.functions.escapeQueryChars(dataimporter.request.single)}
          Mar 23, 2009 12:45:42 PM org.apache.solr.core.SolrDeletionPolicy onInit
          INFO: SolrDeletionPolicy.onInit: commits:num=1
          	commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_3,version=1237809265075,generation=3,filenames=[_5.nrm, _5.tii, _5.tis, _5.fdx, _5.prx, _5.fdt, _5.fnm, segments_3, _5.frq]
          Mar 23, 2009 12:45:42 PM org.apache.solr.core.SolrDeletionPolicy updateCommits
          INFO: last commit = 1237809265075
          Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.TemplateTransformer transformRow
          WARNING: Unable to resolve variable: dataimporter.functions.escapeQueryChars(dataimporter.request.single) while parsing expression: fileAbsolutePath:${dataimporter.functions.escapeQueryChars(dataimporter.request.single)}
          Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.TemplateTransformer transformRow
          WARNING: Unable to resolve variable: dataimporter.functions.escapeQueryChars(dataimporter.request.single) while parsing expression: fileAbsolutePath:${dataimporter.functions.escapeQueryChars(dataimporter.request.single)}
          Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.DocBuilder commit
          INFO: Full Import completed successfully
          Mar 23, 2009 12:45:42 PM org.apache.solr.update.DirectUpdateHandler2 commit
          INFO: start commit(optimize=true,waitFlush=false,waitSearcher=true)
          
          Show
          Fergus McMenemie added a comment - Down loaded your version of my patch. Thanks for taking a look at it and making the improvements. However I still can get things to work. My solr-data.xml is now as follows:- <entity name= "single-delete" dataSource= "myURIreader" processor= "XPathEntityProcessor" url= "${dataimporter.request.single}" rootEntity= " true " flatten= " true " stream= " false " forEach= "/record | /record/mediaBlock" transformer= "TemplateTransformer" > <field column= "fileAbsolutePath" template= "${dataimporter.request.single}" /> <field column= "$deleteDocByQuery" template= "fileAbsolutePath:${dataimporter.functions.escapeQueryChars(dataimporter.request.single)}" /> <field column= "vdkvgwkey" template= "${dataimporter.request.single}" /> </entity> But an attempt to delete a document produces the following.. Mar 23, 2009 12:45:42 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/apache-solr-1.4-dev path=/dataimport params={command=full- import &clean= false &entity=single-delete&commit= true &single=file: ///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml} status=0 QTime=1 Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.TemplateTransformer transformRow WARNING: Unable to resolve variable: dataimporter.functions.escapeQueryChars(dataimporter.request.single) while parsing expression: fileAbsolutePath:${dataimporter.functions.escapeQueryChars(dataimporter.request.single)} Mar 23, 2009 12:45:42 PM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_3,version=1237809265075,generation=3,filenames=[_5.nrm, _5.tii, _5.tis, _5.fdx, _5.prx, _5.fdt, _5.fnm, segments_3, _5.frq] Mar 23, 2009 12:45:42 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1237809265075 Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.TemplateTransformer transformRow WARNING: Unable to resolve variable: dataimporter.functions.escapeQueryChars(dataimporter.request.single) while parsing expression: fileAbsolutePath:${dataimporter.functions.escapeQueryChars(dataimporter.request.single)} Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.TemplateTransformer transformRow WARNING: Unable to resolve variable: dataimporter.functions.escapeQueryChars(dataimporter.request.single) while parsing expression: fileAbsolutePath:${dataimporter.functions.escapeQueryChars(dataimporter.request.single)} Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.DocBuilder commit INFO: Full Import completed successfully Mar 23, 2009 12:45:42 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize= true ,waitFlush= false ,waitSearcher= true )
          Hide
          Shalin Shekhar Mangar added a comment -

          However I still can get things to work. My solr-data.xml is now as follows:-

          Sorry, I was not very clear. I meant that we need to create a new Evaluator which can escape query characters. I'll create an issue and give a patch.

          Show
          Shalin Shekhar Mangar added a comment - However I still can get things to work. My solr-data.xml is now as follows:- Sorry, I was not very clear. I meant that we need to create a new Evaluator which can escape query characters. I'll create an issue and give a patch.
          Hide
          Shalin Shekhar Mangar added a comment -

          OK I opened SOLR-1083 for this enhancement

          Show
          Shalin Shekhar Mangar added a comment - OK I opened SOLR-1083 for this enhancement
          Hide
          Fergus McMenemie added a comment -

          Here I am at apacheCon. I finally got email going and it looks as though it is going to stop raining.

          And to top it all I have been able to delete documents. Fantastic. Thanks very much.

          I now have very flexable and powerfull functionality that is IMHO miles better than the equivilent functionality in the commercial search engines I have used.

          However, I tried deleting some documents that dont exist and was wondering if I the full trace back was required. A simple message should be enough?

          I will review and comment the patch and re-upload ASAP.

          Show
          Fergus McMenemie added a comment - Here I am at apacheCon. I finally got email going and it looks as though it is going to stop raining. And to top it all I have been able to delete documents. Fantastic. Thanks very much. I now have very flexable and powerfull functionality that is IMHO miles better than the equivilent functionality in the commercial search engines I have used. However, I tried deleting some documents that dont exist and was wondering if I the full trace back was required. A simple message should be enough? I will review and comment the patch and re-upload ASAP.
          Hide
          Fergus McMenemie added a comment -

          A more complete version of the patch with docs and an expanded regex test case. Ready for submission?

          Show
          Fergus McMenemie added a comment - A more complete version of the patch with docs and an expanded regex test case. Ready for submission?
          Hide
          Fergus McMenemie added a comment -

          Now with test case for the ChangeListEntityProcessor

          Show
          Fergus McMenemie added a comment - Now with test case for the ChangeListEntityProcessor
          Hide
          Fergus McMenemie added a comment -

          Ooups. Deleted my patch! Uploading again.

          Show
          Fergus McMenemie added a comment - Ooups. Deleted my patch! Uploading again.
          Hide
          Fergus McMenemie added a comment -

          This time it is right!

          Show
          Fergus McMenemie added a comment - This time it is right!
          Hide
          Shalin Shekhar Mangar added a comment -

          Fergus, ChangeListEntityProcessor seems to duplicate URIDataSource's functionality instead of using it. Why is that?

          Show
          Shalin Shekhar Mangar added a comment - Fergus, ChangeListEntityProcessor seems to duplicate URIDataSource's functionality instead of using it. Why is that?
          Hide
          Fergus McMenemie added a comment -

          Hmmm,

          Are you referring to the fragment of code inside ChangeListEntityProcessor that opens the changelist, and its similarity to the functionality in URIDataSource?

          I had not thought about arranging some kind of nested use of URIDataSource... is that what you are thinking about?

          Show
          Fergus McMenemie added a comment - Hmmm, Are you referring to the fragment of code inside ChangeListEntityProcessor that opens the changelist, and its similarity to the functionality in URIDataSource? I had not thought about arranging some kind of nested use of URIDataSource... is that what you are thinking about?
          Hide
          Shalin Shekhar Mangar added a comment -

          Are you referring to the fragment of code inside ChangeListEntityProcessor that opens the changelist, and its similarity to the functionality in URIDataSource?

          Yes.

          I had not thought about arranging some kind of nested use of URIDataSource... is that what you are thinking about?

          Not exactly. EntityProcessors do not access http/files directly. That's what DataSources are for. The ChangeListEntityProcessor should just use the context.getDataSource() instead of creating URLConnection directly. The only problem with that approach is that the baseLocation must be specified on the <dataSource>. If you really need it to be returned with the row, you can put a template field with its value, assuming the baseLocation is fixed.

          The more I look at this, the more I feel that the name 'ChangeListEntityProcessor' is misleading. It doesn't really do any changes. It is actually what I imagined a LineEntityProcessor would be. It just streams lines one by one after accepting or rejecting some lines with regex. Whatever else you need to do (for your original use-case), can be done with nested entities and/or custom transformers.

          What are the changes to TestRegexTransformer that this patch includes? Are these tests that you wrote for the RegexTransformer improvements/fixes that you found earlier? If yes, we should commit them through a different issue. Same should be done for the URIDataSource and associated changes.

          What do you think?

          Show
          Shalin Shekhar Mangar added a comment - Are you referring to the fragment of code inside ChangeListEntityProcessor that opens the changelist, and its similarity to the functionality in URIDataSource? Yes. I had not thought about arranging some kind of nested use of URIDataSource... is that what you are thinking about? Not exactly. EntityProcessors do not access http/files directly. That's what DataSources are for. The ChangeListEntityProcessor should just use the context.getDataSource() instead of creating URLConnection directly. The only problem with that approach is that the baseLocation must be specified on the <dataSource>. If you really need it to be returned with the row, you can put a template field with its value, assuming the baseLocation is fixed. The more I look at this, the more I feel that the name 'ChangeListEntityProcessor' is misleading. It doesn't really do any changes. It is actually what I imagined a LineEntityProcessor would be. It just streams lines one by one after accepting or rejecting some lines with regex. Whatever else you need to do (for your original use-case), can be done with nested entities and/or custom transformers. What are the changes to TestRegexTransformer that this patch includes? Are these tests that you wrote for the RegexTransformer improvements/fixes that you found earlier? If yes, we should commit them through a different issue. Same should be done for the URIDataSource and associated changes. What do you think?
          Hide
          Shalin Shekhar Mangar added a comment -

          Also, you should override EntityProcessorBase#destroy and close the reader object in it.

          Show
          Shalin Shekhar Mangar added a comment - Also, you should override EntityProcessorBase#destroy and close the reader object in it.
          Hide
          Fergus McMenemie added a comment -

          Oh boy is this taking ages!. Taking the points in order

          1)OK. I will try and rewrite it again with this point in mind.

          2)This beast has changed its spots big time and I agree its name is now totally inappropriate. It could at a pinch process CSV, tab separated or many other line orientated text formats and the power provided by nested entities and/or custom transformers is considerable. LineEntityProcessor or LineFilterEntityProcessor are good names.

          3) The existing tests with TestRegexTransformer focused on patterns that matched and checked that the expected things happened. My extra tests tested the behavior when patterns failed to match. This was an issue earlier on.

          Show
          Fergus McMenemie added a comment - Oh boy is this taking ages!. Taking the points in order 1)OK. I will try and rewrite it again with this point in mind. 2)This beast has changed its spots big time and I agree its name is now totally inappropriate. It could at a pinch process CSV, tab separated or many other line orientated text formats and the power provided by nested entities and/or custom transformers is considerable. LineEntityProcessor or LineFilterEntityProcessor are good names. 3) The existing tests with TestRegexTransformer focused on patterns that matched and checked that the expected things happened. My extra tests tested the behavior when patterns failed to match. This was an issue earlier on.
          Hide
          Shalin Shekhar Mangar added a comment -

          Oh boy is this taking ages!. Taking the points in order

          I didn't mean to overwhelm you. I can pick it up from here. I have a half-cooked patch with the above changes.

          Show
          Shalin Shekhar Mangar added a comment - Oh boy is this taking ages!. Taking the points in order I didn't mean to overwhelm you. I can pick it up from here. I have a half-cooked patch with the above changes.
          Hide
          Fergus McMenemie added a comment -

          No! It is a learning process for me. It is just it is taking so long to get it sorted.

          I will rename the processor

          I am of course very interested in your patch.

          I agree the included TestRegexTransformer patch could perhaps be a another JIRA issue. Should I open a new one or reopen SOLR-1080?

          I will add an override for EntityProcessorBase#destroy

          Show
          Fergus McMenemie added a comment - No! It is a learning process for me. It is just it is taking so long to get it sorted. I will rename the processor I am of course very interested in your patch. I agree the included TestRegexTransformer patch could perhaps be a another JIRA issue. Should I open a new one or reopen SOLR-1080 ? I will add an override for EntityProcessorBase#destroy
          Hide
          Shalin Shekhar Mangar added a comment -

          OK, here's the patch. Completely untested, sorry.

          1. Removed explicit URLConnection creation. Just use the entity's data source
          2. Does not return baseLocation (because now only the data source knows about it)
          3. Removes a lot of stuff that I think we don't need (and I may be wrong here)
          4. Closes the reader in destroy
          5. Still has TestRegexTransformer changes

          All yours now. Big thanks for bearing with me!

          Show
          Shalin Shekhar Mangar added a comment - OK, here's the patch. Completely untested, sorry. Removed explicit URLConnection creation. Just use the entity's data source Does not return baseLocation (because now only the data source knows about it) Removes a lot of stuff that I think we don't need (and I may be wrong here) Closes the reader in destroy Still has TestRegexTransformer changes All yours now. Big thanks for bearing with me!
          Hide
          Fergus McMenemie added a comment -

          Hmmm How do I apply your patch

          I tried "patch -p0 < SOLR-1060.patch" and "patch -p1 < SOLR-1060.patch" and it just asked all kind of difficult qustions.

          Show
          Fergus McMenemie added a comment - Hmmm How do I apply your patch I tried "patch -p0 < SOLR-1060 .patch" and "patch -p1 < SOLR-1060 .patch" and it just asked all kind of difficult qustions.
          Hide
          Shalin Shekhar Mangar added a comment -

          This one should work.

          I've removed the changes to HttpDataSource (the deprecation) which was causing this funniness.

          Show
          Shalin Shekhar Mangar added a comment - This one should work. I've removed the changes to HttpDataSource (the deprecation) which was causing this funniness.
          Hide
          Fergus McMenemie added a comment -

          Thanks. It is working fine now.

          I have used it on my data and all seems fine, I am now trying to get the testcase working.

          Are you happy with the name "LineFilterEntityProcessor" or do you prefer "LineEntityProcessor"?

          Also what should I do about the TestRegexTransformer testcase?

          Show
          Fergus McMenemie added a comment - Thanks. It is working fine now. I have used it on my data and all seems fine, I am now trying to get the testcase working. Are you happy with the name "LineFilterEntityProcessor" or do you prefer "LineEntityProcessor"? Also what should I do about the TestRegexTransformer testcase?
          Hide
          Shalin Shekhar Mangar added a comment - - edited

          Are you happy with the name "LineFilterEntityProcessor" or do you prefer "LineEntityProcessor"?

          I prefer LineEntityProcessor

          Also what should I do about the TestRegexTransformer testcase?

          Attach it to SOLR-1080. That can go in immediately.

          Show
          Shalin Shekhar Mangar added a comment - - edited Are you happy with the name "LineFilterEntityProcessor" or do you prefer "LineEntityProcessor"? I prefer LineEntityProcessor Also what should I do about the TestRegexTransformer testcase? Attach it to SOLR-1080 . That can go in immediately.
          Hide
          Fergus McMenemie added a comment -

          Your patch was almost perfect,

          • I sorted comments and other details to suite the new model
          • Renamed the entity to LineEntityProcessor
          • Fixed the unit test module
          • Moved testing of regex stuff elsewhere
          • tested with my own apps
          Show
          Fergus McMenemie added a comment - Your patch was almost perfect, I sorted comments and other details to suite the new model Renamed the entity to LineEntityProcessor Fixed the unit test module Moved testing of regex stuff elsewhere tested with my own apps
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Fergus.

          Even though LineEntityProcessor was originally conceived by you for reading file/urls from a text file, it does not need to be mentioned in the javadocs. I think it can confuse users. The purpose of LineEntityProcessor is simple, just read line by and line, accept/reject and pass on. The documentation should not be more complicated than that.

          Also look at SOLR-1120 that I just opened. There are just so many things in entity processor that even I cannot keep track of. It is a big change but very much needed so let me circle back to this issue after taking care of it.

          Some of the gotchas are:

          1. Right way to clean up. Contrary to my previous comments, destroy is not the right place to do the cleanup.
          2. applyTransformer can return multiple rows which are cached in the entity processor base class
          3. onError attribute need to be handled correctly e.g. abort, skip, continue

          I'll take this forward from here.

          Show
          Shalin Shekhar Mangar added a comment - Thanks Fergus. Even though LineEntityProcessor was originally conceived by you for reading file/urls from a text file, it does not need to be mentioned in the javadocs. I think it can confuse users. The purpose of LineEntityProcessor is simple, just read line by and line, accept/reject and pass on. The documentation should not be more complicated than that. Also look at SOLR-1120 that I just opened. There are just so many things in entity processor that even I cannot keep track of. It is a big change but very much needed so let me circle back to this issue after taking care of it. Some of the gotchas are: Right way to clean up. Contrary to my previous comments, destroy is not the right place to do the cleanup. applyTransformer can return multiple rows which are cached in the entity processor base class onError attribute need to be handled correctly e.g. abort, skip, continue I'll take this forward from here.
          Hide
          Shalin Shekhar Mangar added a comment -
          1. Renamed fileName attribute to url to be consistent with XPathEntityProcessor
          2. Renamed omitLineRegex to skipLineRegex (renamed internal variables as well)
          3. Updated javadocs to remove mentions of change lists (except in one place)

          All tests pass. I'll commit this shortly.

          Show
          Shalin Shekhar Mangar added a comment - Renamed fileName attribute to url to be consistent with XPathEntityProcessor Renamed omitLineRegex to skipLineRegex (renamed internal variables as well) Updated javadocs to remove mentions of change lists (except in one place) All tests pass. I'll commit this shortly.
          Hide
          Shalin Shekhar Mangar added a comment -

          Committed revision 766638.

          Thanks Fergus!

          Show
          Shalin Shekhar Mangar added a comment - Committed revision 766638. Thanks Fergus!
          Hide
          Grant Ingersoll added a comment -

          Bulk close for Solr 1.4

          Show
          Grant Ingersoll added a comment - Bulk close for Solr 1.4

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Fergus McMenemie
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 120h
                120h
                Remaining:
                Remaining Estimate - 120h
                120h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development