Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-2864

DataImportHandler has non-deterministic sort order for XML files

    XMLWordPrintableJSON

Details

    Description

      DataImportHandler's FileListEntityProcessor relies on Java's File.list() method to retrieve a list of files from the configured dataimport directory, but list() does not guarantee a sort order (1). This means that if you have two files that update the same record, the results are non-deterministic. Typically, list() does in fact return them lexigraphically sorted, but this is not guaranteed (2).

      An example of how you can get into trouble is to imagine the following:

      xyz.xml – Created one hour ago. Contains updates to records "Foo" and "Bar".
      abc.xml – Created one minute ago. Contains updates to records "Bar" and "Baz".

      In this case, the newest file, in abc.xml, would (likely, but not guaranteed) be run first, updating the "Bar" and "Baz" records. Next, the older file, xyz.xml, would update "Foo" and overwrite "Bar" with outdated changes.

      (1) Per http://download.oracle.com/javase/1,5,0/docs/api/java/io/File.html#list%28%29

      "There is no guarantee that the name strings in the resulting array will appear in any specific order; they are not, in particular, guaranteed to appear in alphabetical order."

      (2) Even if it was guaranteed, lexigraphical sorting would give you the following sort order:

      1.xml
      10.xml
      2.xml
      ...

      Attachments

        1. lucene-2864.patch
          7 kB
          Gabriel Cooper
        2. lucene-2864.patch
          3 kB
          Gabriel Cooper

        Issue Links

          Activity

            People

              shalin Shalin Shekhar Mangar
              blackbox@inanutshell.us Gabriel Cooper
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 1h
                  1h
                  Remaining:
                  Remaining Estimate - 1h
                  1h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified