Solr
  1. Solr
  2. SOLR-2864

DataImportHandler has non-deterministic sort order for XML files

    Details

      Description

      DataImportHandler's FileListEntityProcessor relies on Java's File.list() method to retrieve a list of files from the configured dataimport directory, but list() does not guarantee a sort order (1). This means that if you have two files that update the same record, the results are non-deterministic. Typically, list() does in fact return them lexigraphically sorted, but this is not guaranteed (2).

      An example of how you can get into trouble is to imagine the following:

      xyz.xml – Created one hour ago. Contains updates to records "Foo" and "Bar".
      abc.xml – Created one minute ago. Contains updates to records "Bar" and "Baz".

      In this case, the newest file, in abc.xml, would (likely, but not guaranteed) be run first, updating the "Bar" and "Baz" records. Next, the older file, xyz.xml, would update "Foo" and overwrite "Bar" with outdated changes.

      (1) Per http://download.oracle.com/javase/1,5,0/docs/api/java/io/File.html#list%28%29

      "There is no guarantee that the name strings in the resulting array will appear in any specific order; they are not, in particular, guaranteed to appear in alphabetical order."

      (2) Even if it was guaranteed, lexigraphical sorting would give you the following sort order:

      1.xml
      10.xml
      2.xml
      ...

      1. lucene-2864.patch
        3 kB
        Gabriel Cooper
      2. lucene-2864.patch
        7 kB
        Gabriel Cooper

        Activity

        Hide
        Gabriel Cooper added a comment -

        Patch to sort files by created/modified time.

        Show
        Gabriel Cooper added a comment - Patch to sort files by created/modified time.
        Hide
        Hoss Man added a comment -

        Gabriel: one thing that's not clear to me from your patch is the interplay between the deterministic sorting and the recursion: if i'm reading this right, directories are sorted deterministically by modification date, then they are walked recursively – as opposed to doing a recursive walk, then sorting all the matching files by date. i have no opinion wether that's good or bad, but it seems worthy of consideration, testing, and documentation.

        minor nits:

        1) if the goal is to be deterministic, then only sorting on last mod date doesn't seem like enough – there should also be a secondary sort on something guaranteed to be unique (like fullpath or name) correct?

        2) the CREATED_FIRST and CREATED_SECOND filename constants in your tests should be used consistently, there's no reason for multiple "a.xml" and "b.xml" strings in the test. using the constants everywhere will help make it clear that those files are being created in a specific order for a specific reason.

        Show
        Hoss Man added a comment - Gabriel: one thing that's not clear to me from your patch is the interplay between the deterministic sorting and the recursion: if i'm reading this right, directories are sorted deterministically by modification date, then they are walked recursively – as opposed to doing a recursive walk, then sorting all the matching files by date. i have no opinion wether that's good or bad, but it seems worthy of consideration, testing, and documentation. minor nits: 1) if the goal is to be deterministic, then only sorting on last mod date doesn't seem like enough – there should also be a secondary sort on something guaranteed to be unique (like fullpath or name) correct? 2) the CREATED_FIRST and CREATED_SECOND filename constants in your tests should be used consistently, there's no reason for multiple "a.xml" and "b.xml" strings in the test. using the constants everywhere will help make it clear that those files are being created in a specific order for a specific reason.
        Hide
        Gabriel Cooper added a comment -

        Excellent feedback. I've changed the sort to be depth-first, sorting directories alphabetically, then traversing their files first by date, then by name.

        I've changed the first test to accommodate your feedback, and modified the recursion test to test that.

        Interestingly, the recursion was technically broken and test never tested anything. It used the child directory as its base then ran recursively through ... that one directory.

        Show
        Gabriel Cooper added a comment - Excellent feedback. I've changed the sort to be depth-first, sorting directories alphabetically, then traversing their files first by date, then by name. I've changed the first test to accommodate your feedback, and modified the recursion test to test that. Interestingly, the recursion was technically broken and test never tested anything. It used the child directory as its base then ran recursively through ... that one directory.
        Hide
        Gabriel Cooper added a comment -

        In re-reading my comment, I realized I wasn't clear on that last sentence. Recursion of course works, but the recursion test didn't actually test recursion due to using the wrong directory as the base.

        Show
        Gabriel Cooper added a comment - In re-reading my comment, I realized I wasn't clear on that last sentence. Recursion of course works, but the recursion test didn't actually test recursion due to using the wrong directory as the base.
        Hide
        Shalin Shekhar Mangar added a comment -

        Thanks for the patch Gabriel.

        1. Your use-case seems to be about making sure that when you add new files to a directory, an old file does not overwrite the new records – one could use "newerThan" to process only the new files.
        2. FileListEntityProcessor has never guaranteed order so technically this is not a bug.
        3. What you have proposed is a very arbitrary sort order. In particular, sort order for directories is different than the order for files. Probably it is relevant to your use-case but once we start down this path, we will have people asking for other sort orders.

        That being said, I'd hate for your work to go waste. A change in walk order shouldn't affect anyone because the order was never guaranteed anyway but at the least, we should have the same sort order for both directories and files otherwise the scenario you've described in the issue description can still happen.

        Show
        Shalin Shekhar Mangar added a comment - Thanks for the patch Gabriel. 1. Your use-case seems to be about making sure that when you add new files to a directory, an old file does not overwrite the new records – one could use "newerThan" to process only the new files. 2. FileListEntityProcessor has never guaranteed order so technically this is not a bug. 3. What you have proposed is a very arbitrary sort order. In particular, sort order for directories is different than the order for files. Probably it is relevant to your use-case but once we start down this path, we will have people asking for other sort orders. That being said, I'd hate for your work to go waste. A change in walk order shouldn't affect anyone because the order was never guaranteed anyway but at the least, we should have the same sort order for both directories and files otherwise the scenario you've described in the issue description can still happen.
        Hide
        Gabriel Cooper added a comment -

        newerThan helps keep you from processing the same file twice, but has no relation to sort order of unprocessed files. The reason I sort directories by name is because directory modification dates don't always update when their contents change, e.g. grandchild modifications. To your point, however, you could change it to traverse all directories, find all files, then sort them all. Seems the likelihood of that being what the user wants to be lower, but is easier to communicate.

        Show
        Gabriel Cooper added a comment - newerThan helps keep you from processing the same file twice, but has no relation to sort order of unprocessed files. The reason I sort directories by name is because directory modification dates don't always update when their contents change, e.g. grandchild modifications. To your point, however, you could change it to traverse all directories, find all files, then sort them all. Seems the likelihood of that being what the user wants to be lower, but is easier to communicate.
        Hide
        Hoss Man added a comment -

        FileListEntityProcessor has never guaranteed order so technically this is not a bug.

        I think the crux of the issue is that the order is non-deterministic, and that itself seems like a (design) bug ... we've never said they would be in any particular order, but we probably should have. files should be processed in some consistent order so that multiple full-import runs produce logically consistent indexes.

        The reason I sort directories by name is because directory modification dates don't always update when their contents change, e.g. grandchild modifications.

        True, but it's a lot easier to explain "all files are sorted by mod date and then processed; if a file is a directory then it is processed recursively" then to try and explain that in a given directory, the subdirectories are sorted by name and then recursed, but the files are sorted by date and processed after all the subdirectories (at least, i think that's what happens, kinda of got lost reading the Comparator code)

        The important thing is definitely to make the behavior reliably reproducible, but ideally we should do that in a way that's simple to understand. what you've got now is reproducible, but it would be just as reproducible (and much easier to explain) if it either did a recursive walk and then sorted by "date,name" , or sorted each dir by "date,name" then did the recursive walk.

        i think that if we're going to worry about the fact that "directory modification dates don't always update when their contents change" therefore directories should be sorted by name, then it might be worth considering the idea of just ignoring last modified all together and just sort everything by name – that's guaranteed to be deterministic, and gives the user total control over the order that files/dirs are processed.

        (But persoally: i think a simple "date,name" sort on all files (regardless of whether they are directories) and then "process" would be fine)

        Show
        Hoss Man added a comment - FileListEntityProcessor has never guaranteed order so technically this is not a bug. I think the crux of the issue is that the order is non-deterministic, and that itself seems like a (design) bug ... we've never said they would be in any particular order, but we probably should have. files should be processed in some consistent order so that multiple full-import runs produce logically consistent indexes. The reason I sort directories by name is because directory modification dates don't always update when their contents change, e.g. grandchild modifications. True, but it's a lot easier to explain "all files are sorted by mod date and then processed; if a file is a directory then it is processed recursively" then to try and explain that in a given directory, the subdirectories are sorted by name and then recursed, but the files are sorted by date and processed after all the subdirectories (at least, i think that's what happens, kinda of got lost reading the Comparator code) The important thing is definitely to make the behavior reliably reproducible, but ideally we should do that in a way that's simple to understand. what you've got now is reproducible, but it would be just as reproducible (and much easier to explain) if it either did a recursive walk and then sorted by "date,name" , or sorted each dir by "date,name" then did the recursive walk. i think that if we're going to worry about the fact that "directory modification dates don't always update when their contents change" therefore directories should be sorted by name, then it might be worth considering the idea of just ignoring last modified all together and just sort everything by name – that's guaranteed to be deterministic, and gives the user total control over the order that files/dirs are processed. (But persoally: i think a simple "date,name" sort on all files (regardless of whether they are directories) and then "process" would be fine)
        Hide
        Hoss Man added a comment -

        bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

        Show
        Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
        Hide
        Robert Muir added a comment -

        rmuir20120906-bulk-40-change

        Show
        Robert Muir added a comment - rmuir20120906-bulk-40-change
        Hide
        Robert Muir added a comment -

        moving all 4.0 issues not touched in a month to 4.1

        Show
        Robert Muir added a comment - moving all 4.0 issues not touched in a month to 4.1
        Hide
        Steve Rowe added a comment -

        Bulk move 4.4 issues to 4.5 and 5.0

        Show
        Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
        Hide
        Uwe Schindler added a comment -

        Move issue to Solr 4.9.

        Show
        Uwe Schindler added a comment - Move issue to Solr 4.9.

          People

          • Assignee:
            Shalin Shekhar Mangar
            Reporter:
            Gabriel Cooper
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - 1h
              1h
              Remaining:
              Remaining Estimate - 1h
              1h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development