Uploaded image for project: 'ManifoldCF'
  1. ManifoldCF
  2. CONNECTORS-1197

FileSystem output connector error with some file names

Details

    Description

      I'm having some problems trying to perform a job starting from a web crawling and with a file system output connector.

      The job is terminated with an error like the following (I think it could depend on special chars in file name).

      Error: Could not create file 'E:\ManifoldCF\http\nypost.com\2015\05\06\bloombergs-the-man-to-beat-hillary-for-democratic-nomination?msg=fail&shared=email': E:\ManifoldCF\http\nypost.com\2015\05\06\bloombergs-the-man-to-beat-hillary-for-democratic-nomination?msg=fail&shared=email (The filename, directory name, or volume label syntax is incorrect)

      Attachments

        Activity

          kwright@metacarta.com Karl Wright added a comment -

          second fix:
          r1678329 (trunk)
          r1678330 (dev_1x)

          kwright@metacarta.com Karl Wright added a comment - second fix: r1678329 (trunk) r1678330 (dev_1x)
          kwright@metacarta.com Karl Wright added a comment -

          r1678300 (trunk)
          r1678301 (dev_1x)

          kwright@metacarta.com Karl Wright added a comment - r1678300 (trunk) r1678301 (dev_1x)
          aasta Andrea added a comment -

          From my point of view the first solution would be better, the second one is valid but I wouldn't get some documents in output without a specific "crawling reason"...does it make sense to you?

          aasta Andrea added a comment - From my point of view the first solution would be better, the second one is valid but I wouldn't get some documents in output without a specific "crawling reason"...does it make sense to you?
          kwright@metacarta.com Karl Wright added a comment -

          Hi Andrea,

          It is not possible to just detect a failure and then modify the document name when detected, for many reasons. One of them is that we don't get back good feedback from Java as to what is wrong exactly with the filename. The other reason is that the connector also has to handle document deletion, which has an entirely different error structure.

          Your only choices are therefore the following:
          (1) A special "windows" mode, which does an entirely different character mapping and where no attempt is made to be wget compliant at all;
          (2) Skipping any files whose names cause hard errors on write.

          Thanks.

          kwright@metacarta.com Karl Wright added a comment - Hi Andrea, It is not possible to just detect a failure and then modify the document name when detected, for many reasons. One of them is that we don't get back good feedback from Java as to what is wrong exactly with the filename. The other reason is that the connector also has to handle document deletion, which has an entirely different error structure. Your only choices are therefore the following: (1) A special "windows" mode, which does an entirely different character mapping and where no attempt is made to be wget compliant at all; (2) Skipping any files whose names cause hard errors on write. Thanks.
          aasta Andrea added a comment -

          Hi Karl,
          thanks for your reply.
          From a user point of view, I think that the expectation would be to have the document stored in any case, for example replacing all the not acceptable chars with a new one (just a _ for example). In my opinion this solution would be the best, but also skipping the document could be one.

          Thank you!

          aasta Andrea added a comment - Hi Karl, thanks for your reply. From a user point of view, I think that the expectation would be to have the document stored in any case, for example replacing all the not acceptable chars with a new one (just a _ for example). In my opinion this solution would be the best, but also skipping the document could be one. Thank you!
          kwright@metacarta.com Karl Wright added a comment -

          Hi Andrea,

          As discussed on the list, this connector is meant to be wget-compliant. Since wget is a unix tool, it may well not choose file names that are compatible with the Windows operating system. That's to be expected.

          The question is, what do you want it to do in that case? The easiest thing to do would be to just skip the file entirely. Is that your suggestion?

          kwright@metacarta.com Karl Wright added a comment - Hi Andrea, As discussed on the list, this connector is meant to be wget-compliant. Since wget is a unix tool, it may well not choose file names that are compatible with the Windows operating system. That's to be expected. The question is, what do you want it to do in that case? The easiest thing to do would be to just skip the file entirely. Is that your suggestion?

          People

            kwright@metacarta.com Karl Wright
            aasta Andrea
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: