Uploaded image for project: 'ManifoldCF'
  1. ManifoldCF
  2. CONNECTORS-1547

No activity record for for excluded documents in WebCrawlerConnector

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: ManifoldCF 2.12
    • Component/s: Web connector
    • Labels:
      None

      Description

      Hi,

      I noticed that there is no activity record logged for documents excluded by the Document Filter transformation connector  in the WebCrawler connector.

      To reproduce the issue on MCF out of the box :

      Null output connector 

      Web repository connector 

      Job :

      • DocumentFilter added which only accepts application/msword (doc/docx) documents

      The simple history does not mention the documents excluded (excepted for html documents). They have fetch activity and that's all (see simple_history_web.jpeg).
      We can only see the documents excluded by the MCF log (with DEBUG verbosity activity on connectors) :

      Removing url 'https://www.datafari.com/assets/img/Logo_Datafari_4_Condensed_No_D_20180606_30x30.png' because it had the wrong content type ('image/png')

      (see manifoldcf_local_files.log)

      The related code is in WebcrawlerConnector.java l.904 :

      fetchStatus.contextMessage = "it had the wrong content type ('"+contentType+"')";
       fetchStatus.resultSignal = RESULT_NO_DOCUMENT;
       activityResultCode = null;

      The activityResultCode is null.

       

       

      If we configure the same job but for a Local File system connector with the same Document Filter transformation connector, the simple history mentions all the documents excluded in the simple history (see simple_history_files.jpeg)  and the code mentions a specific error code with an activity record logged (class FileConnector l. 415) : 

      if (!activities.checkMimeTypeIndexable(mimeType))
       {
       errorCode = activities.EXCLUDED_MIMETYPE;
       errorDesc = "Excluded because mime type ('"+mimeType+"')";
       Logging.connectors.debug("Skipping file '"+documentIdentifier+"' because mime type ('"+mimeType+"') was excluded by output connector.");
       activities.noDocument(documentIdentifier,versionString);
       continue;
       }

       

      So the Web Crawler connector should have the same behaviour than for FileConnector and explicitly mention all the documents excluded by the user I think.

       

      Best regards,

      Olivier

        Attachments

        1. simple_history_web.jpg
          224 kB
          Olivier Tavard
        2. manifoldcf_web.log
          185 kB
          Olivier Tavard
        3. simple_history_files.jpg
          229 kB
          Olivier Tavard
        4. manifoldcf_local_files.log
          24 kB
          Olivier Tavard

          Activity

            People

            • Assignee:
              kwright@metacarta.com Karl Wright
              Reporter:
              olivierfl Olivier Tavard
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: