Uploaded image for project: 'ManifoldCF'
  1. ManifoldCF
  2. CONNECTORS-1598

session based authentication cannot register 401



    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: ManifoldCF 2.12
    • Fix Version/s: None
    • Component/s: Web connector
    • Labels:



      Access to a specific domain is restricted by being A) an intranet service B) based on an employee/costumer profile.
      For manifold to be able to be authenticated there is a specific 'domain/login' page with a form where manifold was configured to enter it's username and password. A session-cookie is then set so manifold is authenticated to access all resources. If a request for a resource is not authenticated the service throws a 401. When the service returns a 401 the actual content of the resource includes the same form as is present in 'domain/login'.


      The only way we have been able to configure manifold to be authenticated was by specifying session-based credentials AND providing 'domain/login' as a seed in the job as well. The only other seed in the job is a sitemap.
      This is of course not ideal since it can easily happen that the seed for the sitemap gets processed first, which then throws a 401 on the sitemap and the job stops.
      Another possible scenario with this configuration is that the cookie expires and all other resources throw 401 and get deleted from the index (elasticsearch). There is also another job (different language, same domain), usage of the cookie from the previous job has also been registered.

      Current session-based access credentials configuration:

      --url regular expression : https://\domain/
      --login pages:
      ---login url regexp : 'login'
      ---page type : form
      ---identification regexp is set to match the form-name
      ---form parameters are filled with the correct parameters

      This is verified to work, but as my understanding this only works because the login-page is part of the seeds and so it matches the url when it comes across it when crawling. There is no configuration yet which redirects (for example) to this page when manifold receives a 401.

      My goal was then to remove the login-page from the seeds and configure the job so that each time a fetch returns a 401, manifold knows to go to the login page. in pseudo code:

      --If authenticated
      ---redirect to login
      ---retry resource


      Based on the documentation here: https://manifoldcf.apache.org/release/release-2.12/en_US/end-user-documentation.html#webrepository I tried a few different configurations. The first thing to notice is in the comparison table, 'page based authentication' only mentions 4xx and 'session based authentication' only mentions 3xx.

      At this time my biggest question is; are these response codes bound to the difference in settings between page and session based? As far I have been able to see, whenever manifold receives a 401 it logs "ignoring url {‌

      {url}‌} because it failed to fetch (status=401, ..."
      Am I not able to work with session based authentication when the service returns 401's?


      Configuration attempts (all failed):
      - for all attempts the login page was removed from the seeds.
      - in general I have kept the above configuration of page type 'form', in the case I was able to redirect manifold to this page.
      - The kinds of content that a web connection can recognize as a login page specified in the documentation lists an option "A page that has specific content on it, as described by a regular expression". As the description of this case specified I tried the page type 'content' setting, with identification regexp set to '.*' for testing and an override url set to 'domain/login'. My hopes were that in this test the match-all-regexp would override to the login page for every url it fetches.
      - Since the content of a 401 also includes the same form as the login page, i tried with page type 'form', supplied identification regexp en override form parameters, just like above, only with the "login url regexp" set to '.*'. My hopes were that each page has the possibility to have the form recognized if it is returned as a 401.

      In both cases the only thing I could see is that manifold fetched the sitemap, received a 401 and in manifold logged "ignoring url {‌{url}

      ‌} because it failed to fetch (status=401, ..."

      Some questions:

      • Is there anything to be done when manifold receives a 401?
      • is 4xx tied to page base authentication and 3xx tied to session based authentication?
      • is there some other configuration/logic that I am missing, that I could try out?

      A minimal effort solution would be if there was a way to make manifold start at the login and not do any crawling (most importanly no deleting) when it is unable to be authenticated. Together with this a way to remove the session cookie when the job is done would also be necessary, so as to avoid the expiry of the cookie as a result of manifold using an old cookie.

      Side-note; is there any way to make manifold not delete documents when it receives a 401?




            • Assignee:
              goovaertsr roel goovaerts
            • Votes:
              0 Vote for this issue
              2 Start watching this issue


              • Created: