XML Commons
  1. XML Commons
  2. XMLCOMMONS-61

Please make catalog use default instead of an afterthough

    Details

      Description

      W3C gets an immense amount of DTD traffic with user-agent often only identifying
      itself as Python or Java.

      http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic

      In a number of cases we have heard back from people affected by our automated
      blocking indicating they are running Xalan and/or Xerces doing such things as
      validating XML or doing XSL transforms. We have directed some we have been in
      correspondence with to your catalog instructions.

      http://xerces.apache.org/xerces2-j/faq-xcatalogs.html

      The vast majority of Xalan/Xerces installations most likely do not implement
      catalogs nor caching of external DTDs and other schemata. It would seem the
      resolver does not care about HTTP response codes nor caching directives.

      http://www.ietf.org/rfc/rfc2616.txt

      Better than a default catalog would be a caching XML Catalog resolver as I
      understand is part of Glassfish

      http://norman.walsh.name/2007/09/07/treadLightly

      There are other Java libraries contributing to this traffic as well. Xalan and
      Xerces are widely used, important libraries. Your assistance in reducing this
      excessive traffic to W3C and others hosting standards schemata would be greatly
      appreciated.

        Activity

        Hide
        Michael Glavassevich added a comment - - edited

        > A schema that really does change frequently (even on every call) can simply send no cache HTTP headers.

        Seems like something that should be handled at the protocol level (i.e. java.net.*). This isn't specific to XML. Also, Xerces and other XML processors I'm familiar with don't read HTTP headers. They just delegate the URL handling (just passing through the URI string it got from the application without inspecting it) to Java's java.net.URL class and that could return a byte stream from a cache if appropriate. Still a risk of breaking someone (headers sometimes lie) but at least the behaviour would be consistent across the Java platform.

        > That is probably pretty rare edge case but like you say could be an optional configuration to disable per application or server wide.

        I wasn't thinking about a schema which changes every time you ask for it, but can imagine frequent updates of a schema that is still under development or one that is constructed from example data and keeps getting updated to accommodate more examples as they're discovered.

        > If the cached catalog resides in memory instead of disk yes there will be a modest memory cost.

        Xerces and other general purpose XML libraries do not know what environment they're going to be deployed into. There may be no local disk drive on the device running the JVM or the application may not have write permissions. Storage costs may be modest for W3C schemata but I've seen several industry standard schemas that are many megabytes in size. The schemas used by a complex business application may take up a significant amount of the heap if they were all cached in memory at once.

        Show
        Michael Glavassevich added a comment - - edited > A schema that really does change frequently (even on every call) can simply send no cache HTTP headers. Seems like something that should be handled at the protocol level (i.e. java.net.*). This isn't specific to XML. Also, Xerces and other XML processors I'm familiar with don't read HTTP headers. They just delegate the URL handling (just passing through the URI string it got from the application without inspecting it) to Java's java.net.URL class and that could return a byte stream from a cache if appropriate. Still a risk of breaking someone (headers sometimes lie) but at least the behaviour would be consistent across the Java platform. > That is probably pretty rare edge case but like you say could be an optional configuration to disable per application or server wide. I wasn't thinking about a schema which changes every time you ask for it, but can imagine frequent updates of a schema that is still under development or one that is constructed from example data and keeps getting updated to accommodate more examples as they're discovered. > If the cached catalog resides in memory instead of disk yes there will be a modest memory cost. Xerces and other general purpose XML libraries do not know what environment they're going to be deployed into. There may be no local disk drive on the device running the JVM or the application may not have write permissions. Storage costs may be modest for W3C schemata but I've seen several industry standard schemas that are many megabytes in size. The schemas used by a complex business application may take up a significant amount of the heap if they were all cached in memory at once.
        Hide
        Ted Guild added a comment -

        A schema that really does change frequently (even on every call) can simply send no cache HTTP headers. That is probably pretty rare edge case but like you say could be an optional configuration to disable per application or server wide. I would wager the majority use fairly static schemata. As evidently many XML processing applications are fault tolerant, or we would get many more complaints about trying to keep this traffic in check, at not getting schemata from us another option could be to not reference the schemata at all.

        If the cached catalog resides in memory instead of disk yes there will be a modest memory cost. Retrieving from local disk instead of over internet to a server that may block, tarpit you, respond slowly due to network latency, bandwidth rate limit or because server is overwhelmed is a potentially far greater performance hit.

        Show
        Ted Guild added a comment - A schema that really does change frequently (even on every call) can simply send no cache HTTP headers. That is probably pretty rare edge case but like you say could be an optional configuration to disable per application or server wide. I would wager the majority use fairly static schemata. As evidently many XML processing applications are fault tolerant, or we would get many more complaints about trying to keep this traffic in check, at not getting schemata from us another option could be to not reference the schemata at all. If the cached catalog resides in memory instead of disk yes there will be a modest memory cost. Retrieving from local disk instead of over internet to a server that may block, tarpit you, respond slowly due to network latency, bandwidth rate limit or because server is overwhelmed is a potentially far greater performance hit.
        Hide
        Michael Glavassevich added a comment - - edited

        Ted, I recall having a discussion with one of your colleagues (might have been Philippe Le Hegaret) several years ago about this caching catalog concept. It's fine in principle but XML libraries aren't static things. There may be multiple applications running in the same JVM, each with their own requirements for loading schemata and each managing multiple instances of parsers, transformers, etc... Any solution which is going to be imposed globally is going to break someone out there (e.g. a schema which really does change at runtime and needs to be loaded multiple times or an application with strict memory requirements which wants that schema to be garbage collected whenever its not doing XML processing) and we cannot make that the default (and the JDK really can't do that either if they care at all about compatibility), though including a caching catalog resolver which a user could choose to enable per application would be okay.

        Show
        Michael Glavassevich added a comment - - edited Ted, I recall having a discussion with one of your colleagues (might have been Philippe Le Hegaret) several years ago about this caching catalog concept. It's fine in principle but XML libraries aren't static things. There may be multiple applications running in the same JVM, each with their own requirements for loading schemata and each managing multiple instances of parsers, transformers, etc... Any solution which is going to be imposed globally is going to break someone out there (e.g. a schema which really does change at runtime and needs to be loaded multiple times or an application with strict memory requirements which wants that schema to be garbage collected whenever its not doing XML processing) and we cannot make that the default (and the JDK really can't do that either if they care at all about compatibility), though including a caching catalog resolver which a user could choose to enable per application would be okay.
        Hide
        Ted Guild added a comment - - edited

        Michael,

        I'm surprised to get a response after 2 years since filling this bug report Thank you for replying. I only noticed an email today, perhaps due to a status change to know there was a reply.

        It would be great if Xalan and Xerces came with a XML catalog by default like many XML processing libraries do. Yes one can add this after but since it is optional and not there from the start the vast majority of instances do not use the tools you mention.

        We have blocked (HTTP 503, TCP), tarpitted and have made efforts to educate yet the traffic just grows (I've seen peaks at half a billion a day).

        As maintaining a catalog to include additional schemata for emerging XML formats is tedious the suggestion is to have resolver write these to the catalog, a caching catalog.

        We also recently started talking to JDK engineers to see if they can't do this upstream for all XML libraries.

        Show
        Ted Guild added a comment - - edited Michael, I'm surprised to get a response after 2 years since filling this bug report Thank you for replying. I only noticed an email today, perhaps due to a status change to know there was a reply. It would be great if Xalan and Xerces came with a XML catalog by default like many XML processing libraries do. Yes one can add this after but since it is optional and not there from the start the vast majority of instances do not use the tools you mention. We have blocked (HTTP 503, TCP), tarpitted and have made efforts to educate yet the traffic just grows (I've seen peaks at half a billion a day). As maintaining a catalog to include additional schemata for emerging XML formats is tedious the suggestion is to have resolver write these to the catalog, a caching catalog. We also recently started talking to JDK engineers to see if they can't do this upstream for all XML libraries.
        Hide
        Michael Glavassevich added a comment -

        Ted, I'm not sure what you're suggesting we do. Changing default behaviours has the potential to break many applications. We simply cannot do that.

        There are well documented ways for applications to avoid or reduce network access, including the use of XML Catalogs, custom entity resolvers and the grammar caching facilities supported by Xerces and also the JAXP standard. The tools are there. People should be using them.

        I believe improving the situation is a matter of education. The more folks you block with a 503 response the more they'll realize that they need to do something and will have to change their application for it to work again.

        Show
        Michael Glavassevich added a comment - Ted, I'm not sure what you're suggesting we do. Changing default behaviours has the potential to break many applications. We simply cannot do that. There are well documented ways for applications to avoid or reduce network access, including the use of XML Catalogs, custom entity resolvers and the grammar caching facilities supported by Xerces and also the JAXP standard. The tools are there. People should be using them. I believe improving the situation is a matter of education. The more folks you block with a 503 response the more they'll realize that they need to do something and will have to change their application for it to work again.

          People

          • Assignee:
            Unassigned
            Reporter:
            Ted Guild
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development