Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2449

Enabling extraction of standard references from text

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.17
    • Component/s: handler
    • Labels:

      Description

      Apache Tika currently provides many ContentHandler which help to de-obfuscate some information from text. For instance, the PhoneExtractingContentHandler is used to extract phone numbers while parsing.

      This improvement adds the StandardsExtractingContentHandler to Tika, a new ContentHandler that relies on regular expressions in order to identify and extract standard references from text.
      Basically, a standard reference is just a reference to a norm/convention/requirement (i.e., a standard) released by a standard organization. This work is maily focused on identifying and extracting the references to the standards already cited within a given document (e.g., SOW/PWS) so the references can be stored and provided to the user as additional metadata in case the StandardExtractingContentHandler is used.

      In addition to the patch, the first version of the StandardsExtractingContentHandler along with an example class to easily execute the handler is available on GitHub. The following sections provide more in detail how the StandardsExtractingHandler has been developed.

      Background

      From a technical perspective, a standard reference is a string that is usually composed of two parts:

      1. the name of the standard organization;
      2. the alphanumeric identifier of the standard within the organization.
        Specifically, the first part can include the acronym or the full name of the standard organization or even both, and the second part can include an alphanumeric string, possibly containing one or more separation symbols (e.g., "-", "_", ".") depending on the format adopted by the organization, representing the identifier of the standard within the organization.

      Furthermore, the standard references are usually reported within the "Applicable Documents" or "References" section of a SOW, and they can be cited also within sections that include in the header the word "standard", "requirement", "guideline", or "compliance".

      Consequently, the citation of standard references within a SOW/PWS document can be summarized by the following rules:

      • RULE #1: standard references are usually reported within the section named "Applicable Documents" or "References".
      • RULE #2: standard references can be cited also within sections including the word "compliance" or another semantically-equivalent word in their name.
      • RULE #3: standard references is composed of two parts:
        • Name of the standard organization (acronym, full name, or both).
        • Alphanumeric identifier of the standard within the organization.
      • RULE #4: The name of the standard organization includes the acronym or the full name or both. The name must belong to the set of standard organizations S = O U V, where O represents the set of open standard organizations (e.g., ANSI) and V represents the set of vendor-specific standard organizations (e.g., Motorola).
      • RULE #5: A separation symbol (e.g., "-", "_", "." or whitespace) can be used between the name of the standard organization and the alphanumeric identifier.
      • RULE #6: The alphanumeric identifier of the standard is composed of alphabetic and numeric characters, possibly split in two or more parts by a separation symbol (e.g., "-", "_", ".").

      On the basis of the above rules, here are some examples of formats used for reporting standard references within a SOW/PWS:

      • <ORGANIZATION_ACRONYM><SEPARATION_SYMBOL><ALPHANUMERIC_IDENTIFIER>
      • <ORGANIZATION_ACRONYM><SEPARATION_SYMBOL>(<ORGANIZATION_FULL_NAME>)<SEPARATION_SYMBOL><ALPHANUMERIC_IDENTIFIER>
      • <ORGANIZATION_FULL_NAME><SEPARATION_SYMBOL>(<ORGANIZATION_FULL_NAME>)<SEPARATION_SYMBOL><ALPHANUMERIC_IDENTIFIER>

      Moreover, some standards are sometimes released by two standard organizations. In this case, the standard reference can be reported as follows:

      • <MAIN_ORGANIZATION_ACRONYM>/<SECOND_ORGANIZATION_ACRONYM><SEPARATION_SYMBOL><ALPHANUMERIC_IDENTIFIER>

      Regular Expressions

      The StandardsExtractingContentHandler uses a helper class named StandardsText that relies on Java regular expressions and provides some methods to identify headers and standard references, and determine the score of the references found within the given text.

      Here are the main regular expressions used within the StandardsText class:

      • REGEX_HEADER: regular expression to match only uppercase headers.
          (\d+\.(\d+\.?)*)\p{Blank}+([A-Z]+(\s[A-Z]+)*){5,}
          
      • REGEX_APPLICABLE_DOCUMENTS: regular expression to match the the header of "APPLICABLE DOCUMENTS" and equivalent sections.
          (?i:.*APPLICABLE\sDOCUMENTS|REFERENCE|STANDARD|REQUIREMENT|GUIDELINE|COMPLIANCE.*)
          
      • REGEX_FALLBACK: regular expression to match a string that is supposed to be a standard reference.
          \(?(?<mainOrganization>[A-Z]\w+)\)?((\s?(?<separator>\/)\s?)(\w+\s)*\(?(?<secondOrganization>[A-Z]\w+)\)?)?(\s(Publication|Standard))?(-|\s)?(?<identifier>([0-9]{3,}|([A-Z]+(-|_|\.)?[0-9]{2,}))((-|_|\.)?[A-Z0-9]+)*)
          
      • REGEX_STANDARD: regular expression to match the standard organization within a string potentially representing a standard reference.
        This regular expression is obtained by using a helper class named StandardOrganizations that provides a list of the most important standard organizations reported on Wikipedia. Basically, the list is composed of International standard organizations, Regional standard organizations, and American and British among Nationally-based standard organizations. Other lists of standard organizations are reported on OpenStandards and IBR Standards Portal.

      How To Use The Standards Extraction Capability

      The standard references identification performed by using the StandardsExtractingContentHandler is based on the following steps (see also the flow chart in attachment):

      1. searches for headers;
      2. searches for patterns that are supposed to be standard references (basically, every string mostly composed of uppercase letters followed by an alphanumeric characters);
      3. each potential standard reference starts with score equal to 0.25;
      4. increases by 0.50 the score of references which include the name of a known standard organization;
      5. increases by 0.25 the score of references which have been found within "Applicable Documents" and equivalent sections;
      6. returns the standard references along with scores;
      7. adds the standard references as additional metadata.

      The unit test is implemented within the StandardsExtractingContentHandlerTest class and extracts the standard references from a SoW downloaded from the FOIA Library. This SoW is also provided as PDF in attachment.

      The StandardsExtractionExample is a class to demonstrate how to use the StandardsExtractingContentHandler to get a list of the standard references from every file in a directory.

      The patch in attachment includes all the changes to add the support for standards extraction.

      1. flowchart_standards_extraction_v02.png
        96 kB
        Giuseppe Totaro
      2. flowchart_standards_extraction.png
        91 kB
        Giuseppe Totaro
      3. SOW-TacCOM.pdf
        140 kB
        Giuseppe Totaro
      4. standards_extraction.patch
        33 kB
        Giuseppe Totaro

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #204: Improvement for TIKA-2449 contributed by Giuseppe Totaro
          URL: https://github.com/apache/tika/pull/204#issuecomment-327693145

          Hi @giuseppetotaro I think this is an excellent addition. I would suggest that although you have grouped the HashMap of Standards Org's by geographical governing entity/country, it may be better to have these listed alphabetically. This is really the only suggestion I would make.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #204: Improvement for TIKA-2449 contributed by Giuseppe Totaro URL: https://github.com/apache/tika/pull/204#issuecomment-327693145 Hi @giuseppetotaro I think this is an excellent addition. I would suggest that although you have grouped the HashMap of Standards Org's by geographical governing entity/country, it may be better to have these listed alphabetically. This is really the only suggestion I would make. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          giuseppetotaro commented on issue #204: Improvement for TIKA-2449 contributed by Giuseppe Totaro
          URL: https://github.com/apache/tika/pull/204#issuecomment-327944445

          Hi @lewismc, thank you very much for your insightful suggestion. I have updated the `StandardOrganizations` class by using the `TreeMap` instead of the `HashMap` (commit 31625a2ad6b9a4de73aa322845fb1e38ba96177d).
          Furthermore, I have added a new regular expression that matches the word "publication" or "standard" within a pattern that is supposed to be a standard (commit 7b869c0c1b132feb691dca645fe9bc689bf320e2). You can find the new workflow [here](https://issues.apache.org/jira/secure/attachment/12885939/flowchart_standards_extraction_v02.png).

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - giuseppetotaro commented on issue #204: Improvement for TIKA-2449 contributed by Giuseppe Totaro URL: https://github.com/apache/tika/pull/204#issuecomment-327944445 Hi @lewismc, thank you very much for your insightful suggestion. I have updated the `StandardOrganizations` class by using the `TreeMap` instead of the `HashMap` (commit 31625a2ad6b9a4de73aa322845fb1e38ba96177d). Furthermore, I have added a new regular expression that matches the word "publication" or "standard" within a pattern that is supposed to be a standard (commit 7b869c0c1b132feb691dca645fe9bc689bf320e2). You can find the new workflow [here] ( https://issues.apache.org/jira/secure/attachment/12885939/flowchart_standards_extraction_v02.png ). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #204: Improvement for TIKA-2449 contributed by Giuseppe Totaro
          URL: https://github.com/apache/tika/pull/204#issuecomment-327951058

          Very good @giuseppetotaro , from your workflow I see you have two decision points for determining whether the entity "Matches Standard Organizations"... is the duplicated logic intentional?

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #204: Improvement for TIKA-2449 contributed by Giuseppe Totaro URL: https://github.com/apache/tika/pull/204#issuecomment-327951058 Very good @giuseppetotaro , from your workflow I see you have two decision points for determining whether the entity "Matches Standard Organizations"... is the duplicated logic intentional? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          giuseppetotaro commented on issue #204: Improvement for TIKA-2449 contributed by Giuseppe Totaro
          URL: https://github.com/apache/tika/pull/204#issuecomment-327959755

          Thanks @lewismc. It was my mistake. I have updated the flowchart and now it is correct. Thanks a lot.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - giuseppetotaro commented on issue #204: Improvement for TIKA-2449 contributed by Giuseppe Totaro URL: https://github.com/apache/tika/pull/204#issuecomment-327959755 Thanks @lewismc. It was my mistake. I have updated the flowchart and now it is correct. Thanks a lot. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          giuseppetotaro commented on issue #204: Improvement for TIKA-2449 contributed by Giuseppe Totaro
          URL: https://github.com/apache/tika/pull/204#issuecomment-328175432

          Hi @chrismattmann, hi @lewismc, I think that it would be useful to have the standard references along with the related scores into the `Metadata` object. To the best of my knowledge, there is no chance to get an "associative array" within a `Metadata` object (i.e., an array of key-value pairs where the key is the reference and the value is the score). Do you think we can extend the `Metadata` class for including also array of objects? Do you have other solutions?
          I look forward to have your feedback.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - giuseppetotaro commented on issue #204: Improvement for TIKA-2449 contributed by Giuseppe Totaro URL: https://github.com/apache/tika/pull/204#issuecomment-328175432 Hi @chrismattmann, hi @lewismc, I think that it would be useful to have the standard references along with the related scores into the `Metadata` object. To the best of my knowledge, there is no chance to get an "associative array" within a `Metadata` object (i.e., an array of key-value pairs where the key is the reference and the value is the score). Do you think we can extend the `Metadata` class for including also array of objects? Do you have other solutions? I look forward to have your feedback. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          lewismc commented on issue #204: Improvement for TIKA-2449 contributed by Giuseppe Totaro
          URL: https://github.com/apache/tika/pull/204#issuecomment-328186142

          @giuseppetotaro this is a long standing issue... there is a JIRA issue for this and loads of context at
          https://issues.apache.org/jira/browse/TIKA-1607
          In all honesty, I am +1 for committing this to master as is. There are no real changes... it just adds new content for a very useful feature.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - lewismc commented on issue #204: Improvement for TIKA-2449 contributed by Giuseppe Totaro URL: https://github.com/apache/tika/pull/204#issuecomment-328186142 @giuseppetotaro this is a long standing issue... there is a JIRA issue for this and loads of context at https://issues.apache.org/jira/browse/TIKA-1607 In all honesty, I am +1 for committing this to master as is. There are no real changes... it just adds new content for a very useful feature. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          chrismattmann commented on issue #204: Improvement for TIKA-2449 contributed by Giuseppe Totaro
          URL: https://github.com/apache/tika/pull/204#issuecomment-328186743

          +1 to commit as is! 👍

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - chrismattmann commented on issue #204: Improvement for TIKA-2449 contributed by Giuseppe Totaro URL: https://github.com/apache/tika/pull/204#issuecomment-328186743 +1 to commit as is! 👍 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          giuseppetotaro closed pull request #204: TIKA-2449: Enabling extraction of standard references from text
          URL: https://github.com/apache/tika/pull/204

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - giuseppetotaro closed pull request #204: TIKA-2449 : Enabling extraction of standard references from text URL: https://github.com/apache/tika/pull/204 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1362 (See https://builds.apache.org/job/Tika-trunk/1362/)
          Improvement for TIKA-2449 contributed by Giuseppe Totaro (totaropeppe: https://github.com/apache/tika/commit/e76302196ebcafb7b51fce37fbe8256e6c0fbc51)

          • (add) tika-parsers/src/test/java/org/apache/tika/sax/StandardsExtractingContentHandlerTest.java
          • (add) tika-core/src/main/java/org/apache/tika/sax/StandardOrganizations.java
          • (add) tika-core/src/main/java/org/apache/tika/sax/StandardReference.java
          • (add) tika-core/src/main/java/org/apache/tika/sax/StandardsExtractingContentHandler.java
          • (add) tika-parsers/src/test/resources/test-documents/testStandardsExtractor.pdf
          • (add) tika-core/src/main/java/org/apache/tika/sax/StandardsText.java
          • (add) tika-example/src/main/java/org/apache/tika/example/StandardsExtractionExample.java
            TIKA-2449: Enabling extraction of standard references from text (totaropeppe: https://github.com/apache/tika/commit/db89ab3ca701077f2615647667d868ca1cf9a728)
          • (edit) CHANGES.txt
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1362 (See https://builds.apache.org/job/Tika-trunk/1362/ ) Improvement for TIKA-2449 contributed by Giuseppe Totaro (totaropeppe: https://github.com/apache/tika/commit/e76302196ebcafb7b51fce37fbe8256e6c0fbc51 ) (add) tika-parsers/src/test/java/org/apache/tika/sax/StandardsExtractingContentHandlerTest.java (add) tika-core/src/main/java/org/apache/tika/sax/StandardOrganizations.java (add) tika-core/src/main/java/org/apache/tika/sax/StandardReference.java (add) tika-core/src/main/java/org/apache/tika/sax/StandardsExtractingContentHandler.java (add) tika-parsers/src/test/resources/test-documents/testStandardsExtractor.pdf (add) tika-core/src/main/java/org/apache/tika/sax/StandardsText.java (add) tika-example/src/main/java/org/apache/tika/example/StandardsExtractionExample.java TIKA-2449 : Enabling extraction of standard references from text (totaropeppe: https://github.com/apache/tika/commit/db89ab3ca701077f2615647667d868ca1cf9a728 ) (edit) CHANGES.txt

            People

            • Assignee:
              gostep Giuseppe Totaro
              Reporter:
              gostep Giuseppe Totaro
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development