Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1614

Geo Topic Parser

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.9
    • Component/s: parser
    • Labels:

      Description

      ##Description

      This program aims to provide the support to identify geonames for any unstructured text data in the project NSF polar research. https://github.com/NSF-Polar-Cyberinfrastructure/datavis-hackathon/issues/1

      This project is a content-based geotagging solution, made of a variaty of NLP tools and could be used for any geotagging purposes.

      ##Workingflow

      1. Plain text input is passed to geoparser

      2. Location names are extracted from the text using OpenNLP NER

      3. Provide two roles:

      • The most frequent location name choosed as the best match for the input text
      • Other extracted locations are treated as alternatives (equal)

      4. location extracted above, search the best GeoName object and return the resloved objects with fields (name in gazetteer, longitude, latitude)

      ##How to Use
      Cautions: This program requires at least 1.2 GB disk space for building Lucene Index

      ```Java
      function A(stream){
      Metadata metadata = new Metadata();
      ParseContext context=new ParseContext();
      GeoParserConfig config= new GeoParserConfig();
      config.setGazetterPath(gazetteerPath);
      config.setNERModelPath(nerPath);
      context.set(GeoParserConfig.class, config);

      geoparser.parse(
      stream,
      new BodyContentHandler(),
      metadata,
      context);

      for(String name: metadata.names())

      { String value=metadata.get(name); System.out.println(name +" " + value); }

      }
      ```
      This parser generates useful geographical information to Tika's Metadata Object.

      Fields for best matched location:
      ```
      Geographic_NAME
      Geographic_LONGTITUDE
      Geographic_LATITUDE
      ```
      Fields for alternatives:
      ```
      Geographic_NAME1
      Geographic_LONGTITUDE1
      Geographic_LATITUDE1

      Geographic_NAME2
      Geographic_LONGTITUDE2
      Geographic_LATITUDE2

      ...

      ```
      If you have any questions, contact me: anyayunli@gmail.com

        Attachments

        1. TIKA-1614.Mattmann.Li.052405.patch.txt
          26 kB
          Chris A. Mattmann

          Activity

            People

            • Assignee:
              chrismattmann Chris A. Mattmann
              Reporter:
              diefunction Anya Yun Li
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: