OpenNLP
  1. OpenNLP
  2. OPENNLP-579

Framework to dynamically link N-best matches from external data to named entities by type (EntityLinker framework)

    Details

    • Type: Wish Wish
    • Status: Reopened
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.6.0
    • Fix Version/s: 1.6.0
    • Component/s: Entity Linker
    • Labels:
    • Environment:
      Any

      Description

      A framework for integrating/linking external data to named entities. For instance, geocoding or georeferencing location entities to geonames gazateers can be implemented as an EntityLinker. Initially created ticket to specifically solve the georeferencing/geolocating/geotagging problem, but the framework should allow linkage of any external data to any entity type. Commercial applications that do this are expensive, and there are many free gazateers one could use to create solutions with OpenNLP.
      UPDATE: The current implementation of the GeoEntityLinker uses Lucene to store the Gazateers, and provides utils for indexing them. The impl returns lat, long (and other gaz fields) for toponyms extracted with NER.
      All extracted toponyms are scored in four ways: fuzzy string matching, binning by location, context modeling, and country-mention proximity. These scores enable a good means of deciding what's worth keeping from the gaz.

      1. entitylinker.properties
        0.5 kB
        Mark Giaconia
      2. opennlp.geoentitylinker.countrycontext.txt
        16 kB
        Mark Giaconia

        Activity

        Hide
        Mark Giaconia added a comment -

        Initial (intentionally simple) implementation for review.

        Show
        Mark Giaconia added a comment - Initial (intentionally simple) implementation for review.
        Hide
        Mark Giaconia added a comment -

        Attached Zip contains a folder (actually the package) called geonamefind, which contains classes and a sql script for implementing a simple solution to the wish. There is an example class and a default postGis implementation.

        Show
        Mark Giaconia added a comment - Attached Zip contains a folder (actually the package) called geonamefind, which contains classes and a sql script for implementing a simple solution to the wish. There is an example class and a default postGis implementation.
        Hide
        Mark Giaconia added a comment -

        properties file

        Show
        Mark Giaconia added a comment - properties file
        Hide
        Joern Kottmann added a comment -

        Thanks for sharing this!

        What the code basically does is to geo locate a place name which was previously detected by the name finder.

        Why do you need the GeoSpan integration for the name finder? The GeoGazateer could simply run after the standard name finder found all location mentions in a document, similar like its done in the coref component.

        Can you recommend a data set I could use for a test?

        Show
        Joern Kottmann added a comment - Thanks for sharing this! What the code basically does is to geo locate a place name which was previously detected by the name finder. Why do you need the GeoSpan integration for the name finder? The GeoGazateer could simply run after the standard name finder found all location mentions in a document, similar like its done in the coref component. Can you recommend a data set I could use for a test?
        Hide
        Joern Kottmann added a comment -

        The issue needs to remain open until we committed the contribution.

        Show
        Joern Kottmann added a comment - The issue needs to remain open until we committed the contribution.
        Hide
        Mark Giaconia added a comment -

        The sql file in the attached Zip has sample data (about 600 cities), so if you stand up a PostGIS instance, and run the script it will create the database, the tables, and load the data. Debug the Example class and it will get a hit on New York.

        As for other gazateers, USGS has a large Gazateer that could be loaded into the postgres database @ http://geonames.usgs.gov/domestic/download_data.htm.

        Good point about the GeoSpan integration, I was just trying to go for maximum encapsulation, but I am open to any ideas you have.
        Thanks!

        Show
        Mark Giaconia added a comment - The sql file in the attached Zip has sample data (about 600 cities), so if you stand up a PostGIS instance, and run the script it will create the database, the tables, and load the data. Debug the Example class and it will get a hit on New York. As for other gazateers, USGS has a large Gazateer that could be loaded into the postgres database @ http://geonames.usgs.gov/domestic/download_data.htm . Good point about the GeoSpan integration, I was just trying to go for maximum encapsulation, but I am open to any ideas you have. Thanks!
        Hide
        Joern Kottmann added a comment -

        Nice, I will give it a try with the USGS data. Depending on the deployment of OpenNLP many people already do some of the necessary pre-processing e.g. tokenization or location name detection or use some software which already integrates OpenNLP.

        To support these use cases the GeoGazateer should be designed to run over a document assuming the following processing was already done:

        • sentence detection
        • tokenization
        • location name detection

        Additionally it should be supported to swap out the implementation e.g. by using a factory to produce an instance based on a model or property file.

        Having this in mind, I propose to change the GeoGazateer interface like this:

        • text, sentences, tokenization and all location names are passed to the find method
        • the find method returns a list of GeoSpan objects which link to the GeoGazEntry objects

        What do you think?

        Show
        Joern Kottmann added a comment - Nice, I will give it a try with the USGS data. Depending on the deployment of OpenNLP many people already do some of the necessary pre-processing e.g. tokenization or location name detection or use some software which already integrates OpenNLP. To support these use cases the GeoGazateer should be designed to run over a document assuming the following processing was already done: sentence detection tokenization location name detection Additionally it should be supported to swap out the implementation e.g. by using a factory to produce an instance based on a model or property file. Having this in mind, I propose to change the GeoGazateer interface like this: text, sentences, tokenization and all location names are passed to the find method the find method returns a list of GeoSpan objects which link to the GeoGazEntry objects What do you think?
        Hide
        Mark Giaconia added a comment -

        Nice Idea... the factory approach will allow for multiple geodatabases to be easily integrated.
        Just to try and get on the same page, I feel like we have the following use cases/user stories:
        User Story 1: I want the GeoNameFinder to do everything for me, I just want to instantiate my gaz, namefinder, and wordtokenizer and let the geonamefinder do the rest..
        In this case One would pass in the gaz, tokennamefinder, and wordtokenizer to the GeoNameFinder's constructor. Then pass sentence text into the find(String) method, and the find(String) method uses the wordtokenizer, tokennamefinder, and gaz to do everything and return geospans. This is for people who don't have a business logic layer on top of their namefinders to begin with This is basically what I implemented so far, but I think it could be changed to just work with the find method rather than through the constructor.
        User Story 2: I have a robust entity resolution engine built on top of opennlp, so I want to pass in only my resolved entity spans, the word tokens, and a simple name of a gazateer that I configured in a properties file. In this case the find method would be overloaded with these options and return a List<GeoSpan>. This is what you (Joern) proposed.
        User Story 3: I have several GeoDatabases, and I want to be able to use them all at once in my geotagging, so that I can consolidate GeoSpans from all of them. I want to be able to call a gaz by name, or use them all at once.
        This is very much like two, but the factory would dynamically instantiate a list of gazateers based on a configuration (props file).
        Sound good?
        Thanks for the quick feedback!

        Show
        Mark Giaconia added a comment - Nice Idea... the factory approach will allow for multiple geodatabases to be easily integrated. Just to try and get on the same page, I feel like we have the following use cases/user stories: User Story 1: I want the GeoNameFinder to do everything for me, I just want to instantiate my gaz, namefinder, and wordtokenizer and let the geonamefinder do the rest.. In this case One would pass in the gaz, tokennamefinder, and wordtokenizer to the GeoNameFinder's constructor. Then pass sentence text into the find(String) method, and the find(String) method uses the wordtokenizer, tokennamefinder, and gaz to do everything and return geospans. This is for people who don't have a business logic layer on top of their namefinders to begin with This is basically what I implemented so far, but I think it could be changed to just work with the find method rather than through the constructor. User Story 2: I have a robust entity resolution engine built on top of opennlp, so I want to pass in only my resolved entity spans, the word tokens, and a simple name of a gazateer that I configured in a properties file. In this case the find method would be overloaded with these options and return a List<GeoSpan>. This is what you (Joern) proposed. User Story 3: I have several GeoDatabases, and I want to be able to use them all at once in my geotagging, so that I can consolidate GeoSpans from all of them. I want to be able to call a gaz by name, or use them all at once. This is very much like two, but the factory would dynamically instantiate a list of gazateers based on a configuration (props file). Sound good? Thanks for the quick feedback!
        Hide
        Joern Kottmann added a comment -

        User Story 1 was always ignored by OpenNLP because it is usually very easy to write a processing loop as part of the integration code in a few lines and people usually have very specific needs anyway. For some its problematic if they can't access the sentence and token segmentation (e.g. for snippet creation as part of a search result), some might have already the sentences (like you) some don't have it.

        Instead of writing tons of small util methods to support all these cases we came to the conclusion that it is easier to communicate how to write these processing loops, which also has the advantage that people then precisely understand which steps are performed.

        +1 for User Story 2 and 3.

        Additionally I would like to propose that we change the GeoGazateer interface to be more generic, I think it should be possible to reuse it for type independent linking of entities (person, organization, all sort of IDs). As far as I see this could be accomplished by making the returned Span type generic.

        For example:
        interface EntityLinker<T extends Span>

        { List<T> find(...); ... }

        What do you think?

        Show
        Joern Kottmann added a comment - User Story 1 was always ignored by OpenNLP because it is usually very easy to write a processing loop as part of the integration code in a few lines and people usually have very specific needs anyway. For some its problematic if they can't access the sentence and token segmentation (e.g. for snippet creation as part of a search result), some might have already the sentences (like you) some don't have it. Instead of writing tons of small util methods to support all these cases we came to the conclusion that it is easier to communicate how to write these processing loops, which also has the advantage that people then precisely understand which steps are performed. +1 for User Story 2 and 3. Additionally I would like to propose that we change the GeoGazateer interface to be more generic, I think it should be possible to reuse it for type independent linking of entities (person, organization, all sort of IDs). As far as I see this could be accomplished by making the returned Span type generic. For example: interface EntityLinker<T extends Span> { List<T> find(...); ... } What do you think?
        Hide
        Mark Giaconia added a comment -

        I had a feeling you were going to suggest generics .. makes sense because the current geospan might not make sense for all gazateers' structures...
        I agree, I will disregard User Story 1, make the interface generic, and go for User stories 2 and 3
        Once I submit those we can take another look and refactor as the group desires. Sound good?
        thanks

        Show
        Mark Giaconia added a comment - I had a feeling you were going to suggest generics .. makes sense because the current geospan might not make sense for all gazateers' structures... I agree, I will disregard User Story 1, make the interface generic, and go for User stories 2 and 3 Once I submit those we can take another look and refactor as the group desires. Sound good? thanks
        Hide
        Joern Kottmann added a comment -

        +1, yes. Any suggestion for a name of the component? Is Entity Linker a confusing name?

        Show
        Joern Kottmann added a comment - +1, yes. Any suggestion for a name of the component? Is Entity Linker a confusing name?
        Hide
        Mark Giaconia added a comment -

        "EntityLinker" sounds confusing (to me) like it's goal would be to establish a relationship between two or more entities..like generating a graph. I think what are doing is providing a framework for enriching the content of any entity (by means of allowing the dynamic extension of a Span via generics).... so maybe call the component "Entity Extender" and the interface ExtendableEntity<T extends Span>. I can already think of a nice Date/Tme entity extension for formatting date hierarchies like an OLAP time dimension.
        I am open to anything.. thanks

        Show
        Mark Giaconia added a comment - "EntityLinker" sounds confusing (to me) like it's goal would be to establish a relationship between two or more entities..like generating a graph. I think what are doing is providing a framework for enriching the content of any entity (by means of allowing the dynamic extension of a Span via generics).... so maybe call the component "Entity Extender" and the interface ExtendableEntity<T extends Span>. I can already think of a nice Date/Tme entity extension for formatting date hierarchies like an OLAP time dimension. I am open to anything.. thanks
        Hide
        Joern Kottmann added a comment -

        Question: What is the key task the component is trying to solve? Is it to enrich an entity or to assign an id to an entity. The later case is often called entity disambiguation or entity resolution. I though of fetching the data from the GIS more like of a convenience thing than the actual task, by knowing the id it could be done afterwards.

        Thinking a bit more about it leads me to the question if we want to support the case of having n-best results per input Span
        Your proposed find method would simply return the best match, where a second find method could return the n-best matches. Depending on the implementation we might also want to have the ability to return a confidence score. Anyway, with the currently proposed design it would be feasible to add all this one day.

        Show
        Joern Kottmann added a comment - Question: What is the key task the component is trying to solve? Is it to enrich an entity or to assign an id to an entity. The later case is often called entity disambiguation or entity resolution. I though of fetching the data from the GIS more like of a convenience thing than the actual task, by knowing the id it could be done afterwards. Thinking a bit more about it leads me to the question if we want to support the case of having n-best results per input Span Your proposed find method would simply return the best match, where a second find method could return the n-best matches. Depending on the implementation we might also want to have the ability to return a confidence score. Anyway, with the currently proposed design it would be feasible to add all this one day.
        Hide
        Jason Baldridge added a comment -

        Mike Speriosu and I have a new paper on doing this (and it uses OpenNLP for NER):

        Mike Speriosu and Jason Baldridge. Text-Driven Toponym Resolution using Indirect Supervision. To appear in proceedings of ACL 2013.

        Here's the abstract:

        Toponym resolvers identify the specific lo-
        cations referred to by ambiguous place-
        names in text. Most resolvers are based on
        heuristics using spatial relationships be-
        tween multiple toponyms in a document,
        or metadata such as population. This pa-
        per shows that text-driven disambiguation
        for toponyms is far more effective. We ex-
        ploit document-level geotags to indirectly
        generate training instances for text classi-
        fiers for toponym resolution, and show that
        textual cues can be straightforwardly in-
        tegrated with other commonly used ones.
        Results are given for both 19th century
        texts pertaining to the American Civil War
        and 20th century newswire articles.

        Here's a PDF of the paper: http://www.jasonbaldridge.com/papers/speriosu-baldridge-acl2013.pdf
        Here's the code: https://github.com/utcompling/fieldspring

        Hope this might help! (Or at least be of interest.)

        -Jason

        Show
        Jason Baldridge added a comment - Mike Speriosu and I have a new paper on doing this (and it uses OpenNLP for NER): Mike Speriosu and Jason Baldridge. Text-Driven Toponym Resolution using Indirect Supervision. To appear in proceedings of ACL 2013. Here's the abstract: Toponym resolvers identify the specific lo- cations referred to by ambiguous place- names in text. Most resolvers are based on heuristics using spatial relationships be- tween multiple toponyms in a document, or metadata such as population. This pa- per shows that text-driven disambiguation for toponyms is far more effective. We ex- ploit document-level geotags to indirectly generate training instances for text classi- fiers for toponym resolution, and show that textual cues can be straightforwardly in- tegrated with other commonly used ones. Results are given for both 19th century texts pertaining to the American Civil War and 20th century newswire articles. Here's a PDF of the paper: http://www.jasonbaldridge.com/papers/speriosu-baldridge-acl2013.pdf Here's the code: https://github.com/utcompling/fieldspring Hope this might help! (Or at least be of interest.) -Jason
        Hide
        Mark Giaconia added a comment -

        Jason, thank you for that paper. It is simply outstanding... I hope to implement some of those concepts eventually (if it is in keeping with the spirit of the project).
        I am almost done with round II of my impl. I hope to resubmit early next week. It now supports factory driven "linker" instantiations based on entity type, and within a linker impl another factory allows for plug and play of any amount of gazateers (not just geographic gazateers).
        Thanks!

        Show
        Mark Giaconia added a comment - Jason, thank you for that paper. It is simply outstanding... I hope to implement some of those concepts eventually (if it is in keeping with the spirit of the project). I am almost done with round II of my impl. I hope to resubmit early next week. It now supports factory driven "linker" instantiations based on entity type, and within a linker impl another factory allows for plug and play of any amount of gazateers (not just geographic gazateers). Thanks!
        Hide
        Jason Baldridge added a comment -

        Great! Feel free to get in touch with Mike and me if you have any questions. Also, you mind find some of the data at the Fieldspring code repo to be useful. (We're still tightening up the TR-CoNLL data to make it easier for others to obtain and process.)

        Show
        Jason Baldridge added a comment - Great! Feel free to get in touch with Mike and me if you have any questions. Also, you mind find some of the data at the Fieldspring code repo to be useful. (We're still tightening up the TR-CoNLL data to make it easier for others to obtain and process.)
        Hide
        Mark Giaconia added a comment -

        Joern, I never responded to your 23/May/13 14:50 comment. I agree... It seems like enrichment because a resolution layer would have to exist on top of this for sure. You will see when I resubmit that find() returns N best matches and the GeoSpan is basically a Span with a list of candidate gaz entries.
        Also, now, one can have any amount of entity-type-specific Linkers, and within each linker, there are many "linkables." So now my example has a GeoLinker that utilizes N "Linkable" geogazateers
        I also created two Gazateers in MySQL (the entire GeoNames and USGS datasets).
        Now, with this framework users will be able to Link entities to any "linkable" data they have, the framework is no longer location specific.
        BTW, I like the name EntityLinker now

        Show
        Mark Giaconia added a comment - Joern, I never responded to your 23/May/13 14:50 comment. I agree... It seems like enrichment because a resolution layer would have to exist on top of this for sure. You will see when I resubmit that find() returns N best matches and the GeoSpan is basically a Span with a list of candidate gaz entries. Also, now, one can have any amount of entity-type-specific Linkers, and within each linker, there are many "linkables." So now my example has a GeoLinker that utilizes N "Linkable" geogazateers I also created two Gazateers in MySQL (the entire GeoNames and USGS datasets). Now, with this framework users will be able to Link entities to any "linkable" data they have, the framework is no longer location specific. BTW, I like the name EntityLinker now
        Hide
        Mark Giaconia added a comment -

        Implemented a factory driven framework for extending Spans and namefinders, and provided three implementations of linkables and a location oriented EntityLinker.
        The EntityLinker.framework package holds the factories, interfaces, and base classes.
        The entitylinker package has implementations and an Example.
        Needs documentation

        Show
        Mark Giaconia added a comment - Implemented a factory driven framework for extending Spans and namefinders, and provided three implementations of linkables and a location oriented EntityLinker. The EntityLinker.framework package holds the factories, interfaces, and base classes. The entitylinker package has implementations and an Example. Needs documentation
        Hide
        Mark Giaconia added a comment -

        Here are the links to the gazateers to load into MySQL
        NGA Geonames
        http://earth-info.nga.mil/gns/html/namefiles.htm
        click on : Click here to Download a single compressed zip file that contains the entire country files dataset (Approximately 376MB compressed/1.72GB uncompressed)
        USGS
        http://geonames.usgs.gov/domestic/download_data.htm
        click on : NationalFile_20130404.zip - Download all national features in one .zip file

        Show
        Mark Giaconia added a comment - Here are the links to the gazateers to load into MySQL NGA Geonames http://earth-info.nga.mil/gns/html/namefiles.htm click on : Click here to Download a single compressed zip file that contains the entire country files dataset (Approximately 376MB compressed/1.72GB uncompressed) USGS http://geonames.usgs.gov/domestic/download_data.htm click on : NationalFile_20130404.zip - Download all national features in one .zip file
        Hide
        Mark Giaconia added a comment -

        Known issues with the code uploaded today:
        1. Each gazateer implemented (as Linkable) may require different formatting of the extracted entity... so just passing in the spans and words is probably not good enough. Each Linkable should implement some kind of formatter. For instance, if an entity is "New York," one may want the search to imply OR between the words and the other may want to imply AND between words. I can also see something like ngram being done, or stemming etc behind the scenes depending on the gazateer (I chose mysql because its text indexing is very flexible). Regardless, I think a formatter should be a part of a linkable so an entity can be dynamically adapted to whatever system it is being linked to (SOLR/HBase/Oracle...etc)
        2. If there are more than one EntityLinker for an entity type, then each Linker will get all linkables, so I need to sort this out so each Linker has its own set of linkables, so linkables are not global to a particular linker type
        Let me know what you think....
        thanks

        Show
        Mark Giaconia added a comment - Known issues with the code uploaded today: 1. Each gazateer implemented (as Linkable) may require different formatting of the extracted entity... so just passing in the spans and words is probably not good enough. Each Linkable should implement some kind of formatter. For instance, if an entity is "New York," one may want the search to imply OR between the words and the other may want to imply AND between words. I can also see something like ngram being done, or stemming etc behind the scenes depending on the gazateer (I chose mysql because its text indexing is very flexible). Regardless, I think a formatter should be a part of a linkable so an entity can be dynamically adapted to whatever system it is being linked to (SOLR/HBase/Oracle...etc) 2. If there are more than one EntityLinker for an entity type, then each Linker will get all linkables, so I need to sort this out so each Linker has its own set of linkables, so linkables are not global to a particular linker type Let me know what you think.... thanks
        Hide
        Joern Kottmann added a comment -

        Thanks for taking time to work in the changes.

        Lets discuss on how we should pass in the document (sentences, tokens, names) to the find method. We definitely want to do this consistently across the interfaces in OpenNLP. Currently there is one other interface which is doing this already DocumentNameFinder. Anyway we never implemented it. I started a discussion on the dev list to decide on how we will do this in the future, please have a look there and participate.

        Some users might want to use different EntityLinkers at the same time, e.g. one which links only locations, and a second one which links person entities. To support this we need to change the factory a bit, if I understand it correctly there can currently only be one properties file, right ? I suggest to make this stateless, e.g. EntityLinkerFactor.createEntityLinker(InputStream propertiesFile). The method returns a ready to use EntityLinker or throws an exception if it can't be created. If a user wants to use couple of different linker he can call createEntityLinker a multiple times to instantiate them all.

        One more issue are the exceptions you are throwing from the EntityLinker.find method, I can see that things under the hood can go wrong when connecting to a database or some external resource. Anyway we should handle this also consistently across components. Lets discuss this on the dev list as well.

        Show
        Joern Kottmann added a comment - Thanks for taking time to work in the changes. Lets discuss on how we should pass in the document (sentences, tokens, names) to the find method. We definitely want to do this consistently across the interfaces in OpenNLP. Currently there is one other interface which is doing this already DocumentNameFinder. Anyway we never implemented it. I started a discussion on the dev list to decide on how we will do this in the future, please have a look there and participate. Some users might want to use different EntityLinkers at the same time, e.g. one which links only locations, and a second one which links person entities. To support this we need to change the factory a bit, if I understand it correctly there can currently only be one properties file, right ? I suggest to make this stateless, e.g. EntityLinkerFactor.createEntityLinker(InputStream propertiesFile). The method returns a ready to use EntityLinker or throws an exception if it can't be created. If a user wants to use couple of different linker he can call createEntityLinker a multiple times to instantiate them all. One more issue are the exceptions you are throwing from the EntityLinker.find method, I can see that things under the hood can go wrong when connecting to a database or some external resource. Anyway we should handle this also consistently across components. Lets discuss this on the dev list as well.
        Hide
        Mark Giaconia added a comment -

        Thanks Joern. I will participate in the discussions.
        As for my submission, it has one properties file, but it allows for any number of linkers, and linkers have many Linkables. Take a close look at the property file I uploaded, a section of it looks like this:

        linker.location.linkables=opennlp.tools.entitylinker.PostGISGeoGazImpl,opennlp.tools.entitylinker.MySQLUSGSGazLinkable,opennlp.tools.entitylinker.MySQLGeoNamesGazLinkable
        linker.location=opennlp.tools.entitylinker.GeoEntityLinker
        linker.person=opennlp.consumer.PersonLinker
        linker.person.linkables=classA, classB

        Notice that there are many entries starting with linker., and for each of those keys (ie, linker.location and linker.person) there is a comma separated list of implementing classes. so this property file supports N number of linkers per any number of entity types, and the factory supports this by taking the entity type as an arg and returning a list of linkers for the type passed in. A user can make calls to the factory with any types they have and get back a list of linkers for all of those types.

        Take a look at the way I get the list of locationLinkers from the factory in the Example class where I pass in Spans[0].getType() as the factory's arg. As long as there is a linker.<type> entry in the props file a call to the factory can get any type of linker from the same properties file.

        Also the Linkable factory supports more than one linked data source (Linkable) within a linker.
        Here is a snippet from the Example class where I make the factory call to get the location linkers:

        List<EntityLinker> linkers = EntityLinkerFactory.getInstance().getLinkers(spans[0].getType());

        one could make as many calls as they want like this
        List<EntityLinker> linkers = EntityLinkerFactory.getInstance().getLinkers("location");
        List<EntityLinker> linkers = EntityLinkerFactory.getInstance().getLinkers("person");
        List<EntityLinker> linkers = EntityLinkerFactory.getInstance().getLinkers("organization");
        The factory returns a list so we can have multiple Linkers for a single type

        As for complying with OpenNLP standard conventions, it will get more familiar.
        thanks again

        Show
        Mark Giaconia added a comment - Thanks Joern. I will participate in the discussions. As for my submission, it has one properties file, but it allows for any number of linkers, and linkers have many Linkables. Take a close look at the property file I uploaded, a section of it looks like this: linker.location.linkables=opennlp.tools.entitylinker.PostGISGeoGazImpl,opennlp.tools.entitylinker.MySQLUSGSGazLinkable,opennlp.tools.entitylinker.MySQLGeoNamesGazLinkable linker.location=opennlp.tools.entitylinker.GeoEntityLinker linker.person=opennlp.consumer.PersonLinker linker.person.linkables=classA, classB Notice that there are many entries starting with linker., and for each of those keys (ie, linker.location and linker.person) there is a comma separated list of implementing classes. so this property file supports N number of linkers per any number of entity types, and the factory supports this by taking the entity type as an arg and returning a list of linkers for the type passed in. A user can make calls to the factory with any types they have and get back a list of linkers for all of those types. Take a look at the way I get the list of locationLinkers from the factory in the Example class where I pass in Spans [0] .getType() as the factory's arg. As long as there is a linker.<type> entry in the props file a call to the factory can get any type of linker from the same properties file. Also the Linkable factory supports more than one linked data source (Linkable) within a linker. Here is a snippet from the Example class where I make the factory call to get the location linkers: List<EntityLinker> linkers = EntityLinkerFactory.getInstance().getLinkers(spans [0] .getType()); one could make as many calls as they want like this List<EntityLinker> linkers = EntityLinkerFactory.getInstance().getLinkers("location"); List<EntityLinker> linkers = EntityLinkerFactory.getInstance().getLinkers("person"); List<EntityLinker> linkers = EntityLinkerFactory.getInstance().getLinkers("organization"); The factory returns a list so we can have multiple Linkers for a single type As for complying with OpenNLP standard conventions, it will get more familiar. thanks again
        Hide
        Mark Giaconia added a comment -

        Please take a look at the latest Upload. At it's essence, it utilizes these three interfaces
        /**
        *Allows for processing of a complete document. Ties to the EntityLinker framework, and can optionally return custom Spans as per the EntityLinker

        • configuration for each entity type
        • @author Mark Giaconia
          */
          public interface LinkableDocumentNameFinder { Document find(String[] sentences, Tokenizer tokenizer, List<TokenNameFinder> nameFinders, boolean linkable); Document find(String documentText, SentenceDetector sentenceDetector, Tokenizer tokenizer, List<TokenNameFinder> nameFinders, boolean linkable); }

          Notice the Document object... in order to make this clean I went with a more OO approach after looking at the current DocumentNameFinder. Document object contains a List<Sentence> look in the domain package at the details of those objects.

        A LinkableDocumentNameFinder instantiates a list of linkables for each entity type that it discovers via the EntityLinkerFactory (which works off of a Properties file).

        public interface EntityLinker<T extends Set<? extends Span>>

        { T find(String[] tokens,Span[] spans, List<Class> linkables) ; T find(String[] tokens,Span[] spans) ; }

        each EntityLinker impl utilizes any number of Linkable interface impls via the LinkableFactory (which also works off of a Properties file):
        public interface Linkable<T extends Set<? extends BaseLink> > extends Formatable

        { T find(String textToSearchFor); T find(String locationText, List<String> whereConditions); T getHierarchyFor(BaseLink entry); }

        formattable interface is because different databases will require different formatting of the entity's text before passing into the query
        public interface Formatable

        { String format(String entity); }

        take a look at the Example class, and the DefaultLinkableDocumentNameFinderImpl classes.

        Show
        Mark Giaconia added a comment - Please take a look at the latest Upload. At it's essence, it utilizes these three interfaces /** *Allows for processing of a complete document. Ties to the EntityLinker framework, and can optionally return custom Spans as per the EntityLinker configuration for each entity type @author Mark Giaconia */ public interface LinkableDocumentNameFinder { Document find(String[] sentences, Tokenizer tokenizer, List<TokenNameFinder> nameFinders, boolean linkable); Document find(String documentText, SentenceDetector sentenceDetector, Tokenizer tokenizer, List<TokenNameFinder> nameFinders, boolean linkable); } Notice the Document object... in order to make this clean I went with a more OO approach after looking at the current DocumentNameFinder. Document object contains a List<Sentence> look in the domain package at the details of those objects. A LinkableDocumentNameFinder instantiates a list of linkables for each entity type that it discovers via the EntityLinkerFactory (which works off of a Properties file). public interface EntityLinker<T extends Set<? extends Span>> { T find(String[] tokens,Span[] spans, List<Class> linkables) ; T find(String[] tokens,Span[] spans) ; } each EntityLinker impl utilizes any number of Linkable interface impls via the LinkableFactory (which also works off of a Properties file): public interface Linkable<T extends Set<? extends BaseLink> > extends Formatable { T find(String textToSearchFor); T find(String locationText, List<String> whereConditions); T getHierarchyFor(BaseLink entry); } formattable interface is because different databases will require different formatting of the entity's text before passing into the query public interface Formatable { String format(String entity); } take a look at the Example class, and the DefaultLinkableDocumentNameFinderImpl classes.
        Hide
        Mark Giaconia added a comment -

        New Interfaces, less throws, more flexibility, introduces a DocumentNameFinder as well. New Document and Sentence objects

        Show
        Mark Giaconia added a comment - New Interfaces, less throws, more flexibility, introduces a DocumentNameFinder as well. New Document and Sentence objects
        Hide
        Joern Kottmann added a comment -

        Sorry, wasn't aware of the fact that multiple entries can be placed in the properties file. The factory should still not be static, because all usages of the entity linker in an application running in one JVM need to share one properties file. This causes various problems for applications using it. For example in UIMA its often common to run multiple instances of a component at once but with different configuration.

        What do you think?

        Show
        Joern Kottmann added a comment - Sorry, wasn't aware of the fact that multiple entries can be placed in the properties file. The factory should still not be static, because all usages of the entity linker in an application running in one JVM need to share one properties file. This causes various problems for applications using it. For example in UIMA its often common to run multiple instances of a component at once but with different configuration. What do you think?
        Hide
        Joern Kottmann added a comment -

        On the dev list we quickly discussed on how we could pass a document to a find method like the one which needs to be declared in the EntityLinker.
        Based on this the find method could be defined as follows:
        T find(String text, Span sentences[], Span tokens[], Span[] names)

        Show
        Joern Kottmann added a comment - On the dev list we quickly discussed on how we could pass a document to a find method like the one which needs to be declared in the EntityLinker. Based on this the find method could be defined as follows: T find(String text, Span sentences[], Span tokens[], Span[] names)
        Hide
        Mark Giaconia added a comment -

        About the static props:based on your comments I agree that the properties should not be static.. will change asap. thanks

        As for the documentNameFinder find method, I basically wrapped this :"T find(String text, Span sentences[], Span tokens[], Span[] names)" into objects, which I think are more understandable and more object oriented (IMO). It would be clunky to me inside the find method to work with all those primitive arrays and keep track of which tokens are for which sentence, and which spans are for which tokens and which names for which spans etc. In the model I proposed it is very simple to create and process the document objects.

        Also if someone already produced sentences, tokens, ran NER to get the spans and names, seems like they would have no need to pass anything to an entitylinker besides the name or names and the entity type? So, I'm not sure why this wouldn't work then: find(String name, String type) and find(List<String> names, String type), where type will envoke the proper linker for the given names via the factory. The user could associate the return values with the span they created prior to making the call externally.

        I wrote it the way it is now because if the user has a SentenceDetector, Tokenizer, and namefinder object defined the way they want them, why not just pass them in with the doc text, or the sentences, or a list of documents? they could still handle the details in their implementation if they wanted. What do you think? ...just throwing ideas around... please don't perceive any hostility
        Thanks!

        MG

        Show
        Mark Giaconia added a comment - About the static props:based on your comments I agree that the properties should not be static.. will change asap. thanks As for the documentNameFinder find method, I basically wrapped this :"T find(String text, Span sentences[], Span tokens[], Span[] names)" into objects, which I think are more understandable and more object oriented (IMO). It would be clunky to me inside the find method to work with all those primitive arrays and keep track of which tokens are for which sentence, and which spans are for which tokens and which names for which spans etc. In the model I proposed it is very simple to create and process the document objects. Also if someone already produced sentences, tokens, ran NER to get the spans and names, seems like they would have no need to pass anything to an entitylinker besides the name or names and the entity type? So, I'm not sure why this wouldn't work then: find(String name, String type) and find(List<String> names, String type), where type will envoke the proper linker for the given names via the factory. The user could associate the return values with the span they created prior to making the call externally. I wrote it the way it is now because if the user has a SentenceDetector, Tokenizer, and namefinder object defined the way they want them, why not just pass them in with the doc text, or the sentences, or a list of documents? they could still handle the details in their implementation if they wanted. What do you think? ...just throwing ideas around... please don't perceive any hostility Thanks! MG
        Hide
        Joern Kottmann added a comment -

        The creation of a document object model is not really easy, we discussed this in great detail in the past and come the the conclusion that it always hides from the user which input information a component actually needs. If you design a document model it should somehow work for all or most of our components, otherwise it gets difficult to let users build their own processing pipelines. Making it work for most of the components means it needs to hold all kind of different annotations.

        So lets say we have a document model which can hold the following information:

        • sentences
        • tokens
        • pos tags
        • named entities

        Lets say after doing sentence and token detection, a user wants to do named entity detection, but now its unclear to the user if pos tags are mandatory for the name finder or not. By having these primitive arrays the user always exactly knows which input a component expects, and if it is not provided it will result in a compile error.

        Anyway OpenNLP is a library and often embedded into applications which define some kind of document model or frameworks which do that (e.g. GATE, UIMA) so for the more complex use cases the user has to deal with this problem himself.
        Currently OpenNLP just provides the core processing engines, all other tasks which are needed have to be solved by the user or specialized frameworks, these are for example, scalability, fault tolerance, resource handling, etc.

        Show
        Joern Kottmann added a comment - The creation of a document object model is not really easy, we discussed this in great detail in the past and come the the conclusion that it always hides from the user which input information a component actually needs. If you design a document model it should somehow work for all or most of our components, otherwise it gets difficult to let users build their own processing pipelines. Making it work for most of the components means it needs to hold all kind of different annotations. So lets say we have a document model which can hold the following information: sentences tokens pos tags named entities Lets say after doing sentence and token detection, a user wants to do named entity detection, but now its unclear to the user if pos tags are mandatory for the name finder or not. By having these primitive arrays the user always exactly knows which input a component expects, and if it is not provided it will result in a compile error. Anyway OpenNLP is a library and often embedded into applications which define some kind of document model or frameworks which do that (e.g. GATE, UIMA) so for the more complex use cases the user has to deal with this problem himself. Currently OpenNLP just provides the core processing engines, all other tasks which are needed have to be solved by the user or specialized frameworks, these are for example, scalability, fault tolerance, resource handling, etc.
        Hide
        Mark Giaconia added a comment -

        I understand, I will abandon the DocumentNameFinder and just continue to refine the EntityLinker framework. I will remove the LinkableDocNameFinder from the linker framework package, and consider that another issue altogether (even though my OO design side is feeling very sad about it)
        thanks!
        MG

        Show
        Mark Giaconia added a comment - I understand, I will abandon the DocumentNameFinder and just continue to refine the EntityLinker framework. I will remove the LinkableDocNameFinder from the linker framework package, and consider that another issue altogether (even though my OO design side is feeling very sad about it) thanks! MG
        Hide
        Mark Giaconia added a comment - - edited

        Please take a close look at the EntityLinker framework. It needs scrutiny. (attached entitylinker_8Jun2013 file).
        It consists of two packages and a properties file.
        Drop the folder into the tools project and debug the Example class's main method, it has three example methods. The first example requires no dependencies so you should be able to step through everything.

        The other two examples require PostGIS and MySQL and the USGS and Geonames gazateers "installed" on each. The scripts to do that are in the entitylinker package, and you will need to put the correct password in the properties file.
        Thoughts:

        • The properties object should be passed all the way through to the implementing Linkable so it can be used for random property acquisition (for DB conns etc), I think this would be helpful.
        • I think it would benefit from some base classes that implement some of the basics.
          -The factory should pool objects, because there is a lot of unnecessary instantiation at this point (or the way the factories are called needs to be managed better....) this becomes difficult when Span arrays can have multiple types of spans.
          -The Find method that Utilizes the Document object is purely experimental, but let me know what you think.

        Thanks!
        MG

        Show
        Mark Giaconia added a comment - - edited Please take a close look at the EntityLinker framework. It needs scrutiny. (attached entitylinker_8Jun2013 file). It consists of two packages and a properties file. Drop the folder into the tools project and debug the Example class's main method, it has three example methods. The first example requires no dependencies so you should be able to step through everything. The other two examples require PostGIS and MySQL and the USGS and Geonames gazateers "installed" on each. The scripts to do that are in the entitylinker package, and you will need to put the correct password in the properties file. Thoughts: The properties object should be passed all the way through to the implementing Linkable so it can be used for random property acquisition (for DB conns etc), I think this would be helpful. I think it would benefit from some base classes that implement some of the basics. -The factory should pool objects, because there is a lot of unnecessary instantiation at this point (or the way the factories are called needs to be managed better....) this becomes difficult when Span arrays can have multiple types of spans. -The Find method that Utilizes the Document object is purely experimental, but let me know what you think. Thanks! MG
        Hide
        Mark Giaconia added a comment -

        Added Map/cache of linkers and linkables to factories to support lazy instantiation, significant performance gain if inside a high throughput pipeline.
        Added a BaseEntityLinker abstract class that reduces the whole framework down to some simple calls, so a class that extends BaseEntityLinker will have an easy time working with the framework.
        I will post again as soon as I finish working on the EntityLinkerProperties object. It is currently opening a stream every time someone needs a property.
        thanks
        MG

        Show
        Mark Giaconia added a comment - Added Map/cache of linkers and linkables to factories to support lazy instantiation, significant performance gain if inside a high throughput pipeline. Added a BaseEntityLinker abstract class that reduces the whole framework down to some simple calls, so a class that extends BaseEntityLinker will have an easy time working with the framework. I will post again as soon as I finish working on the EntityLinkerProperties object. It is currently opening a stream every time someone needs a property. thanks MG
        Hide
        Mark Giaconia added a comment -

        -Added BaseEntityLinker abstract class, and an example of using it to the Example class.
        -Added object pools (Maps) to Factories
        -Fixed EntityLinkerProperties
        -Added other small efficiencies

        please have a look

        Show
        Mark Giaconia added a comment - -Added BaseEntityLinker abstract class, and an example of using it to the Example class. -Added object pools (Maps) to Factories -Fixed EntityLinkerProperties -Added other small efficiencies please have a look
        Hide
        Joern Kottmann added a comment -

        I reviewed the EntityLinkerFactory, in its current implementation it is not thread safe. To make it thread safe I suggest that we create a static factory method and remove all the state variables from the factory.

        It could look like this.
        class EntityLinkerFactory {
        EntityLinker createEntityLinker(Properties properties) throws ...

        { ... }

        }

        What do you think? Each invocation of the factory will produce a new instance, I don't think it is necessary to implement caching, if user code is calling it frequently they can do the caching on their own, or implement a design which makes it unnecessary.

        What do you think?

        Why do you distinguish between the Linkable and the EntityLinker? Couldn't that be simply the same interface? I really liked how that was done in one of the first versions (only EntityLinker) you uploaded here, because it made things really simple. The implementation of the EntityLinker can just support to link multiple types if it is necessary.

        Show
        Joern Kottmann added a comment - I reviewed the EntityLinkerFactory, in its current implementation it is not thread safe. To make it thread safe I suggest that we create a static factory method and remove all the state variables from the factory. It could look like this. class EntityLinkerFactory { EntityLinker createEntityLinker(Properties properties) throws ... { ... } } What do you think? Each invocation of the factory will produce a new instance, I don't think it is necessary to implement caching, if user code is calling it frequently they can do the caching on their own, or implement a design which makes it unnecessary. What do you think? Why do you distinguish between the Linkable and the EntityLinker? Couldn't that be simply the same interface? I really liked how that was done in one of the first versions (only EntityLinker) you uploaded here, because it made things really simple. The implementation of the EntityLinker can just support to link multiple types if it is necessary.
        Hide
        Mark Giaconia added a comment -

        Thanks for the feedback, I agree that a static method in the factory would be cleaner... and I will remove the Map of EntityLinkers as well.
        As for the EntityLinker - Linkable separation. I did it like this so OpenNLP gives users the ability to create a series of Linkable sources, and then plug them into any EntityLinkers they have. Without the Linkable interface the user would have to "do their own thing" inside an EntityLinker impl (which they still have the option to do, they can always create their own Linkable series).
        Currently if a user creates a Linker, and some Linkables with the interfaces, they can configure them in the props file and use the BaseEntityLinker abstract class...which seems quite clean.
        I can certainly remove the Linkable interface, I just felt like providing a sub-framework might be helpful and it provides a logical separation.

        Perhaps we could just consider/promote the Linkable as optional, as if it is essentially the way I personally chose to do my Geo-EntityLinker implementations
        I will resubmit the factory in a bit.

        Show
        Mark Giaconia added a comment - Thanks for the feedback, I agree that a static method in the factory would be cleaner... and I will remove the Map of EntityLinkers as well. As for the EntityLinker - Linkable separation. I did it like this so OpenNLP gives users the ability to create a series of Linkable sources, and then plug them into any EntityLinkers they have. Without the Linkable interface the user would have to "do their own thing" inside an EntityLinker impl (which they still have the option to do, they can always create their own Linkable series). Currently if a user creates a Linker, and some Linkables with the interfaces, they can configure them in the props file and use the BaseEntityLinker abstract class...which seems quite clean. I can certainly remove the Linkable interface, I just felt like providing a sub-framework might be helpful and it provides a logical separation. Perhaps we could just consider/promote the Linkable as optional, as if it is essentially the way I personally chose to do my Geo-EntityLinker implementations I will resubmit the factory in a bit.
        Hide
        Joern Kottmann added a comment -

        It should be possible to define an Aggregated Entity Linker which takes multiple EntityLinkers and merges the results from them, a user can define the exact behavior inside a properties file, for example:

        LocAndOrgLinkers.properties
        linker=....
        loc.postgis=....
        loc.fuzzyMatching=...
        org.mysql=....

        The Aggregated Entity Linker removes the type from the parameters, instantiates two Entity Linkers, the find method forwards the call to the two actual linkers and a merged result is returned to the caller.

        Show
        Joern Kottmann added a comment - It should be possible to define an Aggregated Entity Linker which takes multiple EntityLinkers and merges the results from them, a user can define the exact behavior inside a properties file, for example: LocAndOrgLinkers.properties linker=.... loc.postgis=.... loc.fuzzyMatching=... org.mysql=.... The Aggregated Entity Linker removes the type from the parameters, instantiates two Entity Linkers, the find method forwards the call to the two actual linkers and a merged result is returned to the caller.
        Hide
        Mark Giaconia added a comment -

        sounds good, I'll give it a shot

        Show
        Mark Giaconia added a comment - sounds good, I'll give it a shot
        Hide
        Mark Giaconia added a comment -

        not sure if you saw my post on the dev thread...I implemented the functionality of an aggregated entity linker, but not as a separate interface, and the factory is now thread safe. The BaseEntityLinker abstract class takes care of it by detecting when an input Span[] array contains multiple entity types (via getType()), also, a user can optionally pass in a String[] of entitytypes to constrain the linker creation to only those types (defined like any other linkers in a properties file).

        As a summary, here are the method signatures for the BaseEntityLinker abstract class:
        protected ArrayList<LinkedSpan<T>> getLinkedSpans(String[] tokens, Span[] spans, EntityLinkerProperties properties) //auto detects if the Span[] contains more than one type
        protected ArrayList<LinkedSpan<T>> getAggregatedLinkedSpans(String[] entitytypes, String[] tokens, Span[] spans, EntityLinkerProperties properties) // types are filtered with first arg
        protected Document<T> getLinkedSpans(Document<T> document, EntityLinkerProperties properties)
        public List<Document<T>> getLinkedSpans(List<Document<T>> documents, EntityLinkerProperties properties)
        //class declaration looks like this
        public abstract class BaseEntityLinker<T extends BaseLink>

        {...}

        so far the basic steps to use the framework flow like this....
        1. create an implementation of an EntityLinker
        --Optionally, use the Linkable and LinkableFactory framework inside your EntityLinker impl (recommended due to its configuration-driven extensibility)
        2. Create a props file and add entries using the following format:
        linker.location=opennlp.tools.entitylinker.GeoEntityLinker
        linker.location.linkables=opennlp.tools.entitylinker.PostGISGeoGazImpl,opennlp.tools.entitylinker.MySQLUSGSGazLinkable,opennlp.tools.entitylinker.MySQLGeoNamesGazLinkable
        3. create a class that extends BaseEntityLinker
        4. Use the OpenNLP namefinder et al in your own class, and retrieve LinkedSpans via the class that extends BaseEntityLinker.
        5. Do something awesome with the LinkedSpans

        I also refined the Document and Sentence objects to make them more useful for the Georeferencing impl I am working

        thanks!

        Show
        Mark Giaconia added a comment - not sure if you saw my post on the dev thread...I implemented the functionality of an aggregated entity linker, but not as a separate interface, and the factory is now thread safe. The BaseEntityLinker abstract class takes care of it by detecting when an input Span[] array contains multiple entity types (via getType()), also, a user can optionally pass in a String[] of entitytypes to constrain the linker creation to only those types (defined like any other linkers in a properties file). As a summary, here are the method signatures for the BaseEntityLinker abstract class: protected ArrayList<LinkedSpan<T>> getLinkedSpans(String[] tokens, Span[] spans, EntityLinkerProperties properties) //auto detects if the Span[] contains more than one type protected ArrayList<LinkedSpan<T>> getAggregatedLinkedSpans(String[] entitytypes, String[] tokens, Span[] spans, EntityLinkerProperties properties) // types are filtered with first arg protected Document<T> getLinkedSpans(Document<T> document, EntityLinkerProperties properties) public List<Document<T>> getLinkedSpans(List<Document<T>> documents, EntityLinkerProperties properties) //class declaration looks like this public abstract class BaseEntityLinker<T extends BaseLink> {...} so far the basic steps to use the framework flow like this.... 1. create an implementation of an EntityLinker --Optionally, use the Linkable and LinkableFactory framework inside your EntityLinker impl (recommended due to its configuration-driven extensibility) 2. Create a props file and add entries using the following format: linker.location=opennlp.tools.entitylinker.GeoEntityLinker linker.location.linkables=opennlp.tools.entitylinker.PostGISGeoGazImpl,opennlp.tools.entitylinker.MySQLUSGSGazLinkable,opennlp.tools.entitylinker.MySQLGeoNamesGazLinkable 3. create a class that extends BaseEntityLinker 4. Use the OpenNLP namefinder et al in your own class, and retrieve LinkedSpans via the class that extends BaseEntityLinker. 5. Do something awesome with the LinkedSpans I also refined the Document and Sentence objects to make them more useful for the Georeferencing impl I am working thanks!
        Hide
        Mark Giaconia added a comment - - edited

        See attached zip EntityLinker_13June2013 for latest...
        Threadsafe factory
        factory supports "aggregated" types
        BaseEntityLinker for ease of use
        Document domain model

        Show
        Mark Giaconia added a comment - - edited See attached zip EntityLinker_13June2013 for latest... Threadsafe factory factory supports "aggregated" types BaseEntityLinker for ease of use Document domain model
        Hide
        Joern Kottmann added a comment -

        Thanks for your update, sorry for my late reply, I was quite busy this week with other stuff.

        I would like to get this contribution pulled in soon. Based on the discussion we had on the mailing list and here can you maybe update to EntityLinker to look like this.
        public interface EntityLinker<T extends Span>

        { List<T> find(String text, Span sentences[], Span tokens[], Span names[]) }

        The document is passed in as text, with sentence, token and name spans. The Linkable and Document model classes should be removed.

        Show
        Joern Kottmann added a comment - Thanks for your update, sorry for my late reply, I was quite busy this week with other stuff. I would like to get this contribution pulled in soon. Based on the discussion we had on the mailing list and here can you maybe update to EntityLinker to look like this. public interface EntityLinker<T extends Span> { List<T> find(String text, Span sentences[], Span tokens[], Span names[]) } The document is passed in as text, with sentence, token and name spans. The Linkable and Document model classes should be removed.
        Hide
        Mark Giaconia added a comment -

        Will do, but in this method signature List<T> find(String text, Span sentences[], Span tokens[], Span names[]), we are assuming that the user just wants to have all the sentences and doc text available inside this method, so they would have to keep track of which sentence the spans[] and tokens[] are relevant to externally to this call (because the spans and tokens are not correlated inherently to a particular sentence with this approach).
        Removal of the linkable Interface will definitely simplify the framework.
        I would still like to provide an implementation of a georeferencing EntityLinker, but separate from the actual contribution to the baseline tools api, perhaps as a useful example of using the framework?

        Thanks!

        Show
        Mark Giaconia added a comment - Will do, but in this method signature List<T> find(String text, Span sentences[], Span tokens[], Span names[]), we are assuming that the user just wants to have all the sentences and doc text available inside this method, so they would have to keep track of which sentence the spans[] and tokens[] are relevant to externally to this call (because the spans and tokens are not correlated inherently to a particular sentence with this approach). Removal of the linkable Interface will definitely simplify the framework. I would still like to provide an implementation of a georeferencing EntityLinker, but separate from the actual contribution to the baseline tools api, perhaps as a useful example of using the framework? Thanks!
        Hide
        Joern Kottmann added a comment -

        Exactly, we will align the other tools to produce this as output from the pre-processing steps. Yes, please leave the georeferencing implementation inside the contribution.

        Show
        Joern Kottmann added a comment - Exactly, we will align the other tools to produce this as output from the pre-processing steps. Yes, please leave the georeferencing implementation inside the contribution.
        Hide
        Mark Giaconia added a comment -

        ok, let me know how I can help with refactoring to align the other tools to produce this as output.

        Show
        Mark Giaconia added a comment - ok, let me know how I can help with refactoring to align the other tools to produce this as output.
        Hide
        Mark Giaconia added a comment -

        Couple thoughts. I completed the changes... but as I implement the geoentitylinker I realized it would be useful (perhaps necessary in some cases) to have the below overloads in the entitylinker interface.. let me know what you think. Descriptions below, sorry for the long post.
        List<T> find(String text, Span sentences[], Span tokens[], Span nameSpans[], int sentenceIndex); //////overloaded with int sentenceIndex
        List<T> find(String text, Span sentences[], String tokens[], Span nameSpans[]); ///////tokens are String[] not Span[]

        Descriptions:

        List<T> find(String text, Span sentences[], Span tokens[], Span nameSpans[], int sentenceIndex);//overloaded with int sentenceIndex

        This method takes a sentenceIndex int param to the sentences[] so when a user generates a String[] of tokens using tokens[] and nameSpans[] (to make String[] names for the search), they know which sentence to use. This is useful when externally iterating over sentences, getting names, and linking the names. Without the int overload, inside the entitylinker find method the user would have to hard code an index to the sentences[], or always pass in the one they want to use as the first element, or only pass in one element in the Sentences[].
        here's an example from my GeoEntityLinker impl

        @Override
        public List<LinkedSpan> find(String text, Span[] sentences, Span[] tokens, Span[] names, int sentenceIndex) {
        ////// //get the sentence from text....using sentenceIndex... getting array of sentence strings every call on large documents will be inefficient
        String sentenceINeedTokensFor = Span.spansToStrings(sentences, text)[sentenceIndex];
        //////////get the string[] tokens I need to get the names
        String[] stringtokens = Span.spansToStrings(tokens, sentenceINeedTokensFor );
        //////////get the names based on the tokens
        String[] matches = Span.spansToStrings(names, stringtokens);
        for (int i = 0; i < matches.length; i++)

        { ///process...... }

        List<T> find(String text, Span sentences[], String tokens[], Span nameSpans[]);

        This method allows for a String[] of tokens, rather than Span[] of tokens, which eliminates the problem above. The user has what they need to generate names using the tokens[] and names[], and they only need to touch the sentences and text if desired.
        This allows for simpler processing, and is much more efficient because a sentence array will not have to be generated for every call in order to get the tokens as String[]

        @Override
        public List<LinkedSpan> find(String text, Span[] sentences, String[] tokens, Span[] names) {
        ///////just get the names using tokens[] and nameSpans[]
        String[] matches = Span.spansToStrings(names, tokens);
        for (int i = 0; i < matches.length; i++)

        { ////process }

        return spans;
        }

        Show
        Mark Giaconia added a comment - Couple thoughts. I completed the changes... but as I implement the geoentitylinker I realized it would be useful (perhaps necessary in some cases) to have the below overloads in the entitylinker interface.. let me know what you think. Descriptions below, sorry for the long post. List<T> find(String text, Span sentences[], Span tokens[], Span nameSpans[], int sentenceIndex); //////overloaded with int sentenceIndex List<T> find(String text, Span sentences[], String tokens[], Span nameSpans[]); ///////tokens are String[] not Span[] Descriptions: List<T> find(String text, Span sentences[], Span tokens[], Span nameSpans[], int sentenceIndex);//overloaded with int sentenceIndex This method takes a sentenceIndex int param to the sentences[] so when a user generates a String[] of tokens using tokens[] and nameSpans[] (to make String[] names for the search), they know which sentence to use. This is useful when externally iterating over sentences, getting names, and linking the names. Without the int overload, inside the entitylinker find method the user would have to hard code an index to the sentences[], or always pass in the one they want to use as the first element, or only pass in one element in the Sentences[]. here's an example from my GeoEntityLinker impl @Override public List<LinkedSpan> find(String text, Span[] sentences, Span[] tokens, Span[] names, int sentenceIndex) { ////// //get the sentence from text....using sentenceIndex... getting array of sentence strings every call on large documents will be inefficient String sentenceINeedTokensFor = Span.spansToStrings(sentences, text) [sentenceIndex] ; //////////get the string[] tokens I need to get the names String[] stringtokens = Span.spansToStrings(tokens, sentenceINeedTokensFor ); //////////get the names based on the tokens String[] matches = Span.spansToStrings(names, stringtokens); for (int i = 0; i < matches.length; i++) { ///process...... } List<T> find(String text, Span sentences[], String tokens[], Span nameSpans[]); This method allows for a String[] of tokens, rather than Span[] of tokens, which eliminates the problem above. The user has what they need to generate names using the tokens[] and names[], and they only need to touch the sentences and text if desired. This allows for simpler processing, and is much more efficient because a sentence array will not have to be generated for every call in order to get the tokens as String[] @Override public List<LinkedSpan> find(String text, Span[] sentences, String[] tokens, Span[] names) { ///////just get the names using tokens[] and nameSpans[] String[] matches = Span.spansToStrings(names, tokens); for (int i = 0; i < matches.length; i++) { ////process } return spans; }
        Hide
        Mark Giaconia added a comment -

        Framework package changes:
        -Removed Linkable
        -Changed EntityLinker interface methods to desired method (and added 2 more)

        GeoEntityLinker Impl
        -is functional but needs work on resolution aspects... working on it

        MySQL
        -the SQL script will install the Geonames and USGS gazateers (change path to the files at bottom of script)
        -when this is complete the Example class will run

        Show
        Mark Giaconia added a comment - Framework package changes: -Removed Linkable -Changed EntityLinker interface methods to desired method (and added 2 more) GeoEntityLinker Impl -is functional but needs work on resolution aspects... working on it MySQL -the SQL script will install the Geonames and USGS gazateers (change path to the files at bottom of script) -when this is complete the Example class will run
        Hide
        Mark Giaconia added a comment -

        properties file attached

        Show
        Mark Giaconia added a comment - properties file attached
        Hide
        Mark Giaconia added a comment - - edited
        • GeoEntityLinker is functional (in a basic sense) against the USGS and Geonames gazateers. I ran ~100K sentences through it and produced about 20K locations, and the results look pretty good (currently it is a "high precision, low recall" approach...)
        • Implements the concept of "country context" at the doc level to help resolve locations
        • needs a better scoring approach
        • currently no fuzzy string matching is being used to match the initial NER result with the gazateer entries... it is a boolean type search against a mysql text index. I may try doing a fuzzy search, and then using something like an ngram signature comparison to score the match at a finer level. Currently performs well due to the MySQL text index, but recall would suffer with obscure location names.
          -Filtering with country context is configurable (when true, only locations within the countries found at the document level will be returned)
          -the list of items that indicate countries is in the database, so it is extensible
          General Capability provided with GeoEntityLinker: Finding geonames within unstructured text via linking a gazateer to location named entities (geotagging, georeferencing, geo enabling text)
        Show
        Mark Giaconia added a comment - - edited GeoEntityLinker is functional (in a basic sense) against the USGS and Geonames gazateers. I ran ~100K sentences through it and produced about 20K locations, and the results look pretty good (currently it is a "high precision, low recall" approach...) Implements the concept of "country context" at the doc level to help resolve locations needs a better scoring approach currently no fuzzy string matching is being used to match the initial NER result with the gazateer entries... it is a boolean type search against a mysql text index. I may try doing a fuzzy search, and then using something like an ngram signature comparison to score the match at a finer level. Currently performs well due to the MySQL text index, but recall would suffer with obscure location names. -Filtering with country context is configurable (when true, only locations within the countries found at the document level will be returned) -the list of items that indicate countries is in the database, so it is extensible General Capability provided with GeoEntityLinker: Finding geonames within unstructured text via linking a gazateer to location named entities (geotagging, georeferencing, geo enabling text)
        Hide
        Joern Kottmann added a comment -

        Thanks for the update, I pulled the contribution in today. For modification please send us patches from now on which can be applied against trunk.

        • It would be nice to have a package.java file which describes the entitylinker
        • We need some documentation
        • I removed the Example file, there should be some integration into the OpenNLP command line to demonstrate this component
        Show
        Joern Kottmann added a comment - Thanks for the update, I pulled the contribution in today. For modification please send us patches from now on which can be applied against trunk. It would be nice to have a package.java file which describes the entitylinker We need some documentation I removed the Example file, there should be some integration into the OpenNLP command line to demonstrate this component
        Hide
        Mark Giaconia added a comment -

        thanks Joern, I'll submit a few patches soon, and some documentation including package.java

        Show
        Mark Giaconia added a comment - thanks Joern, I'll submit a few patches soon, and some documentation including package.java
        Hide
        Mark Giaconia added a comment -

        the upload for 588 contains the following updates:

        • a setEntitylinkerProperties method in the entitylinker interface, allows for properties to be generically passed (optionally) through the factory to any EntityLinker impl
        • fixed a few bugs in the GeoNames lookup
        • improved the scoring, but the score distribution is not yet normalized, the MySQL scoring is simply modified (reduced or boosted)
        • added a package-info.java
          I will work on user guide material, CLI, and continue to refine.
          thanks
        Show
        Mark Giaconia added a comment - the upload for 588 contains the following updates: a setEntitylinkerProperties method in the entitylinker interface, allows for properties to be generically passed (optionally) through the factory to any EntityLinker impl fixed a few bugs in the GeoNames lookup improved the scoring, but the score distribution is not yet normalized, the MySQL scoring is simply modified (reduced or boosted) added a package-info.java I will work on user guide material, CLI, and continue to refine. thanks
        Hide
        Mark Giaconia added a comment - - edited

        Added the following:

        • Better scoring of GeoEntities
        • LinkedSpan<> now has two additional fields so consumers can store a sentenceIndex and the original searchTerm (for use downstream). Before when the List<LinkedSpan<BaseLink>> was returned and consolidated there was no way to know which sentence each linkedspan was from once a sentence/document processing loop was exited.
        • Removed unnecessary println statements
        • Did some general/typical cleanup
          Notes: The Example class is still there in case you wanted to debug/run it; I understand it is not part of the baseline. Working on documentation, will submit soon
        Show
        Mark Giaconia added a comment - - edited Added the following: Better scoring of GeoEntities LinkedSpan<> now has two additional fields so consumers can store a sentenceIndex and the original searchTerm (for use downstream). Before when the List<LinkedSpan<BaseLink>> was returned and consolidated there was no way to know which sentence each linkedspan was from once a sentence/document processing loop was exited. Removed unnecessary println statements Did some general/typical cleanup Notes: The Example class is still there in case you wanted to debug/run it; I understand it is not part of the baseline. Working on documentation, will submit soon
        Hide
        Mark Giaconia added a comment -

        New Database setup scripts relevant to the latest committed changes (as of 15 Oct). Contains new indexes, and improved stored procedures

        Show
        Mark Giaconia added a comment - New Database setup scripts relevant to the latest committed changes (as of 15 Oct). Contains new indexes, and improved stored procedures
        Hide
        Mark Giaconia added a comment -

        Properties file and country context file for the GeoEntityLinker
        Gazateers can be downloaded here:
        NGA GeoNames:
        http://earth-info.nga.mil/gns/html/geonames_20131101.zip
        USGS:
        http://geonames.usgs.gov/docs/stategaz/NationalFile_20131020.zip

        once these are downloaded, unzip them to a dir.
        then, use the GazateerIndexer class in the addons geoentitylinker package to create the lucene indexes.
        Once they are complete (takes about an hour total)
        input their paths into the entitylinker.properties file, also input the path to the attached opennlp.geoentitylinker.countrycontext.txt file.
        Once this is complete use this code to use the GeoEntityLinker

        //point to your EntityLinkerProperties file location
        String modelPath = "C:\\apache\\entitylinker
        ";
        EntityLinkerProperties properties = new EntityLinkerProperties(new File(modelPath + "entitylinker.properties"));
        //do NER with a location model to get some spans
        //then the factory to get your linker
        List<LinkedSpan> consolidatedLinkedData = EntityLinkerFactory.getLinker(
        "location", properties).find(document, sentenceSpans, allTokensInDoc, allnamesInDoc);

        better documentation to follow

        Show
        Mark Giaconia added a comment - Properties file and country context file for the GeoEntityLinker Gazateers can be downloaded here: NGA GeoNames: http://earth-info.nga.mil/gns/html/geonames_20131101.zip USGS: http://geonames.usgs.gov/docs/stategaz/NationalFile_20131020.zip once these are downloaded, unzip them to a dir. then, use the GazateerIndexer class in the addons geoentitylinker package to create the lucene indexes. Once they are complete (takes about an hour total) input their paths into the entitylinker.properties file, also input the path to the attached opennlp.geoentitylinker.countrycontext.txt file. Once this is complete use this code to use the GeoEntityLinker //point to your EntityLinkerProperties file location String modelPath = "C:\\apache\\entitylinker "; EntityLinkerProperties properties = new EntityLinkerProperties(new File(modelPath + "entitylinker.properties")); //do NER with a location model to get some spans //then the factory to get your linker List<LinkedSpan> consolidatedLinkedData = EntityLinkerFactory.getLinker( "location", properties).find(document, sentenceSpans, allTokensInDoc, allnamesInDoc); better documentation to follow

          People

          • Assignee:
            Joern Kottmann
            Reporter:
            Mark Giaconia
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - 1,082h
              1,082h
              Remaining:
              Remaining Estimate - 1,082h
              1,082h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development