Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
To make stanbol useful, esp. in offline mode, it needs to some statistical model and entity / topic indices. Those indices can be huge (several GB for all the entities of dbpedia and geonames for instance) hence cannot be packaged as part of the default distrib. However it is very desirable to embed some default statistical models
- opennlp sentence detector for English
- opennlp name finder models for English for organizations, people, places
- solr index for the top 10000 most popular entities (of type organizations, people, places) as measured by number of incoming links in the Wikipedia article graph.
- solr index for the top 1000 most popular topics number of Wikipedia articles categorized in this category or subcategory
The goal is to keep that maven artifact less that 100 MB (ideally even smaller) so that it does not put a big barrier to entry to people downloading the default distribution of Stanbol.
To avoid slowing down the svn repo, those data files will not be put under version control, just the pom.xml + script to rebuild the artifact from a previous version of the jar.
Attachments
1.
|
package english opennlp models | Closed | Olivier Grisel | ||
2.
|
package solr index for popular entities | Closed | Unassigned | ||
3.
|
package solr index for popular topics | Closed | Unassigned |