Details
-
Question
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
None
-
None
Description
Hello.
I'm Juan Vargas. a web developer at Notedlinks S.L. from Spain.
I've been trying a few days to create a spanish index using dbpedia 3.8 files, following the next instructions of https://github.com/apache/stanbol/blob/trunk/entityhub/indexing/dbpedia/README.md to use on Stanbol enhancer, its means:
1. Building index tool
- cd
{stanbol-source}
/entityhub/indexing/genericrdf/ (where you install stanbol) * require stanbol (http://stanbol.apache.org/docs/trunk/tutorial.html)
- mvn assembly:single
- move org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar on my target direct that i plan to make a index
2. Create sub-folder on target directory
- java
jar org.apache.stanbol.entityhub.indexing.dbpedia*-jar-with-dependencies.jar init
3. Download dbpedia dump files and copy in 'indexing/resources/rdfdata':
http://downloads.dbpedia.org/3.8/dbpedia_3.6.owl.bz2 (general for any language)
http://downloads.dbpedia.org/3.8/es/instance_types_es.nt.bz2
http://downloads.dbpedia.org/3.8/es/labels_es.nt.bz2
http://downloads.dbpedia.org/3.8/es/short_abstracts_es.nt.bz2
http://downloads.dbpedia.org/3.8/es/long_abstracts_es.nt.bz2
http://downloads.dbpedia.org/3.8/es/geo_coordinates_es.nt.bz2
http://downloads.dbpedia.org/3.8/es/persondata_es.nt.bz2 (doesnt seem to exist in spanish, any problem it isnt use ?)
http://downloads.dbpedia.org/3.8/es/article_categories_es.nt.bz2
http://downloads.dbpedia.org/3.8/es/category_labels_es.nt.bz2
http://downloads.dbpedia.org/3.8/es/skos_categories_es.nt.bz2
http://downloads.dbpedia.org/3.8/es/redirects_es.nt.bz2
4. Generate entities score and copy to 'indexing/resources':
- curl http://downloads.dbpedia.org/3.8/es/page_links_en.nt.bz2 | bzcat | sed -e 's/.<http\:\/\/es\.dbpedia\.org\/resource\/([^>])> ./\1/' | sort \ | uniq -c | sort -nr > incoming_links.txt (changes in spanish: url resource, 'en' for 'es', see suggested notes on url web)
5. Configuration of the index:
- I left by default, otherwise i dont understand too much how to configurate.
6. Execute jar to create index:
- java
jar org.apache.stanbol.entityhub.indexing.dbpedia*-jar-with-dependencies.jar index
The execution crash, and trace is as follows:
10:42:36,037 [Thread-3] ERROR source.ResourceLoader - Unable to load resource /home/juan/stanbol-index/indexing/resources/rdfdata/redirects_es.nt.bz2
org.openjena.riot.RiotException: [line: 5854, col: 103] Broken token: http://es.dbpedia.org/resource/Pactos_de_
at org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97)
at org.openjena.riot.lang.LangBase.raiseException(LangBase.java:205)
at org.openjena.riot.lang.LangBase.nextToken(LangBase.java:152)
at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:42)
at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:22)
at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:58)
at org.openjena.riot.lang.LangBase.parse(LangBase.java:75)
at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:173)
at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:154)
at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:113)
at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:282)
at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193)
at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74)
at org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfResourceImporter.importResource(RdfResourceImporter.java:75)
at org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:201)
at org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResources(ResourceLoader.java:137)
at org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource.initialise(RdfIndexingSource.java:272)
at org.apache.stanbol.entityhub.indexing.core.impl.IndexingSourceInitialiser.run(IndexingSourceInitialiser.java:43)
at java.lang.Thread.run(Thread.java:679)
Looking redirects_es.nt.bz2 file:
5852 <http://es.dbpedia.org/resource/Tratados_Lateranos> <http://dbpedia.org/ontology/wikiPageRedirects> <http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> .
5853 <http://es.dbpedia.org/resource/Tratado_Laterano> <http://dbpedia.org/ontology/wikiPageRedirects> <http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> .
5854 <http://es.dbpedia.org/resource/Tratado_Lateranense> <http://dbpedia.org/ontology/wikiPageRedirects> <http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> .
5855 <http://es.dbpedia.org/resource/Tratados_Lateranenses> <http://dbpedia.org/ontology/wikiPageRedirects> <http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> .
I dont see any error. Someone could help me, if there are anything unusual?
Also, i try to do a dbpedia 3.8 englsih version, to check if i wad doing wrong a spanish version, its seems ok, but finally minutes after, i got::
11:23:32,576 [Thread-3] ERROR source.ResourceLoader - Unable to load resource /home/juan/stanbol-index/indexing/resources/rdfdata/short_abstracts_en.nt.bz2
org.openjena.riot.RiotException: [line: 1880, col: 96] Broken token: Bambara, also known as Bamana, and Bamanankan by speakers of the language, is a language spoken in Mali, and to a lesser extent Burkina Faso, Senegal by as many as six million people (in
at org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97)
at org.openjena.riot.lang.LangBase.raiseException(LangBase.java:205)
at org.openjena.riot.lang.LangBase.nextToken(LangBase.java:152)
at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:42)
at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:22)
at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:58)
at org.openjena.riot.lang.LangBase.parse(LangBase.java:75)
at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:173)
at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:154)
at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:113)
at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:282)
at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193)
at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74)
at org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfResourceImporter.importResource(RdfResourceImporter.java:75)
at org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:201)
at org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResources(ResourceLoader.java:137)
at org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource.initialise(RdfIndexingSource.java:272)
at org.apache.stanbol.entityhub.indexing.core.impl.IndexingSourceInitialiser.run(IndexingSourceInitialiser.java:43)
at java.lang.Thread.run(Thread.java:679)
Looking short_abstracts_en.nt.bz2:
1879 <http://dbpedia.org/resource/Bernard_of_Clairvaux> <http://www.w3.org/2000/01/rdf-schema#comment> "Bernard of Clairvaux, O. Cist (1090 \u2013 August 20, 1153) was a French abbot and the primary builder of the reforming Cistercian order. After the death of his mother, Bernard sought admission into the Cistercian order. Three years later, he was sent to found a new abbey at an isolated clearing in a glen known as the Val d'Absinthe, about 15\u00A0km southeast of Bar-sur-Aube. According to tradition, Bernard founded the monastery on 25 June 1115, naming it Claire Vall\u00E9e, which evolved into Clairvaux."@en .
1880 <http://dbpedia.org/resource/Bambara_language> <http://www.w3.org/2000/01/rdf-schema#comment> "Bambara, also known as Bamana, and Bamanankan by speakers of the language, is a language spoken in Mali, and to a lesser extent Burkina Faso, Senegal by as many as six million people (including second language users). The Bambara language is the language of people of the Bambara ethnic group, numbering about 4,000,000 people, but serves also as a lingua franca in Mali (it is estimated that about 80% of the population speak it as a first or second language)."@en .
1881 <http://dbpedia.org/resource/Bishkek> <http://www.w3.org/2000/01/rdf-schema#comment> "Bishkek, formerly Pishpek and Frunze, is the capital and the largest city of Kyrgyzstan. Bishkek is also the administrative centre of Chuy Province which surrounds the city, even though the city itself is not part of the province but rather a province-level unit of Kyrgyzstan. The name is thought to derive from a Kyrgyz word for a churn used to make fermented mare's milk, the Kyrgyz national drink."@en .
Someone might say why appears errors like "broken pipe" or if I'm doing something wrong. I think that i follow well the guide. Thanks, and I hope that this information can help others that try to create indexes and an Apache Stanbol, that is a really great project. Nice work!
Best,
Juan.