So if we can pack long streams of 1s with
freqs and positions I think this is probably useful for a lot of people.
Yes, if the overhead is minimal, it might not be an issue in certain cases.
Additionally for the .doc, i see its smaller in the AFOR-3 case too. Is
your "Ent" basically a measure of doc deltas? I'm confused exactly
what it is
Yes, Ent is jsut a delta representation of the id of the entity (which can be considered as the document id). It is just that I have changed the name of the concept, as SIREn is manipulating principally entity and not document. In my case, an entity is just a set of attribute-value pairs, similarly to a document in Lucene.
Because I would think if you take e.g. Geonames, the place
names in the dataset are not in random order but actually "batched" by
country for example, so you would have long streams of docdelta=1 for
I checked, and Geonames dataset was alphabetically sorted by url names:
as well as dbpedia and sindice.
So, yes, this might have (good) consequences on the docdelta list for certain datasets such as geonames. And especially when indexing semi-structured data, as the schema of the data in one dataset is generally identical across entities/documents. therefore it is likely to see long runs of 1 for certain terms or schema terms.