Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 6.0
    • Component/s: SimplePostTool
    • Labels:
      None

      Description

      When trying to index some Freebase articles, such as:

      http://maven.tamingtext.com/freebase-wex-2011-01-18-articles-first10k.tsv

      using the SimplePostTool (bin/post), I ran into a few minor things along the way that would help new users trying to get their content indexed.

      First, I tried the naive approach:

      $ bin/post -c freebase ./freebase-wex-2011-01-18-articles-first10k.tsv 
      

      Didn't work ... here's the output:

      SimplePostTool: WARNING: Skipping freebase-wex-2011-01-18-articles-first10k.tsv. Unsupported file type for auto mode.
      1 files indexed.
      

      Ummm ... no, 1 files not indexed Instead the output should be something like:

      SimplePostTool: WARNING: Skipping freebase-wex-2011-01-18-articles-first10k.tsv. Unsupported file type for auto mode.
      0 of 1 files indexed.
      

      Besides the misleading output, shouldn't tsv be a supported file type for auto-mode? It's a common enough format ...

      So I renamed the file to .csv instead and re-ran ... this time I get:

      $ mv freebase-wex-2011-01-18-articles-first10k.tsv freebase-wex-2011-01-18-articles-first10k.csv
      $ bin/post -c freebase ./freebase-wex-2011-01-18-articles-first10k.csv
      
      ERROR - 2015-01-28 16:24:16.074; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: CSVLoader: input=null, line=1,expected 108 values but got 4
      

      Hmmm ... OK ... did a little Googling and discovered I needed to specify the separator to be %09 (again, the tool should just recognize TSV as a supported format)

      bin/post -c freebase -params "separator=%09&escape=\\" ./freebase-wex-2011-01-18-articles-first10k.csv
      

      Success! (of course I had to add a header line to the file too, but there's little we can do about that)

        Activity

        Hide
        thelabdude Timothy Potter added a comment -

        Also, it would be nice to have an option to skip bad docs and keep progressing through the file and give a nice report about which docs were bad (i.e. line number and short error message)

        Show
        thelabdude Timothy Potter added a comment - Also, it would be nice to have an option to skip bad docs and keep progressing through the file and give a nice report about which docs were bad (i.e. line number and short error message)

          People

          • Assignee:
            ehatcher Erik Hatcher
            Reporter:
            thelabdude Timothy Potter
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development