|
|
| |
|
NUTCH-698 |
FIXED
|
CrawlDb is corrupted after a few crawl cycles
|
|
|
| |
|
NUTCH-694 |
FIXED
|
Distributed Search Server fails
|
|
|
| |
|
NUTCH-688 |
FIXED
|
Fix missing/wrong headers in source files
|
|
|
| |
|
NUTCH-631 |
FIXED
|
MoreIndexingFilter fails with NoSuchElementException
|
|
|
| |
|
NUTCH-515 |
FIXED
|
Next fetch time is set incorrectly
|
|
|
| |
|
NUTCH-722 |
FIXED
|
Nutch contains jars that we cannot redistribute
|
|
|
| |
|
NUTCH-621 |
FIXED
|
Nutch needs to declare it's crypto usage
|
|
|
| |
|
NUTCH-703 |
FIXED
|
Upgrade to Hadoop 0.19.1
|
|
|
| |
|
NUTCH-724 |
DUPLICATE
|
Drop the JAI libraries
|
|
|
| |
|
NUTCH-678 |
FIXED
|
Hadoop 0.19 requires an update of jets3t
|
|
|
| |
|
NUTCH-641 |
FIXED
|
IndexSorter incorrectly copies stored fields
|
|
|
| |
|
NUTCH-700 |
FIXED
|
Neko1.9.11 goes into a loop
|
|
|
| |
|
NUTCH-508 |
FIXED
|
${hadoop.log.dir} and ${hadoop.log.file} are not propagated to the tasktracker
|
|
|
| |
|
NUTCH-61 |
FIXED
|
Adaptive re-fetch interval. Detecting umodified content
|
|
|
| |
|
NUTCH-652 |
FIXED
|
AdaptiveFetchSchedule#setFetchSchedule doesn't calculate fetch interval correctly
|
|
|
| |
|
NUTCH-727 |
FIXED
|
Add KEYS file to release artifact
|
|
|
| |
|
NUTCH-699 |
FIXED
|
Add an "official" solr schema for solr integration
|
|
|
| |
|
NUTCH-603 |
FIXED
|
Add more default url normalizations
|
|
|
| |
|
NUTCH-586 |
FIXED
|
Add option to run compiled classes w/o job file
|
|
|
| |
|
NUTCH-279 |
FIXED
|
Additions for regex-normalize
|
|
|
| |
|
NUTCH-602 |
FIXED
|
Allow configurable number of handlers for search servers
|
|
|
| |
|
NUTCH-565 |
FIXED
|
Arc File to Nutch Segments Converter
|
|
|
| |
|
NUTCH-488 |
FIXED
|
Avoid parsing uneccessary links and get a more relevant outlink list
|
|
|
| |
|
NUTCH-485 |
FIXED
|
Change HtmlParseFilter 's to return ParseResult object instead of Parse object
|
|
|
| |
|
NUTCH-605 |
FIXED
|
Change deprecated configuration methods for Hadoop
|
|
|
| |
|
NUTCH-643 |
FIXED
|
ClassCastException in PdfParser on encrypted PDF with empty password
|
|
|
| |
|
NUTCH-545 |
FIXED
|
Configuration and OnlineClusterer get initialized in every request.
|
|
|
| |
|
NUTCH-669 |
FIXED
|
Consolidate code for Fetcher and Fetcher2
|
|
|
| |
|
NUTCH-532 |
FIXED
|
CrawlDbMerger: wrong computation of last fetch time
|
|
|
| |
|
NUTCH-684 |
FIXED
|
Dedup support for Solr
|
|
|
| |
|
NUTCH-467 |
FIXED
|
DeleteDuplicate fails if Segment index directory has 0 documents
|
|
|
| |
|
NUTCH-525 |
FIXED
|
DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to rerun dedup on a segment
|
|
|
| |
|
NUTCH-668 |
FIXED
|
Domain URL Filter
|
|
|
| |
|
NUTCH-613 |
FIXED
|
Empty Summaries and Cached Pages
|
|
|
| |
|
NUTCH-497 |
FIXED
|
Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
|
|
|
| |
|
NUTCH-579 |
FIXED
|
Feed plugin only indexes one post per feed due to identical digest
|
|
|
| |
|
NUTCH-413 |
FIXED
|
Fetcher ignores -noParsing command line option
|
|
|
| |
|
NUTCH-597 |
FIXED
|
Fetcher2 - java.lang.NullPointerException when host does not exist and fetcher.threads.per.host.by.ip is set to true causes threads to finish.
|
|
|
| |
|
NUTCH-474 |
FIXED
|
Fetcher2 sets server-delay and blocking checks incorrectly
|
|
|
| |
|
NUTCH-126 |
FIXED
|
Fetching via https does not work with a proxy (patch)
|
|
|
| |
|
NUTCH-518 |
FIXED
|
Fix OpicScoringFilter to respect scoring filter chaining
|
|
|
| |
|
NUTCH-382 |
FIXED
|
Fix for NUTCH-365 introduced a bug if generate.max.per.host.by.ip is enabled
|
|
|
| |
|
NUTCH-471 |
FIXED
|
Fix synchronization in NutchBean creation
|
|
|
| |
|
NUTCH-74 |
FIXED
|
French Analyzer Plugin
|
|
|
| |
|
NUTCH-503 |
FIXED
|
Generator exits incorrectly for small fetchlists
|
|
|
| |
|
NUTCH-554 |
FIXED
|
Generator throws java.io.IOException and dies on injected urls with no protocol
|
|
|
| |
|
NUTCH-636 |
FIXED
|
Http client plug-in https doesn't work on IBM JRE
|
|
|
| |
|
NUTCH-561 |
FIXED
|
HttpClient plugin does not work with NTLM authentication
|
|
|
| |
|
NUTCH-501 |
FIXED
|
Implement a different caching mechanism for objects cached in configuration
|
|
|
| |
|
NUTCH-574 |
FIXED
|
Including inlink anchor text in index can create irrelevant search results.
|
|
|
| |
|
NUTCH-510 |
FIXED
|
IndexMerger delete working dir
|
|
|
| |
|
NUTCH-393 |
FIXED
|
Indexer doesn't handle null documents returned by filters
|
|
|
| |
|
NUTCH-442 |
FIXED
|
Integrate Solr/Nutch
|
|
|
| |
|
NUTCH-671 |
FIXED
|
JSP errors in Nutch searcher webapp running with Tomcat 6
|
|
|
| |
|
NUTCH-723 |
FIXED
|
LICENCE.txt is lacking info that should be there
|
|
|
| |
|
NUTCH-635 |
FIXED
|
LinkAnalysis Tool for Nutch
|
|
|
| |
|
NUTCH-533 |
FIXED
|
LinkDbMerger: url normalized is not updated in the key and inlinks list
|
|
|
| |
|
NUTCH-261 |
FIXED
|
Multi Language Support
|
|
|
| |
|
NUTCH-725 |
FIXED
|
NOTICE.txt is lacking info that should be there
|
|
|
| |
|
NUTCH-575 |
FIXED
|
NPE in OpenSearchServlet when summary is null
|
|
|
| |
|
NUTCH-559 |
FIXED
|
NTLM, Basic and Digest Authentication schemes for web/proxy server
|
|
|
| |
|
NUTCH-504 |
FIXED
|
NUTCH-443 broke parsing during fetching
|
|
|
| |
|
NUTCH-487 |
FIXED
|
Neko HTML parser goes on default settings.
|
|
|
| |
|
NUTCH-646 |
FIXED
|
New Indexing Framework for Nutch
|
|
|
| |
|
NUTCH-516 |
FIXED
|
Next fetch time is not set when it is a CrawlDatum.STATUS_FETCH_GONE
|
|
|
| |
|
NUTCH-529 |
FIXED
|
NodeWalker.skipChildren doesn't work for more than 1 child.
|
|
|
| |
|
NUTCH-593 |
FIXED
|
Nutch crawl problem
|
|
|
| |
|
NUTCH-506 |
FIXED
|
Nutch should delegate compression to Hadoop
|
|
|
| |
|
NUTCH-614 |
FIXED
|
Order Inlinks by OPIC score of parent page
|
|
|
| |
|
NUTCH-392 |
FIXED
|
OutputFormat implementations should pass on Progressable
|
|
|
| |
|
NUTCH-220 |
FIXED
|
PDF Box can't parse document: java.lang.NullPointerException
|
|
|
| |
|
NUTCH-550 |
FIXED
|
Parse fails if db.max.outlinks.per.page is -1
|
|
|
| |
|
NUTCH-645 |
FIXED
|
Parse-swf unit test failing
|
|
|
| |
|
NUTCH-535 |
FIXED
|
ParseData's contentMeta accumulates unnecessary values during parse
|
|
|
| |
|
NUTCH-634 |
FIXED
|
Patch - Nutch - Hadoop 0.17.1
|
|
|
| |
|
NUTCH-726 |
FIXED
|
README.txt is lacking info that should be there
|
|
|
| |
|
NUTCH-615 |
FIXED
|
Redirected URL are fetched wihtout setting any FetchInterval
|
|
|
| |
|
NUTCH-547 |
FIXED
|
Redirection handling: YahooSlurp's algorithm
|
|
|
| |
|
NUTCH-339 |
FIXED
|
Refactor nutch to allow fetcher improvements
|
|
|
| |
|
NUTCH-598 |
FIXED
|
Remove deprecated use of ToolBase, Migration to the new implementation
|
|
|
| |
|
NUTCH-434 |
FIXED
|
Replace usage of ObjectWritable with something based on GenericWritable
|
|
|
| |
|
NUTCH-616 |
FIXED
|
Reset Fetch Retry counter when fetch is successful
|
|
|
| |
|
NUTCH-647 |
FIXED
|
Resolve URLs tool
|
|
|
| |
|
NUTCH-682 |
FIXED
|
SOLR indexer does not set boost on the document
|
|
|
| |
|
NUTCH-534 |
FIXED
|
SegmentMerger: add -normalize option
|
|
|
| |
|
NUTCH-715 |
FIXED
|
Subcollection plugin doesn't work with default subcollections.xml file
|
|
|
| |
|
NUTCH-153 |
FIXED
|
TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail
|
|
|
| |
|
NUTCH-618 |
FIXED
|
Tika error "Media type alias already exists"
|
|
|
| |
|
NUTCH-439 |
FIXED
|
Top Level Domains Indexing / Scoring
|
|
|
| |
|
NUTCH-612 |
FIXED
|
URL filtering is always disabled in Generator when invoked by Crawl
|
|
|
| |
|
NUTCH-489 |
FIXED
|
URLFilter-suffix management of the url path when the url contains some query parameters
|
|
|
| |
|
NUTCH-642 |
FIXED
|
Unit tests fail when run in non-local mode
|
|
|
| |
|
NUTCH-607 |
FIXED
|
Update build.xml to include tika jar in war file
|
|
|
| |
|
NUTCH-691 |
FIXED
|
Update jakarta poi jars to the most relevant version
|
|
|
| |
|
NUTCH-552 |
FIXED
|
Upgrade Nutch to Hadoop 0.15.x
|
|
|
| |
|
NUTCH-604 |
FIXED
|
Upgrade Nutch to Lucene 2.3.0
|
|
|
| |
|
NUTCH-587 |
FIXED
|
Upgrade Nutch to use Hadoop 0.15.3 release
|
|
|
| |
|
NUTCH-611 |
FIXED
|
Upgrade Nutch to use Hadoop 0.16
|
|
|
| |
|
NUTCH-663 |
FIXED
|
Upgrade Nutch to use Hadoop 0.19
|
|
|
| |
|
NUTCH-662 |
FIXED
|
Upgrade Nutch to use Lucene 2.4
|
|
|
| |
|
NUTCH-608 |
FIXED
|
Upgrade nutch to use released apache-tika-0.1-incubating
|
|
|
| |
|
NUTCH-653 |
FIXED
|
Upgrade to hadoop 0.18
|
|
|
| |
|
NUTCH-517 |
FIXED
|
build encoding should be UTF-8
|
|
|
| |
|
NUTCH-626 |
FIXED
|
fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects
|
|
|
| |
|
NUTCH-546 |
FIXED
|
file URL are filtered out by the crawler
|
|
|
| |
|
NUTCH-481 |
FIXED
|
http.content.limit is broken in the protocol-httpclient plugin
|
|
|
| |
|
NUTCH-695 |
FIXED
|
incorrect mime type detection by MoreIndexingFilter plugin
|
|
|
| |
|
NUTCH-507 |
FIXED
|
lib-lucene-analyzers jar defintion is wrong in plugin.xml
|
|
|
| |
|
NUTCH-25 |
FIXED
|
needs 'character encoding' detector
|
|
|
| |
|
NUTCH-120 |
FIXED
|
one "bad" link on a page kills parsing
|
|
|
| |
|
NUTCH-353 |
FIXED
|
pages that serverside forwards will be refetched every time
|
|
|
| |
|
NUTCH-681 |
FIXED
|
parse-mp3 compilation problem
|
|
|
| |
|
NUTCH-571 |
FIXED
|
parse-mp3 plugin doesn't always index album of mp3
|
|
|
| |
|
NUTCH-560 |
FIXED
|
protocol-httpclient reading more bytes than http.content.limit
|
|
|
| |
|
NUTCH-419 |
FIXED
|
unavailable robots.txt kills fetch
|
|
|
| |
|
NUTCH-584 |
FIXED
|
urls missing from fetchlist
|
|
|
| |
|
NUTCH-530 |
WON'T FIX
|
Add a combiner to improve performance on updatedb
|
|
|
| |
|
NUTCH-637 |
WON'T FIX
|
Add method to nutch and tika system(Code written)
|
|
|
| |
|
NUTCH-486 |
WON'T FIX
|
Break searcher dependency on commons-cli
|
|
|
| |
|
NUTCH-632 |
WON'T FIX
|
Bug in TextParser with encoding
|
|
|
| |
|
NUTCH-748 |
WON'T FIX
|
DiskChecker Could not find
|
|
|
| |
|
NUTCH-590 |
WON'T FIX
|
Index multiple docs per call using IndexingFilter extension point
|
|
|
| |
|
NUTCH-82 |
WON'T FIX
|
Nutch Commands should run on Windows without external tools
|
|
|
| |
|
NUTCH-155 |
WON'T FIX
|
Remove web gui from the distribution to "contrib" and use OpenSearch Servlet
|
|
|
| |
|
NUTCH-526 |
WON'T FIX
|
Use a combiner in LinDbMerger to improve the performance as in LinkDb
|
|
|
| |
|
NUTCH-357 |
WON'T FIX
|
crawling simulation
|
|
|
| |
|
NUTCH-661 |
WON'T FIX
|
errors when the uri contains space characters
|
|
|
| |
|
NUTCH-599 |
WON'T FIX
|
nutch crawl and index problem
|
|
|
| |
|
NUTCH-630 |
DUPLICATE
|
Error caused by index-more plugin in the latest svn revision - 652259
|
|
|
| |
|
NUTCH-592 |
DUPLICATE
|
Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED
|
|
|
| |
|
NUTCH-701 |
DUPLICATE
|
Replace Fetcher with Fetcher2
|
|
|
| |
|
NUTCH-491 |
DUPLICATE
|
dedup fails with ArrayIndexOutOfBoundsException
|
|
|
| |
|
NUTCH-572 |
INVALID
|
Scoring and redirected Urls
|
|
|
| |
|
NUTCH-452 |
INCOMPLETE
|
Nutch JSF/My Faces Search Frontend
|
|
|
| |
|
NUTCH-262 |
INCOMPLETE
|
NUTCH-261
Summary excerpts and highlights problems
|
|
|
| |
|
NUTCH-531 |
CANNOT REPRODUCE
|
Pages with no ContentType cause a Null Pointer exception
|
|
|
| |
|
NUTCH-398 |
CANNOT REPRODUCE
|
map-reduce very slow when crawling on single server
|
|
|
| |
|
NUTCH-687 |
FIXED
|
Add RAT
|
|
|
| |
|
NUTCH-500 |
FIXED
|
Add hadoop masters configuration file into conf folder
|
|
|
| |
|
NUTCH-582 |
FIXED
|
Add missing type parameters
|
|
|
| |
|
NUTCH-345 |
FIXED
|
Add support for Content-Encoding: deflated
|
|
|
| |
|
NUTCH-765 |
FIXED
|
Allow Crawl class to call Either Solr or Lucene Indexer
|
|
|
| |
|
NUTCH-620 |
FIXED
|
BasicURLNormalizer should collapse runs of slashes with a single slash
|
|
|
| |
|
NUTCH-502 |
FIXED
|
Bug in SegmentReader causes infinite loop
|
|
|
| |
|
NUTCH-639 |
FIXED
|
Change LuceneDocumentWrapper visibility from private to protected
|
|
|
| |
|
NUTCH-161 |
FIXED
|
Change Plain text parser to use parser.character.encoding.default property for fall back encoding
|
|
|
| |
|
NUTCH-528 |
FIXED
|
CrawlDbReader: add some new stats + dump into a csv format
|
|
|
| |
|
NUTCH-494 |
FIXED
|
FindBugs: CrawlDbReader and DeleteDuplicates
|
|
|
| |
|
NUTCH-539 |
FIXED
|
HttpClient plugin does not work with BasicAuthentication
|
|
|
| |
|
NUTCH-563 |
FIXED
|
Include custom fields in BasicQueryFilter
|
|
|
| |
|
NUTCH-711 |
FIXED
|
Indexer failing after upgrade to Hadoop 0.19.1
|
|
|
| |
|
NUTCH-514 |
FIXED
|
Indexer should only index pages with fetch status SUCCESS
|
|
|
| |
|
NUTCH-667 |
FIXED
|
Input Format for working with Content in Hadoop Streaming
|
|
|
| |
|
NUTCH-676 |
FIXED
|
MapWritable is written inefficiently and confusingly
|
|
|
| |
|
NUTCH-548 |
FIXED
|
Move URLNormalizer from Outlink to ParseOutputFormat
|
|
|
| |
|
NUTCH-683 |
FIXED
|
NUTCH-676 broke CrawlDbMerger
|
|
|
| |
|
NUTCH-505 |
FIXED
|
Outlink urls should be validated
|
|
|
| |
|
NUTCH-411 |
FIXED
|
Parse ignores meta refresh redirection
|
|
|
| |
|
NUTCH-633 |
FIXED
|
ParseSegment no longer allow reparsing
|
|
|
| |
|
NUTCH-596 |
FIXED
|
ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS
|
|
|
| |
|
NUTCH-444 |
FIXED
|
Possibly use a different library to parse RSS feed for improved performance and compatibility
|
|
|
| |
|
NUTCH-567 |
FIXED
|
Proper (?) handling of URIs in TagSoup.
|
|
|
| |
|
NUTCH-601 |
FIXED
|
Recrawling on existing crawl directory using force option
|
|
|
| |
|
NUTCH-536 |
FIXED
|
Reduce number of warnings in nutch core
|
|
|
| |
|
NUTCH-606 |
FIXED
|
Refactoring of Generator, run all urls through checks
|
|
|
| |
|
NUTCH-651 |
FIXED
|
Remove bin/{start|stop}-balancer.sh from svn tracking
|
|
|
| |
|
NUTCH-580 |
FIXED
|
Remove deprecated hadoop api calls (FS)
|
|
|
| |
|
NUTCH-446 |
FIXED
|
RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt
|
|
|
| |
|
NUTCH-468 |
FIXED
|
Scoring filter should distribute score to all outlinks at once
|
|
|
| |
|
NUTCH-665 |
FIXED
|
Search Load Testing Tool
|
|
|
| |
|
NUTCH-495 |
FIXED
|
Unnecessary delays in Fetcher2
|
|
|
| |
|
NUTCH-680 |
FIXED
|
Update external jars to latest versions
|
|
|
| |
|
NUTCH-544 |
FIXED
|
Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
|
|
|
| |
|
NUTCH-498 |
FIXED
|
Use Combiner in LinkDb to increase speed of linkdb generation
|
|
|
| |
|
NUTCH-522 |
FIXED
|
Use URLValidator in the Injector
|
|
|
| |
|
NUTCH-443 |
FIXED
|
allow parsers to return multiple Parse object, this will speed up the rss parser
|
|
|
| |
|
NUTCH-359 |
FIXED
|
extraction of links will fail for whole page if one single link cannot be parsed
|
|
|
| |
|
NUTCH-456 |
FIXED
|
parse msexcel plugin speedup
|
|
|
| |
|
NUTCH-247 |
FIXED
|
robot parser to restrict.
|
|
|
| |
|
NUTCH-720 |
FIXED
|
site: search operator with no query term
|
|
|
| |
|
NUTCH-171 |
WON'T FIX
|
Bring back multiple segment support for Generate / Update
|
|
|
| |
|
NUTCH-451 |
WON'T FIX
|
Tool to recover partial fetcher output
|
|
|
| |
|
NUTCH-509 |
WON'T FIX
|
Update Crawldb: avoid to start a job if there is no valid segment
|
|
|
| |
|
NUTCH-330 |
WON'T FIX
|
command line tool to search a Lucene index
|
|
|
| |
|
NUTCH-553 |
DUPLICATE
|
Add more normalization rules to regex-normalize file.
|
|
|
| |
|
NUTCH-448 |
LATER
|
Allow Plugin Includes and Excludes from File
|
|
|
| |
|
NUTCH-223 |
FIXED
|
Crawl.java uses Integer.MAX_VALUE for -topN where Generator.java uses Long.MAX_VALUE for -topN
|
|
|
| |
|
NUTCH-538 |
FIXED
|
Delete unused classes under o.a.n.util
|
|
|
| |
|
NUTCH-484 |
FIXED
|
Nutch Nightly API link is broken in site
|
|
|
| |
|
NUTCH-499 |
FIXED
|
Refactor LinkDb and LinkDbMerger to reuse code
|
|
|
| |
|
NUTCH-482 |
FIXED
|
Remove redundant plugin lib-log4j
|
|
|
| |
|
NUTCH-483 |
FIXED
|
remove redundant commons-logging jar from ontology plugin
|
|
|
| |
|
NUTCH-513 |
FIXED
|
suffix-urlfilter.txt does not have a template
|
|
|
| |
|
NUTCH-654 |
FIXED
|
urlfilter-regex's main does not work
|
|
|
|
|
| |
|
NUTCH-354 |
FIXED
|
MapWritable, nextEntry is not reset when Entries are recycled
|
|
|
| |
|
NUTCH-400 |
FIXED
|
Update & add missing license headers
|
|
|
| |
|
NUTCH-273 |
FIXED
|
When a page is redirected, the original url is NOT updated.
|
|
|
| |
|
NUTCH-332 |
FIXED
|
doubling score causes by page internal anchors.
|
|
|
| |
|
NUTCH-233 |
FIXED
|
wrong regular expression hang reduce process for ever
|
|
|
| |
|
NUTCH-336 |
FIXED
|
Harvested links shouldn't get db.score.injected in addition to inbound contributions
|
|
|
| |
|
NUTCH-341 |
FIXED
|
IndexMerger now deletes entire <workingdir> after completing
|
|
|
| |
|
NUTCH-105 |
FIXED
|
Network error during robots.txt fetch causes file to be ignored
|
|
|
| |
|
NUTCH-167 |
FIXED
|
Observation of <META NAME="ROBOTS" CONTENT="NOARCHIVE"> directive
|
|
|
| |
|
NUTCH-361 |
FIXED
|
generator create fetchlist randomly
|
|
|
| |
|
NUTCH-433 |
FIXED
|
java.io.EOFException in newer nightlies in mergesegs or indexing from hadoop.io.DataOutputBuffer
|
|
|
| |
|
NUTCH-318 |
FIXED
|
log4j not proper configured, readdb doesnt give any information
|
|
|
| |
|
NUTCH-350 |
FIXED
|
urls blocked db.fetch.retry.max * http.max.delays times during fetching are marked as STATUS_DB_GONE
|
|
|
| |
|
NUTCH-381 |
WON'T FIX
|
Ignore external link not work as expected
|
|
|
| |
|
NUTCH-277 |
CANNOT REPRODUCE
|
Fetcher dies because of "max. redirects" (avoiding infinite loop)
|
|
|
| |
|
NUTCH-331 |
CANNOT REPRODUCE
|
Fetcher incorrectly reports task progress to tasktracker resulting in skipped URLs
|
|
|
| |
|
NUTCH-258 |
CANNOT REPRODUCE
|
Once Nutch logs a SEVERE log item, Nutch fails forevermore
|
|
|
| |
|
NUTCH-417 |
FIXED
|
After upgrade to hadoop-0.9.1, parsing and indexing doesn't work.
|
|
|
| |
|
NUTCH-340 |
FIXED
|
Bug(s) in 0.8 tutorial
|
|
|
| |
|
NUTCH-347 |
FIXED
|
Build: plugins' Jars not found
|
|
|
| |
|
NUTCH-405 |
FIXED
|
Content object is not properly initialized in map method of ParseSegment
|
|
|
| |
|
NUTCH-416 |
FIXED
|
CrawlDatum status and CrawlDbReducer refactoring
|
|
|
| |
|
NUTCH-371 |
FIXED
|
DeleteDuplicates should remove documents with duplicate URLs
|
|
|
| |
|
NUTCH-367 |
FIXED
|
DistributedSearch thown ClassCastException
|
|
|
| |
|
NUTCH-322 |
FIXED
|
Fetcher discards ProtocolStatus, doesn't store redirected pages
|
|
|
| |
|
NUTCH-337 |
FIXED
|
Fetcher ignores the fetcher.parse value configured in config file
|
|
|
| |
|
NUTCH-344 |
FIXED
|
Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
|
|
|
| |
|
NUTCH-404 |
FIXED
|
Fix LinkDB Usage - implementation mismatch
|
|
|
| |
|
NUTCH-418 |
FIXED
|
Fixes parsing of XHTML (e.g. title)
|
|
|
| |
|
NUTCH-365 |
FIXED
|
Flexible URL normalization
|
|
|
| |
|
NUTCH-415 |
FIXED
|
Generate should mark selected records in crawlDB
|
|
|
| |
|
NUTCH-401 |
FIXED
|
Hardcoded /tmp directory in SegmentReader
|
|
|
| |
|
NUTCH-395 |
FIXED
|
Increase fetching speed
|
|
|
| |
|
NUTCH-432 |
FIXED
|
JAVA_PLATFORM with spaces (i.e. Mac OS X-ppc-32) breaks bin/nutch script
|
|
|
| |
|
NUTCH-403 |
FIXED
|
Make URL filtering optional in Generator
|
|
|
| |
|
NUTCH-437 |
FIXED
|
MapFile in Hadoop Trunk has changed, must update references
|
|
|
| |
|
NUTCH-378 |
FIXED
|
MetaWrapper decorator
|
|
|
| |
|
NUTCH-406 |
FIXED
|
Metadata tries to write null values
|
|
|
| |
|
NUTCH-646 |
FIXED
|
New Indexing Framework for Nutch
|
|
|
| |
|
NUTCH-253 |
FIXED
|
Normalize Host during Generate
|
|
|
| |
|
NUTCH-428 |
FIXED
|
NullPointerException
|
|
|
| |
|
NUTCH-614 |
FIXED
|
Order Inlinks by OPIC score of parent page
|
|
|
| |
|
NUTCH-379 |
FIXED
|
ParseUtil does not pass through the content's URL to the ParserFactory
|
|
|
| |
|
NUTCH-391 |
FIXED
|
ParseUtil logs file contents to log file when it cannot find parser
|
|
|
| |
|
NUTCH-384 |
FIXED
|
Protocol-file plugin does not allow the parse plugins framework to operate properly
|
|
|
| |
|
NUTCH-362 |
FIXED
|
Remove parse-text from unsupported filetypes in parse-plugins.xml
|
|
|
| |
|
NUTCH-394 |
FIXED
|
Searching via Tomcat / nutch-0.9-dev.war raises exception
|
|
|
| |
|
NUTCH-360 |
FIXED
|
Switch nutch to use java 5 source format
|
|
|
| |
|
NUTCH-305 |
FIXED
|
Update crawl and url filter lists to exclude jpeg|JPEG|bmp|BMP
|
|
|
| |
|
NUTCH-459 |
FIXED
|
Upgrade Nutch to Hadoop 0.12.1
|
|
|
| |
|
NUTCH-383 |
FIXED
|
Upgrade Nutch to Hadoop 0.7
|
|
|
| |
|
NUTCH-205 |
FIXED
|
Wrong 'fetch date' for non available pages
|
|
|
| |
|
NUTCH-266 |
FIXED
|
hadoop bug when doing updatedb
|
|
|
| |
|
NUTCH-387 |
FIXED
|
host normalization in Generator$Selector
|
|
|
| |
|
NUTCH-430 |
FIXED
|
integer overflow in HashComparator.compare
|
|
|
| |
|
NUTCH-425 |
FIXED
|
parse-js pollutes anchor text with base URL of source page
|
|
|
| |
|
NUTCH-374 |
FIXED
|
when http.content.limit be set to -1 and Response.CONTENT_ENCODING is gzip or x-gzip , it can not fetch any thing.
|
|
|
| |
|
NUTCH-675 |
WON'T FIX
|
Reduce tasks do not report their status and are killed by jobtracker
|
|
|
| |
|
NUTCH-543 |
DUPLICATE
|
CLONE -some problem about the Nutch cache
|
|
|
| |
|
NUTCH-581 |
FIXED
|
DistributedSearch does not update search servers added to search-servers.txt on the fly
|
|
|
| |
|
NUTCH-68 |
FIXED
|
A tool to generate arbitrary fetchlists
|
|
|
| |
|
NUTCH-421 |
FIXED
|
Allow predeterminate running order of index filters
|
|
|
| |
|
NUTCH-399 |
FIXED
|
Change CommandRunner to use concurrent api from jdk
|
|
|
| |
|
NUTCH-440 |
FIXED
|
Command line utilities should exit with an error message when given wrong arguments
|
|
|
| |
|
NUTCH-226 |
FIXED
|
CrawlDb Filter tool
|
|
|
| |
|
NUTCH-420 |
FIXED
|
DeleteDuplicates.HashPartitioner depends on the order of IndexDocs
|
|
|
| |
|
NUTCH-274 |
FIXED
|
Empty row in/at end of URL-list results in error
|
|
|
| |
|
NUTCH-325 |
FIXED
|
UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes
|
|
|
| |
|
NUTCH-388 |
FIXED
|
nutch-default.xml has outdated example for urlfilter.order
|
|
|
| |
|
NUTCH-426 |
FIXED
|
parse-js skips parsing if found URL fails java.net.URL parse
|
|
|
| |
|
NUTCH-246 |
FIXED
|
segment size is never as big as topN or crawlDB size in a distributed deployement
|
|
|
| |
|
NUTCH-524 |
WON'T FIX
|
Generate Problem with Single Node
|
|
|
| |
|
NUTCH-390 |
FIXED
|
Javadoc warnings
|
|
|
| |
|
NUTCH-338 |
FIXED
|
Remove the text parser as an option for parsing PDF files in parse-plugins.xml
|
|
|
|
Maintenance release for 0.8 branch
|
|
| |
|
NUTCH-354 |
FIXED
|
MapWritable, nextEntry is not reset when Entries are recycled
|
|
|
| |
|
NUTCH-332 |
FIXED
|
doubling score causes by page internal anchors.
|
|
|
| |
|
NUTCH-336 |
FIXED
|
Harvested links shouldn't get db.score.injected in addition to inbound contributions
|
|
|
| |
|
NUTCH-341 |
FIXED
|
IndexMerger now deletes entire <workingdir> after completing
|
|
|
| |
|
NUTCH-105 |
FIXED
|
Network error during robots.txt fetch causes file to be ignored
|
|
|
| |
|
NUTCH-318 |
FIXED
|
log4j not proper configured, readdb doesnt give any information
|
|
|
| |
|
NUTCH-350 |
FIXED
|
urls blocked db.fetch.retry.max * http.max.delays times during fetching are marked as STATUS_DB_GONE
|
|
|
| |
|
NUTCH-337 |
FIXED
|
Fetcher ignores the fetcher.parse value configured in config file
|
|
|
| |
|
NUTCH-344 |
FIXED
|
Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
|
|
|
| |
|
NUTCH-462 |
FIXED
|
Noarchive urls are available via the cache link
|
|
|
| |
|
NUTCH-205 |
FIXED
|
Wrong 'fetch date' for non available pages
|
|
|
| |
|
NUTCH-266 |
FIXED
|
hadoop bug when doing updatedb
|
|
|
| |
|
NUTCH-338 |
FIXED
|
Remove the text parser as an option for parsing PDF files in parse-plugins.xml
|
|
|