您的位置:首页 > 其它

Nutch 一般工作流程

2014-04-09 17:26 148 查看
sequence of batch operations

1. inject -> populates CrawlDB from seed list

2. Generate -> Selets URLS to fetch in segment

3. Fetch -> Fetches URLs from segment

4. Parse -> Parses content(text + metadata)

5. UpdateDB -> Updates CroawlDB(new URLs, new status...)

6. InvertLinks -> Build Webgraph

7. SOLR Index -> Send docs to SOLR

8. SOLR Dedup -> Remove duplicate docs based on signature

Repeat steps 2 to 8

Or use the all-in-one crawl script
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: