作者:傻丫丫69_678 | 来源:互联网 | 2022-12-22 20:56
Ineedtoindexawholelotofwebpages,whatgoodwebcrawlerutilitiesarethere?Impreferablyaf
I need to index a whole lot of webpages, what good webcrawler utilities are there? I'm preferably after something that .NET can talk to, but that's not a showstopper.
我需要索引很多网页,有什么好的webcrawler实用程序?我最好是在.NET可以谈论的东西之后,但这不是一个强调。
What I really need is something that I can give a site url to & it will follow every link and store the content for indexing.
我真正需要的是我可以提供网站网址的内容,它将跟随每个链接并存储内容以进行索引。
6 个解决方案
2
Searcharoo.NET contains a spider that crawls and indexes content, and a search engine to use it. You should be able to find your way around the Searcharoo.Indexer.EXE code to trap the content as it's downloaded, and add your own custom code from there...
Searcharoo.NET包含一个爬行和索引内容的蜘蛛,以及一个使用它的搜索引擎。您应该能够找到绕过Searcharoo.Indexer.EXE代码的方法,以便在下载内容时捕获内容,并从那里添加您自己的自定义代码...
It's very basic (all the source code is included, and is explained in six CodeProject articles, the most recent of which is here Searcharoo v6): the spider follows links, imagemaps, images, obeys ROBOTS directives, parses some non-HTML file types. It is intended for single websites (not the entire web).
这是非常基本的(所有源代码都包含在内,并在六篇CodeProject文章中进行了解释,其中最新的是Searcharoo v6):蜘蛛遵循链接,图像映射,图像,服从ROBOTS指令,解析一些非HTML文件类型。它适用于单个网站(不是整个网站)。
Nutch/Lucene is almost certainly a more robust/commercial-grade solution - but I have not looked at their code. Not sure what you are wanting to accomplish, but have you also seen Microsoft Search Server Express?
Nutch / Lucene几乎肯定是一个更强大/商业级的解决方案 - 但我还没有看过他们的代码。不确定你想要完成什么,但你还看过Microsoft Search Server Express吗?
Disclaimer: I am the author of Searcharoo; just offering it here as an option.
免责声明:我是Searcharoo的作者;只是在这里提供它作为一种选择。