Friends, Technology, Web2.0 - What I am reading

    [Home] [Recent] [Site Map]

   

预告和抓取问题汇总

好久没有推出新的产品和升级,小虾向大家道歉 在这向大家做个预告,虾米们准备在未来两到三周上线一个重要的升级(不排除有跳票的可能性*^_^*),具体内容现在还不能说,哈哈。 在未来的日子里,抓虾将继续致力于提升用户的阅读体验,开发具有创新性的产品,也希望大家能一如既往的支持抓虾。 ---------- 分隔线 --------- 一直以来,经常有人问小虾关于抓取慢的问题和更新停止的问题,比如为什么有好多天没有更新啊,在这里小虾集中解释一下出现这种情况的可能原因。 1.ETAG没有更新,导致抓取频率变慢。为了减轻被抓取服务器的压力和带宽,我们的spider会首先测试ETAG是否更新,如果ETAG更新则立即抓取否则延迟抓取。如果你的blog更新了,但是ETAG没有更新的话,那么抓取的频率会变慢。 2.频繁的XML解析错误,会导致抓取变慢。如果某个RSS的XML文件出现错误,则会降低抓取频率。 3.服务器不稳定。如果在某个时间段内频繁的出现http错误,当抓取连续出现一定次数的http错误时,spider会自动降低对目标站的抓取频率甚至停止抓取目标站点。 4.设置了错误的跳转和重定向。对于会设置301,302跳转的用户们,请一定要小心,确定你准确的知道这些跳转的意思和可能的结果。比如我们就曾遇到过这样的问题,某用户将url A 设置自动跳转到url B,然后又设置了url B跳转到 url A,造成跳转循环,导致spider在跳转到指定次数(一般不超过3次)后自动停止,抓取失败。也出现过用户设置了错误的跳转地址的情况,出现404等错误。所以设置跳转时请一定小心,抓虾的spider遇到302不会更改下次的抓取地址,当遇到301跳转时会自动更改该频道的抓取地址。 5.DNS解析错误,对于有能力的用户,请大家选择稳定的有保证的DNS服务器。 6.未定义的错误?哇哈哈哈,总有时候有一些莫名奇妙的错误,如果有任何问题,请大家发信给小虾,小虾帮大家解决。 怎么联系小虾呢,还是那四个方式!哪四个方式呢?谁知道谁知道? 知道的在此回帖列出那四个方式,回帖序号以9为结尾的回帖可以得到小奖励(限10人) 奖励包括以下几种(任选一种): 加入新手试用频道一周、每周推荐博客上榜一次、赠送抓虾小杯子和小徽章一套。
>>Source Link
>>Blog: 抓虾日记
>>Publish Date: 12/16/2007 7:01:43 AM
>>Keywords: url spider

Related Posts
>>Crawl Test Tool Released to the Public #
    Posted by OatmealAs promised the Crawl Test Tool has been released to the public. I wrote a blog entry about this about a month ago which highlighted some of the features of the tool, but in a nutshe
>>Determining a Comic Book"s Popularity With Google #
    At Cover Browser I"m making use of the Google page count to determine the highlights of a particular comic series. For instance, this approach returns the following comic as the most popular single is
>>An Argument for Website Validation #
    I recently ran across a website for a local website development company and in the process of checking out their services and portfolio I noticed something peculiar: All the sites they developed use 1
>>File Names and SEO #
    My real question is do the underscores have an impact on how the spider crawls the site or are long file names better? I"m thinking it doesn"t matter but would like to be sure. I"ve been trying to nam
>>How I Escaped Google"s Supplemental Hell #
    Posted by Dr. PeteI veered left, but it was too late. A wall of fire sprang up in front of me, blocking my path. I turned around and there he was: Googlebot, my old nemesis and lord of the search unde
>>支持中文域名订阅,我的频道的几项改进及首页新调整 #
    Hi,各位虾米好向大家汇报一下最近所做的事情: 1. 支持中文域名的blog的订阅 相信不少虾米已经习惯在抓虾直接输入一个Blog的URL而非标准RSS Feed地址进行订阅。 中文互联网的Blog的地址存在多种情况,包括:全英文URL,带中文汉字的URL,带中文汉字编码的URL,中英文混合的URL,同时,不同系统对不同编码的处理也不一样。这导致订阅器在处理用户请求时所面临的情况较为复
>>Google Results Malware Warning #
    Google is getting more aggressive about their badware warnings. Previously, they only warned when you clicked on a page from the results (if the site in question was determined to contain spyware or s
>>WMW Updates Redirect Handling (Again) #
    Wow. Just yesterday I posted on how WebmasterWorld greatly improved the way they handle bot abuse, now allowing most searchers into the site, mentioning that the only remaining issue is that they"re n
>>Web Analytics Shootout: Unica"s Affinium NetInsight #
    Eric Enge, president of Stone Temple Consulting (and SEW Expert and blogger), continues his 2007 Web Analytics Shootout series with a review of Unica’s Affinium NetInsight Web analytics package. As w
>>ASP.NET MVC框架 (第二部分): URL路径选择 #
    【原文地址】ASP.NET MVC Framework (Part 2): URL Routing【原文发表日期】 Monday, December 03, 2007 2:44 AM 上个月,我发表了我要撰写的系列贴子中的第一篇,这些帖子将讨论我们正在开发的新ASP.NET MVC框架。这个系列的第一个贴子建造了一个简单的电子商务产品列表/浏览场景,讨论了MVC后面的高层次的概念,示范了如何从头

Other Posts:
>>每周博客推荐升级
>>每周博客推荐上线
>>博主站长们看过来咯
>>严惩作弊行动
>>关于阅读文章的字体
>>今日上线-积分系统、热文评论、博客地址
>>blogcn最近的问题
>>放出你的头像来
>>号外,号外,个人页面改版啦 :)
>>编辑收藏文章的”分享”和”私藏”
>>关于收藏的小小改动
>>近日大事件


Month Archives:

Top Tags:
Company & Product Profiles Google Technology Internet Search feature Business and Technology Web2.0 column analysis 服务介绍 application letter 业界信息 news comment Startups deal Search Headlines 產業策進 未來趨勢 Social Network 創投 news_in widget SEW Experts 业界动态 創業案例 Web 2.0 News & Ideas China2.0


@2007 All rights Reserved