<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>SEO India &#187; Web crawler</title>
	<atom:link href="http://www.snvinfotech.com/seo-india/tag/web-crawler/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.snvinfotech.com/seo-india</link>
	<description>People are searching, where are you?</description>
	<lastBuildDate>Mon, 27 Dec 2010 08:34:34 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Web page Crawling</title>
		<link>http://www.snvinfotech.com/seo-india/seo-services-2/</link>
		<comments>http://www.snvinfotech.com/seo-india/seo-services-2/#comments</comments>
		<pubDate>Sat, 02 May 2009 13:09:30 +0000</pubDate>
		<dc:creator>SNV InfoTech</dc:creator>
				<category><![CDATA[SEO Updates]]></category>
		<category><![CDATA[seo services]]></category>
		<category><![CDATA[seo services india]]></category>
		<category><![CDATA[Web crawler]]></category>

		<guid isPermaLink="false">http://www.snvinfotech.com/seo-india/?p=215</guid>
		<description><![CDATA[A Web crawler is a computer program that browses the World Wide Web. Web crawlers are also called ants, automatic indexers, bots, and worms or Web spider, Web robot, Wanderers. This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. From the [...]]]></description>
			<content:encoded><![CDATA[<p>A <strong>Web crawler</strong> is a computer program that browses the World Wide Web. Web crawlers are also called ants, automatic indexers, bots, and worms or Web spider, Web robot, Wanderers.</p>
<p>This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data.<span id="more-215"></span><br />
From the beginning, a key motivation for designing Web crawlers has been to retrieve Web pages and add them or their representations to a local repository. Such a repository may then serve particular application needs such as those of a Web search engine. In its simplest form a crawler starts from a seed page and then uses the external links within it to attend to other pages.</p>
<p>Running a web crawler is a challenging task. There are tricky performance and reliability issues.<br />
In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers or we have been able to crawl up to tens of thousands of pages within a few minutes. This amounts to roughly 600K per second of data.</p>
<p>In order to fetch a Web page, we need an HTTP client which sends an HTTP request for a page and reads the response. The client needs to have timeouts to make sure that an unnecessary amount of time is not spent on slow servers or in reading large pages.</p>
<p>Once a page has been fetched, we need to parse its content to extract information that will feed and possibly guide the future path of the crawler. Parsing may imply simple hyperlink/URL extraction or it may involve the more complex process of tidying up the HTML content in order to analyze the HTML tag tree.</p>
<p>Post by : <strong><a href="http://www.snvinfotech.com" target="_self">SEO Service provider company India</a></strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.snvinfotech.com/seo-india/seo-services-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

