Web Robots (2006)

Back to Article Index


Over the last few years I have become quite proficient at writing web-robots in Delphi and had an article published on the subject in Delphi Developer magazine. website. 

If you want to write robots, you need to know a bit about the internet. I'll add links in the future but for now here are a few pointers.

What is a web-robot? Simply- a computer program that can retrieve information from the World Wide Web. It doesn't run on the web- it runs on your PC (or a server) and fetches information much as you or I would do when browsing the net.

How does it work? If you don't know- try this. Browse to any website on the internet that you know- say Altavista and right-click anywhere on the page, except over a graphic. Look for 'View Source' and click it. If its grayed, you right-clicked a graphic, try again. Otherwise notepad should open with a page of HTML code. Everything you saw on the web-page is described by this HTML and by doing string processing you can extract the data.

Is that it? Well pretty much. To have a web-robot walk through a web-site it has to locate the links which usually look like this <a href="http://www.pinpub.com"Link to Pinpub>. It then extracts the link and follows it. if you want to download graphics, sounds etc just look for the link and go to that page.

What about forms? Well yes, a web-robot has to be able to navigate forms- the electronic equivalent of clicking the 'Submit' button. Where it gets slightly complicated is handling the difference between 'Get' and 'Post'. Excuse me? If you click a submit button and go to a site where the url looks like this www.awebsite.com/search.htm?value=100 The bit after the ? is a parameter called value with an actual value of 100- you see this because the form uses get. If you don't see a ? or anything after the url (but you filled in a form) then it will have used a post.  

If you want to see this, you will need access to a web server. Now save the HTML source (from IE on a web page- right click, then view source) look for the word post and change it to get- also change the action = url to action = "" Now save it out on to the web server and browse it locally.

So what would you use a web-robot for?

When I was at Homedirectory I was asked to obtain the number of properties on a rival website. As I didn't want a big clue pointing back to us (like our IP address), I obtained a list of 600 anonymous proxies and from our own website logs, culled  500 different user agents.  As the site in question had a post code search which returned the number of properties in a postal district (there are ~2,800 in the UK), I processed the list in random order, using a random user agent (this is the field that your browser identifies itself to the server) and a random picked proxy. It took about 5 hours to run and returned 72,000 which was a lot less than that site claimed. Interestingly enough, after just 120 of the 2,800 districts had been scanned, the estimate was 70,000 properties. I should have paid more attention when studying stats (sampling size needed) at university.

Out of curiosity, I ran the same query on our site directly without the proxies. It took an hour to run and for the duration of that hour tripled our site traffic!

If you want me to create these type of web robots for you commercially, please contact me for a quote.

 

Back to Article Index

Permission given to reprint/use on the web so long as it includes a link to my website.
If you like this article, share it by bookmarking- click an icon below

Bookmark Web Robots at del.icio.us    Digg Web Robots at Digg.com    Bookmark Web Robots at Reddit.com    Bookmark Web Robots at Spurl.net    Bookmark Web Robots at Simpy.com    Bookmark Web Robots at NewsVine    Blink this Web Robots at blinklist.com    Bookmark Web Robots at Furl.net    Fark Web Robots at Fark.com    Bookmark Web Robots at YahooMyWeb