Writing a web-robot in Delphi (2001)
In this article, I'll show you how to create a simple Web robot that does multiple parallel searches on a search engine and visits each Web site in the results and downloads that page. It uses the ActiveX Components provided by Internet Explorer 4+.
Caveat- the code as originally written would work with Altavista but that has changed probably a dozen times so its as likely to work as your chance of bicyling up Mt Everest! Copernicus (www.copernic.com) is an awesome (and free) search engine searcher and they issue upgrades for specific engines on a regular basis. If you want to write one, play with Copernicus. I rate it 11 out of 10. (No I've no finanicial or otherwise connection to them- I'm just a very very satisfied customer).
Although it sounds exotic, a bot (also known as a spider, intelligent agent, Web robot, crawler, robot, and so on) is simply a program that visits a number of Web sites. The best-known bots are, of course, the spiders used by various search engines to catalog new content. Take a look on the Web and youíll find lots of references and details. Thereís even a book on the subject, published by Microsoft Press: Programming Bots, Spiders and Intelligent Agents in Visual C++, by David Pallmann (ISBN 0-7356-0565-3). Itís well worth getting if youíre interested in writing bots and you donít mind wading through C++ code.
When you create a bot, you should be aware that material your bot gathers from sites that it visits may well be copyrightedóso be careful how you use it. Another thing to keep in mind is this: If your bot visits a Web site repeatedly, it might upset the Web siteís owners, particularly if they carry paid advertising. For a similar reaction, just mention Video Recorders that automatically skip ads or Tivo to advertising people. Of course, if your bot does hammer a particular Web site and gets noticed, you might find that your IP address is no longer allowed access to that site (The dreaded 403!). In that case, a dialup account where the ISP gives you a dynamic IP address is probably a much better idea. Iíll discuss the Robot Exclusion Standard later in respect to this.
The major problem with rolling your own bot isnít writing the code, itís how fast your Internet link is. For serious crawling, you need a broadband link, not dialup!
Microsoft has made life a lot easier for bot creators (and virus creators, trojan authors!) by their usual practice of including a couple of ActiveX browsing objects in Internet Explorer (IE) since version 4. Actually this reusable 'engine' approach is to be admired, if only it wasn't misused so much! If you use them, they take care of 99 percent of the difficult stuff like Internet access, firewalls, and using HTTP to download the pagesí HTML. IE has a lot of functionality built in, and much of it is accessible. IE 3 had some objects in it, but Iím not sure whether these are usable in the same way.
If youíre an ardent IE hater, take heart! You donít have to betray your principles or skip this article. When you use IEís objects, you never actually see IEóitís fully integrated into Windows.
WebBrowser is the name of the ActiveX object from IE. With Delphi 3, if you have IE installed on your PC, you must create the type library unitógo to Import ActiveX Controls in Delphi, select Microsoft Internet Controls, and click install. You should now see TWebBrowser_V1, TWebBrowser, and TShellFolderViewOC on the ActiveX tab on the component palette. Weíll be using TWebBrowser. Delphi 4 presents a problem due to changes in handling ActiveX between Delphi versions 3 and 4. A program that ran fine under Delphi 3 generates an EOLESysError under Delphi 4: "CoInitialize not Called." The type library Pascal source for the control is now twice the size in Delphi 5 than it was in Delphi 3. If you have Delphi 4, I suggest you either upgrade to Delphi 5 or find someone who has it and see if the Delphi 5 shdocvw.pas works for you. All of the IE object functionality is contained in the shdocvw.dll.
If you have Delphi 5 or 6, you donít need to do this. TWebBrowser has replaced the older THTML component and is the last component on the Internet Tab. You also get a demo of using this in the folder Demos/Coolstuff, and the Help file has some stuff on TWebBrowser, but curiously, Borland hasnít added the TWebBrowser_V1 component, even though itís in the source file. If you want to find out more about using TWebBrowser or TWebBrowser_V1, go to www.microsoft.com and do a search, or get Delphi 5 for the Help file!
TWebBrowser is a very easy component to work with. About half of the properties can be ignored as they are for the visible IE interface to control the onscreen look of IEólike toolbars or displaying in full screen. The Visible property determines whether we can see the browser window or not. In the final application, the user will never see it, but it can be useful for debugging.
The simplest way of using WebBrowser is by calling the Navigate(URL) method, then handling the OnNavigateComplete2 event and using the Document property to access the downloaded page. Although there are two other eventsóOnDocumentComplete and OnDownloadCompleteóthat should help determine whether a Web page was successfully downloaded, I found it easier to process everything from the OnNavigateComplete2. This only triggers when the browser has successfully moved to the specified URL; however, itís confused by multiple frames, so some extra care has to be taken, as youíll see.
WebBrowser provides you with several properties that simplify the task of extracting the data. These properties include Links, anchors, applets, forms, frames, style sheets, and a few more. The only problem, especially when frames are used, is sorting the chaff from the wheatówhich links are valid, and which might be ads or other services? In that case, the only way to do it is to scan the HTML and extract the relevant information. As each page is downloaded, it can be accessed directly.
As you might know, many SearchEngines donít index frames. An HTML document can consist of one or more frames. Which frame holds the stuff you want? Itís usually the first (and often only) page. The only reliable way is to walk through each page looking for text and then searching this for the strings that identify results. A much greater problem is the multiple triggering of the various DocumentComplete, DownloadComplete, and NavigationComplete events. Much debugging, cursing, and hair removal occurred before I realized what was happening. Ignore DownloadComplete. Instead, use either DocumentComplete or NavigationComplete. I did some more experiments and found that the best way was to use one and check whether a document was ready with VarIsEmpty(fwebbrowser.document). Then get the frame from the browser Document := fWebBrowser.Document, count the number of frames, and index through them. Frame 0 uses the script.top, while other frames use the frames.Item(index). Note the Basic type array indexing. From the frame, check the document.body, and extract the actual text from this. Be aware that the CreateTextRange will cause an exception if thereís no text, as in a banner objectóhence the try-except to catch it. At this point, we have the complete HTML code, and all we do next is get the results and navigation links from it.
When I was testing this, I used the word Options on AltaVista. The page that was returned contained 85 links according to the WebBrowser links (document.links.Items(Index) property is used). Of these, just 10 are results and a number are from ads, banners, and stuff like that as well as the Navigation links. The layout is different for each search engineís result, and an HTML analysis object would make a good idea for a future article. Iíve stuck with AltaVista, as other search engines lay things out in their own way. To keep the code short, Iíve used two text stringsó"AltaVista found" and "Result Pages"óto mark the start and stop of the result jumps. All URLs (look for "href=") that occur between these two strings and donít contain the text defined in the ignore text (like "jump.altavista") are used.
Everything centers on three components: TSearchEngine, TFetchResult, and TResultHandler. A list of TSearchEngines is constructed at program start with each object holding details needed to use that engine and an instance of the WebBrowser object. The search string is passed to this list, and each component then starts a query with its own engine. Result links are returned to a central list, while being careful to serialize access to this list through a simple guard variable (fbusy) to prevent two list additions from occurring at the same time. If there were just one operation being done, this could be avoided, but I have a list search as well and this takes time, so the guard variable must be used.
For a typical search engine like AltaVista, the query is processed by a cgi-bin script called query with various parameters added something like this for the search string "search text" (quotes included): pg=q&kl=en&q=%22search text%22&stq=20, which I understand as pg = q (itís a query), kl = en (Language is English), and stq=20 (results start with the 20th result).
The class TSearchEngine has methods PostQuestion, ExtractResultLinks, and NavigateNextResultPage and sets the WebBrowser event handlers to act accordingly. Most of the applicationís time is spent doing nothing more than waiting for the WebBrowser events to trigger. Iíve included a simple state mechanism so that users of this class can determine whatís happening, and itís able to keep track of what itís supposed to be doing.
For each results link found, a TFetchResult component is created and added to a list of fetches. Every instance of this class has its own WebBrowser component and event handler code. I used a list to simplify tracking all fetches. I use a timeout period (default 240 seconds), and every 30 seconds, the entire list of fetches is scanned and checked for timeouts. One global timer is lighter on Windows resources than a timer for each object. As itís also difficult to exactly determine when a page has been fully downloaded, this timeout provides a tidier way of doing it.
If a fetch succeeds, the HTML contents of the document are saved out to a results folder. I havenít included the graphics in order to keep the code shorter. The filename is derived from the frame name after removing unacceptable characters.
In a sense, this is a multithreaded application with a fair degree of parallelism, although there are no threads used explicitly. Each of the WebBrowser components is used independently, one in each TSearchEngine and one in each TFetchResult. There seems to be no upper limit to the number of parallel searches, although the overall speed is obviously subject to the bandwidth of the Internet link.
If you were writing a retail product, youíd probably create a config file with some way of defining the query format for each engine it can handle as well as extracting the result links. Every search engine produces results in different formats and might well change the format whenever they like.
Thereís also the high probability that two or more searches will yield the same URL. I get around that by searching the list of current fetches. I suggest another list be used, as this only holds current fetches. Before fetching the results from a particular URL, a quick search is done of this list to make sure that the URL hasnít already been fetched. If it hasnít, the URL is added to the list and the fetch proceeds.
As always with search engines, you need to choose your search text carefully to avoid multi-million results. A limit value hard-coded as 100 is set for each TSearchEngine.
One last point: Be careful with the use of Application.ProcessMessages when you use WebBrowser. Iíve avoided it except where fetches are added, which waits if the resultProcessor is busy. I donít know if itís the way events are sent, but I found that setting a label and calling application.ProcessMessages could force the same event to happen again. This occurred more in the debugger, but if you get odd behavior, comment them out.
I think this code is somewhat rough and ready, but it works very well. Possibly the least satisfactory methods are detecting which frame has text. I found it hard to work out, other than by using exception catching, so if you run it in the debugger, expect a few trapped exceptions. Itís also not a very polished applicationóthere could be a better user interface, for example. But it does what I intendedóit performs several parallel searches and downloads at the same time.
Robots not welcome
A standard has emerged called the Robot Exclusion Standard, which Web sites should use to specify trees that shouldnít be indexed by spiders because the pages change frequently or contain executables or non-text files.
I read that the latest estimates suggest that there are more than 800 million Web pages in existence, and that most of the search engines have indexed fewer than half of the total among them. The biggest search engines only have indexes for 150-200 million pages. So anything that limits the "noise" is welcome, and one way of doing this is through this standard. Just place a robots.txt text file in the rootófor example, www.altavista.com/robots.txtólike the following:
This stipulates that all Web robots shouldnít look in /cgi-bin. Of course, your robots donít have to heed thisóitís meant mainly for Web spiders, but itís bad manners to ignore it, and it makes sense to go along with it unless you really need to see whatís there.
How, you might wonder, do Web site people know when a robot has visited them? Quite simplyóitís the UserAgent name in the header passed to the Web server. This is most often Mozilla, but it might be Scooter (the AltaVista spider) or others. Mozilla (for "Mosaic Killer") was the working name for Netscape, and IE has also adopted it. It might be possible to change this in the fifth parameter of the TWebBrowser Navigate method, which holds additional headers. Or you could write your own HTTP interface instead of using IE, and then specify a name. Given that what weíre doing might not be very popular with search engines, itís probably best to do it behind the anonymity of a Mozilla browser! Additionally, you might want to add a 10-second delay between jumping pages. That way, it will at least look as if itís a human operator and not a program.Source Code (TBD)