Friday, November 25, 2005

Google Parser

Download source - 25.2 Kb
Google is definitely one of the most useful websites on the net. I am using it every day and like other things I use frequently, I intended to customize it to my personal changing needs. As a developer the first thing that came to my mind to automate my activities was that I needed a class representing this search engine. When I looked into their web site I saw that they provide a very nice API which is accessible through Web Services.

For some strange reason, I couldn't access the web service out of the company from behind the firewall... So I had to look into other options. Looking into CodeProject's web site, I figured out that I am not the only one interested in this subject. There has been already some nice articles written about this matter. There is a nice code provided by Stuart Konen in C++ but I was looking for a .NET assembly. Also Kisilevich Slava is providing code with a much better interface and even a more powerful engine which gets its results from different search engines. But (s)he uses another HTML parser which makes it dependent on it and I wanted to have full control of the parsing of the page.
So, I came up with this parser...
Providing a Google interface has two challenges involved. First, you should write a code to establish a connection to send your HTTP request and get your response back. This is very easy when you work at home where the permissions are granted automatically. At the office there is a proxy which blocks the unattended internet access. So the issue here is to provide the DefaultCredentials as Internet Explorer would do. Fortunately Mr. Gates has provided an easy solution for this:HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Proxy = WebProxy.GetDefaultProxy();
request.Proxy.Credentials = System.Net.CredentialCache.DefaultCredentials;
Once this is done, you can request anything IExplorer can. For more information on DefaultCredentials see the MSDN.
The second challenge was to parse the response. This is the tricky part. Some do this by converting the HTML to XML and then use the XML parsers. Others use COM and use HtmlDocument to access the tags as DOM objects. Some others like me prefer parsing it using RegularExpressions. The last approach is surely the least readable code but surely fast and elegant.

Describing Google's Search Object
Each result item has the following attributes:
  • link to the website
  • description of the found item
  • some fragment of text having one or more words from the query
  • a link to Cache
  • and a link to Similar pages
from which the last two parts are not supported in my demo.
This brings us to our first class Item which gets constructed as follows:
public Item(string link, string description, string text)
this._Link = link;
this._Description = description;
this._Text = text;
The Searcher class provides the following properties:
  • Count
  • From
  • To
  • ItemsPerPage
  • Url and
  • Results
Of properties, only the ItemsPerPage is read/write, the rest are read-only.
It has three public methods which all help to initiate the search. The first one which should also be called first is the Search method which is to be given a query parameter.public void Search(string query)
Once you call this method, the class will get the response and parse the page to get Count and Results. The you can call subsequent search methods exactly like you would do on the Google page.
public void Search(long from)
public void SearchNext()

As I was working with this project, I realized that Google developers are changing the HTML layout more often than they change the logo on the front page. So I had to make sure my code was still working. But then they never promised anyone to keep the same HTML layout. Which makes this project very unstable. In other words nothing guarantees that it will work after a while. That is why I decided to have a project with full control over the whole code and not using any third party library. That way, I can modify the code easily to accommodate any changes in the Google HTML format. The whole parsing process has been split into three sections. First I have to get the counts and I m doing it in the GetCounts method. Then I get the division of results and parse the items out of it in a loop implemented in GetResults. For each item, I parse the HTML to get its properties and that happens within ParseItem.
Using the Code
In my demo I have a form to query Google and show the results in a list box. I have also provided a WebBrowser control to see how it looks within IExplorer. Also there is a link on the title to initiate IExplorer outside of the application. The code using this class is pretty simple.
private Searcher google = new Searcher();
google.ItemsPerPage = nudItemsPerPage.Value ;
google.Search( txtSearch.Text);
if (google.Results==null) return;
btnNext.Enabled= true;
catch (Exception ex)
lstLinks.DataSource = google.Results ;
lstLinks.DisplayMember = "Description";

Last Words
There are plenty of ways to parse your HTML code. Depending on the problem you will need to choose one. Working with regular expressions is the most fascinating one and there is undoubtedly a lot to explore in that area.

1 comment:

Anonymous said...

I’ve just lately began a weblog, the information you present on this website has helped me tremendously. Thank you for your whole time & work.