Friday, November 25, 2005

Google Parser



Download source - 25.2 Kb
Introduction
Google is definitely one of the most useful websites on the net. I am using it every day and like other things I use frequently, I intended to customize it to my personal changing needs. As a developer the first thing that came to my mind to automate my activities was that I needed a class representing this search engine. When I looked into their web site I saw that they provide a very nice API which is accessible through Web Services.


For some strange reason, I couldn't access the web service out of the company from behind the firewall... So I had to look into other options. Looking into CodeProject's web site, I figured out that I am not the only one interested in this subject. There has been already some nice articles written about this matter. There is a nice code provided by Stuart Konen in C++ but I was looking for a .NET assembly. Also Kisilevich Slava is providing code with a much better interface and even a more powerful engine which gets its results from different search engines. But (s)he uses another HTML parser which makes it dependent on it and I wanted to have full control of the parsing of the page.
So, I came up with this parser...
Challenges
Providing a Google interface has two challenges involved. First, you should write a code to establish a connection to send your HTTP request and get your response back. This is very easy when you work at home where the permissions are granted automatically. At the office there is a proxy which blocks the unattended internet access. So the issue here is to provide the DefaultCredentials as Internet Explorer would do. Fortunately Mr. Gates has provided an easy solution for this:HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Proxy = WebProxy.GetDefaultProxy();
request.Proxy.Credentials = System.Net.CredentialCache.DefaultCredentials;
Once this is done, you can request anything IExplorer can. For more information on DefaultCredentials see the MSDN.
The second challenge was to parse the response. This is the tricky part. Some do this by converting the HTML to XML and then use the XML parsers. Others use COM and use HtmlDocument to access the tags as DOM objects. Some others like me prefer parsing it using RegularExpressions. The last approach is surely the least readable code but surely fast and elegant.

Describing Google's Search Object
Item
Each result item has the following attributes:
  • link to the website
  • description of the found item
  • some fragment of text having one or more words from the query
  • a link to Cache
  • and a link to Similar pages
from which the last two parts are not supported in my demo.
This brings us to our first class Item which gets constructed as follows:
public Item(string link, string description, string text)
{
this._Link = link;
this._Description = description;
this._Text = text;
}
Searcher
The Searcher class provides the following properties:
  • Count
  • From
  • To
  • ItemsPerPage
  • Url and
  • Results
Of properties, only the ItemsPerPage is read/write, the rest are read-only.
It has three public methods which all help to initiate the search. The first one which should also be called first is the Search method which is to be given a query parameter.public void Search(string query)
Once you call this method, the class will get the response and parse the page to get Count and Results. The you can call subsequent search methods exactly like you would do on the Google page.
public void Search(long from)
public void SearchNext()

Parsing
As I was working with this project, I realized that Google developers are changing the HTML layout more often than they change the logo on the front page. So I had to make sure my code was still working. But then they never promised anyone to keep the same HTML layout. Which makes this project very unstable. In other words nothing guarantees that it will work after a while. That is why I decided to have a project with full control over the whole code and not using any third party library. That way, I can modify the code easily to accommodate any changes in the Google HTML format. The whole parsing process has been split into three sections. First I have to get the counts and I m doing it in the GetCounts method. Then I get the division of results and parse the items out of it in a loop implemented in GetResults. For each item, I parse the HTML to get its properties and that happens within ParseItem.
Using the Code
In my demo I have a form to query Google and show the results in a list box. I have also provided a WebBrowser control to see how it looks within IExplorer. Also there is a link on the title to initiate IExplorer outside of the application. The code using this class is pretty simple.
private Searcher google = new Searcher();
...
try
{
google.ItemsPerPage = nudItemsPerPage.Value ;
google.Search( txtSearch.Text);
if (google.Results==null) return;
btnNext.Enabled= true;
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
lstLinks.DataSource = google.Results ;
lstLinks.DisplayMember = "Description";

Last Words
There are plenty of ways to parse your HTML code. Depending on the problem you will need to choose one. Working with regular expressions is the most fascinating one and there is undoubtedly a lot to explore in that area.

Tuesday, November 1, 2005

Static Events

Introduction


I just had a conversation with one of my colleagues and he mentioned the subject of using Static Events which was new to me and I want to investigate it in this article.

Background (Why Static?)

Basically the idea is to have something shared among all the loaded instances of a class and ensure that changing the static property will cause all instances to update their content right away without the changer having to iterate through the existing objects and figure out which ones need to be updated. Kind of building the intelligence into the class so that all instances know what to do when the static property has changed.

This reminds me of the example of exchange rate, which is very well known to those implementing in banking systems: all transactions respect the current exchange rate. But I don't recall using Static Events for that. We saw this property as a separate object and we made sure that there is only one instance of it at a time. And all instances of transactions knew where to find it when needed. There is a fine difference though. The transactions will not need to know about the changes happening on the exchange rate, rather they will use the last changed value at the time that they use it by requesting the current value. This is not enough when, for example, we want to implement an application where the user interface reacts immediately on changes in the UI characteristics like font, as if it has to happen at real-time. It would be very easy if we could have a static property in the Font class called currentFont and a static method to change that value and a static event to all instances to let them know when they need to update their appearance.

Using the code

It is clear that we need a static field that all instances respect. Let's say we need a static field called Font that all the labels will use to refresh when this base field has been changed.

public class MyLabel : System.Windows.Forms.Label
{
// The static field all class instance will respect
private static Font font = new Font("Verdana",11 );

This field requires a static property setter to allow us to make changes to the base field.

public static Font Font
{
set {
font = value;
OnMyFontChange(new FontChangedEventArgs(font));
}
}

As you can see this is where we set the static variable but we also call the notification method to start the delegate.

    private static  void OnMyFontChange(FontChangedEventArgs e)     {         if (MyFontChanged != null)             MyFontChanged(null, e);     }

Now almost everything is set. All we need to do is to make sure that every instance subscribes to this event. And that is what we do in the constructor of the class.

public  MyLabel()
{
// Every instance subscribes to this event
MyLabel.MyFontChanged += new FontChangedEventHandler(this.ChangeBaseFont);
}

The delegated method is where we use the changed value to refresh the UI.

private void ChangeBaseFont(object  sender, FontChangedEventArgs e)
{
base.Font = e.Font ;
base.Invalidate();
}

Note:

Of course we could access the static field without the need to pass it over and introduce a new EventArgs class but it just happens to be implemented so, and it certainly has nothing to do with this subject.

In the demo, I have provided a test application using this label control and it demonstrates how it will update multiple screens by changing the base Font property.