One of the things I was working on recently is 'screen scraping' some data from several sites and presenting them in a single page. Of course Jaxer seems like the perfect tool for this as I can go fetch the data server-side, pull it all together, then present the final page to the browser. It was so exciting to see this work that I thought I'd make this post to share my experience with you!
We’re working now on adding support in Jaxer for being able to programmatically create 'window' objects, filling them with content from a remote URL, having that content actually execute, then being able to go into that window object and pull DOM elements out.
Unfortunately for me, that work isn't ready yet, so I wanted to see if there was some workaround I could use until it was ready. Well, after some research and trial and error, a workaround was found. The only catch with this code is that if there is any JavaScript on the page(s) you are fetching, that code will not execute, that’s coming soon.
Note that all of the code below can be contained within a single script tag:
So, the first thing I wanted to do is make it really easy to fetch a remote document and grab out some content. I didn't want to use string matching or regular expressions of any kind, I wanted to use my trusted 'getElementById' to fetch named elements directly. (Of course I ran into sites that don't name their elements, but fortunately many use named classes, so I had to fetch them that way, but I get ahead of myself.)
Let's get started. The following two lines is all that is required to fetch a remote URL synchronously on the server and then pull out a named element:
Now that we have 'item' as the element we were trying to fetch, we just have to add that element to our current blank page. I created a simple function called 'addElementToPage(title, element)' that takes care of this for me. Here's the next line:
That's it. Now I do that 2 more times to fetch content from 2 other sites.
You'll notice that I didn't use getElementById() as those elements were not named, but fortunately they had class names associated with them. I found a convenient function called getElementsByClass() that I used in this case. Since it returns an array of elements, I use the [0] index to retrieve the first item in the list, which in my case, is probably the only item.
Believe it or not, that's basically it. At this point, your three 'DOM fragments' have been fetched and inserted into your new document. Here's the result:
Following are the convenience functions that help make the magic happen. The first function is addElementToPage(). It very simply just creates an H2 tag and sets the title and a DIV tag which contains the contents of the fetched element:
This next function, getDocumentFromURL(), is the one that does most of the work. (I found some good info on the subject of HTML to DOM here: http://jszen.blogspot.com/2007/02/how-to-parse-html-strings-into-dom.html) It first goes and retrieves the remote page. Then it creates a 'document fragment' from the contents of the fetched site. That fragment is then added to a dynamically created IFRAME. Finally, the Document object from the IFRAME is fetched and returned. In short, we can pass in a URL, get the string value, place it into an IFRAME, then pull out the resulting Document object so that we can work on it.
This final function was found at http://www.dustindiaz.com/getelementsbyclass/ and walks a node looking for an element with a specific class name. It was used in the case where elements don't have ID's, so a class name is used instead.
What's exciting about this sample is that it is relatively simple, uses the full power of server-side JavaScript and more importantly, Jaxer's cool server-side DOM capability to enable real 'DOM scraping'. Once window object creation is finished in Jaxer (real soon now), then you'll be able to fetch remote pages, execute their integrated code, then proceed to fetch out items from the resulting DOM.