We're finalizing our Jaxer 1.0 release, and I thought I'd go back to Paul's 'DOM Scraping' example and see what Jaxer 1.0 has to offer.
A main new feature is Jaxer.Sandbox, which lets you open new server-side sandboxed 'windows' and load pages into them. You can GET a page from any url, or even POST some content and load back the response. You can control whether JavaScript is executed in the window, whether it honors meta refreshes, whether it loads synchronously or asynchronously, etc. And, most important, the window and its contents have no access to Jaxer and its APIs nor to your app — but your app has full access to it.
There's a lot of other new goodness in Jaxer 1.0, as well as the official released version of the Mozilla engine found in Firefox 3. So for example getElementsByClassName is natively implemented (see John Resig's speed comparison), in addition to the other Mozilla features such as built-in XPath functionality and a very robust DOM feature set — just what you need for some serious 'screen scraping', mashups, and content repurposing.
Let's see it in action. First, let's grab the same three pages Paul used and extract some choice content elements into our page before sending it to the browser. We'll reuse the same Jaxer.Sandbox instance three times, each time loading a different page into it and grabbing some content by id or by classname. The page is basically a single script block: containing the following server-side JavaScript: and a single helper function that adds the given contents from the sandbox to the current page:
Here's the resulting page:

That's not too bad, but the timestamps on the News.com section don't look good, and there are a bunch of JavaScript errors on the browser from left-over JavaScript content that was in the original WashingtonPost.com page. We can quickly clean that up with another, tiny DOM helper function that removes any DOM NodeList you pass to it. And while we're at it we'll also remove the reference to the client-side Jaxer framework since we won't be needing it here. The script block becomes:
Now that's much better:

But if you run this example yourself, you'll see it's slow: each of the original servers is a bit slow to load its entire page, and you have to load all three server-side before your page is ready to be sent to the browser. We could load all three in parallel, using the Jaxer.Sandbox's async capabilities and the Jaxer.Thread.waitFor() method to wait for them to load. But even then the user would see nothing in the browser until the slowest one was done loading server-side. So let's load them asynchronously client-side, by creating a server-side function called getFragment that loads a url and returns some of its contents, and setting getFragment.proxy = true so we can call it asynchronously from the client. The page now has two script blocks, one that runs on the browser and one on the server: The client-side code is: and the server-side code is: The result looks the same as the previous screenshot, but the order in which the sections are displayed may vary, depending on which remote server responds first.