| Probably the most common technique used traditionally | | | | Disadvantages: |
| to extract data from web pages this is to cook up | | | | - It's relatively complex to create and work with such |
| some regular expressions that match the pieces you | | | | an engine. The level of expertise required to even |
| want (e.g., URL's and link titles). Our screen-scraper | | | | understand an extraction engine that uses artificial |
| software actually started out as an application written | | | | intelligence and ontologies is much higher than what is |
| in Perl for this very reason. In addition to regular | | | | required to deal with regular expressions. |
| expressions, you might also use some code written in | | | | - These types of engines are expensive to build. |
| something like Java or Active Server Pages to parse | | | | There are commercial offerings that will give you the |
| out larger chunks of text. Using raw regular | | | | basis for doing this type of data extraction, but you still |
| expressions to pull out the data can be a little | | | | need to configure them to work with the specific |
| intimidating to the uninitiated, and can get a bit messy | | | | content domain you're targeting. |
| when a script contains a lot of them. At the same | | | | - You still have to deal with the data discovery portion |
| time, if you're already familiar with regular expressions, | | | | of the process, which may not fit as well with this |
| and your scraping project is relatively small, they can | | | | approach (meaning you may have to create an |
| be a great solution. | | | | entirely separate engine to handle data discovery). |
| Other techniques for getting the data out can get very | | | | Data discovery is the process of crawling web sites |
| sophisticated as algorithms that make use of artificial | | | | such that you arrive at the pages where you want to |
| intelligence and such are applied to the page. Some | | | | extract data. |
| programs will actually analyze the semantic content of | | | | When to use this approach: Typically you'll only get into |
| an HTML page, then intelligently pull out the pieces that | | | | ontologies and artificial intelligence when you're planning |
| are of interest. Still other approaches deal with | | | | on extracting information from a very large number of |
| developing "ontologies", or hierarchical vocabularies | | | | sources. It also makes sense to do this when the data |
| intended to represent the content domain. | | | | you're trying to extract is in a very unstructured format |
| There are a number of companies (including our own) | | | | (e.g., newspaper classified ads). In cases where the |
| that offer commercial applications specifically intended | | | | data is very structured (meaning there are clear labels |
| to do screen-scraping. The applications vary quite a bit, | | | | identifying the various data fields), it may make more |
| but for medium to large-sized projects they're often a | | | | sense to go with regular expressions or a |
| good solution. Each one will have its own learning | | | | screen-scraping application. |
| curve, so you should plan on taking time to learn the ins | | | | Screen-scraping software |
| and outs of a new application. Especially if you plan on | | | | Advantages: |
| doing a fair amount of screen-scraping it's probably a | | | | - Abstracts most of the complicated stuff away. You |
| good idea to at least shop around for a | | | | can do some pretty sophisticated things in most |
| screen-scraping application, as it will likely save you | | | | screen-scraping applications without knowing anything |
| time and money in the long run. | | | | about regular expressions, HTTP, or cookies. |
| So what's the best approach to data extraction? It | | | | - Dramatically reduces the amount of time required to |
| really depends on what your needs are, and what | | | | set up a site to be scraped. Once you learn a |
| resources you have at your disposal. Here are some | | | | particular screen-scraping application the amount of |
| of the pros and cons of the various approaches, as | | | | time it requires to scrape sites vs. other methods is |
| well as suggestions on when you might use each one: | | | | significantly lowered. |
| Raw regular expressions and code | | | | - Support from a commercial company. If you run into |
| Advantages: | | | | trouble while using a commercial screen-scraping |
| - If you're already familiar with regular expressions and | | | | application, chances are there are support forums and |
| at least one programming language, this can be a quick | | | | help lines where you can get assistance. |
| solution. | | | | Disadvantages: |
| - Regular expressions allow for a fair amount of | | | | - The learning curve. Each screen-scraping application |
| "fuzziness" in the matching such that minor changes to | | | | has its own way of going about things. This may imply |
| the content won't break them. | | | | learning a new scripting language in addition to |
| - You likely don't need to learn any new languages or | | | | familiarizing yourself with how the core application |
| tools (again, assuming you're already familiar with | | | | works. |
| regular expressions and a programming language). | | | | - A potential cost. Most ready-to-go screen-scraping |
| - Regular expressions are supported in almost all | | | | applications are commercial, so you'll likely be paying in |
| modern programming languages. Heck, even VBScript | | | | dollars as well as time for this solution. |
| has a regular expression engine. It's also nice because | | | | - A proprietary approach. Any time you use a |
| the various regular expression implementations don't | | | | proprietary application to solve a computing problem |
| vary too significantly in their syntax. | | | | (and proprietary is obviously a matter of degree) |
| Disadvantages: | | | | you're locking yourself into using that approach. This |
| - They can be complex for those that don't have a lot | | | | may or may not be a big deal, but you should at least |
| of experience with them. Learning regular expressions | | | | consider how well the application you're using will |
| isn't like going from Perl to Java. It's more like going | | | | integrate with other software applications you currently |
| from Perl to XSLT, where you have to wrap your | | | | have. For example, once the screen-scraping |
| mind around a completely different way of viewing the | | | | application has extracted the data how easy is it for |
| problem. | | | | you to get to that data from your own code? |
| - They're often confusing to analyze. Take a look | | | | When to use this approach: Screen-scraping |
| through some of the regular expressions people have | | | | applications vary widely in their ease-of-use, price, and |
| created to match something as simple as an email | | | | suitability to tackle a broad range of scenarios. |
| address and you'll see what I mean. | | | | Chances are, though, that if you don't mind paying a bit, |
| - If the content you're trying to match changes (e.g., | | | | you can save yourself a significant amount of time by |
| they change the web page by adding a new "font" | | | | using one. If you're doing a quick scrape of a single |
| tag) you'll likely need to update your regular | | | | page you can use just about any language with |
| expressions to account for the change. | | | | regular expressions. If you want to extract data from |
| - The data discovery portion of the process | | | | hundreds of web sites that are all formatted differently |
| (traversing various web pages to get to the page | | | | you're probably better off investing in a complex |
| containing the data you want) will still need to be | | | | system that uses ontologies and/or artificial intelligence. |
| handled, and can get fairly complex if you need to deal | | | | For just about everything else, though, you may want |
| with cookies and such. | | | | to consider investing in an application specifically |
| When to use this approach: You'll most likely use | | | | designed for screen-scraping. |
| straight regular expressions in screen-scraping when | | | | As an aside, I thought I should also mention a recent |
| you have a small job you want to get done quickly. | | | | project we've been involved with that has actually |
| Especially if you already know regular expressions, | | | | required a hybrid approach of two of the |
| there's no sense in getting into other tools if all you | | | | aforementioned methods. We're currently working on |
| need to do is pull some news headlines off of a site. | | | | a project that deals with extracting newspaper |
| Ontologies and artificial intelligence | | | | classified ads. The data in classifieds is about as |
| Advantages: | | | | unstructured as you can get. For example, in a real |
| - You create it once and it can more or less extract | | | | estate ad the term "number of bedrooms" can be |
| the data from any page within the content domain | | | | written about 25 different ways. The data extraction |
| you're targeting. | | | | portion of the process is one that lends itself well to an |
| - The data model is generally built in. For example, if | | | | ontologies-based approach, which is what we've done. |
| you're extracting data about cars from web sites the | | | | However, we still had to handle the data discovery |
| extraction engine already knows what the make, | | | | portion. We decided to use screen-scraper for that, |
| model, and price are, so it can easily map them to | | | | and it's handling it just great. The basic process is that |
| existing data structures (e.g., insert the data into the | | | | screen-scraper traverses the various pages of the |
| correct locations in your database). | | | | site, pulling out raw chunks of data that constitute the |
| - There is relatively little long-term maintenance | | | | classified ads. These ads then get passed to code |
| required. As web sites change you likely will need to | | | | we've written that uses ontologies in order to extract |
| do very little to your extraction engine in order to | | | | out the individual pieces we're after. Once the data has |
| account for the changes. | | | | been extracted we then insert it into a database. |