Three Common Methods For Web Data Extraction

Probably the most common technique used traditionallyDisadvantages:
to extract data from web pages this is to cook up- It's relatively complex to create and work with such
some regular expressions that match the pieces youan engine. The level of expertise required to even
want (e.g., URL's and link titles). Our screen-scraperunderstand an extraction engine that uses artificial
software actually started out as an application writtenintelligence and ontologies is much higher than what is
in Perl for this very reason. In addition to regularrequired to deal with regular expressions.
expressions, you might also use some code written in- These types of engines are expensive to build.
something like Java or Active Server Pages to parseThere are commercial offerings that will give you the
out larger chunks of text. Using raw regularbasis for doing this type of data extraction, but you still
expressions to pull out the data can be a littleneed to configure them to work with the specific
intimidating to the uninitiated, and can get a bit messycontent domain you're targeting.
when a script contains a lot of them. At the same- You still have to deal with the data discovery portion
time, if you're already familiar with regular expressions,of the process, which may not fit as well with this
and your scraping project is relatively small, they canapproach (meaning you may have to create an
be a great solution.entirely separate engine to handle data discovery).
Other techniques for getting the data out can get veryData discovery is the process of crawling web sites
sophisticated as algorithms that make use of artificialsuch that you arrive at the pages where you want to
intelligence and such are applied to the page. Someextract data.
programs will actually analyze the semantic content ofWhen to use this approach: Typically you'll only get into
an HTML page, then intelligently pull out the pieces thatontologies and artificial intelligence when you're planning
are of interest. Still other approaches deal withon extracting information from a very large number of
developing "ontologies", or hierarchical vocabulariessources. It also makes sense to do this when the data
intended to represent the content domain.you're trying to extract is in a very unstructured format
There are a number of companies (including our own)(e.g., newspaper classified ads). In cases where the
that offer commercial applications specifically intendeddata is very structured (meaning there are clear labels
to do screen-scraping. The applications vary quite a bit,identifying the various data fields), it may make more
but for medium to large-sized projects they're often asense to go with regular expressions or a
good solution. Each one will have its own learningscreen-scraping application.
curve, so you should plan on taking time to learn the insScreen-scraping software
and outs of a new application. Especially if you plan onAdvantages:
doing a fair amount of screen-scraping it's probably a- Abstracts most of the complicated stuff away. You
good idea to at least shop around for acan do some pretty sophisticated things in most
screen-scraping application, as it will likely save youscreen-scraping applications without knowing anything
time and money in the long run.about regular expressions, HTTP, or cookies.
So what's the best approach to data extraction? It- Dramatically reduces the amount of time required to
really depends on what your needs are, and whatset up a site to be scraped. Once you learn a
resources you have at your disposal. Here are someparticular screen-scraping application the amount of
of the pros and cons of the various approaches, astime it requires to scrape sites vs. other methods is
well as suggestions on when you might use each one:significantly lowered.
Raw regular expressions and code- Support from a commercial company. If you run into
Advantages:trouble while using a commercial screen-scraping
- If you're already familiar with regular expressions andapplication, chances are there are support forums and
at least one programming language, this can be a quickhelp lines where you can get assistance.
solution.Disadvantages:
- Regular expressions allow for a fair amount of- The learning curve. Each screen-scraping application
"fuzziness" in the matching such that minor changes tohas its own way of going about things. This may imply
the content won't break them.learning a new scripting language in addition to
- You likely don't need to learn any new languages orfamiliarizing yourself with how the core application
tools (again, assuming you're already familiar withworks.
regular expressions and a programming language).- A potential cost. Most ready-to-go screen-scraping
- Regular expressions are supported in almost allapplications are commercial, so you'll likely be paying in
modern programming languages. Heck, even VBScriptdollars as well as time for this solution.
has a regular expression engine. It's also nice because- A proprietary approach. Any time you use a
the various regular expression implementations don'tproprietary application to solve a computing problem
vary too significantly in their syntax.(and proprietary is obviously a matter of degree)
Disadvantages:you're locking yourself into using that approach. This
- They can be complex for those that don't have a lotmay or may not be a big deal, but you should at least
of experience with them. Learning regular expressionsconsider how well the application you're using will
isn't like going from Perl to Java. It's more like goingintegrate with other software applications you currently
from Perl to XSLT, where you have to wrap yourhave. For example, once the screen-scraping
mind around a completely different way of viewing theapplication has extracted the data how easy is it for
problem.you to get to that data from your own code?
- They're often confusing to analyze. Take a lookWhen to use this approach: Screen-scraping
through some of the regular expressions people haveapplications vary widely in their ease-of-use, price, and
created to match something as simple as an emailsuitability to tackle a broad range of scenarios.
address and you'll see what I mean.Chances are, though, that if you don't mind paying a bit,
- If the content you're trying to match changes (e.g.,you can save yourself a significant amount of time by
they change the web page by adding a new "font"using one. If you're doing a quick scrape of a single
tag) you'll likely need to update your regularpage you can use just about any language with
expressions to account for the change.regular expressions. If you want to extract data from
- The data discovery portion of the processhundreds of web sites that are all formatted differently
(traversing various web pages to get to the pageyou're probably better off investing in a complex
containing the data you want) will still need to besystem that uses ontologies and/or artificial intelligence.
handled, and can get fairly complex if you need to dealFor just about everything else, though, you may want
with cookies and such.to consider investing in an application specifically
When to use this approach: You'll most likely usedesigned for screen-scraping.
straight regular expressions in screen-scraping whenAs an aside, I thought I should also mention a recent
you have a small job you want to get done quickly.project we've been involved with that has actually
Especially if you already know regular expressions,required a hybrid approach of two of the
there's no sense in getting into other tools if all youaforementioned methods. We're currently working on
need to do is pull some news headlines off of a site.a project that deals with extracting newspaper
Ontologies and artificial intelligenceclassified ads. The data in classifieds is about as
Advantages:unstructured as you can get. For example, in a real
- You create it once and it can more or less extractestate ad the term "number of bedrooms" can be
the data from any page within the content domainwritten about 25 different ways. The data extraction
you're targeting.portion of the process is one that lends itself well to an
- The data model is generally built in. For example, ifontologies-based approach, which is what we've done.
you're extracting data about cars from web sites theHowever, we still had to handle the data discovery
extraction engine already knows what the make,portion. We decided to use screen-scraper for that,
model, and price are, so it can easily map them toand it's handling it just great. The basic process is that
existing data structures (e.g., insert the data into thescreen-scraper traverses the various pages of the
correct locations in your database).site, pulling out raw chunks of data that constitute the
- There is relatively little long-term maintenanceclassified ads. These ads then get passed to code
required. As web sites change you likely will need towe've written that uses ontologies in order to extract
do very little to your extraction engine in order toout the individual pieces we're after. Once the data has
account for the changes.been extracted we then insert it into a database.