|
With a computer and a connection, people all over
the world finally have equal access to information.
But finding the exact information you're seeking
is still the major difference between success and
failure. Here's a roundup of various search-related
nuggets from the last year.
Search Tips:
Visualize the page where the info lives. This
is probably the biggest secret to success. Search
engines don't know anything about information, they
just know about web pages. Instead of focusing directly
the information you're trying to learn, imagine what
types of web pages would hold this info—what
other terms would be on that page, which domains
might discuss this particular aspect of the subject—think
about what type of page might hide that info. Hunt
down the page which might hold the info, rather than
just the info itself.
For instance, if you're trying to figure out why
a particular Netscape version shows a white gap in
a table, then a term like "netscape table gap" can
pull up too many pages, with many hits being discussion
groups where people ask about a particular gap. A
page which holds the actual information you're seeking
would probably talk about multiple browsers, and
would also talk about how to fix the gap, so a longer
term like "netscape explorer table space gap fix
flush" would directly bring up information about
fixing browser table differences. Think about how
the solution might be described, rather than just
the problem you're seeing. There's a particular page
which hides your quarry; how can you best hunt down
that certain page?
Use the advanced operators for the search engine. Most
search engines can search on phrases rather than
isolated words if you put the phrase in quotes. A
minus sign or NOT in front of a term can exclude
the word from a search... parentheses let you group
terms for OR searches... different search engines
let you tweak searches in different ways. Google's Advanced
Search Page exposes some of their lesser-known techniques and operators.
For instance, adding "site:macromedia.com inurl:support" to
a term will restrict the search to Macromedia TechNotes,
or you can search my
blog by adding the term "inurl:jdmx.blogspot.com" to
a query. Or try adding "filetype:swf" to a term to
find only SWF files. Each engine offers different
special operators, and getting familiar with these
advanced tools can let you find more things, more
quickly. The Fagan Finder Search
Engine Ultimate Interface configures itself to
advanced operators for multiple engines.
Scan a set of candidate pages simultaneously. You
can tell a little bit about a page's actual contents
from the caption in the search results page, but
that's no replacement for actually examining a number
of candidate pages. The browser's "Find Text..." command
can help bring you directly to the search term's
location in the page, particularly if you use the
keyboard shortcut to "Find Again".
For scanning candidate pages, I use Mozilla's tabbed
browsing—a context-click will load a candidate
page beneath my search results, so I can quickly
invoke half-a-dozen pages in a single window and
then quickly scan and dismiss each. Other people
use browser sidebars to hold their results while
they scan through pages one at a time. If your browser
doesn't offer such features, then use its keyboard
shortcut to open each candidate page in its own window,
so that you can quickly scan multiple candidates
without losing your search results.
Try again. It's nice if you can find a page
with the desired info with your first set of search
terms, but not all prey is so easy to catch. Looking
at the search results, and examining candidate pages,
will often suggest new phrases to search on, or terms
to omit from the results. Sometimes the results of
the first search will suggest a domain or different
engine which might contain the info you're seeking.
Search Tools:
Different engines, different indices. Google's
great, but it doesn't
find everything. AllTheWeb spiders differently
and seems to find more non-English documents, WiseNut
and each of the search engines traverses a different
network in finding documents. If you're confident
that a document exists, and can't locate it in one
engine, then cross-checking in another engine can
help.
This is the idea behind meta-search engines, which
query a number of different engines and then evaluate
the total results. A meta-search can take a little
longer for its multiple queries, and the value of
a meta-search may have declined a bit as a result
of Google's primacy, but their aggregation and unique
sortings give a different picture of the web's true
contents. Try Pandia
MetaSearch to see the advantages and disadvantages
of searching multiple engines simultaneously.
Weblogs are updated too frequently to usefully
show up in a mainstream search engine, so smaller
and more specific engines like Daypop and Technorati are
quick enough to offer searching of daily news and
commentary. Like Google
News and Froogle,
these engines query different document repositories
than a typical search engine would.
Different engines, different rankings. If
a query returns too many documents to quickly scan,
then the order in which these documents are returned
usually determines whether the search succeeds or
fails. Different engines rank documents in different
ways, and as site owners try to optimize their pages
for high rankings the engines frequently evolve in
response. Different engines can recommend different
documents for identical terms.
Early engines ranked results by how many times
the term appeared in a page, or whether the term
was in a page's keywords, title, or address. Google
added the innovation of ranking pages by how many
pages out there linked to this document—Google
harvested dispersed human judgment by realizing useful
documents were linked more often than trivial documents. Teoma takes
this a step further, by measuring incoming links
from authoritative documents in the same circle of
influence.
(As an example, if an early search engine were
looking for a brain surgeon, it would walk along
the street looking for a sign saying "Brain Surgeon
Office". Google would ask lots of people if they
knew of a brain surgeon. Teoma would find a group
of brain surgeons and ask them who the best one is.)
Searching versus navigating. Sometimes it's
just plain hard to make a search term accurate yet
restricted enough to give a good set of candidate
pages. In cases like this a directory service can
help you explore an area without necessarily finding
keywords to group sites. Drill down through the Open
Directory Project to listings like their web
authoring section. You can then do a normal search
within a given category. Another way to find subject
directories like this is to do a regular engine search
with the term "wiki" to find listings of resources.
The Invisible Web. Web search engines index
pages. Not all of the information available on the
web happens to live in an HTML page. A lot of it
lives in databases. If you regularly do specialized
research, then get familiar with its particular web-accessible
databases or you'll miss much of what's available!
Here's a listing of
many sites which go into this subject in detail.
(Warning: Once you start prowling around in here
it's easy to lose track of time.... ;-)
Usenet predated the World Wide Web, and holds much
early Internet history. But newsgroup messages aren't
HTML documents and won't show up in most search engines.
The original DejaNews search engine archived these
documents and presented them through an HTML interface,
and was later rescued and revived as Google
Groups.
What about web pages which have been removed from
the web? These usually disappear from a mainstream
search engine the next time the site is spidered.
But a specialized engine like the Internet Archive's Wayback
Machine keeps those old pages searchable.
There's tons of material on the web which is simply
invisible to standard search engines...!
Bookmarks and Bookmarklets. If there's a
query you regularly make, consider saving it as a
bookmark in your browser. I regularly check whether
Google News has found new articles about Macromedia,
and this
URL takes me directly there with the newest articles
right on top. This
URL came from a search at Technorati to find
the newest blogs which link to my blog.
If you're instead regularly submitting different
requests to the same engine, then a bookmark with
a "javascript:" pseudo-URL can speed things up by
asking for the query terms in a dialog box, so you
can get your results with only one page loading.
Here's a bookmarklet I use for searching
my blog, and another I use to find recent
postings to the Macromedia newsgroups. (Caveat:
These work in my Mozilla browsers, and the JavaScript
may be slightly different in other browsers.)
Bookmarklets can actually do things in your
browser, such as turning off CSS or performing an
analysis of the page content. A little bit of time
spent browsing these
sites can completely change the way you use a
web browser.
Both bookmarks for saved queries and bookmarklets
for JavaScript actions are tricks to make a document
browser act more like a real application. If your
browsers are like mine, then each has its own long
unique set of bookmarks—the system doesn't
scale well—but for regular searches and adjustable
searches then bookmarks and bookmarklets can definitely
help.
Search engine optimization. All of the above
help you find stuff. But suppose you want to make
things easy to find? Each search engine has its own
way of finding web pages, and its own way of determining
which pages to present first. Each also regularly
changes its methods in response to spammers. For
more searching techniques, as well as search-optimization
advice, keep an eye on sites like SearchDay, Search
Engine News, Pandia
Search World, Google
Weblog, Fagan
Finder, and similar sites.
Search Toys:
Here's the fun stuff! Search engines learn about
many pages as part of their job, but "finding stuff" isn't
the only action they can perform. Most of these toys
are based on the Google search engine, because of
Google's popularity and its web services, but also
because Google Labs tries out new approaches too.
(Many of these toys use "HTML scraping", where
a server-side or client-side process strips desired
information out of an HTML page. Scraping like this
can be a fragile process, because you're looking
for patterns in a particular document's markup to
retrieve the desired information, and changes to
the document's markup can break the scraper. The
move to web services and search APIs, with XML-formatted
returns, will make it much easier, more robust, and
more generalizable to retrieve specific information
from engines like this.)
Googlefight: Enter
two terms, and this page submits a search to Google
on each, then scrapes the results to find which has
more search results—it finds which is more
popular on the web. The earliest implementations
of this idea used other search engines for web spelling
checks.
Googlism: Submit
a term, whether a name or place or even a date, and
Googlism scrapes the captioned search results to
find phrases people use to describe this name. (The "who", "what" and
other qualifiers presumably choose the parsing style
for the returned results.) Do a vanity search on
your own name to find what the web "thinks" of you....
Google
News and Froogle:
These two Google engines appeared during 2002,
and both search particular data sets. Google News
does regular and timely indexing of over 4000 news
feeds and applies heuristics to group articles
together on common subjects. I like it because
it's a quick way to find new articles on particular
subjects (try ColdFusion or accordion),
and also because it's a way to quickly compare
news articles from different news services on the
same subject. Froogle indexes into online stores
to quickly find different offerings and prices
for a particular consumer good (try "coffee
maker" or laptop
computers below $1000).
Blogdex and DayPop
Top 40: These two engines also search specialized
data sets, in this case the weblogs that thousands
of people write to each day. They tabulate the
links within these blogs to find the hottest stories
in the blogosphere. (You can also perform specific
URL or text searches in this type of engine to
find commentary on a particular story.) The day's
top hits are often of mixed value—a Reuters
news item on watercolor paintings made by Capuchin
monkeys may be frequently linked but of little
longterm importance—yet Blogdex and Daypop
succeed in offering a unique glimpse into what
many, many people find important each day.
Technorati's
Google Juice and Google
Rank: David Sifry's new blog-indexing service
offers a few twists. Google Juice can show how
highly Google ranks a particular URL for a particular
search term, while Google Rank lists the top returns
for any given term. You can do both these tasks
manually, but these massagings of Google returns
definitely save time. Even better are his watchlist subscriptions,
which push data to you when a ranking changes.
These services can be particularly useful when
you're using search-engine optimization techniques
to gauge a site's ranking.
Google
People: This toy from AvaQuest is a little
different from the above ones, because it attempts
to find an actual answer to a question which may
be hard to resolve using just normal search techniques.
Enter a "who is" question, like "Who discovered
radium?" A parser pulls search terms from your
question (presumably "discovered radium" in this
case), submits the search to Google, and retrieves
the results. It then consults a database of names
to identify proper names within returned pages,
compares the popularity of retrieved names across
matching pages, and then guesses the name of the
person who most likely fulfills the request. What
I find particularly exciting about this example
is that it massages search engine results into
doing a new type of task... it utilizes multiple
databases to automatically draw inferences about
the world.
Jon
Udell's Library Lookup: If you're reading an
online book catalog which includes its International
Standard Book Number (ISBN) in its URL, then this
set of bookmarklets can scrape the unique book
identifier out of the address and submit a query
to a local library to see if that book is currently
available on the shelf. (Some libraries expose
their book catalog as a web service.) This doesn't
use Google or other search engines directly, but
like Google People, Jon's Library Lookup combines
a book database with a particular library database
to do real work which would be difficult or tedious
to do through other means. It also uses online
engines in combination with web services, which
brings us to the next topic....
Search tomorrow:
Search engines started as a way to find documents
which contained particular text. The next stage was
finding the most authoritative document for a particular
term. Search engines then grew to try to index as
many documents as possible, to handle as many different
requests as possible.
Now the pendulum is starting to swing the other
way—towards specialized databases, like DayPop
and Google News—towards specialized requests,
like "What are people saying about so-and-so?" or "Who
discovered radium?"—towards specialized tasks,
like finding whether the local library has a particular
book available right now, or finding a a nearby restaurant
which has a particular item on the menu.
If you have a computer running Macintosh OS X,
you're probably already familiar with the built-in
Sherlock application. (If you're not yet using this
OS on any of your computers, then you can see screenshots here.)
Sherlock version 2.0 was essentially a skin over
an existing search engine, but the 3.0 version is
another type of application entirely. It pulls information
from several online databases to help you with a
single task.
Take a look at its "Yellow Pages" application.
It already knows how to find a variety of online
directories. After you search on a name and choose
an address, Sherlock quickly submits the name to
a mapping service to pull down directions. It knows
where you are because it's a local application which
can remember data across sessions.
Or look at the "Movies" application. It starts
off with your zip code, either your stored default
location or a location you enter while traveling.
Sherlock retrieves a list of local theatres within
a given radius. Choose a theatre, and it shows you
which films are playing. Choose a film, and it gives
you showtimes. You can pull up a short description
of the film, or a video trailer, and in some cases
can order tickets online.
In these types of applications, multiple data
sources are pulled for appropriate content to perform
one specific task. Instead of the focus being
on the contents of a particular database, the focus
is on what a person is trying to do: track their
packages, monitor particular stocks, plan a night
out on the town. The product is no longer the data
that someone has collected; the product is instead
the experience that the author is delivering.
How is this done? It's nothing new, really—Shockwave
applications have been able to do data-mining across
servers for years. The Watson application
from Karelia Software was the first to become really
popular at these types of tasks, and many of the
applications inside Sherlock 3 are apparently directly
inspired by their Watson counterparts.
Why this explosion now? Previously, web servers
delivered just HTML pages. Data and presentation
were mixed together... you'd have to parse the HTML
manually to extract the data you wanted. (Watson
still has parsing
dictionaries telling how to isolate data within
different HTML pages.) The recent emphasis on XML-formatted
feeds, whether as Rich Site Syndication or Google
APIs or web services or other feeds, has given us
pure data without a presentation layer. More web
servers are exposing more pure information than ever
before.
This is where I think web-indexing will go—towards completing
a task, rather than just finding a document.
News aggregators, such as AmphetaDesk,
can grab information from many sources, but we
still need to add some smarts so they can watch
for particular news and alert you (eg, "hang out
in the background and tell me immediately if the
traffic service reports anything about Route 101
southbound during the late afternoon each workday").
There are significant legal and ethical issues
to resolve about how to use data that others expose
on the web. Some of these tasks can be done on
demand by bookmarked search queries, but background
applications which watch the web for you seem to
me to be a hot area for growth over the next two
years.
Whew! That's more than what I expected to type
here, but there's a lot of fun and exciting stuff
going on in the world of search engines and related
services. I hope there's at least a thing or two
in the above that can help make your time on the
web more fun in 2003. Drop a note on the related
item in my blog if
you've got any comments, thanks!
|