Accessibility
 
Home / Developer Center /  

JD's Forum

John Dowdell

John Dowdell

John Dowdell joined Macromedia in 1993 and listens to people in the online communities. He likes to make complex things simpler, and keeps a daily weblog of related news.

View Previous Columns

 
Searching Tips, Tools, Toys, and Tomorrow


With a computer and a connection, people all over the world finally have equal access to information. But finding the exact information you're seeking is still the major difference between success and failure. Here's a roundup of various search-related nuggets from the last year.


Search Tips:

Visualize the page where the info lives. This is probably the biggest secret to success. Search engines don't know anything about information, they just know about web pages. Instead of focusing directly the information you're trying to learn, imagine what types of web pages would hold this info—what other terms would be on that page, which domains might discuss this particular aspect of the subject—think about what type of page might hide that info. Hunt down the page which might hold the info, rather than just the info itself.

For instance, if you're trying to figure out why a particular Netscape version shows a white gap in a table, then a term like "netscape table gap" can pull up too many pages, with many hits being discussion groups where people ask about a particular gap. A page which holds the actual information you're seeking would probably talk about multiple browsers, and would also talk about how to fix the gap, so a longer term like "netscape explorer table space gap fix flush" would directly bring up information about fixing browser table differences. Think about how the solution might be described, rather than just the problem you're seeing. There's a particular page which hides your quarry; how can you best hunt down that certain page?

Use the advanced operators for the search engine. Most search engines can search on phrases rather than isolated words if you put the phrase in quotes. A minus sign or NOT in front of a term can exclude the word from a search... parentheses let you group terms for OR searches... different search engines let you tweak searches in different ways. Google's Advanced Search Page exposes some of their lesser-known techniques and operators. For instance, adding "site:macromedia.com inurl:support" to a term will restrict the search to Macromedia TechNotes, or you can search my blog by adding the term "inurl:jdmx.blogspot.com" to a query. Or try adding "filetype:swf" to a term to find only SWF files. Each engine offers different special operators, and getting familiar with these advanced tools can let you find more things, more quickly. The Fagan Finder Search Engine Ultimate Interface configures itself to advanced operators for multiple engines.

Scan a set of candidate pages simultaneously. You can tell a little bit about a page's actual contents from the caption in the search results page, but that's no replacement for actually examining a number of candidate pages. The browser's "Find Text..." command can help bring you directly to the search term's location in the page, particularly if you use the keyboard shortcut to "Find Again".

For scanning candidate pages, I use Mozilla's tabbed browsing—a context-click will load a candidate page beneath my search results, so I can quickly invoke half-a-dozen pages in a single window and then quickly scan and dismiss each. Other people use browser sidebars to hold their results while they scan through pages one at a time. If your browser doesn't offer such features, then use its keyboard shortcut to open each candidate page in its own window, so that you can quickly scan multiple candidates without losing your search results.

Try again. It's nice if you can find a page with the desired info with your first set of search terms, but not all prey is so easy to catch. Looking at the search results, and examining candidate pages, will often suggest new phrases to search on, or terms to omit from the results. Sometimes the results of the first search will suggest a domain or different engine which might contain the info you're seeking.


Search Tools:

Different engines, different indices. Google's great, but it doesn't find everything. AllTheWeb spiders differently and seems to find more non-English documents, WiseNut and each of the search engines traverses a different network in finding documents. If you're confident that a document exists, and can't locate it in one engine, then cross-checking in another engine can help.

This is the idea behind meta-search engines, which query a number of different engines and then evaluate the total results. A meta-search can take a little longer for its multiple queries, and the value of a meta-search may have declined a bit as a result of Google's primacy, but their aggregation and unique sortings give a different picture of the web's true contents. Try Pandia MetaSearch to see the advantages and disadvantages of searching multiple engines simultaneously.

Weblogs are updated too frequently to usefully show up in a mainstream search engine, so smaller and more specific engines like Daypop and Technorati are quick enough to offer searching of daily news and commentary. Like Google News and Froogle, these engines query different document repositories than a typical search engine would.

Different engines, different rankings. If a query returns too many documents to quickly scan, then the order in which these documents are returned usually determines whether the search succeeds or fails. Different engines rank documents in different ways, and as site owners try to optimize their pages for high rankings the engines frequently evolve in response. Different engines can recommend different documents for identical terms.

Early engines ranked results by how many times the term appeared in a page, or whether the term was in a page's keywords, title, or address. Google added the innovation of ranking pages by how many pages out there linked to this document—Google harvested dispersed human judgment by realizing useful documents were linked more often than trivial documents. Teoma takes this a step further, by measuring incoming links from authoritative documents in the same circle of influence.

(As an example, if an early search engine were looking for a brain surgeon, it would walk along the street looking for a sign saying "Brain Surgeon Office". Google would ask lots of people if they knew of a brain surgeon. Teoma would find a group of brain surgeons and ask them who the best one is.)

Searching versus navigating. Sometimes it's just plain hard to make a search term accurate yet restricted enough to give a good set of candidate pages. In cases like this a directory service can help you explore an area without necessarily finding keywords to group sites. Drill down through the Open Directory Project to listings like their web authoring section. You can then do a normal search within a given category. Another way to find subject directories like this is to do a regular engine search with the term "wiki" to find listings of resources.

The Invisible Web. Web search engines index pages. Not all of the information available on the web happens to live in an HTML page. A lot of it lives in databases. If you regularly do specialized research, then get familiar with its particular web-accessible databases or you'll miss much of what's available! Here's a listing of many sites which go into this subject in detail. (Warning: Once you start prowling around in here it's easy to lose track of time.... ;-)

Usenet predated the World Wide Web, and holds much early Internet history. But newsgroup messages aren't HTML documents and won't show up in most search engines. The original DejaNews search engine archived these documents and presented them through an HTML interface, and was later rescued and revived as Google Groups.

What about web pages which have been removed from the web? These usually disappear from a mainstream search engine the next time the site is spidered. But a specialized engine like the Internet Archive's Wayback Machine keeps those old pages searchable.

There's tons of material on the web which is simply invisible to standard search engines...!

Bookmarks and Bookmarklets. If there's a query you regularly make, consider saving it as a bookmark in your browser. I regularly check whether Google News has found new articles about Macromedia, and this URL takes me directly there with the newest articles right on top. This URL came from a search at Technorati to find the newest blogs which link to my blog.

If you're instead regularly submitting different requests to the same engine, then a bookmark with a "javascript:" pseudo-URL can speed things up by asking for the query terms in a dialog box, so you can get your results with only one page loading. Here's a bookmarklet I use for searching my blog, and another I use to find recent postings to the Macromedia newsgroups. (Caveat: These work in my Mozilla browsers, and the JavaScript may be slightly different in other browsers.)

Bookmarklets can actually do things in your browser, such as turning off CSS or performing an analysis of the page content. A little bit of time spent browsing these sites can completely change the way you use a web browser.

Both bookmarks for saved queries and bookmarklets for JavaScript actions are tricks to make a document browser act more like a real application. If your browsers are like mine, then each has its own long unique set of bookmarks—the system doesn't scale well—but for regular searches and adjustable searches then bookmarks and bookmarklets can definitely help.

Search engine optimization. All of the above help you find stuff. But suppose you want to make things easy to find? Each search engine has its own way of finding web pages, and its own way of determining which pages to present first. Each also regularly changes its methods in response to spammers. For more searching techniques, as well as search-optimization advice, keep an eye on sites like SearchDay, Search Engine News, Pandia Search World, Google Weblog, Fagan Finder, and similar sites.


Search Toys:

Here's the fun stuff! Search engines learn about many pages as part of their job, but "finding stuff" isn't the only action they can perform. Most of these toys are based on the Google search engine, because of Google's popularity and its web services, but also because Google Labs tries out new approaches too.

(Many of these toys use "HTML scraping", where a server-side or client-side process strips desired information out of an HTML page. Scraping like this can be a fragile process, because you're looking for patterns in a particular document's markup to retrieve the desired information, and changes to the document's markup can break the scraper. The move to web services and search APIs, with XML-formatted returns, will make it much easier, more robust, and more generalizable to retrieve specific information from engines like this.)

Googlefight: Enter two terms, and this page submits a search to Google on each, then scrapes the results to find which has more search results—it finds which is more popular on the web. The earliest implementations of this idea used other search engines for web spelling checks.

Googlism: Submit a term, whether a name or place or even a date, and Googlism scrapes the captioned search results to find phrases people use to describe this name. (The "who", "what" and other qualifiers presumably choose the parsing style for the returned results.) Do a vanity search on your own name to find what the web "thinks" of you....

Google News and Froogle: These two Google engines appeared during 2002, and both search particular data sets. Google News does regular and timely indexing of over 4000 news feeds and applies heuristics to group articles together on common subjects. I like it because it's a quick way to find new articles on particular subjects (try ColdFusion or accordion), and also because it's a way to quickly compare news articles from different news services on the same subject. Froogle indexes into online stores to quickly find different offerings and prices for a particular consumer good (try "coffee maker" or laptop computers below $1000).

Blogdex and DayPop Top 40: These two engines also search specialized data sets, in this case the weblogs that thousands of people write to each day. They tabulate the links within these blogs to find the hottest stories in the blogosphere. (You can also perform specific URL or text searches in this type of engine to find commentary on a particular story.) The day's top hits are often of mixed value—a Reuters news item on watercolor paintings made by Capuchin monkeys may be frequently linked but of little longterm importance—yet Blogdex and Daypop succeed in offering a unique glimpse into what many, many people find important each day.

Technorati's Google Juice and Google Rank: David Sifry's new blog-indexing service offers a few twists. Google Juice can show how highly Google ranks a particular URL for a particular search term, while Google Rank lists the top returns for any given term. You can do both these tasks manually, but these massagings of Google returns definitely save time. Even better are his watchlist subscriptions, which push data to you when a ranking changes. These services can be particularly useful when you're using search-engine optimization techniques to gauge a site's ranking.

Google People: This toy from AvaQuest is a little different from the above ones, because it attempts to find an actual answer to a question which may be hard to resolve using just normal search techniques. Enter a "who is" question, like "Who discovered radium?" A parser pulls search terms from your question (presumably "discovered radium" in this case), submits the search to Google, and retrieves the results. It then consults a database of names to identify proper names within returned pages, compares the popularity of retrieved names across matching pages, and then guesses the name of the person who most likely fulfills the request. What I find particularly exciting about this example is that it massages search engine results into doing a new type of task... it utilizes multiple databases to automatically draw inferences about the world.

Jon Udell's Library Lookup: If you're reading an online book catalog which includes its International Standard Book Number (ISBN) in its URL, then this set of bookmarklets can scrape the unique book identifier out of the address and submit a query to a local library to see if that book is currently available on the shelf. (Some libraries expose their book catalog as a web service.) This doesn't use Google or other search engines directly, but like Google People, Jon's Library Lookup combines a book database with a particular library database to do real work which would be difficult or tedious to do through other means. It also uses online engines in combination with web services, which brings us to the next topic....


Search tomorrow:

Search engines started as a way to find documents which contained particular text. The next stage was finding the most authoritative document for a particular term. Search engines then grew to try to index as many documents as possible, to handle as many different requests as possible.

Now the pendulum is starting to swing the other way—towards specialized databases, like DayPop and Google News—towards specialized requests, like "What are people saying about so-and-so?" or "Who discovered radium?"—towards specialized tasks, like finding whether the local library has a particular book available right now, or finding a a nearby restaurant which has a particular item on the menu.

If you have a computer running Macintosh OS X, you're probably already familiar with the built-in Sherlock application. (If you're not yet using this OS on any of your computers, then you can see screenshots here.) Sherlock version 2.0 was essentially a skin over an existing search engine, but the 3.0 version is another type of application entirely. It pulls information from several online databases to help you with a single task.

Take a look at its "Yellow Pages" application. It already knows how to find a variety of online directories. After you search on a name and choose an address, Sherlock quickly submits the name to a mapping service to pull down directions. It knows where you are because it's a local application which can remember data across sessions.

Or look at the "Movies" application. It starts off with your zip code, either your stored default location or a location you enter while traveling. Sherlock retrieves a list of local theatres within a given radius. Choose a theatre, and it shows you which films are playing. Choose a film, and it gives you showtimes. You can pull up a short description of the film, or a video trailer, and in some cases can order tickets online.

In these types of applications, multiple data sources are pulled for appropriate content to perform one specific task. Instead of the focus being on the contents of a particular database, the focus is on what a person is trying to do: track their packages, monitor particular stocks, plan a night out on the town. The product is no longer the data that someone has collected; the product is instead the experience that the author is delivering.

How is this done? It's nothing new, really—Shockwave applications have been able to do data-mining across servers for years. The Watson application from Karelia Software was the first to become really popular at these types of tasks, and many of the applications inside Sherlock 3 are apparently directly inspired by their Watson counterparts.

Why this explosion now? Previously, web servers delivered just HTML pages. Data and presentation were mixed together... you'd have to parse the HTML manually to extract the data you wanted. (Watson still has parsing dictionaries telling how to isolate data within different HTML pages.) The recent emphasis on XML-formatted feeds, whether as Rich Site Syndication or Google APIs or web services or other feeds, has given us pure data without a presentation layer. More web servers are exposing more pure information than ever before.

This is where I think web-indexing will go—towards completing a task, rather than just finding a document. News aggregators, such as AmphetaDesk, can grab information from many sources, but we still need to add some smarts so they can watch for particular news and alert you (eg, "hang out in the background and tell me immediately if the traffic service reports anything about Route 101 southbound during the late afternoon each workday"). There are significant legal and ethical issues to resolve about how to use data that others expose on the web. Some of these tasks can be done on demand by bookmarked search queries, but background applications which watch the web for you seem to me to be a hot area for growth over the next two years.

Whew! That's more than what I expected to type here, but there's a lot of fun and exciting stuff going on in the world of search engines and related services. I hope there's at least a thing or two in the above that can help make your time on the web more fun in 2003. Drop a note on the related item in my blog if you've got any comments, thanks!