Accessibility

ColdFusion Article

 

Using Verity Vspider in ColdFusion Search Applications


Joe Cronin

Joe Cronin

www.verity.com

Created:
22 August 2005
User Level:
Intermediate

Early on, Macromedia understood the need to provide ColdFusion developers with the ability to integrate advanced search features into their applications. Since 1997, it has integrated Verity into ColdFusion to provide that advanced search. Today, hundreds of thousands of developers have taken advantage of this Verity search to enhance the value and functionality of ColdFusion-based sites and applications. The search technology embedded in ColdFusion MX 7 is Verity’s flagship search product, Verity K2, the leading enterprise search software on the market today. With this integration of Verity K2 into ColdFusion, developers have access to the most sophisticated and powerful search capabilities available at a fraction of the cost of acquiring Verity K2 search separately.

This article explains how you can use a specific piece of the Verity search functionality – the Verity Spider (otherwise called vspider.exe or vspider). Vspider is a tool that you can use to index content and build collections that are searchable by the user. An important new feature became available to ColdFusion MX 6.1 and 7.0 with vspider—now, ColdFusion users can extend vspider to build collections from data stored on a server other than the one hosting your Coldfusion server. This feature enhances the way you use search functionality within ColdFusion applications; as you can implement an enterprise-wide search solution with ColdFusion. Learn more in the Verity white paper, "Understanding Verity’s ColdFusion Search Expansion Pack."

Requirements

To complete this tutorial you will need to install the following software and files:

ColdFusion MX (version 6.1 or above)

Tutorials and sample files:

Some Background on Indexing and Creating Collections

The search functionality within ColdFusion performs searches against collections, not against the actual documents and database records within ColdFusion. A Verity collection is a special index that you create with Verity Spider or the ColdFusion tag, cfindex. These functions locate all the searchable documents and/or database content and extract the text and metadata within each document or record and other information, such as document zone and field data, word proximity, and the physical file system address or URL. Verity gathers all of this information together in the Verity collection. By combining this information into one index and running searches against it, rather than having to locate and access the actual documents and databases each time a user searches for information, you dramatically increase the speed and relevancy of your ColdFusion search capabilities. Verity also makes available advanced features, such as document summaries in results lists and the ability to limit searches to specific groups of documents.

ColdFusion supports collections of the following basic data types:
  • Text files such as HTML and CFML pages
  • Binary documents such as Microsoft Word documents, Adobe Acrobat PDF files, and so forth (see the complete list of supported file types at the end of this paper)
  • Record sets returned from a query or a ColdFusion query object including those returned when you use the cfquery, cfldap, and cfpop queries
  • Individual documents or from an entire directory tree

Within ColdFusion, you can build searches of multiple collections, each of which can focus on a specific group of documents or queries, according to subject, document type, location, or any other logical grouping. Because you can perform searches against multiple collections, developers have substantial flexibility in designing a search interface.

Two Options: Deciding When to Use cfindex Versus Vspider

You can generate collections with either vspider or cfindex. But when should you use vspider? But when should you use cfindex? The following table helps you decide.

Deciding When to Use cfindex Versus Vspider
Function cfindex vspider
Indexes ColdFusion documents Yes Yes
Indexes file system documents Yes Yes
Indexes documents outside of ColdFusion Server * No Yes
Indexes a wide range of doc types Yes Yes
Indexes dynamic content No Yes
Configure and use through CF Administrator Yes No
Configure and use through command-line interface No Yes
Schedule indexing jobs Yes Yes

With vspider, you can index web-based and file system documents in over two hundred of the most popular application document formats, including Microsoft Office, WordPerfect, ASCII text, HTML, SGML, XML and PDF (Adobe Acrobat) documents.

Vspider uses HTTP to "crawl" web servers and collect content to index. Vspider starts crawling at a particular web address you specify with the -start parameter value, for example, http://www.macromedia.com. Vspider requests this page and processes it, collecting all the words from the page and adding them to the index. It also collects all the referring links to other pages and adds these pages to a queue to process in a manner similar to the first page.

There are two main advantages to using vspider instead of cfindex:

  1. The vspider index can include dynamic content generated through HTTP requests to the web server. When you use the cfindex tag or index a collection through the Macromedia ColdFusion MX Administrator, you cannot include dynamic content.
  2. Since vspider requests pages from the web server, vspider only indexes pages that are visible through the web server. Many times, web administrators have files in directories that web server can access, but the files are not "linked" to other files. Without a referring link, the files are invisible to users. But if you were to index using the cfindex tag instead, these pages would inadvertently become visible to users. Indexing with vspider through the web server ensures that any pages that do not have a referring link are not indexed into the collection, and thus, unavailable to users.

About Registering Collections

When indexing using the standard ColdFusion search tags, ColdFusion MX communicates with a private-branded Verity K2 server, called the ColdFusion MX Search Server that creates the collection and indexes documents for the tag.

Unlike the cfcollection tag, vspider acts directly on the collection without the use of ColdFusion MX Search Server. Vspider also has the ability to create a collection on its own. However, since vspider acts directly on the collection, the ColdFusion MX Search Server has no knowledge that the collection exists. The collection won’t be available for search through ColdFusion MX unless you specify the collection information to the ColdFusion MX Search Server explicitly. To do so, use ColdFusion Administrator to register the collection with the K2 Server by specifying the "create" option, as follows:

cfcollection (action="create")

If the collection exists, as in this case, ColdFusion Administrator will simply register the collection with the K2 Server.

The following is a simple example of a command line for creating a collection called myCollection:

Vspider –collection cf_root\verity\Data\Colls\myCollection 
–style cf_root\verity\Data\stylesets\ColdFusionVspider

Expanding the Domain of Vspider to Enterprise Content

Many customers use ColdFusion as the basis for their enterprise search initiatives. However, one of the restrictions of the out-of-the-box version of vspider is the limitation to index and search content stored on the same machine that hosts your ColdFusion server. Due to the sophistication of the search capabilities in Verity K2, the adoption of ColdFusion within many enterprises, and the desire to reduce and simplify development efforts, many developers want to expand the scope of the built-in vspider search capabilities to include content stored on machines other than the one that hosts your ColdFusion server. This capability is now available through the Verity ColdFusion Search Expansion Pack. Learn more information by downloading the white paper, "Understanding Verity’s ColdFusion Search Expansion Pack."

Getting Started with Vspider

The following section contains examples of indexing using vspider in different scenarios. You can use these examples as a starting point for developing your own scripts. Notice that some of the examples described involve indexing content stored on remote web servers. Since the default vspider license is restricted to localhost indexing, the examples that describe non-localhost indexing require the Verity ColdFusion Search Expansion Pack.

You can find the vspider command line utility in cf_root\verity\k2\platform\bin directory of your ColdFusion MX installation. The easiest way to reference vspider is to add the \bin directory to your PATH environment. On Unix platforms, add the \bin directory to the LD_LIBRARY_PATH environment variable.

Vspider has a few basic parameters. The following are two simple command line examples of using vspider:

Vspider –collection cf_root\verity\Data\Colls\myCollection 
–style cf_root\verity\Data\stylesets\ColdFusionVspider 
–cgiok 
–start http://www.macromedia.com

The definitions of the parameters are as follows:

Header or description of table
Parameter Specification
-collection Specifies the file system path to the collection
-start Specifies the starting point for crawling and indexing documents. If your website has multiple starting points, use multiple –start arguments in your command line.
-cgiok Specifies that vspider will index dynamic content. Although the name suggests that vpsider will only index content generated by CGI, it really indicates any dynamic content.
-style Specifies the file system path to the style file that defines the schema of the collection. Notice that vspider has specific style files that it uses, compared to ColdFusion MX. It’s important that these style files are used in conjunction with vspider.

Vspider Examples

The following section contains examples of indexing using vspider in different scenarios. You can use these examples as a starting point for developing your own scripts.

Spidering a Single Web Server

The definition of an index that spiders a single web server is as follows:

  • To index all content is on a single web server (-start)
  • To ensure that the spider doesn’t follow any links to other web servers (-host)
  • To index dynamically-created web pages (-cgiok)

The syntax to use is as follows:

vspider -cmdfile /verity/vspider/intra.cmd 

The file, intra.cmd contains the following specifications:

-collection icd.coll 
-start http://sigma.macromedia.com
-style cf_root\verity\Data\stylesets\ColdFusionVspider
-host  sigma.macromedia.com
-cgiok 

Spidering a Single Web Server, Excluding Certain Pages

The definition of an index that spiders a single web server but excludes certain pages is as follows:

  • To index all content served on a single web server (-start)
  • To ensure that the spider doesn’t follow any links to other web servers (-host)
  • To index dynamically-created web pages (-cgiok)
  • To exclude indexing of pages contained in a particular directory, or that have a particular pattern in there URL (-exclude)

The syntax to use is as follows:

vspider -cmdfile /verity/vspider/intra.cmd 

The file, intra.cmd contains the following specifications:

-collection icd.coll 
-start http://sigma.macromedia.com
-style cf_root\verity\Data\stylesets\ColdFusionVspider
-host  sigma.macromedia.com
-exclude */underconstruction/*
-cgiok

Spidering Your Entire Intranet

The definition of an index that spiders an entire intranet is as follows:

  • To index your internal web servers (-start)
  • To use two starting URLs (-start)
  • To ensure that the spider does not index pages outside of your intranet’s domain (-domain)
  • To index web servers that contain dynamically-created web pages (-cgiok)

The syntax to use is as follows:

vspider -cmdfile /verity/vspider/intra.cmd 

The file, intra.cmd contains the following specifications:

-collection icd.coll 
-start http://sigma.macromedia.com
-start http://colt.macromedia.com
-style cf_root\verity\Data\stylesets\ColdFusionVspider
-domain macromedia.com 
-cgiok 

Excluding HTML Documents Based on Content

The definition of an index that crawls an entire website, parsing for links to other documents but does not index any HTML document that contains the text "welcome" in the <Title> tag. The syntax is as follows:

vspider -cmdfile /verity/spider/skip1.cmd 

The file, skip1.cmd contains the following specifications:

-collection icd.coll 
-start http://www.mysite.com 
-style cf_root\verity\Data\stylesets\ColdFusionVspider
-indskip title "welcome" 

Adding to an Existing Collection

Use the following syntax to add only Microsoft Word and Excel documents to an existing collection:

vspider -collection icd.coll 
-start http://www.mysite.com
-indmimeinclude application/msword 
-indmimeinclude application/excel 

The -indmimeinclude option specifies to vspider to index only the specified MIME types. This example contains an additional instance of –indmimeinclude, which is necessary to index a second MIME Type. Likewise, you could include all values in a single instance of -indmimeinclude.

Updating Only Certain Documents

To update a large collection, but only with documents that indexed at least 30 hours ago, you do not need to specify -style because you are updating an existing collection that already contains style files. You can use the following syntax:

vspider -cmdfile /verity/spider/update.cmd 

The file, update.cmd contains the following specifications:

-collection icd.coll 
-refresh 
-refreshtime 1 day 6 hours 

Making Sense of Vspider Command-line Options

Some of the available command-line options for vspider are as follows:

  • -include/indinclude
  • -exclude/indexclude
  • -mimeinclude/indmimeinclude
  • -mimeexclude/indmimeexclude

These command-line options can seem a bit confusing at first. But as you use this tool, you will see that they all have their use. The difficulty is determining which option to use.

Take a look at the options –include and –indinclude. The -include option will only process pages that meet the expression criteria. Processing a page is defined as indexing the page, and following the links within the page.

In the following command-line example:

-include ‘*memo*’
-start ‘http://web.macromedia.com/docs’ 

The starting page does not meet the expression criteria (-include ‘*memo*’), therefore, vspider will not index this page or follow the links within the page. In other words, vspider will exit successfully, having indexed nothing.

On the other hand, if you specified the -indinclude option, vspider only indexes pages that meet the expression criteria. It will read pages that do not meet its expression criteria, making it possible to follow the links within those pages. If you changed the previous command-line example to use -indinclude ‘*memo*’ vspider would begin at the –start page specification, read it, and follow links from that page, perhaps to http://web.macromedia.com/docs/memos, to find and index content.

The same logic applies to the other options (-exclude/indexclude and so on). Understanding the nuances of these options will save you many headaches as you try to index your content.

This article explained how you can use Verity search functionality in your ColdFusion applications. The Verity Search Expansion Pack allows you to build collections from data stored on a server other than the one hosting your Coldfusion server. This feature enhances the way you use search functionality within ColdFusion applications; as you can implement an enterprise-wide search solution with ColdFusion. Learn more in the Verity white paper, "Understanding Verity’s ColdFusion Search Expansion Pack."

About the author

Joe Cronin is director of Technical Services in Verity, Inc.'s Channel Partners group. He has a Bachelor of Science degree in computer engineering technology from Wentworth Institute of Technology. Verity is recognized by industry analysts such as Gartner, IDC, and the Delphi Group as the market leader in enterprise search software, including text search, classification, recommendation, monitoring, and concept extraction solutions. For more information, write to cfsearch@verity.com.