About the Verity Spider


The Verity spider is the indexing component embedded in Information Server. During the indexing process, the spider automatically detects the document type and indexes it appropriately. For information about the document filters used to index and display the supported document types, including WYSIWYG, PDF, HTML, and ASCII, refer to Chapter 13.

There are two interfaces for the Verity spider. The GUI spider appears in the SEARCH'97 Information Server Indexing Manager; the command-line spider, called vspider, is included with the product media in:

installdir/platform/admin/

where installdir is the full path to the directory in which you installed Information Server, and platform represents the name of your platform (for example, _solaris for Solaris).

The Indexing Manager and the command-line spider run as separate instances, so they are separate indexers. Each indexer requires a distinct and different set of style files. Style files determine configuration characteristics for a collection. The Indexing Manager uses a default set of style files. For more information, see "The Role of Style Files" in
Chapter 12.

The command-line spider is installed during the basic product installation procedure. The features of the Verity spider are controlled through licensing options. By default, the command-line spider can walk through a file system's directory structures.

Licensing Options

Licensing options for the Verity spider control the behavior of the GUI and command-line spider, as described below.
Option
Description
Default behavior
Web crawling of the local host (must be the host machine for ) and file walking of files available on the network.
Default domain
Adds to the default behavior, web crawling of the default domain (the domain for the local host).
Default domain plus remote hosts
Adds to the default behavior, web crawling of the default domain plus web crawling of remote hosts. With this option enabled, the GUI spider is limited to indexing a single domain at a time.

The licensing file is called ind.lic and it is stored in the admin directory in this location:

installdir/platform/admin/

where installdir is the product installation directory, and platform represents the name of your platform (for example, _solaris for Solaris).

The license file path is set in the inetsrch.ini file by the LicenseFile parameter in the UniversalSpider section.

When you run the GUI spider, the enabled licensing options are reported in the application's log file. For the command-line spider, the enabled licensing options are printed at run time.

Meta Collections

A meta collection is a group of collections defined in a collection map file. When you submit a new indexing task to Indexing Manager (the HTML interface to the Verity spider), Information Server transparently creates a special meta collection structure in the default collection directory for the server. For information about setting the default collection directory, see "Setting Path Defaults" in Chapter 3, Setting Up Your Server.

The meta collection structure created by the Indexing Manager consists of a parent meta collection directory, five subdirectories, and a collection map file. The parent directory and map file exist at the same level. The parent directory and map file are automatically named after the first eight characters of the collection name you specify on the New Indexing Task page in Indexing Manager. For information about the syntax of the collection map file, refer to Appendix D.

Each subdirectory in the meta collection structure corresponds to a Verity collection. Each collection created with Indexing Manager consumes five collections from the total of 128 collections which Information Server supports.

The meta collection structure is created automatically and should not be altered. For more general information about Verity collections, their role and function in your application, refer to Chapter 12, Verity Collections.

Meta Collections not Supported by Command-line Spider

A collection built by the Indexing Manager can't be updated directly using the command-line spider in this release. This is because an indexing task submitted to the command-line spider produces a single universal collection, not a meta collection.

NOTE: In previous releases, the command-line spider produced meta collections identical in format to the meta collection built by the Indexing Manager. The command-line spider that shipped with Information Server V3.1 Service Pack 2 and Service Pack 3, called Verity Spider V3.5, generated universal collections.

You can make updates to a meta collection using the command-line spider only if you upgrade the meta collection to a single, universal collection. An upgrade process is available for upgrading a meta collection. For complete information about this process with usage examples, refer to the Verity Spider User's Guide.

Upgrading from Information Server V3.1

All Information Server V3.1 collections built by the Indexing Manager and/or the command-line spider can be searched by Information Server. However, meta collections can't be updated by the command-line spider, and universal collections can't be updated by the Indexing Manager.

Remember that you can search meta collections built by the Verity spider in previous releases, however you can't update them to add, delete, or change documents using the command-line spider.

DNS Access for Indexing

In order to index a remote host, whether within the same network or across the world through the Internet, the host upon which SEARCH'97 Information Server is running must be able to "find" it by way of a Domain Name System (DNS) server.

For example, if you are running Information Server on the host Gatherer within your company's network, and you want to index a distant web site, Gatherer must have access to a DNS server that can find that web site.

Error Messages

The following messages are common indicators of Information Server's inability to do a DNS lookup of the site to be indexed.

Testing for Access

If you receive one of the above error messages, or aren't sure you can access a certain site, you can use either ping or nslookup to test the availability of a site in question. These programs are available on both UNIX and Windows NT.

From the host which is running SEARCH'97 Information Server, use either ping or nslookup against the site in question. If you don't see a resolved IP address for the remote hostname, then you will not be able to index the site.

Contact your network administrator for information about DNS access to the sites you want to index.





Copyright © 1998, Verity, Inc. All rights reserved.