The Indexing Process
To make a set of documents available for searching, you create a new collection using an indexing application, such as the Verity spider (Indexing Manager or vspider), and then identify the new collection to the search server.
The following illustration shows how an indexing application interacts with the Verity engine and the components of a collection.
-
How the Verity Engine Indexes Documents
The Verity engine uses a gateway to access the files (or other repositories) containing the document set to be indexed. By default, the file system gateway is used to index documents residing on file systems in various operating system environments. When indexing source documents require a different access method, such as HTTP or SQL, the indexing application specifies the appropriate gateway, identified in the style.vgw file, where "vgw" stands for Verity gateway. The Verity spider uses an HTTP gateway called vgw_url to index documents.
After opening a document using the gateway, the search engine uses the appropriate filter to perform the following operations:
Field tokens are used to populate the documents table, while word tokens are used to generate the word list. The documents table and word list are described in more detail in "Collection Directory-Key Components."
The gateway used to index documents in a collection is also used to access document information for results list generation and display the document for viewing. One gateway per collection is supported.
Verity Indexers
The following table summarizes the indexing applications available with Verity Information Server.
It is important to note that collections have the same architecture, even if they are built using different indexers. Based on whether a file system gateway or HTTP gateway is used to index the documents, the document key values can have different types. If you are indexing a web site for searching using Information Server, then document keys are stored as URLs. If you are indexing documents in the file system, then document keys are stored as path names.
Meta Collections vs. Universal Collections
Information Server transparently creates a special meta collection when you use the Indexing Manager (the HTML interface to the Verity spider). A meta collection is a group of collections defined in a collection map file. The special meta collection generated by the Indexing Manager includes separate collections for each document type. Along with the meta collection structure, the Indexing Manager produces a collection map file which identifies the collections in the meta collection structure to Information Server.
Unlike the Indexing Manager which produces meta collections, the command-line spider produces universal collections. Also, the mkvdk command-line tool produces universal collections.
NOTE: In previous releases, the command-line spider produced meta collections identical in format to the meta collection built by the Indexing Manager. The command-line spider that shipped with Information Server V3.1 Service Pack 2 and Service Pack 3, called Verity Spider V3.5, generated universal collections.
A collection built by the Indexing Manager can't be updated directly using the command-line spider in this release. This is because an indexing task submitted to the command-line spider produces a single universal collection, not a meta collection. The collection schema generated for meta collections is not identical to the schema for universal collections.
If you want to use the command-line spider to upgrade collections built with Information Server V3.1 and Verity spider V3.1, you need to follow an upgrade procedure before you can make changes to these collections with Information Server V3.6 and Verity spider V3.6. For more information, see "Meta Collections" in Chapter 4, Indexing Web Sites.
Copyright © 1998, Verity, Inc. All rights
reserved.