Submitting an Indexing Task
You can use the Indexing Manager to submit a new indexing task. To submit a new indexing task do the following:
- 1. Click Indexing Manager on the menu bar.
- 2. Click New Indexing Task on the Indexing Manager menu. Information Server displays the following page.
-
Source
The source for the indexing task can be either a new URL or one you have indexed previously.
Note that what you can enter in "URLs to Index" and "Restrict indexing to" will depend on your licensing. For more information on licensing, see "Licensing Options" above.
- Tasks - The names of indexing tasks you have submitted previously. Select the one you want to run.
- Edit - Allows you to edit the settings specified in previous indexing tasks.
- Name - Specify the name for a new indexing task. You can enter a descriptive name, or the URL or file path you're going to index. This value appears in the Tasks field under Source in the New Indexing Task page, and in the Name field in the Current Indexing Tasks page.
- URLs to Index - Enter the URL or file path for the indexing task. The syntax for URLs is
http://www.<host>.com:<port>/<path>, where the port number is optional. The syntax for a file path is file://<path>, where <path> is c:\path for Windows file systems, and file://<path> for UNIX file systems. For UNIX, remember to add the preceeding slash, as in /usr/docs, for an URL of file:///usr/docs. Note that a URL doesn't have to begin with www. You can index any single domain in a given indexing task. For example, you might index www.yourcompany.com, dev.yourcompany.com:8080, and admin.yourcompany.com/hr. However, you can't add a different domain such as www.verity.com to the same index task.
- Restrict indexing to: - You can select either Domain or Host. Domain restricts indexing to the domain of the starting URL. Host restricts indexing to the host machine of the starting URL. If you want your indexing to go outside the domain of the starting URL, you will need to use the command-line spider with the appropriate license. See Verity Spider User's Guide.
Destination
The destination is the collection to which you want the index written. To update an existing collection, select the collection you want to use from the list. To create a new collection, click the New... button.
Creating a New Collection
When you click the New Collection button, Information Server displays the following page.
-
To create a new collection, enter the following information:
Click Create to create the collection.
Advanced Options
The Advanced Options section allows you to specify some additional settings. When you click the Advanced button, Information Server displays the following additional options.
-
MIME Type
Select the document types you want indexed. The following document types can be indexed:
Filename
By default, the Verity GUI spider isn't prohibited from following links (during web crawling) or walking through directory structures (during file walking). Web crawling starts at a specified URL and follows the links anywhere allowed by the "Restrict indexing to" option, including to a location "above" the starting directory. File walking starts at a named directory and walks through any subdirectories it finds.
For example, from this starting URL:
- http://www.some.web.site/region2/sales/
the GUI spider can follow links to http://www.some.web.site/, if they exist.
To limit the scope of spidering to the starting directory, you can specify an include pattern in the "Include only URLs matching the pattern(s): such as:" text box. For example, the following pattern:
- */region2/sales/*
will restrict spidering to only URLs that include the string /region2/sales/. This basically means the sales directory and any directories that fall below it. Any links to files in directories above this level, even region2, will not be followed.
Networks Options
Submitting the Indexing Task
When you have specified all of the desired settings, click the Submit button to begin indexing. Information Server allows you to view the status of the indexing task as it proceeds.
Managing robots.txt Files
The robots.txt file is used on many web sites to specify what parts of the site indexers from outside the site should avoid. The the Index Manager always honors all robots.txt files. In addition, if you are reindexing a site and robots.txt has changed, the indexer will delete documents that have been added to robots.txt. If you wish to ignore robots.txt files, you must use the command-line spider. See the Verity Spider User's Guide, for details.
Copyright © 1998, Verity, Inc. All rights
reserved.