Clustering


Clustering is used to automatically find the groups of similar documents in a set of documents returned by the search engine.

Document clustering is a technique for automatically discovering the subtopics in a set of documents and grouping the documents by those subtopics. Organizing documents by subtopic can help you get a sense of the major subject areas covered in the document set and can help you find the documents of interest more quickly. For example, by clustering the reults of a search, you can get an overview of the major subtopics relating to your search topic in the collection. If only one or two of the subtopics are of interest, you can quickly focus in on the groups of interesting documents without wasting time scanning the rest. The subtopic overview can also help guide and improve searches by revealing the distribution of concepts in the collection and suggesting more specific ways of expressing your information need. The ultimate goal of clustering is to help you find relevant information more efficiently in the face of information overload.

In this section, you'll find the following:

Clustering Search Form

A sample form with clustering implemented is shown below.

Clustering Search Form Source

The SEARCHScript source for the clustering items in the sample form is shown below. The complete source for the sample form can be found in installdir/s97is/locale/english/samples/forms/cluster.htm.

Label
SEARCHScript
Form action
<!-- (NT CGI) <form method="POST" action="/search97cgi/s97_cgi.exe"> --> <!-- (CGI) <form method="POST" action="/search97cgi/s97_cgi"> --> <!-- (ISAPI) <form method="POST" action="/search97cgi/s97is.dll"> --> <!-- (NSAPI) <form method="POST" action="/search97cgi/s97is_ns2.vts"> --> For the protocol you are using with , copy the line and remove the <!--, (protocol), and --> tags so you are left with the <FORM ... > tag.
Max Clusters
<SELECT NAME="ClusterCount"> <OPTION SELECTED VALUE="0">Default <OPTION VALUE="5">5 <OPTION VALUE="10">10 <OPTION VALUE="15">15 <OPTION VALUE="20">20 </SELECT>
Effort
<SELECT NAME="ClusterEffort"> <OPTION SELECTED VALUE="Default">Default <OPTION VALUE="Precision">Precision <OPTION VALUE="Average">Average <OPTION VALUE="Quick">Quick </SELECT>
Order
<SELECT NAME="ClusterOrder"> <OPTION SELECTED VALUE="Relevance">Default <OPTION VALUE="Relevance">Relevance <OPTION VALUE="Center">Center </SELECT>
Style
<SELECT NAME="ClusterStyle"> <OPTION SELECTED VALUE="Fixed">Default <OPTION VALUE="Fixed">Fixed <OPTION VALUE="Coarse">Coarse <OPTION VALUE="Medium">Medium <OPTION VALUE="Fine">Fine </SELECT>

There is no single "correct" clustering for a given set of documents. For example, sometimes there are documents which just don't fit into a single cluster. There may also be situations where you search across collections built with a previous version of Information Server, in which case no clustering information will be available.

Information Server allows you to work around these anomalies with a combination of built-in techniques and SEARCHScript. For collections built with versions of Information Server prior to V3.0, and therefore without clustering information, a special cluster with a score of negative one (Score:-1) is automatically provided in addition to the maximum number of clusters set with ClusterCount. This cluster provides access to documents which satisfy your query, but could not be logically clustered. For documents which logically fit into more than one cluster, a second special cluster, with a score of zero (Score:0), is also automatically provided in addition to the maximum number of clusters set with ClusterCount.

With SEARCHScript, you can refine and customize how documents are clustered. With ClusterCount, you specify the maximum number of clusters to generate, typically far fewer than the number of documents being clustered. The maximum number of clusters you can request is equal to the number of documents. In practice, if you ask for ClusterCount=#documents, the actual number of clusters returned may be somewhat less than that number. Remember that in addition to this number, you may see either one or both of the special clusters.

Although there is no single "correct" clustering for a given set of documents, it is possible to find "better" clusters using more time. With ClusterEffort, you can use Precision for the most time, Quick for the least time, and Average for somewhere in between.

ClusterOrder specifies the order in which to return the clusters and the documents within the clusters. Relevance returns documents in the same relative order in which they occur in the set submitted for clustering. For example, if cluster 1 contains the first, third, and seventh document submitted, they will be given in that relative order within the cluster. Center will return documents in order of their similarity to the cluster center.

Finally, ClusterStyle specifies the style of cluster analysis to be used. With Fixed, clustering tries to produce the exact number of clusters specified with ClusterCount. However, since there is no optimal "natural" number of clusters, a further refinement can produce better results. Use Coarse to cluster documents into fewer, larger groups. Use Fine to cluster documents into many, small clusters. Use Medium for something in between.

Try experimenting with these options to get different clusters for a given search.

Clustering Results Page

When clustering options are selected, the results page will look similar to the following. For this example, the clustering options were 5 clusters, 10 keywords, precision effort, relevance order and course style. You can also customize the number of documents within each cluster using the ClusterPageSize property for the document object, as described in the SEARCHScript Reference Guide.

Clustering Results Template Source

The SEARCHScript source for the sample results page can be found in installdir/s97is/locale/english/samples/template/cluster.hts.

Following is some of the HTML and SEARCHScript source used in cluster.hts. It is only a portion of cluster.hts and is out of context from the whole file which is required for processing.


<%-- Begin case for cluster-type header --%>
<%-- For S97IS 3.0 collections --%>
<% if cluster.score == 0 then %>
<TR>
<TD VALIGN=TOP><B>Cluster $$cluster.index </B></TD>
<TD VALIGN=TOP COLSPAN=3><B>Score:</B> $$cluster.score</TD>
</TR>
<TR>
<TD VALIGN=TOP></TD>
<TD VALIGN=TOP COLSPAN=3><B>Keywords:</B>
<% first = 1 %>
<% foreach keyword in cluster.keywords %>
<% first ? keyword : ", " + keyword %>
<% first = 0 %>
<% endfor %>
</TD>
</TR>
<%-- For S97IS 3.0 collections where docs fit more than one cluster --%>
<% elseif cluster.score = 0 then %>
<TR>
<TD VALIGN=TOP><B>Cluster</B></TD>
<TD VALIGN=TOP COLSPAN=3><B>Miscellaneous - Documents fit more than one cluster.</B></TD>
</TR>
<TR>
<TD VALIGN=TOP></TD>
<TD VALIGN=TOP COLSPAN=3><B>Keywords:</B>
<% first = 1 %>
<% foreach keyword in cluster.keywords %>
<% first ? keyword : ", " + keyword %>
<% first = 0 %>
<% endfor %>
</TD>
</TR>
<% else %>
<%-- For non-S97IS 3.0 collections where no cluster information exists --%>
<TR>
<TD VALIGN=TOP><B>Cluster</B></TD>
<TD VALIGN=TOP COLSPAN=3><B>No cluster information for document. </B></TD>
</TR>
<% endif %>
<%-- End case for cluster-type header --%>
<% offset = 1 %>
<% foreach doc in cluster.documents %>
<TR>
<TD></TD>
<TD VALIGN=TOP>$$doc.score / $$doc.clusterScore </TD>
<TD><A HREF="$$doc.URL_HTML">
<%if exists( doc.title ) %>
$$doc.title
<% else %>
$$doc.VdkVgwKey
<% endif %>
</A>
</TD>
</TR>
<% offset = offset + 1 %>
<% endfor %>

For more information about SEARCHScript, see the SEARCHScript Reference Guide.





Copyright © 1998, Verity, Inc. All rights reserved.