WASD Web Services - Environment Overview

5 - Document Searching

5.1 - Plain-Text Search 5.2 - HTML Search 5.3 - Search Syntax 5.3.1 - "ISINDEX" Search 5.3.2 - Standard Search Form 5.3.3 - Forms-Based Search 5.3.4 - Search Options 5.3.5 - Example Search Form
next previous contents full-page

The query and extract scripts provide real-time searching of plain-text and HTML documents, and document retrieval. The search is a simple-string search, not a GREP-style search. It is designed to provide a useful mechanism for locating documents containing a keyword, not for document analysis. It has the useful feature for plain-text documents of allowing the selective extraction of only the portion near the hit.

Only files with a plain-text or HTML MIME data type (see 2 - Document Access and Specification) will be searched. Others may be specified, or be selected from wildcard file specification, but they will not actually have their contents searched.

Directory specifications may include a wildcard elipsis (allowing a directory tree to be traversed) and/or file name wildcards. In other words, anything acceptable as VMS file system syntax (except in URL-format of course). See examples in 5.3.2 - Standard Search Form.

5.1 - Plain-Text Search

A search of a plain-text file is straight-forward. Each line in the file is searched for the required string. The first time it is encountered is considered a hit. The line is not searched for any further occurances.

Searches of plain text files allow the subsequent selection of partial documents (i.e. the retrieval of only a number of lines around any actual hit). This allows the user to selectively extract a portion of a document, avoiding the need to explcitly scan through to the section of interest.

5.2 - HTML Search

A search of an HTML file is a little more complex. As might be expected, only text presented in the document text is searched, markup text is ignored. That is, all text not part of an HTML tag construct is extracted and searched. For example, out of the following HTML fragment

<!-- an example HTML document -->
<P>
The document entitled <A TARGET="_blank" HREF="example.html">"Example Document"</A>
provides only an <I>overview</I> of the full capabilities of HTML.

only the following text would actually be searched

The document entitled "Example Document" provides only an overview
of the full capabilities of HTML.

The mechanism for partial document retrieval available with plain-text files is not present with HTML documents. HTML files generally must be treated as a whole, with the formatting of current sections often very dependent on the formatting of previous sections. This makes extracting a subsection perilous without extensive syntactical analyis. On the positive side, HTML documents tend to be already divided into meaningful subdocuments (files), making retrieval of a hit naturally more-or-less within context.

Instead of partial document retrieval, the document is processed to place anchors for each hit, making it possible to jump directly to a particular section of interest. Generally this works well but may occasionally distort the presentation of a document.

5.3 - Search Syntax

A search may be initiated in basic three ways:

Appending a question-mark and search string to a file specification (the simple syntax of "ISINDEX"-style searching). This is standard HTTP, and of course must conform to HTTP syntax.
Providing the name of the query script followed by the directory path to be searched. The script then returns a standard search form.
Forms-based search, which allows the format and mechanism of the search to be controlled.

5.3.1 - "ISINDEX" Search

Placing the HTML tag "<ISINDEX>" within a document's text is sufficient to inform the browser that searching is available for that document. The browser will inform the user of this and allow a search of that document to be initiated at any time. Note that it is limited to the one document.

Using the keyword search syntax explicitly is another method of initiating a search, and additionally can use a wildcard in the document specification. For example:

/wasd_root/doc/env/*.*?formatted

The following link provides an online demonstration search using the above syntax. Note the difference in the way plain-text file hits are presented compared with those of HTML files.

/wasd_root/doc/env/*.*?formatted

5.3.2 - Standard Search Form

Using the "QUERY" script name followed by a URL-format path specifying the directory to be searched returns a standard, script-generated search form.

The following link provides an online demonstration of the standard search form.

/cgi-bin/query/wasd_root/doc/env/

As with all search specifications, the directory specification may include wildcard a elipsis (allowing a directory tree to be traversed) and/or file name wildcards. In other words, anything acceptable as VMS file system syntax (except in URL-format of course). See the following examples.

/cgi-bin/query/wasd_root/doc/env/*.html
/cgi-bin/query/wasd_root/doc/.../
/cgi-bin/query/wasd_root/doc/.../*.html

5.3.3 - Forms-Based Search

A "forms-based" search is initiated by the server receiving a file specification, which of course may contain wildcards, followed by a search parameter. This is a typical HTML forms format URL. For example:

*.txt?search=SIMPLE
/web/.../*.*?search=THIS
sub_directory/*.*?search=THAT
../sibling_directory/*.HTML?search=OTHER

The following link provides an online demonstration search using the form-based syntax.

/wasd_root/doc/env/*.*?search=formatted

5.3.4 - Search Options

Additional URI components may be appended after the initial "search=" parameter. These are appended with intervening "&") characters.

Case-Sensitivity. An optional URI component of "case=yes" or "case=no" makes the search case-sensistive or case-insensistive (the default). The following example illustrates the use of this syntax:
```
/web/html/.../*.html?search=Protocol&case=yes

/web/html/.../*.html?search=PrOtOcOl&case=no
```
The following provides an online demonstration using the two links shown above:
- Case-sensistive search for "Protocol"
- Case-insensistive search for "PrOtOcOl"
Hits. An optional URI component of "hits=document" or "hits=line" makes the search results be presented by-document (file) or by line-by-line (the default). The following example illustrates the use of this syntax:
```
/web/html/.../*.html?search=protocol&hits=document

/web/html/.../*.html?search=protocol&hits=line
```
The following provides an online demonstration using the two links shown above:
- Search result granularity by document.
- Search result granularity by line (the default).

5.3.5 - Example Search Form

To allow the client to enter a search string and submit a search to the server a HTML level 2 form construct can be used. Here is an example:

<FORM ACTION="/web/html/.../*.html">
Search HTML documents for: 
<INPUT TYPE=text NAME="search">
<INPUT TYPE=submit VALUE="[execute]">
</FORM>

The following provides an online demonstration of the form used above:

Bells and Whistles

A form providing all the options refered to in 5.3.4 - Search Options is shown below (some additional white-space introduced for clarity):

<FORM ACTION="/web/html/.../*.html">

Search HTML documents for: 
<INPUT TYPE=text NAME="search">
<INPUT TYPE=submit VALUE="[execute]">

<BR><TT><A TARGET="_blank" HREF="/query/-/aboutquery.html">About</A> this search.</TT>

<BR><TT>Output By:
line <INPUT TYPE=radio NAME="hits" VALUE="line" CHECKED>
document <INPUT TYPE=radio NAME="hits" VALUE="document"></TT>

<BR><TT>Case Sensitive:
no <INPUT TYPE=radio NAME="case" VALUE="no" CHECKED>
yes <INPUT TYPE=radio NAME="case" VALUE="yes"></TT>

</FORM>

The following provides an online demonstration of the form used above:

next previous contents full-page