home Links Articles Books Past Meetings Photos SiteMap
The MDCFUG is sponsored by TeraTech. Visit us at www.TeraTech.com

Please send
comments/questions to

michael@
teratech.com

 

The truth about Verity

by Eron Cohen and Michael Smith

TeraTech http://www.teratech.com/

Introduction to indexing queries with Verity

 

 

 

 

 

The Verity text search engine is one of the best features of ColdFusion, but is often ignored altogether by developers.  This article will tell you how to use Verity to add the power of an internet-style search engine to the front door of your web application with little effort and complete control in CF code.  We cover all the basics of using the Verity search engine, with a focus specifically on indexing SQL queries.  If you are interested in also searching documents or other text such as CFHTTP result sets with Verity you will also benefit from this article, as all the same principles apply.

 

Text Box: What can you index with Verity?
-databases
-webpages
-CFHTTP results
-Other documents such as PDFs, Microsoft Word, etc.

What Is Verity?

Verity is an add-on that comes free with ColdFusion Server (Pro and Enterprise versions).  It provides a full-text search service, which can be directed at web pages, documents, database queries and even your own custom textual searches such as CFHTTP result sets.  The award-winning Verity search engine was created by the Verity Corporation and has been around for several years.  The use of their technology is ubiquitous on Internet sites and on non-web based venues (e.g.- the search in the ColdFusion Studio documentation is also based on the Verity search engine!)  Since it is designed to do nothing but text searches, it is extremely fast, and efficient at doing so.   Not only that, but it will allow your users to concoct their own advanced searches (AND, OR, NEAR etc) such as you may have used on AltaVista.  Verity provides flexibility and functionality that would take a lot of effort to provide using only ColdFusion and SQL code.  Plus it runs hundreds of times faster to boot! Imagine how useful it would be to be able to easily implement a full text search on your website.   Think of all the uses you could have of indexing databases and searching for records without needing to rely on using SQL queries to do so.

 

Getting started with Verity

The first time you use the Verity search engine, expect to spend a little time getting started. To say the least, there are a few concepts that you need to grasp, but the basics are quite easy to get.  Really there are only three tags that you will need to know how to use.  Listed in the order that we’ll use them they are: CFCOLLECTION, CFINDEX, CFSEARCH.  You’ll also need to get your mind around the idea of a Verity collection and gain an understanding of how to handle the results of a search so that you can display them to the site’s user or do some other useful things with them.

 

The tag you’ll probably use the most is also the easiest of the three mentioned above—CFSEARCH.  If you already have a collection to work with, all you need to do is employ this tag to extract search results.  Say you wanted to find instances of the word “ColdFusion” in your company’s knowledgebase.  Provided that someone had previously setup a “collection” or index of the knowledgebase database, the syntax would simply be:

 

<CFSEARCH NAME="knowledgebase_search"
    COLLECTION="Corporate_Knowledgebase"
    CRITERIA="ColdFusion">

 

This will cause ColdFusion to ask Verity to look through its index of the corporate knowledgebase database for records containing the word “ColdFusion”—if it finds any, it will return them in a query called “knowledgebase_search” which can be referenced in much the same way any SQL query could.  You can use CFOUTPUT to output the results of the search—verity even assigns relevance scores that you can use to rank the results.  This is all discussed in further detail later on.  Meanwhile lets get started looking at the nuts and bolts of how to make Verity work for you.

 

CFCOLLECTION

A “collection” is another word for an index.  Before you can begin to work with Verity, you’ll first need to create a collection that Verity will use to index your text.  Not surprisingly, the CFCOLLECTION tag is used to create, manipulate and destroy new Verity indexes.   Actually, using this tag is optional, because the ColdFusion Administrator also provides the same functionality.  Having said that, there are some very good reasons to use CFCOLLECTION to create your collections programmatically—aside from allowing you to create them on the fly, they also let you create indexes on servers that you don’t have direct access to, for instance if you use a hosting company.   Furthermore, CFCOLLECTION can be used to optimize and repair ailing indexes...this is similar to what would need to be done with any database index.

 

The syntax for CFCOLLECTION is really quite simple: 

<CFCOLLECTION ACTION="CREATE, DELETE, OPTIMIZE, REPAIR"
    COLLECTION="NAME_OF_COLLECTION"
    PATH="DIRECTORY WHERE YOU WANT TO CREATE YOUR COLLECTION"
    LANGUAGE="ENGLISH, ETC">

 

When you’re starting out, you mainly need to know how to create an index.  Of course the other actions are important too, but we won’t cover those here.  Suppose we want to create an index called “Contact_Database” in the C:\VERITY_COLLECTIONS directory.  We’d simply use the command:

 

<CFCOLLECTION ACTION=”CREATE” COLLECTION=”Customer_Database” Path=”C:\VERITY_COLLECTIONS”>

 

Whether you create your index with CFCOLLECTION or in the ColdFusion Administrator, it MUST be done before you can do anything else.  This makes sense—you have to define the index before you can use it to hold data or search it.   Also, the directory that you specify in the path= parameter must already exist—otherwise ColdFusion will throw an error.

 

CFINDEX

CFINDEX is probably the trickiest of the three Verity tags to use.  Basically, this tag is used to index your website files or your database.  Since indexing the database is the more difficult, we’ll cover that here.  In a nutshell, what you need to do is first query your database and then use CFINDEX index the fields in that query.  Depending on the circumstances, you can either do one record at a time or index the entire database in one go.  Oftentimes you need to choose the former as a matter of course.  For instance, in the case of our contacts database, its much more practical to index the records as they are added to the database.  Not only is a “less expensive” operation, but it also will mean that the record will be searchable immediately after it is entered into your database.  Furthermore, it should be noted that indexing data is extremely CPU intensive.  If you are indexing a large amount of data you should at least do it during off-hours using the scheduler in the ColdFusion administrator.  For large or clustered sites, you may want to set up a special machine that’s sole purpose is to index data.

 

The syntax needed to index a database is as follows (see Allaire’s documentation for further syntax information):

 

<CFINDEX COLLECTION="The name of your previously created collection"
    ACTION="DELETE,OPTIMIZE, PURGE, REFRESH, UPDATE"
    TYPE="Type Of Data being indexed--for databases use CUSTOM"
           TITLE="A query column or text that is used to identify the data"
    KEY="A UNIQUE field to identify your DB record for later retrieval"
    BODY="Comma delimited list of fields from your database"
    CUSTOM1="May contain any text you want"
    CUSTOM2=" May contain any text you want "
    QUERY="Name of the query being indexed">
 

 

Lets go over how to make use of this in “real life”:

 

First, you must query your database to return the records you wish to index: 

 

<CFQUERY name=”get_all_customers” datasource=”my_database”>

Select *

From my_customers_table, my_customer_addresses

Where my_customers_table .customer_id= my_customer_addresses.this_customer_id

</CFQUERY>

 

Now we’ll index the query in the collection we created earlier:  Notice that we do not use # signs around our column names.

 

        <CFINDEX

                COLLECTION="Customer_Database"

                ACTION="UPDATE"

                TYPE="CUSTOM" 

                BODY="first_name, last_name, address1, address2, city, state, country, email" 

                KEY="Customer_ID" 

                TITLE="last_name"

                QUERY="get_all_customers">

 

When indexing a database, the TYPE= field is ALWAYS custom.  Body contains the names of fields that are in your database which are returned by your query.  The KEY= field is the field that contains the name of the Unique ID for your data that will allow you to retrieve the record if it is found in a search.  The QUERY= field specifies the name of the query that was executed earlier that you’d like to index.

 

It is VERY important that you make sure your KEY field is unique and not repeated more than one time in your query.  This is one of the pitfalls of using Verity.  In the example above, if for instance, the customer had 2 addresses, we couldn’t use just the customer_id field for the KEY field—her customer ID would come up more than once in the query.  The reason why is that this will confuse the Indexing engine and cause it to fail with no error messages displayed—it will just stop indexing.  If you have reason to think that this problem could potentially exist for you, you’ll have to be a bit creative with your SQL.  You can use SQL to concatenate a couple of fields together to make sure that the KEY field is unique.  For example:

 

<CFQUERY name=”get_all_customers” datasource=”my_database”>

Select my_customers_table .customer_id & ‘~’  & my_customer_addresses.customer_address_id as New_Unique_ID, *

From my_customers_table, my_customer_addresses

Where my_customers_table .customer_id= my_customer_addresses.this_customer_id

</CFQUERY>

 

This would create a field in our SQL query called “New_Unique_ID” which would combine the customer_ID a tilde and then the customer_address_id.  So if user 234 had address ID 1533 it would generate an ID of 234~1533.  When you do a search on your data, you’ll have to break the ID number back down so you can get the record you were looking for.  This could easily be done using the ColdFusion listfirst()/listlast() functions using “~” for the delimiter.

 

 

CFSEARCH

The CFSEARCH tag is the one that actually lets you search through your collection to find the items that match your search criteria.  

<CFSEARCH NAME="Name of your search query"
    COLLECTION="Name of the collection (or Collections!) to search"
    TYPE="Simple or Explicit (explained later) "
    CRITERIA="The actual search criteria"
    MAXROWS="Number of rows to return"
    STARTROW="Row number to start with (especially useful for returning x number of results at a time.">
 

It is important to note that you can list more than one collection to search in the Collection parameter of CFSEARCH.  This is especially useful because you may have more than one index of information you might want to search simultaneously for the same keywords.  

 

Here is an example of a working CFSEARCH.  It searches two collections (Collection1 and Collection2) for any records that contain the word Bethesda and a word starting with the letters C-O-H but that does NOT contain the word Eron:

 

<CFSEARCH name=”this_search” Collection=”Collection1, Collection2” Type=”Simple”

            Criteria=” Coh* and Bethesda NOT Eron”>

 

Naturally, in the real world, you would probably have a form field or some other mechanism for ascertaining the search criteria.  If the CFSEARCH tag is successful in finding any records, it will return the value that you Specified in the KEY field when you CFINDEX’ed.  After the CFSEARCH tag is run, these fields will be available in a field called KEY.  So for instance if I wanted a list of the record IDs returned from the search above, I could access #this_search.key#.  In other words, after you run CFSEARCH, the results are treated the same as any database query would be. You could do something to this to output all the values from the key field:

 

<CFOUTPUT query=”this_search”>

#key#

</CFOUTPUT>

 

Or better yet, you could just access them all at once using the valuelist() function:

 

<CFSET my_query_results=valuelist(this_search.key)>

 

These values could then be used to recall the actual records from the database using the SQL “IN” command:

 

<CFQUERY name=”the_results” datasource=”my_database”>

 

Select *

From my_table

Where customer_ID IN (#val(my_query_results)#)

 

</CFQUERY>

 

You will need to develop some sort of appropriate interface to handle and display the results of the search.  The CFSEARCH tag will return a few other useful fields to help you with this:

 

·        KEY — the value of the KEY attribute defined in the CFINDEX tag used to populate the collection. In our case the filename and path.

·        TITLE — Returns the value of the TITLE attribute defined by the <TITLE> HTML tag in any HTML or ColdFusion application page file that was indexed by CFINDEX. If the collection was TYPE=CUSTOM, TITLE returns the value of the TITLE attribute defined by the CFINDEX tag. If the collection was TYPE=FILE, TITLE also returns the value of the TITLE attribute defined by the CFINDEX tag.

·        SCORE — Returns the relevancy score of the document based on the search criteria from 0 to 100.

·        URL — Returns the value of the URLPATH attribute defined in the CFINDEX tag used to populate the collection.

·        SUMMARY - the best three sentences or 500 characters of documents returned by a search.

·        CUSTOM1, CUSTOM2 - user defined key fields—these can be extremely important in helping you determine where the information being returned came from.

·        CURRENTROW — The current row of the query being processed

 

There are also fields for the whole query:

·        RECORDCOUNT — The total number of records returned by the query

·        RECORDSSEARCHED — The total number of records in the index that were searched.

 

Special Techniques

Adding one record at a time to your index

As stated earlier, you would usually want to index records as they are added to your database instead of en masse.  In summary, the way to accomplish this is by first adding your record to the database and then using that query to index the data you added (If you are using autonumber key fields you will have to do a couple of extra steps: first you will add your data to the database, then you will use the MAX() function to get the ID of the item you just added and then do a third query to retrieve that record for indexing.  You will use the ACTION=”UPDATE” for CFINDEX to indicate that you wish to update the Index.

 

Updating records in your index

The process is similar to adding a new record to the index.  In order to update a SINGLE record you simply need to specify an ID that has already been used for a previously indexed item for your KEY= field.   In other words, when you are indexing data, if the KEY= field is a new id, the Verity search engine simply adds the record to the collection.  If the KEY= field contains an ID that has been used previously it will update the data in the collection.  When doing this, you’ll typically use the ACTION=”UPDATE” directive.  However, if you’re redoing your entire index anyway, you’ll want to use ACTION=”REFRESH” which basically erases your entire index and then recreates it.

 

Dealing with CFHTTP

CFHTTP is a special case.  If you need to index information on an external website, you can do so, but it’s a little bit tricky.  First, you use CFHTTP to retrieve the page you want to index.  Next you create a “simulated” query using ColdFusion’s QueryNew, QueryAddRow and QuerySetCell functions.  Finally, you use CFINDEX as you would with any query—you’ll need to put something in the CUSTOM1 field to indicate where the information was retrieved from.  Normally, you would put the URL that it originated from so that when there’s a hit on that record, you’ll be able to point the searcher to the correct webpage.

 

Verity Search Query Language

As was mentioned earlier, you can do more than just search for single words using the <CFSEARCH> CRITERIA parameter.   This is where some of the real power of the Verity search engine kicks in.  Make sure you have a good help page drawn up to let your users know what they can do with Verity.  Lets have a look at a few of the best of these:

 

You can enter comma-delimited strings and use wildcard characters (regular expressions). By default, a simple query searches for words, not strings. For example, entering the word "all" will find documents containing the word "all" but not "allegorical." You can use wildcards, however to broaden the scope of the search. "all*" will return documents containing both "all" and "alliterate." Case is ignored, but only when (as above) the search string is all lowercase or all uppercase.  If the criteria is mixed case ("All"), only the same case would match (only "All", not "all" or "ALL"). To override this feature, you can simply use ColdFusion’s lower() or upper() functions around the search criteria to pop them into upper or lowercase.

 

You can enter multiple words separated by commas: software, Microsoft, Oracle. The comma in a Simple query expression is treated like a logical OR. If you omit the commas, the query expression is treated as a phrase, so documents would be searched for the phrase "software Microsoft Oracle."

 

You can use the AND, OR, and NOT operators in a simple query: software AND (Microsoft OR Oracle). To include an operator in a search, you surround it with double quotation marks: software "and" Microsoft. This expression searches for the phrase "software and Microsoft."

 

A simple query employs the STEM operator and the MANY modifier. STEM searches for words that derive from those entered in the query expression, so that entering "find" will return documents that contain "find," "finding," "finds," etc. The MANY modifier forces the documents returned in the search to be presented in a list based on a relevancy score.

 

For a full list of Verity operators see the on-line help page at our knowledge base page http://www.teratech.com/knowledgebase/. You may also want to poke around the Verity Inc. website and look at their documentation.  There are already a few known undocumented features that can be used such as the <thesaurus> modifier.  When used before a search word, Verity also finds words that are synonyms to the search word that it modifies. <SOUNDEX> is another undocumented verity keyword. To get the SOUNDEX operator to work, you need to modify the style.prm file. I've tested both operators and they both work great!

 

You will want to be aware of a file called “vdk20.stp” in your CFUSION\VERITY\COMMON\ENGLISH\ directory.  It’s the list of words that Verity will ignore when a search is performed.  These are generally “human garbage” words that you’d want to normally throw away when you search.  For example “A, An, Another, is, are, my, etc.”  If you would like to, you can add your own words to this list or take out some of the ones Verity ignores by default.

 

 

What do I have to watch out for?

Using Verity as part of ColdFusion is not without its foibles.  There are definitely some things a developer needs to be aware of so that they can steer-clear of potential problems.  First of all, the Verity indexes tend to get bloated after a while.  Its best to plan ahead for this and make it so that it’s no problem to regularly drop and reindex your collections.  If it’s not possible to delete and recreate your index, than you should at least OPTIMIZE it on a regular basis.  You can handle this automatically by using the ColdFusion scheduler along with the CFCOLLECTION tag to set up a regular time to do this. Note that no processes can be using a collection at the same time it's being updated. If it is then you will get memory leaks on your CF server and it will eventually crash. Which means you should put an exclusive named lock around the updater (cfcollection),and a readonly lock with the same name around every access to cfsearch/cfcollection.

 

Next, you need to make sure that your collections don’t get too large—this will speed up your searches and according to the Allaire Knowledgebase there is approximately a 100-megabyte limit to the size of collections.  In order to prevent oversized collections, you may find the need to divide your indexes into more than one collection.  Luckily, Allaire makes it very easy to search multiple indexes simultaneously.  As shown earlier, you do this by using a comma separated list of the names of the collections in the COLLECTION= field of your CFSEARCH tag.   Finally, the author is aware of one combination of search criteria that will cause a problem when doing a CFSEARCH—if you type two ANDs in a row (ie—cohen AND AND eron) Verity will throw an error.  You can either control this by stripping the double AND out of the search phrase using the REPLACE function before the CFSEARCH or just use CFTRY/CFCATCH to catch any errors and put up an error message.

 

Putting it all together (code example)

Here are some simple ColdFusion templates to help you get started.  You will need to either download and install the MS Access database example or create a database table in an available datasource called “my_customers_table”.  It will need the following fields along with a few rows of example data: first_name, last_name, address1, address2, city, state, country, email.  You should save these templates to a directory on your server and then run them in the order that they are listed here. 

 

The example code is based on code generated by the Verity Wizard that comes with ColdFusion Studio.  To access the wizard, choose FILE>NEW inside of ColdFusion Studio, then click on the CFML tab.  The Verity wizard will most likely be the last one in your list, as pictured in the adjacent image.  The wizard is actually designed for generating pages that index and search static HTML web pages, but they do give a good start for other kinds of work and are very easy to recycle for other types of Verity searches.  If you own ColdFusion Studio, they are definitely worth examining!

 

 

 

 

Create_collection.cfm

 

 

<!---

File Name: create_collection.cfm

Purpose:

 

Created by: Eron Cohen

On: Sunday, September 03, 2000

 

Comments: Make sure you have a directory on your server called c:\verity_collections  Otherwise, change the path to something that does exist!

 --->

 

 

 <CFCOLLECTION ACTION="CREATE" COLLECTION="Customer_Database" Path="C:\VERITY_COLLECTIONS">

 

 

Index_query.cfm

 

<!---

File Name: index_query.cfm

Purpose:

 

Created by: Eron Cohen

On: Sunday, September 03, 2000

 

Comments: For this to work, you need to make sure you have the database available!  You can download the example database or create a table in your own datasource.  ".  It will need the following fields along with a few rows of example data: first_name, last_name, address1, address2, city, state, country, email.

 

This will index the entire database table.

 --->

 

 

<!--- First we need to query the database  --->

<CFQUERY name="get_all_customers" datasource="my_database">

Select *

From my_customers_table

Order by customer_last_name

</CFQUERY>

 

 

 

<!---  Now we can index the query results --->

<CFINDEX

                COLLECTION="Customer_Database"

                ACTION="UPDATE"

                TYPE="CUSTOM" 

                BODY="first_name, last_name, address1, address2, city, state, country, email" 

                KEY="Customer_ID" 

                TITLE="last_name"

                QUERY="get_all_customers">

 

 

 

Search_form.cfm

 

<!---

File Name: search_form.cfm

Purpose: Gather input for search.

 

Created by: Eron Cohen

On:  Sunday, September 03, 2000

Comments:

 

This was borrowed from the ColdFusion Studio Verity wizard.  To access the wizard choose FILE>NEW and then click the CFML tab.  It is mainly used to generate code for indexing websites not queries!

 

 

 --->

 

 

 <FORM action="search_collection.cfm" method="post">

            <INPUT type="hidden" name="StartRow" value="1">

 

            <TABLE>

 

                        <TR>

                                    <TD>Keywords:</TD>

                                    <TD><INPUT type="text" name="Criteria" size="30"></TD>

                        </TR>

 

                        <TR>

                                    <TD>Max Rows:</TD>

                                    <TD><SELECT name="MaxRows"> <OPTION> 10 <OPTION> 25 <OPTION> 100 </SELECT></TD>

                        </TR>

 

                        <TR>

                                    <TD colspan=2><INPUT type="submit" value="   Search &gt; &gt; "></TD>

                        </TR>

 

            </TABLE>

 

</FORM>

 

 

search_collection.cfm

 

<!---

File Name: search_collection.cfm

Purpose: Gather input for search.

 

Created by: Eron Cohen

On:  Sunday, September 03, 2000

Comments:

 

This was borrowed from the ColdFusion Studio Verity wizard.  To access the wizard choose FILE>NEW and then click the CFML tab.  It is mainly used to generate code for indexing websites not queries!

 

 

 --->

 

 

 

<!---

 Use CFSEARCH to get all "hits" on this search criteria.  Up to the specified Max. number of rows.

 --->

 

<CFSEARCH Name="this_search"

                          Collection="Customer_Database"

                          Type="Simple"

                          Criteria = "#Form.Criteria#"

                          MaxRows = "#Evaluate(Form.MaxRows + 1)#"

                          StartRow = "#Form.StartRow#">

                         

 

<!--- Now query the database to return those records, unless recordcount was 0!       --->      

 

 

<CFIF this_search.recorcount is not 0>

<CFQUERY name="get_search_results_records_query" datasource="#application.dsn#">

 

            Select *

            From From my_customers_table

            where customer_id in (#valuelist(this_search.key)#)

 

</CFQUERY>                       

 

<!--- Now display the results to the searcher:   --->

 

Here are the records that were found:<p>

 

<CFOUTPUT query="get_search_results_records_query">

#currentrow#: #first_name#, #last_name#, #address1#, #address2#, #city#, #state#<br>

</CFOUTPUT>

 

<CFELSE>

 

Sorry, no records match your search criteria--please try <A HREF="javascript: history.go(-1);">again!</A>

 

</CFIF>

 

Summary

Verity is a great text searching tool and is worth adding to your CF skill set. In this article we have shown you how to use CFINDEX and CFSEARCH to create and search verity free text searches for both databases and files. We also covered creating custom verity indices of any text you have in a CF page and various pitfalls in using verity in CF.

 

Resources

ColdFusion Documentation "Indexing Data with Verity"

 

MDCFUG article "Verity Free Text Search" by Michael Smith

 

Webtechiques "Hot Searches with ColdFusion" by Sanjay Patel and Charles Linville

 

Chapter 29 of "The ColdFusion 4.0 Web Application Construction Kit" by Ben Forta

 

CNET's "ColdFusion Tips: Avoid Problems With Large Verity Collections" by Rob Bilson and Mike Van Hoozer

 

CNET's "Understanding Verity Collections in ColdFusion" by Jeremy Petersen

 

TeraTech ColdCuts tips on Verity

 

Verity’s website:http://www.verity.com/

 

 

 

Bio

Eron Cohen is ColdFusion programmer, MDCFUG speaker and author. Michael Smith is president of TeraTech http://www.teratech.com/ , a 11-year-old Rockville, Maryland based consulting company that specializes in ColdFusion, Database and Visual Basic development.  Michael runs the MDCFUG and recently organized the two-day, Washington, DC-based CFUN-2k conference which attracted more than 750 participants. You can reach Michael at michael@teratech.com or 301-424-3903.


Home | Links | Articles | Past Meetings | Meeting Photos | Site Map
About MDCFUG | Join | Mailing List |Forums | Directions |Suggestions | Quotes | Newbie Tips
TOP

Copyright © 1997-2017, Maryland Cold Fusion User Group. All rights reserved.
< >