Wednesday, May 5, 2010

Match synonym values via custom thesaurus

Introduction

Sometimes different values have same or similar meaning making it more difficult to find duplicates with exact or even approximate string matching.

To help you to find more duplicate records a functionality that supports thesaurus files was implemented.  Thesaurus files are the text files with groups of synonyms at each line separated by a specified character:

...
alex,alexandr,sasha
bob,robert
...

Similar approach was used by Oracle in search functionality for AquaLogic.

To get synonym from the thesaurus file for specific value the Synonym(string, string) should be used with following parameters:
  • name - name of the thesaurus file (see Figure 1);
  • value - string value for which synonym needs to be found in thesaurus.

Adding thesaurus functionality to the schema

1. Create thesaurus file(s) for titles, forenames, countries, etc. (see examples: Nickname, Titles)
2. Add thesaurus file to the schema. 
Before using Synonym function you will need to add thesaurus file to the schema.
  • Open "Thesaurus Files" form through main menu Options->Thesauruses;
  • Specify thesaurus logical/alias name, file path and value delimiter (see Figure 1).
Thesaurus Files Form
Figure 1. Thesaurus Files Form

3. Create calculated field.
After the thesaurus file is added to the schema, add new calculated field that will contain results of Synonym() function.

Calculated Field Details
Figure 2. Calculated Field Details

4. Check computed field in the unique constraint
  • add Unique constraint to the primary key of the file;
  • double click on the constraint node and check the fields to be included in the data rule.