Friday, May 28, 2010

Thesaurus Class (VB.Net)

Introduction
In some cases two values with similar meaning should be matched. I have created a Thesaurus class based on Hashtable to realise this functionality.

Source Code(vb.net)
Download thesaurus classes here.

Thesaurus File
The thesaurus source file is text file that contains a group of synonyms in each line separated by the specified char (see example file for nicknames with > used as value separator here).

Using the Code
Here is an example in which I compare two forenames using the Nicknames thesaurus file:
Public Sub test_thesaurus() As Boolean
 Dim t As New Thesaurus("Nicknames","c:\nicknames.txt", ">"c)
 t.Load()
 Dim s1 As String = "Robert"
 Dim s2 As String = "Bob"
 Return  (t.GetKey(s2) = t.GetKey(s1))
End Sub

Similarity Distance
One thing that can be also useful is distance between synonyms. Now you can specify distance value enclosed in square brackets in the thesaurus file. This functionality can be extended for score based matching.

Wednesday, May 26, 2010

AnyValue and NullValue

I came a cross a difficulty with matching duplicate contact records due to the issue with special values in some of the fields. For example Dr in the Titles field should match any other value - Mr/Sir/Mrs/Ms. On the other hand if Unknown was found in the address line it should not be matched to any other value including itself.

To overcome this problem I have added functionality that supports following parameters in Column node of the Unique constraint:
1. AnyValue - value that equals any other element --- if this parameter is set, then value in the field that equals to parameter's value will match values in all other records. By setting this parameter you will get more matches and therefore can increase the number of false matches.
2. NullValue - value that does not equal any element including itself --- if this parameter is set, then row where field equals to parameter's value will not match any other row in the file. By setting this parameter you will get more unique records and therefore less false matches.  

Schema example for contact details matching:

<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type='text/xsl' href='http://www.flat-file.net/schema/file-set.xslt'?>
<Files ErrorLog="data.log" ExecutionLog="execution.log">
  <Synonyms>
    <Synonym Name="Nicknames" File="C:\match\nicknames.txt" Delimiter=">" />
    <Synonym Name="Titles" File="C:\match\titles.txt" Delimiter=">" />
  </Synonyms>
  <File Alias="Contacts" 
Name="contacts.csv" 
HasHeader="True" 
Path="C:\match\" 
StringQualifier='"' 
Delimiter=",">
    <Columns>
      <Column Name="ContactKey" Index="0">
        <Unique>
          <Action>
            <DupesReport Path="C:\match\duplicates.csv" />
          </Action>
          <Column Name="Address_Line1" />
          <Column Name="Postcode" Index="Hash" />
          <Column Name="NickName" NullValue='' />
          <Column Name="SoundSurname" />
          <Column Name="Gender" AnyValue='U' />
          <Column Name="Initials" />
        </Unique>
      </Column>
      <Column Name="Title" />
      <Column Name="Forenames" />
      <Column Name="Initials" />
      <Column Name="Surname" />
      <Column Name="Address_Line1" />
      <Column Name="Postcode" />
    </Columns>
    <Expressions>
      <Expression Name="NickName" Value="[Synonym('Nicknames',Field('Forenames'))]" />
      <Expression Name="SoundSurname" Value="[Soundex(Field('Surname'))]" />
      <Expression Name="Gender" Value="[Synonym('Titles',Field('Title'))]" />
    </Expressions>
  </File>
</Files>

The the source file for the schema above must contain the following fields in the header:
ContactKey,Title,Forenames,Initials, Surname, Address_Line1, Postcode


You can use both AnyValue and NullValue in the same field.

If you want to define more then one value of special type, then check  thesaurus functionality which can be used in combination with AnyValue/NullValue to achieve this.

You can try thesaurus files that were used in the sample schema:

Wednesday, May 5, 2010

Match synonym values via custom thesaurus

Introduction

Sometimes different values have same or similar meaning making it more difficult to find duplicates with exact or even approximate string matching.

To help you to find more duplicate records a functionality that supports thesaurus files was implemented.  Thesaurus files are the text files with groups of synonyms at each line separated by a specified character:

...
alex,alexandr,sasha
bob,robert
...

Similar approach was used by Oracle in search functionality for AquaLogic.

To get synonym from the thesaurus file for specific value the Synonym(string, string) should be used with following parameters:
  • name - name of the thesaurus file (see Figure 1);
  • value - string value for which synonym needs to be found in thesaurus.

Adding thesaurus functionality to the schema

1. Create thesaurus file(s) for titles, forenames, countries, etc. (see examples: Nickname, Titles)
2. Add thesaurus file to the schema. 
Before using Synonym function you will need to add thesaurus file to the schema.
  • Open "Thesaurus Files" form through main menu Options->Thesauruses;
  • Specify thesaurus logical/alias name, file path and value delimiter (see Figure 1).
Thesaurus Files Form
Figure 1. Thesaurus Files Form

3. Create calculated field.
After the thesaurus file is added to the schema, add new calculated field that will contain results of Synonym() function.

Calculated Field Details
Figure 2. Calculated Field Details

4. Check computed field in the unique constraint
  • add Unique constraint to the primary key of the file;
  • double click on the constraint node and check the fields to be included in the data rule.