Wednesday, May 26, 2010

AnyValue and NullValue

I came a cross a difficulty with matching duplicate contact records due to the issue with special values in some of the fields. For example Dr in the Titles field should match any other value - Mr/Sir/Mrs/Ms. On the other hand if Unknown was found in the address line it should not be matched to any other value including itself.

To overcome this problem I have added functionality that supports following parameters in Column node of the Unique constraint:
1. AnyValue - value that equals any other element --- if this parameter is set, then value in the field that equals to parameter's value will match values in all other records. By setting this parameter you will get more matches and therefore can increase the number of false matches.
2. NullValue - value that does not equal any element including itself --- if this parameter is set, then row where field equals to parameter's value will not match any other row in the file. By setting this parameter you will get more unique records and therefore less false matches.  

Schema example for contact details matching:

<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type='text/xsl' href='http://www.flat-file.net/schema/file-set.xslt'?>
<Files ErrorLog="data.log" ExecutionLog="execution.log">
  <Synonyms>
    <Synonym Name="Nicknames" File="C:\match\nicknames.txt" Delimiter=">" />
    <Synonym Name="Titles" File="C:\match\titles.txt" Delimiter=">" />
  </Synonyms>
  <File Alias="Contacts" 
Name="contacts.csv" 
HasHeader="True" 
Path="C:\match\" 
StringQualifier='"' 
Delimiter=",">
    <Columns>
      <Column Name="ContactKey" Index="0">
        <Unique>
          <Action>
            <DupesReport Path="C:\match\duplicates.csv" />
          </Action>
          <Column Name="Address_Line1" />
          <Column Name="Postcode" Index="Hash" />
          <Column Name="NickName" NullValue='' />
          <Column Name="SoundSurname" />
          <Column Name="Gender" AnyValue='U' />
          <Column Name="Initials" />
        </Unique>
      </Column>
      <Column Name="Title" />
      <Column Name="Forenames" />
      <Column Name="Initials" />
      <Column Name="Surname" />
      <Column Name="Address_Line1" />
      <Column Name="Postcode" />
    </Columns>
    <Expressions>
      <Expression Name="NickName" Value="[Synonym('Nicknames',Field('Forenames'))]" />
      <Expression Name="SoundSurname" Value="[Soundex(Field('Surname'))]" />
      <Expression Name="Gender" Value="[Synonym('Titles',Field('Title'))]" />
    </Expressions>
  </File>
</Files>

The the source file for the schema above must contain the following fields in the header:
ContactKey,Title,Forenames,Initials, Surname, Address_Line1, Postcode


You can use both AnyValue and NullValue in the same field.

If you want to define more then one value of special type, then check  thesaurus functionality which can be used in combination with AnyValue/NullValue to achieve this.

You can try thesaurus files that were used in the sample schema: