To overcome this problem I have added functionality that supports following parameters in Column node of the Unique constraint:
1. AnyValue - value that equals any other element --- if this parameter is set, then value in the field that equals to parameter's value will match values in all other records. By setting this parameter you will get more matches and therefore can increase the number of false matches.
2. NullValue - value that does not equal any element including itself --- if this parameter is set, then row where field equals to parameter's value will not match any other row in the file. By setting this parameter you will get more unique records and therefore less false matches.
Schema example for contact details matching:
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type='text/xsl' href='http://www.flat-file.net/schema/file-set.xslt'?>
<Files ErrorLog="data.log" ExecutionLog="execution.log">
<Synonyms>
<Synonym Name="Nicknames" File="C:\match\nicknames.txt" Delimiter=">" />
<Synonym Name="Titles" File="C:\match\titles.txt" Delimiter=">" />
</Synonyms>
<File Alias="Contacts"
Name="contacts.csv"
HasHeader="True"
Path="C:\match\"
StringQualifier='"'
Delimiter=",">
<Columns>
<Column Name="ContactKey" Index="0">
<Unique>
<Action>
<DupesReport Path="C:\match\duplicates.csv" />
</Action>
<Column Name="Address_Line1" />
<Column Name="Postcode" Index="Hash" />
<Column Name="NickName" NullValue='' />
<Column Name="SoundSurname" />
<Column Name="Gender" AnyValue='U' />
<Column Name="Initials" />
</Unique>
</Column>
<Column Name="Title" />
<Column Name="Forenames" />
<Column Name="Initials" />
<Column Name="Surname" />
<Column Name="Address_Line1" />
<Column Name="Postcode" />
</Columns>
<Expressions>
<Expression Name="NickName" Value="[Synonym('Nicknames',Field('Forenames'))]" />
<Expression Name="SoundSurname" Value="[Soundex(Field('Surname'))]" />
<Expression Name="Gender" Value="[Synonym('Titles',Field('Title'))]" />
</Expressions>
</File>
</Files>
The the source file for the schema above must contain the following fields in the header:
ContactKey,Title,Forenames,Initials, Surname, Address_Line1, Postcode
You can use both AnyValue and NullValue in the same field.
If you want to define more then one value of special type, then check thesaurus functionality which can be used in combination with AnyValue/NullValue to achieve this.
You can try thesaurus files that were used in the sample schema: