Remove Noise Words
Learn about removing Noise Words
Noise words are generic terms often found in attributes that reduce the clarity of meaningful values during the matching process. Removing them helps improve matching accuracy.
Noise words for organizations and addresses include, for example:
- Organizations: "Corp," "LLC," and "Inc."
- Addresses: "St," "Street," "Avenue," and "Ave."
It's often useful to ignore these words when generating tokens and comparing values.
Noise-word removal runs only when the input value is within these limits:
- Organization names — up to two words (for example,
Rodriguez Co
). - Addresses — up to three words (for example,
123 Main St
).
If the value exceeds these limits—Rodriguez Repairs Co
or Rodriguez Mendes Repairs Co
, for instance—noise words are not stripped.
This behavior applies to the following classes:
BasicTokenizedOrganizationNameComparator
OrganizationNameMatchToken
AddressLineComparator
- You can use numbers for noise words removal. For example, phone numbers.
- Word-count limits also apply when the input consists solely of numbers; a single numeric "word" counts toward the overall limit.
For your convenience, Reltio provides an out-of-the box noise words removal
function and a predefined set of noise words for Organizations and Addresses.
This behavior is built into both the BasicTokenizedOrganizationNameComparator
and the AddressLineComparator
. Each comparator, along with its companion match-token class, is subject to the word-count limits described above.
Each of these classes utilizes an in-built list of noise words that you can download. For example, for addressLine use this operation:
GET https://{{env}}.reltio.com/reltio/api/{{tenantId}}/configuration/noiseDictionaries/addressLine
If you wish to develop and leverage your own list of noise words, you can:
- Create a text file (for example, myNoiseWords.txt) where each line contains a single noise word, like this:It's used in a case-insensitive manner but we recommend you use lower case, as a best practice.
inc co corp corps corporation corporate company service services
- Submit the list to Reltio, for details see Contact a support person. Attach your text file, and request the task Add file for noise words removal. You will receive a full path name to the file.
- Create a custom comparator class and specify the full path and file name, in the proper parameter field of the custom class. For more information, see Comparator Classes.
Impact of word-count limits on noise word removal
- For Organization names:
Comparators:
BasicTokenizedOrganizationNameComparator
andOrganizationNameMatchToken
Input value Output Reason Rodriguez Co Rodriguez Within the two-word limit — cleansed Rodriguez Repairs Co Rodriguez Repairs Co Exceeds two-word limit — noise word retained
- For Address fields:
Comparator:
AddressLineComparator
Input value Output Reason 123 Main St 123 Main Within the three-word limit — cleansed 456 Oak Park Street 456 Oak Park Street Exceeds three-word limit — noise word retained 555-123-4567 x100 555-123-4567 x100 Numeric string counts as one "word" — no noise word present