Remove Noise Words
Learn about removing Noise Words
Noise words (or garbage words) are generic words often found in attributes that weaken the effectiveness of meaningful values during the matching process. Removing them improves matching performance.
Noise words for organizations and addresses include, for example:
- Organizations: "Corp," "LLC," and "Inc."
- Addresses: "St," "Street," "Avenue," and "Ave."
It's often desirable to ignore these words when generating tokens and doing comparisons.
For your convenience, Reltio provides an out-of-the box noise words removal
function and a predefined set of noise words for Organizations and Addresses. It's built into the BasicTokenizedOrganizationNameComparator and AddressLineComparator comparators, and their companion match token classes.
Each of these classes utilizes an in-built list of noise words that you can download. For example, for addressLine use this operation:
GET https://{{env}}.reltio.com/reltio/api/{{tenantId}}/configuration/noiseDictionaries/addressLine
If you wish to develop and leverage your own list of noise words, you can:
- Create a text file (for example, myNoiseWords.txt) that looks simply like this:
inc
co
corp
corps
corporation
corporate
company
service
services
It's used in a case-insensitive manner but we recommend you use lower case, as a best practice. - Submit the list to Reltio, for details see Contact a supportive person. Attach your text file, and request the task Add file for noise words removal. You will receive a full path name to the file.
- Create a custom comparator class and specify the full path and file name, in the proper parameter field of the custom class. For more information, see Comparator Classes.