Remove Noise Words
Learn about removing Noise Words
Noise Words (Garbage words) are generic words, commonly found in attributes which dilute the effectiveness of the more meaningful values in the match process. Therefore, they should be removed so you get a better matching performance. For Organizations, example noise words are: Corp, LLC, and Inc. For Addresses, example noise words are: St, Street, Avenue, and Ave. It is often desirable to ignore these words when generating tokens and doing comparisons.
noise words removal
function and a predefined set of noise words for
Organizations and Addresses. It is built into the
BasicTokenizedOrganizationNameComparator and
AddressLineComparator, and their companion match token classes.
Each of these classes utilizes an in-built list of noise words that you can
download.If you wish to develop and leverage your own list of noise words, you can:
- Create a text file (for example, myNoiseWords.txt) that looks simply like this:
inc
co
corp
corps
corporation
corporate
company
service
services
It is used in a case-insensitive manner but we recommend you use lower case, as best practice. - Submit the list to Reltio by filing a support ticket at support@reltio.com. Attach your text file, and request the task Add file for noise words removal. You will receive a full path name to the file.
- Create a custom comparator class and specify the full path and file name, in the proper parameter field of the custom class. For more information, see Comparator Classes.