Unify and manage your data

Remove Noise Words

Learn about removing Noise Words

Noise words (or garbage words) are generic words often found in attributes that weaken the effectiveness of meaningful values during the matching process. Removing them improves matching performance.

Noise words for organizations and addresses include, for example:

  • Organizations: "Corp," "LLC," and "Inc."
  • Addresses: "St," "Street," "Avenue," and "Ave."

It's often desirable to ignore these words when generating tokens and doing comparisons.

This capability isn't available as a standalone cleanser but can only be invoked within the context of a comparator and token class.
Tip: You can use numbers for noise words removal. For example, for phone numbers.

For your convenience, Reltio provides an out-of-the box noise words removal function and a predefined set of noise words for Organizations and Addresses. It's built into the BasicTokenizedOrganizationNameComparator and AddressLineComparator comparators, and their companion match token classes.

Each of these classes utilizes an in-built list of noise words that you can download. For example, for addressLine use this operation:

GET https://{{env}}.reltio.com/reltio/api/{{tenantId}}/configuration/noiseDictionaries/addressLine

If you wish to develop and leverage your own list of noise words, you can:

  1. Create a text file (for example, myNoiseWords.txt) that looks simply like this:

    inc

    co

    corp

    corps

    corporation

    corporate

    company

    service

    services

    It's used in a case-insensitive manner but we recommend you use lower case, as a best practice.
  2. Submit the list to Reltio, for details see Contact a supportive person. Attach your text file, and request the task Add file for noise words removal. You will receive a full path name to the file.
  3. Create a custom comparator class and specify the full path and file name, in the proper parameter field of the custom class. For more information, see Comparator Classes.