Unify and manage your data

Remove Noise Words

Learn about removing Noise Words

Noise words are generic terms often found in attributes that reduce the clarity of meaningful values during the matching process. Removing them helps improve matching accuracy.

Noise words for organizations and addresses include, for example:

  • Organizations: "Corp," "LLC," and "Inc."
  • Addresses: "St," "Street," "Avenue," and "Ave."

It's often useful to ignore these words when generating tokens and comparing values.

This capability isn't available as a standalone cleanser but can only be invoked within the context of a comparator and token class.
Important:

Noise-word removal runs only when the input value is within these limits:

  • Organization names — up to two words (for example, Rodriguez Co).
  • Addresses — up to three words (for example, 123 Main St).

If the value exceeds these limits—Rodriguez Repairs Co or Rodriguez Mendes Repairs Co, for instance—noise words are not stripped.

This behavior applies to the following classes:

  • BasicTokenizedOrganizationNameComparator
  • OrganizationNameMatchToken
  • AddressLineComparator
Tip:
  • You can use numbers for noise words removal. For example, phone numbers.
  • Word-count limits also apply when the input consists solely of numbers; a single numeric "word" counts toward the overall limit.

For your convenience, Reltio provides an out-of-the box noise words removal function and a predefined set of noise words for Organizations and Addresses.

This behavior is built into both the BasicTokenizedOrganizationNameComparator and the AddressLineComparator. Each comparator, along with its companion match-token class, is subject to the word-count limits described above.

Each of these classes utilizes an in-built list of noise words that you can download. For example, for addressLine use this operation:

GET https://{{env}}.reltio.com/reltio/api/{{tenantId}}/configuration/noiseDictionaries/addressLine

If you wish to develop and leverage your own list of noise words, you can:

  1. Create a text file (for example, myNoiseWords.txt) where each line contains a single noise word, like this:
    inc
    
    co
    
    corp
    
    corps
    
    corporation
    
    corporate
    
    company
    
    service
    
    services
    It's used in a case-insensitive manner but we recommend you use lower case, as a best practice.
  2. Submit the list to Reltio, for details see Contact a support person. Attach your text file, and request the task Add file for noise words removal. You will receive a full path name to the file.
  3. Create a custom comparator class and specify the full path and file name, in the proper parameter field of the custom class. For more information, see Comparator Classes.

Impact of word-count limits on noise word removal

For Organization names:

Comparators: BasicTokenizedOrganizationNameComparator and OrganizationNameMatchToken

Input valueOutputReason
Rodriguez CoRodriguezWithin the two-word limit — cleansed
Rodriguez Repairs CoRodriguez Repairs CoExceeds two-word limit — noise word retained
For Address fields:

Comparator: AddressLineComparator

Input value
OutputReason
123 Main St123 MainWithin the three-word limit — cleansed
456 Oak Park Street456 Oak Park StreetExceeds three-word limit — noise word retained
555-123-4567 x100555-123-4567 x100Numeric string counts as one "word" — no noise word present