String cleansing in Match FAQ

Learn about string cleansing in Match through frequently asked questions.

How does the sequence of entries in the dictionary file impact the output?

The sequence in the dictionary is processed from top to bottom. This shouldn't have an impact in the output unless there are repetitive source values. In such cases, the first source encountered in the sequence will be triggered. For more details, see topic String Replacement Cleanser.

If a source term matches multiple entries, how will this be processed?

It's not recommended to have multiple sources with the same value but different replacements. In this scenario, the value will be overwritten by the first replacement and won't be picked up by subsequent replacements as it has already been modified.

Example:

st => street
st => str
Outcome: strreet

If a modified search term from one line matches subsequent entries in the file, how is this handled?

It's not recommended to have a replacement value equal to another source. In such cases, values are overwritten from top to bottom in the sequence.

Example:

street => str
str => st
Outcome: st

Should punctuation rules like periods, commas, exclamation points, etc., be at the start or end of the file?

It's not mandatory to add punctuation rules. If required, you may include them based on your specific needs.

What are the recommendations for handling spaces?

If you want to retain spaces, include them explicitly in the source and replacement strings.

Example:

avenue => ave
A => AB
A => ABC
A => ABC
Input: “A”
Output: ABCB
Explanation: “A” is replaced with “AB,” and then “A” is replaced with “ABC.”

If translating to standard terms, what are the implications of going to larger vs smaller target terms?

The file size should not exceed 10 MB. If the number of lines exceeds 1,000, performance may be impacted. Ensure the file stays within these limits to avoid performance issues.

Any special considerations for using "class": "com.reltio.match.comparator.DynamicDamerauLevenshteinDistance"?

The DynamicDamerauLevenshteinDistance comparator evaluates the similarity between strings by considering transpositions (for example, swapping adjacent characters), insertions, deletions, and substitutions. This makes it ideal for use cases where small typographical errors are common. When configuring this comparator, ensure that your use case benefits from its flexible matching capabilities, as it may introduce performance overhead for large datasets. For details about comparator classes, see topic Comparator Classes.

Footer

You've found your way to a temporary environment. Content is not complete or final. Looking for Reltio documentation? Go to docs.reltio.com.

Unify and manage your data

String cleansing in Match FAQ

How does the sequence of entries in the dictionary file impact the output?

If a source term matches multiple entries, how will this be processed?

If a modified search term from one line matches subsequent entries in the file, how is this handled?

Should punctuation rules like periods, commas, exclamation points, etc., be at the start or end of the file?

What are the recommendations for handling spaces?

If translating to standard terms, what are the implications of going to larger vs smaller target terms?

Any special considerations for using "class": "com.reltio.match.comparator.DynamicDamerauLevenshteinDistance"?

Footer