String cleansing in Match FAQ
Learn about string cleansing in Match through frequently asked questions.
How does the sequence of entries in the dictionary file impact the output?
The sequence in the dictionary is processed from top to bottom. This shouldn't have an impact in the output unless there are repetitive source values. In such cases, the first source encountered in the sequence will be triggered. For more details, see topic String Replacement Cleanser.
If a source term matches multiple entries, how will this be processed?
It's not recommended to have multiple sources with the same value but different replacements. In this scenario, the value will be overwritten by the first replacement and won't be picked up by subsequent replacements as it has already been modified.
Example:
st => street
st => str
Outcome: strreet
If a modified search term from one line matches subsequent entries in the file, how is this handled?
It's not recommended to have a replacement value equal to another source. In such cases, values are overwritten from top to bottom in the sequence.
Example:
street => str
str => st
Outcome: st
Should punctuation rules like periods, commas, exclamation points, etc., be at the start or end of the file?
It's not mandatory to add punctuation rules. If required, you may include them based on your specific needs.
What are the recommendations for handling spaces?
If you want to retain spaces, include them explicitly in the source and replacement strings.
Example:
avenue => ave
A => AB
A => ABC
A => ABC
Input: “A”
Output: ABCB
Explanation: “A” is replaced with “AB,” and then “A” is replaced with “ABC.”
If translating to standard terms, what are the implications of going to larger vs smaller target terms?
The file size should not exceed 10 MB. If the number of lines exceeds 1,000, performance may be impacted. Ensure the file stays within these limits to avoid performance issues.
Any special considerations for using "class": "com.reltio.match.comparator.DynamicDamerauLevenshteinDistance"?
The DynamicDamerauLevenshteinDistance
comparator evaluates the similarity between strings by considering transpositions (for example, swapping adjacent characters), insertions, deletions, and substitutions. This makes it ideal for use cases where small typographical errors are common. When configuring this comparator, ensure that your use case benefits from its flexible matching capabilities, as it may introduce performance overhead for large datasets. For details about comparator classes, see topic Comparator Classes.