String cleansing in Match FAQ
Learn about string cleansing in Match through frequently asked questions.
How does the sequence of entries in the dictionary file impact the output?
The sequence in the dictionary is processed from top to bottom. This shouldn't have an impact in the output unless there are repetitive source values. In such cases, the first source encountered in the sequence will be triggered. For more details, see topic String Replacement Cleanser.
If a source term matches multiple entries, how will this be processed?
If a source term matches multiple entries, the dictionary is processed top to bottom, and the cleanser applies the first match it finds. After that first replacement, the original source text is changed, so later entries with the same source value typically won't match. The only time you'll see "multiple replacements" take effect is when the replacement from an earlier line becomes a source match on a later line (a chained replacement). For example, if your dictionary contains street => str and later str => st, then 101 Main Street can be replaced to 101 Main str and then replaced again to 101 Main st, ending up as st.
For example:
street=>St
stree=>st
str=>st
stret=>st
mains=>main
Record 1: 101 Mains Street
Record 2: 101 Main St
Record 1: 101 Main St
Record 2: 101 Main StShould punctuation rules like periods, commas, exclamation points, etc., be at the start or end of the file?
It's not mandatory to add punctuation rules. If required, you may include them based on your specific needs.
What are the recommendations for handling spaces?
Spaces are interpreted and processed based on their position within a dictionary entry and whether they appear in the pattern or replacement. In patterns, leading spaces are removed unless explicitly handled using a word boundary. Spaces between characters and after characters are matched as written.
In replacements, spaces before and between characters are preserved exactly as defined, while trailing spaces are removed during processing. If a trailing space must be retained, use an explicit two-step dictionary configuration. To retain spaces, include them explicitly in both the pattern and replacement strings.
For detailed behavior and examples, see Space processing in pattern and replacement strings.
If translating to standard terms, what are the implications of going to larger vs smaller target terms?
Reltio doesn't enforce a specific target-term length. Choose the target term length based on your use case:
- Use larger target terms when you want better readability (for example, spelling out words).
- When you want the cleansed values to take less space (for example, abbreviations).
Any special considerations for using "class": "com.reltio.match.comparator.DynamicDamerauLevenshteinDistance"?
The DynamicDamerauLevenshteinDistance comparator evaluates the similarity between strings by considering transpositions (for example, swapping adjacent characters), insertions, deletions, and substitutions. This makes it ideal for use cases where small typographical errors are common. When configuring this comparator, ensure that your use case benefits from its flexible matching capabilities, as it may introduce performance overhead for large datasets. For details about comparator classes, see topic Comparator Classes.