Explicit NULL marking
Learn why it's important to explicitly mark NULL values in MDM systems, and how it affects survivorship and golden record creation in Reltio.
Why MDM systems require explicit NULL marking
In Master Data Management (MDM) systems, it's important to clearly distinguish between different meanings of "nothing". A blank field might mean different things depending on the context:
- "We don’t know this yet" (e.g., a customer didn’t give their phone number).
- "This doesn’t apply" (e.g., a company doesn’t have a middle name).
- "We want to remove the old value" (e.g., deleting outdated info).
Computers treat an empty string (""
) and a null value differently:
""
= A blank value. It is a blank string, but it exists.NULL
= The absence of a value. The value does not exist.
When combining records, for example in merging duplicates, it’s important to:
- Keep correct values
- Know when to override or ignore missing data
- Provide the best version of every attribute
In general, the MDM golden record is additive. If there is 1 value in one source record and 0 in another, we should provide the value that exists. This means that there is a need to provide a distinction if you want to treat the absence of a value as a value to be able to be survived.
Survivorship in MDM
Another key concept in MDM is survivorship, which refers to the rules or algorithms that determine which attribute value is retained when multiple records are merged. These algorithms always value having a value over not having a value.
Survivorship is typically applied at the individual attribute level. Each field in a master record may have its own rule to decide the winning value.
One commonly used strategy is Recency. This means that when conflicting data exists:
- The value from the most recent value is selected as the survivor.
This approach ensures that the master data reflects the most current information available across all systems. Systems in the MDM environment typically do not have authorship control over other source data. This means that system A cannot directly update data from system B.
There is no effective way to determine the reasoning for the removal of an attribute value. The platform then always assumes that a NULL means that the value is missing and it should prioritize other values in all cases. This can be managed through manual pinning in the platform or by using a replacement value to designate that the user wants to include this missing value in the survivorship consideration.
Example 1: Merging customer records
In this example, the Email and Phone Number fields follow the Recency survivorship strategy. That means the system will prefer the most recent value available. Only records with a value are considered to compete in the survivorship.
Case A: Identifying intentional NULL with a designated value “N/A”
Customer ID | First Name | Last Name | Phone Number | Last Updated | |
---|---|---|---|---|---|
12345 | John | Smith | john.smith@email.com | (555) 123-4567 | 2024-12-01 |
Customer ID | First Name | Last Name | Phone Number | Last Updated | |
---|---|---|---|---|---|
12345 | John | Smith | N/A | 2025-02-15 |
Field | A Value | B Value | Action Taken |
---|---|---|---|
First Name | John | John | Keep "John" |
Last Name | Smith | Smith | Keep "Smith" |
john.smith@email.com | Keep existing email | ||
Phone Number | (555) 123-4567 | N/A | Change to "N/A" |
- Email: Source B didn't provide a value so the system retained the most recent non-null value from Source A.
- Phone Number: The value "N/A" was interpreted as an explicit overwrite, replacing the original number.
Case B: Without explicit marking
Customer ID | First Name | Last Name | Phone Number | Last Updated | |
---|---|---|---|---|---|
12345 | John | Smith | 2025-02-15 |
Field | A Value | B Value | Action Taken |
---|---|---|---|
First Name | John | John | Keep "John" |
Last Name | Smith | Smith | Keep "Smith" |
john.smith@email.com | Keep existing email | ||
Phone Number | (555) 123-4567 | Keep phone number |
- Email: Source B had the field blank, which was treated as missing and not an instruction to delete. The system kept the non-null value from Source A.
- Phone Number: Same as above. The blank was interpreted as "no update," so the system retained the existing value.
Example 2: Updating product information
In this example, the Color, Weight, and Warranty fields also follow the "most recent survives" survivorship strategy. When newer records provide explicit NULLs, the system will remove the older values; when fields are blank, the system retains the previous values.
Case A: With explicit NULL marking
SKU | Product Name | Color | Weight (kg) | Warranty | Last Updated |
---|---|---|---|---|---|
9087-XP | UltraWidget Pro 500 | Red | 1.2 | 2 years | 2024-10-01 |
SKU | Product Name | Color | Weight (kg) | Warranty | Last Updated |
---|---|---|---|---|---|
9087-XP | UltraWidget Pro 500 | N/A | N/A | 2025-01-10 |
Field | A Value | B Value | Final MDM Value |
---|---|---|---|
Product Name | UltraWidget Pro 500 | Same | UltraWidget Pro 500 |
Color | Red | N/A | Change to "N/A" |
Weight | 1.2 | 1.2 | |
Warranty | 2 years | N/A | Change to "N/A" |
- Color: Source B's use of "N/A" was interpreted as an explicit override. The original "Red" value was replaced.
- Weight: Source B provided no value, so the system retained the non-null value from Source A.
- Warranty: "N/A" from Source B signaled an intentional removal or overwrite, replacing the previous warranty period.
Case B: Without explicit NULL marking
Field | A Value | B Value | Final MDM Value |
---|---|---|---|
Product Name | UltraWidget Pro 500 | Same | UltraWidget Pro 500 |
Color | Red | Red (assumed retain) | |
Weight | 1.2 | 1.2 | |
Warranty | 2 years | 2 yrs (potentially outdated) |
- Color: Source B didn't provide a value. The system retained the existing value “Red” from Source A.
- Weight: Since Source B didn't supply a new value, the system preserved the “1.2” value from Source A.
- Warranty: With no updated value from Source B, the system retained the existing warranty period from Source A. This value may be outdated if not intentionally preserved.
Summary and application in the Reltio platform
- Avoid guesswork to differentiate what “missing” values you want to be considered for survivorship vs which ones you don’t
- Maintain clean, reliable data
- Make accurate decisions when merging or updating records
The concept of survivorship further strengthens data quality by ensuring the best and most accurate value is selected for each field. Using the Recency strategy allows the MDM system to reflect the most current and relevant data, especially when values conflict or change over time.
Since all source data is maintained in Reltio, if the goal is to simply return the newest crosswalk. That is, to always return the most recent crosswalk update, then you can do that as well by looking at the individual crosswalk values.