Recommendations for Data Profiling and Cleanse

Some recommendations for data profiling and cleanse that you can consider.

Using Statistically Representative Data Sample for Profiling

For large data sets, use a statistically representative sample of data. Here are some guidelines around selecting such samples:

The sample must be large enough. Typically, the sample must have at least 10% or > 500,000 records, whichever is greater.
The sample must be composed of records randomly selected throughout the dataset and not just from the beginning of the file or table. An easy way to do this is to select every 10th record. This process is fairly random and gets you 10% data.

Note: Selecting all data from a particular state or zip code may not create representative data from the entire source. For example, CA has a large number of HCPs. However, HCPs in Northern Florida have multiple state licenses as they practice in multiple states.

Data Profiling Metrics for Source Attributes

Data profiling metrics are captured for every source attribute. Following is the list of data profiling metrics that must be captured:

Number of unique values
Number of null values
Type of the attribute (Number, Character, Boolean, Date, and so on)
Maximum and minimum length of each attribute
The average length of each attribute
Modal (most frequent) values

Typical Data Quality Issues and Cleansing Recommendations

This section provides information about typical data quality issues and cleansing recommendations.

Generic issues
- Alpha characters in numeric attributes must be stripped.
- Non-printable characters must be stripped.
- Special characters in alphabetic attributes (for example, Name) must be stripped.
Phone numbers
- Phone numbers must be formatted consistently; especially international phone numbers.
- Check for the following type of numbers. If necessary, remove them.
  - High-frequency phone numbers to see if they are default numbers. For example, 999-999-9999.
  - Specific rules of thumb numbers that are entered during data entry. Such as the sales person’s phone number or the main phone number of the department or company.
Organization Names - May have noise words such as CORP, LTD, ASSOCIATION, and so on. These words must either be standardized or removed. For example, CORP must be standardized to CORPORATION or removed entirely to make the name attribute more consistent.
- The preferred way to do this is in the ETL layer, as every source has its own noise words and patterns.
- The less preferred method is to configure match rules to ignore noise words based on a customized dictionary.
Person Names
- Person names sometimes include Title (for example, Mr., Dr., and so on) and Suffix (for example, Jr, Sr, and so on). These words must be removed and moved under separate attributes.
- First names commonly include nicknames (For example, Bill Smith for William Smith). Reltio's match engine uses a dictionary to substitute full names for nicknames in flight, without changing the original data. The NameDictionary cleanser must be configured to use this.
Email
- Basic verification of email format is to check if the email address conforms to the pattern, email@domain.com. The preferred way to do this is in the ETL layer, where source specific cleansing rules can be applied.
- This does not guarantee that the email address is valid. But informs that it is in the expected format.
Address
- Reltio’s built-in address cleansing engine can interpret and standardize poor quality source addresses. However, if the data contains anomalous patterns that are source-specific (for example, uncommon abbreviations or keywords), the engine may not be able to parse the address correctly.
- The recommended approach to improve address parsing and quality is to apply pre-processing quality rules:
  - Noise words such as UNKNOWN, NONE, or N/A must be stripped to improve the quality of data and to prevent bad matches. If these values are found in key attributes such as Address Line 1, City or State, then Address Line 1 is considered the minimum mandate field. The value of this data is questionable. It must be filtered out and not loaded into Reltio.
  - Address Line 1 must contain premise and street information. Additional information such as Contact Name, Department, and so on, must be removed from Address Line 1.
  - Fix basic transposition errors. For example, Street address in Address line 2 and Suite in Address Line 1.
  - The country code is key to cleansing international addresses and must always be provided.
  - It is suggested to provide Zip5 for better responses.
Dates
- Check to make sure all dates follow the same format or adjust them to be consistent.
- Check the high-frequency dates to determine if they are default values such as 1/1/1900. These default values are not meant for record matching. These must be removed or noted to be excluded from the match rules.
Reference Data
- Harmonization of reference data such as Specialty Codes, SIC codes, and so on, across sources can be a difficult exercise. This must be started as early in the project as possible.
- Reltio provides RDM (Reference Data Management) capabilities that can ease the work of defining reference data and mappings. The RDM capabilities also help by applying them automatically when loading data into Reltio.

Optimizing Match Rules as per Data Profiling Results

This section provides information about the ways to optimize match rules based on the data profling results obtained.

High Cardinality (high degree of uniqueness) attributes such as identifiers are the best candidates for match attributes.
Medium cardinality attributes such as Name and Address are the next best.
Low cardinality attributes such as Gender and State Code must only be used in conjunction with a high or medium-cardinality attribute. This helps to partition or filter the match population.
Sparse attributes (those with a high percentage of nulls) must be treated as low cardinality attributes.
Data profiling can highlight the medium cardinality attributes that contain frequently repeating values. This results in the following issues and must be avoided:
- Creates hot spots (that is, data with subsets that are highly matchy)
- Poor match performance and high numbers of potential matches

Optimizing Survivorship Rules as per Data Profiling Results

You can optimize survivorship rules based on the data profiling results obtained. Business users usually have a good idea of the quality and priority of their sources. Having data profiling metrics for each source can help augment that knowledge and improved the design of survivorship rules.

Unify and manage your data