Using a schema you can validate data. When you use the VALIDATE statement type, two datasets will be created - valid and invalid. With the valid dataset, you can continue into other statements. With the invalid data, you can choose to save it off to a file, pass it to a Lingk event (for event-based operations), or do nothing.
For more information on schemas, see the PRINTSCHEMA statement.
When using a recipe the full power of SQL is available to you. Therefore, you can use a SQL GROUP BY statement with aggregate functions (like
last()). Using this approach, it's easy to get a clean dataset when duplicates are present in the data.
Recipes run in Apache Spark using Lingk's SIRE engine. There is no need for staging tables as all processing happens in memory and data can be pulled from systems on demand. Therefore, you can use SQL principles for matching two or more data sets together.
For an example of Standard and Fuzzy Matching, see the following recipe: Fuzzy Matching with Soundex and Levenshtein .
For standard matching, you can use an INNER JOIN between two data sets. You can match between the two data sets on one or many fields with different operators (including "=" and "LIKE"). This matching handles exact matches easily. If you have more sophisticated matching needed, check out Fuzzy Matching below.
When you have to handle the following use cases:
- Typos (Michael vs Micheal)
- Nicknames (Michael vs Mike)
- Phonetic similarities etc. (Michael vs Mikael)
- Changes coming out of various storage standards like - sometimes name stored as First Name, Last Name and at times as Last Name, First Name.
Using the built-in
() functions, you can build SQL queries that utilizing fuzzy matching for joins.