When configuring the postprocessor for the data diff, ensure that you give it the same name as listed in the postprocessor attribute for the input provider and that its type is set to datadiff. 

You will also need to indicate the primary key field(s) (comma delimited) in your input data. The Primary Key(s) are a must since the rest of the row will be stored as a hash.

The statements section of the recipe is where you can edit and view the differences between runs of the recipe. The postprocessor automatically adds a __change field to the data that it stores and indicates whether a record is added (a), updated (u), or has been deleted (d). Writing statements for a data diff is similar to writing statements for any other recipe, but just use the where clause along with the __change attribute available within Spark SQL statements to view those records that are different from the last run of the Recipe.

The S3Bucket type of StorageBucket specifies that the hashed data for the diff will be stored in a secure Lingk-hosted S3 bucket.

Example Recipe

Run the Data Diff Simplified recipe.

The Data Diff Simplified recipe explained...

The JSON provider above allows you to see the data that will serve as input to the data diff postprocessor. After running the example once, to see the full power of the Lingk engine, add a new JSON row and edit an existing row within the JSON object and re-run the Recipe. 

Note within this provider is a postprocessor property. The value given to this property should be used as the name of the data diff postprocessor.