DATA Step: the UPDATE Statements

When combining two datasets with the MERGE statement, any variables with the same name will take the values from the latter datasetThis happens even when there are missing values at the latter dataset. However, in a more common data processing scenarios, we would like to keep the former data values when the associated latter data values are missing. 

In such scenarios, you may use the UPDATE statement, instead of MERGE. The UPDATE statement allows you to update existing observations in a master dataset with corresponding observations from a transaction dataset. In the process, if any values are missing in the transaction, it will keep the values of the master dataset. 

To illustrate, let's consider the two datasets created by the following DATA steps:

From the first DATA step instructions, the master_dataset contains three variables: var_a, var_b, and var_c. Similarly, the transact_dataset includes four variables named var_a, var_b, var_e and var_c. So, the two datasets have three common variables: var_a, var_b, and var_c. Here, observe that the transact_dataset contains some missing values for var_a, var_b, and var_c.

Performing MERGE statement will result in:

We see that the values from the transact_dataset take precedence, regardless of missing. On the other hand, the UPDATE statement will keep the master_dataset value if the transact_dataset is missing. For example:

Note that to execute the UPDATE statement, the two datasets must be appropriately sorted by the PROC SORT. Also note that the common variables in the two datasets must have the same data types.

Post a Comment