DATA Step: Combining Two Datasets

In a data analysis project, we commonly encounter a situation where combining two or more datasets are required. For example, when there are multiple sources of information about a subject, such as sales data from different region or demographic data from various surveys, combining datasets becomes essential for comprehensive analysis. Or when we are working on a project with an ongoing observations, such as panel data analysis or any other repeated measures, we often want to concatenate or update previous observations.

All these different types of dataset combinations help in gaining a holistic view and drawing more accurate insights. As a robust data management tool, SAS provides a set of functionalities to achieve this. In this blog post, we will explore how to combine multiple datasets using the SET and MERGE statements with some examples based on the following three datasets. Let's get started!

The SET Statements

Vertical Concatenation of the Two Datasets

In a SAT DATA step, the SET statement is primarily used to bring observations from the source dataset. Taking advantage of this, this statement can also be leveraged to concatenate two or more datasets. When multiple datasets are specified in a single SET statement, SAS vertically concatenates them in the order specified. For example:

In this example DATA step, SAS fetched each observation from dataset_one. Once it reached the last observation of the dataset, it started to bring each observation from the second data source specification, which is also dataset_one. So, the resulting output has in total 8 observations with the two variables var_a and var_b.

Note that the SET statement do not match any observation and tries its best to contain all the data values from its dataset sources. That is, even for the variables that is not common in both data sources, SAS will retain them with missing values in the output. For example:

In this example, the two separate datasets (dataset_one with variables var_a and var_b, and dataset_two with variables var_c and var_d), have no common variable. Thus, the variables where it cannot find any matches are missing in the output dataset.

Interleaving Observations

In the example datasets, we see that dataset_one has two variables, var_a and var_b, while dataset_three has three variables, var_a, var_b, and var_e. Notably, there are two common variables between them. Thus, if we concatenate these two datasets using the SET statement, we can expect there will be three variables in the output. 

By default, the SET statement does not sort the observations by common variables. To sort the observations and interleave values from both datasets, you should first sort the two data sources using PROC SORT. Then, after the SET statement, specify the variable by which you want the output to be sorted. For example:

One-to-one Data Reading

When there are two SET statements in a single DATA step, SAS first reads all the observations from the first SET statement and stores them in a temporal space called PDV. This temporal dataset has dimension exactly the same as the dataset specified in the first SET statement. Subsequently, it moves onto the second SET statement to bring the observations from the next dataset. 

When fetching observations from the second dataset, for each observation in the PDV, SAS replaces any existing values for the common variables. If there is any variable from the second dataset that does not exist in the first dataset, SAS appends the variable to the PDV. For example:

The MERGE Statements

When it encounters a MERGE statement with two datasets in a DATA step, SAS simultaneously read each observation from the two datasets. Then it combines the two observations in a horizontal manner. For example:

Match Merge

In a more common scenario, we would like to merge the two data sets based on the common variable values. By specifying variables in the BY statement, you can instruct SAS to match observations from different datasets based on the shared values of the variable specification. For example:

Note that when using BY statements with MERGE in a SAS DATA step, the two datasets must be sorted beforehand. To sort the datasets, you may use the PROC SORT.

Post a Comment