DATA Step: Simplifying SAS Programs

 

In the real practice with SAS programming, you may find yourself needing to perform the same operation on multiple variables. For example, you might want to take the logarithm of every numeric variable. While you could achieve this with a series of assignment statements, handling lots of variables in this manner can lead to a program that is excessively lengthy and hard to maintain. In this blog post, we'll explore some strategies to simplify and shorten a SAS program. By implementing these techniques, you can efficiently enhance the readability of your program. Let's get started!

 

The ARRAY Statements

In computer programming, an array is an ordered group of elements, having the same data type. You can think of an array as shelves in a bakery. Each shelf can hold multiple items, just like an array stores multiple values, and can have a label that tells you what kind of bread it holds (e.g., sourdough, croissants, bagels), as an array has names that identify the type of data they store. In this analogy, each loaf of bread on a shelf has a specific position. You can grab the third baguette from the left or the one at the very back. Likewise, each value in an array has a unique position or index that allows you to access specific data points.

In SAS programming, the ARRAY statement groups a set of variables with the same data type into an array. As long as they are either all numeric or character, this grouping can consist of any variables you choose; they may be ones that already exist in your dataset or new variables that you're creating in a DATA step. The ARRAY statement follows this general form:

ARRAY name (n) $ variable-list;

Where:

  • name: Name you give to the array.
  • n: Number of variables in the array.
  • $: Indicates if the array consists of all character variables.
  • variable-list: list of all variables you want to include in the array.

For example, let's consider the DATA steps shown above. After importing raw data, line 13 defines an array named revenue that can hold 5 numeric values for an observation. It assigns the values of the variables, baguette, boule_200g, etc., to the corresponding elements in the array. Then, line 14 starts a loop that iterates 5 times (once for each element in the array). Inside the loop, revenue(i) = revenue(i) * 0.85; calculates the revenue for each bread type by multiplying the number of units sold (stored in the array) by the price per unit (all breads have the same price of 0.85 in this example).


Using Shortcuts for Lists of Variable Names

When there are too many variables in a SAS program, you may take advantage of the shortcuts for the list of their names. Particularly, listing all the variable names to create an array can make your code less maintainable, and even lead to some typos. In such scenarios, abbreviating variable names can reduce the code length and keep your program maintainable. For example:


In the example shown above, baguette -- croissant refers to all the five variables from baguette to croissant. So, for example, SUM(baguette -- croissant); pass the variables into the SUM function. In addition to this, SAS also reserves some special name lists:

  • _NUMERIC_: All numeric variables in a dataset.
  • _CHARACTER_: All character variables in a dataset.
  • _ALL_: All variables in a dataset.

Using the special name lists, you can also shortcut variables like:

  • baguette _NUMERIC_ croissant: All numeric variables from baguette to croissant.
  • baguette _CHARACTER_ croissant: All character variables from baguette to croissant.

Note that the name abbreviations are not limited to the DATA step. In fact, you can also employ this strategy in a PROC step to make your code look better.

Post a Comment

0 Comments