PROC Step: Sorting Your Data

As implied by its name, PROC SORT arranges observations in a SAS dataset. This step is required when you intend to apply other procedures to the dataset based on each unique value of a variable. That is, you must first sort the observations by a variable before applying any procedure to the dataset based on that variable. 

Usage of PROC SORT is not limited to rearranging a dataset for another step. You may also apply PROC SORT to organize your data for a report or before combining two datasets in a DATA step. Here's the basic syntax of the PROC SORT:

PROC SORT DATA=sas_dataset;
BY variable-list;

In PROC SORT, the BY statement specifies the variable values by which SAS should arrange the observations. Thus, it is required for the procedure. When there are more than one variable in the BY statement, SAS sorts observations by the first variable, then by the second variable within categories of the first, and so on. For example, let's consider a SAS dataset shown below:


Now, suppose that we want to print out this dataset, rearranging observations by title and tenure_months. You may achieve this as follows:

By default, applying PROC SORT to a dataset will modify the original arrangements of the observations. To avoid this, you may add the OUT option. The NODUPKEY option tells SAS to eliminate any duplicate observations that have the same values for the BY variable. If you specify the DUPOUT option, SAS will put the deleted observations in that dataset. For example:

By default, SAS sorts observations in ascending order. To reverse the sorting order, you should put the keyword DESCENDING before the variable name. For example:

When sorting observations by character variables, the default collating sequence varies depending on the operating system you're using. In the z/OS operating environment, it follows the EBCDIC sequence, where precedence is given to blanks followed by numerals, uppercase letters, and then lowercase letters. For all other operating environments, the default collating sequence is ASCII, where precedence is given to blanks followed by lowercase letters, uppercase letters, and then numerals.

You can, however, explicitly specify which collating sequence you would like to use with the SORTSEQ option. For example:


As mentioned earlier, either ASCII or EBCDIC will sort observations, distinguishing upper- and lowercase letters. Depending on the purpose, you often want to sort them case insensitively. The SORTSEQ=LINGUISTIC with the STRENGTH=PRIMARY suboption tells SAS to ignore the case. For example:

Occasionally, numeral values are stored as character values. When applying PROC SORT to such values, it sorts data as if they are character strings. So, for example, the value "10" comes before "2". The NUMERIC_COLLATION=ON suboption tells SAS to treat numerals as their numeric equivalent. For example:

This suboption is particularly useful when you want to sort observations by a variable with numerals and characters. For example:

Post a Comment