Crafting new variables from raw data is often essential. In many cases, this practice can significantly improve the performance of statistical models and machine learning algorithms. Depending on the data type and analysis purpose, various techniques can be employed to derive new variables. When working with tabular data, commonly employed techniques include:

**Transformations**: Combining multiple variables through mathematical operations performed on existing variables (e.g., calculating body mass index using height and weight.)**Binning**: Grouping continuous variables into categories (e.g., creating age ranges from a continuous age values.)**Encoding**: Converting categorical variables into numerical representations (e.g., one-hot encoding for colors.)**Standardizing**: Scaling variables to a common range, often between 0 and 1 or with a mean of 0 and standard deviation of 1.

In a SAS DATA step, you can easily define a new variable as follows:

NewVariable=Expression;

On the left side of the equal sign is the variable name, which can be either a new one, to append one more variable to the data set, or existing one, to re-define it by the expression. On the right side of the equal sign can be a constant, an existing variable, or mathematical expressions combining them. Here are available mathematical expressions:

Operation | Description | Example |
---|---|---|

Numeric constant | Assigns the value 10 to the variable x. | x = 10; |

Character constant | Assigns the string "John" to the character variable name. | name = "John"; |

New variable | Assigns the sum of num1 and num2 to the new variable total. | total = num1 + num2; |

Addition | Adds the values of num1 and num2 and assigns the result to the variable total | total = num1 + num2; |

Subtraction | Subtracts the value of num2 from num1 and assigns the result to the variable difference. | difference = num1 - num2; |

Multiplication | Multiplies the values of num1 and num2 and assigns the result to the variable product. | product = num1 * num2; |

Division | Divides the value of num1 by num2 and assigns the result to the variable quotient. | quotient = num1 / num2; |

Exponentiation | Raises the value of base to the power of exponent and assigns the result to the variable result. | result = base ** exponent; |

SAS follows the standard mathematical rules of precedence: exponentiation takes precedence, followed by multiplication and division, lastly addition and subtraction. Parentheses can be used to override this order.

Let's consider an example scenario. According to fitness experts, the one-repetition maximum (one-rep max or 1RM) in weight training refers to the maximum weight an individual can possibly lift for a single repetition. Based on the past workout logs, there are several different ways to calculate 1RM:

The following DATA steps perform different 1RM calculations using Reps and Weights:

DATAWorkoutLogs;INPUTTimestamp:DATETIME. Excercise$Reps Sets Weights;DATALINES;27MAR2023:08:00:00 Squats 10 3 10027MAR2023:08:15:00 Push-ups 15 3 027MAR2023:08:30:00 Bench-press 8 4 12027MAR2023:08:45:00 Deadlifts 12 3 15027MAR2023:09:00:00 Lunges 10 3 8027MAR2023:09:15:00 Pull-ups 8 4 027MAR2023:09:30:00 Bicep-curls 12 3 4027MAR2023:09:45:00 Planks . . .27MAR2023:10:00:00 Shoulder-press 10 3 50;RUN;DATAOneRepMax;SETWorkoutLogs;Epley = Weights * (1+0.0333* Reps);Brzycki = Weights / (1.0278-0.0278* Reps);Lombardi = Weights * Reps **0.1;OConner = Weights * (1+0.025* Reps);Wathan = Weights *100/ (48.8+53.8*EXP(-0.075* Reps));RUN;

In addition to the arithmetic operations, SAS provides some built-in mathematical functions:

Function | Description | Example |
---|---|---|

ROUND(x, int) | Rounds a numeric value to a specific number of decimals. | ROUND(3.141592, 2) returns 3.14. |

INT(x) | Returns the integer part of x, by truncating any fractional digits. | INT(3.75) returns 3 INT(-2.1) returns -2. |

ABS(x) | Returns the absolute value of x. | ABS(-5) returns 5. |

CEIL(x) | Returns the smallest integer that is greater than or equal to x. | CEIL(4.3) returns 5 CEIL(-1.5) returns -1. |

FLOOR(x) | Returns the largest integer that is less than or equal to x. | FLOOR(4.8) returns 4 FLOOR(-1.5) returns -2. |

LOG(x) | Returns the natural logarithm of x (base-e logarithm). | LOG(10) returns 2.302585. |

EXP(x) | Returns the base-e raised to the power of x. | EXP(1) returns 2.71828. |

SQRT(x) | Returns the square root of x. | SQRT(9) returns 3. |

MAX(arg1, arg2, ...) | Returns the maximum value among the arguments. | MAX(5, 10, 3) returns 10. |

MIN(arg1, arg2, ...) | Returns the minimum value among the arguments. | MIN(5, 10, 3) returns 3. |

SUM(arg1, arg2, ...) | Calculates the sum of the arguments (arg1, arg2, ...) | SUM(5, 10, 3) returns 18. |

MEAN(arg1, arg2, ...) | Calculates the sum of the arguments (arg1, arg2, ...) | MEAN(5, 10, 3) returns 6. |

It is worth noting that the functions listed above perform element-wise calculations ** across variables within an observation, not across observations within a variable**. For example:

DATAOneRepMax;SETWorkoutLogs;Epley = Weights * (1+0.0333* Reps);Brzycki = Weights / (1.0278-0.0278* Reps);Lombardi = Weights * Reps **0.1;OConner = Weights * (1+0.025* Reps);Wathan = Weights *100/ (48.8+53.8*EXP(-0.075* Reps));MaxRM =MAX(Epley, Brzycki, Lombardi, OConner, Wathan);MinRM =MIN(Epley, Brzycki, Lombardi, OConner, Wathan);RUN;PROC PRINTDATA=OneRepMax;RUN;

Let's consider another example. In the following raw data file, Month, Day, and Year variables are delimited by spaces.

When importing this file into a SAS data set, we want to craft a variable of SAS date values, combining these three pieces of information. You can achieve this through the **MDY( month, date, year)** function as follows:

DATACarSales;INFILE'/home/u63368964/source/car-sales.dat';INPUTSalesMonth SalesDate SalesYear Model $ Color $ Quantity Price;SasDate =MDY(SalesMonth, SalesDate, SalesYear);RUN;PROC PRINTDATA=CarSales;RUN;

## Using IF-THEN Statements to Craft New Variables

Instead of applying an expression to all observations, we often want to use it only for those that meet certain criteria. This is called conditional logic, and you do it with **IF-THEN** statement in SAS programming.

IFconditionTHENaction;

If *condition* evaluated to be true, then SAS *executes* action. Here are operators that can be used in the *condition* part:

Symbolic | Mnemonic | Description |
---|---|---|

= | EQ | Checks if two values are equal. |

~= | NE | Checks if two values are not equal. |

^= | NE | Checks if two values are not equal. |

> | GT | Checks if the left operand is greater than the right operand. |

>= | GE | Checks if the left operand is greater than or equal to the right operand. |

< | LT | Checks if the left operand is less than the right operand. |

<= | LE | Checks if the left operand is less than or equal to the right operand. |

& | AND | All comparisons must be true. |

| | OR | Only one comparison must be true. |

For example, in the CarSales data set, let's say that we also want to mark SalesQuarter based on SalesMonth variable. In a DATA step, you can do this by applying **YYQ( year, quarter)** function in IF-THEN statements:

DATACarSales;INFILE'/home/u63368964/source/car-sales.dat';INPUTSalesMonth SalesDate SalesYear Model$Color$Quantity Price;IFSalesMonth LE3THENSalesQuarter =YYQ(SalesYear,1);IFSalesMonth GT3ANDSalesMonth LE6THENSalesQuarter =YYQ(SalesYear,2);IFSalesMonth GT6ANDSalesMonth LE9 THENSalesQuarter =YYQ(SalesYear,3);IFSalesMonth GT9THENSalesQuarter =YYQ(SalesYear,4);RUN;PROC PRINTDATA=CarSales;RUN;

Note that the created SAS date will be the first day of the quarter.

The **IN** operator checks if there is any matches in the list of values. For example, within an observation, if Model matches any of the list values ('Sedan', 'SUV', 'Hatchback'), then assigns 'Passenger' to CarType:

DATACarSales;LENGTHModel $10. CarType $10.;INFILE'/home/u63368964/source/car-sales.dat';INPUTSalesMonth SalesDate SalesYear Model$Color$Quantity Price;IFModel IN ('Sedan', 'SUV', 'Hatchback')THENCarType = 'Passenger';ELSECarType = 'Commercial';RUN;PROC PRINTDATA=CarSales;RUN;

** Note that a single IF-THEN statement can only have one action**.
However, by incorporating

**DO-END**block, you can execute more than one action. For example:

DATACarSales;LENGTHModel $10. CarType $10.;INFILE'/home/u63368964/source/car-sales.dat';INPUTSalesMonth SalesDate SalesYear Model$Color$Quantity Price;IFModelNOTIN ('Sedan', 'SUV', 'Hatchback')THENDO;CarType = 'Commercial';SalesTax = Price *0.06;ELSEDO;CarType = 'Passenger';SalesTax = Price *0.1;RUN;PROC PRINTDATA=CarSales;RUN;

### Subsetting IFs

Sometimes, it may be necessary to remove observations from a data set based on certain conditions. In such situations, you can use the DELETE statement along with the IF-THEN/ELSE statements. For example:

DATAPassengerVehicleSales;INFILE'/home/u63368964/source/car-sales.dat';INPUTSalesMonth SalesDate SalesYear Model$Color$Quantity Price;IFModel = 'Truck'THENDELETE;RUN;PROC PRINTDATA=PassengerVehicleSales;RUN;

## Working with SAS Dates

In SAS, ** a variable can hold either character or numeric values**. Any date values are represented as the number of days since January 1, 1960. However, this representation can be confusing, particularly when performing calculations involving time intervals like months or years. For example, the varying number of days in different months complicates tasks such as adding or subtracting a month from a date value. Similarly, determining the number of intervals between two dates is also challenging. To address these issues, you can employee the following functions:

**INTNX(**: This function adds*'interval'*,*from*,*n*)*n-intervals*to the dates specified at*from*argument.**INTCK(**: This function calculates how many*'interval'*,*from*,*to*)*intervals*between*from*and*to*dates.

For example, let's consider the following DATA step:

DATAEmployees;INFILE'/home/u63368964/source/employees.csv'FIRSTOBS=2DLM=','TRUNCOVER;/* Specifying date value representations */FORMATDob DATE10. StartDate DATE10. EndDate DATE10. TenYearsAnniversary WORDDATE18.;INPUTEmployeeID $4. GivenName:$15. SurName:$15. Dob DDMMYY10. JobTitle:$25. StartDate:DATE10. EndDate?:DATE10.; Age =INTCK('Year', Dob,TODAY());IFEndDate = '.'THEN DO; YearsInOffice =INTCK('Year', StartDate,TODAY()); TenYearsAnniversary =INTNX('Year', StartDate,10); NumDaysToAnniversary =INTCK('Day', StartDate, TenYearsAnniversary);END;ELSETenureMonths =INTCK('Month', StartDate, EndDate);DROPEmployeeID JobTitle;RUN;PROC PRINTDATA=Employees;RUN;

In the DATA steps above, the INTCK() functions are employed to calculate number of years and days between the two SAS date variables. Here, **TODAY()** function retrieves the current date from the system. For example, Age = INTCK('Year', Dob, TODAY()); counts the number of years between the date values stored in Dob and system date. On the other hand, the INTNX() function is used to calculate the SAS date that is 10 years after the StartDate variable.

By default, the first year of a hundred-year span is set to be 1960. However, you can change this setting with the **YEARCUTOFF=** option. The YEARCUTOFF= option is helpful when the raw date values are stored with two-digits. For example, if a date value is provided by 07/04/76, it is ambiguous whether it means 1976, 2076, or possibly 1776, and so on. To avoid this issue:

OPTIONS YEARCUTOFF=1950;

This statement ensures that the two-digit dates are between 1950 and 2049, and thereby 07/04/76 represents July 4th, 1976.

## Manipulating Character Variables

In SAS, characters refer to any alphanumeric values enclosed by a pair of quotation marks. They represent categorical or factor variables in SAS data sets.

When processing these character variables, the standard arithmetic operators and functions typically cannot be applied directly. Even if an operator can be applied to the variables, the results could be tricky.

### Concatenation Operator

One of the most common operator to manipulate character variables is the **concatenation operator** (** ||**). This operator concatenates two character variables on its left and right-hand sides into a single variable. For example, let's consider the following DATA step:

DATAEmployees;INFILE'/home/u63368964/source/employees.csv'FIRSTOBS=2DLM=','TRUNCOVER;INPUTEmployeeID $4. GivenName:$15. SurName:$15. Dob DDMMYY10. JobTitle:$25. StartDate:DATE10. TerminationDate?:DATE10.; EmployeeName =TRIM(SurName) || ', ' || GivenName;KEEPSurName GivenName EmployeeName;RUN;

In the code lines above, concatenation operators combine SurName, ',', and GivenName. Here, to remove any trailing blank spaces in SurName, the TRIM function is applied.

### Applying Comparison Operators on Character Variables

The comparison operators, such as > (GT), = (EQ), or ^= (NE), is not limited to numeric values; they can also be applied to compare two character variables. When comparing two character variables, the evaluation is based on either ASCII or EBCDIC collating sequence order. While IBM mainframes employ EBCDIC, most PCs, including SAS Studio environments, use ASCII, where blank spaces come first, followed by numbers, uppercase letters, lowercase letters, and non-Latin alphabets. For example:

DATAEmployees;INFILE'/home/u63368964/source/employees.csv'FIRSTOBS=2DLM=','TRUNCOVER;INPUTEmployeeID $4. GivenName:$15. SurName:$15. Dob DDMMYY10. JobTitle:$25. StartDate:DATE10. TerminationDate?:DATE10.; EmployeeName =TRIM(SurName) || ', ' || GivenName;IFJobTitle >= 'Area Sales Manager'THENDELETE;KEEPEmployeeName JobTitle;RUN;

This DATA step subsets raw data by comparing character strings. In the raw data file, there are 3,000 rows in total. However, after subsetting, the returning output contains only 66 observations.

### Selected Functions for Character Variables

In addition to the two operators mentioned earlier, SAS also provides some built-in functions for character manipulations:

Function Name | Description | Example | Result |
---|---|---|---|

ANYALNUM(arg, start) | Returns position of first occurrence of any alphabetic character or numeral at or after optional start position. | a='123 E St, #2 '; x=ANYALNUM(a); y=ANYALNUM(a,10); | x=1 y=12 |

ANYALPHA(arg, start) | Returns position of first occurrence of any alphabetic character at or after optional start position. | a='123 E St, #2 '; x=ANYALPHA(a); y=ANYALPHA(a,10) | x=5 y=0 |

ANYDIGIT(arg, start) | Returns position of first occurrence of any numeral at or after optional start position. | a='123 E St, #2 '; x=ANYDIGIT(a); y=ANYDIGIT(a,10) | x=1 y=12 |

ANYSPACE(arg, start) | Returns position of first occurrence of a white space character at or after optional start position. | a='123 E St, #2 '; x=ANYSPACE(a); y=ANYSPACE(a,10) | x=4 y=10 |

CAT(arg-1, arg-2, ... arg-n) | Concatenates two or more character strings together . leaving leading and trailing blanks | a=' cat'; b='dog '; x=CAT(a,b); y=CAT(b,a); | x=' catdog ' y='dog cat' |

CATS(arg-1, arg-2, ... arg-n) | Concatenates two or more character strings together . stripping leading and trailing blanks | a=' cat'; b='dog '; x=CATS(a,b); y=CATS(b,a); | x='catdog' y='dogcat' |

CATX('separator-string', arg-1, arg-2, ... arg-n) | Concatenates two or more character strings together and stripping leading and trailing blanks.inserting a separator string between arguments | a=' cat'; b='dog '; x=CATX('&',a,b); | x='cat&dog' |

COMPRESS(arg, 'char') | Removes spaces or optional characters from argument. | a=' cat & dog '; x=COMPRESS(a); y=COMPRESS(a,'&'); | x='cat&dog' y=' cat dog ' |

INDEX(arg, 'string') | Returns starting position for string of characters. | a='123 E St, #2 '; x=INDEX(a,'#') | x=11 |

LEFT(arg) | Left aligns a SAS character expression. | a=' cat'; x=LEFT(a); | x='cat ' |

LENGTH(arg) | Returns the length of an argument not counting trailing blanks (). missing values have a length of 1 | a='my cat'; b=' my cat '; x=LENGTH(a); x=LENGTH(b); | x=6 y=7 |

PROPCASE(arg) | Converts first character in word to uppercase and remaining characters to lowercase. | a='MyCat'; b='TIGER'; x=PROPCASE(a); y=PROPCASE(b); | x='Mycat' y='Tiger' |

SUBSTR(arg, position, n) | Extracts a substring from an argument starting at for position characters or until end if no n. n | a='(916)734-6281'; x=SUBSTR(a,2,3); | x='916' |

TRANSLATE(source, to-1, from-1, ... to-n, from-n) | Replaces characters in from with to characters (source - you can't replace one character with two, for example). one to one replacement only | a='6/16/99'; x=TRANSLATE (a,'-','/'); | x=6-16/99 |

TRANWRD(source, from, to) | Replaces character string in from with source character string. to | a='Main Street'; x=TRANWRD (a,'Street','St.'); | x='Main St.' |

TRIM(arg) | Removes trailing blanks from character expression. | a='My '; b='Cat'; x=TRIM(a)||b; | x='MyCat' |

UPCASE(arg) | Converts all letters in argument to uppercase. | a='MyCat'; x=UPCASE(a) | x='MYCAT' |

## 0 Comments