SAS users tend to think of indexing SAS data sets to either improve query performance or as a method to avoid dataset sorting. Another use case for using SAS indexes is to provide direct access to a specific observation in a dataset.
Python pandas is numpy + dictionary of indexing. The index is used for labeling DataFrame rows and columns, as a Python dictionary.
Return Rows and Columns by Label
Return Rows and Columns by Position

Let’s get started.
Comparision with SAS Indexing
Without index, to retrieve an observation from variable “Location” whose value is “New York”, SAS performs a sequential read starting with the first observation in the dataset. SAS normally begins reading at observation one reading the dataset in sequence for Location = “New York” until all observations are read.
With an index on the variable Location , SAS may use the index to directly access those observations containing these values without performing a sequential read of the dataset. A SAS index stores values in ascending order for a specific variable or variables and manages information on how to locate a given observation(s) in the dataset.
In python numpy, array indexing is based on sequential numerical order. For example, array[1,0], the same way Matlab does. At DataFrame creation time, the RangeIndex object is created as the default index. similar to the automatic _n_ variable SAS establishes at SAS dataset creation time.
For Series and DataFrame objects, pandas automatically create an index structure at DataFrame creation time for both rows/columns, where the RangeIndex object used as the default row index. These index objects are responsible for holding the axis labels and other metadata like integer-based location identifiers, axis name, etc. RangeIndex is a memory-saving special case of Int64Index limited to representing monotonic ranges.
One or more row/columns in DataFrame can be used to define an index. Assigning more than one row/column as an index creates a MultiIndex object discussed in a separate post. New users to pandas often get confused about the role of an index, since most of their prior associations consider an index to be an auxiliary structure for columns.
In a pandas DataFrame, the values from a column or columns may be used as an index to supply values as row labels augmenting the default integer values assigned by the RangeIndex object. Just as you are able to return SAS dataset observations using the automatic variable _n_, a DataFrame’s default RangeIndex is used to return rows using a zero-based offset.
By explicitly setting a DataFrame index from column or multiple column values, you can return rows using these column values in addition to returning rows using the RangeIndex object.
Create a pandas Index
When a DataFrame is assigned an index, the rows remain accessible by supplying a collection of integer values as well as accessible by the row labels defined by the index.
In the following example, Create DataFrame Index illustrates the set_index method using the N column. In this example, the values from the ID column supply labels for the DataFrame rows.
import pandas as pd
df = pd.DataFrame([['0071', 'Patton' , 17, 27],
... ['1001', 'Joyner' , 13, 22],
... ['0091', 'Williams', 111, 121],
... ['0110', 'Jurat' , 51, 55]],
... columns = ['ID', 'Name', 'Before', 'After'])
print(df)
ID Name Before After
0 0071 Patton 17 27
1 1001 Joyner 13 22
2 0091 Williams 111 121
3 0110 Jurat 51 55
df.set_index('ID', inplace=True)
print(df)
Name Before After
ID
0071 Patton 17 27
1001 Joyner 13 22
0091 Williams 111 121
0110 Jurat 51 55
The DataFrame df is constructed using the DataFrame constructor method. The first print function returns all of the rows labeled with the default RangeIndex object labeling rows starting with the integer 0 to length (axis – 1).
The syntax:
df.set_index('ID', inplace = True)
selects the ID column as the index and inplace = True updates the index in place rather than creating a new DataFrame. The default value for the set_index method is inplace = False. In the case where the inplace=argument is False, you must assign the results to a new DataFrame.
For example,
df_idx = df.set_index('ID', inplace=False)
creates the df_idx DataFrame with the ID column as its index. The original df DataFrame remains unaltered.
The second print function illustrates how the DataFrame rows are labeled with values from the ID column. The overhead for creating and dropping indexes is minimal and it is not unusual to do so repetitively in a single Python program.
Next, we consider sub-setting DataFrames. Sub-setting data by rows and/or columns is an essential task for any form of data analysis. Panda DataFrames offers three choice for sub-setting operations. They are:
-
[ ] operator enables selection by columns or by rows.
-
.loc indexer uses row and column labels for sub-setting. A column label is the column Name and row labels are assigned with an index (either with the index=) parameter at DataFrame creation time or with the df.set_index method.
If no index is explicitly assigned, then the integer-based RangeIndex object is the default. If no names are assigned to columns, then the RangeIndex object labels the columns with the first column as 0, the second column as 1, and so on.
- .iloc indexer uses integer positions (from 0 to length-1 of the axis) for sub-setting rows and columns. This method remains available even if a user-defined index or MultiIndex is defined. MultiIndexes, or hierarchical indexes are discussed later in this chapter.
Both the .loc and .iloc indexers accept Boolean logic to perform complex sub-setting. The [ ] operator and the .iloc indexers can access rows using the default RangeIndex object, that is, integer values indicating a position along the index. The .loc indexer requires a user-defined index for creating row labels in order to operate.
Return Columns by Position
Consider the example below, DataFrame Default Indexes. This example constructs the i DataFrame using a single print function to display the DataFrame values, default row index and the default column index.
The \n syntax inserts a new line to display the desired results.
i = pd.DataFrame([['Patton' , 17, 27],
... ['Joyner' , 13, 22],
... ['Williams' , 111, 121],
... ['Jurat' , 51, 55],
... ['Aden' , 71, 70]])
print(i)
0 1 2
0 Patton 17 27
1 Joyner 13 22
2 Williams 111 121
3 Jurat 51 55
4 Aden 71 70
print(' Row Index: ', i.index, '\n', 'Column Index:', i.columns)
Row Index: RangeIndex(start=0, stop=5, step=1)
Column Index: RangeIndex(start=0, stop=3, step=1)
We begin with sub-setting using the [ ] operator. Observe that both columns and rows in DataFrame i use the default RangeIndex as their labels.
The default RangeIndex is used to select rows or columns using integers to locate their positions along the index. Consider in the following examples, DataFrame Returns Column 0.
i[0]
0 Patton
1 Joyner
2 Williams
3 Jurat
4 Aden
The call to [ ] operator returns the first column (0) from the DataFrame i. In most cases, DataFrame columns will have labels to return the columns of interest.
The [ ] operator also accepts a Python list of columns to return. Recall that a Python list is a mutable data structure for holding a collection of items. List literals are written within square brackets [ ] with commas , to indicate multiple items in the list.
Consider the example below, Create df DataFrame. Each of the values supplied to the DataFrame constructor method is a Python list as is the columns= argument.
df = pd.DataFrame([['I','North', 'Patton', 17, 27],
... ['I', 'South','Joyner', 13, 22],
... ['I', 'East', 'Williams', 111, 121],
... ['I', 'West', 'Jurat', 51, 55],
... ['II','North', 'Aden', 71, 70],
... ['II', 'South', 'Tanner', 113, 122],
... ['II', 'East', 'Jenkins', 99, 99],
... ['II', 'West', 'Milner', 15, 65],
... ['III','North', 'Chang', 69, 101],
... ['III', 'South','Gupta', 11, 22],
... ['III', 'East', 'Haskins', 45, 41],
... ['III', 'West', 'LeMay', 35, 69],
... ['III', 'West', 'LeMay', 35, 69]],
... columns=['District', 'Sector', 'Name', 'Before', 'After'])
df[['Name', 'After']].head(4)
Name After
0 Patton 27
1 Joyner 22
2 Williams 121
3 Jurat 55
In this example the syntax:
df[['Name', 'After']].head(4)
is a column sub-setting operation returning the columns Name and After.
Notice how the Python list with [’ Name’, ‘After’] inside the DataFrame slice operator results in a pair of square brackets [[ ]]. The outer pair is the syntax for the DataFrame [ ] operator while the inner pair hold the literal values to form the Python list of column names.
Clearly, using a list of column names rather than a list of column integer index positions is a more convenient method for sub-setting. The equivalent SAS program is displayed in the following example, Create df SAS dataset. It is used in subsequent examples in this chapter.
4 data df;
5 length region $ 6
6 name $ 8
7 district $ 3;
8 infile cards dlm=',';
8 input district $
10 region $
11 name $
12 before
13 after;
14 list;
15 datalines;
RULE: ----+----1----+----2----+----3----+----4----+----5
16 I, North, Patton, 17, 27
17 I, South, Joyner, 13, 22
18 I, East, Williams, 111, 121
19 I, West, Jurat, 51, 55
20 II, North, Aden, 71, 70
21 II, South, Tanner, 113, 122
22 II, East, Jenkins, 99, 99
23 II, West, Milner, 15, 65
24 III, North, Chang, 69, 101
25 III, South, Gupta, 11, 22
26 III, East, Haskins, 45, 41
27 III, West, LeMay, 35, 69
28 III, West, LeMay, 35, 69
NOTE: The data set WORK.DF has 13 observations and 5 variables.
29 ;;;;
30 proc print data = df(obs=4);
31 var name after;
32 run;
NOTE: There were 4 observations read from the data set WORK.DF
The output from PROC PRINT with data=df(obs=4).
Return Rows by Position
The general syntax for DataFrame row slicing (sub-setting rows) using the [ ] operator is:
df:[start : stop : step]
The start position is included in the output and the stop position is not included in the output. Consider the following example, DataFrame Row Slicing, Example 1.
df[:3]
District Sector Name Before After
0 I North Patton 17 27
1 I South Joyner 13 22
2 I East Williams 111 121
This example returns the first three row from the df DataFrame. A null value for the start position defaults to start position zero (0). The value following the colon :indicates the stop position and goes up to but does not include the row in the slicing operation.
In the following example below, DataFrame Row Slicing, Example 2, illustrates returning every other row from the df DataFrame.
df[::2]
District Sector Name Before After
0 I North Patton 17 27
2 I East Williams 111 121
4 II North Aden 71 70
6 II East Jenkins 99 99
8 III North Chang 69 101
10 III East Haskins 45 41
12 III West LeMay 35 69
The start and stop positions are null causing the slice df[::2] to default to the first and last row respectively in the DataFrame. The value of two (2) for the step position returns every other row. This same logic is displayed in the example below, SELECT Every Other Row.
4 data df1;
5 set df;
6 if mod(_n_, 2) ^= 0 then output;
7 run;
NOTE: There were 13 observations read from the data set WORK.DF. NOTE: The data set WORK.DF1 has 7 observations and 5 variables.
The example creates the df1 dataset with a sub-setting IF statement to perform modulo division by 2 on the automatic SAS _n_variable assigned to each observation.
Modulo division by 2 on even integers returns 0 (zero). By using the IF statement’s inequality evaluation of ^= 0 every odd _n_ value (1, 3, 5, etc.) evaluates true and is written to the output df1 dataset.
Return Rows and Columns by Label
The .loc indexer is a method primarily used for returning rows and columns using labels. Allowed inputs to .loc are:
• A single label such as 12 or ‘Name’. Note that 12 is interpreted as the row label and not as the integer location along the index.
• A Python list of labels [‘A’, ‘B’, ‘C’]
• A slice object with labels ‘a’ : ‘z’. Both the start, in this case ‘a’, and the stop, ‘z’ is included when present in the index.
• Conditional evaluations Each of these methods are illustrated. Up to this point, the index for the df DataFrame relies on the default RangeIndex object for returning rows by an integer position along the index. In order to retrieve rows from the df DataFrame by labels the Name column is set as the DataFrame index.
This action assigns the values from the Name column as labels for the DataFrame rows. Said another way, a DataFrame index maps columns values onto rows as labels.
The syntax and default values for the set_index method is:
df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
In the folowing example, Add Index to DataFrame, illustrates defining an index for an existing DataFrame.
print(df)
District Sector Name Before After
0 I North Patton 17 27
1 I South Joyner 13 22
2 I East Williams 111 121
3 I West Jurat 51 55
4 II North Aden 71 70
5 II South Tanner 113 122
6 II East Jenkins 99 99
7 II West Milner 15 65
8 III North Chang 69 101
9 III South Gupta 11 22
10 III East Haskins 45 41
11 III West LeMay 35 69
12 III West LeMay 35 69
print(df.index)
RangeIndex(start=0, stop=13, step=1)
df.set_index('Name', inplace = True, drop = True)
print(df.head(4))
District Sector Before After
Name
Patton I North 17 27
Joyner I South 13 22
Williams I East 111 121
Jurat I West 51 55
print(df.index)
Index(['Patton', 'Joyner', 'Williams', 'Jurat', 'Aden', 'Tanner', 'Jenkins','Milner', 'Chang', 'Gupta', 'Haskins', 'LeMay', 'LeMay'],
dtype='object', name='Name')
In this example, the first print function displays all rows from the df DataFrame.
The syntax:
print(df.index)
returns the default RangeIndex in use for the rows with integer values between 0 and 13. The syntax:
df.set_index('Name', inplace=True, drop=True)
selects the values from the Name column label as row labels.
The argument inplace=True updates the df DataFrame in place and the drop=True argument drops the Name column for the DataFrame.
With this defined index, the .loc indexer uses the Name column values to return rows rather than using row position.
Notice how the third print() function displays the values for the Name column as row labels in the left-most column of the output.
The syntax:
print(df.index)
returns the index values as a Python list.
An index may have non-unique values which we illustrate with this example. Some DataFrame operations require the index keys be in sorted order while others may require unique values.
We cover the details for sorting in Chapter 5, pandas Data Management. Chapter 9, Time Series Analysis covers details for unique index values. With an index in place as row labels we can slice rows using the .loc indexer. As well, columns can be sliced with the .loc indexer since they have labels (i.e. names).
The syntax for the .loc indexer is:
df.loc[row selection, column selection]
The row selection is listed first, separated by a comma , followed by the column selection. For both the row or column selection a colon : is used to request a range of items.
df.loc['Patton': 'Aden', ]
District Sector Before After
Name
Patton I North 17 27
Joyner I South 13 22
Williams I East 111 121
Jurat I West 51 55
Aden II North 71 70
This example slices rows beginning with the row labeled Patton and ending with the row labeled Aden inclusive. The empty value for the column selector, following the comma , implies all columns.
The same DataFrame can be returned by stating the column selector explicitly with the syntax:
df.loc['Patton' : 'Aden', 'District' : 'After']
The following example, Return Row and Column Slices, illustrates supplying a single label to the row selector followed by a Python list of labels as the column selector.
df.loc['LeMay', ['Before','After']]
Before After
Name
LeMay 35 69
LeMay 35 69
Notice how the row labels are not unique.
Conditionals
In the example below, Return Rows Conditionally, illustrates returning rows and columns based on a Boolean comparison.
df.loc[(df['Sector'] == 'West') & (df['Before'] > 20)]
District Sector Before After
Name
Jurat I West 51 55
LeMay III West 35 69
LeMay III West 35 69
The Boolean comparisons are enclosed with parentheses ( ) and utilize any of the Comparision operators. .
In this example the Boolean comparisons contain two predicates; the first is (df[‘Sector’] == ‘West’) and the second is (df[‘Before’] > 20). The Boolean operator & (and) joins the predicates and therefore both must return True in order to meet the row selection criteria.
Note the sytax differences between Return Row and Column Slices and Return Rows Conditionally. In the former rows are sliced based on labels. The latter uses the df[‘Sector’] and df[‘Before’] to designate column names for the conditional expression.
Suppose we wish to sub-set rows based on the last letter of the value for the Name column ending with the letter ‘r’. In the following example, Conditionally Return Rows with String Manipulation combines the .loc() indexer with the .str.endswith attribute to satisfy the request.
df.loc[df['Name'].str.endswith("r"), ['District', 'Sector']]
KeyError: 'the label [Name] is not in the [index]'
Unfortunately, this example raises a KeyError since the column Name was dropped when the df.index was initially created in the following example, Add Index to DataFrame.
Note this error message is truncated here. One remedy for the KeyError is to ‘return’ the Name column using the .reset_index function illustrated in the example below, Drop DataFrame Index.
df.reset_index(inplace = True)
df.loc[df['Name'].str.endswith("r"), ['Name', 'Before', 'After']]
Name Before After
1 Joyner 13 22
5 Tanner 113 122
7 Milner 15 65
In this example the syntax:
df.reset_index(inplace = True)
calls the .reset_index method to ‘drop’ the index and return the <span class="coding">Name </span>column as one of the columns on the df DataFrame.
The inplace = true argument performs the operation in-place. The second line in the program chains the .str.endswith(“r”) attribute to the Name column and returns True for any value whose last letter in the sequence is ‘r’.
The purpose of this example is to simply illustrate resetting an index with the reset_index method. The more Pythonic remedy for the KeyError illustrated in the example below is:
df.loc[df.index.str.endswith('r'), ['District', 'Sector']]
The analog SAS program is shown in the following example, Conditionally Return Observations with String Manipulation.
4 proc sql;
5 select name
6 ,before
7 ,after
8 from df
9 where substr(reverse(trim(name)),1,1) = 'r';
10 quit;
The nested functions in the WHERE clause work from the inside out by:
- Calling the TRIM function to remove trailing blanks
- Calling the REVERSE function to reverse the letters in the variable name
- Calling the SUBSTR (left of =) function to test if the first letter is ‘r’.
Another method for conditional testing is to combine the .loc indexer with the isin attribute. The isin attribute returns a Boolean indicating if elements in the DataFrame column are contained in a Python list of values.
df.set_index('Name', inplace=True)
df.loc[df['Sector'].isin(['North', 'South'])]
District Sector Before After
Name
Patton I North 17 27
Joyner I South 13 22
Aden II North 71 70
Tanner II South 113 122
Chang III North 69 101
Gupta III South 11 22
Because the index was ‘dropped’ in the example below Drop DataFrame Index, the index is set again, this time with the syntax:
df.set_index('Name', inplace=True)
to enable slicing with labels using the .loc indexer.
In the following example, SAS IN Operator illustrates the same capability using the IN operator. The IN operator performs an implied truth test by including values from a list that match values from the sector variable. The IN operator is also valid with an IF statement in a Data Step.
4 proc sql;
5 select *
6 from df
7 where sector in ('North', 'South');
8 quit;
Updating
The .loc indexer can update or set values (the term used with pandas documentation). Consider the example below, Set Values Matching a List of Labels.
df.loc[['Patton', 'Jurat', 'Gupta'], 'After']
Name
Patton 27
Jurat 55
Gupta 22
Name: After, dtype: int64
df.loc[['Patton', 'Jurat', 'Gupta'], ['After']] = 100
df.loc[['Patton', 'Jurat', 'Gupta'], 'After']
Name
Patton 100
Jurat 100
Gupta 100
Name: After dtype: int64
The first call to the .loc indexer supplies a Python List of Name labels for three individuals along with their corresponding After values and returns a Series.
Recall that a Series is analogous to a single DataFrame column. The second call to the .loc indexer sets (updates) the After column for each of the labels in the Python list:
['Patton', 'Jurat', 'Gupta']
The SAS analog is illustatred in the example below, IN Operator Conditionally Select Rows.
4 data df;
5 set df;
6 if _n_ = 1 then put
7 'Name After';
8 if name in ('Patton', 'Jurat', 'Gupta') then do;
9 after = 100;
10 put @1 name @10 after;
11 end;
12 run;
OUTPUT:
Name After
Patton 100
Jurat 100
Gupta 100
NOTE: There were 13 observations read from the data set WORK.DF. NOTE: The data set WORK.DF has 13 observations and 5 variables.
This example uses the IN operator with an IF/THEN DO/END block updating the after variable conditionally.
Setting values for an entire DataFrame column is illustated in the following example, Set Values for a Column.
df.loc[: , 'After'] = 100
print(df.head(5))
District Sector Before After
Name
Patton I North 17 100
Joyner I South 13 100
Williams I East 111 100
Jurat I West 51 100
Aden II North 71 100
The call to the .loc indexer slices all rows from the df DataFrame since no start and stop values are supplied indicated by a colon :. The column slice After is set to 100.
8. Return Rows and Columns by Position
The .iloc indexer uses integer positions (from 0 to length-1 of the axis) for slicing rows and columns. Allowed inputs to .iloc are: • An integer, e.g. 12
• A Python list of integers [4, 2, 22]
• A slice object with integers 2 : 22. The start, in this case 2 is inclusive and the stop position 22 is exclusive.
The stop position for the .iloc indexer is exclusive, meaning not included. This is in contrast to the .loc indexer which is inclusive. The syntax for the .iloc indexer is:
df.iloc[row selection, column selection]
A comma (,) is used to separate the request of row slices from column slices. A colon (:) is used to request a range of items. The absence of either a column or row selector is an implicit request for all columns or rows, respectively.
These features are illustrated below. In the following example, Return df First and Last Row, return the introduces the .iloc indexer.
df.iloc[[0, -1]]
District Sector Name Before After
Name
Patton I North Patton 17 100
LeMay III West LeMay 35 69
In this example a Python list of row values based on their position is passed to the .iloc indexer. Row 0 is the first row and row -1 is the last row in the df DataFrame.
The same logic for SAS is illustrated in the example below, Return First and Last Observation.
4 data df1;
5 set df end = last;
6
7 if name in ('Patton', 'Jurat', 'Gupta') then after = 100;
8 if _n_ = 1 then output;
9 if last = 1 then output;
10 run;
NOTE: There were 12 observations read from the data set WORK.DF. NOTE: The data set WORK.DF1 has 2 observations and 5 variables.
11 proc print data = df1 noobs;
12 run;
The input dataset df uses the end = dataset option to detect the last observation reading the df dataset. The end = dataset option initializes a variable’s value to 0 and is set to 1 when the last observation is read.
Sub-setting IF statements are used to output the first and last observation to the output dataset df1. The output dataset is displayed in the example, First and Last Observation. The noobs option for PROC PRINT supresses the display of the SAS observation number contained in the automatic SAS variable n.
The .iloc indexer accomodates a Python list of integers as well as a slice object to define row and column selections. In the following example, .iloc Using List and Slice Object, illustrates combining these selectors.
df.reset_index(inplace = True)
df.iloc[[2, 10, 12], :2]
Name District
2 Williams I
10 Haskins III
12 LeMay III
While it is possible to call the .iloc indexer with an index preset, in order to understand the effect, the Name index is dropped with:
df.reset_index(inplace = True)
The row selector uses a Python list of integers selecting rows 2, 10, and 12 followed by a comma (,) to define the column slicer. The column slicer 0:2 selects two columns (column 0 and column 1). Remember, for the .iloc indexer, the stop values used as slicers for row and column values goes up to but does not include the stop value.
The syntax for selecting the first three columns and all rows in the DataFrame is:
df.iloc[ : , :3]
If the column slicer stop position exceeds the number of columns in the DataFrame then all columns are returned.
The .iloc indexer accepts the value -1 value to indicate the last object in a sequence, -2 as second to last, and so on.
df.iloc[-1, -1]
100
This example returns the last row from the last column in DataFrame df.