Indexing

SAS users tend to think of indexing SAS data sets to either improve query performance or as a method to avoid dataset sorting. Another use case for using SAS indexes is to provide direct access to a specific observation in a dataset.

Python pandas is numpy + dictionary of indexing. The index is used for labeling DataFrame rows and columns, as a Python dictionary.

Comparision with SAS Indexing

Create a pandas Index

Return Columns by Position

Return Rows by Position

Return Rows and Columns by Label

Conditionals

Updating

Return Rows and Columns by Position

Let’s get started.

Comparision with SAS Indexing

Without index, to retrieve an observation from variable “Location” whose value is “New York”, SAS performs a sequential read starting with the first observation in the dataset. SAS normally begins reading at observation one reading the dataset in sequence for Location = “New York” until all observations are read.

With an index on the variable Location , SAS may use the index to directly access those observations containing these values without performing a sequential read of the dataset. A SAS index stores values in ascending order for a specific variable or variables and manages information on how to locate a given observation(s) in the dataset.

In python numpy, array indexing is based on sequential numerical order. For example, array[1,0], the same way Matlab does. At DataFrame creation time, the RangeIndex object is created as the default index. similar to the automatic _n_ variable SAS establishes at SAS dataset creation time.

For Series and DataFrame objects, pandas automatically create an index structure at DataFrame creation time for both rows/columns, where the RangeIndex object used as the default row index. These index objects are responsible for holding the axis labels and other metadata like integer-based location identifiers, axis name, etc. RangeIndex is a memory-saving special case of Int64Index limited to representing monotonic ranges.

One or more row/columns in DataFrame can be used to define an index. Assigning more than one row/column as an index creates a MultiIndex object discussed in a separate post. New users to pandas often get confused about the role of an index, since most of their prior associations consider an index to be an auxiliary structure for columns.

In a pandas DataFrame, the values from a column or columns may be used as an index to supply values as row labels augmenting the default integer values assigned by the RangeIndex object. Just as you are able to return SAS dataset observations using the automatic variable _n_, a DataFrame’s default RangeIndex is used to return rows using a zero-based offset.

By explicitly setting a DataFrame index from column or multiple column values, you can return rows using these column values in addition to returning rows using the RangeIndex object.

Create a pandas Index

When a DataFrame is assigned an index, the rows remain accessible by supplying a collection of integer values as well as accessible by the row labels defined by the index.

In the following example, Create DataFrame Index illustrates the set_index method using the N column. In this example, the values from the ID column supply labels for the DataFrame rows.

code Create DataFrame Index.py

 import pandas as pd
 df = pd.DataFrame([['0071', 'Patton'  , 17,  27],
...                    ['1001', 'Joyner'  , 13,  22],
...                    ['0091', 'Williams', 111, 121],
...                    ['0110', 'Jurat'   , 51,  55]],
...          columns = ['ID',   'Name', 'Before', 'After'])
 print(df)
     ID      Name  Before  After
0  0071    Patton      17     27
1  1001    Joyner      13     22
2  0091  Williams     111    121
3  0110     Jurat      51     55
 df.set_index('ID', inplace=True)
 print(df)
          Name  Before  After
ID
0071    Patton      17     27
1001    Joyner      13     22
0091  Williams     111    121
0110     Jurat      51     55

The DataFrame df is constructed using the DataFrame constructor method. The first print function returns all of the rows labeled with the default RangeIndex object labeling rows starting with the integer 0 to length (axis – 1).

The syntax:

df.set_index('ID', inplace = True)

selects the ID column as the index and inplace = True updates the index in place rather than creating a new DataFrame. The default value for the set_index method is inplace = False. In the case where the inplace=argument is False, you must assign the results to a new DataFrame.

For example,

df_idx = df.set_index('ID', inplace=False)

creates the df_idx DataFrame with the ID column as its index. The original df DataFrame remains unaltered.

The second print function illustrates how the DataFrame rows are labeled with values from the ID column. The overhead for creating and dropping indexes is minimal and it is not unusual to do so repetitively in a single Python program.

Next, we consider sub-setting DataFrames. Sub-setting data by rows and/or columns is an essential task for any form of data analysis. Panda DataFrames offers three choice for sub-setting operations. They are:

[ ] operator enables selection by columns or by rows.
.loc indexer uses row and column labels for sub-setting. A column label is the column Name and row labels are assigned with an index (either with the index=) parameter at DataFrame creation time or with the df.set_index method.

If no index is explicitly assigned, then the integer-based RangeIndex object is the default. If no names are assigned to columns, then the RangeIndex object labels the columns with the first column as 0, the second column as 1, and so on.

.iloc indexer uses integer positions (from 0 to length-1 of the axis) for sub-setting rows and columns. This method remains available even if a user-defined index or MultiIndex is defined. MultiIndexes, or hierarchical indexes are discussed later in this chapter.

Both the .loc and .iloc indexers accept Boolean logic to perform complex sub-setting. The [ ] operator and the .iloc indexers can access rows using the default RangeIndex object, that is, integer values indicating a position along the index. The .loc indexer requires a user-defined index for creating row labels in order to operate.

Return Columns by Position

Consider the example below, DataFrame Default Indexes. This example constructs the i DataFrame using a single print function to display the DataFrame values, default row index and the default column index.

The \n syntax inserts a new line to display the desired results.

code DataFrame Default Indexes.py

 i = pd.DataFrame([['Patton'   , 17,  27],
...                   ['Joyner'   , 13,  22],
...                   ['Williams' , 111, 121],
...                   ['Jurat'    , 51,  55],
...                   ['Aden'     , 71,  70]])
 print(i)
          0    1    2
0    Patton   17   27
1    Joyner   13   22
2  Williams  111  121
3     Jurat   51   55
4      Aden   71   70

 print(' Row Index:   ',    i.index, '\n', 'Column Index:', i.columns)
 Row Index:    RangeIndex(start=0, stop=5, step=1)
 Column Index: RangeIndex(start=0, stop=3, step=1)

We begin with sub-setting using the [ ] operator. Observe that both columns and rows in DataFrame i use the default RangeIndex as their labels.

The default RangeIndex is used to select rows or columns using integers to locate their positions along the index. Consider in the following examples, DataFrame Returns Column 0.

code DataFrame Return Column or Row.py

 i[0]
0      Patton
1      Joyner
2    Williams
3       Jurat
4        Aden

The call to [ ] operator returns the first column (0) from the DataFrame i. In most cases, DataFrame columns will have labels to return the columns of interest.

The [ ] operator also accepts a Python list of columns to return. Recall that a Python list is a mutable data structure for holding a collection of items. List literals are written within square brackets [ ] with commas , to indicate multiple items in the list.

Consider the example below, Create df DataFrame. Each of the values supplied to the DataFrame constructor method is a Python list as is the columns= argument.

code Create DataFrame df.py

 df = pd.DataFrame([['I','North', 'Patton', 17, 27],
...                    ['I', 'South','Joyner', 13, 22],
...                    ['I', 'East', 'Williams', 111, 121],
...                    ['I', 'West', 'Jurat', 51, 55],
...                    ['II','North', 'Aden', 71, 70],
...                    ['II', 'South', 'Tanner', 113, 122],
...                    ['II', 'East', 'Jenkins', 99, 99],
...                    ['II', 'West', 'Milner', 15, 65],
...                    ['III','North', 'Chang', 69, 101],
...                    ['III', 'South','Gupta', 11, 22],
...                    ['III', 'East', 'Haskins', 45, 41],
...                    ['III', 'West', 'LeMay', 35, 69],
...                    ['III', 'West', 'LeMay', 35, 69]],
...                    columns=['District', 'Sector', 'Name', 'Before', 'After'])

 df[['Name', 'After']].head(4)
       Name  After
0    Patton     27
1    Joyner     22
2  Williams    121
3     Jurat     55

In this example the syntax:

df[['Name', 'After']].head(4)

is a column sub-setting operation returning the columns Name and After.

Notice how the Python list with [’ Name’, ‘After’] inside the DataFrame slice operator results in a pair of square brackets [[ ]]. The outer pair is the syntax for the DataFrame [ ] operator while the inner pair hold the literal values to form the Python list of column names.

Clearly, using a list of column names rather than a list of column integer index positions is a more convenient method for sub-setting. The equivalent SAS program is displayed in the following example, Create df SAS dataset. It is used in subsequent examples in this chapter.

code Create df SAS dataset.sas

4 data df;
5 length region $ 6
6        name $ 8
7        district $ 3;
8 infile cards dlm=',';
8 input district $
10       region $
11       name $
12       before
13       after;
14 list;
15 datalines;

RULE:      ----+----1----+----2----+----3----+----4----+----5
16        I,   North, Patton,   17,  27
17        I,   South, Joyner,   13,  22
18        I,   East,  Williams, 111, 121
19        I,   West,  Jurat,    51,  55
20        II,  North, Aden,     71,  70
21        II,  South, Tanner,   113, 122
22        II,  East,  Jenkins,  99,  99
23        II,  West,  Milner,   15,  65
24        III, North, Chang,    69,  101
25        III, South, Gupta,    11,  22
26        III, East,  Haskins,  45,  41
27        III, West,  LeMay,    35,  69
28        III, West,  LeMay,    35,  69
NOTE: The data set WORK.DF has 13 observations and 5 variables.

29 ;;;;
30 proc print data = df(obs=4);
31    var name after;
32 run;

NOTE: There were 4 observations read from the data set WORK.DF

The output from PROC PRINT with data=df(obs=4).

Return Rows by Position

The general syntax for DataFrame row slicing (sub-setting rows) using the [ ] operator is:

df:[start : stop : step]

The start position is included in the output and the stop position is not included in the output. Consider the following example, DataFrame Row Slicing, Example 1.

code DataFrame Row Slicing, Example 1.py

 df[:3]
  District Sector      Name  Before  After
0        I  North    Patton      17     27
1        I  South    Joyner      13     22
2        I   East  Williams     111    121

This example returns the first three row from the df DataFrame. A null value for the start position defaults to start position zero (0). The value following the colon :indicates the stop position and goes up to but does not include the row in the slicing operation.

In the following example below, DataFrame Row Slicing, Example 2, illustrates returning every other row from the df DataFrame.

code DataFrame Row Slicing, Example 2.py

 df[::2]
   District Sector      Name  Before  After
0         I  North    Patton      17     27
2         I   East  Williams     111    121
4        II  North      Aden      71     70
6        II   East   Jenkins      99     99
8       III  North     Chang      69    101
10      III   East   Haskins      45     41
12      III   West     LeMay      35     69

The start and stop positions are null causing the slice df[::2] to default to the first and last row respectively in the DataFrame. The value of two (2) for the step position returns every other row. This same logic is displayed in the example below, SELECT Every Other Row.

code SELECT Every Other Row.sas

4 data df1;
5    set df;
6    if mod(_n_, 2) ^= 0 then output;
7 run;

NOTE: There were 13 observations read from the data set WORK.DF. NOTE: The data set WORK.DF1 has 7 observations and 5 variables.

The example creates the df1 dataset with a sub-setting IF statement to perform modulo division by 2 on the automatic SAS _n_variable assigned to each observation.

Modulo division by 2 on even integers returns 0 (zero). By using the IF statement’s inequality evaluation of ^= 0 every odd _n_ value (1, 3, 5, etc.) evaluates true and is written to the output df1 dataset.

Return Rows and Columns by Label

The .loc indexer is a method primarily used for returning rows and columns using labels. Allowed inputs to .loc are:

• A single label such as 12 or ‘Name’. Note that 12 is interpreted as the row label and not as the integer location along the index.

• A Python list of labels [‘A’, ‘B’, ‘C’]

• A slice object with labels ‘a’ : ‘z’. Both the start, in this case ‘a’, and the stop, ‘z’ is included when present in the index.

• Conditional evaluations Each of these methods are illustrated. Up to this point, the index for the df DataFrame relies on the default RangeIndex object for returning rows by an integer position along the index. In order to retrieve rows from the df DataFrame by labels the Name column is set as the DataFrame index.

This action assigns the values from the Name column as labels for the DataFrame rows. Said another way, a DataFrame index maps columns values onto rows as labels.
The syntax and default values for the set_index method is:

df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)

In the folowing example, Add Index to DataFrame, illustrates defining an index for an existing DataFrame.

code Add Index to DataFrame.py

 print(df)
   District Sector      Name  Before  After
0         I  North    Patton      17     27
1         I  South    Joyner      13     22
2         I   East  Williams     111    121
3         I   West     Jurat      51     55
4        II  North      Aden      71     70
5        II  South    Tanner     113    122
6        II   East   Jenkins      99     99
7        II   West    Milner      15     65
8       III  North     Chang      69    101
9       III  South     Gupta      11     22
10      III   East   Haskins      45     41
11      III   West     LeMay      35     69
12      III   West     LeMay      35     69
 print(df.index)
RangeIndex(start=0, stop=13, step=1)

 df.set_index('Name', inplace = True, drop = True)
 print(df.head(4))
         District Sector  Before  After
Name
Patton          I  North      17     27
Joyner          I  South      13     22
Williams        I   East     111    121
Jurat           I   West      51     55
 print(df.index)
Index(['Patton', 'Joyner', 'Williams', 'Jurat', 'Aden', 'Tanner', 'Jenkins','Milner', 'Chang', 'Gupta', 'Haskins', 'LeMay', 'LeMay'],
dtype='object', name='Name')

In this example, the first print function displays all rows from the df DataFrame.

The syntax:

print(df.index)

returns the default RangeIndex in use for the rows with integer values between 0 and 13. The syntax:

df.set_index('Name', inplace=True, drop=True)

selects the values from the Name column label as row labels.

The argument inplace=True updates the df DataFrame in place and the drop=True argument drops the Name column for the DataFrame.

With this defined index, the .loc indexer uses the Name column values to return rows rather than using row position.

Notice how the third print() function displays the values for the Name column as row labels in the left-most column of the output.

The syntax:

print(df.index)

returns the index values as a Python list.

An index may have non-unique values which we illustrate with this example. Some DataFrame operations require the index keys be in sorted order while others may require unique values.

We cover the details for sorting in Chapter 5, pandas Data Management. Chapter 9, Time Series Analysis covers details for unique index values. With an index in place as row labels we can slice rows using the .loc indexer. As well, columns can be sliced with the .loc indexer since they have labels (i.e. names).

The syntax for the .loc indexer is:

df.loc[row selection, column selection]

The row selection is listed first, separated by a comma , followed by the column selection. For both the row or column selection a colon : is used to request a range of items.

codeReturn Row Slices.py

 df.loc['Patton': 'Aden', ]
         District Sector  Before  After
Name
Patton          I  North      17     27
Joyner          I  South      13     22
Williams        I   East     111    121
Jurat           I   West      51     55
Aden           II  North      71     70

This example slices rows beginning with the row labeled Patton and ending with the row labeled Aden inclusive. The empty value for the column selector, following the comma , implies all columns.

The same DataFrame can be returned by stating the column selector explicitly with the syntax:

df.loc['Patton' : 'Aden', 'District' : 'After']

The following example, Return Row and Column Slices, illustrates supplying a single label to the row selector followed by a Python list of labels as the column selector.

code Return Row and Column Slices.py

 df.loc['LeMay', ['Before','After']]
       Before  After
Name
LeMay      35     69
LeMay      35     69

Notice how the row labels are not unique.

Conditionals

In the example below, Return Rows Conditionally, illustrates returning rows and columns based on a Boolean comparison.

codeReturn Rows Conditionally.py

 df.loc[(df['Sector'] == 'West') & (df['Before'] > 20)]
      District Sector  Before  After
Name
Jurat        I   West      51     55
LeMay      III   West      35     69
LeMay      III   West      35     69

The Boolean comparisons are enclosed with parentheses ( ) and utilize any of the Comparision operators. .

In this example the Boolean comparisons contain two predicates; the first is (df[‘Sector’] == ‘West’) and the second is (df[‘Before’] > 20). The Boolean operator & (and) joins the predicates and therefore both must return True in order to meet the row selection criteria.

Note the sytax differences between Return Row and Column Slices and Return Rows Conditionally. In the former rows are sliced based on labels. The latter uses the df[‘Sector’] and df[‘Before’] to designate column names for the conditional expression.

Suppose we wish to sub-set rows based on the last letter of the value for the Name column ending with the letter ‘r’. In the following example, Conditionally Return Rows with String Manipulation combines the .loc() indexer with the .str.endswith attribute to satisfy the request.

code Conditionally Return Rows with String Manipulation.py

df.loc[df['Name'].str.endswith("r"), ['District', 'Sector']]
KeyError: 'the label [Name] is not in the [index]'

Unfortunately, this example raises a KeyError since the column Name was dropped when the df.index was initially created in the following example, Add Index to DataFrame.

Note this error message is truncated here. One remedy for the KeyError is to ‘return’ the Name column using the .reset_index function illustrated in the example below, Drop DataFrame Index.

code Drop DataFrame Index.py

 df.reset_index(inplace = True)
 df.loc[df['Name'].str.endswith("r"), ['Name', 'Before', 'After']]
     Name Before  After
1  Joyner      13     22
5  Tanner     113    122
7  Milner      15     65

In this example the syntax:

df.reset_index(inplace = True)
calls the .reset_index method to ‘drop’ the index and return the <span class="coding">Name </span>column as one of the columns on the df DataFrame.

The inplace = true argument performs the operation in-place. The second line in the program chains the .str.endswith(“r”) attribute to the Name column and returns True for any value whose last letter in the sequence is ‘r’.

The purpose of this example is to simply illustrate resetting an index with the reset_index method. The more Pythonic remedy for the KeyError illustrated in the example below is:

df.loc[df.index.str.endswith('r'), ['District', 'Sector']]

The analog SAS program is shown in the following example, Conditionally Return Observations with String Manipulation.

code Conditionally Return Observations with String Manipulation.py

4  proc sql;
5    select name
6           ,before
7           ,after
8    from df
9    where substr(reverse(trim(name)),1,1) = 'r';
10 quit;

The nested functions in the WHERE clause work from the inside out by:

Calling the TRIM function to remove trailing blanks
Calling the REVERSE function to reverse the letters in the variable name
Calling the SUBSTR (left of =) function to test if the first letter is ‘r’.

Another method for conditional testing is to combine the .loc indexer with the isin attribute. The isin attribute returns a Boolean indicating if elements in the DataFrame column are contained in a Python list of values.

code Select Rows with isin List of Values.py

 df.set_index('Name', inplace=True)
 df.loc[df['Sector'].isin(['North', 'South'])]
       District Sector  Before  After
Name
Patton        I  North      17     27
Joyner        I  South      13     22
Aden         II  North      71     70
Tanner       II  South     113    122
Chang       III  North      69    101
Gupta       III  South      11     22

Because the index was ‘dropped’ in the example below Drop DataFrame Index, the index is set again, this time with the syntax:

df.set_index('Name', inplace=True)

to enable slicing with labels using the .loc indexer.

In the following example, SAS IN Operator illustrates the same capability using the IN operator. The IN operator performs an implied truth test by including values from a list that match values from the sector variable. The IN operator is also valid with an IF statement in a Data Step.

code SAS IN Operator.sas

4 proc sql;
5   select *
6   from df
7 where sector in ('North', 'South');
8 quit;

Updating

The .loc indexer can update or set values (the term used with pandas documentation). Consider the example below, Set Values Matching a List of Labels.

code Set Values Matching a List of Labels.py

 df.loc[['Patton', 'Jurat', 'Gupta'], 'After']
Name
Patton    27
Jurat     55
Gupta     22
Name: After, dtype: int64
 df.loc[['Patton', 'Jurat', 'Gupta'], ['After']] = 100
 df.loc[['Patton', 'Jurat', 'Gupta'], 'After']
Name
Patton    100
Jurat     100
Gupta     100
Name: After dtype: int64

The first call to the .loc indexer supplies a Python List of Name labels for three individuals along with their corresponding After values and returns a Series.

Recall that a Series is analogous to a single DataFrame column. The second call to the .loc indexer sets (updates) the After column for each of the labels in the Python list:

 ['Patton', 'Jurat', 'Gupta']

The SAS analog is illustatred in the example below, IN Operator Conditionally Select Rows.

code IN Operator Conditionally Select Rows.sas

4  data df;
5      set df;
6  if _n_ = 1 then put
7     'Name     After';
8  if name in ('Patton', 'Jurat', 'Gupta') then do;
9     after = 100;
10     put @1 name @10 after;
11     end;
12  run;

OUTPUT:

Name     After
Patton   100
Jurat    100
Gupta    100

NOTE: There were 13 observations read from the data set WORK.DF. NOTE: The data set WORK.DF has 13 observations and 5 variables.

This example uses the IN operator with an IF/THEN DO/END block updating the after variable conditionally.

Setting values for an entire DataFrame column is illustated in the following example, Set Values for a Column.

codeSet Values for a Column.py

 df.loc[: , 'After'] = 100
 print(df.head(5))
         District Sector  Before  After
Name
Patton          I  North      17    100
Joyner          I  South      13    100
Williams        I   East     111    100
Jurat           I   West      51    100
Aden           II  North      71    100

The call to the .loc indexer slices all rows from the df DataFrame since no start and stop values are supplied indicated by a colon :. The column slice After is set to 100.

8. Return Rows and Columns by Position

The .iloc indexer uses integer positions (from 0 to length-1 of the axis) for slicing rows and columns. Allowed inputs to .iloc are: • An integer, e.g. 12

• A Python list of integers [4, 2, 22]

• A slice object with integers 2 : 22. The start, in this case 2 is inclusive and the stop position 22 is exclusive.

The stop position for the .iloc indexer is exclusive, meaning not included. This is in contrast to the .loc indexer which is inclusive. The syntax for the .iloc indexer is:

df.iloc[row selection, column selection]

A comma (,) is used to separate the request of row slices from column slices. A colon (:) is used to request a range of items. The absence of either a column or row selector is an implicit request for all columns or rows, respectively.

These features are illustrated below. In the following example, Return df First and Last Row, return the introduces the .iloc indexer.

code Return df First and Last Row .py

 df.iloc[[0, -1]]
       District Sector    Name  Before  After
Name
Patton        I  North  Patton      17    100
LeMay       III   West   LeMay      35     69

In this example a Python list of row values based on their position is passed to the .iloc indexer. Row 0 is the first row and row -1 is the last row in the df DataFrame.

The same logic for SAS is illustrated in the example below, Return First and Last Observation.

code Return First and Last Observation.sas

4  data df1;
5    set df end = last;
6
7  if name in ('Patton', 'Jurat', 'Gupta') then after = 100;
8  if _n_  = 1 then output;
9  if last = 1 then output;
10 run;

NOTE: There were 12 observations read from the data set WORK.DF. NOTE: The data set WORK.DF1 has 2 observations and 5 variables.

11 proc print data = df1 noobs;  
12 run;

The input dataset df uses the end = dataset option to detect the last observation reading the df dataset. The end = dataset option initializes a variable’s value to 0 and is set to 1 when the last observation is read.

Sub-setting IF statements are used to output the first and last observation to the output dataset df1. The output dataset is displayed in the example, First and Last Observation. The noobs option for PROC PRINT supresses the display of the SAS observation number contained in the automatic SAS variable n.

The .iloc indexer accomodates a Python list of integers as well as a slice object to define row and column selections. In the following example, .iloc Using List and Slice Object, illustrates combining these selectors.

code Using List and Slice Object.py

 df.reset_index(inplace = True)
 df.iloc[[2, 10, 12], :2]
        Name District
2   Williams        I
10   Haskins      III
12     LeMay      III

While it is possible to call the .iloc indexer with an index preset, in order to understand the effect, the Name index is dropped with:

df.reset_index(inplace = True)

The row selector uses a Python list of integers selecting rows 2, 10, and 12 followed by a comma (,) to define the column slicer. The column slicer 0:2 selects two columns (column 0 and column 1). Remember, for the .iloc indexer, the stop values used as slicers for row and column values goes up to but does not include the stop value.

The syntax for selecting the first three columns and all rows in the DataFrame is:

df.iloc[ : , :3]

If the column slicer stop position exceeds the number of columns in the DataFrame then all columns are returned.

The .iloc indexer accepts the value -1 value to indicate the last object in a sequence, -2 as second to last, and so on.

code Return Last Row From Last Column.py

 df.iloc[-1, -1]
100

This example returns the last row from the last column in DataFrame df.