The concepts between SAS PROC SQL, Excel pivot table, and pandas.pivot_table, df.groupby are similar: to get summaries on a two-way table, where the rows are the group-by and the columns are the select, using SQL language.

I will not get into useful SAS procedures such as PROC MEANS, PROC SUMMARY, etc., even though the concepts are similar.

Columns: select

Rows: groupby (also need to be in the select statement)

Columns

Below is a simple select statement,selecting all the columns, using a where statement to filter.

codeselect.sas

SELECT * 
FROM data WHERE date > '2022-04-29'd and tempreture < 0;

In Python,

data[(data.date> pd.Timestamp('2022-04-29')) &(data.tempreture < 0)]

Rows and columns

codegroupby.py

tips= sns.load_dataset('tips')
tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})
Out: 
               tip
              size  mean
smoker day
Yes    Thur 17.000 3.030
       Fri  15.000 2.714
       Sat  42.000 2.875
       Sun  19.000 3.517
No     Thur 45.000 2.674
       Fri   4.000 2.812
       Sat  45.000 3.103
       Sun  57.000 3.168

Let’s look at an example from a SAS community Q &A.

codeprepare_data.sas

data battlerecord ;                                                  
  input year $ com_a $ com_b $ result_of_a $ ;                       
  cards ;                                                            
2015 INTEL AMD DEFEAT                                                
2015 INTEL AMD WIN                                                   
2015 INTEL SAMSUNG WIN                                               
2016 INTEL AMD DRAW                                                  
2016 AMD SAMSUNG LOSE                                                
2016 AMD INTEL LOSE                                                  
2017 INTEL QUALCOMM WIN                                              
run ;

Observation on the data: all three columns that are involved in the query are categorical.
Goal: we are going to group by 2 of the 3 columns and pivot the 3rd column, and get the counts of this tabulation.

	year	com_a	com_b	result_of_a
0	2015	INTEL	AMD	DEFEAT
1	2015	INTEL	AMD	WIN
2	2015	INTEL	SAMSUNG	WIN
3	2016	INTEL	AMD	DRAW
4	2016	AMD	SAMSUNG	LOSE
5	2016	AMD	INTEL	LOSE
6	2017	INTEL	QUALCOMM	WIN

SAS To summarize it using SAS Proc SQL, the logic is straight forward as plain SQL. The tedious part is to do sum (result_of_a in (“WIN”)) as repeatedly.

codeproc sql.sas

option validvarname = any ;                                          
                                                                     
proc sql ;                                                           
  create table need as                                               
  select com_a as "(com_a)"n                                         
       , year as  "(year)"n                                          
       , sum (result_of_a in ("WIN")          ) as "(# of 'WIN's)"n  
       , sum (result_of_a in ("DEFEAT","LOSE")) as "(# of 'LOSE's)"n 
       , sum (result_of_a in ("DRAW")         ) as "(# of 'DRAW's)"n 
  from   battlerecord                                                
  group  1, 2                                                        
  order  1 desc, 2                                                   
  ;                                                                  
quit ;

Python

This can be achieved in pandas easily with groupby or pd.pivot_table. They can produce the same summary.

The difference between the result is that

The difference is that in pandas we initially include the column that is to be summarized in the groupby, and then unstack it to the columns.

To keep missings in a group, use dropna=False.

codegroupyby.py

df = pd.read_excel(r".\python_SAS\Python-for-SAS-Users\data\sql_data.xlsx")
df=pd.read_excel(r".\Python-for-SAS-Users\data\battlerecord.xlsx")
summarized = df.groupby(['com_a','year','result_of_a'], dropna=False).count().\
     unstack().fillna(0).sort_index(ascending=[False,True])
summarized

Although the column names are not exactly the same as the questioner asked for, we have accomplished most of what’s been asked for.

	(‘com_b’, ‘DEFEAT’)	(‘com_b’, ‘DRAW’)	(‘com_b’, ‘LOSE’)	(‘com_b’, ‘WIN’)
(‘INTEL’, 2015)	1	0	0	2
(‘INTEL’, 2016)	0	1	0	0
(‘INTEL’, 2017)	0	0	0	1
(‘AMD’, 2016)	0	0	2	0

Note that the levels of indices start from the outer layer: column indices start from the top and row indices start from the left.

To remove the outer layer of row or column index, we can use .droplevel(0). To remove the inner most layer of row or column index, use .droplevel(-1)

# to drop the top level of column multiindex
summarized.columns=summarized.columns.droplevel(0)
# to get row index information as columns
summarized.reset_index(inplace=True)

	com_a	year	DEFEAT	DRAW	LOSE	WIN
0	INTEL	2015	1	0	0	2
1	INTEL	2016	0	1	0	0
2	INTEL	2017	0	0	0	1
3	AMD	2016	0	0	2	0

Another way in Python is to use pd.pivot_table.

This method is similar to the SQL logic in that only the groupby columns are in the index. The difference is that the column to be pivoted is directly assigned to the column.

Comparing with df.groupby, it is more straightforward, and do not require the “trick” of unstack multiindex.

codepd.pivot_table.py

df = pd.read_excel(r".\python_SAS\Python-for-SAS-Users\data\sql_data.xlsx")
pd.pivot_table(df, columns ='result_of_a', index=['com_a','year'], aggfunc='size',
     ...:     fill_value = 0).sort_index(ascending = [False, True])

	DEFEAT	DRAW	LOSE	WIN
(‘INTEL’, 2015)	1	0	0	2
(‘INTEL’, 2016)	0	1	0	0
(‘INTEL’, 2017)	0	0	0	1
(‘AMD’, 2016)	0	0	2	0

Again, if we are not satisfied with how the index looks, we can reset_index(inplace=True) to get information out of the index.

Conclusion: for this problem, I think pd.pivot_table has the simplest solution because it takes only one line of code:

pd.pivot_table(df, columns ='result_of_a', index=['com_a','year'], aggfunc='size',
     ...:     fill_value = 0).sort_index(ascending = [False, True]).reset_index(inplace=True)

However, it is not enough just to get a solution. Understanding how these methods are related to each other, and the general kind of problems that they can solve, can help us solve many more problems.

Treatment of missing

Filtering rows

Task	SAS PROC SQL	python pandas
want rows with missing	SELECT * FROM df WHERE col IS NULL	df[df.col.isna()]
don’t want rows with missing	SELECT * FROM df WHERE col IS NOT NULL	df[df.col.notna()]
group by	automatically includes missing asa	use dropna=False

codeselect missing.sas

SELECT * 
FROM data WHERE date IS NULL;

In group by

SAS PROC SQL treats missing as a group unless you specify it with “where 1 is not missing”. This is a good feature.

Whereas in Python, pd.pivot_table, using dropna=False will keep the missing as a row.

Use case in financial analysis

The advantage of using pd.pivote_table or df.groupby is not limited to summary tables.

One of the use cases is to leverage datetime index in the methods.

For example, say we have a portfolio of loans to a group of customers. After we pd.pivot_table(index=datetime_colum, column = customer, aggfunc=’size’), we can immediately follow up with pandas resampling or rolling methods to perform additional statistics desired.

pd.pivot_table(index=datetime_colum) + resampling

Joins

A trick that I use to remember the SQL syntax is: S F J O W G H O, which means

select
from
join
on
where
group by
having
order by

codejoin.sas

PROC SQL;
SELECT *
FROM df1
INNER JOIN df2
  ON df1.key = df2.key;
QUIT;

pd.merge(df1, df2, on='key')

codeleft join.sas

PROC SQL;
SELECT *
FROM df1
LEFT OUTER JOIN df2
  ON df1.key = df2.key;
QUIT;

pd.merge(df1, df2, on='key', how='left')

Union

Union is stacking one set of data upon another, where the same columns are lined up.

Union all is stacking everything regardless duplicates. Whereas union removes duplicates.

union all

codeunion.sas

PROC SQL;
SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;
QUIT;

codeunion.py

pd.concat([df1, df2])

Groupby