
The concepts between SAS PROC SQL, Excel pivot table, and pandas.pivot_table, df.groupby are similar: to get summaries on a two-way table, where the rows are the group-by and the columns are the select, using SQL language.
I will not get into useful SAS procedures such as PROC MEANS, PROC SUMMARY, etc., even though the concepts are similar.
Columns: select
Rows: groupby (also need to be in the select statement)
Columns
Below is a simple select statement,selecting all the columns, using a where statement to filter.
SELECT *
FROM data WHERE date > '2022-04-29'd and tempreture < 0;
In Python,
data[(data.date> pd.Timestamp('2022-04-29')) &(data.tempreture < 0)]
Rows and columns
tips= sns.load_dataset('tips')
tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})
Out:
tip
size mean
smoker day
Yes Thur 17.000 3.030
Fri 15.000 2.714
Sat 42.000 2.875
Sun 19.000 3.517
No Thur 45.000 2.674
Fri 4.000 2.812
Sat 45.000 3.103
Sun 57.000 3.168
Let’s look at an example from a SAS community Q &A.
data battlerecord ;
input year $ com_a $ com_b $ result_of_a $ ;
cards ;
2015 INTEL AMD DEFEAT
2015 INTEL AMD WIN
2015 INTEL SAMSUNG WIN
2016 INTEL AMD DRAW
2016 AMD SAMSUNG LOSE
2016 AMD INTEL LOSE
2017 INTEL QUALCOMM WIN
run ;
Observation on the data: all three columns that are involved in the query are categorical.
Goal: we are going to group by 2 of the 3 columns and pivot the 3rd column, and get the counts of this tabulation.
year | com_a | com_b | result_of_a | |
---|---|---|---|---|
0 | 2015 | INTEL | AMD | DEFEAT |
1 | 2015 | INTEL | AMD | WIN |
2 | 2015 | INTEL | SAMSUNG | WIN |
3 | 2016 | INTEL | AMD | DRAW |
4 | 2016 | AMD | SAMSUNG | LOSE |
5 | 2016 | AMD | INTEL | LOSE |
6 | 2017 | INTEL | QUALCOMM | WIN |
SAS To summarize it using SAS Proc SQL, the logic is straight forward as plain SQL. The tedious part is to do sum (result_of_a in (“WIN”)) as repeatedly.
option validvarname = any ;
proc sql ;
create table need as
select com_a as "(com_a)"n
, year as "(year)"n
, sum (result_of_a in ("WIN") ) as "(# of 'WIN's)"n
, sum (result_of_a in ("DEFEAT","LOSE")) as "(# of 'LOSE's)"n
, sum (result_of_a in ("DRAW") ) as "(# of 'DRAW's)"n
from battlerecord
group 1, 2
order 1 desc, 2
;
quit ;
Python
This can be achieved in pandas easily with groupby or pd.pivot_table. They can produce the same summary.
The difference between the result is that
The difference is that in pandas we initially include the column that is to be summarized in the groupby, and then unstack it to the columns.
To keep missings in a group, use dropna=False.
df = pd.read_excel(r".\python_SAS\Python-for-SAS-Users\data\sql_data.xlsx")
df=pd.read_excel(r".\Python-for-SAS-Users\data\battlerecord.xlsx")
summarized = df.groupby(['com_a','year','result_of_a'], dropna=False).count().\
unstack().fillna(0).sort_index(ascending=[False,True])
summarized
Although the column names are not exactly the same as the questioner asked for, we have accomplished most of what’s been asked for.
(‘com_b’, ‘DEFEAT’) | (‘com_b’, ‘DRAW’) | (‘com_b’, ‘LOSE’) | (‘com_b’, ‘WIN’) | |
---|---|---|---|---|
(‘INTEL’, 2015) | 1 | 0 | 0 | 2 |
(‘INTEL’, 2016) | 0 | 1 | 0 | 0 |
(‘INTEL’, 2017) | 0 | 0 | 0 | 1 |
(‘AMD’, 2016) | 0 | 0 | 2 | 0 |
Note that the levels of indices start from the outer layer: column indices start from the top and row indices start from the left.
To remove the outer layer of row or column index, we can use .droplevel(0). To remove the inner most layer of row or column index, use .droplevel(-1)
# to drop the top level of column multiindex
summarized.columns=summarized.columns.droplevel(0)
# to get row index information as columns
summarized.reset_index(inplace=True)
com_a | year | DEFEAT | DRAW | LOSE | WIN | |
---|---|---|---|---|---|---|
0 | INTEL | 2015 | 1 | 0 | 0 | 2 |
1 | INTEL | 2016 | 0 | 1 | 0 | 0 |
2 | INTEL | 2017 | 0 | 0 | 0 | 1 |
3 | AMD | 2016 | 0 | 0 | 2 | 0 |
Another way in Python is to use pd.pivot_table.
This method is similar to the SQL logic in that only the groupby columns are in the index. The difference is that the column to be pivoted is directly assigned to the column.
Comparing with df.groupby, it is more straightforward, and do not require the “trick” of unstack multiindex.
df = pd.read_excel(r".\python_SAS\Python-for-SAS-Users\data\sql_data.xlsx")
pd.pivot_table(df, columns ='result_of_a', index=['com_a','year'], aggfunc='size',
...: fill_value = 0).sort_index(ascending = [False, True])
DEFEAT | DRAW | LOSE | WIN | |
---|---|---|---|---|
(‘INTEL’, 2015) | 1 | 0 | 0 | 2 |
(‘INTEL’, 2016) | 0 | 1 | 0 | 0 |
(‘INTEL’, 2017) | 0 | 0 | 0 | 1 |
(‘AMD’, 2016) | 0 | 0 | 2 | 0 |
Again, if we are not satisfied with how the index looks, we can reset_index(inplace=True) to get information out of the index.
Conclusion: for this problem, I think pd.pivot_table has the simplest solution because it takes only one line of code:
pd.pivot_table(df, columns ='result_of_a', index=['com_a','year'], aggfunc='size',
...: fill_value = 0).sort_index(ascending = [False, True]).reset_index(inplace=True)
However, it is not enough just to get a solution. Understanding how these methods are related to each other, and the general kind of problems that they can solve, can help us solve many more problems.
Treatment of missing
Filtering rows
Task | SAS PROC SQL | python pandas |
---|---|---|
want rows with missing | SELECT * FROM df WHERE col IS NULL | df[df.col.isna()] |
don’t want rows with missing | SELECT * FROM df WHERE col IS NOT NULL | df[df.col.notna()] |
group by | automatically includes missing asa | use dropna=False |
SELECT *
FROM data WHERE date IS NULL;
In group by
SAS PROC SQL treats missing as a group unless you specify it with “where 1 is not missing”. This is a good feature.
Whereas in Python, pd.pivot_table, using dropna=False will keep the missing as a row.
Use case in financial analysis
The advantage of using pd.pivote_table or df.groupby is not limited to summary tables.
One of the use cases is to leverage datetime index in the methods.
For example, say we have a portfolio of loans to a group of customers. After we pd.pivot_table(index=datetime_colum, column = customer, aggfunc=’size’), we can immediately follow up with pandas resampling or rolling methods to perform additional statistics desired.
pd.pivot_table(index=datetime_colum) + resampling
Joins
A trick that I use to remember the SQL syntax is: S F J O W G H O, which means
- select
- from
- join
- on
- where
- group by
- having
- order by
PROC SQL;
SELECT *
FROM df1
INNER JOIN df2
ON df1.key = df2.key;
QUIT;
pd.merge(df1, df2, on='key')
PROC SQL;
SELECT *
FROM df1
LEFT OUTER JOIN df2
ON df1.key = df2.key;
QUIT;
pd.merge(df1, df2, on='key', how='left')
Union
Union is stacking one set of data upon another, where the same columns are lined up.
Union all is stacking everything regardless duplicates. Whereas union removes duplicates.
union all
union all
PROC SQL;
SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;
QUIT;
pd.concat([df1, df2])