Categories

Python for SAS

Read PDFs

A lot of times we are given PDFs to read. Wouldn’t it be nice if we could read the tables and analyze in Python? This post is specifically about reading tables in PDFs. Note: pip install tabula-pyRead all documentcoderead all.pyimport tabula# Specify the path to your PDF filepdf_path = r"C:/Users/sache/Downloads/Citi-2023-Annual-Report.pdf"# Read tables from the PDFtables = tabula.read_pdf(pdf_path, pages='all')# Iterate through extracted tablesfor table in tables: print(table)But often we just need a specific table on a page. For example, the document we are reading is a nearly 400 page Citigroup 2023...

Cleaning financial data

This is a long overdue post.Cleaning namesRemove suffixcodeplot model driver.pyCleaning valuesFinancial data are often shown as strings in Python, although they are meant to be numbers. Remove common characters such as “$”, “,”, “()”, which is used to express indicate negative number Then use pd.to_numeric() .rstrip or .lstrip: Both will remove strings given from right or left, respectfully. If nothing is given, then remove white space. Note: All the string methods used below are not in-place operations. Therefore, do not use lstrip(“$”) in example below. Because the operations are not...

Match company names

Plotting all the (important) drivers in one figure ReferenceCleaning namescodeplot model driver.pyimport Levenshteindef match_names(column1, column2): matches = {} for name1 in column1: min_distance = float('inf') match = None for name2 in column2: distance = Levenshtein.distance(name1, name2) if distance < min_distance: min_distance = distance match = name2 matches[name1] = match return matches# Example usage:column1 = ["John", "Jane", "Michael"]column2 = ["Jon", "Janet", "Michele"]matches = match_names(column1, column2)for name1, match in matches.items(): print(f"{name1} matched with {match}")!(multi facet line plot)[https://seaborn.pydata.org/_images/faceted_lineplot.png]codeplot model driver.py>>> color_nz = ‘#2CA02C’>>> color_cp = ‘#D62728’>>> temp = df[drivers + [‘x’, ‘cate’]].melt(id_vars=+ [‘x’,...

Multiple facet plot for driver analysis

Plotting all the (important) drivers in one figure ReferencePlotting all the (important) drivers in one figureDoes my model result make sense? We check, and usually we check both the largest (most important) companies, and we check randomly. Let’s start with following a simple exmaple from seaborn. As my models are on credit ratings forecast based on various point in time scenarios, I am focusing on line plots.!(multi facet line plot)[https://seaborn.pydata.org/_images/faceted_lineplot.png]The “dots” dataset is a typical seabon type “long” dataset, where one of the columns is used for paneling. If our...

lambda functions

What is lambda function What is lambda function in Python list of lambda functions and list comprehension Use with 2 arguments in list comprehension Higher order function Loan balance amortization ReferenceWhat is lambda functionI never heard of lambda function when working in SAS, for a good reason: lambda are nothing but functions. However, there are often used in Python and some other languages such as Matlab, Java, and even Excel.A little bit of history and very high level:Alanzo Church, an American mathematician, invented the lambda function to address the question...

Search words in all files

Your very disorganized colleague needs help finding where certain data came from. There are too many files and too many worksheets to manually go through.You want to help by using a little code.We use glob.glob(“*.xlsx”) to include all Excel files. If the files contain some letters or numbers, we can add that to glob.glob(“something.xlsx”) We append all sheets, except those in the no_sheet list, from all Exel files together in one single DataFrame We define a helper function findWord, which finds words from DataFrame column by column because we don’t...

SAS macro variable and Python arguments

Python functions are a lot like SAS macros. What is the analogy to SAS macro variable in Python? 1. When defining a function Positional arguments fixed number of positional arguments Unlimited positional arguments Keyword arguments 2. When calling a function 3. Scan or loop through Summary Appendix: SAS macro reviewThe general idea of entering arguments to a Python function is similar to SAS macro variable for SAS functions, although the details are different. We see “args” and “kwargs” a lot in Python code. They are naming convention representing * and...

Python R SAS Data Input and Output

Work in Progress. Load data that comes with IDE or libraries Loading pre-packaged data into Python Load prepackaged data to R Load SASHelp data External file Import external file to Python Overview read() with open() pandas.read_ Reading larger files in Python Import external file to R read.table data.table Import data in SAS Data from SQL Servers Pyodbc RODBC Load data that comes with IDE or librariesLoading data that comes with IDE or libraries allows me to test code and fast prototype.Loading pre-packaged data into PythonWorking with Python, I use Ipython...

Python R SAS Keyboard Shortcuts

Keyboard shortcuts VSCode Ipython shell commands RStudio Add custom snippetsKeyboard shortcutsKeyboard shortcuts are essential because: when we are used to writing in a language, switching to another one can make use feel slow and dumb. Using keyboard shortcuts will allow us to pick up speed and feel at home.VSCodeAlthough Spyder and Ipython (notebook and shell) both are great, VSCode is especially useful when we are writing packages or modules. I use VSCode for all three languages. There are suggested add-ons for each of the languages. See keyboard shortcuts pdf for...

introduction to configparser

When we work on a project in SAS, we often like to have a separate program for defining libnames, path, constant parameters, some macro variables, and filenames. In Python, we like to do the same, except that we call it “configuration” or “config”. Of course, these concepts are not unique to Python or SAS. Configuration files can be used for storing applications settings or even operating system’s settings.The Python configparser library can be used for creating and interacting with configuration files, which are often done via .ini, .json, or a...

Python class

Only in a Python class did I see clearly that the thinking process is different from SAS. The design of a class is the essence of object programming and the essence of what makes Python thought process different from SAS. It is when writing a class that I finally say to myself “Aha, that’s kind of new to me!” a class in Python Class attributes and methods Naming inputs and functions Parent class and child classes Dunder (magic) methods Other notable dunders Decorators # reference a class in PythonA class...

Reusable Code

I often hear some people saying that the reason why we need to switch from SAS to Python is because only in Python we can write reusable code. That is totally wrong! You can and should write reusable code in just about any language. People switch to SAS for other reasons, expenses being one of them.In SAS, besides the numerous PROCs (procedures) that are reusable, writing macros (or macro functions) is the de facto method.The analogy to SAS macro in Python is functions. Both SAS macros and Python function take...

Python R SAS Data Analysis Lookup

Work in Progress. summary Custom Snippets Run external code summary Purpose SAS Python R   input PROC IMPORT DATAFILE = “ “ OUT=df DBMS=CSV REPLACE; GUESSINGROWS=10000; RUN; pd.read_csv(“”) # encoding=”cp1252”,encoding=”ISO-8859-1” read.table(file, as.is=TRUE)       pd.read_sas(““,encoding=”latin-1” ) read.csv(“”, header=TRUE)         load() # load data written with save   output PROC EXPORT DATA= df OUTFILE= “” DBMS=csv REPLACE; RUN; df.to_csv(index=False) save()   content proc contents data = df array.ndim, .shape, .size, .dtype, .itemsize, .nbytes str(df)     out = dsList (keep=memname memlabel name label nobs varnum) noprint;run;  ...

Groupby

Photo by Ji BiduanThe concepts between SAS PROC SQL, Excel pivot table, and pandas.pivot_table, df.groupby are similar: to get summaries on a two-way table, where the rows are the group-by and the columns are the select, using SQL language.I will not get into useful SAS procedures such as PROC MEANS, PROC SUMMARY, etc., even though the concepts are similar.Columns: selectRows: groupby (also need to be in the select statement)ColumnsBelow is a simple select statement,selecting all the columns, using a where statement to filter.codeselect.sasSELECT * FROM data WHERE date > '2022-04-29'd...

SAS PROC SQL and python df.groupby and pivot_table, and simple joins

The concepts between SAS PROC SQL, Excel pivot table, and pandas.pivot_table, df.groupby are similar: to get summaries on a two-way table, where the rows are the group-by and the columns are the select, using SQL language.I will not get into useful SAS procedures such as PROC MEANS, PROC SUMMARY, etc., even though the concepts are similar.Columns: selectRows: groupby (also need to be in the select statement)ColumnsBelow is a simple select statement,selecting all the columns, using a where statement to filter.codeselect.sasSELECT * FROM data WHERE date > '2022-04-29'd and tempreture <...

Timeseries Processing 4-Expanding Windows

This post consists of a few timeseries examples from my upcoming book on statistical and machine learning using Python, sequal to my co-authored book Python for SAS UserExpanding windowsExpanding window statistics are useful for cumulative statistics like month-to-date moving sum, YTD moving average, etc. For each row, it computes statistics with all the data available up to that point in time. It is called “expanding window” probably because we are using increasing larger window as we go down the rows. These can be useful when we want to use all...

Timeseries processing 3-Rolling Window Statistics

This post consists of a few timeseries examples from my upcoming book on statistical and machine learning using Python, sequal to my co-authored book Python for SAS UserCompare rolling with resamplingRolling window and resampling are two important time series processing methods. The first key difference between them is whether frequency changes or not. Their second key difference is whether restrict to datetime indices.In windowing, statistics are calculated from the windowed rows when “expanding” through each row and frequencies are not changed.Whereas resampling changes frequencies of the data via up sampling...

Timeseries processing 2-shifting and rate of return

This post consists of a few timeseries examples from my upcoming book on statistical and machine learning using Python, also to be published by Apress,as my co-authored book Python for SAS UserShifting Time seriesShifting timeseries is needed when we process data across time. For example, when we compute daily percent change, we are comparing each time point with the one from the day before, diff(1)/shift(1). On the other hand, when comparing with something from tomorrow, we would need diff(-1)/shift(-1), where -1 means to go forward into the future. Oh well.In...

Subplots and Uses

Often we need to plot a few things in subplots for comparison. An simple example of subplots. Compare Stocks in Two Sectors - Sarah ChenRows and ColumnsLet’s automate the number of rows and column once and for all. The key is to use quotient i//width and remainder i%width.codeautomate rows and columns for subplots.pythonimport matplotlib.pyplot as plttickers = [ 'AAPL', 'ADBE', 'AMD', # Apple, Adobe, AMD 'AMZN', 'CSCO', 'EBAY', # Amazon, Cisco, eBay 'FB', 'GOOGL', 'IBM', # Facebook, Google, IBM 'INTC', 'MSFT', 'NVDA', # Intel, Microsoft, NVidia 'PYPL', 'TSLA', 'VRSN' #...

Timeseries Processing 5-Resampling and Interpolation

This post consists of a few timeseries examples from my upcoming book on statistical and machine learning using Python, sequal to my co-authored book Python for SAS UserExpanding windowsExpanding window statistics are useful for cumulative statistics like month-to-date moving sum, YTD moving average, etc. For each row, it computes statistics with all the data available up to that point in time. It is called “expanding window” probably because we are using increasing larger window as we go down the rows. These can be useful when we want to use all...

Indexing

SAS users tend to think of indexing SAS data sets to either improve query performance or as a method to avoid dataset sorting. Another use case for using SAS indexes is to provide direct access to a specific observation in a dataset.Python pandas is numpy + dictionary of indexing. The index is used for labeling DataFrame rows and columns, as a Python dictionary.Comparision with SAS IndexingCreate a pandas IndexReturn Columns by PositionReturn Rows by PositionReturn Rows and Columns by LabelConditionalsUpdatingReturn Rows and Columns by Position Photo by Ruvim MiksanskiyLet’s get...

Working with Timeseries in Python and SAS

This post covers miscellaneous tips for working with datatime in Python and SAS. 1. Comparing dates 2. Assembling3. Parsing Photo by arindam sahaComparing Dates Comparing dates• To slice data using Date column by comparing against a date in Python when the date column is not the indexdf[df.Date.dt.year>2010] #comparing year# or when you need to compare against a specific dateimport datetimedf[df.Date.dt.date > datetime.date(2010,12,30)]• To slice data by comparing against a date in SASif year(Date) > 2010;/*or, to compare against a specific date*/if Date > '30Dec2010'd;/* or where Date > '30Dec2010'd;*/ •...

Shifting Time Series

Photo by paul-gilmorIn SAS there are multiple ways to shift time series data, i.e. to create leads and lags, including the LAG function or the more powerful one is PROC EXPAND. In Python pandas, shifting is also called “sliding window”. There are two main shifting methods in pandas: shift() and tshift(). These two methods take the same parameters, where the defaults are: periods=1, freq=None, axis=0.The difference between them is that shift() shifts the data whereas tshift() shift the index. Both can take positive or negative integers to specify number of...

Constructing, Assembling, Parsing

In this post, you will learn:CreatingAssemblingParsingLet’s get started. Photo by leonard-cotteIn practice, we often need to perform the following in working with time series or panel data before we start to manipulate them: Concstruct a time series from the ground up Assembling from multiple columns Convert or parse strings to desired date, time or datetime formats Convert number to desired date, time or datetime objectsCreatingAs we have seen so far, it is easy to create a series of date, time or datetime with pandas. Using the pd.date_range(), pd.period_range() pd.timedelta_range() function,...

Time Series Processing 1

This post consists of a few timeseries examples from my upcoming book on statistical and machine learning using Python, sequal to my co-authored book Python for SAS UserParsing and Normalizing DatesFor most of the use cases we have encountered, pandas.to_datetime will parse date/time in string or number format to datetime format correctly.However, when we want to be sure that parsing is done according to specification, we should provide directives, which are proceeded by the % sign. Just note that lower-case m represents month whereas upper caseM represents minute.When using with...

Time Series Indexing

While we are postponing, life speeds by.The most fundamental measures of time are point in time timestamp and intervals (fixed or variable), and the difference between them timedelta. These objects provide building blocks for comprehensive timeseries data processes.Recall a pandas index is simply a way to label rows. One of the main uses of Timestamp and Period objects is indexing. Lists of Timestamp or Period objects are automatically coerced to DatetimeIndex and PeriodIndex respectively when used as index. Python pandas is powerful and convenient in handling timeseries. But there are...

MultiIndexing

This post is about multiindexing in pandas and is partially based on Chapter 4 “Indexing and groupBy” of my co-authored book (from page 131) Python for SAS Users.In this post we will cover:- MultiIndexing- Basic Sub-sets with MultiIndexes- Advanced Indexing with MultiIndexes Photo by Tan KaninthanondMultiIndexingThis section introduces MultiIndexing, also known as hierarchical indexing. The main advantage of MultiIndexing is to be able drill down data without the need of using groupby.While many SAS users have not used indexing or composite indexing, all SAS users have encountered what looks like Python...

Timestamp, Period and Timedelta

TimestampPeriodTimedeltaLet’s get started. Photo by Sean OThe most fundamental measures of time are point in time (time stamp) and intervals (fixed or variable), and the difference between them, timedelta. These objects provide building blocks for comprehensive time series data processes.The most fundamental measures of time are point in time timestamp and intervals (fixed or variable), and the difference between them timedelta. These objects provide building blocks for comprehensive time series data processes. here.TimestampPandas Timestamp is pandas’ equivalent to the Python’s native datetime object and in many cases a pandas Timestamp...

Mastering directories

Say we are doing some data analysis. And our project directory has two folders that contains code and inputs: code and data.We want to read data from the data folder, run code from the code folder, and output our analysis results with plots in .png and analysis in .txt files. Name In/Out code the file data input images output analysis output Mastering paths allows us to make our code portable and less error prone.What portable means that whoever receives our code and files can run it on their computer without...

numpy outer product

One of the advantage of Python over base SAS (of course SAS has other advantages over Python) is matrix operations. This short post is about a few ways to do product and outer product using Numpy.productsnp.multiply and * accomplish the same thing. Both multiply element-wise.np.dot as the name implies, gives a dot product. The result is a single number.reshapeAn useful reshape trick is -1, which is a placeholder that refers to the size of the remaining axis that is to be inferred.For example, we begin with a 1-d array of...

credit risk

Muni Bonds

Bonds vs. Dividend Stocks Types of Muni Bonds Who invest in munis Historical perspectives Current situation and outlook Risks in Munis Muni bond insurance Climate risk in Muni bonds Interest rate sensitivities Compute bond price Bond duration Convexity ReferenceBonds vs. Dividend StocksDividend stocks and bonds are two different types of investments that serve different purposes in an investor’s portfolio. Dividend Stocks: Dividend stocks are shares of companies that distribute a portion of their earnings to shareholders in the form of dividends. These dividends are typically paid regularly, such as quarterly...

Credit Risk Tools

Private credit in the oil and gas industry refers to financing provided by private credit funds, private equity firms, or other non-bank lenders to companies operating within the oil and gas sector. This type of financing can take various forms, including debt financing, mezzanine financing, or structured finance solutions. Private credit has become increasingly popular in the oil and gas industry, particularly in times when traditional bank lending may be limited or when companies seek alternative sources of capital.Here are some key aspects of private credit in the oil and...

Credit Risk Tools

Wholesale credit analysis is a very labor intensive process. Although there are many quantitative models, so far real intelligence analyzing the financials is still the most effective method. This is because there are so many different things that can affect a company differently from another one in a different industry, or even in the same industry.Definitions Term Calculation Operating income Sales - COGS (including despreciation) - SG&A - other operating expense + other operating income EBIT Income before taxes + interest expense EBITDA Net i ncome before taxes + interest...

Commerical real estate and risks

Common risk factorsCommercial real estate risks vary greatly by location and property type, and many idiosyncratic factors determine building-specific risks: if it is on the waterfront, amenities and updates, management, etc. But there are some common factors. LTV\(LTV=loan/property value\)The higher the LTV, the higher the risk. Once LTV is bigger than 1, it becomes very risky. DSCR\(Net operating income/debt service coverage ratio\) The higher the DSCR, the better the deal, and smaller the default risk. Sometimes even when LTV is larger than 1, but NOI is higher than interest payments,...

Retail Deposit and Attrition Model

The attrition model is about how many retail customers will leave the bank at a given point in time. The target variable can be the number of accounts closed each month. Whereas the deposit model is about what is the average balance of customers for the leaving customers. The target variable can be the monthly growth rate of balances. The two things together tell us how much $ leaving the institution at a given time.The key drivers are: rate driven difference of bank rate and benchmark rate (e.g. average rates...

Timeseries Modeling

This post consists of a few timeseries regression examples from my upcoming book on statistical and machine learning using Python, also to be published by Apress,as my co-authored book Python for SAS User. There are two main categories of models for time series data: Various variations of OLS type of regression y = a + b*x. To account for residual serial correlation, Newey-West standard errors may be used. Time series model, such as ARIMAX, SARIMAXWe will begin with some data analysis and then get into modeling.Data VisualizationWhen we look at...

Data Processing Lessons

General principles Don’t create (too many) unnecessary columns Move all data format correction code in block Data aggregation Functions Passing arguments Plotting Other tips Cartesian product Although processes are iterative, it should still be organized, at least periodically. If it becomes too chaotic, it becomes inefficient.For example, some have multiple places of unnecessary indicator variable creation manually. Whereas it could and should be all be done in one shot after we have finalized the set the variables! Need to be disciplined and organized! General principles Don’t create (too many) unnecessary...

Un-credible Credit Scores

Photo by paul-gilmorFrom my experience both as a technical expert on credit risk as well as personal experience, credit score is an outdated system. Credit score is a one-dimensional view of a person’s financial health. The gap between what it does and is used for creates opportunities for lenders and other institutions to up-skill their data and analytics for offering better credit services for our time.Although most of my experience in credit risk is more in Commerical lending rather than personal, I had researched and developed retail credit risk models...

Failures and Risks

Photo by Ji BiduanTwo of my favorite investors, Sam Zell and Howard Marks, both have said that understanding risk is the most important thing in their investment strategies. Sam Zell even put it this way: you need to fear, to be respectful of risk.Here is what Sam Zell said in a podcast:You had to be realistic. You had to be able to look at it and say, “What could go wrong?” If I learned anything from [Jay Pritzker], I learned that if everything went too well, you could survive. The...

Credit Scoring Platforms and In-House Solutions

This post is a survey on some of the credit scoring platforms and credit scoring in-house solutions out there and what they do. There are many opportunities for moving forward credit scoring but the competition is fierce: large banks with lots of resources, existing proprietary analytic software companies, new analytic analytics-as-a-service companies leveraging open source and cloud computing, old powerhouse credit scoring companies, and new credit scoring platforms.In less regulated countries like China there has been heavy use of machine learning in credit scoring and social scoring (Ali Pay, Baidu...

Credit Risk Overview

The credit industry Consumer Commercial Credit cycles: the best of the time, the worst of time; vice versa Managing credit risk Qualitative insights Data Asymmetry of information Data we can capture Analytics:At any given moment, there are hundreds of trillions of debt or credit. This fantastic visualization (even though outdated) provides a sense of the scale and the relative scale of credit: All of the World’s Money and Markets in One Visualization Credit risk is the uncertainty associated with credit, specifically about whether the borrower will pay back the money...

education

Model Relationship Using Networks

We often want to visualize relationships: between models, people, or anything.I learned a lot from a user’s fun example:How to customize a networkx graph?codegraph representation 1.pyimport randomimport networkx as nxG = nx.petersen_graph()pdot = nx.drawing.nx_pydot.to_pydot(G)shapes = ['box', 'polygon', 'ellipse', 'oval', 'circle', 'egg', 'triangle', 'exagon', 'star', ]colors = ['blue', 'black', 'red', '#db8625', 'green', 'gray', 'cyan', '#ed125b']styles = ['filled', 'rounded', 'rounded, filled', 'dashed', 'dotted, bold']for i, node in enumerate(pdot.get_nodes()): node.set_label("n%d" % i) node.set_shape(shapes[random.randrange(len(shapes))]) node.set_fontcolor(colors[random.randrange(len(colors))]) node.set_fillcolor(colors[random.randrange(len(colors))]) node.set_style(styles[random.randrange(len(styles))]) node.set_color(colors[random.randrange(len(colors))])for i, edge in enumerate(pdot.get_edges()): edge.set_label("e%d" % i) edge.set_fontcolor(colors[random.randrange(len(colors))]) edge.set_style(styles[random.randrange(len(styles))]) edge.set_color(colors[random.randrange(len(colors))])png_path = "test.png"pdot.write_png(png_path)Here is a super simple...

Python and the Office applications

Introduction Excel Outlook PowerPoint Create dashboard Copy from Excel to PowerPoint Run VBA code from PythonIntroductionMost of the corporate office work is still immersed in MS Office: Excel, Outlook emails, PowerPoint and Word documents. To reduce Office fatigue, we can use Python.Python standard libary ctypes is a foreign function library. It provides C compatible data types, and allows calling functions in DLLs or shared libraries. It can be used to wrap these libraries in pure Python.The win32com library and pywin32 library enable use and publish our own COM (Component Object...

Merge Sort and Recursion

This post is a rewrite of an older post back in July, 2022. Most of the addition are my edited transcript from the last few minutes of Professor David Malen’s CS50 class on algorithm that was live streamed on September 22, 2022. I know it is kind of strange to edit a transcript and put it in a post. But the explanation is just that good. Better than anything I have seen. This particular class is just one of the best movies that I have ever watched (yes I call...

All Elements in 2 BSTs

Problem Brute force method O(n log(n)) O(n) solution ReferenceIn this post, we deep dive into this problem Leetcode 1305. All Elements in Two Binary Search Trees.Please pardon me if you find the notes very basic.ProblemProblem is from Leetcode 1305. All Elements in Two Binary Search TreesGiven two binary search trees root1 and root2, return a list containing all the integers from both trees sorted in ascending order.Example 1:Input: root1 = [2,1,4], root2 = [1,0,3],Output: [0,1,1,2,3,4]Example 2:Input: root1 = [1,null,8], root2 = [8,1],Output: [1,1,8,8]Brute force method O(n log(n))n 代稱兩棵樹的節點總數把兩棵樹探訪過一次並將所有數字放到一個新的陣列後再排序,也就是 O(n log...

Find Mode in BST

Problem 1: 最長連續字母 Three-variable solution for arrays Problem 2: find Mode in Binary Search Tree Solution: reduce tree problem to array problem Other ways to get the list from tree traversal Work in progress ReferenceIn this post, we explore problems on counting the most frequent consecutive occurence of a number or string.Please pardon me if you find the notes messy (still working on it) and send me a message if you see anything that should be corrected.Problem 1: 最長連續字母題目大意:輸入有多列,給定一字串(只包含小寫字母)。求該字串中連續出現最多次的字母為何?如果有多個連續出現最多次的,則請找到最早出現的字母。範例輸入:abbcccciiiiiiiixxxxxxxxxxxxguuuuuuufugpccccccc範例輸出:b 2x 12解題思維:用一個整數變數 C (fast) 當作計數器(一開始設為 1)、一個整數變數 M (slow) 當作最大值,以及一個字元變數 T 儲存最大值發生時的字元(即所求)。跳過字串第一個字元從第二個字元開始,對於第...

BST to and from Preorder Traversal

Problem 1: 中序以及後序探訪转成前序探訪 Problem 2. Postorder from inorder and preorder Brute force: Back to Tree Then Convert to Postorder Problem 3: Binary Tree Preorder Traversal Problem 3 LeetCode - 1008. Construct BST from Preorder Traversal Problem 2: find Mode in Binary Search Tree Solution: reduce tree problem to array problem Other ways to get the list from tree traversal Work in progress ReferenceIn this post, we explore problems on binary search tree preorder traversal and the reverse: construct BST from preorder traversal. 先複習一下二元樹的前序、中序、後序探訪,它們的遞迴式依序是:前序:根節點 → 左子樹 → 右子樹中序: 左子樹 → 根節點...

Maximum Subarray

Maximum profit from buying and selling stocks Leetcode problem Brute force method 1. \(O(n^3)\) 2. \(O(n^2)\) Kadane’s algorithm (Fast and Slow method) \(O(n)\) Computing the best subarray’s position Maxhere and maxsofar Divide-and-conquer recursion approach: \(O(n^2)\) and \(O(n*log(n))\) Memorizing method (DP) ReferenceThe maximum subarray is a well-known problem in computer science. We discussed its brute force and dynamic programming solutions in Dynamic programming. In this post, we deep dive into this problem using Leetcode 53. Maximum Subarray as a test example.Please pardon me if you find the notes messy and send...

Index of the First Occurrence in a String

Problem O(n*m) solution More efficient O(n*m) solution Use Python in & .index ReferenceProblemProblem is from Leetcode 28. Find the Index of the First Occurrence in a String.Given two strings needle and haystack, return the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.回傳字串 needle 第一次在字串 haystack 中出現的索引值。如果 needle 不是 haystack 的一部分則回傳 -1Example 1:Input: haystack = “sadbutsad”, needle = “sad”Output: 0Explanation: “sad” occurs at index 0 and 6.The first occurrence is at index 0, so we return 0.Example 2:Input: haystack = “leetcode”,...

Assigning and Swapping

Problem Bruter force Counter O(1) solution: assigning O(1) solution: swapping instead of assigning ReferenceProblemProblem is from Leetcode 27. Remove Element.Given an integer array nums and an integer val, remove all occurrences of val in nums in-place. The relative order of the elements may be changed.Since it is impossible to change the length of the array in some languages, you must instead have the result be placed in the first part of the array nums. More formally, if there are k elements after removing the duplicates, then the first k elements...

Number of identical pairs of numbers in an array

Problem Counter defaultdict DictProblemProblem is from Leetcode 1512. Number of Good Pairs.A pair is identical if A[i] == A[j] & i < j. Input: nums = [1,2,3,1,1,3]Output: 4Explanation: There are 4 good pairs (0,3), (0,4), (3,4), (2,5) 0-indexed.Use collections.Counter, creates a dictionary of frequencies.Note that because the keys of the dictionary are integers, we cannot use for key, value in Counter(A).All my solutions here are brute force.CounterInstead, we can use for value in Counter(A).values().from collections import Counterdef countSame(A): res = 0 for v in Counter(A).values(): if v >= 2: res...

Rearrange Array Elements by Sign

Problem My brute forece solution Simple O(n) solution Clever O(1) solution but kind of complicated ReferenceProblemThe first problem is Leetcode 2149. Rearrange Array Elements by Sign. You are given a 0-indexed integer array nums of even length consisting of an equal number of positive and negative integers.You should rearrange the elements of nums such that the modified array follows the given conditions:Every consecutive pair of integers have opposite signs.For all integers with the same sign, the order in which they were present in nums is preserved.The rearranged array begins with...

Remove Duplicates from Sorted Array

Problem 1 My brute force O(n) solution Clever O(1) solution Problem 2 My brute force O(n) solution Clever O(1) solution ReferenceProblem 1The first problem is Leetcode 26. Remove Duplicates from Sorted Array. Given an integer array nums sorted in non-decreasing order, remove the duplicates in-place such that each unique element appears only once. The relative order of the elements should be kept the same. Return k after placing the final result in the first k slots of nums.給定一個已排序陣列 nums ,原地(In-Place)將重複的元素清除掉,也就是使每個元素只出現一次。並回傳新的陣列長度。請勿另開空間給另一個陣列,你只應藉由 O(1) 的額外空間並原地(In-Place)地更改輸入陣列。Example 1:Input: nums = [1,1,2]Output: 2, nums = [1,2,_]Explanation:...

Target Array in the Given Order

1389. Create Target Array in the Given Order - super easy1389. Create Target Array in the Given Order - super easy Summary: this problem is super easy. Just follow instruction and use list.insert(where, what). Using indexing to assign values will not work because indices are repeated. Use .insert(). ProblemInitially target array is empty.From left to right read nums[i] and index[i], insert at index index[i] the value nums[i] in target array.Repeat the previous step until there are no elements to read in nums and index.nums = [0,1,2,3,4]index = [0,1,2,2,1][0,4,1,3,2]Explanation:nums index target0...

Max sum of subarrays

Leetcode 1672. Max sum of subarrays 1389. Create Target Array in the Given Order - super easyLeetcode 1672. Max sum of subarraysYou are given an m x n integer grid accounts where accounts[i][j] is the amount of money the i​​​​​​​​​​​th​​​​ customer has in the j​​​​​​​​​​​th​​​​ bank. Return the wealth that the richest customer has.A customer’s wealth is the amount of money they have in all their bank accounts. The richest customer is the customer that has the maximum wealth. Example 1:Input: accounts = [[1,2,3],[3,2,1]]Output: 6 Example 2:Input: accounts = [[1,5],[7,3],[3,5]]Output:...

Kids With the Greatest Number of Candies

Leetcode 1672. Max sum of subarrays 1389. Create Target Array in the Given Order - super easyThis post is for my young sweet child, who likes candies and sweets.This problem is Leetcode 1431. There are n kids with candies. You are given an integer array candies, where each \(candies[i]\) represents the number of candies the \(ith\) kid has, and an integer extraCandies, denoting the number of extra candies that you have.Return a boolean array result of length n, where result[i] is true if, after giving the ith kid all the...

Richest Customer Wealth- Max sum of subarrays

Leetcode 1672. Richest Customer Wealth- Max sum of subarraysThe problem comes from Leetcode 1672. Richest Customer Wealth- Max sum of subarrays.You are given an m x n integer grid accounts where accounts[i][j] is the amount of money the i​​​​​​​​​​​th​​​​ customer has in the j​​​​​​​​​​​th​​​​ bank. Return the wealth that the richest customer has.A customer’s wealth is the amount of money they have in all their bank accounts. The richest customer is the customer that has the maximum wealth. Example 1:Input: accounts = [[1,2,3],[3,2,1]]Output: 6 Example 2:Input: accounts = [[1,5],[7,3],[3,5]]Output: 10...

Running Sum of 1d Array

Leetcode 1480. Running Sum of 1d Array Leetcode 2011. Final Value of Variable After Performing OperationsLeetcode 1480. Running Sum of 1d ArrayExample 1:Input: nums = [1,2,3,4]Output: [1,3,6,10]Example 2:Input: nums = [1,1,1,1,1]Output: [1,2,3,4,5]Example 3:Input: nums = [3,1,2,10,1]Output: [3,4,6,16,17]Constraints:1 <= nums.length <= 1000-10^6 <= nums[i] <= 10^6 Solution 1: Brute forceThis is a very easy problem. But more complicated problems may rely on this kind of simple steps. Since it is a running sum, we need to provide a starter array, an empty list, to collect the cumulative sums. To accumulate sum,...

Course schedule II

Compare with Floyd’s algorithm for topological sort AppendixLeetcode 210. Course Schedule IIThere are a total of numCourses courses you have to take, labeled from 0 to numCourses - 1. You are given an array prerequisites where prerequisites[i] = [ai, bi] indicates that you must take course bi first if you want to take course ai.For example, the pair [0, 1], indicates that to take course 0 you have to first take course 1.Return the ordering of courses you should take to finish all courses. If there are many valid answers,...

Build array from permutation

Intuitive solution O(1) space complexity solution ReferenceThis problem came from Leetcode 1920. Build Array from Permutation. This is a very easy problem. But more complicated problems may rely on this kind of simple steps. Given a zero-based permutation A (0-indexed), build an array A of the same length where \(A[i] = A[A[i]]\) and return it.Example 1:Input: nums = [0,2,1,5,3,4]Output: [0,1,2,4,5,3]Intuitive solutionThe intuitive solution is to follow exactly what it says says in \(A[i] = A[A[i]]\) and build a new array:0th place of the new array \(A\): get the key at...

If can make non-decreasing Array

My simple brute force approach A clever not not efficient approach Compare 2 solutions ReferenceRules or actions sometimes change not only those keys but also the relationships between the keys.This problem comes from Leecode 665. Non-decreasing Array:Given an array nums with n integers, check if it could become non-decreasing by modifying at most one element. Example 1: Input: nums = [4,2,3] Output: true Explanation: You could modify the first 4 to 1 to get a non-decreasing array. Example 2: Input: nums = [4,2,1] Output: falseMy simple brute force approachThe idea...

Course schedule I

Leetcode 207. Course Schedule Appendix ReferenceLeetcode 207. Course ScheduleProblem: There are a total of numCourses courses you have to take, labeled from 0 to numCourses - 1. You are given an array prerequisites where \(prerequisites[i] = [a_i, b_i]\) indicates that you must take course \(b_i\) first if you want to take course \(a_i\). For example, the pair [0, 1], indicates that to take course 0 you have to first take course 1. Return true if you can finish all courses. Otherwise, return false. Example 1:Input: numCourses = 2, prerequisites =...

Steps to make array monotonic

Leetcode 2289. Steps to Make Array Non-decreasing problem: Given a 0-indexed integer array \(A\). In one step, remove all elements \(A[i]\) where \(A[i]\) is less than its left neighbor, return the number of steps performed until nums becomes a non-decreasing array. Input: \([5,3,4,4,7,3,6,11,8,5,11]\) Output: 3Explanation: The following are the steps performed: Step 1: \([5,3,4,4,7,3,6,11,8,5,11]\) ==> \([5,4,4,7,6,11,11]\) Step 2: \([5,4,4,7,6,11,11]\) ==> \([5,4,7,11,11]\) Step 3: \([5,4,7,11,11]\) ==> \([5,7,11,11]\) \([5,7,11,11]\) is a non-decreasing array. Therefore, we return 3.My solutionMy solution is very straightforward. The steps are:Number of steps ct=:0; record the indice of...

The DNF

Introduction Ukrainian national flag problem The Dutch national flag problem The counting approach DNF 3-way sort for qucksort Two passes on the array One single pass Variations of the UNF and the DNF Reference Introduction The famous Dutch National Flag problem was proposed as a colored ball problem by Edsger Dijkstra as follows: Given N balls of colour red, white or blue arranged in a line in random order. You have to arrange all the balls such that the balls with the same colours are adjacent with the order of...

arrays

Array as an abstract data type Python list Slicing List comprehension Implement the array ADT in PythonArray as an abstract data typeAs an abstract data type, an array is a list of similar data (sounds like circular definition). An array of an array is a 2-dimensional array, i.e. matrix. name type: the type of data is stored, and all have to be the same within one array size: it is fixed once it is defined, and cannot be changed once created Access: random access use index as all elements are...

Topological sort and cycle detection

Problem Topological sort DFS Time complexity Khan’s algorithm DFS vs. Khan’s for topological sort Cycle detection Compare with Floyd’s algorithm for topological sort Appendix Know your library graphlib.TopologicalSorter networkX Reference Problem Given a directed hierchical structure, how do we flatten it to an array so that the directions are preserved? Recall my problem of trying to fly to Beijing, China, while Zero-Covid policy is going on in China, say I have combed through many many flights (edges) and stops (nodes) and have come up with the following possible path.How do...

System design basics

What is system design Meta data systems Google data systems Large bank data systems Scalability, reliability and maintainability Scalability Reliability Availability Reliability vs availability Maintainability or managebility Efficiency ReferenceWhat is system designNo places that talks about system design bothers to give a definition of what is system design. But guessing from what I have read, it seems that system design means how to design big data web-based data systems and applications such as Facebook, Instagram, Google, Twitter, and so on. Before diving into data system buzzword, let’s see take a...

Cheapest flight with at most k stops

My problemWhat is the cheapest total price from one place to another with at most k stops? The flights to China now are very expensive, 10 times of pre-Covid prices. Not only they are very expensive, but also the length of time and number of stops are horendous. Before Covid the non-stop flight to China was a little over 13 hours. Now, a $20,000 economy ticket with 5 stops with total travel time of over 50 hours is not uncommon. This’s a practical problem, not just a puzzle.Before quoting any...

Cycle detection

Problem statement: Space \(O(n)\) solution Space \(O(1)\) solution: Floyd’s tortoise-hare Problem statement: Given a linked list, if there is a cycle, return the start of the cycle. The start of cycle is defined as the place where 2 nodes meet.If there is no cycle, return -1.Space \(O(n)\) solutionThe natural solution of finding if there is a cycle is to go down the list using .next and store each visited in a hashtable or set. A cycle exists if and only if we reach a node that is already visited.Space \(O(1)\)...

Word Ladder

Quote from Wikipedia “Word ladder (also known asdoublets, word-links, change-the-word puzzles, paragrams, laddergrams, or word golf) is a word game invented by Lewis Carroll in the 19th centuryAnd from tears to smile:tears → sears → spars → spare → spire → spile → smileProblemThe problem is Leetcode 127. Word Ladder.Given two words, beginWord and endWord, and a wordList, return the number of words in the shortest word ladder from beginWord to endWord, or 0 if no such sequence exists.Example 1:Input: beginWord = “hit”, endWord = “cog”, wordList = [“hot”,”dot”,”dog”,”lot”,”log”,”cog”]Output: 5Example...

array puzzles comparisons

Introduction Problems that ask for permutation or rearrangement 1. Leetcode 1389 create a target array in given order. 2. Leetcode 1920. Build Array from Permutation 4. Passing Yearbooks - number of permutations Problems that ask for a number Leetcode 1480. Running Sum of 1d Array Leetcode 2011. Final Value of Variable After Performing Operations Leetcode 1672. Richest Customer Wealth- Max sum of subarrays Problems that ask for returning index Two Sum IntroductionMany array-based puzzles ask for some rearrangment of the input array. The key to solve them is to read...

Morse code

Quote from Wikipedia “Morse code is a method used in telecommunication to encode text characters as standardized sequences of two different signal durations, called dots and dashes, or dits and dahs. Morse code is named after Samuel Morse, one of the inventors of the telegraph.”. By the way, Samuel Morse was an American artist.Problem: Unique Morse Code WordsInternational Morse Code defines a standard encoding where each letter is mapped to a series of dots and dashes, as follows:‘a’ maps to “.-“,‘b’ maps to “-…”,‘c’ maps to “-.-.”, and so on.For...

Know yourself

The behavioral portion of the interview allows the team to better understand how you work and what motivates you professionally.Know yourself. Take the time to review your own resume as your interviewer will almost certainly ask questions about key events in your work history.Be honest. Not every project is a runaway success and we may not always interact perfectly with our peers. Being transparent in these situations won’t be counted against you in the interview. In fact, sharing and discussing how you learned, improved and grew from your past experiences...

Dynamic programming

What is dynamic programming Problem with naive recursion How dynamic programming solves the problem Top-down DP Bottom-up DP Run time Subarray with the greatest sum Number of subarrays Brute force method Memorizing method (DP) Shortest path Recursive thinking Bottom-up thinking Reference Dynamic programming (DP) is used to eliminate overlapping sub-problems in recursions.In comparison with recursion, it reduces time complexity from exponential to polynomial. But increases space complexity.What is dynamic programmingDynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in...

Bit-wise operations

What is bit Binary vs decimal number system Bit-wise operations and or xor not shifting Examples count bits Count odd number occurance Reverse integer hex to RGB Use in set operations ReferenceBit-wise operations should be taken very literally: bit-wise. We talked about element-wise operations. Here we talk about bit-wise operations. Set a bit (where n is the bit number): unsigned char a |= (1 « n); It sets the nth bit to 1. If n is 1, then it sets the second digit to 1. 4 1«1 is 6. Because 4...

Dijkstra algorithm shortest path between

IntroductionObviously to be able to find the shortest distance or cost between any two nodes in a graph is a very useful thing.Dijkstra’s algorithm to find the shortest path between two points. It picks the unvisited vertex with the lowest distance, calculates the distance through it to each unvisited neighbor, and updates the neighbor’s distance if smaller.I find it more intuitive if I think in terms of driving from one place to another. I look at the map, and compare the different routes in distance and traffic (plus toll). I...

Heap sort

Introduction Max heap data structrue Heapsort 堆排序 Cost The heapq module Sort by the nth element Customize the dunder method lt convert the original list of lists into a list of tuples. use heapq for max heap k largest problems k longest strings in a stream When to use heap Appendix: Implement max heap in Python Introduction This post is about the heap data structure and heap sort.Max heap data structrue The heap data structure is a fusion of array and tree.It has \(log(n)\) push and pop, and constant time...

Minimum spanning tree

“To see a World in a Grain of SandAnd a Heaven in a Wild FlowerHold Infinity in the palm of your hand”-Willam Blake Introduction scipy.sparse.csgraph.minimum_spanning_tree Greedy algorithm Prim’s algorithm Keep track of the tree Appendix Customize less than Use cases MST with pygame Maybe off topic Reference book Videos Introduction A tree is a connected, acyclic (no cycles) graph.Minimum spanning tree is a subset of the edges of a connected, edge-weighted graph that connects all the nodes together, without any cycles and with the minimum possible total edge weight. Spanning...

DFS

Introduction DFS Recursive Iterative SWPIA DFS edge classfication DFS traversals Topological sort Compare iterative DFS and BFS Know your library [graphlib] source code IntroductionIn the figure above, if we follow numbers 1, 2, through 12 in the graph below, we complete a depth-first search (DFS) traversal. It goes as deep as possible before bouncing back to span the left sub-tree, and then to the next child of the root node and plunge all the way again, and then bounce back to span the right sub-tree.DFSRecursiveIt is quite instructive to read...

BFS

Introduction BFS QWPIA Complexity Compare iterative DFS and BFS Appendix NetworkX IntroductionBreadth-first search (BFS) is one of my favorite computer algorithms. It is also one of the simpliest algorithms for searching a graph. Prim’s minimum spanning tree and Dijkstra’s single-source shortest-path use ideas similar to BFS.A BFS starts at some arbitrary node and explores its neighbors first before moving to the next level of neighbors, in a layer by layer fashion.BFS is a way to search graph broadly. The algorithm discovers all nodes at distance $$k$& from source node before...

Common problems solved by graphs

Introduction Problems Shortest path problem Negative cycles Strongly connected components Traveling salesman problem (TSP) Minimum spanning tree (MST) Max flow (MST) Bridge Algorithms DFS DFS edge classfication Topological sort BFS Compare iterative DFS and BFS IntroductionGraphs, or networks, are useful for representing relationships. In this post, we learn common problems that are solved using graph theory, and a few of the most well-known algorithms.ProblemsShortest path problemGiven a weighted graph, find the shorted path from node A to node B with minimal cost (or weight).For example, Amazon delivery routes where heavy...

Graphs and networks

Introduction Types of graphs Representations Mathematical set representation adjacency list representation Adjacency matrix representation Convert set representation to adjacency matrix Convert matrix to adjacency list Directed graphs Comparisons of representations Implementation undirected graph IntroductionGraphs, or networks, are useful for representing relationships from social network,maps, the internet, and the universe. In this post, we start with the humble beginning of graph theory from the problem of proving it is not possible to cross each of the 7 bridges of Konigsberg once and only once, and go over its three presentations, and...

tuples reviewed

tuples tuples are like lists Use with dictionary Return a tuple of values in a function namedtupletuplesSome notes below are adapted from the book Think Python.tuples are like listsTuples are like lists, except tuples are immutable, and they don’t use square brackets.Note that tuples are immutable meaning that you cannot modify it by assignment such as x[0] = 100. But you can add elements to it just like you do to a list. Note that it is like list means that you can do things like:x = (1, 2)x +=...

Sorting algorithms 2

Introduction Quicksort Worst case time complexity Quicksort implementations 1. First shot 2. quicksort using list comprehension Preventing worst case Heapsort 堆排序 Max heap data structrue Implement max heap in Python The heapq module Bucket sort (Radix sort) Reference QuickSort Introduction This is a second post on sorting algorithms. In this post we talk about four common algorithms that have average time complexity \(O(n*log(n))\), which is a lot better than \(O(n^2)\).Most of the sorting methods in this post are by comparion, except bucket sort.Volume 3 of Don Knuth’s book series “The...

Sorting algorithms-1

Introduction Insertion sort Select sort Compare Select sort and Insertion sort Bubble sort Introduction Sorting algorithms are very important. In the Women in Mathematics summer program that I participated in Washington DC 2006 during my college junior year, the NSA director of mathematics came and gave us (16 women math college students from around the country) a talk on sorting algorithms. The fact that he chose to talk about sorting made me realize it is importance. When things are sorted, you can find what your target much faster.The binary search...

hash table

Hash tables can be seen as indexed arrays via hash functions. Hash tables are slightly fancier version of arrays. hash table the basics hash function one-to-one or many-to-one Deterministic hashable Hash tables in Python defaultdict Counter set Hash table examples Find anagram If letter can be constructed hash table the basicsA hash table is a data structure that is like a Python dictionary. It is designed to store keys (and optionally) with associated values in an array.A key is stored in the array locations based on its hash code, which...

Learn recursion

What is recursion Recursion is like brocolli (or cauliflower).Recursion is a function that is defined with itself. \(f(n) = \text{some combination of }f(n-1)\). What does that supposed to mean?The simplest example I think is the natural number sequence \(1, 2, 3, 4, 5, 6, ...\)An important tip: don’t be intimated by recursions. Insert as many print statements as you need to see what the program is doing.Natural numberThe natural numbers can be expressed as:\(f(n)=f(n-1)+1\), with the initial condition that the first number is \(0\) (when \(n=0\)).The initial values must be...

binary search

By halfing the search range each time, the binary search very quickly zoom in on the search range. Binary search the basics Genearalize the binary search The bisect library Variations of binary search Binary search trivial problems First and last position in sorted array Position or insertion position Check if an integer is a perfect square Binary search the basicsGiven a sorted array of integers \(A\), we want to find a number, which we call “target”, or \(x\) The code should return the index of target, and return \(-1\) if...

binary search trees

Binary search tree basics Keeping tree balanced Binary search trees uses Tree traversing/walking Depth first search (DFS) Libraries Binary search tree implementation in Python Binary search tree basicsA Binary search tree (often abbreviated as “BST”) is an abstract data structure. It is also called an ordered/sorted binary tree.The general property of the BST is: at each node, it is greater or equal to its left child but is less or equal to its right child. It can be summarized in graph below:Its setup is very intuitive from the mathematical perspective:...

Binary tree implementation

Tree Traversal Tree height breadth First Traversal Count number of leave nodes Binary trees and arrays Root to Leaf Paths Tree traversal: Implementing tree traversals: ReferenceTo implement binary tree, we only need to define the node itself, its value, and left and right attributes. We can define methods after this.code treeNode.pyclass Node(object): def __init__(self, val = None): self.left = None self.right = None self.val = valroot = Node(1)root.left = Node(2)root.right = Node(3)root.left.left = Node(4)root.left.right = Node(5)root.right.left = Node(6)root.right.right = Node(7)''' for visualization only 1 / \ 2 3 / \...

Tree traversal

Tree Traversal Tree representing arithmetic expression Implementing tree traversals: Preorder (DFS) breadth First Traversal ReferenceTree TraversalFrom Wikipedia: Unlike linear data structures (Array, Linked List, Queues, Stacks, etc) which have only one logical way to traverse them, trees can be traversed in different ways. The following are the generally used ways for traversing trees. Inorder 中序 (left, data, right) Preorder 前序 (data, left, right) Postorder 後序 (left, right, data)But that’s not exactly what it is. The left means recursively left. The right means recursively right.Inorder: 4, 2, 5, 1, 6, 3,...

Trees

Applications of trees Tree storage Binary Trees Types of binary trees: Binary Search Trees (BST) Rooted tress ReferenceA tree is a data structure similar to a linked list but instead of each node pointing simply to the next node in a linear fashion, each node points to a number of nodes. Tree is an example of non-linear data structures. A tree structure is a way of representing the hierarchical nature of a structure in a graphical form.A tree is an undirected graph without any cycles. Each tree has exactly \(n\)...

Palindrome

Quote from Wikipedia “In computer science, the longest palindromic substring or longest symmetric factor problem is the problem of finding a maximum-length contiguous substring of a given string that is also a palindrome. For example, the longest palindromic substring of “bananas” is “anana”.Given a string, we want to find the longest palindrome from the string. Brute force Clever ways Remove symbols and white spaces reverse vs. reversed()Brute forceWhat came to my mind first is a brute force method.The method is like solving a very simple math problem. You first understand...

Merge 2 Sorted Lists

ProblemThe problem is Leetcode 21. Merge Two Sorted Lists.You are given the heads of two sorted linked lists list1 and list2.Merge the two lists in a one sorted list. The list should be made by splicing together the nodes of the first two lists.Return the head of the merged linked list.合併兩個已排序之連結串列(Linked List)l1 、 l2 並回傳一個新的已排序串列。新的串列應當由兩個串列之節點拼接在一起而形成。範例:輸入: 1->2->4, 1->3->4輸出: 1->1->2->3->4->4Brute Force Stupid MethodThe solution is from a Leetcode user. It goes out of the way to do lots of extra work and it did not take any advantage of the fact that...

Linked list

Arrays are stored in contiguous blocks of memory in a compuater. In C, array sizes are fixed. This creates the need to have a data structure that allows data being stored in different places of a computer and are connected via pointers. That is linked list.Linked lists were developed in 1955–1956, by Allen Newell, Cliff Shaw and Herbert A. Simon at RAND Corporation as the primary data structure for their Information Processing Language. IPL was used by the authors to develop several early artificial intelligence programs.A linked list is a...

O

In working in financial services, the goal was to get insight, build models, and predict/forecast.However, in the machine learning era, the goals and capacities of analytics have far expanded beyond statistics and modeling. Modelers may need to implement the models themselves. In order to implement efficient models, it is necessary to understand computer science fundamentals. This post is about how we measure efficiency. The Big O metric Time complexity Space complexity Example: in-place sort integer listThe Big O metricIn order to measure run time efficiency, the big \(O\) metric is...

Basic data structures

Array Stacks and queues Linked List Graphs Tree ReferenceIn the machine learning and artificial intelligence era, it is necessary to understand basic abstract data structures, and why we need them.Data structures are defined and built according to varied needs of different data types and algorithms.ArrayArrays are lists of similar data. An array of an array is a 2-dimensional array, i.e. matrix. name type: the type of data is stored, and all have to be the same within one array size: it is fixed once it is defined, and cannot be...

String operations

Leecode 2114. Max number of words in sentences 1528. Rearragne String Convert string to integer Convert integer to string Appendix Reference Future reading listLeecode 2114. Max number of words in sentences Problem:Given a list of sentences, find the maximum number of words of all the sentences. Constraints:all lower case and no white space Solution 1: Each sentence is an array of words separated with a space. The number of words is the number of space + 1. Complexity: def MaxWords1(sentences): # i represents a sentence in the list of sentences...

write well

Beautiful is better than ugly.Explicit is better than implicit.Simple is better than complex.Complex is better than complicated.Sometimes the principles are conflicting: explicit names can be long and make code complicated, and ugly looking. So we each have to decide on the right balance.Write well (consistently) to avoid mistakesIt is important to be consistent. There are many ways to name objects. There are no right or wrong ways. But being consistent has many benefits.One mistake that I sometimes make is confusing myself by names for index and parameters, I decide to...

Tips on Learning

How to work from the time & place dimensionsTo me, there are three modes of work: Solitude: anyting that requires quiet thinking, such as reading, coding. Work with one or two partners. Best explained by All I Really Need to Know about Pair Programming I Learned In Kindergarten Work with a team and meetingsFor solitude work, it is best when it is dark outside when everyone is asleep. Since staying up at night will result in bad long term health, getting up before 4:30 am is the only option. The...

Learn by Doing

Children are smart and tough. They don’t need adults driving them to kids’ golf, yoga, or ballet all week long. And they surely know what are the most important things in life: the difference between needs and wants, be thankful, respectful, loving and helpful.How helpful and capable are they? They can assemble treadmill, computers and furniture. They enjoy the action more than Lego or computer games because of the usefulness in the accomplishments.The ability to think on one’s own and the toughness to try to learn to figure things out...

other risks

Basics in financial statements for non-FIs

This post covers the very basics of the three main financial statements and their connections.Why do I write this post? I am writing the post because I need to use the financials in climate transition risk modeling. Climate transition risk does not work like CCAR models where you model the relationship of macroeconomic variables such as GDP, housing price index, unemployment rate with default rate, LGD, EAD using same time or up to a few quarters’ lag data. As climate risk and opportunities are more strategic, the scenarios are quite...

Analyzing risk profile of banks

Double leverageRefco,very succesffully, 3 months collaped. 4 billion dollar hole on the BS. At the time they had 17X double leverage. There was no real capital in it.Cannot measure CME. Parent company does not need to meet capital adequacy requirements. If it were subject to Basel, it will not be allowed to do it.It is a problem for both holdco and sub creditors.Borrow at the parent, then shuffle down the subs as equity.Double leverage ratio = investment in subs/ equity of parentHoldco debt serves as contingent capital for subs.According to...

Inflation Data Analysis

Too much inflation can cause many problems, and can even bring down a society/nation.Milton Frieman said that inflation is a disease, and that inflation is always everywhere a monetary phenomena, the result of too much money. It is more complicated than just about money. Other factors such as innovation, natural disaster, forces of nature, wars, and so on can all play important roles.We had lived in a world with low inflation in the US coming out of a decade of less than 2.5% YoY inflation rate. China inflation rate is...

Agent Based Model

Why Agent Based Model Deterministic approach Non-deterministic approach Agent Based ModelWhy Agent Based ModelSuppose we have an investment fund currently valued at $1,000,000, 100% invested in the S&P 500 ETF.We plan to take out the money in 30 years. How much do we expect to have in the account at that time?Deterministic approachThe deterministc approach is the simplest (also unrealistic). We assume an annual rate of return of 8% for the next 30 years.codedeterministic_value.pypv = 1000000i = 0.08time_horizon = 30balance_t = 0print("{:10s} {:15s} ".format("Year", "Ending Balance"))print("-"*25)for yr in range(1, time_horizon...

Accounting Treatments and Reclassification

Accounting classifications of bank assets Why do banks own securities besides loans Why banks own bonds in their asset Reclassifying Unanswered questions ReferenceAccounting is a complicated topic. This post tries to summarize the types of accounting classifications on bank assets and impacts on reclassification.Images of this post are from New York Fed Quarterly Trends for Consolidated U.S. Banking Organizations, based on onsolidated financial statistics for the U.S. commercial banking industry, including both bank holding companies (BHCs) and banks. Statistics are based on quarterly regulatory filings. Statistics are inclusive of BHCs’...

CAP theorem

The CAP (Consistency, Availability and Partition Tolerance) Theorem states that a distributed system cannot be strictly consistent, highly available and fault tolerant at the same time. The system designers MUST choose at most two out of three guarantees in the system.ConsistencyConsistency means all users have the same view of data at any given time. If there are multiple replicas and there is an update being processed, all users see the update go live at the same time even if they are reading from different replicas. Systems that do not guarantee...

A Study on Bank Balance sheet composition

This post is a study on the bank balance sheet composition based on “Quarterly Trends for Consolidated U.S. Banking Organizations, First Quarter 2022” published by the Federal Reserve Bank of New York. The study can be used for benchmark for your own bank, and for study of what is happening to the banking industry as a whole and a glimpse of the overall US economy.Consolidated financial statistics for the U.S. commercial banking industry, including both bank holding companies (BHCs) and banks. Statistics are based on quarterly regulatory filings. Statistics are...

Rates macros and inflation data analysis June 30 2022

Introduction Money Suppy M2 Money velocity Oil Global commodities Monthly data Price CPI PPI cpi and ppi wti and commodities and cpi HPI HPI historical extremes RATES Fed funds rate 10-year treasury note yield IntroductionWhen big changes that affect the world happen, risk (uncertainties) rise.Both rising rates and inflation can impact businesses and people’s lives, as well as countries.Without referring to any specific hyperinflation country current or in history, it is worth nothing that one of the worst inflation that affected the US and most of the world is the...

Regulation W

What is Regulation WRegulation W, also known as “Affiliate Transactions” is a U.S. FRB regulation. Its purpose is to help ensure the safty and soundness of the U.S. banking system by prohibiting (transfering bad assets from one subsidiary to another, lending to affiliate without collateral) and/or limiting certain transactions between depository institutions, such as a bank and their affiliates to prevent abuse from its own non-bank affiliates.In my view, the purpose is to prevent conflict of interest between client side of business and proprietary side (corporate) side of business.Similar concepts...

US Government Interest Rates

Under this mighty topic of interest rate, in this post I explore interest rates from 1970s and try to understand them in more details in exact dates and actions in history.One of the motivation is to get a bit of crystal ball on inflation and recession. If I have longer data, I would like to study on wars as well.We had lived in a world with low inflation in the US coming out of a decade of less than 2.5% YoY inflation rate. China inflation rate is less than 4%,...

Inflation Data Analysis

Too much inflation can cause many problems, and can even bring down a society/nation.Milton Frieman said that inflation is a disease, and that inflation is always everywhere a monetary phenomena, the result of too much money. It is more complicated than just about money. Other factors such as innovation, natural disaster, forces of nature, wars, and so on can all play important roles.We had lived in a world with low inflation in the US coming out of a decade of less than 2.5% YoY inflation rate. China inflation rate is...

Enterprise Risk Management

Enterprise Risk Management was a key part of actuarial fellowship exam years ago when I took my fellowship exams.What differentiate enterprise risk management from non-ERM is the following:ERM perspective is at the company level, outside of individual units. If we may compare a company as a tree (a big company a big tree), ERM looks at the tree as a whole, including the leaves, branches, the trunk, and the root system. Because of this perspective, you may see problems that individual business units do not.Communicating and coordinating between different business...

Mobile Payment Risk

Mobile wallet 移动钱包 are digital forms of wallet that people carry (or used to carry) in their pockets.As we do not tend to carry large amounts of money in wallets, mobile wallets are convenient for small payments (as opposed to payments in larger businesses).They hold digital information about payments including credit and/or debit card, bank account, pre-paid card, virtual currency information, coupons and loyalty membership, and wallet holder identifications.A mobile wallet is a software application (app) that does the following: Secure enrollment of the holder (application download, identification) Securely store...

machine learning

Kernels

Briefly speaking, a kernel is a shortcut that helps us do certain calculation faster which otherwise would involve computations in higher dimensional space.Mathematical definition: K(x, y) = <f(x), f(y)>. Here K is the kernel function, x, y are n dimensional inputs. f is a map from n-dimension to m-dimension space. < x,y> denotes the dot product. usually m is much larger than n.Intuition: normally calculating <f(x), f(y)> requires us to calculate f(x), f(y) first, and then do the dot product. These two computation steps can be quite expensive as they...

Embeddings

An embedding is a low-dimensional translation of a high-dimensional vector. PCA (principal component analysis) is the most well known one.Embedding is the process of converting high-dimensional data to low-dimensional data in the form of a vector in such a way that the two are semantically similar.In its literal sense, “embedding” refers to an extract (portion) of anything. Generally, embeddings improve the efficiency and usability of machine learning models and can be utilised with other types of models as well. When dealing with massive amounts of data to train, building machine...

AIs

A few years ago, then HSBC US CRO Rhydian Cox had a conversation with an AI expert. After the long talk on AI given by the expert, Rhydian, with his usual humor and smily face, said “You have so many smart people working for you. You don’t need artificial intelligence. You’ve got real intelligence.”The term AI is used very loosely to mean many things, including many traditional statistical methods and ideas. This post is a start on a few less heard of AIs.Static AITechniques manually trained offline, or whose parameters...

Xi Correlation in Python and R

When we want to know relationships between two things, by “things” we mean numerical variables, Pearson, Spearman, Kendall are commonly used correlation measures.Gini, and Gini variations also the relationship between 2 things.The \(\xi\) correlation is a candidate for detecting whether one thing is a function of the other, i.e. whether one is dependent on the other. Because it does not assume linearity,it can detect non-monotonic functional relationships when other correlation measures fail, such as a sine or cosine.The \(\xi\) correlation is actually very simple. In my view, is a variation...

A Brief Adventure in Stocks

Compare Stocks in Two SectorsWe often need to compare stock performances within and across sectors. In the following example, we plot six major stocks from technology and energy sectors. The top ones are the energy stocks, representing Exxon, BP and Chevron. The bottom panel contains the technology stocks, representing Apple, Amazon and Google.codeCompare Multiple Stocks in Two Sectors.py tickers = ["XOM","BP", "CVX","AAPL","AMZN","GOOG"] stocks = [pdr.get_data_yahoo(i, start=start, end=date.today())['Adj Close'].rename(i) for i in tickers] stocks_df = pd.DataFrame(stocks).T tickers= np.array(tickers).reshape(2,3) fig, ax = plt.subplots(2,3, figsize=(12,6)) for i in range(2): for j in range(3):...

Z Score and Standardization

Z-scores are linearly transformed data values having a mean of zero and a standard deviation of 1.Their names are related to the standard normal tables. Remember, less than 60 years ago, that’s all the technology that 99.9% of mathematicians, scientists and statisticians had.They are scores with a common standard. This standard is a mean of zero and a standard deviation of 1.Z-scores measure the distance of a data point from the mean in terms of the standard deviation, and retains the shape properties of the original data set (i.e. same...

Interpretable AI XAI

There is a trade off between accuracy and interpretability. High accuracy models have low interpretability and potential problems. Explainable AI (XAI) is to have your cake and eat it too. In linear regression, there has been well-established theory and diagnostics on how a model works, such as the model statement y = a + bX, confidence interval, p-value (assuming x is not relevant what’s the chance of having this kind of relationship to y) and etc.In linear regression and situations where linear regression are used (in neural network as well)...

Linear Regression II Categorical Data Preparation

Most datasets outside of image recognition have categorical variables. In linear regression, they are often converted to dummy variables.In SAS, PROC GLM, PROC GENMOD, PROC GLMMIX and other regression related procedures handle categorical variables are by the CLASS statement, which implicitly converts them into dummies. Only when we use PRO REG that we would have to explicitly recode them ourselves.As the mathematics behind linear regression is linear algebra, categorical variables are generally converted to dummy variables.For non-regularized linear regression, perfect multicollinearity must be avoided. That’s why categorical column is encoded...

Linear Regression I - OLS

In many business contexts, models not only need to be reasonable accurate but also must be interpretable and intuitive.Linear models are sometimes preferred to more complex models, or at the minimum used as benchmark, for its strength in interpretability and reasonable performance.In fact, linear regression is at the core of machine learning model interpretability.There are many variations of linear regression models: ordinary least squares (aka “OLS”) linear regression, weighted least square regression, regularized OLS linear regression fitted by minimizing a penalized version of the least squares cost function, generalized linear...

T Tests - Signal to Noise

Century old t-tests formulated to ensure beer quality was formally adopted in statistics and, now, AB testing The t-test was developed in early 1900s as an economical (small samples) way to check for differences in mean quality of batches of Guinness beer that were small in sample sizes. William Gosset , the Head Brewer of Guinness and pioneer of modern statistics empirically, by trial and error, found a formula for a t-distributed random variable.Gosset was a friend of both Karl Pearson and Ronald Fisher.It is called the t-test because the...

market risk

Why commodity prices rise and fall

Commodity prices are driven by supply and demand, of course. What drives supply and demand then? And what else impact commodity prices? Weather, climate change, geopolitical events, technology, behavior change, regulation, just to name a few. However, it is important to keep in mind that markets have very quick ways to adjust itself. A shortage can quickly turn into an oversupply.Inflation and deflationInflation and deflation affect prices of almost all goods, including commodities. Take oil for example, oil prices are higher in an inflationary enviroment since dollar is the currency...

Financial concepts revisit

Languages we use can help or hinder our thinking.When we are working in different worlds, we need to simplify the words used without 1 million jargons. We simplily cannot work with 1 million jargons in banking,1 million jargons in derivatives, 1 million jargons in insurance,1 million jargons in computer algorithm, and another million of jargons from statistics and machine learning. Oh, what a mouthful!Euler was able to make leaps and bounds in mathematics to a certain degree because of his use of symbols and simple notations.Mathematics relies on simplifications.BetsMost financial...

Why Treasury yield rise

Why 10 year Treasury yield riseTen year Treasury (aka ‘T10’) is one of the most watched metric for financial institutions. It has been rising lately after a long period of staying very low.Yield rise is because prices are down.Prices are down because demand has droppoed, and more are selling than buying.Why are people (and ‘people’) selling bonds(T10)?Patrick Pancoast from Intuition Learning explains the reason.We all know that inflation has been rising.People look at inflation, and they anticipate the Fed will be raising rates (and the Fed has been).They do not...

Introdution to Counterparty Credit Risk

What is counterparty credit riskCredit risk from settlementSettlement exposureTradingPayment comes after we agree the price.Pre-settlement exposureDerivative. The counterparty defaults, and owes the bank (us) money from derivatives.Client is obligated to give us 100 MM pounds, we are obligated to giveIn reality, the hedge is not a CP. It is my entire book ofTo hedge each CP is inefficient.Suppose sometime during the 6 months, client goes bankcrupt. What happens if fugure forward price is higher, lower?If we have lots of lots deals, we will talk about netting agreements. In this case,...

SOFR vs LIBOR

For decades, LIBOR (London interbank offered rate), the cost of large global banks can borrow from each other on wholesale markets, was the de facto benchmark for floating-rate borrowing (LIBOR + spread) around the world. It’s calculated from a daily survey of more than 15 large banks that estimate the price to borrow from each other without collateral.I will skip the reasons why LIBOR is being replaced. The reason should be apparent in the comparsion table.From New York Fed’s A User’s Guide to SOFR by The Alternative Reference Rates CommitteeSOFR...

Introdution to Market Risk 市场风险

What is market riskThe Federal Reserve defines market risk as “Market risk encompasses the risk of financial loss resulting from movements in market prices.”The first factor listed by the Fed is “The sensitivity of the financial institution’s earnings or the economic value of its capital to adverse changes in interest rates 利率风险, foreign exchanges rates 汇率风险, commodity prices, or equity prices 股市风险.” A more appropriate name should be “financial market risk”. In more plain terms: the risk that financial assets may lose value.Think of it this way, if you are...

climate risk

CBAM

Context: EU Emission RegulationThe EU is in the forefront on climate regulation. Its home page has a ticking countdown clock to 2050 Net Zero. It has the European Climate Law, which writes into law the goal set out in the European Green Deal for Europe’s economy and society to become climate-neutral by 2050. The law also sets the intermediate target of reducing net greenhouse gas emissions by at least 55% by 2030, compared to 1990 levels.The EU ETS is a cornerstone of the EU’s policy to combat climate change and...

High Tunnel 1

Lately I am looking into high tunnels (高隧道). High tunnel is similar to greenhouse, but it is not greenhouses or low tunnel systems. 可避免如強風、豪雨、冰雹、雪霜與乾旱等極端氣候現象造成的作物災損。此外,其優勢在於可拉長作物的產季、提高精品作物的生產量、延長貨架壽命(shelf life)等,以滿足廣大的高端消費市場。Tips on Using High Tunnel from an NRCS Field OfficerBelow tips are from our wonderful field officer, who had practiced farming growing up working with his mom in his grandmother’s farm. Till & level. Learn how to level. Remove weed/brush roots, especially those rhyzones that look like thick and white veins. Build high tunnel structure. Mow around it or use landscape fabric for weed control. Wood...

Net Zero: Costs and Opportunities

GHG emission has been increasing in a very fast pace since the industrial revolution. Net Zero means the steady state that GHG in atmosphere is no longer increasing. In other words, emission released and emission absorbed cancel out each other.The focus of GHG is CO2. CO2 in atmosphere is like a blanket, keeping the Earth warm. When CO2 increases, the temperature increases. Most of the energy on the Earth is absorbed by waters (oceans, lakes and rivers). When temperature warms up, the waters expand in volume. Melting of the ice...