Cleaning financial data

data cleaning, strings | 16 March 2024

This is a long overdue post.

Cleaning names

Remove suffix

codeplot model driver.py



Cleaning values

Financial data are often shown as strings in Python, although they are meant to be numbers.

  1. Remove common characters such as “$”, “,”, “()”, which is used to express indicate negative number
  2. Then use pd.to_numeric() .rstrip or .lstrip: Both will remove strings given from right or left, respectfully. If nothing is given, then remove white space.

Note: All the string methods used below are not in-place operations. Therefore, do not use lstrip(“$”) in example below. Because the operations are not in place, it does not see “$” in the leftmost position.

.replace(“,”, “”) by default will remove “,” no matter how many commas there are in the string. If the optional argument count is given, only the first count occurrences are replaced.

codeclean values.py
>>> ss = "($123,456)"
>>> ss.replace("(", "-").rstrip(")").replace("$", "").replace(",", "")
Out: '-123456'
>>> ss = "($123,456,7,8,,,,9)"
>>> pd.to_numeric(ss.replace("(", "-").rstrip(")").replace("$", "").replace(",", ""))
Out: -123456789

When we have a lot of data to convert, we may or may not want to do it in a hurry. We can hurry up and convert everything in one go if we know the data well. But the fact is that you should never assume you know the data well, unless you built that data.

codeclean data 1.py

def clean_data_1(data, col_list, string):
    """
    attacking a particular string that's in your otherwise numeric data
    """
    for i in col_list:
        if data[i].dtype == 'object':
            if any("%s"%string in value for value in data[i] ):
                print("%s in %s"%(string,i))
            return
            else: data[col] = pd.to_numeric(data[col])
        else:
            continue

We can do it more thoroughly as shown below:

codeclean_financial_data.py
def clean_financials(ss):
    if pd.isna(ss):
        return ss
    ss = str(ss).strip() # remove white space
    if ss.startswidth("(") and ss.endswidth(")"): # negative number expressed as in parenthesis
        ss = ss[1:-1] # remove first and last character
        ss = ss.replace("$", "").replace('%', '').replace('x', ''):
        ss = "-" + ss # add negative sign
    else:
        ss = ss.replace("$", "").replace('%', '').replace('x', ''):
    return ss

def clean_fiancial_data(data, col_list):
    for col in col_list:
        data[col] = data[col].apply(clean_financial_data)
        data[col] = pd.to_numeric(data[col])  # can add errors='coerce' to force remaining non-numbers to na
    return data

We can look at another example. It creates a list of lambda functions, with the possible intention of a list of functions as [lamba x: x + 1, lambda x: x + 2, …, lambda x: x + 9]. But unfortunately, in this case, lambda function holds on to the expression of \(i\) and won’t undate it until the end. So we end up with a list of lambda functions, all of them are lamba x: x + 9

incremental = [lambda x: x + i for i in range(10)]
print(incremental[0](2))
# 11
print(incremental[-1](2))
# 11

Reference

Line plots on multiple facets

Python FAQ Why do lambdas defined in a loop with different values all return the same result?

python accumulate official documentation

Excel lambda-the-ultimatae-excel-worksheet-function

import Levenshtein

def match_names(column1, column2): matches = {} for name1 in column1: min_distance = float(‘inf’) match = None for name2 in column2: distance = Levenshtein.distance(name1, name2) if distance < min_distance: min_distance = distance match = name2 matches[name1] = match return matches

Example usage:

column1 = [“John”, “Jane”, “Michael”] column2 = [“Jon”, “Janet”, “Michele”] matches = match_names(column1, column2)

for name1, match in matches.items(): print(f”{name1} matched with {match}”)