This is a long overdue post.
Cleaning names
Remove suffix
Cleaning values
Financial data are often shown as strings in Python, although they are meant to be numbers.
- Remove common characters such as “$”, “,”, “()”, which is used to express indicate negative number
- Then use pd.to_numeric() .rstrip or .lstrip: Both will remove strings given from right or left, respectfully. If nothing is given, then remove white space.
Note: All the string methods used below are not in-place operations. Therefore, do not use lstrip(“$”) in example below. Because the operations are not in place, it does not see “$” in the leftmost position.
.replace(“,”, “”) by default will remove “,” no matter how many commas there are in the string. If the optional argument count is given, only the first count occurrences are replaced.
>>> ss = "($123,456)"
>>> ss.replace("(", "-").rstrip(")").replace("$", "").replace(",", "")
Out: '-123456'
>>> ss = "($123,456,7,8,,,,9)"
>>> pd.to_numeric(ss.replace("(", "-").rstrip(")").replace("$", "").replace(",", ""))
Out: -123456789
When we have a lot of data to convert, we may or may not want to do it in a hurry. We can hurry up and convert everything in one go if we know the data well. But the fact is that you should never assume you know the data well, unless you built that data.
def clean_data_1(data, col_list, string):
"""
attacking a particular string that's in your otherwise numeric data
"""
for i in col_list:
if data[i].dtype == 'object':
if any("%s"%string in value for value in data[i] ):
print("%s in %s"%(string,i))
return
else: data[col] = pd.to_numeric(data[col])
else:
continue
We can do it more thoroughly as shown below:
def clean_financials(ss):
if pd.isna(ss):
return ss
ss = str(ss).strip() # remove white space
if ss.startswidth("(") and ss.endswidth(")"): # negative number expressed as in parenthesis
ss = ss[1:-1] # remove first and last character
ss = ss.replace("$", "").replace('%', '').replace('x', ''):
ss = "-" + ss # add negative sign
else:
ss = ss.replace("$", "").replace('%', '').replace('x', ''):
return ss
def clean_fiancial_data(data, col_list):
for col in col_list:
data[col] = data[col].apply(clean_financial_data)
data[col] = pd.to_numeric(data[col]) # can add errors='coerce' to force remaining non-numbers to na
return data
We can look at another example. It creates a list of lambda functions, with the possible intention of a list of functions as [lamba x: x + 1, lambda x: x + 2, …, lambda x: x + 9]. But unfortunately, in this case, lambda function holds on to the expression of \(i\) and won’t undate it until the end. So we end up with a list of lambda functions, all of them are lamba x: x + 9
incremental = [lambda x: x + i for i in range(10)]
print(incremental[0](2))
# 11
print(incremental[-1](2))
# 11
Reference
Python FAQ Why do lambdas defined in a loop with different values all return the same result?
python accumulate official documentation
Excel lambda-the-ultimatae-excel-worksheet-function
import Levenshtein
def match_names(column1, column2): matches = {} for name1 in column1: min_distance = float(‘inf’) match = None for name2 in column2: distance = Levenshtein.distance(name1, name2) if distance < min_distance: min_distance = distance match = name2 matches[name1] = match return matches
Example usage:
column1 = [“John”, “Jane”, “Michael”] column2 = [“Jon”, “Janet”, “Michele”] matches = match_names(column1, column2)
for name1, match in matches.items(): print(f”{name1} matched with {match}”)