Work in Progress.

Load data that comes with IDE or libraries

Loading data that comes with IDE or libraries allows me to test code and fast prototype.

Loading pre-packaged data into Python

Working with Python, I use Ipython shell and VSCode. Neither of them come with any datasets. To load some well-known datasets quickly, first need to import one of those libraries that packaged with them.

# variable and target together
In [1]: import seaborn as sns
   ...: iris = sns.load_dataset("iris")
   ...: iris.head()
      sepal_length  sepal_width  petal_length  petal_width species
0         5.100        3.500         1.400        0.200  setosa
1         4.900        3.000         1.400        0.200  setosa
2         4.700        3.200         1.300        0.200  setosa
3         4.600        3.100         1.500        0.200  setosa
4         5.000        3.600         1.400        0.200  setosa
# as an array
In [2]: from sklearn.datasets import load_iris
   ...: data = load_iris()
   ...: data.target[[10, 25, 50]]
   ...:
   ...: list(data.target_names)
   ...:
Out[2]: ['setosa', 'versicolor', 'virginica']
# as a pandas DataFrame
In [5]: X,y=load_iris(return_X_y=True, as_frame=True)

In [6]: type(X)
Out[6]: pandas.core.frame.DataFrame

In [7]: X.head()
Out[7]:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0              5.100             3.500              1.400             0.200
1              4.900             3.000              1.400             0.200
2              4.700             3.200              1.300             0.200
3              4.600             3.100              1.500             0.200
4              5.000             3.600              1.400             0.200

Load prepackaged data to R

data() can be used to load any dataset that comes with RStudio. This makes testing code fast and easy. For example, data(“mtcars”) loads the cars dataset, and data(“iris”) loads the iris dataset.

Load SASHelp data

There are many famous datasets that comes with the sas. See [Sashelp Data Sets - SAS Support]https://support.sas.com/documentation/tools/sashelpug.pdf). data=sashelp.iris will automatically load the data.

codeexisting data.sas

proc contents data=sashelp.iris varnum;
ods select position;
run;
title "The First Five Observations";
proc print data=sashelp.iris(obs=5) noobs;
run;

External file

Import external file to Python

Overview

There are many ways to import data into Python. Below is an overview, and is by no means exhaustive:

Python build-in functions read(), readline(), and readlines() for small files.
The csv library, which I have never used.
The pandas library, which I use all the time.
dask.dataframe() splits big pandas DataFrame into many along index.
datatable for big 2D data frames.
read() with open()

The basic syntax of is: open(‘file’,’mode’). To use the read() function, the file first needs to be open by calling the open()built-in function, which has two parameters: the path to the file and an optional argument to indicate whether ‘r’ (for reading) or ‘w’ (for writing).

coderead_open.py

with open('df4.csv', 'r') as reader:
    # Read & print the entire file
    print(reader.read()) # print the entire file
# ,event,trial,freq
# 0,no,control,0
# 0,no,control,0
# ...
# 0,no,control,0
with open('df4.csv', 'r') as reader:
    # Read & print the first line
    print(reader.readline())
Out:,event,trial,
reader.close() # close after read to prevent unexpected errors

f=open('new.txt','w') # write to file. Creates file if not existed
f.write('hello')
f.close()
f=open('new.txt','r')
f.read()
Out: 'hello'
f=open("new.txt", "a+")^M
for i in range(2):^M
     f.write("\nAppended line %d" % (i+1))
f.close()
f=open("new.txt","r")
print(f.read())
# hello
# Appended line 1
# Appended line 2

pandas.read_

Most of the time, I use the pandas library to import data. See Input/output for a comprehensive list of functions for a wide variety of formats of data. The read_csv, and other pandas read functions take many parameters. Those that I have used are listed below:

pandas.read_csv(filepath, index_col=None, usecols=None, dtype=None, engine=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False,  thousands=None, decimal='.',  encoding=None,  low_memory=True, float_precision=None)

pandas.read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, dtype=None,  true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, parse_dates=False, date_parser=None, thousands=None, comment=None, skipfooter=0, convert_float=None, mangle_dupe_cols=True, storage_options=None)

pandas.read_sas(filepath_or_buffer, format=None, index=None, encoding=None, chunksize=None, iterator=False)

The first example below reads a record every 100 rows. We can change n for higher or lower frequencies. columns=lambda x: x.strip() removes proceeding and trailing blanks.

codeinput.py

#  Large file sample of every 100 rows
keep = ['CUST_ID', 'ZIP_CODE', 'CUST_NAME', 'Country','Segment1','Segment2','DATE','LGD']
n=100
per_raw_head = pd.read_csv(inputFolder+'/filename.csv', header=0, skiprows=lambda i:i%n!=0, engine='python').rename(columns=lambda x: x.strip())
list(per_raw_head)
# full data
df = pd.read_csv('./data/filename.csv', low_memory=False, usecols=keep, encoding='lating-1').rename(columns=lambda x:x.strip())
# SAS file 
df0 = pd.read_sas(inputFolder+'/filename.sas7bdat', encoding='latin-1')

Reading larger files in Python

Most of Python runs on memory. To work with large files, we need to split the files one way or another, divide and conquer.

Using chunksize option, we can somewhat reunite the pieces after they are imported into Python. I guess there has to be some memory saving mechanism in this process.

chunks = pd.read_csv(input_file, chunksize=100000)
data = pd.concat(chunks)

dask.dataframe(): A Dask DataFrame is a large parallel DataFrame composed of many smaller pandas DataFrames, split along the index. These pandas DataFrames may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster. One Dask DataFrame operation triggers many operations on the constituent pandas DataFrames.

Advantage: most functions used with pandas can also be used with dask. See some examples here.

datatable is made for big 2D data frames, up to 100GB). It supports out-of-memory datasets. The Python datatable tries to mimic R’s data.table, but does not have all the functions associated with its R sister yet.

The datatable.fread() method reads exactly like its counterpart in R. It has a consice and cleaner syntax than pandas. There is no need to type file extension such as “.csv”, .iloc,.loc. But because it is an extra work to remember these, for now I prefer stay with pandas unless I have to.


import datatable as dt
In [10]: df = dt.fread('iris') # don't include file extension

In [11]: df
Out[11]:
    | sepal_length  sepal_width  petal_length  petal_width  species
    |      float64      float64       float64      float64  str32
--- + ------------  -----------  ------------  -----------  ---------
  0 |          5.1          3.5           1.4          0.2  setosa
  … |            …            …             …            …  …
149 |          5.9          3             5.1          1.8  virginica
[150 rows x 5 columns]

In [12]: df[:2,:]
Out[12]:
   | sepal_length  sepal_width  petal_length  petal_width  species
   |      float64      float64       float64      float64  str32
-- + ------------  -----------  ------------  -----------  -------
 0 |          5.1          3.5           1.4          0.2  setosa
 1 |          4.9          3             1.4          0.2  setosa
[2 rows x 5 columns]
In [14]: from datatable import f
In [18]: df[f.sepal_length>7.5,:] # cannot omit ":"
Out[18]:
   | sepal_length  sepal_width  petal_length  petal_width  species
   |      float64      float64       float64      float64  str32
-- + ------------  -----------  ------------  -----------  ---------
 0 |          7.6          3             6.6          2.1  virginica
 1 |          7.7          3.8           6.7          2.2  virginica

Import external file to R

R has different dialets. It is versatile and but also fragmented. I mainly use two methods:

read.table

read.table is the principal means of reading tabular data into R.

codeinput.r

read.table(file, header = FALSE, sep = "", quote = "\"'",
           dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
           row.names, col.names, as.is = !stringsAsFactors,
           na.strings = "NA", colClasses = NA, nrows = -1,
           skip = 0, check.names = TRUE, fill = !blank.lines.skip,
           strip.white = FALSE, blank.lines.skip = TRUE,
           comment.char = "#",
           allowEscapes = FALSE, flush = FALSE,
           stringsAsFactors = default.stringsAsFactors(),
           fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

codeinput.r

read.table(file,header=TRUE) # default separator is sep=" " is any white space
read.table(file, as.is=TRUE) # as.is=TRUE prevents string values from being converted to factors
read.csv("file", ,header=TRUE) # specifically for .csv files

load("myData.rdata") # load the dataset written with save
write.table(myData, file= "c:/documents/data/myData.csv", sep=',', row.names=F)

data(x) # loads specific dataset

read_feather(path, columns=NULL)
write_feather(x, path)

data.table

fread stands for “fast read”. The speed efficiency gain becomes more obvious as the data size gets larger.

codefread_speed.r

set.seed(1)
N <-1000000
df<- data.frame(matrix(runif(N), nrow=N))
write.csv(df, 'df.csv', row.names=F)

system.time({df<-read.csv('df.csv)}) # time it
system.time({df<-fread('df.csv)})

The imported data is stored as a data.tableobject, which is, by inheritance, a data.frame object as well.

Conversely, a data.frame object can be converted to data.table by:

data.table(df), or as.data.table(df)
setDT(df) converts in place.

data.table does not store rownames. We need to separately preserve the row names.

codedata frame to data.table.r

data("mtcars")
mtcars$carname <- rownames(mtcars)
mtcars_dt <- as.data.table(mtcars)
class(mtcars_dt) 
"data.table" "data.frame"

Import data in SAS

For data already in SAS format, all you have to is use libname statement to reference it. For other formats, the PROC IMPORT imports external data. Inline data can be created using DATA step. That is about 99% of the cases already.

codefirstLook.sas

PROC IMPORT DATAFILE='/folders/myfolders/DATA/LoanStats3a.csv'
OUT=df
     	DBMS=CSV
     	REPLACE;
	GUESSINGROWS=50000;
RUN;

Data from SQL Servers

Pyodbc

pyodbc is an open source Python module that is participated by Microsoft.
Both Microsoft site and Github have user guides.

codepyodbc.py

import pyodbc 
# Some example server values are
# server = 'localhost\sqlexpress' # for a named instance
# server = 'myserver,port' # to specify an alternate port
server = 'tcp:myserver.database.windows.net' 
database = 'mydb' 
username = 'myusername' 
password = 'mypassword' 
mytablename ='mytable'
cnxn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};
                        SERVER='+server+';
                        DATABASE='+database+';
                        UID='+username+';
                        PWD='+ password
                        'Trusted_Connection=yes;')
cursor = cnxn.cursor()
# print columns of table, if desired
for row in cursor.columns(table=mytablename):
  print(row.column_name)
sql_command = 'SELECT' + ','.JOIN(mycolumns) + 'FROM' + mytablename
extract = pd.read_sql(sql_command, cnxn)

Below snippet loops through a list of database names, get their corresponding table names, column names, and meta data, and outputs the information to Excel files.

codedb_table_column_meta.py

import pyodbc
import pandas as pd
dblist = ['db1',...,'dbn']
for dbdata in dblist:
  print("database: %s" %dbdata)
  print("pulling info from DB...")
  cursor = cnxn.cursor()
  tblist= []
  for row in cursor.table():
    tblist.append(row.table_name)
  clmnlist=[]
  for tb in tblist:
    for row in cursor.columns(table=tb):
      clmnlist.append(row)
  pd.DataFrame(clmnlist).to_excel("sql_server_%s_table_column.xlsx"%dbdata, index=False)

RODBC

There are multiple ways to get data from a SQL server in R. RODBC is one of them. Its functons are listed below:

Function	Description
odbcConnect(dsn, uid=””, pwd=””)	Open a connection to an ODBC database
sqlFetch(channel, sqtable)	Read a table from an ODBC database into a data frame
sqlQuery(channel, query)	Submit a query to an ODBC database and return the results
sqlSave(channel, mydf, tablename = sqtable, append = FALSE)	Write or update (append=True) a data frame to a table in the ODBC database
sqlDrop(channel, sqtable)	Remove a table from the ODBC database
close(channel)	Close the connection

codeRODBC.r

library(RODBC)
cnxn = odbcConnect('')
extract = sqlQuery(cnxn, )
# RODBC Example
# import 2 tables into R data frames 
library(RODBC)
cnxn <-odbcConnect("mydatabase", uid="myusername", pwd="mypassword")
mytable1 <- sqlFetch(cnxn, "mytablename1")
mytable2 <- sqlQuery(cnxn, "select * from mytablename2")
close(cnxn)

Python R SAS Data Input and Output