This set of commands lays out the process I used to explore the training dataset for the project. I start by importing our trusty friends pandas, numpy and matplotlib and set some default parameters for matplotlib.
import pandas as pd
import numpy as np
import matplotlib
matplotlib.rcParams['figure.figsize'] = (12, 8)
matplotlib.pyplot.style.use('ggplot')
pd.set_option('max.rows', 100)
%matplotlib inline
I started by reading in the training set from the project submission repo and taking a look at it.
# Let's read in the training data set
df = pd.read_csv('~/rouest/project-submissions/data/train.csv')
# And take a basic look. Already we can see a number of object variables that may need to be converted.
df.info()
And let's just confirm that we aren't going to need to impute any values or remove any incomplete records.
df.isnull().sum()
Below I've listed the descriptions of each part of our data set from the project repo listed by index on original read in.
Remember that isFraud at index 9 in our target variable!
Let's see what the data actually looks like:
# Looking at the data
df.head(20)
# Since I don't see any fraudulent records in the firs head() call, let's make sure they are there!
df[df.isFraud == 1].head(10)
# Let's also take a look at isFraud = 1 as a percentage of our overall data set
df[df.isFraud == 1].info()
print('The percentage of our dataset that has isFraud = 1 is %s percent.' % (df[df.isFraud == 1].shape[0]/df.shape[0]*100))
In the following section, I am going to try to work methodically through each of our features to see any obvious deliniations for our taget variable. Let's start with step:
# So we don't have to include the logic on each call of df, let's make an isFraud = 1 df called dfraud
dfraud = df.loc[df.isFraud == 1]
Below I wanted to compare the a normed histogram for df and dfraud to see any comparative differences. I used normed = True in order to overlay them and actually be able to see the isFraud set.
steps = df.step.plot.hist(label = 'All Records', alpha = 0.8,
range = (df.step.min(), df.step.max()), bins = 20, normed = True)
ifsteps = dfraud.step.plot.hist(label = 'Fraud', alpha = 0.5, bins = 20, normed = True)
steps.legend();
ifsteps.legend();
Looking at the records with step > 400, it almost appears as those isFraud follows the same distribution; let's take a closer look.
# We'll start with a non-fraud dataframe
dfnot = df.loc[df.isFraud == 0]
# And then run the same plot but with the range set for dfraud.step.max()
steps2 = dfnot.step.plot.hist(label = 'Not Fraud', alpha = 0.8,
range = (dfnot.step.min(), dfraud.step.max()), bins = 20, normed = True)
ifsteps2 = dfraud.step.plot.hist(label = 'Fraud', alpha = 0.5, bins = 20, normed = True)
steps2.legend();
ifsteps2.legend();
Dang, I half expected all records with steps > 400 to be accounted for by isFraud records. Of course it wasn't going to be that easy. Onward and upward!
Below we are going to explore the distribution of our transaction types.
# Starting with the whole set
df['type'].value_counts()
# and then isFraud
dfraud['type'].value_counts()
Interesting! This tells us with some confidence that isFraud = 1 will likely be isolated to those transaction types.
Below we are going to start working through our floats including, amount, oldbalanceOrg, newbalanceOrg, oldbalanceDest and newbalanceDest. We aren't including ID as that is not going to be in our feature space.
df.describe(include= None).amount
dfraud.describe(include= None).amount
We can see a significant difference in mean between df a dfraud amounts but let's try to visualize them.
df.amount.plot.hist();
Well that's not helpful. Looks like the amounts are so skewed we'll need to transform to visualize. Let's see if we can kill two birds with one stone using numpy's log1p.
amount = np.log1p(df.amount).plot.hist(label = 'Not Fraud', alpha = 0.8,
bins = 20, normed = True)
famount = np.log1p(dfraud.amount).plot.hist(label = 'Fraud', alpha = 0.5, bins = 20, normed = True)
amount.legend();
famount.legend();
Not only are we successsul in actually seeing the full scope of out data, we now know that log1p transformation gets us much closer to a normal distribution. Let's try the same thing with our other floats.
df.describe(include= None).oldbalanceOrg
dfraud.describe(include= None).oldbalanceOrg
oldbalanceOrg = np.log1p(df.oldbalanceOrg).plot.hist(label = 'Not Fraud', alpha = 0.8,
bins = 20, normed = True)
foldbalanceOrg = np.log1p(dfraud.oldbalanceOrg).plot.hist(label = 'Fraud', alpha = 0.5, bins = 20, normed = True)
oldbalanceOrg.legend();
foldbalanceOrg.legend();
Looks like very few of our isFraud cases have a beginning balance from the source account of 0 when compared to all records.
df.describe(include= None).newbalanceOrig
dfraud.describe(include= None).newbalanceOrig
newbalanceOrig = np.log1p(df.newbalanceOrig).plot.hist(label = 'Not Fraud', alpha = 0.8,
bins = 20, normed = True)
fnewbalanceOrig = np.log1p(dfraud.newbalanceOrig).plot.hist(label = 'Fraud', alpha = 0.5, bins = 20, normed = True)
newbalanceOrig.legend();
fnewbalanceOrig.legend();
Wow! For the isFraud cases, it looks like the updated balance from the original account is (nearly) always 0!
df.describe(include= None).oldbalanceDest
dfraud.describe(include= None).oldbalanceDest
oldbalanceDest = np.log1p(df.oldbalanceDest).plot.hist(label = 'Not Fraud', alpha = 0.8,
range = (np.log1p(df.oldbalanceDest).min(), np.log1p(df.oldbalanceDest).max()),
bins = 20, normed = True)
foldbalanceDest = np.log1p(dfraud.oldbalanceDest).plot.hist(label = 'Fraud', alpha = 0.5,
range = (np.log1p(df.oldbalanceDest).min(), np.log1p(df.oldbalanceDest).max()),
bins = 20, normed = True)
oldbalanceDest.legend();
foldbalanceDest.legend();
df.describe(include= None).newbalanceDest
dfraud.describe(include= None).newbalanceDest
newbalanceDest = np.log1p(df.newbalanceDest).plot.hist(label = 'Not Fraud', alpha = 0.8,
range = (np.log1p(df.newbalanceDest).min(), np.log1p(df.newbalanceDest).max()),
bins = 20, normed = True)
fnewbalanceDest = np.log1p(dfraud.newbalanceDest).plot.hist(label = 'Fraud', alpha = 0.5,
range = (np.log1p(df.newbalanceDest).min(), np.log1p(df.newbalanceDest).max()),
bins = 20, normed = True)
newbalanceDest.legend();
fnewbalanceDest.legend();
df.info()
So far, we've worked through step, type, amount, oldbalanceOrg, newbalanceOrig, oldbalanceDest and newbalanceDest. All that's left are nameOrig and nameDest. Given that these are basically unique account numbers, there may not be much to glean. But wait! We can distinguish what type of account (merchant or consumer) by the leading character.
# Let's create a lambda function and a new variable to analyze the account types
df['nameOrigCat'] = df.nameOrig.apply(lambda x: x[0])
df.nameOrigCat.value_counts()
Hmm, looks like all transactions started from consumer accounts regardless of isFraud status. Let's try namedest:
df['nameDestCat'] = df.nameDest.apply(lambda x: x[0])
df.nameDestCat.value_counts()
Huzzah! Looks like there's some distinction there.
pd.crosstab(df.isFraud, df.nameDestCat) # plot the crosstab for survival by sex
Well, not as informative as I'd hoped, but at least it's good to know that no fraudulent transactions went to merchants!