Filtering the Data from the Aggregated Player Data
Table of ContentsClose
1 Activating Python Environment
This code block must be run to activate python virtual environment for org session "SESSION_1". The following Python code blocks are run in "SESSION_1" in which the virtual environment should have been activated.
(pyvenv-activate "~/gitRepos/TTK28-project/venv/")
2 Importing dependencies
This filtering of data typically uses pandas and numpy for data storage
and manipulation. Additionally a the Random function is used for picking out
the data. The data is standardized using Sklearn and the StandardScalar.
import pandas as pd import numpy as np from numpy.random import RandomState from icecream import ic from sklearn.preprocessing import StandardScaler, OneHotEncoder
3 Hyperparameters
In order to more easily configure the model and keep track of the setting, this hyperparameters dictionary is defined here and used throughout the code where needed.
NAME = "full" DATA_BALANCE_PERCENTAGE = 0.8 TRAINING_DATA_FRACTION = 0.8
4 Helper Function
This function determines the number of samples to extract based on a percentage how imbalanced the dataset should be.
def num_sample(percentage, hof_df): hof_length = len(hof_df['G_all']) return int((percentage * hof_length)/(1 - percentage))
5 Importing and Cleaning the Data
df = pd.read_csv('../data/final_data.csv')
sys:1: DtypeWarning: Columns (48) have mixed types.Specify dtype option on import or set low_memory=False.
Fixing Last Game Played date into a float32.
# Splits the final Game date and convert to string finalGameYear = df['finalGame'].str.split(pat="-", expand=True)[0] df['finalGame'] = finalGameYear # Removing NaN fields from the dataset df = df[finalGameYear.notnull()] # Adding back to dataset as ints df['finalGame'] = df['finalGame'].astype('float32')
Fixing parts that cannot have NaN.
# Replacing NaN in each column with 0 and adding back as type award_names = ['Most Valuable Player', 'World Series MVP', 'AS_games', 'Gold Glove', 'Rookie of the Year', 'Silver Slugger','G', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'BB', 'SO', 'IBB', 'HBP', 'SH', 'SF', 'BB-A', 'H-A', 'IPouts', 'SO-A', 'BB-A', 'IBB-A', 'PO', 'A', 'E', 'DP', 'ERA'] for col in award_names: df[col] = df[col].fillna(0) df[col] = df[col].astype('float32')
6 Creating Additional Statistics
In order to aggregate the data into stats which are considered features/stats which express the skill of a batter. It would be possible to use all the features individually, but with a smaller and imbalanced dataset it would be nice not to have to include more features than necessary. These stats are 1B, SLG, OBS, OPS.
# Adding the Batting Stats df['1B'] = df['H'] - df['2B'] - df['3B'] - df['HR'] df['SLG'] = (df['1B'] + 2*df['2B'] + 3*df['3B'] + 4*df['HR'])/df['AB'] df['OBS'] = (df['H'] + df['BB'] + df['HBP'])/(df['AB'] + df['BB']+ df['HBP']+ df['SF']) df['OPS'] = df['SLG'] + df['OBS'] # Adding the Pitching Stats df['WHIP'] = round(3*(df['BB-A'] + df['H-A'])/df['IPouts'], 2) df['KperBB'] = round(df['SO-A']/(df['BB-A'] - df['IBB-A']), 2) new_stats = ['1B', 'SLG', 'OBS', 'OPS', 'WHIP', 'KperBB'] for col in new_stats: df[col] = df[col].fillna(0) df[col] = df[col].astype('float32')
7 Fulfilling criteria for HoF eligibility
Removing players who have played less than 10 years as these players are not eligible for the MLB Hall of Fame.
df = df[df['Years_Played'] >= 10] # 10 years
8 Preparing to Feed Data into Model
8.1 Selecting Variables to use in training
hof_label_df = df[['playerID', 'HoF', 'POS']] df_inf_containing = df[['WHIP', 'KperBB']] df = df[['G_all', 'finalGame', 'OPS', 'SB', 'HR', 'Years_Played', 'Most Valuable Player', 'AS_games', 'Gold Glove', 'Rookie of the Year', 'World Series MVP', 'Silver Slugger', 'WHIP', 'ERA', 'KperBB', 'PO', 'A', 'E', 'DP']] df_inf = np.isinf(df_inf_containing) # Removing index of values with infinity values from both data to be # standardized and the label data df = df[~df_inf['WHIP']] df = df[~df_inf['KperBB']] hof_label_df = hof_label_df[~df_inf['WHIP']] hof_label_df = hof_label_df[~df_inf['KperBB']]
/tmp/babel-aDvIBM/python-PmDC6f:12: UserWarning: Boolean Series key will be reindexed to match DataFrame index. df = df[~df_inf['KperBB']] /tmp/babel-aDvIBM/python-PmDC6f:14: UserWarning: Boolean Series key will be reindexed to match DataFrame index. hof_label_df = hof_label_df[~df_inf['KperBB']]
8.2 One Hot Encoding for Player Position Data
# creating one hot encoder object onehotencoder = OneHotEncoder() #reshape the 1-D country array to 2-D as fit_transform expects 2-D and finally fit the object X = onehotencoder.fit_transform(hof_label_df[['POS']]).toarray() #To add this back into the original dataframe dfOneHot = pd.DataFrame(X, columns = ["POS_"+str(int(i)) for i in range(X.shape[1])])
8.3 Standardizing the Data
df = pd.DataFrame(StandardScaler().fit_transform(df), columns=df.columns) # Joining the one hot encoded dummy variables to standardized data set df = df.join(dfOneHot) # Adding back the HoF data df.insert(df.shape[1], 'HoF', hof_label_df['HoF'].to_numpy())
8.4 Converting HoF back to strings
df['HoF'] = df['HoF'].replace(1.0, 'Y', regex=True) df['HoF'] = df['HoF'].replace(np.nan, 'N', regex=True) df['HoF'] = df['HoF'].replace(0.0, 'N', regex=True)
8.5 Splitting dataset by HoF players
reg_df = df[df['HoF'] == 'N'] hof_df = df[df['HoF'] == 'Y']
8.6 Under-sampling the non-HoF players
Currently, the non-HoF players are not undersampled.
reg_seeded_random = RandomState(1) # sampled_reg_df = reg_df.sample(n = num_sample(DATA_BALANCE_PERCENTAGE, hof_df), random_state=reg_seeded_random) sampled_reg_df = reg_df
8.7 Splitting reg and HoF into train and test
sep_seeded_random = RandomState(1) train_reg_df = sampled_reg_df.sample(frac=TRAINING_DATA_FRACTION, random_state=sep_seeded_random) test_reg_df = sampled_reg_df.drop(train_reg_df.index) train_hof_df = hof_df.sample(frac=TRAINING_DATA_FRACTION, random_state=sep_seeded_random) test_hof_df = hof_df.drop(train_hof_df.index) print('length of train_reg_df: ', len(train_reg_df)) print('length of test_reg_df: ', len(test_reg_df)) print('length of train_hof_df: ', len(train_hof_df)) print('length of test_hof_df: ', len(test_hof_df))
length of train_reg_df: 2537 length of test_reg_df: 634 length of train_hof_df: 180 length of test_hof_df: 45
8.8 Merging the test and training datasets
train_df = pd.concat([train_hof_df, train_reg_df]) test_df = pd.concat([test_hof_df, test_reg_df]) # Shuffling the data train_df = train_df.sample(frac = 1, random_state=sep_seeded_random) test_df = test_df.sample(frac = 1, random_state=sep_seeded_random)
9 Save the different testing and validation data sets
train_df.to_csv('../data/train_data_' + NAME + '.csv', index=False) test_df.to_csv('../data/test_data_' + NAME + '.csv', index=False)
10 Links to Other Files in Project
- extract_data: Extracting data from Lahman's raw data.
- filter_data: Data manipulation, feature creation and feature selection.
- hof_model: Creation of the model and training of the neural network.
- grid_search_results: Storing some of the results for the best Grid Searches