Table of Contents

Filtering the Data from the Aggregated Player Data

Table of ContentsClose

1 Activating Python Environment

This code block must be run to activate python virtual environment for org session "SESSION_1". The following Python code blocks are run in "SESSION_1" in which the virtual environment should have been activated.

(pyvenv-activate "~/gitRepos/TTK28-project/venv/")

2 Importing dependencies

This filtering of data typically uses pandas and numpy for data storage and manipulation. Additionally a the Random function is used for picking out the data. The data is standardized using Sklearn and the StandardScalar.

import pandas as pd
import numpy as np
from numpy.random import RandomState
from icecream import ic
from sklearn.preprocessing import StandardScaler, OneHotEncoder

3 Hyperparameters

In order to more easily configure the model and keep track of the setting, this hyperparameters dictionary is defined here and used throughout the code where needed.

NAME = "full"
DATA_BALANCE_PERCENTAGE = 0.8
TRAINING_DATA_FRACTION = 0.8

4 Helper Function

This function determines the number of samples to extract based on a percentage how imbalanced the dataset should be.

def num_sample(percentage, hof_df):
    hof_length = len(hof_df['G_all'])
    return int((percentage * hof_length)/(1 - percentage))

5 Importing and Cleaning the Data

df = pd.read_csv('../data/final_data.csv')
sys:1: DtypeWarning: Columns (48) have mixed types.Specify dtype option on import or set low_memory=False.

Fixing Last Game Played date into a float32.

# Splits the final Game date and convert to string
finalGameYear = df['finalGame'].str.split(pat="-", expand=True)[0]
df['finalGame'] = finalGameYear
# Removing NaN fields from the dataset
df = df[finalGameYear.notnull()]
# Adding back to dataset as ints
df['finalGame'] = df['finalGame'].astype('float32')

Fixing parts that cannot have NaN.

# Replacing NaN in each column with 0 and adding back as type
award_names = ['Most Valuable Player', 'World Series MVP', 'AS_games', 'Gold Glove',
               'Rookie of the Year', 'Silver Slugger','G', 'AB', 'R', 'H', '2B', '3B',
               'HR', 'RBI', 'SB', 'BB', 'SO', 'IBB', 'HBP', 'SH', 'SF', 'BB-A', 'H-A',
               'IPouts', 'SO-A', 'BB-A', 'IBB-A', 'PO', 'A', 'E', 'DP', 'ERA']
for col in award_names:
    df[col] = df[col].fillna(0)
    df[col] = df[col].astype('float32')

6 Creating Additional Statistics

In order to aggregate the data into stats which are considered features/stats which express the skill of a batter. It would be possible to use all the features individually, but with a smaller and imbalanced dataset it would be nice not to have to include more features than necessary. These stats are 1B, SLG, OBS, OPS.

# Adding the Batting Stats
df['1B'] = df['H'] - df['2B'] - df['3B'] - df['HR']
df['SLG'] = (df['1B'] + 2*df['2B'] + 3*df['3B'] + 4*df['HR'])/df['AB']
df['OBS'] = (df['H'] + df['BB'] + df['HBP'])/(df['AB'] + df['BB']+ df['HBP']+ df['SF'])
df['OPS'] = df['SLG'] + df['OBS']

# Adding the Pitching Stats
df['WHIP'] = round(3*(df['BB-A'] + df['H-A'])/df['IPouts'], 2)
df['KperBB'] = round(df['SO-A']/(df['BB-A'] - df['IBB-A']), 2)

new_stats = ['1B', 'SLG', 'OBS', 'OPS', 'WHIP', 'KperBB']
for col in new_stats:
    df[col] = df[col].fillna(0)
    df[col] = df[col].astype('float32')

7 Fulfilling criteria for HoF eligibility

Removing players who have played less than 10 years as these players are not eligible for the MLB Hall of Fame.

df = df[df['Years_Played'] >= 10] # 10 years

8 Preparing to Feed Data into Model

8.1 Selecting Variables to use in training

hof_label_df = df[['playerID', 'HoF', 'POS']]
df_inf_containing = df[['WHIP', 'KperBB']]
df = df[['G_all', 'finalGame', 'OPS', 'SB', 'HR',
         'Years_Played', 'Most Valuable Player', 'AS_games',
         'Gold Glove', 'Rookie of the Year', 'World Series MVP', 'Silver Slugger',
         'WHIP', 'ERA', 'KperBB', 'PO', 'A', 'E', 'DP']]

df_inf = np.isinf(df_inf_containing)
# Removing index of values with infinity values from both data to be
# standardized and the label data
df = df[~df_inf['WHIP']]
df = df[~df_inf['KperBB']]
hof_label_df = hof_label_df[~df_inf['WHIP']]
hof_label_df = hof_label_df[~df_inf['KperBB']]
/tmp/babel-aDvIBM/python-PmDC6f:12: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  df = df[~df_inf['KperBB']]
/tmp/babel-aDvIBM/python-PmDC6f:14: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  hof_label_df = hof_label_df[~df_inf['KperBB']]

8.2 One Hot Encoding for Player Position Data

# creating one hot encoder object 
onehotencoder = OneHotEncoder()
#reshape the 1-D country array to 2-D as fit_transform expects 2-D and finally fit the object 
X = onehotencoder.fit_transform(hof_label_df[['POS']]).toarray()
#To add this back into the original dataframe 
dfOneHot = pd.DataFrame(X, columns = ["POS_"+str(int(i)) for i in range(X.shape[1])]) 

8.3 Standardizing the Data

df = pd.DataFrame(StandardScaler().fit_transform(df), columns=df.columns)
# Joining the one hot encoded dummy variables to standardized data set
df = df.join(dfOneHot)
# Adding back the HoF data
df.insert(df.shape[1], 'HoF', hof_label_df['HoF'].to_numpy())

8.4 Converting HoF back to strings

df['HoF'] = df['HoF'].replace(1.0, 'Y', regex=True)
df['HoF'] = df['HoF'].replace(np.nan, 'N', regex=True)
df['HoF'] = df['HoF'].replace(0.0, 'N', regex=True)

8.5 Splitting dataset by HoF players

reg_df = df[df['HoF'] == 'N']
hof_df = df[df['HoF'] == 'Y']

8.6 Under-sampling the non-HoF players

Currently, the non-HoF players are not undersampled.

reg_seeded_random = RandomState(1)
# sampled_reg_df = reg_df.sample(n = num_sample(DATA_BALANCE_PERCENTAGE, hof_df), random_state=reg_seeded_random)
sampled_reg_df = reg_df

8.7 Splitting reg and HoF into train and test

sep_seeded_random = RandomState(1)
train_reg_df = sampled_reg_df.sample(frac=TRAINING_DATA_FRACTION, random_state=sep_seeded_random)
test_reg_df = sampled_reg_df.drop(train_reg_df.index)
train_hof_df = hof_df.sample(frac=TRAINING_DATA_FRACTION, random_state=sep_seeded_random)
test_hof_df = hof_df.drop(train_hof_df.index)

print('length of train_reg_df: ', len(train_reg_df))
print('length of test_reg_df: ', len(test_reg_df))
print('length of train_hof_df: ', len(train_hof_df))
print('length of test_hof_df: ', len(test_hof_df))
length of train_reg_df:  2537
length of test_reg_df:  634
length of train_hof_df:  180
length of test_hof_df:  45

8.8 Merging the test and training datasets

train_df = pd.concat([train_hof_df, train_reg_df])
test_df = pd.concat([test_hof_df, test_reg_df])
# Shuffling the data
train_df = train_df.sample(frac = 1, random_state=sep_seeded_random)
test_df = test_df.sample(frac = 1, random_state=sep_seeded_random)

9 Save the different testing and validation data sets

train_df.to_csv('../data/train_data_' + NAME + '.csv', index=False)
test_df.to_csv('../data/test_data_' + NAME + '.csv', index=False)

10 Links to Other Files in Project

  1. extract_data: Extracting data from Lahman's raw data.
  2. filter_data: Data manipulation, feature creation and feature selection.
  3. hof_model: Creation of the model and training of the neural network.
  4. grid_search_results: Storing some of the results for the best Grid Searches

Author: Olav Landmark Pedersen

Created: 2021-08-15 Sun 22:15

Validate