My Personal Website
COMMENT Handy-Templates
TODO Original Template
First heading within the post
- This post will be exported as EXPORT_HUGO_MENU: :menu "main"
- LETS SEE IF THIS HAPPENS Its title will be "Writing Hugo blog in Org".
- It will have hugo and org tags and emacs as category.
- The menu item weight and post weight are auto-calculated.
- The menu item identifier is auto-set.
- The lastmod property in the front-matter is set automatically to the time of export.
content/posts/writing-hugo-blog-in-org-subtree-export.md
.
A sub-heading under that heading
- It's draft state will be marked as
true
as the subtree has the
todo state set to TODO.
With the point anywhere in this Writing Hugo blog in Org post
subtree, do C-c C-e H H
to export just this post.
TODO Useful Template
The exported Markdown has a little comment footer as set in the Local Variables section below.
Spacemacs
Creating a Literate Config for Spacemacs
All posts in here will have the category set to spacemacs.
Why create a "literate configuration"?
In a previous post, I explained the concept and origin of literate programming. A literate configuration is simply applying the literate programming concept to configurations files. This is useful for configuration files typically require more context around the settings and why you chose to do something in a particular manner. Sometimes this could include links to where you found the configuration settings or 'work in progress' settings. Another reason literate configurations are useful is because configuration files are not necessarily modified every day. When jumping back into your config to fix, add, or modify a feature it might be useful to have a short explanation to quickly catch yourself up.
For this particular post the goal is to create a literate configuration for
Spacemacs. It is a heavily modded distribution of Emacs. This is mainly
because I use Spacemacs myself and find having a literate configuration for
it very useful. Literate configuration itself is an .org
file where the
weaved code is defined in code blocks. Using org-mode headings also makes it
much easier to navigate your configuration file. Extracting the file code
from the literate configuration, or tangling the literate configuration,
will yield an elisp
file.
Spacemacs is also well suited for a literate configuration since it by
default loads a single .spacemacs
file, which tends to steadily grow as it
increasingly takes over your life. At the time of writing this my current
'tangled' spacemacs.el
configuration (just the code) is 1500 lines long.
Correctly loading Spacemacs literate config file
When I was attempting to create a literate Spacemacs config, there were
several helpful links to figure out how to tangle and generate output
files from an .org
file. I struggled to figure out the best way to load
the file itself. There are two default locations for your Spacemacs
configurations with the following loading priority:
- ~~/.spacemacs~: A single dotfile in the home directory
- ~~/.spacemacs.d/init.el~: A dotfile directory in the home directory
For a literate configuration we will need to have mulitple files,
therefore I find it more orderly to go with the second option and create a
~/.spacemacs.d
directory.
Initially, it seemed to make sense to create a spacemacs.org
literate
configuration file, and tangle directly to either .spacemacs
or
init.el
file. It could be done this way, but Spacemacs adds custom content
at the end of the init.el
(or ~/.spacemacs
) file in the following function:
#+begin_example elisp (defun dotspacemacs/emacs-custom-settings () "Emacs custom settings. This is an auto-generated function, do not modify its content directly, use Emacs customize menu instead. This function is called at the very end of Spacemacs initialization." #+end_example
This is where Emacs stores custom-set-variables
and custom-set-faces~
generated during runtime. If we were to /tangle/ our literate configuration
file directly to the ~init.el
file it would overwrite this function and
variables every time it was generated. Therefore, instead the literate
configuration can be tangled to an intermediate file, for example
spacemacs.el
, and then the init.el
file can load its contents:
- ~spacemacs.org~: The literate configuration file
- ~spacemacs.el~: The tangled configuration file
- ~init.el~: The file that Spacemacs loads as a default configuration file.
An overview of how this works is shown in figure fig:spacemacs-config-overview.
#+CAPTION: An overview of the concept of how to make a literate config for spacemacs that doesn't overwrite the auto generated settings from Spacemacs. Created in Krita. #+NAME: fig:spacemacs-config-overview #+ATTR_HTML: :class center ../static/img_literate_spacemacs_config/literate_spacemacs_config_crop.png
The 'spacemacs.org' file
This is the literate configuration file itself. When creating this you have to ensure every part of your configuration file is copied over in the code snippets. Below is the start of my configuration file. Most of these settings are not particularly important for generating the configuration file, but has served me well thus far.
#+begin_example org #+TITLE: Spacemacs Literate User Configuration #+STARTUP: headlines #+STARTUP: nohideblocks #+STARTUP: noindent #+OPTIONS: toc:4 h:4 #+PROPERTY: header-args:emacs-lisp :comments link #+end_example
An example of a code block from a part of my configuration below. The important part is the that the configuration is in the code blocks.
** Function start and default settings
This is just the start of some configuration options copied directly over from the default configuration.
(defun dotspacemacs/layers ()
(setq-default
dotspacemacs-distribution 'spacemacs
dotspacemacs-enable-lazy-installation 'unused
dotspacemacs-ask-for-lazy-installation t
dotspacemacs-configuration-layer-path '()
What defines the code block:
#+begin_example
#+end_example
The emacs-lisp
option indicates what language the code block is written in. The
:tangle spacemacs.el
sets the output target file to spacemacs.el
when
the file is tangled. My complete spacemacs.org
, with all its warts, can be found in my
dotfiles repository on github.
The 'spacemacs.el' file
In order to generate the tangled file, spacmacs.el
, the function
org-bable-tangle()
has to be run on the literate configuration file.
For this function to be run every time you save the file, the following
'local variables' can be added to the bottom of spacemacs.org
.
This will add a function to the after-save-hook
. A hook is a variable
that holds a list of functions at a specific time.
#+begin_example org * Local Variables :ARCHIVE:
#+end_example
The 'init.el' file
This is the file Spacemacs loads as the configuration. You only have to add the following line to the file.
#+begin_example elisp (load-file "~/.spacemacs.d/spacemacs.el") #+end_example
Spacemacs should then load your configuration and you should be up and running! As you continue to use Spacemacs, it will populate the ~init.el~ file. It will start to populate such as:
#+begin_example elisp (load-file "~/.spacemacs.d/spacemacs.el") (defun dotspacemacs/emacs-custom-settings () "Emacs custom settings. This is an auto-generated function, do not modify its content directly, use Emacs customize menu instead. This function is called at the very end of Spacemacs initialization." (custom-set-variables ;; custom-set-variables was added by Custom. ;; If you edit it by hand, you could mess it up, so be careful. ;; Your init file should contain only one such instance. ;; If there is more than one, they won't work right. '(evil-want-Y-yank-to-eol nil) ... ... #+end_example
That's it!
Now when you change your spacemacs.org
file and tangle the output to
spacemacs.el
it will not overwrite the variables generated by Spacemacs
found in init.el
. It is not that complicated, but I hope this helps
someone fairly new to org-babel
or Spacemacs.
TODO Quickly Accessing files with Emacs Bookmarks
General Linux
TODO Managing your dotfiles on Github using GNU Stow
Machine Learning
TODO Using a Neural Network to Determine MLB Hall of Fame Players
Background
The project discussed in this post was originally created for a masters degree course TTK28 - Modeling with Neural Networks. It was heavily inspired by Rui Hu's (rh2835) project Baseball-Analytics, and the data used is from the amazingly comprehensive dataset, Sean Lahman's Baseball Database. The Baseball-Analytics project had a part which involved using Linear Regression to determine which Major League Baseball (MLB) Players would end up in the MLB Hall of Fame (HoF). Since the masters course I was taking was primarily concerned with using Neural Networks for Data Science, it seemed fair to be able to reuse some of Rui Hu's code for data preparation and cleaning.
The Objective:
Determine if an eligible Major League Baseball (MLB) Player will make it into the National Baseball Hall of Fame (HoF) based on career statistics using a Neural Network.
Statistical analysis within baseball has been popular for a long time. This is partially due to the wealth of data and discrete events that occur during the game [cite:@Koseler2017]. The use of statistics has always been popular, but Bill James popularized the use of more complex analysis in the 1980's [cite:@Koseler2017]. Recently, machine learning has been used more and more in baseball. A literature review found that Support Vector Machines (SVM) were the most popular methods used for binary classification, while Bayesian Inference mixed with Linear Regression was the most used method for Regression tasks [cite:@Koseler2017]. Since the objective is to determine if a player will reach the Hall of Fame or not, it is a binary classification task.
Hall of Fame Eligibility and Selection
Why is this problem suitable for a machine learning technique, such as a neural network? The players are voted in by the Baseball Writers' Association of America (BBWAA) or Era Committees (formerly known as Veteran Committees) [cite:@NationalBaseballHallofFame,@NationalBaseballHallofFamea]. The rules for which players are eligible for the HoF has changed throughout its existence. One of lasting requirements is that they must have played in the league for a certain number of seasons. Currently, this criteria is having played at least 10 seasons in the league [cite:@NationalBaseballHallofFame]. The BBWAA requires that the player has to have been active for one of the 15 season prior to the election and not played within the last five seasons [cite:@NationalBaseballHallofFame]. In contrast, the Era Committees was established to vote in players from different eras of baseball which were previously not voted in, but deserve a spot in the HoF.
Since these players are voted mainly by sportswriters which follow the sport and no specific criteria, the process is somewhat arbitrary. There are several other factors which impact the sportswriters, such as character and sportsmanship. However, this project will attempt to determine this based on a collection of their career statistics.
Major Project Steps
The player data in the database goes all the way back to 1871, and the
2020 database version contains data through the 2019 season [cite:@Lahman]. The
data was fragmented over several files. The extract_data.py
or
~extract_data.org~ file aggregates relevant data and stores it in final.csv
.
The data underwent further data cleaning, data manipulation and feature creation
through the filter_data.py
or filter_data.org
file. After finishing the
preparation of the data into a test
and train
dataset, the data is imported
in hof_model.py
or hof_model.org
. This file involves; inputting the data
into various classifiers, training the classifiers, and final testing of the
classifiers with different metrics.
The source code for this project can be found the github repository,
machine-learning-MLB-HoF. Part of the goal of this project was to
examine Org-modes (in Emacs) usage in Reproducible Research. Therefore,
it was possible to render these .org
files to HTML
sites which are
linked to below. While this was fun to test out, it is not the main focus of
this post.
- [https://www.olavpedersen.com/standalone_hof/extract_data.html][extract_data]]: Extracting data from Lahman's raw data.
- [https://www.olavpedersen.com/standalone_hof/filter_data.html][filter_data]]: Data manipulation, feature creation and feature selection.
- [https://www.olavpedersen.com/standalone_hof/grid_search_results.html][grid_search_results]]: Storing some of the results for the best Grid Searches
- [https://www.olavpedersen.com/standalone_hof/hall_of_fame_model.html][hof_model]]: Creation of the model and training of the neural network.
Feature Selection
There are several specialized positions in baseball which require different skills. Pitchers often do not participate in hitting (when the team is on offense). For the first basemen and catchers, their hitting ability is less important than their defensive abilities. Since these positions perform different tasks, they are evaluated using different metrics. Additionally, the skill of individuals in these positions are not necessarily easily measured by a single statistic. The quality of these metrics are quite variable:
- Good for batters (OPS, SLG, OBS, etc.)
- Okay for pitchers (WHIP, K/BB, K/9, etc.)
- Poor for fielders (FFRA, UZR, etc.)
The variability of quality in position metrics led me to include other types of statistics. These include player accolades with some other general statistics. Initially, the following ten features selected for this project were:
Position | Feature or Statistic |
---|---|
All | Number of games played |
All | Date of last game played |
All | Years Played |
All | Player Position |
Batter | OPS (On-base Plus Slugging) |
Batter | Silver Slugger Award |
All | Most-Valuable Player (MVP) Awards |
All | All-Star Games Played |
Fielder | Golden Glove Awards |
All | Rookie of the Year Award |
All | World Series MVP Award |
Most of these stats are fairly self-explanatory though some of them are more cryptic. The Golden Glove Award, or simply, Golden Glove are given out to the players which exhibit overall fielding excellence regardless of position [cite:@Rawlings]. The Silver Slugger recognizes the best offensive players at each position in both the American League and National League [cite:@MLB]. Since these and other awards have not always been around, it seemed reasonable to include the 'Date of last game played' feature which should hopefully assist weighting those award features.
The OPS consists of two other metrics, On-Base-Percentage (OBP) and Slugging Percentage (SLG) as shown below:
$$OPS = OBP + SLG$$
The OBP refers to how frequently a batter reaches a base per plate appearance [cite:@MLBa]. As seen in the equation below, it includes hits, walks, and hit-by-pitches, but not errors, fielder's choice, or dropped third strikes [cite:@MLBa]. There is a summary table of abbreviations below.
$$OBP = \frac{H+BB+HBP}{AB+BB+SF+HBP}$$
The SLG is the total number of bases reached per plate appearance, or At-Bat (AB) [cite:@MLBb].
$$SLG = \frac{B+2B\cdot 2 + 3B\cdot 3 + HR\cdot 4}{AB} = \frac{TB}{AB}$$
The above 10 features were used in the course project due to a time crunch. If we are being honest, they are kind of uninteresting features. It only contains a single "baseball" stat (OPS), while the others are dates, number of games played, or awards. After the project was finished, I wanted to experiment with adding more features and stats to test out the neural network properly.
After some more feature engineering, the following features were used in addition to those above, giving a total of 20 features:
Position | Feature or Statistic |
---|---|
Batter | SB (Stolen Bases) |
Batter | HR (Home Runs) |
Batter | A (Assists) |
Pitcher | WHIP (Walks and Hits per Inning Pitched) |
Pitcher | ERA (Earned Run Average) |
Pitcher | K/BB (Strikeout-to-walk Ratio) |
Fielder | PO (Putout) |
Fielder | E (Errors) |
Fielder | DP (Double Plays) |
These stats were an attempt to add some more stats for batters, pitchers, and fielders to encompass different player positions in the features. Out of these the WHIP, ERA, and K/BB statistics are pitcher stats. The formulas for the three stats are defined below. Ideally, a pitcher would want a lower WHIP and ERA, but a higher K/BB.
$$WHIP = \frac{BB+H}{IP}$$
$$ERA = 9 \cdot \frac{ER}{IP}$$
$$K/BB = \frac{K}{BB}$$
The other statistics are directly recorded from game play and not derived through formulas. Just to ensure that the average reader can still follow along, I will briefly explain some of them. Stolen Bases are when a base runner advances by taking a base to which he isn't entitled [cite:@MLBd]. Home Runs are when a player hits the ball allowing for a four-base run [cite:@Chase2016]. Putouts are when fielders records the act of completing an /out/, either by a force-out, tagging a runner, catching a third strike, or catching a batted ball [cite:@MLBe]. Errors are when, in the judgement of the official scorer, a fielder fails to convert an out on a play that an average fielder should have made [cite:@MLBf]. Double Plays is when two offensive players are ruled out within the same play [cite:@MLBg].
Overview of abbreviations [cite:@MLBc]:
Abbreviation | Meaning |
---|---|
OPS | On-Base Plus Slugging |
OBP | On-Base Percentage |
SLG | Slugging Percentage |
H | Hit |
BB | Walk |
HBP | Hit-By-Pitch |
AB | At-Bats |
SF | Sacrifice Fly Balls |
1B or B | Single (getting to first base) |
2B | Double (getting to second base) |
3B | Triple (getting to third base) |
HR | Home Run |
TB | Total Bases |
SB | Stolen Bases |
WHIP | Walks and Hits per Inning Pitched |
ERA | Earned Run Average |
K/BB | Strikeout-to-walk Ratio |
PO | Putout |
A | Assist |
E | Error |
DP | Double Play |
Feature Scaling and One Hot Encoding
The features above have different data ranges and can be categorical, discrete numeric, or continuous numeric values.
Let's consider the numeric features first. Dates can be large numbers depending on how they are converted to numeric values, while SLG can range from between $0.00$ to $4.00$. If these numbers were fed directly into a neural network, the numeric operations for each node could affect the weighting disproportionally for each input feature [cite:@Stottner2019]. In an attempt to minimize this difference between features, the data is typically either normalized or standardized. Since most of the features used in this project are not bound by a numeric range, standardization is more reasonable than normalization [cite:@Chadha2021]. There are certain features which contain outliers. These outliers really affect normalization of the data, but standardization is affected to a smaller degree [cite:@Chadha2021]. Considering this, all the numeric features were standardized.
While most of the features are non-categorical, being either discrete or
continuous numeric values, the player positions was a categorical feature.
There are seven unique values in the player position column (POS
) are:
~['C' '1B' 'OF' '2B' '3B' 'P' 'SS']~. Popular methods for dealing with
categorical data include, label encoding, one-hot encoding, and binary
encoding to name a few [cite:@Pathak2020]. Label encoding would involve
assigning a digit to for the seven different positions. This can cause
issues as higher numeric values are given to seemingly equal categorical
values. One-hot encoding gets around this by converting the single
position column into seven different columns, one column for each position.
For a given position, for example a pitcher (P
), that column would contain
a 1 while, all the other position columns would contain 0. The current
number of features was 20, it seemed reasonable to increase the size of the
input features to 26 (by converting POS column to seven different columns).
Test Data and Training Data
As you might have picked up on from the description above, this is a supervised learning project. Meaning there is input data, x, with a corresponding labelled output result, y. Knowing the result y for a given input x, allows a model to be trained to predict y for a given /x/ [cite:@VanEngelen2020].
In supervised learning, one typically has a finite amount of data to create the model with. It is normal to split the data set one has into a training/ data set, which is used to train the model, and a /test data set where you examine the models ability to predict the output y given the input x. If the whole data set was used for training, you would have no way of knowing how good your classifier performs. If you used training data to test the performance of your model you have successfully muddied the score metrics of your classifiers [cite:@Brownlee2017]. It introduces bias in your results since you have 'shown' your testing data point and used it to improve its performance.
For this project, 80% of the data was used as the test data, while 20% was held out as the test set.
Imbalanced Data
One of the bigger issues with the data set, is the imbalance of the two categories! Of the all the 3396 eligible players for the MLB Hall of Fame, there are only 225 player in the Hall of Fame (* with valid feature data!). This means that only 6.65 % of eligible players make it into the Hall of Fame.
$$\frac{225 \text{ HoF}}{3396 \text{ Eligible}} = 0.06625 = 6.65 \%$$
This means the database is imbalanced with a ratio of:
$$225 \text{ HoF} : (3396 \text{ Eligible} - 225 \text{ HoF}) $$ $$= 225 \text{ HoF} : 3171 \text{ non-HoF, but Eligible}$$ $$ \approx 1 : 14.09$$
This raises a few considerations:
Classifier performance metric
Typically, accuracy is the metric used for determining a classifiers performance. It is the number of correctly predicted data points out of all of the data points [cite:@Chicco2020]. Lets consider a classifier which ALWAYS predicted that the players were not Hall of Fame players. The accuracy of the classifier would then be:
$$100 \% - 6.65 \% = 93.35 \% $$
In an evenly split data set this would be a fantastic result! However, in this application it is quite useless.
Let's consider other metrics. This article on Towards Data Science is fantastic at explaining different metrics and handling of imbalanced data in data science. Introducing the confusion matrix is probably a good place to start. It consists of the following matrix:
It represents a matrix of the binary predictions in the columns and the true or actual values by the rows. There are a few different metrics which are derived from this table, namely accuracy, precision, recall, and F1 score. The following will be explained briefly , [cite:@Chicco2020,@Rocca2019].
- Accuracy, mentioned previously, is essentially how many predictions you got correct out of all of the predictions were made.
$$Accuracy = \frac{TP+TN}{(TP + FP)+(FN+TN)}$$
- Precision is how many of the positive predictions that were made, were correct.
$$Precision = \frac{TP}{TP + FP}$$
- Recall is how many of the true positives, were correctly predicted as positive.
$$Recall = \frac{TP}{TP + FN}$$
- F1 is the harmonic mean of the precision and recall. It is a type of mean which is often used in determining rates of change and percentages [cite:@Ferger1931]. Basically, just consider it to be an average of precision and recall.
$$F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} = \frac{TP}{TP+\frac{1}{2}(FP+FN)}$$
While these are good metrics they do not capture the entire picture of what is happening. Precision, recall, and accuracy do not take true negatives (TN) into account! Consider them in the application of designing a classifier for determining MLB Hall of Fame players. Ideally, the classifier should be equally as good at distinguishing Hall of Fame players from the others. However, if we were to choose between precision and recall, it would make more sense to care more about the recall. This takes into account those false negatives, or Hall of Fame players incorrectly predicted as regular players.
Now to the balanced metric, the Matthew Correlation Coefficient (MCC). This treats the true and predicted class as two variables and computes the correlation coefficient between them [cite:@Chicco2020,@Shmueli].
$$MCC = \frac{TP\cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$$
This metric is symmetric as it weights all values of the confusion matrix equally [cite:@Chicco2020]. The metric has a range of $[-1, 1]$, where 1 indicates that both classes are represented equally well! Because of these properties the MCC was the metric used in this project.
Training the Model
After finding a better metric to determine the performance of the classifier, the issue of training, or fitting, the classifier is still an issue. If one was to train the classifiers without taking into account the imbalance of the two categories, it would weight the errors equally. This would cause the classifier to be really good at non-HoF players, while not often correcting for errors of incorrectly labelling Hall of Fame players.
There are a few different ways one can go about training or fitting an imbalanced data. These are explained well in the article mentioned previously. One of the methods discussed is either undersampling or oversampling.
Undersampling would entail picking out few data points from the over-represented group, in our case the non-HoF players, so that the two groups were equally represented. In this problem we would end up with a data set consisting of 225 HoF and 225 non-HoF players. After separating the data into 80% training data and 20% testing data, this would leave only *360 data points* in the training data.
$$225 \text{ HoF} + 225 \text{ non-HoF} = 450 \text{ all data points}$$
$$450 \text{ all data points} \cdot 80\% = 360 \text{ training data points}$$
Considering that we have 26 input features, 360 data points might be sufficient to train or fit the model, but more data would be better.
Oversampling would involve picking out more (or re-picking) data from the under represented group [cite:@Rocca2019]. This would involve picking out multiples of the same HoF player in order to balance out the data set. I am sure this is a fair option, but intuitively it seems risky to oversample when the ratios between the two groups is approximately, 1:14. Keep in mind this is just a personal thought and I am sure taking advantage of both over and undersampling in smaller degrees is probably a good solution. However, I wanted to use the data 'as is', and attempt to use another technique to deal with the data imbalance.
Generating synthetic data is another data manipulation option, but I feel like this is poorly suited for this application. It might involve figuring out which features are the most important and finding a range between feature values and creating new data points based on this. This option seems more fitting for computer vision tasks where the synthetic data would involve slightly rotating images to be classified. In this project it seems to be more involved.
The methods above involve manipulations at the level of the data, but there are other methods either in the classifier or afterwards. A possible method for afterwards is through probability thresholds. It can be used when the classifier outputs a probability of being within a particular group [cite:@Rocca2019]. This involves training the model as normal, and then applying thresholds afterwards which adjust the likelihood based on the representation of the group within the data [cite:@Rocca2019].
The method that was used in this project was class reweighting. This typically involves altering the currently used machine learning algorithms. Different classifiers are trained based on different principals. Cost-sensitive algorithms train the classifier based on a penalty, or /cost/, associated with an incorrect prediction [cite:@Brownlee2020a]. With out imbalanced data set, it would be important for us to weight the cost associated with the minority group of Hall of Fame players, much higher than the majority group of non-HoF players [cite:@Brownlee2020a]. Most algorithms are not created for cost-sensitive learning, but can be extended or modified to consider these costs [cite:@Brownlee2020a].
- Cost-sensitive algorithms can involve modifying existing algorithms to use costs as penalty for misclassisfication [cite:@Brownlee2020a].
- Can be used for iteratively trained algorithms, logisitic regression and neural networks.
- Scikit-learn library allows cost-sensistive extension via
class_weight
agrument for specificially Logisitc Regression. Cost sensitive
augmentation for Neural Network in
fit()
function [cite:@Brownlee2020a].
TODO Add in a point about class weights
When they explain the cost matrix in this article, they explain that a good heuristic for a cost matrix is starting with one that represents the ratio of the dataset [cite:@Brownlee2020a]. If the ratio of data looks like ~1:100~, then it could look like this [cite:@Brownlee2020a]:
Actual Negative | Actual Positive | |
Predicted Negative | 0 | 100 |
Predicted Positive | 1 | 0 |
The Classifiers
To determine the performance of the neural network in this project, it should be compared to a baseline of other classifiers [cite:@Brownlee2014]. For structured data Logistic Regression is the common baseline model [cite:@Ameisen]. /Decision Tree Classifiers/ are also common classifiers for binary classification tasks [cite:@Koseler2017]. Therefore, both of these classifiers were used in this project to compare the neural network's performance.
Electronics
TODO ESP32 RGB LED Programmable Controller
Perspectives
Literate Programming, Reproducible Research, and "Clean Code" + Docstrings
What is "Literate Programming"?
Donald Knuth formally introduced the idea of "Literate Programming" in 1984
cite:Knuth1992. In his book he suggested the following change in attitude <
#+begin_quote Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want the computer to do. #+end_quote
At Stanford University the idea of literate programming was realized through
the WEB
system. It involves combining a document formatting
language with a programming language cite:Knuth1992. In the original report
submitted to The Computer Journal in 1983, the example presented of the
WEB
system consisted of the combination of TeX
, and PASCAL
. This
choice is not surprising, as Donald Knuth created TeX
. The code was written
in a single .WEB
file. A reason Knuth was happy with the term WEB
is
because he considered a complex piece of code as a web simple
materials delicately placed together cite:Knuth1992. The 'weaving' process
involves separating out the document formatting language into a separate
file. While the 'tangling' entails extracting the programming language to
produce a machine-executable code cite:Knuth1992. This process was depicted
as figure fig:WEB-overview cite:Knuth1992.
#+CAPTION: Adopted image from the original paper showing dual usage of a .WEB
file. The 'weaving' separates out the TeX
from the .WEB
file compiles it to an output document, while the 'tangling' extracts the PASCAL
code to produce a machine-executable program.
#+NAME: fig:WEB-overview
#+ATTR_HTML: :class center
../static/img_literate_programming_reproducible_research_clean_code/literate-programming-2.png
If you consider using a tool for literate programming their website has a long list of tools which can be used for writing code with literate programming cite:literate_programming_tools. While it is comprehensive, it does not list a fantastic tools which will be mentioned later. There are also several other projects online which seem to be popular, such as Zachary Yedidia's Literate project.
Data Science and Reproducible Research
The term "data science" likely first appeared in 1974 and initially defined by the following quote cite:Cao2017.
#+begin_quote The science of dealing with data, once they have been established, while the relation of the data to what they represent is delegated to other fields of data. #+end_quote
The term has increased in popularity since 2012 cite:Cao2017. This increase in interest is likely due to the emergence of data mining, big data and the advances in Artificial Intelligence (AI) and Machine Learning (ML). Due to the accumulation of large amounts of data, interdisciplinary tools are being combined with to form a "new" field. A more modern and comprehensive definition of data science is as follows cite:Cao2017.
#+begin_quote From the disciplinary perspective, data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology to study data and its environments (including domains and other contextual aspects, such as organizational and social aspects) in order to transform data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology. #+end_quote
# Reproducible research is a useful tool to be able to explain the context # surrounding the code. This assists in creating reproducibility both # benefiting the researcher and the scientific community.
A central point in literate programming is the intent of 'weaving' and 'tangling' the original document into two separate files. The original web/ document was not necessarily intended to be used as the final result. However, the modern use of "reproducible research" implies using a similar literate programming /web document as the final result cite:Goodman2018.
#+CAPTION: Example of a Jupyter Notebook showing the mix of rich text, code block, along with output. #+NAME: fig:jupyter-web ../static/img_literate_programming_reproducible_research_clean_code/jupyter_notebook_example.png
The computer scientist Jon Claerbout coined the term "reproducible research" and described it with a reader of a document being able to see the entire process from the raw data to the code producing tables and figures cite:Goodman2018. The term can be slightly confusing, as reproducibility has long been an important part of the research and the scientific method.
One of the issues in with the use of software in scientific field generally, is the reproducibility of the research being conducted cite:Hinsen2013. Therefore, the greater context surrounding the code is more important and possibly not self explanatory through the code alone. Experts within the field of exploratory data science even struggle keeping track of the experiments they conduct. It can lead to confusion of what experiments have been performed and how they came by the results cite:Hinsen2013. Since digitalization and digital tools have become a more important part of many fields, reproducible research has become popular in such fields such as epidemiology, computational biology, economics, and clinical trials cite:Goodman2018.
While literate programming and reproducible research are closely related in form, they were intended for slightly different use-cases. Literate programming is not specifically for research, but is a more broad approach to programming. While reproducible research focuses on providing the full context of an experiment or study and not making the code itself more legible.
Some of the more popular tools for reproducible research are
Jupyter Notebooks for programming in python (example in figure
fig:jupyter-web), and the integration of SWeave
and knitr
into R
Markdown for programming in the statistical language R cite:Kery2018.
Jupyter Notebooks is an open source spin-off project from the IPython
(Interactive Python) project. Python is considered a fairly easy language to
learn, there are several python projects for both data manipulation, and
support for different machine learning frameworks such as Tensorflow and
PyTorch. This makes Jupyter Notebooks particularly well suited for data
science.
Readable and Clean Code
Knuth's attitude change mentioned above was primarily about making code more readable for humans. Literate programming is a method of accomplishing this goal, but it is not the only avenue to take. What if the code was written so well, it was completely self-explanatory.
Robert C. Martin, affectionately called "Uncle Bob", is a software engineer and the author of the bestselling book Clean Code: A Handbook of Agile Software Craftsmanship cite:martin2009clean. He has an interesting talk on clean code where he explains that code should read like "well written prose". In his book he writes cite:martin2009clean:
#+begin_quote Indeed, the ratio of time spent reading versus writing is well over 10 to 1. We are constantly reading old code as part of the effort to write new code. ...[Therefore,] making it easy to read makes it easier to write. #+end_quote
While writing readable code is something to always strive towards, it will likely never be completely self-explanatory. Therefore, it would likely be beneficial to write good documentation to complement the code.
Self-Documentation with Docstrings
Emacs has a nice method of self-documentation which is very helpful
to understanding the editors source code written in elisp
. It
uses documentation strings (or docstrings), which are comments directly
written into the code. Writing comments in code to describe functions is not
at all uncommon. However, traditional in-file documentation is stripped
from the source during runtime, such as Javadocs. Docstrings are maintained
throughout the runtime of the code in which it is written. The programmer
can then interactively inspect this throughout the runtime of the program.
-
Some notable languages which support Docstrings include:
- Lisp/Elisp
- Python
- Elixir
- Clojure
- Haskell
- Gherkin
- Julia
Emacs is almost like working in an lisp
terminal. The editor is written
entirely in a variant of lisp
, called elisp
. For example, in order to
split a window vertically, the function split-window-below
is called. The
code snippet below is the start of the source code for this function.
#+begin_src elisp (defun split-window-below (&optional size) "Split the selected window into two windows, one above the other. The selected window is above. The newly split-off window is below and displays the same buffer. Return the new window.
If optional argument SIZE is omitted or nil, both windows get the same height, or close to it. If SIZE is positive, the upper \(selected) window gets SIZE lines. If SIZE is negative, the lower (new) window gets -SIZE lines.
If the variable `split-window-keep-point' is non-nil, both windows get the same value of point as the selected window. Otherwise, the window starts are chosen so as to minimize the amount of redisplay; this is convenient on slow terminals." (interactive "P") (let ((old-window (selected-window)) (old-point (window-point)) (size (and size (prefix-numeric-value size))) moved-by-window-height moved new-window bottom) (when (and size (< size 0) (< (- size) window-min-height)) ;; `split-window' would not signal an error here. (error "Size of new window too small")) ;; ... ;; ... ;; ... #+end_src
When calling the function describe-function
and inputting the function
name, "split-window-below", it will generate the following output in a
mini-buffer (a kind of little buffer). This documentation is created
dynamically during runtime.
#+begin_quote split-window-below is an interactive compiled Lisp function in ‘window.el’.
It is bound to SPC w -, SPC w s, M-m w -, M-m w s, C-x 2,
(split-window-below &optional SIZE)
Probably introduced at or before Emacs version 24.1.
Split the selected window into two windows, one above the other. The selected window is above. The newly split-off window is below and displays the same buffer. Return the new window.
If optional argument SIZE is omitted or nil, both windows get the same height, or close to it. If SIZE is positive, the upper (selected) window gets SIZE lines. If SIZE is negative, the lower (new) window gets -SIZE lines.
If the variable ‘split-window-keep-point’ is non-nil, both windows get the same value of point as the selected window. Otherwise, the window starts are chosen so as to minimize the amount of redisplay; this is convenient on slow terminals.
[back] #+end_quote
Comparing the docstring in the source code and the output of
describe-function
, there is more information added to the output
documentation. This was a slight aside, but using describe-*
functions in
Emacs is probably of the most useful and helpful ways to learn emacs.
However, this does show a benefit of how docstrings are used in emacs.
Org-Mode for Literate Programming and Reproducible Research
While Jupyter Notebooks are a fantastic way of writing reproducible research, they are not a method of literate programming as they are not intended to be 'weaved' and 'tangled'. The original tool such as ~WEB~ system for literate programming, does not allow for compiling embedded code interactively. However, Org-mode was the first to provide full support for reproducible research and literate programming cite:Schulte2012.
Org-mode is something called a major mode in emacs. An .org
file is essentially
a plain text markup language, but there are so many things that can be done
in org-mode it is mind-boggling. With some configuration it can be used as
almost anything including an advanced agenda system, calendar, financial
ledger, zettlekasten note taking system (org-roam), and an exporter into
almost any text formatting. Even that was an inadequate list, but for those
interested this is a more complete list.
#+CAPTION: An example of a .org
file with code blocks in emacs-lisp
which will be tangled to a file called spacemacs.el
.
#+NAME: fig:org-babel
../static/img_literate_programming_reproducible_research_clean_code/org-bable.png
Using a package called org-babel
it can allow for both 'weaving' using
org-mode's built in export functionality, and 'tangling' using
org-babel-tangle
of the entire file (see figure fig:org-babel). As with other markup languages,
.org
files already support rich text formatting of code blocks. With
simple tags and commands the code can run code blocks can be run dynamically
by using certain session
tags. Running embedded code blocks in this manner
through the same, or multiple, sessions in the same file is important for
reproducible research. More details on using org-mode for both literate
programming and reproducible research is the paper by cite:Schulte2012.
A blog post which explains and demonstrates how to use org-babel
can be
found here.
Closing thoughts
While literate programming is the a useful tool for providing more context to programming projects, in many cases writing self-explanatory code might be the best avenue for pure software engineering. Reproducible research is more suited for explaining a study or an experiment which is particularly useful in the scientific practice today as the use of digital tools continues to grow. Contrary to software engineering, these use-cases aren't well suited for simply writing self-explanatory code. Literate programming and reproducible research attack mixed natural language and computational language documents for different ends cite:Schulte2012. While literate programming introduces natural language to programming files, reproducible research adds embedded executable code into natural language documents cite:Schulte2012.
If humans were capable of easily writing in machine code, there would be no need for all the different types of programming languages. The reason we use programming languages is to be able to interpret and understand what we are instructing the computer to do. Therefore, making the code understandable should be a top priority. Even though Knuth's literate programming might not be suitable in all settings, I think his attitude about telling the reader what we are instructing the machine to perform is valuable in any programming situation. This is especially true for collaboration and long term complexity of coding projects.
While some of the differences between literate programming and reproducible research have been mentioned, particular examples of literate programming were not discussed. Configuration files are ideal for literate programming. They typically require explanations of the context of the settings or an explanation of the setting itself. After writing the literate configuration, they can be tangled to generate the actual configuration. I hope to write a post soon about showing how I created a literate configuration for spacemacs.