COMMENT Handy-Templates ARCHIVE

TODO Original Template

First heading within the post

This post will be exported as

content/posts/writing-hugo-blog-in-org-subtree-export.md

LETS SEE IF THIS HAPPENS Its title will be "Writing Hugo blog in Org".
It will have hugo and org tags and emacs as category.
The menu item weight and post weight are auto-calculated.
The menu item identifier is auto-set.
The lastmod property in the front-matter is set automatically to

A sub-heading under that heading

It's draft state will be marked as true as the subtree has the

TODO

With the point anywhere in this Writing Hugo blog in Org post subtree, do C-c C-e H H to export just this post.

TODO Useful Template

The exported Markdown has a little comment footer as set in the Local Variables section below.

Spacemacs @Spacemacs

Creating a Literate Config for Spacemacs

All posts in here will have the category set to spacemacs.

Why create a "literate configuration"?

In a previous post, I explained the concept and origin of literate programming. A literate configuration is simply applying the literate programming concept to configurations files. This is useful for configuration files typically require more context around the settings and why you chose to do something in a particular manner. Sometimes this could include links to where you found the configuration settings or 'work in progress' settings. Another reason literate configurations are useful is because configuration files are not necessarily modified every day. When jumping back into your config to fix, add, or modify a feature it might be useful to have a short explanation to quickly catch yourself up.

For this particular post the goal is to create a literate configuration for Spacemacs. It is a heavily modded distribution of Emacs. This is mainly because I use Spacemacs myself and find having a literate configuration for it very useful. Literate configuration itself is an .org file where the weaved code is defined in code blocks. Using org-mode headings also makes it much easier to navigate your configuration file. Extracting the file code from the literate configuration, or tangling the literate configuration, will yield an elisp file.

Spacemacs is also well suited for a literate configuration since it by default loads a single .spacemacs file, which tends to steadily grow as it increasingly takes over your life. At the time of writing this my current 'tangled' spacemacs.el configuration (just the code) is 1500 lines long.

Correctly loading Spacemacs literate config file

When I was attempting to create a literate Spacemacs config, there were several helpful links to figure out how to tangle and generate output files from an .org file. I struggled to figure out the best way to load the file itself. There are two default locations for your Spacemacs configurations with the following loading priority:

~~/.spacemacs~: A single dotfile in the home directory
~~/.spacemacs.d/init.el~: A dotfile directory in the home directory

For a literate configuration we will need to have mulitple files, therefore I find it more orderly to go with the second option and create a ~/.spacemacs.d directory.

Initially, it seemed to make sense to create a spacemacs.org literate configuration file, and tangle directly to either .spacemacs or init.el file. It could be done this way, but Spacemacs adds custom content at the end of the init.el (or ~/.spacemacs) file in the following function:

#+begin_example elisp (defun dotspacemacs/emacs-custom-settings () "Emacs custom settings. This is an auto-generated function, do not modify its content directly, use Emacs customize menu instead. This function is called at the very end of Spacemacs initialization." #+end_example

This is where Emacs stores custom-set-variables and custom-set-faces~ generated during runtime. If we were to /tangle/ our literate configuration file directly to the ~init.el file it would overwrite this function and variables every time it was generated. Therefore, instead the literate configuration can be tangled to an intermediate file, for example spacemacs.el, and then the init.el file can load its contents:

~spacemacs.org~: The literate configuration file
~spacemacs.el~: The tangled configuration file
~init.el~: The file that Spacemacs loads as a default configuration file.

An overview of how this works is shown in figure fig:spacemacs-config-overview.

#+CAPTION: An overview of the concept of how to make a literate config for spacemacs that doesn't overwrite the auto generated settings from Spacemacs. Created in Krita. #+NAME: fig:spacemacs-config-overview #+ATTR_HTML: :class center ../static/img_literate_spacemacs_config/literate_spacemacs_config_crop.png

The 'spacemacs.org' file

This is the literate configuration file itself. When creating this you have to ensure every part of your configuration file is copied over in the code snippets. Below is the start of my configuration file. Most of these settings are not particularly important for generating the configuration file, but has served me well thus far.

#+begin_example org #+TITLE: Spacemacs Literate User Configuration #+STARTUP: headlines #+STARTUP: nohideblocks #+STARTUP: noindent #+OPTIONS: toc:4 h:4 #+PROPERTY: header-args:emacs-lisp :comments link #+end_example

An example of a code block from a part of my configuration below. The important part is the that the configuration is in the code blocks.

** Function start and default settings

This is just the start of some configuration options copied directly over from the default configuration.

  (defun dotspacemacs/layers ()
    (setq-default
     dotspacemacs-distribution 'spacemacs
     dotspacemacs-enable-lazy-installation 'unused
     dotspacemacs-ask-for-lazy-installation t
     dotspacemacs-configuration-layer-path '()

What defines the code block:

#+begin_example

#+end_example

The emacs-lisp option indicates what language the code block is written in. The :tangle spacemacs.el sets the output target file to spacemacs.el when the file is tangled. My complete spacemacs.org, with all its warts, can be found in my dotfiles repository on github.

The 'spacemacs.el' file

In order to generate the tangled file, spacmacs.el, the function org-bable-tangle() has to be run on the literate configuration file. For this function to be run every time you save the file, the following 'local variables' can be added to the bottom of spacemacs.org. This will add a function to the after-save-hook. A hook is a variable that holds a list of functions at a specific time.

#+begin_example org * Local Variables :ARCHIVE:

#+end_example

The 'init.el' file

This is the file Spacemacs loads as the configuration. You only have to add the following line to the file.

#+begin_example elisp (load-file "~/.spacemacs.d/spacemacs.el") #+end_example

Spacemacs should then load your configuration and you should be up and running! As you continue to use Spacemacs, it will populate the ~init.el~ file. It will start to populate such as:

#+begin_example elisp (load-file "~/.spacemacs.d/spacemacs.el") (defun dotspacemacs/emacs-custom-settings () "Emacs custom settings. This is an auto-generated function, do not modify its content directly, use Emacs customize menu instead. This function is called at the very end of Spacemacs initialization." (custom-set-variables ;; custom-set-variables was added by Custom. ;; If you edit it by hand, you could mess it up, so be careful. ;; Your init file should contain only one such instance. ;; If there is more than one, they won't work right. '(evil-want-Y-yank-to-eol nil) ... ... #+end_example

That's it!

Now when you change your spacemacs.org file and tangle the output to spacemacs.el it will not overwrite the variables generated by Spacemacs found in init.el. It is not that complicated, but I hope this helps someone fairly new to org-babel or Spacemacs.

TODO Quickly Accessing files with Emacs Bookmarks

General Linux @Linux

TODO Managing your dotfiles on Github using GNU Stow

Machine Learning @Machine_Learning

TODO Using a Neural Network to Determine MLB Hall of Fame Players

Background

The project discussed in this post was originally created for a masters degree course TTK28 - Modeling with Neural Networks. It was heavily inspired by Rui Hu's (rh2835) project Baseball-Analytics, and the data used is from the amazingly comprehensive dataset, Sean Lahman's Baseball Database. The Baseball-Analytics project had a part which involved using Linear Regression to determine which Major League Baseball (MLB) Players would end up in the MLB Hall of Fame (HoF). Since the masters course I was taking was primarily concerned with using Neural Networks for Data Science, it seemed fair to be able to reuse some of Rui Hu's code for data preparation and cleaning.

The Objective:

Determine if an eligible Major League Baseball (MLB) Player will make it into the National Baseball Hall of Fame (HoF) based on career statistics using a Neural Network.

Statistical analysis within baseball has been popular for a long time. This is partially due to the wealth of data and discrete events that occur during the game [cite:@Koseler2017]. The use of statistics has always been popular, but Bill James popularized the use of more complex analysis in the 1980's [cite:@Koseler2017]. Recently, machine learning has been used more and more in baseball. A literature review found that Support Vector Machines (SVM) were the most popular methods used for binary classification, while Bayesian Inference mixed with Linear Regression was the most used method for Regression tasks [cite:@Koseler2017]. Since the objective is to determine if a player will reach the Hall of Fame or not, it is a binary classification task.

Hall of Fame Eligibility and Selection

Why is this problem suitable for a machine learning technique, such as a neural network? The players are voted in by the Baseball Writers' Association of America (BBWAA) or Era Committees (formerly known as Veteran Committees) [cite:@NationalBaseballHallofFame,@NationalBaseballHallofFamea]. The rules for which players are eligible for the HoF has changed throughout its existence. One of lasting requirements is that they must have played in the league for a certain number of seasons. Currently, this criteria is having played at least 10 seasons in the league [cite:@NationalBaseballHallofFame]. The BBWAA requires that the player has to have been active for one of the 15 season prior to the election and not played within the last five seasons [cite:@NationalBaseballHallofFame]. In contrast, the Era Committees was established to vote in players from different eras of baseball which were previously not voted in, but deserve a spot in the HoF.

Since these players are voted mainly by sportswriters which follow the sport and no specific criteria, the process is somewhat arbitrary. There are several other factors which impact the sportswriters, such as character and sportsmanship. However, this project will attempt to determine this based on a collection of their career statistics.

Major Project Steps

The player data in the database goes all the way back to 1871, and the 2020 database version contains data through the 2019 season [cite:@Lahman]. The data was fragmented over several files. The extract_data.py or ~extract_data.org~ file aggregates relevant data and stores it in final.csv. The data underwent further data cleaning, data manipulation and feature creation through the filter_data.py or filter_data.org file. After finishing the preparation of the data into a test and train dataset, the data is imported in hof_model.py or hof_model.org. This file involves; inputting the data into various classifiers, training the classifiers, and final testing of the classifiers with different metrics.

The source code for this project can be found the github repository, machine-learning-MLB-HoF. Part of the goal of this project was to examine Org-modes (in Emacs) usage in Reproducible Research. Therefore, it was possible to render these .org files to HTML sites which are linked to below. While this was fun to test out, it is not the main focus of this post.

[https://www.olavpedersen.com/standalone_hof/extract_data.html][extract_data]]: Extracting data from Lahman's raw data.
[https://www.olavpedersen.com/standalone_hof/filter_data.html][filter_data]]: Data manipulation, feature creation and feature selection.
[https://www.olavpedersen.com/standalone_hof/grid_search_results.html][grid_search_results]]: Storing some of the results for the best Grid Searches
[https://www.olavpedersen.com/standalone_hof/hall_of_fame_model.html][hof_model]]: Creation of the model and training of the neural network.

Feature Selection

There are several specialized positions in baseball which require different skills. Pitchers often do not participate in hitting (when the team is on offense). For the first basemen and catchers, their hitting ability is less important than their defensive abilities. Since these positions perform different tasks, they are evaluated using different metrics. Additionally, the skill of individuals in these positions are not necessarily easily measured by a single statistic. The quality of these metrics are quite variable:

Good for batters (OPS, SLG, OBS, etc.)
Okay for pitchers (WHIP, K/BB, K/9, etc.)
Poor for fielders (FFRA, UZR, etc.)

The variability of quality in position metrics led me to include other types of statistics. These include player accolades with some other general statistics. Initially, the following ten features selected for this project were:

Position	Feature or Statistic
All	Number of games played
All	Date of last game played
All	Years Played
All	Player Position
Batter	OPS (On-base Plus Slugging)
Batter	Silver Slugger Award
All	Most-Valuable Player (MVP) Awards
All	All-Star Games Played
Fielder	Golden Glove Awards
All	Rookie of the Year Award
All	World Series MVP Award

Most of these stats are fairly self-explanatory though some of them are more cryptic. The Golden Glove Award, or simply, Golden Glove are given out to the players which exhibit overall fielding excellence regardless of position [cite:@Rawlings]. The Silver Slugger recognizes the best offensive players at each position in both the American League and National League [cite:@MLB]. Since these and other awards have not always been around, it seemed reasonable to include the 'Date of last game played' feature which should hopefully assist weighting those award features.

The OPS consists of two other metrics, On-Base-Percentage (OBP) and Slugging Percentage (SLG) as shown below:

The OBP refers to how frequently a batter reaches a base per plate appearance [cite:@MLBa]. As seen in the equation below, it includes hits, walks, and hit-by-pitches, but not errors, fielder's choice, or dropped third strikes [cite:@MLBa]. There is a summary table of abbreviations below.

The SLG is the total number of bases reached per plate appearance, or At-Bat (AB) [cite:@MLBb].

The above 10 features were used in the course project due to a time crunch. If we are being honest, they are kind of uninteresting features. It only contains a single "baseball" stat (OPS), while the others are dates, number of games played, or awards. After the project was finished, I wanted to experiment with adding more features and stats to test out the neural network properly.

After some more feature engineering, the following features were used in addition to those above, giving a total of 20 features:

Position	Feature or Statistic
Batter	SB (Stolen Bases)
Batter	HR (Home Runs)
Batter	A (Assists)
Pitcher	WHIP (Walks and Hits per Inning Pitched)
Pitcher	ERA (Earned Run Average)
Pitcher	K/BB (Strikeout-to-walk Ratio)
Fielder	PO (Putout)
Fielder	E (Errors)
Fielder	DP (Double Plays)

These stats were an attempt to add some more stats for batters, pitchers, and fielders to encompass different player positions in the features. Out of these the WHIP, ERA, and K/BB statistics are pitcher stats. The formulas for the three stats are defined below. Ideally, a pitcher would want a lower WHIP and ERA, but a higher K/BB.

The other statistics are directly recorded from game play and not derived through formulas. Just to ensure that the average reader can still follow along, I will briefly explain some of them. Stolen Bases are when a base runner advances by taking a base to which he isn't entitled [cite:@MLBd]. Home Runs are when a player hits the ball allowing for a four-base run [cite:@Chase2016]. Putouts are when fielders records the act of completing an /out/, either by a force-out, tagging a runner, catching a third strike, or catching a batted ball [cite:@MLBe]. Errors are when, in the judgement of the official scorer, a fielder fails to convert an out on a play that an average fielder should have made [cite:@MLBf]. Double Plays is when two offensive players are ruled out within the same play [cite:@MLBg].

Overview of abbreviations [cite:@MLBc]:

Abbreviation	Meaning
OPS	On-Base Plus Slugging
OBP	On-Base Percentage
SLG	Slugging Percentage
H	Hit
BB	Walk
HBP	Hit-By-Pitch
AB	At-Bats
SF	Sacrifice Fly Balls
1B or B	Single (getting to first base)
2B	Double (getting to second base)
3B	Triple (getting to third base)
HR	Home Run
TB	Total Bases
SB	Stolen Bases
WHIP	Walks and Hits per Inning Pitched
ERA	Earned Run Average
K/BB	Strikeout-to-walk Ratio
PO	Putout
A	Assist
E	Error
DP	Double Play

Feature Scaling and One Hot Encoding

The features above have different data ranges and can be categorical, discrete numeric, or continuous numeric values.

Let's consider the numeric features first. Dates can be large numbers depending on how they are converted to numeric values, while SLG can range from between $0.00$ to $4.00$. If these numbers were fed directly into a neural network, the numeric operations for each node could affect the weighting disproportionally for each input feature [cite:@Stottner2019]. In an attempt to minimize this difference between features, the data is typically either normalized or standardized. Since most of the features used in this project are not bound by a numeric range, standardization is more reasonable than normalization [cite:@Chadha2021]. There are certain features which contain outliers. These outliers really affect normalization of the data, but standardization is affected to a smaller degree [cite:@Chadha2021]. Considering this, all the numeric features were standardized.

While most of the features are non-categorical, being either discrete or continuous numeric values, the player positions was a categorical feature. There are seven unique values in the player position column (POS) are: ~['C' '1B' 'OF' '2B' '3B' 'P' 'SS']~. Popular methods for dealing with categorical data include, label encoding, one-hot encoding, and binary encoding to name a few [cite:@Pathak2020]. Label encoding would involve assigning a digit to for the seven different positions. This can cause issues as higher numeric values are given to seemingly equal categorical values. One-hot encoding gets around this by converting the single position column into seven different columns, one column for each position. For a given position, for example a pitcher (P), that column would contain a 1 while, all the other position columns would contain 0. The current number of features was 20, it seemed reasonable to increase the size of the input features to 26 (by converting POS column to seven different columns).

Test Data and Training Data

As you might have picked up on from the description above, this is a supervised learning project. Meaning there is input data, x, with a corresponding labelled output result, y. Knowing the result y for a given input x, allows a model to be trained to predict y for a given /x/ [cite:@VanEngelen2020].

In supervised learning, one typically has a finite amount of data to create the model with. It is normal to split the data set one has into a training/ data set, which is used to train the model, and a /test data set where you examine the models ability to predict the output y given the input x. If the whole data set was used for training, you would have no way of knowing how good your classifier performs. If you used training data to test the performance of your model you have successfully muddied the score metrics of your classifiers [cite:@Brownlee2017]. It introduces bias in your results since you have 'shown' your testing data point and used it to improve its performance.

For this project, 80% of the data was used as the test data, while 20% was held out as the test set.

Imbalanced Data

One of the bigger issues with the data set, is the imbalance of the two categories! Of the all the 3396 eligible players for the MLB Hall of Fame, there are only 225 player in the Hall of Fame (* with valid feature data!). This means that only 6.65 % of eligible players make it into the Hall of Fame.

This means the database is imbalanced with a ratio of:

This raises a few considerations:

Classifier performance metric

Typically, accuracy is the metric used for determining a classifiers performance. It is the number of correctly predicted data points out of all of the data points [cite:@Chicco2020]. Lets consider a classifier which ALWAYS predicted that the players were not Hall of Fame players. The accuracy of the classifier would then be:

In an evenly split data set this would be a fantastic result! However, in this application it is quite useless.

Let's consider other metrics. This article on Towards Data Science is fantastic at explaining different metrics and handling of imbalanced data in data science. Introducing the confusion matrix is probably a good place to start. It consists of the following matrix:

../static/img_mlb_hof/ConfusionMatrixTransparent.png

It represents a matrix of the binary predictions in the columns and the true or actual values by the rows. There are a few different metrics which are derived from this table, namely accuracy, precision, recall, and F1 score. The following will be explained briefly , [cite:@Chicco2020,@Rocca2019].

Accuracy, mentioned previously, is essentially how many predictions you got correct out of all of the predictions were made.

Precision is how many of the positive predictions that were made, were correct.

Recall is how many of the true positives, were correctly predicted as positive.

F1 is the harmonic mean of the precision and recall. It is a type of mean which is often used in determining rates of change and percentages [cite:@Ferger1931]. Basically, just consider it to be an average of precision and recall.

While these are good metrics they do not capture the entire picture of what is happening. Precision, recall, and accuracy do not take true negatives (TN) into account! Consider them in the application of designing a classifier for determining MLB Hall of Fame players. Ideally, the classifier should be equally as good at distinguishing Hall of Fame players from the others. However, if we were to choose between precision and recall, it would make more sense to care more about the recall. This takes into account those false negatives, or Hall of Fame players incorrectly predicted as regular players.

Now to the balanced metric, the Matthew Correlation Coefficient (MCC). This treats the true and predicted class as two variables and computes the correlation coefficient between them [cite:@Chicco2020,@Shmueli].

This metric is symmetric as it weights all values of the confusion matrix equally [cite:@Chicco2020]. The metric has a range of $[-1, 1]$, where 1 indicates that both classes are represented equally well! Because of these properties the MCC was the metric used in this project.

Training the Model

After finding a better metric to determine the performance of the classifier, the issue of training, or fitting, the classifier is still an issue. If one was to train the classifiers without taking into account the imbalance of the two categories, it would weight the errors equally. This would cause the classifier to be really good at non-HoF players, while not often correcting for errors of incorrectly labelling Hall of Fame players.

There are a few different ways one can go about training or fitting an imbalanced data. These are explained well in the article mentioned previously. One of the methods discussed is either undersampling or oversampling.

Undersampling would entail picking out few data points from the over-represented group, in our case the non-HoF players, so that the two groups were equally represented. In this problem we would end up with a data set consisting of 225 HoF and 225 non-HoF players. After separating the data into 80% training data and 20% testing data, this would leave only *360 data points* in the training data.

Considering that we have 26 input features, 360 data points might be sufficient to train or fit the model, but more data would be better.

Oversampling would involve picking out more (or re-picking) data from the under represented group [cite:@Rocca2019]. This would involve picking out multiples of the same HoF player in order to balance out the data set. I am sure this is a fair option, but intuitively it seems risky to oversample when the ratios between the two groups is approximately, 1:14. Keep in mind this is just a personal thought and I am sure taking advantage of both over and undersampling in smaller degrees is probably a good solution. However, I wanted to use the data 'as is', and attempt to use another technique to deal with the data imbalance.

Generating synthetic data is another data manipulation option, but I feel like this is poorly suited for this application. It might involve figuring out which features are the most important and finding a range between feature values and creating new data points based on this. This option seems more fitting for computer vision tasks where the synthetic data would involve slightly rotating images to be classified. In this project it seems to be more involved.

The methods above involve manipulations at the level of the data, but there are other methods either in the classifier or afterwards. A possible method for afterwards is through probability thresholds. It can be used when the classifier outputs a probability of being within a particular group [cite:@Rocca2019]. This involves training the model as normal, and then applying thresholds afterwards which adjust the likelihood based on the representation of the group within the data [cite:@Rocca2019].

The method that was used in this project was class reweighting. This typically involves altering the currently used machine learning algorithms. Different classifiers are trained based on different principals. Cost-sensitive algorithms train the classifier based on a penalty, or /cost/, associated with an incorrect prediction [cite:@Brownlee2020a]. With out imbalanced data set, it would be important for us to weight the cost associated with the minority group of Hall of Fame players, much higher than the majority group of non-HoF players [cite:@Brownlee2020a]. Most algorithms are not created for cost-sensitive learning, but can be extended or modified to consider these costs [cite:@Brownlee2020a].

Cost-sensitive algorithms can involve modifying existing algorithms to
Can be used for iteratively trained algorithms, logisitic regression and
Scikit-learn library allows cost-sensistive extension via class_weight

fit()

TODO Add in a point about class weights

When they explain the cost matrix in this article, they explain that a good heuristic for a cost matrix is starting with one that represents the ratio of the dataset [cite:@Brownlee2020a]. If the ratio of data looks like ~1:100~, then it could look like this [cite:@Brownlee2020a]:

	Actual Negative	Actual Positive
Predicted Negative	0	100
Predicted Positive	1	0

The Classifiers

To determine the performance of the neural network in this project, it should be compared to a baseline of other classifiers [cite:@Brownlee2014]. For structured data Logistic Regression is the common baseline model [cite:@Ameisen]. /Decision Tree Classifiers/ are also common classifiers for binary classification tasks [cite:@Koseler2017]. Therefore, both of these classifiers were used in this project to compare the neural network's performance.

Electronics @Electronics

TODO ESP32 RGB LED Programmable Controller

Perspectives @Perspectives

Literate Programming, Reproducible Research, and "Clean Code" + Docstrings

What is "Literate Programming"?

Donald Knuth formally introduced the idea of "Literate Programming" in 1984 cite:Knuth1992. In his book he suggested the following change in attitude <> within the construction of programs:

#+begin_quote Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want the computer to do. #+end_quote

At Stanford University the idea of literate programming was realized through the WEB system. It involves combining a document formatting language with a programming language cite:Knuth1992. In the original report submitted to The Computer Journal in 1983, the example presented of the WEB system consisted of the combination of TeX, and PASCAL. This choice is not surprising, as Donald Knuth created TeX. The code was written in a single .WEB file. A reason Knuth was happy with the term WEB is because he considered a complex piece of code as a web simple materials delicately placed together cite:Knuth1992. The 'weaving' process involves separating out the document formatting language into a separate file. While the 'tangling' entails extracting the programming language to produce a machine-executable code cite:Knuth1992. This process was depicted as figure fig:WEB-overview cite:Knuth1992.

#+CAPTION: Adopted image from the original paper showing dual usage of a .WEB file. The 'weaving' separates out the TeX from the .WEB file compiles it to an output document, while the 'tangling' extracts the PASCAL code to produce a machine-executable program. #+NAME: fig:WEB-overview #+ATTR_HTML: :class center ../static/img_literate_programming_reproducible_research_clean_code/literate-programming-2.png

If you consider using a tool for literate programming their website has a long list of tools which can be used for writing code with literate programming cite:literate_programming_tools. While it is comprehensive, it does not list a fantastic tools which will be mentioned later. There are also several other projects online which seem to be popular, such as Zachary Yedidia's Literate project.

Data Science and Reproducible Research

The term "data science" likely first appeared in 1974 and initially defined by the following quote cite:Cao2017.

#+begin_quote The science of dealing with data, once they have been established, while the relation of the data to what they represent is delegated to other fields of data. #+end_quote

The term has increased in popularity since 2012 cite:Cao2017. This increase in interest is likely due to the emergence of data mining, big data and the advances in Artificial Intelligence (AI) and Machine Learning (ML). Due to the accumulation of large amounts of data, interdisciplinary tools are being combined with to form a "new" field. A more modern and comprehensive definition of data science is as follows cite:Cao2017.

#+begin_quote From the disciplinary perspective, data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology to study data and its environments (including domains and other contextual aspects, such as organizational and social aspects) in order to transform data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology. #+end_quote

# Reproducible research is a useful tool to be able to explain the context # surrounding the code. This assists in creating reproducibility both # benefiting the researcher and the scientific community.

A central point in literate programming is the intent of 'weaving' and 'tangling' the original document into two separate files. The original web/ document was not necessarily intended to be used as the final result. However, the modern use of "reproducible research" implies using a similar literate programming /web document as the final result cite:Goodman2018.

#+CAPTION: Example of a Jupyter Notebook showing the mix of rich text, code block, along with output. #+NAME: fig:jupyter-web ../static/img_literate_programming_reproducible_research_clean_code/jupyter_notebook_example.png

The computer scientist Jon Claerbout coined the term "reproducible research" and described it with a reader of a document being able to see the entire process from the raw data to the code producing tables and figures cite:Goodman2018. The term can be slightly confusing, as reproducibility has long been an important part of the research and the scientific method.

One of the issues in with the use of software in scientific field generally, is the reproducibility of the research being conducted cite:Hinsen2013. Therefore, the greater context surrounding the code is more important and possibly not self explanatory through the code alone. Experts within the field of exploratory data science even struggle keeping track of the experiments they conduct. It can lead to confusion of what experiments have been performed and how they came by the results cite:Hinsen2013. Since digitalization and digital tools have become a more important part of many fields, reproducible research has become popular in such fields such as epidemiology, computational biology, economics, and clinical trials cite:Goodman2018.

While literate programming and reproducible research are closely related in form, they were intended for slightly different use-cases. Literate programming is not specifically for research, but is a more broad approach to programming. While reproducible research focuses on providing the full context of an experiment or study and not making the code itself more legible.

Some of the more popular tools for reproducible research are Jupyter Notebooks for programming in python (example in figure fig:jupyter-web), and the integration of SWeave and knitr into R Markdown for programming in the statistical language R cite:Kery2018. Jupyter Notebooks is an open source spin-off project from the IPython (Interactive Python) project. Python is considered a fairly easy language to learn, there are several python projects for both data manipulation, and support for different machine learning frameworks such as Tensorflow and PyTorch. This makes Jupyter Notebooks particularly well suited for data science.

Readable and Clean Code

Knuth's attitude change mentioned above was primarily about making code more readable for humans. Literate programming is a method of accomplishing this goal, but it is not the only avenue to take. What if the code was written so well, it was completely self-explanatory.

Robert C. Martin, affectionately called "Uncle Bob", is a software engineer and the author of the bestselling book Clean Code: A Handbook of Agile Software Craftsmanship cite:martin2009clean. He has an interesting talk on clean code where he explains that code should read like "well written prose". In his book he writes cite:martin2009clean:

#+begin_quote Indeed, the ratio of time spent reading versus writing is well over 10 to 1. We are constantly reading old code as part of the effort to write new code. ...[Therefore,] making it easy to read makes it easier to write. #+end_quote

While writing readable code is something to always strive towards, it will likely never be completely self-explanatory. Therefore, it would likely be beneficial to write good documentation to complement the code.

Self-Documentation with Docstrings

Emacs has a nice method of self-documentation which is very helpful to understanding the editors source code written in elisp. It uses documentation strings (or docstrings), which are comments directly written into the code. Writing comments in code to describe functions is not at all uncommon. However, traditional in-file documentation is stripped from the source during runtime, such as Javadocs. Docstrings are maintained throughout the runtime of the code in which it is written. The programmer can then interactively inspect this throughout the runtime of the program.

include

Lisp/Elisp
Python
Elixir
Clojure
Haskell
Gherkin
Julia

Emacs is almost like working in an lisp terminal. The editor is written entirely in a variant of lisp, called elisp. For example, in order to split a window vertically, the function split-window-below is called. The code snippet below is the start of the source code for this function.

#+begin_src elisp (defun split-window-below (&optional size) "Split the selected window into two windows, one above the other. The selected window is above. The newly split-off window is below and displays the same buffer. Return the new window.

If optional argument SIZE is omitted or nil, both windows get the same height, or close to it. If SIZE is positive, the upper \(selected) window gets SIZE lines. If SIZE is negative, the lower (new) window gets -SIZE lines.

If the variable `split-window-keep-point' is non-nil, both windows get the same value of point as the selected window. Otherwise, the window starts are chosen so as to minimize the amount of redisplay; this is convenient on slow terminals." (interactive "P") (let ((old-window (selected-window)) (old-point (window-point)) (size (and size (prefix-numeric-value size))) moved-by-window-height moved new-window bottom) (when (and size (< size 0) (< (- size) window-min-height)) ;; `split-window' would not signal an error here. (error "Size of new window too small")) ;; ... ;; ... ;; ... #+end_src

When calling the function describe-function and inputting the function name, "split-window-below", it will generate the following output in a mini-buffer (a kind of little buffer). This documentation is created dynamically during runtime.

#+begin_quote split-window-below is an interactive compiled Lisp function in ‘window.el’.

It is bound to SPC w -, SPC w s, M-m w -, M-m w s, C-x 2, .

(split-window-below &optional SIZE)

Probably introduced at or before Emacs version 24.1.

Split the selected window into two windows, one above the other. The selected window is above. The newly split-off window is below and displays the same buffer. Return the new window.

If optional argument SIZE is omitted or nil, both windows get the same height, or close to it. If SIZE is positive, the upper (selected) window gets SIZE lines. If SIZE is negative, the lower (new) window gets -SIZE lines.

If the variable ‘split-window-keep-point’ is non-nil, both windows get the same value of point as the selected window. Otherwise, the window starts are chosen so as to minimize the amount of redisplay; this is convenient on slow terminals.

[back] #+end_quote

Comparing the docstring in the source code and the output of describe-function, there is more information added to the output documentation. This was a slight aside, but using describe-* functions in Emacs is probably of the most useful and helpful ways to learn emacs. However, this does show a benefit of how docstrings are used in emacs.

Org-Mode for Literate Programming and Reproducible Research

While Jupyter Notebooks are a fantastic way of writing reproducible research, they are not a method of literate programming as they are not intended to be 'weaved' and 'tangled'. The original tool such as ~WEB~ system for literate programming, does not allow for compiling embedded code interactively. However, Org-mode was the first to provide full support for reproducible research and literate programming cite:Schulte2012.

Org-mode is something called a major mode in emacs. An .org file is essentially a plain text markup language, but there are so many things that can be done in org-mode it is mind-boggling. With some configuration it can be used as almost anything including an advanced agenda system, calendar, financial ledger, zettlekasten note taking system (org-roam), and an exporter into almost any text formatting. Even that was an inadequate list, but for those interested this is a more complete list.

#+CAPTION: An example of a .org file with code blocks in emacs-lisp which will be tangled to a file called spacemacs.el. #+NAME: fig:org-babel ../static/img_literate_programming_reproducible_research_clean_code/org-bable.png

Using a package called org-babel it can allow for both 'weaving' using org-mode's built in export functionality, and 'tangling' using org-babel-tangle of the entire file (see figure fig:org-babel). As with other markup languages, .org files already support rich text formatting of code blocks. With simple tags and commands the code can run code blocks can be run dynamically by using certain session tags. Running embedded code blocks in this manner through the same, or multiple, sessions in the same file is important for reproducible research. More details on using org-mode for both literate programming and reproducible research is the paper by cite:Schulte2012. A blog post which explains and demonstrates how to use org-babel can be found here.

Closing thoughts

While literate programming is the a useful tool for providing more context to programming projects, in many cases writing self-explanatory code might be the best avenue for pure software engineering. Reproducible research is more suited for explaining a study or an experiment which is particularly useful in the scientific practice today as the use of digital tools continues to grow. Contrary to software engineering, these use-cases aren't well suited for simply writing self-explanatory code. Literate programming and reproducible research attack mixed natural language and computational language documents for different ends cite:Schulte2012. While literate programming introduces natural language to programming files, reproducible research adds embedded executable code into natural language documents cite:Schulte2012.

If humans were capable of easily writing in machine code, there would be no need for all the different types of programming languages. The reason we use programming languages is to be able to interpret and understand what we are instructing the computer to do. Therefore, making the code understandable should be a top priority. Even though Knuth's literate programming might not be suitable in all settings, I think his attitude about telling the reader what we are instructing the machine to perform is valuable in any programming situation. This is especially true for collaboration and long term complexity of coding projects.

While some of the differences between literate programming and reproducible research have been mentioned, particular examples of literate programming were not discussed. Configuration files are ideal for literate programming. They typically require explanations of the context of the settings or an explanation of the setting itself. After writing the literate configuration, they can be tangled to generate the actual configuration. I hope to write a post soon about showing how I created a literate configuration for spacemacs.

TODO Glamourous Toolkit

Heading 1

Footnotes

COMMENT Local Variables ARCHIVE

My Personal Website