submitted

2021-05-04 15:24:59 +01:00 · 2021-05-04 15:24:59 +01:00 · 35943e8aad
commit 35943e8aad
parent 10baaad312
2 changed files with 314 additions and 80 deletions
--- a/report/references.bib
+++ b/report/references.bib
@ -19,3 +19,51 @@
 	year = {2013}
 }

+@misc{cmabridge-cnns,
+	author = {Angermueller, Christof and Kendall, Alex},
+	organization = {University of Cambridge},
+	title = {Convolutional Neural Networks},
+	url = {https://cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_basics.pdf},
+	urldate = {2021-05-02},
+	year = {2015}
+}
+
+@misc{tds-alexnet,
+	author = {Alake, Richmond},
+	organization = {Towards Data Science},
+	title = {What AlexNet Brought To The World Of Deep Learning},
+	url = {https://towardsdatascience.com/what-alexnet-brought-to-the-world-of-deep-learning-46c7974b46fc},
+	urldate = {2021-05-02},
+	year = {2020}
+}
+
+@misc{learnopencv-alexnet,
+	author = {Nayak, Sunita},
+	month = jun,
+	organization = {Learn OpenCV},
+	title = {Understanding AlexNet},
+	url = {https://learnopencv.com/understanding-alexnet},
+	urldate = {2021-05-02},
+	year = {2018}
+}
+
+@misc{tds-lr-schedules,
+	author = {Lau, Suki},
+	month = jul,
+	organization = {Towards Data Science},
+	title = {Learning Rate Schedules and Adaptive Learning Rate Methods for Deep Learning},
+	url = {https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1},
+	urldate = {2021-05-02},
+	year = {2017}
+}
+
+@misc{glassbox-train-props,
+	author = {Draelos, Rachel},
+	month = sep,
+	organization = {Glass Box},
+	title = {Best Use of Train/Val/Test Splits, with Tips for Medical Data},
+	url = {https://glassboxmedicine.com/2019/09/15/best-use-of-train-val-test-splits-with-tips-for-medical-data},
+	urldate = {2021-05-02},
+	year = {2019}
+}
+
--- a/report/report.lyx
+++ b/report/report.lyx
@ -185,7 +185,26 @@ University of Surrey
 \end_layout

 \begin_layout Abstract
-abstract
+Investigations are made into 3 broad influences on a convolutional neural
+ network's performance including the subject dataset, hyper-parameters and
+ architecture.
+ These investigations were conducted using the Stanford Cars dataset and
+ the seminal AlexNet architecture.
+ The proportions of dataset dedicated to training, validation and testing
+ were varied with higher accuracy obtained by heavily biasing towards testing.
+ Offline data augmentation was investigated by expanding the training dataset
+ using rotations and horizontal flips.
+ This significantly increased performance.
+ A peak in accuracy was identified when varying the number of epochs with
+ overfitting occuring beyond this critical epoch value.
+ Various learning rate schedules were investigated with dynamic learning
+ rates throughout the training period far out-performing a fixed learning
+ rate.
+ Finally, the architecture of the network was investigated by varying the
+ dimensions of the final dense layers, the kernel size of the convoluional
+ layers and by including new layers.
+ All of these investigations were able to report higher accuracy than the
+ standard AlexNet.
 \end_layout

 \begin_layout Standard
@ -195,10 +214,6 @@ LatexCommand tableofcontents
 \end_inset


-\end_layout
-
-\begin_layout List of TODOs
-
 \end_layout

 \begin_layout Standard
@ -282,17 +297,15 @@ Introduction
 \begin_layout Standard
 Although much of the theory for convolutional neural networks (CNNs) was
 developed throughout the 20th century, their importance to the field of
- computer vision was not widely appreciated until the early 2010s.
- 
-\begin_inset Flex TODO Note (inline)
-status open
-
-\begin_layout Plain Layout
-More context
-\end_layout
+ computer vision was not widely appreciated until the early 2010s 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "alexnet"
+literal "false"

 \end_inset

+.
 
 \end_layout

@ -381,7 +394,16 @@ Prior to more in-depth investigations, how the dataset is divided into training,
 validation and test data was investigated in order to identify a suitable
 proportion for later work.
 As a fixed size dataset, a balance must be struck between how much is reserved
- for training the network and how much should be used to evaluate the network.
+ for training the network and how much should be used to evaluate the network
+ 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "glassbox-train-props"
+literal "false"
+
+\end_inset
+
+.
 Throughout this paper, the term 
 \emph on
 split
@ -393,7 +415,15 @@ split
 \begin_layout Standard
 Although the dataset is of a fixed size, there are methods to artificially
 expand the set of training data by performing image manipulations such
- as rotations and zooms.
+ as rotations and zooms 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "tds-alexnet,learnopencv-alexnet"
+literal "false"
+
+\end_inset
+
+.
 This aims to teach the network invariance to such transforms during classificat
 ion.
 A Python script was written to take a training dataset and perform a range
@ -414,12 +444,21 @@ Meta-Parameters
 \begin_layout Standard
 The number of epochs that a network is trained for is important for balancing
 the fit to the training set.
- Too few and the CNN will be underfit whereas too many and the network will
- be too specific to the training set.
+ Too few and the CNN will be underfitted whereas too many and the network
+ will be too specific to the training set.
 \end_layout

 \begin_layout Standard
-The learning rate of a CNN is critical for attaining high-performance results.
+The learning rate of a CNN is critical for attaining high-performance results
+ 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "tds-lr-schedules"
+literal "false"
+
+\end_inset
+
+.
 The value and how it changes over the range of training epochs or the 
 \emph on
 learning schedule
@ -447,7 +486,15 @@ Convolutional Layers
 \begin_layout Standard
 The convolutional layers of AlexNet are responsible for applying subsequent
 image manipulations by convolving the sample with a kernel of learned parameter
-s.
+s 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "cmabridge-cnns"
+literal "false"
+
+\end_inset
+
+.
 The kernel size of each layer was varied in order visualise performance.
 \end_layout

@ -463,13 +510,21 @@ Following the convolutional stages there are three dense or fully-connected
 output.
 The second is as a traditional multi-layer perceptron classifier, taking
 the high-level visual insights of the later convolutional layers and reasoning
- these into a final classification.
+ these into a final classification 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "learnopencv-alexnet"
+literal "false"
+
+\end_inset
+
+.
 
 \end_layout

 \begin_layout Standard
 When treated as an MLP, these can instead be considered as 2 hidden layers
- and a single output layer.
+ and a single output layer for AlexNet.
 As the last layer is of a fixed number of nodes equal to the number of
 classes and is required to form the one-hot vector output, it is treated
 separately to the others.
@ -482,10 +537,18 @@ New Layers
 \end_layout

 \begin_layout Standard
-It has been shown that the early layers (~1-3) of AlexNet are responsible
- for identifying low-level features such as edges while the latter layers
- (~3-5) perform higher level reasoning including texture.
- The addition of a new layer in both of these regions of the network were
+It has been shown that the early layers (~1-3 in AlexNet) of CNNs are responsibl
+e for identifying low-level features such as edges while the latter layers
+ (~3-5) perform higher level reasoning including texture 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "cmabridge-cnns"
+literal "false"
+
+\end_inset
+
+.
+ The addition of a new layer in both of these regions of the network was
 investigated.
 Reasonable values for kernel sizes and number of layers were selected consideri
 ng the values from the neighbouring layers.
@ -732,7 +795,7 @@ noprefix "false"
 \end_inset

 , the batch size was set to 128, the default value and one used for the
- unaugmented control experiment for comparison later.
+ unaugmented experiment for comparison later.
 Figure 
 \begin_inset CommandInset ref
 LatexCommand ref
@ -811,19 +874,13 @@ noprefix "false"

 \end_inset

-), augmenting the dataset more than the doubled the accuracy.
+), augmenting the dataset more than doubled the accuracy.
 Rotation performed better than flipping the images while the described
+ 
+\emph on
+full
+\emph default
 combination performing best.
-\begin_inset Flex TODO Note (inline)
-status open
-
-\begin_layout Plain Layout
-Scaled batch size
-\end_layout
-
-\end_inset
-
-
 \end_layout

 \begin_layout Standard
@ -839,8 +896,8 @@ noprefix "false"

 ), data augmentation still performed better than the unaugmented dataset
 however the performance was not as high as with a constant batch size.
- Full processing performed worse than just flipping or rotating in this
- case.
+ Full processing performed worse than either flipping or rotating in this
+ case but still performed better than the unaugmented control.
 \end_layout

 \begin_layout Standard
@ -1108,8 +1165,8 @@ noprefix "false"
 \end_inset

 .
- More epochs can be seen to increase performance until ~70 epochs, after
- this the accuracy gradually declines.
+ More epochs can be seen to dramatically increase performance until ~70
+ epochs, after this the accuracy gradually declines.
 The opposite trend can be seen in the loss of figure 
 \begin_inset CommandInset ref
 LatexCommand ref
@ -1259,7 +1316,8 @@ noprefix "false"
 .
 For a fixed learning rate, values between 0.01 and 0.001 gave the best accuracy
 with values both larger or smaller giving a top-1 accuracy less than 10%.
- The highest value between 50 and 100 epochs were similar.
+ The highest value between 50 and 100 epochs were comparable at around 33%
+ and 35% respectively.
 \begin_inset Note Comment
 status open

@ -1419,17 +1477,6 @@ noprefix "false"
 .
 Over both 50 and 100 epochs, the step-down scale factor can be seen to
 have little effect on test accuracy.
- 
-\begin_inset Flex TODO Note (inline)
-status open
-
-\begin_layout Plain Layout
-Finish
-\end_layout
-
-\end_inset
-
-
 \end_layout

 \begin_layout Subsubsection
@ -1450,8 +1497,13 @@ noprefix "false"

 .
 From these results, a slow decay rate can be seen to give the best results,
- values between 0.95 and 0.99 gave the highest accuracies, over both 50 and
- 100 epochs.
+ values between 0.95 and 0.99 gave the highest accuracies over both training
+ periods.
+ Over 50 epochs, the performance drops faster than over 100 epochs as 
+\begin_inset Formula $\lambda$
+\end_inset
+
+ decreases.
 \end_layout

 \begin_layout Standard
@ -1587,7 +1639,11 @@ Reasonable values of gamma for the sigmoid function were selected
 \uwave off
 \noun off
 \color none
-between 0.05 and 0.2.
+ up to 0.4, as 
+\begin_inset Formula $\gamma$
+\end_inset
+
+ increases beyond 0.5 the profile tends towards a step action.
 Accuracies over 50 and 100 epochs were evaluated and can be seen in figure
 
 \begin_inset CommandInset ref
@ -1903,10 +1959,15 @@ noprefix "false"
 \end_inset

 with the standard kernel sizes for AlexNet also marked.
+ Only one kernel size was changed at a time, the network is a standard AlexNet
+ apart from the subject layer.
 In general, varying the kernel size of the earlier layers (1 and 2) had
- little effect on the accuracy with little gain made over the default.
+ little benefit on the accuracy, a kernel size of 3 for layer 1 performed
+ particularly bad with a ~7% lower accuracy.
 Higher gains were made in the later layers, where a size of 5 or 7 tended
 to perform better than the standard 3.
+ Layer 3 showed both the highest gain with a +6% from the original 3x3 to
+ 5x5 and the highest loss with -10% from 3x3 to 11x11.
 \end_layout

 \begin_layout Subsubsection
@ -1974,6 +2035,7 @@ noprefix "false"
 Each number of layers shows a peak with a steep ascent and a more gradual
 descent, as the number of layers increases the nodes associated with the
 peak also increases.
+ The highest performance was for the standard 2 layers with 512 nodes.
 \end_layout

 \begin_layout Subsubsection
@ -2016,10 +2078,6 @@ name "fig:new-layer"
 \end_inset


-\end_layout
-
-\begin_layout Plain Layout
-
 \end_layout

 \end_inset
@ -2028,16 +2086,23 @@ name "fig:new-layer"
 \end_layout

 \begin_layout Standard
-\begin_inset Flex TODO Note (inline)
-status open
-
-\begin_layout Plain Layout
-TODO
-\end_layout
+Test accuracies for varied kernel sizes in additional convolutional layers
+ can be seen in figure 
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "fig:new-layer"
+plural "false"
+caps "false"
+noprefix "false"

 \end_inset

-
+.
+ A fixed number of filters was selected to interpolate the values of neighouring
+ layers.
+ A new layer between conv 1 and conv 2, layer 1.5, performed best with a
+ 3x3 kernel (54%), increasing the size resulted in decreased accuracy.
+ For layer 3.5, a 5x5 kernel performed best for a top-1 accuracy of 52%.
 \end_layout

 \begin_layout Subsubsection
@ -2046,7 +2111,7 @@ Summary

 \begin_layout Standard
 A comparison of the best reported accuracies for the investigated architecture
- changes can be seen in figure 
+ alterations can be seen in figure 
 \begin_inset CommandInset ref
 LatexCommand ref
 reference "fig:architecture-best-barh"
@ -2059,7 +2124,7 @@ noprefix "false"
 .
 Each of the investigated architecture changes was able to outperform AlexNet.
 The largest increase was achieved by reducing the number of nodes in the
- 2 hidden dense layers from 4096 to 512 for a ~10% increase to 57%.
+ 2 hidden dense layers from 4,096 to 512 for a ~10% increase to 57%.
 \end_layout

 \begin_layout Standard
@ -2125,17 +2190,70 @@ name "sec:Discussion"
 Dataset
 \end_layout

+\begin_layout Standard
+A high training proportion was found to increase the accuracy of the network.
+ This is for two major reasons.
+ Increasing the number of images for training effectively increases the
+ length of training as more images are progagated each epoch.
+ Alongside this, increasing the training proportion also provides a more
+ complete view of the dataset.
+ Higher proportions will allow the network to see more of each class, with
+ a significantly lower proportion it would be possible that few if any of
+ a class is present in the training dataset.
+ The same can be argued for the test sets, however.
+ As the test set is reduced in complement to the training set's increase,
+ the breadth of qualities being evaluated is reduced as the number of examples
+ of each class is reduced.
+ This is the core of the balancing act in conducting both comprehensive
+ training and testing.
+\end_layout
+
+\begin_layout Standard
+Offline augmentation of the training data proved to be an effective way
+ of increasing the accuracy of the evaluated networks.
+ When using rotation, small angles were the most effective, in practice
+ a random angle between 0 and 10 could be used.
+ Data augmentation increases performance as it presents the network with
+ different perspectives of the same images.
+ As such, the network can learn invariance to factors such as which way
+ the car is facing in the image.
+\end_layout
+
 \begin_layout Standard
 The batch size scaling inline with the training set growth was conducted
 in an effort to control for the amount of extra training being conducted.
 When comparing data augmentation methods, difficulty comes in comparing
 processing methods which expand the training set by different amounts.
 Synthetically larger datasets not only present the network with new perspective
-s of the image but also train the network for longer.
- A method to better control for this in the future could be to define a
- constant expansion factor across processing methods and then compose this
- extra training data of different proportions of augmentations (rotations
- of varying angles and flips).
+s of the images but also train the network for longer and as such it is
+ hard to define how much should be attributed to the 
+\emph on
+quality
+\emph default
+ of the synthetic data.
+ Scaling the batch size so as to maintain the number of network updates
+ reduced the accuracy as would be expected when attempting to control for
+ more training, however the full processing (
+\begin_inset Formula $E=6$
+\end_inset
+
+) was reduced further than the 
+\begin_inset Formula $E=2$
+\end_inset
+
+ processing methods.
+ This does not follow the hypothesis as it might be expected that more perspecti
+ves of the training data would improve over a single rotation or flip.
+ This suggests that scaling the batch size as described was not a sufficient
+ method to control for the longer training periods.
+\end_layout
+
+\begin_layout Standard
+A method to better control for this in the future could be to define a constant
+ expansion factor across processing methods and then compose this extra
+ training data of different proportions of augmentations (rotations of varying
+ angles and flips) or to use online augmentation such that the images are
+ manipulated as they are presented to the network.
 \end_layout

 \begin_layout Subsection
@ -2146,6 +2264,30 @@ Meta-Parameters
 As presented, it can be seen that training a network beyond a threshold
 number of epochs leads to diminishing performance as the network overfits
 to the training set.
+ This reduces the network's ability to generalise as it effectively learns
+ 
+\emph on
+too much
+\emph default
+ about the training data.
+\end_layout
+
+\begin_layout Standard
+From the comparisons of different learning rate schedules, it can be seen
+ from the similar performances that the employed specific function was not
+ as important as the need to decay the learning rate itself.
+ This is demonstrated in the 10% performance gain when using a dynamic learning
+ rate.
+ Considering the error surface with local minima that the weight set is
+ navigating, initially it is important to make large steps across the surface.
+ Towards the end of the training period, however, ideally the network will
+ be close to or within a minima.
+ At this point, large steps will reduce the performance of the network as
+ oscillating jumps over the minima are made instead of settling within.
+ By decaying the learning rate, both intentions can be actioned, initially
+ taking large steps to find the deepest possible minimum before reducing
+ the size of the movements in order to converge into it as opposed to circling
+ it.
 \end_layout

 \begin_layout Subsection
@ -2159,9 +2301,28 @@ From the reported results each investigation outperformed the standard AlexNet
 It would be inaccurate from these results to suggest that these derivative
 architectures are better than AlexNet as the performance is a function
 of the dataset, the specific dataset split used, the learning rate schedule
- and number of epochs trained for.
- Instead what is being stated is that, for the selected, specific values
- of those, a more optimal architecture than the standard AlexNet was found.
+ and number of training epochs.
+ Instead what is being stated is that, for the specific values of those,
+ a more optimal architecture than the standard AlexNet was found.
+\end_layout
+
+\begin_layout Standard
+Looking to the dense layer shape investigations, each number of layers has
+ a similar profile in the described steep rise and gradual descent.
+ As the number of layers increases, the number of hidden nodes required
+ to achieve the same performance increases.
+ This implies a required relation between the dimensions of the dense layers
+ to attain acceptable performance as both a deep MLP section of few nodes
+ and a shallow MLP of many nodes will not be sufficient.
+\end_layout
+
+\begin_layout Standard
+The higher increase in performance by adding layer 1.5 than 3.5 would suggest
+ that more low-level feature learning capacity was more effective for the
+ dataset than higher-level capacity.
+ Both were higher than the best reported accuracy from varying AlexNet's
+ kernel sizes which would suggest that the existing convolutional stages
+ were well suited to the dataset as standard.
 \end_layout

 \begin_layout Section
@ -2175,6 +2336,31 @@ name "sec:Conclusions"

 \end_layout

+\begin_layout Standard
+Investigations into the factors affecting convolutional neural network have
+ been presented.
+ The effect of balancing the proportion of data to be partitioned between
+ training and testing was investigated.
+ Increasing the amount of training data as much as possible was shown to
+ increase the accuracy.
+ Offline data augmentation using a selection of image rotations and flips
+ was shown to more than double the test accuracy.
+ 
+\end_layout
+
+\begin_layout Standard
+A dynamic learning rate schedule was shown to be important in achieving
+ high-performance accuracy as opposed to a fixed value.
+ The choice of decay function did not significantly affect the best reported
+ accuracy.
+\end_layout
+
+\begin_layout Standard
+Derivative architectures of AlexNet were shown to increase performance when
+ altering the dense layers shape, the convolutional kernel sizes and when
+ including additional convolutional layers.
+\end_layout
+
 \begin_layout Standard
 \begin_inset Newpage newpage
 \end_inset