Post on 12-Oct-2020
Mathieu Hinderyckx
machine learning techniquesNon-reference-based DNA sequence compression using
Academic year 2015-2016Faculty of Engineering and Architecture
Chair: Prof. dr. ir. Rik Van de WalleDepartment of Electronics and Information Systems
Chair: Prof. dr. Jozef VercruysseInformatiesystemenTechnology and Molecular,Vakgroep Elektronica enBiotechnology Department of Environmental Technology, Food
Master of Science in Computer Science Engineering Master's dissertation submitted in order to obtain the academic degree of
Counsellors: Ruben Verhack, Tom Paridaens, Lionel PigouSupervisors: Prof. dr. Wesley De Neve, Prof. dr. ir. Joni Dambre
ii
Mathieu Hinderyckx
machine learning techniquesNon-reference-based DNA sequence compression using
Academic year 2015-2016Faculty of Engineering and Architecture
Chair: Prof. dr. ir. Rik Van de WalleDepartment of Electronics and Information Systems
Chair: Prof. dr. Jozef VercruysseInformatiesystemenTechnology and Molecular,Vakgroep Elektronica enBiotechnology Department of Environmental Technology, Food
Master of Science in Computer Science Engineering Master's dissertation submitted in order to obtain the academic degree of
Counsellors: Ruben Verhack, Tom Paridaens, Lionel PigouSupervisors: Prof. dr. Wesley De Neve, Prof. dr. ir. Joni Dambre
Preface
Acknowledgments
This work forms the culmination of everything I (should) have learned during my time and
education as an engineer in Computer Sciences at Ghent University. It has definitely been a
challenging task to perform, a few people were essential in completing this. First of all, I would
like to thank in one breath Ghent University in general and my parents for offering me first
of all a qualitative education, a great time as a student, and the opportunity for this research
as my last project. More specifically, I would like to thank Lionel Pigou of the Reservoir Lab
and both Ruben Verhack and Tom Paridaens from the MMlab at the university for their role
as counsellors and day-to-day assistance and overview of this project. While not always having
been the easiest of times, their help was very welcome and essential in approaching this, at
times, daunting topic. Lastly, I would give many thanks to my friends and colleagues. Both to
the great students, who finish everything in time and have found some moments to help and
advice on this work, and to the possibly even greater one who also have found a challenging
task in completing their projects, and went through this long project together with me.
Permission for usage
The author(s) gives (give) permission to make this master dissertation available for consultation
and to copy parts of this master dissertation for personal use. In the case of any other use,
the copyright terms have to be respected, in particular with regard to the obligation to state
expressly the source when quoting results from this master dissertation.
Mathieu Hinderyckx,
August 8, 2016
iv
1
Non-reference-based DNA sequence compressionusing machine learning techniques
Mathieu HinderyckxSupervisors: Prof. dr. Wesley De Neve, Prof. dr. ir. Joni Dambre
Counsellors: Ruben Verhack, Tom Paridaens, Lionel Pigou
Abstract—With the continuous development of the medical andtechnological world, a high interest in the analysis of DNA datahas emerged in the last decades, leading to an explosive growth ofsequence data gathering. This growth outpaces developments inelectronic storage technology, so the need arises for a solutionto deal with this abundance of data. One of the approachesis data compression, which has seen wide use in all kinds ofelectronic data processing. In this work, a data compressionscheme is explored for lossless compression of sequence data.The scheme is build around an Auto-Encoder, a machine learningtechnique. Several models are constructed and evaluated, basedon recent developments in deep learning. The best model achievesa reconstruction accuracy of over 95% in selected scenarios.After selection of this convolutional Auto-Encoder architecture,the model is trained on chromosomes from the human genome,and used to compress sequence data from an alternative humangenome in various configurations. The compression scheme isshown to be able to compress the sequence to 60%-70% of itsuncompressed filesize. While functional, this proof of conceptimplementation does not outperform general-purpose compres-sion techniques, currently the most widely used approach forsequence data. However, the combination these techniques intandem offers a competitive solution, outperforming most ofthe existing solutions. Future improvement on this work canbe to finetune and improve this proof of concept, or to explorecompression using predictive coding by applying LSTM neuralnetworks.
Keywords—Convolutional Auto-Encoder, DNA sequence, com-pression, deep learning
I. INTRODUCTION
The genome of an organism contains its genetic material: theDNA within that organism, composed of nucleotides (bases) A(Adenine), C (Cytosine), G (Guanine) and T (Thymine). Thisbiochemical component of every living organism contains atreasure of largely untapped information about an organism.With the establishment and development of advanced tech-nology in the bioinformatics discipline, attempts are madeto unravel the information contained within. This requires aneletronic representation of this biochemical material, rangingfrom a few to a few hundred gigabytes per human genome. Thedecreasing costs and growing interest in bio-informatics hasled to an explosive growth in DNA sequencing, significantlyoutpacing the required developments in electronic storagetechnology. Several approaches are investigated in order todeal with this data abundance. Among these are a blind growthin storage expenses, the triage of collected data, the storageof physical samples instead, and most prominently the use of
electronic data compression. As (sequenced & aligned) DNAin its raw representation is represented by a long string ofcharacters, compression using generic approaches is possible.However, compression of sequence data might try to takeadvantage of the natural and biological characteristics of DNAmaterial, notably being the repeated content and the oftenvery close relationships between existing reference sequences.Both losless and lossy compression techniques exists, thelatter sometimes having a user-specified trade-off betweencompression ratio and information loss. A lot of compressiontechniques have been investigated, and are currently beingdeveloped to address the compression of sequencing data.This is one of the most promising tracks to face the issueof dealing with high storage requirements as it does notnecessarily involve data loss. This paper will explore the optionof data compression for non-reference-base genome sequences.It will try to apply Auto-Encoders, a technique from theneural network field in machine learning which have shownin previous application to be a viable compression scheme.
II. RELATED WORK
A lot of research has been published on both compressionalgorithms for DNA and on machine learning, neural networksand deep learning. Especially the field of deep learning iscurrently an extremely active area of research with lots ofapplications in several domains.
A. DNA CompressionSequence data can be stored using several formats. A major
distinction is the difference between reads and sequences.Reads are the result of the sequencing process, and are filescontaining short, repeated strains from the physical sample.All of these reads are then aligned and assembled to formone contiguous sequence. A second distinction is the referenceor non-reference based storage. Reference-based operationindicates the data is stored as differential from a goldenstandard genome, while non-reference based systems store thefiles to be used standalone. There are compression algorithmsdevised for combinations of these two scenarios. The lack ofa benchmark dataset, the closed-source nature of some toolsand incompatible feature set of these various solutions makescomparing these very hard. No solution works guaranteed well,and most of the time, general-purpos compression is applied tothe sequences, which does achieve a reasonable performance.A survey of technieque in use is offered by [1] and [2].
2
B. Image processing with machine learningArtificial Neural Networks have proven to be extremely
succesful and have led to development of deep learningdiscipline. The one technique of interest is the concept ofArtificial Neural Networks (ANNs). They have been tried ona variety of problems, and have proven to be superior overtraditional methods in a lot of situations.
III. METHODOLOGY
In this work an Auto-Encoder will be trained for construct-ing a compression method. An input file will be encodedto a bitstream suitable for transfer, and decoded again. Theactual encoding and decoding is done by an Artificial NeuralNetwork. Compression is achieved by introducing a bottleneckin the ANN network architecture. A large input sequence isrepresented as a compressed, coded representation learned bythe network plus a residue necessary to correct erroneousreconstructions. By training the network using DNA data, theaim is to let it learn the implicit structure of the input data, so itcan create a good coded representation and effectively performreconstruction. The parameters (i.e. weights of the ANN) ofthe encoder and decoder module are not included in thebistream to be transferred. They are fixed (either to a hardcodedset of values or to a reprogrammable set) during operation.As large ANNs often contain a high amount of parameters,including these in the bitstream for network transfer mightnot be efficient. Having a fixed set of parameters also opensthe possibility of efficient hardware implementations (e.g.FPGA’s, ASICs and coprocessors as currently used for imageprocessing). The encoder and decoder processes thus share thenetwork weights and architecture.
Fig. 1: Overview of encoding/decoding process
Figure 3 shows a block diagram of the encoding & decodingprocess. As a first step, the source file is read and preprocessedto a format suitable for the encoder to accept as input. This
input is fed to the encoding part of the AE. This Encoder blockis a simplified representation here (and is discussed below).The encoding block will generate a compressed representationof the input, which is used as the codeword. The codeword isfed to the decoder, which tries to reconstruct the input, possiblymaking an incorrect reconstruction. This reconstruction iscompared with the original input, and the differences are storedin a residue. The codeword and residue are combined in thebitstream, which is the output of the encoding process. Thedecoder performs the same operation as the final half of theencoding process in order to reconstruct the input file.
IV. RESULTS
A. Machine learning model
During the machine learning part of the project, an Auto-Encoder (AE) architecture was constructed to perform thecompression task. Starting from a traditional AE with a singlefully-connected hidden layer which takes a single base toencode, and ending up with a deep (batch-normalized) convo-lutional AE encoding a sequence of hundreds of bases into twovalues, the end result is a network capable of achieving a verygood compression rate with associated good reconstructionaccuracy. The traditional AEs with only fully-connected layersare unable to perform the task and show an underfittingproblem. Only when moving to (variations on) convolutionalAEs, the performance is very good, and these are the modelsof interest. This section concludes with a comparison of theseCAE variations.
Several architectural variations of CAEs are now trained for50000 updates, each having the same settings for input se-quence length and encoding units. Figure 2 plots the resultingcomparison of the reconstruction accuracy and cross-entropyloss for the validation set. A first conclusion to draw here is themodel using the ReLu activation function (BN-CAE(ReLu))does not work. The cross-entropy quicly starts to diverge andthe corresponding accuracy drops to under 50%. This leavesthe CAEs using sigmoid activations. The Batch Normalizedvariation (BN-CAE) initially performs the best. However, evenbefore 10000 updates the model shows signs of overfitting.The loss function increases after a minimum at around 7500updates, and the accuracy has dropped a lot earlier than that,ending up with a score which does outperform the deep CAEon which it is based, but not matching the shallow CAE. Thedesired regularization effect of the Batch Normalization is notimmediately observed in this case. The deep CAE shows awell-known learning behaviour; the accuracy rises and lossfunction decreases, up until the overfitting stage occurs. At10000-15000 updates, clear signs of overfitting show up, andthe performance gets worse from there on. The model isoutperformed by both its shallow predecessor, and its batch-normalized successor. The clear winner here is the shallowCAE, also displaying the well-known learning behaviour. Theloss function decreases and at about 30000 updates starts toincrease due to overfitting, ending up with a similar score as theBN-CAE. The accuracy rises to over 96%, and after overfittingoccurs ends up slightly below that number. It outperforms all
3
of its successors, and consequently, this is the model to applyin the next step.
Fig. 2: Evaluation of best-performing networks
B. Compression schemeHaving determined a succesful model in the previous sec-
tion, it will be used to implement a compression scheme byapplying the CAE to the DNA sequences.
V. CONCLUSION
The purpose os this work has been to explore the possibilityof constructing a compression scheme for DNA sequence datausing a deep learning approach. A first data analysis has shownthere are little obvious features and patterns present in thesource data, hence the way of a an unsupervised techniquehas been chosen. One particular technique, the Auto-Encoderhas been selected, as it has been shown to be capable ofbeing used for data compression. After the data analysis, asuitable AE architecture has been investigated. This rangedfrom a traditional single-hidden layer AE, to the more powerfulconvolutional variants, and ended with a trial of a batch-normalized deep convolutional AE inspired by state-of-the-artresearch in deep learning. The resulting best performing modelturned out to be the shallow convolutional AE, demonstratingthe ability to compress and reconstruct the sequence withan accuracy of over 90% while having an architecture thatoffers a good compression rate. The next part in this researchinvolved implementing a compression scheme using this CAEto investigate if this approach yields a working setup. Severalscenarios in evaluating the scheme have been investigated. The
Fig. 3: Compression statistics of bitstream encodings followedby gzip compression.
evaluation involved a look at the encoded sequence and thestructure of the error residue, as these make up the encodedbistream. Even this basic implementation, with still a decentroom for improvement, has shown to be capable of a losslesscompression of sequence data with a resulting compressionof 60%-70% in most cases. Further application of a general-purpose compression scheme such as gzip is shown to leadto an even larger achievable compression rate, mitigating thesimplistic approach in storing the residue files. The combinedsteps of the AE compression and gzip leads to a bpb of around1, which outperforms most of the existing algorithms.
ACKNOWLEDGMENT
The author would like to thank Ghent University for offeringme first of all a qualitative education, and as a final result theopportunity for this research. More specifically, he would liketo thank Lionel Pigou of the Reservoir Lab and both RubenVerhack and Tom Paridaens from the MMlab for their roleas counsellors and day-to-day assistance and overview of thisproject. Lastly, many thanks to my friends and colleagues whooffered help, advice and a welcome break at times.
REFERENCES
[1] T. Snyder, Overview and Comparison of Genome Compression Algo-rithms, University of Minnesota, 2012.
[2] Wandelt, Sebastian and Bux, Trends in Genome Compression. CurrentBioinformatics, vol.9, no. 5, p.1-24, 2013.
Contents
Preface iv
1 Introduction 1
1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Content of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Related work 8
2.1 DNA compression algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Image manipulation with machine learning . . . . . . . . . . . . . . . . . . . . . 15
2.3 Auto-Encoders as feature learning models . . . . . . . . . . . . . . . . . . . . . . 17
3 Theory 21
3.1 Brief introduction to machine learning . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Introduction to the concepts of (Artificial) Neural Networks . . . . . . . . . . . . 26
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Relevant techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Selected technique: Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1 Denoising Auto-Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.2 Convolutional Auto-Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 Methodology 39
4.1 Structural overview of the proposed compression scheme . . . . . . . . . . . . . . 39
4.1.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.3 Network view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Technical aspects & design decisions . . . . . . . . . . . . . . . . . . . . . . . . . 41
viii
CONTENTS ix
4.2.1 Data acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.2 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.3 Network training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.4 Compression-Accuracy trade-off . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.6 Software & Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5 Results 51
5.1 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Baseline comparison: state of the art . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Auto-Encoder model construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.1 Shallow fully-connected Auto-Encoder . . . . . . . . . . . . . . . . . . . . 57
5.3.2 Deep fully-connected Auto-Encoder . . . . . . . . . . . . . . . . . . . . . 60
5.3.3 Shallow Convolutional Auto-Encoder . . . . . . . . . . . . . . . . . . . . . 60
5.3.4 Deep Convolutional Auto-Encoder . . . . . . . . . . . . . . . . . . . . . . 64
5.3.5 Batch-Normalized ReLu CAE . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.6 Model comparison, selection and discussion . . . . . . . . . . . . . . . . . 67
5.4 Compression scheme implementation . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.1 Scenario 1: chromosome Ref-A on chromosome Ref-B . . . . . . . . . . . 71
5.4.2 Scenario 2: chromosome Ref-A on chromosome Alt-A . . . . . . . . . . . . 72
5.4.3 Additional general-purpose compression . . . . . . . . . . . . . . . . . . . 73
6 Conclusion 78
6.1 Discussion of the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
References 81
List of Figures
3.1 Illustration of under- and overfitting: curve (polynomial function) fitting to a
data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Artificial neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Neural Network structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Comparison of the input-output relation of some activation functions used in
ANNs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Auto-Encoder as a contraint on a neural net architecture . . . . . . . . . . . . . 34
4.1 Block diagram of encoding/decoding process . . . . . . . . . . . . . . . . . . . . . 40
4.2 Base sequence representation as 5-channel cubes. . . . . . . . . . . . . . . . . . . 44
4.3 Schematic display of three differenct evaluation scenarios. The grey shaded part
is the data used to train the model, and the arrows point to the (test) data which
is compressed using the trained model. . . . . . . . . . . . . . . . . . . . . . . . . 47
5.1 Occurence of single bases in the full human reference genome. Error bars indicate
the occurence minima and maxima in separate chromosomes. . . . . . . . . . . . 54
5.2 Occurence of base pairs in the full human reference genome. Error bars indicate
the occurence minima and maxima in separate chromosomes. . . . . . . . . . . . 55
5.3 Occurence of codons (base triplets) in the full human reference genome. Er-
ror bars indicate the occurence minima and maxima in separate chromosomes.
Codon labels are ommited for clarity. . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4 Occurence of groups of 2 bases in the reference genome, compared with their
expected frequency. The bars show the actual frequency of the group, the mark
shows the frequency which is expected based on the single-base frequencies. . . . 56
x
LIST OF FIGURES xi
5.5 Occurence of groups of 3 bases in the reference genome, compared with their
expected frequency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.6 Shallow traditional fully-connected AE with single-base input and two encoding
neurons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.7 Reconstruction accuracy and loss function of shallow fully-connected AE with
single-base input and two encoding units. Above: independant weights, below:
tied weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.8 Variation on the shallow traditional fully-connected AE with 100 bases as input
and two encoding units. (1) indicates an increase in input length. (2) indicates
an increase in encoding units. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.9 Reconstruction accuracy and loss function of shallow fully-connected AE with
input sequence length of 100 bases and two encoding units. Above: independant
weights, below: tied weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.10 Deep fully-connected AE with a sequence input length of 100 bases and two
hidden units in the encoding layer. . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.11 Reconstruction accuracy and loss function of deep fully-connected AE with in-
put sequence length of 100 bases and two encoding units. Above: independant
weights, below: tied weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.12 Shallow convolutional AE structure. . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.13 Reconstruction accuracy and loss function of shallow convolutional AE with in-
put sequence length of 400 bases and two encoding units. Above: independant
weights, below: tied weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.14 Deep CAE architecture. Symmetrical decoding part ommited for readability. . . 66
5.15 Reconstruction accuracy and loss function of deep CAE with input sequence
length of 400 bases and two encoding units trained with tied weights. . . . . . . . 67
5.16 Reconstruction accuracy and loss function of Batch Normalized deep CAE with
input sequence length of 400 bases and two encoding units trained with tied
weights. Above: ReLu activations, below: Sigmoid activations. . . . . . . . . . . 68
5.17 Performance comparison of variations on the CAE with input sequence length of
400 bases and two encoding units, all trained with tied weights. . . . . . . . . . . 70
List of Tables
5.1 Chromosomes and their file content in the refence genome. The entropy is cal-
culated on the contained sequence, and not on the data stream which includes
metadata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 General-purpose compression software on chromosome FASTA files of the human
reference sequence. The Compression column is the 7-Zip filesize compared to
the uncompressed file, and the bpb column is the bits per base achieved by the
7-zip compression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3 Overview of the Auto-Encoders considered with some key characteristics. (tied)
indicates the weights of the decoder and encoder parts are shared. The compres-
sion column gives the ratio of the encoding units to the input sequence (at 3 bits
per base). It does not include the residue at this point. . . . . . . . . . . . . . . . 71
5.4 File sizes after encoding using the shallow CAE trained on chromosome 1 of the
reference genome. The compression column shows the ratio of the bitstream
filesize to the original FASTA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.5 Residue (form 1) analysis after encoding using the shallow CAE trained on chro-
mosome 1 of the reference genome. . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.6 File sizes after encoding using the shallow CAE. On each row a chromosome from
the reference genome is used for training, and its counterpart in the alternative
genome (who’s name is given in this table) is encoded. . . . . . . . . . . . . . . . 75
5.7 Residue (form 1) analysis after encoding using the shallow CAE. On each row a
chromosome from the reference genome is used for training, and its counterpart
in the alternative genome (who’s name is given in this table) is encoded. . . . . . 76
xii
LIST OF TABLES xiii
5.8 Compression results after gzip is applied on top of the bitstreams resulting from
the encodings in scenario 1. Compression ratio is the filesize fraction compared
to the original FASTA file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Chapter 1
Introduction
1.1 Problem statement
Context The genome of an organism contains its genetic material: the DNA (Deoxyribonu-
cleic acid) within that organism. This DNA is composed of nucleotides (bases), represented
by the characters A (Adenine), C (Cytosine), G (Guanine) and T (Thymine). This biochem-
ical component of every living organism contains a treasure of largely untapped information
about an organism. With the establishment and development of advanced technology in the
bioinformatics discipline, attempts are made to unravel the information contained within. This
requires an eletronic representation of this biochemical material. The process of converting the
biochemical physical DNA to a data file is called sequencing. The human genome is on average
about 3 GB of uncompressed data, but a single uncompressed genome generated by sequencing
institutes can take up as much as nearly 300 GB.
Problem The cost of sequenceing a single human genome has dropped in the past decade
from tens of millions of dollars to about 10 thousand dollars. The technological improvements
and maturity of the second generation sequencing platforms (which are currently in use) has
lead to equipment currently costing well under one million dollar and available in many scientific
institutes. The upcoming third generation will be even cheaper, making personalized medicine
available for the masses. These technologies are responsible for an ever increasing number of
genomes being sequenced, both from new species and more individuals from a certain species.
When looking at the data produced by these institutes, numbers are rapidly growing. The
world’s biggest sequencing institute, operating over 180 sequencers, produces a total amount
of data on the order of 10 PB per year and growing. The storage requirements for the output
1
CHAPTER 1. INTRODUCTION 2
of high-throughput sequencing instruments falls in the range of 50-100 PB per year. Viewing
the genomic data growth in the first decade of the century, it is observed that the progress in
computer hardware lags behind. The need for storage outpaces the growth in storage capacity
technology. At the same time, many programs are introduced aimed at collecting large amounts
of sequence data. For example, The Million Veteran Program (led by the US Department of
Veterans Affairs), will produce a total of about 250 PB of raw data over the span of 5-7 years.
As these large-scale projects are emerging, the storage concerns are of increasingly high priority.
Several approaches can be taken in dealing with this growth in sequence data creation (plus
derived and meta-data), all being not mutually exclusive.
A first option is simply to add storage capacity. Prior to 2005, the increase rate of sequenc-
ing capacity closely followed the rate of increase in storage capacity; both doubled around every
18 months. With a stable budget, production sequencing facilities and archival databases could
match the storage hardware requirements. However, the current trend that can be observed is
that the cost of sequencing a single base halves roughly every 8 months. The cost of hard disk
space has been halving every 25 months for the last few years. Even when applying general
purpose compression algorithms, the increasing rate of sequencing is significantly outpacing the
storage growth. This mismatch between technologies means that either a reduction in stored
sequence data must occur, or a progressive necessary increase in storage costs is required, the
latter seeming unlikely and unattractive. When looking at the technological trends in the last
decade, one can observe a growth in the availability of distributed and high-capacity computing
facilities. With the kinds of requirements required for this task, these data centers are much
better for storage than local infrastructure, as keeping even a small amount of fully-sequenced,
uncompressed genomes on the same machine is unrealistic. Next to storage capacity, bandwith
is a concern as well. Network capacity comes into play with a significant importance in these
applications. Transferring sequences between machines having a small bandwidth is out of the
question for real-life applications. One of the largest data centers and cloud computing facil-
ities currently available are Amazon S3 and Amazon EC2 (data of 2013, [1]). While the cost
of sequencing a human genome in January 2013 was about 5700 dollars, the cost of one-year
storage and 15 data downloads was about 1500 dollars. These numbers clearly show that the
costs of data storage and transfer will become comparable the the costs of the actual sequencing
in the near future. With a growing interest in personalized medicine, these costs - besides the
CHAPTER 1. INTRODUCTION 3
performance limitations of current technology - will be a significant obstacle if not faced properly.
A second option is to throw away some data, known as triage. Some proposals regarding
this approach have been put forward. These include storing the physical sample instead of the
digitally sequenced version, discarding old data, discarding data which could be regenerated,
not storing raw data, but rather limiting storage to analysis results... A common ground in
these suggestions is the implied ability to regenerate data from samples at any point. With large
scale internationally distributed sequencing projects, this is infeasible however, thus the need
arises to store the data electronically. These large volumes of data must be available for analysis
during at least the project runtime (often several years) and preferably for possible follow-up
projects. Given the exponential decrease in storage costs, data storage is heavily influenced
by its early years. Also observing the continuous large investments in variation and cancer se-
quencing, it seems inappropriate to limit the possibilities of re-analysis due to the sake of these
storage costs; compromising possible medical breakthroughs due to a storage cost limitation is
rather difficult to accept. The feasibility of storing and reacquiring clinical samples is a concern
as well. Many samples have low actual DNA content and cannot be distributed freely in the
future. Some rare samples (which are often of the most interest) are non-renewable and their
availability depends thus solely on electronic archiving. Even renewable samples can be very
cost-intensive to reproduce, and due to some inherent randomness in the sequencing process,
reproducing the exact same raw data is nearly impossible. Without reproducability - one of
the main principles of the modern scientific method - the approach of keeping only the physical
samples poses strong concerns. Finally, the worldwide cooperation in this field can be severely
limited due to the complexity of the long-term operation of physical storage, distribution and
end-point sequencing. So, selective(physical) storage and discarding old data is a rather radical
approach, raising major methodological doubts from a scientific and research point of view.
A third option is to compress the electronic data. As (sequenced & aligned) DNA in its
raw representation is represented by a long string of characters, compression using generic ap-
proaches is possible. However, compression of sequence data might try to take advantage of the
natural and biological characteristics of DNA material, notably being the repeated content and
the often very close relationships between existing reference sequences. Both losless and lossy
compression techniques exists, the latter sometimes having a user-specified trade-off between
CHAPTER 1. INTRODUCTION 4
compression ratio and information loss. A controlled loss of precision can be acceptable, depend-
ing on the appropriate scientific and application requirements. A key feature of reference-based
compression is that performance can increase with the growth in sequencing technology and
projects, as it might exploit the growing redundancy therein. A lot of compression techniques
have been investigated, and are currently being developed to address the compression of se-
quencing data. This is one of the most promising tracks to face the issue of dealing with high
storage requirements as it does not necessarily involve data loss.
DNA data
As this is a research work on DNA data, this subsection will provide a very coincise discussion
on DNA and sequencing techniques. The biochemical aspects are not relevant to this work and
are therefore ommited, with only the necessary technical concepts touched upon here.
DNA is a biochemical molecule that carries the biological and genetic information about a
living organism, used for its development, reproduction... DNA molecules in their physical form
consists of two strands coiled around each other into a double helix structure. These strands are
made up from nucleotides. Each nucleotide consists of (among other elements) a nucleobase:
cytosine (C), guanine (G), adenine (A), or thymine (T). DNA has several uses nowadays in
biochemical and medical technology. An important use is genetic engineering: the application
of man-made (recombinant) DNA which is extracted from an organism and modified to cre-
ate another organism such as disease-resistent agricultural or medical products. DNA profiling
(used in the forensics domain) is a method of identifying an individual by a small amount of
biological material, useful in identifying criminal perpetrators from evidence, identifying victims
from mass-scale incidents, and determining paternity relations. Interesting in the light of this
work is the recent research in using DNA as an archival storage system, where the extremely
dense structure of DNA (up to 1 exabyte per cubic millimeter) is used to archive digital data
in a key-value store. ([2])
The interdisciplinary field of research regarding techniques for storing, data mining and
manipulating biological data is called bio-informatics. The specific characteristics of this data
has lead to advances in general computer science, machine learning, database technology, string
searching and matching algorithms... By aligning DNA sequences with other sequences, specific
CHAPTER 1. INTRODUCTION 5
mutations and distinctions can be discovered, e.g. the presence or the likelyhood of developing a
certain - possibly hereditary - disease. Gene-finding algorithms search for patterns in the data,
which can in turn be related to specific functions in organisms, evolutionary development...
Since the first method for determining DNA sequences in 1970, a wide variety of advanced
techniques has been developed.
One major distinction in these technologies worth discussing here is the difference between
reads and sequences. When generating coverage of DNA samples, the result is a collection of
millions of short DNA sequences called reads, ranging from 30 to 30000 bases long. These reads
are short, repeated and shifted parts of the sequence. Reads require sequence assembly before
most actual genome analysis occurs.1 The assembly, alignment and sequence reconstruction
of these reads is an extremely computationally challenging task for even the most advanced
approaches. Several assembler algorithms have been developed for this purpose, making use
of greedy algorithms and methods in dynamic programming. The output of this assemblers is
one contiguous, aligned sequence. This is then called the full genome or sequence. The domain
of bio-informatics is a field of highly active research: new and improved methods are being
developed for the data extraction of physical samples, alignment procedures, and data mining
of the information contained in the sequences - both in reads and in sequences.
A second major distinction can be made regarding reference or non-reference based storage
of DNA material. When the sequences are stored and processed independantly of other material,
it is called a non-reference based system. Each sequence contains the full information on the
content of its DNA material. Another possibility is storing the file based on a reference sequence.
As a lot of DNA sequences show very high amounts of similarity, this can be exploited by
only storing differences and modifications in that particular strain compared to a well-known
reference sequence. This reference serves as the gold standard of a DNA strain for a particular
species. When having a large set of sequences to store, the reference-based storage can greatly
reduce storage requirements by ommiting the redundant part of the DNA shared by all of them.
A disadvantage of this approach is the need to agree upon the gold standard, which often varies
after technological improvements and thus requires precise version control.
1One analogy of this situation is the reads resembling the result of shredding multiple copies of a book. Theymight only contain a part of a sentence, but possibly an entire paragraph. Furthermore, a lot of the sentences willbe found repeatedly in the reads, without any notion of their position in the original text. With the limitationsof current technology, some words will be mangled by the extraction process as well. Some parts could beunrecognizable, and even some parts from another book could end up in there. The task at hand is then thereconstruction of the one contiguous source text.
CHAPTER 1. INTRODUCTION 6
1.2 Content of this work
This work will follow the track of compression of the data. It will use fully-aligned human
genome data as source material. The possibility is explored of applying a machine learning
technique - specifically neural network architectures - to create a lossless compression method
of this material. The scheme will be non-reference based: compression will work on the raw
content of the DNA sequence independantly. These techniques have been shown to be a valid
method for lossy compression of visual material, but at the time of writing, there are no known
applications of these techniques on DNA data. On the other hand, several traditional techniques
for compression of DNA material exist and are being developed, however without making use of
the possible powerful advantages of machine learning. This work will try to apply the succesful
compression performance of machine learning to the human genome.
The reason this approach might prove to work is twofold. Firstly, the base purpose of
machine learning is performing (complex) pattern recognition, and this is where they excel
compared to all traditional techniques. Due to the existence of useful patterns in audiovisual
material, they are extremely succesful in visual computing application. Machine learning suc-
ceeds in (automatically) finding relevant patterns in data which are not trivially discovered by
humans. They might thus be able to find - and exploit - meaningful patterns in DNA data
which are currently not known.
A second reason is the specific content of this data. Visual data often has a lot of redun-
dancy in its data, which makes them a good candidate for compression. As DNA data files
contain redundancy by sharing (among species and individuals) similar strains and repeated
patterns, they might prove to be good source material for certain machine learning algorithms
where traditional algorithms fail.
Several characteristics of both the application scenario and the source material must be
taken into account when choosing or creating a particular compression method. The applica-
tion requires that data can be stored or transferred over a network, and data should preferably
be accessible in real-time (with either a sequential or random access pattern). When the data
sets are stored off-premises in a datacenter, transferring files come with an associated cost.
Additionally, even with high speed access, transferring these data sets takes a long time due to
their large size. The goal of applying compression is here to reduce this cost by reducing the
CHAPTER 1. INTRODUCTION 7
size: either for network transfer or for storage purposes. By having a reasonably fast scheme
with a good compression ratio, the processing time plus the reduced transfer time can be signif-
icantly lower compared to the full transfer of raw data. Easy transfer of this data furthermore
leads to an optimized cooperation between research institutes as well, which helps in advancing
the scientific progress. With the ongoing surge of large-scale sequencing projects, compression
algorithms currently are focused on compression ratio rather than speed.
This structure of this work is as following. Chapter 2 contains a survey of relevant techniques
and developments in the domain of bio-informatics and machine learning. Chapter 3 introduces
some necessary theoretical concepts from the domain of machine learning used in this work.
Then, in chapter 4, the methodology of the approach taken here is explained, together with some
technical aspects and design decisions. After that, chapter 5 discusses the results obtained by
implementing this solution. Finally, chapter 6 concludes by a discussion of the achieved results
and looks at open questions and what future research might need to investigate on this topic.
Chapter 2
Related work
This chapter provides a survey of some relevant research in the field. The first subsection focuses
on (some of the) existing techniques developed and in use today regarding the manipulation of
sequence data. The second subsection handles some work on succesful applications of machine
learning techniques to image manipulation. The last subsection discusses Auto-Encoders, a
specific method in the machine learning portfolio which will be applied throughout this work.
2.1 DNA compression algorithms
This section offers a brief overview of current compression techniques, starting with general-
purpose techniques, and then discussing some techniques designed specific for sequence data.
General-purpose compression algorithms
First of all, a discussion of some general compression techniques and their effectiveness on
genome data follows. Several higher-level DNA-specialized algorithms make use of these tech-
niques or similar concepts which follow the same underlying reasoning. Here specifically, some
techniques for lossless compression of strings are discussed.
Lempel-Ziv
One well-known technique is the Lempel-Ziv (LZ) algorithm [3], known for performing well on
repetitive text. LZ starts with a dictionary of all possible characters. By running over the input
one character at a time, LZ looks for matching substrings in the dictionary. If not found, the
sequence is added to the dictionary. Repeated strings in the file are replaced in the output by
their indices in this dictionary. This way, repeated patterns can be compressed efficiently. To
8
CHAPTER 2. RELATED WORK 9
decompress a file, the dictionary gets rebuilt on the fly by reversely executing the compression
algorithm. This means the dictionary does not need to be sent along. When introduced in
1977, it was the best known compression algorithm, and the current popular 7-zip compression
scheme is still based on the principles of LZ. The original LZ-77 compression was improved with
LZ-78, and later extended to Lempel-Ziv-Welch in 1984. Lempel-Ziv compression is widely used
in several algorithms. The first DNA-focused compression algorithm BioCompress was heavily
based on LZ, and lots of recent technologies are still based upon the same principles.
Arithmetic coding
A second widely used technique in lossless data compression is arithmetic coding [4]. Arithmetic
coding is done by taking the probability that a characters will occur and fitting the probabilities
in the range [0,1). A string is read character per character, and the probability distribution of
that character is used to update (shorten) the range. After a number of characters has been
read, the shortest decimal in the final range is chosen to represent that series of characters. This
technique leads to a about 2 bits per character for sequence data. With 2 bits, 4 states can be
represented which match the 4 nucleobases occurring in DNA, and this should be a baseline for
other compression algorithms. Arithmetic coding can also be assisted by a Markov Model. A
Markov Model is a model that allows prediction of the next state based on the current state,
or in an order-n Markov Model, based on the n previous states. The probabilities of moving to
a new state (representing moving to a new character) can then be fed into an arithmetic coder.
Raw sequencing data
DNA Compression schemes
Kuruppu (2011, [5]) uses an algorithm based on the general-purpose LZ-77 algorithm, but
specifically modified for sequence data: optimized Relative Lempel-Ziv (opt-RLZ) to compress
genomes which are stored reference-based. Each sequence in a collection is encoded as a series
of factors (substrings) occurring in the reference. Factors are encoded as pairs, containing a
lookup position and a length (which is encoded further with Golomb coding). The reference
sequence gets stored uncompressed. Some optimizations are made to improve on the LZ-77
parsing which uses a greedy method to match substrings in a sequence. One of this is to look
ahead h characters when encoding a sequence. The algorithm uses a variation on this concept in
order to search the longest factor in a region. A second method used is the matching statistics
CHAPTER 2. RELATED WORK 10
algorithm to encode the (position, length) pairs of factors. The algorithm likely creates shorter
factors, which have a shorter literal encoding than a lookup pair encoding. These so-called
short factors are again encoded using a Golomb code. Finally, it is observed that the sequence
of matched long factors form an increasing subsequence. These longest increasing subsequence
(LISS) factors are encoded differentially with a Golomb code. The positions of these LISS factors
can sometimes be predicted, thus need not be encoded, leading to a further compression boost.
This opt-RLZ algorithm is compared on a dataset of four human genomes with a reference to
three other compression algorithms: COMRAD, RLCSA and XMcompress, and is shown to be
the best known compression algorithm at the time of writing. Depending on the genome tested,
opt-RLZ outperformed COMRAD and matched XM in compression ratios. Encodings as low as
0.15 bits per base pair (bpb) and 0.48 bpb are achieved. This rates are half of the bpb achieved
by standard RLZ. The authors futhermore note that the uncompressed reference genome takes
up most of the space, thus with additional genomes added to a dataset even better results are
possible. The execution speed of the algorithm is very fast, memory requirements are low, and
the algorithm allows for rapid random access to substrings of the sequence.
One of the many file formats for storing raw reads is the FASTQ format, containing the reads
together with associated metadata (e.g. read ID, base calls, quality score ...). When compressed
with the general-purpose Gzip compression, a 3-fold size reduction can be obtained. Using a
combination of Huffman coding and a scheme similar to Lempel-Ziv, a dedicated FASTQ com-
pressor DSRC (introduced by Deorowicz in 2011, [6] and in 2014 improved upon by DSRC2)
can obtain a compression ratio of 5. DSRC is fast and handles variable-length reads with an
alphabet size beyond five. Improvements upon DSRC use an additional arithmetic encoder,
group reads with a recognized overlap together, or use a preprocessor (SCALCE) to improve
the compression ratio further using Gzip.
When the choice is made to go with lossy compression, the quality scores of a read are a
natural candidate to allow some loss of information; some tools ignore this data, so they can be
ommited entirely for certain purposes. Another approach is to filter out reads which do not meet
the required quality level. SeqDB [7] is another FASTQ-dedicated compression scheme which
focuses on speed, with speedups of over 10 times compared to DSRC, but with compression
ratios at best at Gzip’s level. A more recent algorithm is Quip [8], using higher-order modelling,
arithmetic coding and an assembly-based approach. The idea is to use the first few (million)
CHAPTER 2. RELATED WORK 11
reads as a reference for the following ones. Depending on the algorithm mode used, ratios are
significantly better than DSRC with a slightly lower speed. A somewhat unique feature if Quip
is the possibility to work with both aligned and non-aligned reads, and working with a reference
or standalone.
One of the latest (2015) schemes developed for FASTQ files is LFQC [9]. It is a new
lossless non-reference based compression algorithm outperforming existing big data compression
techniques such as Gzip, Fastqz, Quip and DSRC2 on selected datasets.
(Reference) Aligned reads
While techniques on raw reads are being developed as well, they are usually assembled and
aligned to a reference genome. File formats used for this are the SAM format, augmenting the
reads data with additional quality information, leading to files about twice the size of FASTQ
files. Another format BAM is a Gzip-like compressed equivalent of the textual SAM format.
Some compression schemes can handle both aligned an unaligned reads. One of the compressors
used for reference-based reads is the CRAM toolkit1, a framework and specification developed
by the European Bioinformatics Institute, achieving 40%-50% size reduction compared to BAM
files. For aligned reads, the mapping coordinates and differences from the reference are stored.
For unaligned reads, an artificial reference is constructed with the sole purpose of compression.
The tool allows for lossless and lossy compression with several options to define its behaviour.
Other similar tools are SlimGene [10] and SAMZIP [11]. The previously mentioned Quip can
operate on SAM/BAM formats using a reference as well, and performs better compared to
CRAM. Another program available is the tabix program, a generic tool to perform indexing,
searching and compression ([12]). A textual file is sorted, split into blocks which are compressed
using Gzip, and an index is built to allow random-access queries.
One algorithm to compress reads using a reference sequence is discussed by Fritz et al.
(2011, [13]), taking advantage of the fact that most reads in a run match the reference near-
perfectly. The algorithm takes a mapping of the reads to the reference, and efficiently stores
that mapping plus any deviations, using an artificially constructed reference. First the lookup
position of each read on the reference is stored. The length of these reads is Huffman encoded.
Then these reads are ordered with respect to the lookup position, allowing an efficient relative
1http://www.ebi.ac.uk/ena/software/cram-toolkit
CHAPTER 2. RELATED WORK 12
encoding of the differentials between successive values using a Golomb code. Variations from
the reference are stored as an offset relative to the lookup position of the read together with the
base identities or lengths, depending on the type of variation. These offsets are again encoded
using Golomb code. The read pair information is also stored, relative to each other and Golomb
coded. This technique on aligned portions of the sequence data results on varying data sets in
a storage requirement of between 0.02 bits/base pair and 0.66 bits/base pair. These numbers
compare favorably to general-purpose bzip2 compression of DNA (1 bpb) and are considerably
more efficient then BAM compression: a 5 to 54-fold ratio compared to compressed FASTA
or BAM is achieved. Next to aligned sequence data however, usually 10% to up to 70% of
the reads are unmapped to a reference, often dominating storage costs. For this purpose, an
artificial reference is assembled from similar experiments (e.g. similar species). Using a third
human sequence and a database of bacterial and viral sequences, 17% of the reads could be
mapped to these artificial references with a resulting 0.026 bits/base compression performance.
A large 83% of the unmapped reads however, could not be mapped to one of the used references.
These reads are compressed using general-purpose techniques. The result of this combination of
techniques leads to a significantly better compression performance than traditional approaches
on real data, with a 10 to 30-fold better ratio. The authors note this compression scheme
performs better with longer read lengths, which newer sequencing technology offers.
Full genome sequences
Single genome compression
Raw sequencing data of a single genome poses the greatest challenge for storage and trans-
fer. The genome sequence of a single individual is very hard to compress due to the lack of
well-unterstood structure and patterns. When only the symbols A, C, G or T are used, the
simplistic general-porpose encoding using 2 bits per symbol often outperforms ’smart’ general-
purpose compressors like Gzip. Nonetheless, specialized DNA compressors are developed in the
hope to improve upon this baseline. The highly acclaimed XM achieves compresses up to 5
times, but with an impractical low speed. Other notable compressors are dna3 and FCM-M.
XM [14] is an Expert Model algorithm using arithmetic coding. The unique property is
how it determines the probabilities of the characters. The algorithm uses a collection of ex-
perts, being anything that gives a grood probability distribution for a position in the sequence.
CHAPTER 2. RELATED WORK 13
Examples are the previously mentioned Markov Model, or a copy expert, which determines if
something is likely a copy of a known block. After obtaining probabilities for characters, XM will
combine these and feed into an arithmetic coder. Additionally, the experts are weighted based
on their past accuracy. In comparison with other genome specific algorithms (BioCompress,
GenCompress, DNACompress, DNAPact), XM performs on average and on more genomes bet-
ter than other algorithms. XM achieves a compression well under 2 bpb, clearly showing that
a specialized algorithm is able to perform better than generic compression algorithms (such as
simple arithmetic coding).
Tabus & Korodi proposed a sophisticated DNA sequence compression method based on
normalized maximum likelihood (NML) discrete regression for approximate block matching
(2003, [15]). T&K first breaks the sequence in equally-sized blocks. Each block is compressed
using three different methods, with the most efficient one selected for storage. The first method
used a Markov Model, the second method is simple arithmetic coding, and the third method
looks for matches in the previous blocks and compresses the block using differences. The third
method works with a few subsequent steps. First a block is searched with an approximate
equal content. The positions of equality are stored in a string, and for the differences, distances
from the reference block are arithmetically encoded. A probability distribution is constructed
to apply a form or arithmetic coding on the distances. A second probability distribution is is
created from that distribution, called a universal code, which will create prefixes in a way that
has good performance on all possible source distributions. Compression results on chromosomes
of the human genome lead to between 1.449 and 1.616 bpb. No comparisons have been made to
other algorithms however, so the effectiveness of the T&K method is somewhat hard to evaluate.
The T&K method has been further developed later in 2005 into GeNML [16].
Genome collections
When databases of lots of individual genomes of the same species are considered, the scenario
is significantly different, as more knowledge of the combined genomes can be exploited. These
genomes are very likely highly similar, sometimes sharing over 99% of their content, and as
such a collection can be compressed more efficiently compared to standalone material. Several
algorithms working on referenced genomes show improved performance with a bigger dataset
of genomes. General-purpose algorithms (Gzip, rar) are usually not applicable here since repe-
CHAPTER 2. RELATED WORK 14
titions in the data can be gigabytes apart. Variations on a reference genome consists of Single
Nucleotide Pylomorphisms and indels, insertions of deletions of multiple nucleotides. Assuming
these variations to a reference are known, plus a readily available SNP database, a single human
genome has been compressed to about 4 Mbytes. Recent techniques have been able to com-
press collections to an extent of a few hundred Kilobytes per individual (2009, [17]). Standard
compression techniques as Huffman coding and Lempel-Ziv have used here as well, but do not
achieve the performance of specialized compression schemes. The GDC2 compressor achieves a
compression ratio of 9500 for relatively encoded genomes in a large collection with fast execu-
tion time [18]. Some specialized compressors such as GDC and LZ-End allow for access to an
arbitrary string in the collection, at an expense of compression ratios achieved. An advanced
general-purpose LZ compressor, 7zip, is able to achieve competitive results as well, provided
the data is in a specific order.
COMRAD (Compression using Redundancy of DNA sets) ([19], [20]) generates a dictionary
of substrings of length L and keeps track of their frequencies. It will count the places where
the most numerous substring can be replaced (counting without overlapping). This string will
be replaced by the non-overlapping counts, and these replacements are stored. This process of
counting and repeating is repeated until the counts reach a tresshold. The output is a string
and a set of replacements that were made, allowing to decompress the sequence by reversely
applying the replacements. Using a dataset of human genomes, bacteria and viruses, COMRAD
is compared to RLZ, RLCSA and arithmetic coding. On average, arithmetic coding leads to
2.06 bpb, while COMRAD achieves a relatively efficient 1.10 bpb. As the compared algorithms
are not DNA-specific and do not perform better then simple arithmetic coding, they are a poor
choice of algorithms, and a comparison with better performing methods should be done.
Note While several existing compression schemes for a variety of scenarios are mentioned in
the previous section, it should be noted that this survey is not exhaustive at all, and many
more techniques exist. In evaluating compression schemes, some difficulties arise when trying
to perform experimental tests and comparisons. Many tools are limited in their functionality,
as they are often designed with a specific scenario in mind having its own constraints and
characteristics. They might only accept sequences with a limited alphabet, fixed-width reads,
assume specific ID formats, disregard metadata... This makes an honest comparison rather hard
to perform. Some existing tools show significant problems when trying to run them. While
CHAPTER 2. RELATED WORK 15
sometimes open source code is available, some tools are proprietary, do not disclose the settings
leading to their published results, and are therefore very hard to evaluate. The output format
of some of these tools are not always compatible, with features not supported, or not being able
to turn them off, which makes a comparison often not entirely fair. The large variety of file
formatting of the DNA material does pose a difficulty as well. Lastly, the lack of a benchmark
dataset is probably the largest hurdle in making fair comparisons possible. Each research uses
its own dataset to present numbers which are thus often not comparable in a transparant
way, and over the years and advances in technology, these datasets differ significantly. This
specific problem has been addressed by the machine learning community in the form of MNIST,
ImageNet and other publicly available benchmark datasets. Having a similar concept to allow
compression method benchmarking would certainly improve the transparancy of many of the
published research efforts.
2.2 Image manipulation with machine learning
This section will focus on the machine learning success in computer vision applications, where
these applications have proven to be extremely succesful and have led to development of deep
learning discipline. The one technique of interest is the concept of Artificial Neural Networks
(ANNs). They have been tried on a variety of problems, and have proven to be superior over
traditional methods in a lot of situations. This section will discuss a few works on applying
ANNs specifically to image manipulation, classification and compression.
Classification
As ANNs are extremely succesful in computer vision applications, there has been a large inter-
est and set of research papers published over the years around (convolutional) neural networks
in image classification. The gold standard of machine learning classification has long been the
MNIST dataset, which comprises of labeled grayscale pictures of handwritten numbers. For
the purpose of comparing new networks’ performance, a new dataset was later constructed:
CIFAR-10/100, consisting of tiny multi-labeled colour images. Since 2010 the current most
popular benchmark used in object detection and image classification research is the ImageNet
Large Scale Visual Recognition Challenge (ILSVRC). This section will shortly describe some
CHAPTER 2. RELATED WORK 16
winning competitors of these competitions.2
Krizhevsky ([22]) discusses the application of ANNs in image classification using the Ima-
geNet dataset, a set of 15 million high-resolution labeled images used in the ILSVRC-2010/2012
competition. They use a large convolutional network (often abbreviated as CNN or ConvNet)
(using 5 convolutional and 3 fully-connected layers) using GPU implementation to achieve a
result by far outscoring any result ever reported on this dataset. A few novel techniques are
discussed to prevent overfitting the model and speed up the training. They introduce Rectified
Linear Units (ReLUs) as nonlinearity in CNN, apply GPU specific architectural decisions, add
a normalization method and overlapping pooling in pooling layers of the network. Overfitting
is reduced by using data augmentation and dropout. Comparing with existing results in the
competitions, they achieve an error rate which is more then 10% lower then existing state of
the art solutions. Their results show that a large, deep convolutional neural network is capa-
ble of achieving recordbreaking results on a highly challenging dataset using purely supervised
learning.
Simonyan and Zisserman ([21]) improve upon these state-of-the-art convolutional neural
nets. It performs a thorough investigation about the architecture of these networks, specifically
aimed at the depth and convolution filter sizes. By stacking a high amount of convolutional
layers with small filters, they come up with significantly more accurate ConvNet architectures,
leading to the winning submissions of the ImageNet Challenge 2014 in both image classification
and localisation. Combining several of their deep models as an ensemble model leads to a record
performance of 24.4% and 7.1% for the top-1 and top-5 error rate on ILSVRC. While still ad-
hering to the original ConvNet architecture, by deepening the configuration, they outperformed
all previous winning networks from previous years. The conclusion of the work is that very deep
ConvNets (having sometimes over 20 layers) outperform existing architectures, confirming the
general adagium of ”the deeper, the better”.
2For a more complete overview of image classification with ILSVRC, see [[21]] and the very useful http:
//rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html, which ranks thepublished results on the most widely used benchmark datasets and where possible links to the papers.
CHAPTER 2. RELATED WORK 17
Compression
Traditional techniques for image compression include predictive coding, transform coding and
vector quantization. Several standards regarding compression of (audio)visual material are in
widespread use today. While not widely used in practice, there has been some research on image
compression using neural networks. As ANNs are performing well with incomplete or noisy data,
they can be expected to perform well on images and visual data. ANNs (and machine learning
in general) can process input patterns to a simpler pattern with fewer components/coefficients
as an internal representation. This internal pattern, stored as neurons in a hidden layer, repre-
sents the external input information but in a more compact way, leading to compression. Deep
learning has led to breakthroughs in learning hierarchical representations from images.
[23] is an early work discussing a Direct Solution Method (DSM) based neural network with
an auto-encoder architecture. DSM operates contrary to most approaches using iterative train-
ing and Error BackPropagation (EPB). A multi-layer perceptron with a single hidden layer
is used, where they express the output layer neuron values as a linear system of equations.
The weights of the network are found by solving this system using a Modified Gram-Schmidt
method. This method is chosen over traditional iterative approaches because it is stable, faster,
and requires less computing power. Comparing compression and reconstruction of the DSM
and EPB method, a compression ratio of 3:1 was found with both methods performing well and
DSM being the faster one.
Dutta et al. (2009, [24]) uses a multi-layer feed-forward neural net with sigmoid activation
functions in order to perform image compression. Some standard pictures are compressed,
median filtering and histogram equalization is applied to the reconstructed grayscale images.
Compression rates of around 85% are achieved. While the PSNR is used as a quality measure,
only a visual display shows the results and unfortunately there are no comparisons made between
existing compression codecs.
2.3 Auto-Encoders as feature learning models
While most of the published research on ANN handles image classification and object recog-
nition, which fall under the supervised learning category of machine learning techniques, there
CHAPTER 2. RELATED WORK 18
has been active research on applying ANN in unsupervised learning as well (albeit with far
less attention), bridging from their good performance in computer vision tasks to other similar
applications. The majority of this research focuses on complementing the methods: using an
un- or semi-supervised learning stage before applying a supervised classification task. As the
auto-encoder will be used in the methodology of this work, this section provides a closer look
to research on this particular technique.
Kulkarni et al. (2015, [25]) construct a Deep Convolutional Inverse Graphics Network (DC-
IGN) in order to learn image features. It then uses the network to manipulates these features
in order to render an (object in an) image with a different viewport, lighting condition ... The
architecture is a variational autoencoder, with the encoder part consisting of multiple layers
of convolution & max-pooling, and the decoder having symmetrical inverse equivalents. This
leads to a middle hidden layer holding semantically-interpretable graphic features. By recreat-
ing a modified version of an image, they discuss the model’s efficacy at learning a 3D rendering
engine. By using specific batch training datasets, they learn the model to express specific fea-
tures by specific neurons in a hidden layer. They conclude by utilizing a deep convolution and
de-convolution architecture within a variational autoencoder formulation it is possible to train
a deep convolutional inverse graphics network with a graphics code layer representation from
static images.
Radford et al. (2015, [26]) introduce deep convolutional generative adversarial networks
(DCGANs), which are a class of CNN’s with certain architectural constraints. The aim is to
automatically learn a feature representation from unlabeled data, which can then be reused
in supervised learning tasks. The DCGANs are shown to learn a hierarchy of representations
by training on various image datasets, from object parts to entire scenes. By using the fea-
ture representation as input vector to supervised image classification models, they are able to
outperform traditional techniques (K-means, Exemplar CNN) on the CIFAR dataset. The con-
clusion is a set of stable architectures for generative adversarial networks and evidence that
these networks can learn good representation vectors for further use.
Zhao (2015, [27]) presents a novel architecture called a stacked what-where auto-encoders
(SWWAE). The architecture consists of a convolutional net as discriminative model followed by
CHAPTER 2. RELATED WORK 19
a deconvolutional net as generative model. By selectively enabling only parts of the architecture,
the architecture provides a unified approach to supervised, unsupervised and semi-supervised
learning. The novelty they introduce is an adaptation of the pooling stage (with an associated
modified loss function). Their idea is whenever a layer implements a many-to-one mapping
(as pooling does), they compute a set of complementary variables to improve reconstruction
ability. Each pooling stage will output a what value, which is the content feeding to the next
layer. Additionally a where variable will be saved, informing the corresponding decoding step
about the location of certain features. Each convolution + pooling stage and its corresponding
decoding stage is then a stacked auto-encoder. Their results on exiting image classification sets
lead to a comparably good accuracy, yet not improving on state-of-the-art.
One application of auto-encoders is the denoising Auto-Encoder (dAE), introduced by Vin-
cent (2010, [28]). In order to learn an auto-encoder to learn useful features instead of the
identity function, the auto-encoder is locally trained to reconstruct its input from a corrupted
version. By stacking layers of dAEs, a deep network is formed that is shown to be able to
learn useful features from natural images or digit images (MMNIST). Experiments using the
denoising objective as unsupervised training criterion in auto-encoders complemented with ex-
isting supervised classification methods show a significant improvement. The conclusion is the
clear establishment of the denoising criterion as a valuable unsupervised objective to guide the
learning of useful higher-order data representations.
Masci (2011, [29]) introduces a novel convolutional auto-encoder (CAE) for unsupervised
learning. By stacking these and adding pooling layers, a ConvNet is formed, trained using
traditional methods. The CAE serves as a scalable hierarchical unsupervised feature extractor,
learning good ConvNet initializations and avoiding the problem of getting stuck in local minima
of highly non-convex objective functions (arising in virtually all deep learning problems). The
(novel) convolutional variant of the auto-encoder exploits the 2D image structure of data, lead-
ing to parameter sharing among locations in the image. This thus preserves spatial locality, and
the discovery of localized, repeated features in data are the property that made ConvNets excell
in object recognition tasks. A stack of CAEs is trained and used as initialisation for a ConvNet
to perform classification on MNIST and CIFAR. They conclude that pre-trained CNNs slightly
but consistently outperform randomly initialized nets.
CHAPTER 2. RELATED WORK 20
Building on top of denoising auto-encoders, Rasmus et al. (2015, [30]) build a Ladder
Network consisting of a stacked auto-encoder architecture with skip connections between en-
coder/decoder pairs and where each layer of the network is trained seperately. The Ladder
network is basically a collection of nested denoising autoencoders which share parts of the de-
noising machinery with each other. The idea behind skip connections is to alleviate the need of
the model to capture details in the encoding step, as the decoder can recover these discarded
details through the direct connections. Their approach on unsupervised learning is compati-
ble with existing supervised feed-forward networks, scalable and computationally efficient. By
reaching state-of-the-art on supervised tasks (MNIST & CIFAR), they show how a simultaneous
unsupervised learning stage improves the performance of existing neural nets.
Chapter 3
Theory
This chapter will give a short overview of the theory and principles that are used throughout
this work. It starts with a general overview of the concepts of machine learning, then discusses
one particular technique (Artificial Neural Networks or ANNs), and finally discusses a special
application of these ANNs which are applied in this work, namely Auto-Encoders (AE).
3.1 Brief introduction to machine learning
This section will sketch a brief overview of the theory of machine learning and some important
concepts and distinct techniques.
Machine Learning (ML) is a currently highly active area in computer sciences rooted in pat-
tern recognition and learning theory in Artificial Intelligence, closely related to computational
statistics and mathematical optimization. In machine learning, algorithms learn to perform
tasks without explicit programming by hand; there is an emphasis on automatic methods. The
goal is to devise learning algorithms that do effective learning automatically without human in-
tervention. The motivation behind this is that code is expensive and specialized to a particular
application, while data is often abundant and increasingly becoming cheaper, objective and thus
usable for multiple purposes. Contrary to traditional programming where a program is written
allowing a computer to solve a task directly, in ML a method is searched for where a computer
comes up with its own program, based on example data provided as input. Advanced statistics
are used to create usable programs from template algorithms and structures. By having simple
templates but with a large number of parameters, these techniques are general-purpose, but
able to model complex relationships and programs.
21
CHAPTER 3. THEORY 22
Machine learning is a data-driven methodology. Often, the relationship between data has a
complexity that is too high or for humans to comprehend or program in a traditional program.
By adjusting numerous parameters in a template model based on training samples, the goal in
applying a ML algorithm is to build a program which automatically constructs its logic and
rulesets based on the data.
The goal of a ML approach is to find an unknown target function f that expresses the
relationship between input/output relationship
f : X → T (3.1)
with the input domain X consisting of vectors xi with dimension M , and the output domain
T of target vectors ti with dimensionN . The targets are representing either a class label (N = 1)
or one or more real-valued numbers (N >= 1).
xi = (x1, ..., xM )T , ti = (t1, ..., tN )T (3.2)
The first phase is the training phase. Suppose a set of training samples Dn is given, compris-
ing N data points, each consisting of input vector x with an associated output (target) value
t.
Dn = (x1, t1), ..., (xN , tN ) (3.3)
A model is then constructed through a learning algorithm that will try to map the data
points of this training set to their known target values. This learning algorithm will construct
a parametrized function hθ(x) out of a hypothesis set H of candidate mapping functions. This
function should be a good approximation of the actual target function f .
hθ : X → T , h ∈ H, hθ ≈ f (3.4)
The learning often uses an iterative procedure, and is performed by defining an error measure
E(h, f). This misfit between the predicted target and the actual known target during the
training phase is propagated as an additional input to the learning algorithm that can update
the model parameter set θ accordingly to try to reduce this error.
CHAPTER 3. THEORY 23
E(hθ, f) = e(hθ(x,w), f(x)) (3.5)
Additionally1, a probability distribution P on χ is applied on the input and target values.
This probability distribution accounts for noisy data, and allows to provide a certainty degree
for a predicted target value.
By repeatedly executing evaluating samples from the training set and adjusting the model
parameters to reduce the error measure, the model will hopefully learn a function hθ which
approximates f to some sufficient precision.
The goal of the algorithm is to build a model that will perform this mapping task on new,
unknown input data (the test set) during the testing (evaluation) phase. In order to perform
this task well, the proposed function must approximate the unknown function f that represents
the real input-output relation well. The model will then apply this function to new input x in
order to perform a prediction t of the unknown target valuet.
hθ(x) = t, t ≈ t (3.6)
An important feature of machine learning is the ability for generalization: performing good
predictions for unknown data. As input samples often comprise only a fraction of the input
space, generalization is a central goal in pattern recognition. Care must be taken not to tai-
lor the model too much the specific subset of input samples. This phenomenom is known as
overfitting and leads to poor performance on unseen data. Several techniques are used to avoid
overfitting on machine learning algorithms. Another important step in the process is called
feature extraction. Often, input data is of a high dimensionality. This data is preprocessed in
order to extract relevant features before feeding to the model . Transformation of the data to
a new variable space can make the pattern recognition easier. Feature extraction can preserve
only useful dimensions and discard non-discriminatory information, which allows for a speedup
in the processing.
Depending on the type of problem, a few distinctions can be made between ML algorithms.
Applications where the training data consists of input vectors with an associated target vector
are known as supervised learning. The data is labelled in that case, which allows for easy
1Often, but not necessarily available, depending on the ML technique used.
CHAPTER 3. THEORY 24
testing verification of the model. The goal of the problem is to assign new data points a label
(or real-valued number). In other scenarios, the training data lacks any corresponding target
values. The goals of these so-called unsupervised learning problems is not to predict a variable
based on the input features. Rather, the model tries to discover groups and similarities in the
samples, or it tries to find a probability distribution of the data. This is done in order to find
the general structure in the shape of the dataset (for example to reduce the dimensionality
of the data). Lastly, there are reinforcement learning techniques where a set of actions must
be determined in order to maximize a reward. Here, the concept of an environment is used
and the form of the reward must be defined. Some ML techniques are parametrical models.
These models perform training, but do not need to remember the data for evaluation. Other,
non-parametrical models, use the entire dataset when performing predictions, typically without
the need for a lot of training. Some problems are convex, where optimization leads to a global
minimum; the best model for the dataset can be achieved. Other techniques have a non-convex
loss function and do not have a guaranteed single best stationary solution.
Over the years, a large set of techniques has been developed to address a wide range of
problems, each having different requirements in available training data, speed (of training or
evaluation), performance, stability and other features. Some well-known techniques are Logis-
tic Regression, Naıve Bayes, k-Nearest Neighbour, Random Forests, Support Vector Machines,
(Convolutional/Artificial) Neural Networks, Linear Discriminant Analysis, Principal Compo-
nent Analysis, Bagging & Boosting. Some reinforcement learning techniques are Q-learning
and SARSA ...
Underfitting and overfitting
The goal in applying ML to a problem is finding patterns in data. One constructs a model
that tries to approximate the data by a mathematical function, the target function f(x). It
does so by looking at known training data and learns an approximation of the data behaviour
and characteristics based on this data. Assuming the training examples are a representative set
of the new and unknown data the model will have to process after training, it will be able to
work well with unseen test data based on similarities between these data sets. Two important
phenomena in the topic of ML are very important because they threaten a succesful application:
the concept of underfitting and overfitting. This subsestion will therefore discuss these problems
in some more detail.
CHAPTER 3. THEORY 25
When constructing a model, there is the choice of working with a very simple or a com-
plex model. Simple models are limited in their expressive power and modelling capacity, while
complex models are the opposite: they are powerful, but require often a very large amount of
computing power. With high-dimensional data, the complexity of a model can very soon be
unwieldly to work with, a phenomenom known as the curse of dimensionality. The complex
model will be able to train very specific behaviour of the training data. The risk is however that
the approximation will result in a target function that is so specific to those training samples,
the model will perform very poorly on new data that slightly differs from these training sam-
ples. It is said the model overfits and does not generalize well. On the other hand, the simple
model might not be able to capture the necessary characteristics of the data it is working with.
No matter how much new data is available to train, the result will be a function lacks useful
expressive power and that is of not much interest; e.g. the output could always be the average
point of a data set. This end of the spectrum is known as underfitting: while simple, the model
is too general to be of use. One mostly aims to strike a balance between these extremes: a
model should be sufficiently powerful to find the important characteristics of the data and still
generalize well to new data.
Figure 3.1 illustrates these phenomena. The blue dots are data points sampled from a source
with a sine distribution and some random noise added. A machine learning model will try to
fit a curve f(x) to these points which is the target function describing the behaviour of the
data. The function here is a polynomial function of a chosen degree, and curve fitting is done
through least-squares error method. Figure 3.1a uses a 1st degree function. It is clear this is
not adequate to describe the data. The model is not expressive enough to find patterns in the
data. This is a case of underfitting. Figure 3.1b shows a 3rd degree polynomial, fitting the data
quite good. When new data comes in from the sine source (indicated by the red points), the
curve fits them pretty good. The model has found the pattern in the data, and generalises well
to new data. This is the desirable case. Figure 3.1c shows the case of overfitting. A 10th degree
polynomial is fitted to the blue data. This fits the blue data exactly. However, the new red
data are not at all represented in a good way by the approximation. The curve is specialized
too much to the known training data, and does not generalise well to new data. This model
will thus perform suboptimal during its eventual application on new data.
CHAPTER 3. THEORY 26
(a) 1st degree (b) 3rd degree (c) 10th degree
Figure 3.1: Illustration of under- and overfitting: curve (polynomial function) fitting to a dataset
During the development of machine learning techniques, there are some techniques and
modifications specifically designed to address the problem of overfitting, known as regularization
techniques. When using regression and finding parameters of a polynomial curve, there are the
L1 and L2 regularization, also known as Lasso regression and Ridge regression. They lead
to sparse solutions and avoid large coefficient, favoring smooth solutions. Early stopping is
regularization in time when working with iterative training. Here the model performance will
be monitored during training and training will stop as performance would start to degrate.
Ensemble models are a form of regularization, as they combine the results of multiple, possibly
overfitting, models. Specifically for neural networks, one form of regularization is a technique
called Dropout layers (explained later).
3.2 Introduction to the concepts of (Artificial) Neural Networks
Artificial Neural networks (ANNs) are one kind of machine learning technique where the com-
putational model is inspired by (neurons in) the human brain. While the concept of neural
nets has been around for quite some time, their power and applications have long been under-
estimated. With recent developments in both the theory and easy access to high amounts of
computational power, neural networks have been rediscovered and have seen a surge in interest
during the last decade. ANNs are applied in lots of domains and applications and nowadays
envelop the most advanced models in ML and AI and outperform many other previous state of
the art algorithms.
CHAPTER 3. THEORY 27
x2 w2 Σ f
Activationfunction
y
Output
x1 w1
x3 w3
Weights
Biasb
Inputs
Figure 3.2: Artificial neuron
3.2.1 Introduction
Architecture
A NN consists of a layered network of computational units, called articifial neurons. The
first neural networks were developed by Rosenblatt starting in the 1950s and were based on
perceptrons. Figure 3.2 shows one such artificial unit (note this is a general model and not
necessarily a perceptron unit).
Each unit takes a certain amount of inputs xi. A weighted sum of these inputs is passed to the
activation function f . The output of this activation function is the output hW,b(x) of the unit,
and this can be passed as an input to other units. Several variations of units are possible: inputs
can be binary or real-valued, and often a bias input is added to the unit. As activation function
a step function can be taken, but more commonly a continuous output function is chosen. A
popular choice is the sigmoid (logistic) function or the hyperbolic tangent, both having desirable
properties regarding their derivatives. This single neuron can thus be interpreted as a model
that makes a decision based on features of its input, using the formula
hW,b(x) = f(
n∑i=1
Wixi + b) = f(W Tx) (3.7)
By adjusting the weights of the summation, the decision behaviour of the model can be
changed.
When chaining multiple of these units together, a neural net is formed, for example the
simple network displayed in Figure 3.3. While a single artificial unit has limited computational
power, this does not hold for multi-layer networks. Construction of NOT/AND/OR operators
is possible, meaning artificial neural nets are functional complete: they can express all possible
truth tables for a collection of input features. Furthermore, the theorem known as the universal
CHAPTER 3. THEORY 28
Inputlayer
Hiddenlayer
Outputlayer
Input 1
Input 2
Input 3
Input 4
Input 5
Ouput
Figure 3.3: Neural Network structure
approximation theorem for neural nets states that an ANN can be used to approximate any
continuous function to arbitrary precision. Their expressive power is thus as large as any other
computational device.
The net in Figure 3.3 is a simple single-layer feedforward neural net. The first layer in the
net is the input layer, which takes the input features. Then the layer of computational units
processes these features, which is called the hidden layer. The final layer giving the output of
the network is the output layer. Several other architectural patterns are possible. More than
1 hidden layer is normally used, which is then commonly called deep learning. The multiple
hidden layers can have different sizes and perform different operations on their input (e.g.
to reduce overfitting). Convolutional networks are networks where the input is transformed
through a variety of operations in early layers before any decision making comes into place.
Recurrent networks are networks where there are loops introduced by chaining a neuron output
to previous neurons and layers in the net, which leads to these nets having some form of memory.
The main difficulty in succesfully applying neural networks to any problem is finding a suitable
architecture. At the time of writing, there are no known definitely good ’recipes’ for constructing
good networks.
CHAPTER 3. THEORY 29
Training and evalution
Training a neural net is done by adapting the weights of the connections between layers. When
constructing a network, weights are initialized often randomly. Using a labeled set of examples,
a prediction error for these samples is calculated using a feedforward pass. The weights are
then updated in order to minimize the prediction error following a loss function. This updating
is commonly done using backpropagation in conjunction with gradient descent. The output
error of the network if fed backwards to the network through each layer. Backpropagation
is a mathematical procedure allowing to efficiently compute partial derivatives (requiring an
activation function with good derivative properties) of the chosen cost function. These partial
derivatives are then used to update the weights in all layers. This forward-backward cycle is
repeated until the network converges to a desired performance. The training phase of a neural
network is extremely computationally expensive; the backpropagation algorithm and access to
massive computing power and datasets has therefore been paramount in the recent success and
rediscovery in neural networks.
Training can be influenced by adapting the learning rate of the model and using different
methods of sampling training examples. Using Gradient Descent, the complete set of training
samples are run through in order to perform a parameter update iteration (therefore also called
Batch GD). With large datasets, each step in the process of weight updates is very costly and
time-consuming. A variation of this is Stochastic Gradient Descent (SGD, also called Iterative
or On-line GD), which updates the weights after each training sample. The term stochastic
indicates that this gradient, based on a single sample, is an approximation of the true gradient,
and due to this stochastic nature, the path to the final solution might be zig-zag rather than
the direct way of GD. SGD has a higher variance due to single-sample picking and a lower
learning rate, but almost surely converges to the global cost minimum if the cost function is
convex. The computational power is also lower for SGD, as only a single gradient is to be
computed at an iteration. A compromise between speed and computational power is Mini-
Batch Gradient Descent (MB-GD), where the gradient is computed on a group of samples in
each iteration. Algorithmically speaking, using larger mini-batches reduces the variance of the
updates (by taking the average of the gradients in the mini-batch), and this to take bigger
step-sizes, which means the optimization algorithm will progress faster. MB-GD converges in
fewer iterations than GD because we update the weights more frequently; however, MB-GD
can utilize vectorized implementations, which typically results in a computational performance
CHAPTER 3. THEORY 30
gain over SGD.
Having a trained network, evaluation is done by applying the input features to the input
layers, and subsequently calculating each layer output from front to back, eventually resulting
in the output of the last (output) layer. The procedure is called the feedforward pass, is
straightforward and fast.
Deep Learning
A very important part in traditional machine learning is feature engineering; finding useful
patterns in the data to base decisions on. Feature learning is (automatically) finding common
patterns of data to use in classification or regression problems. The term deep learning origi-
nated from methods and strategies for designing deep hierarchies of non-linear features. The
first rise in popularity since the discovery of neural networks in 1965 was the backpropagation
algorithm applied to training neural net training in 1985. Since about 2010, the use of GPUs and
the rectifier activation function has lead to practical possibilities of these powerful architectures
and solved some of the existing issues holding back success. Since 2013, LSTM networks have
been growing rapidly in problems regarding non-linear time dependencies in sequential data, en
together with convolutional nets are the 2 major success stories of deep learning. Deep learning
models often achieve exceptional performance in a lot of problems, and their capabilities are
still being discovered in applications to all domains of computing. Research in deep learning
has been accelerating rapidly since 2012-2014 when Google, Facebook and Microsoft started to
show high interest in the field.
Applications
Neural networks are nowadays used in several applications, often grouped under deep learning
name. They have proven to be unmatched in object recognition in visual data (e.g. face
recognition in photos, handwriting reading), anomaly detection (spam processing), autonomous
systems such as self-driving cars, natural language processing and digital assistents... Deep
learning is nowadays seen as and expected to be a solution to all sorts of problems and most
technology companies are heavily investing in the technology, both in the form of dedicated
hardware solutions (GPU’s, dedicated processors) as software libraries and toolboxes.
CHAPTER 3. THEORY 31
3.2.2 Relevant techniques
This subsection will very briefly discuss some specific techniques regarding neural networks that
are used in and are relevant to this work.
Activation function
Each unit in a neural net applies a function to a combination (e.g. a weighted sum) of its inputs.
Several activation functions are possible and have been used throughout the history of neural
nets. Figure 3.4 shows a visual comparison between the 3 most widely used choices in neural
networks and deep learning.
The sigmoid function (also referred to by logistic) has seen frequent use historically. It
has a range of [0, 1] which means its output can be interpreted as a probability and is easy to
understand. Having a smooth shape is attractive because of the importance of the gradient in
the backpropagation algorithm for efficiently training nets. A problem however is the sigmoid
saturates: a large input leads to a gradient close to zero, known as the vanishing gradient.
Another smaller issue is that its output is not zero-centered (desirable for gradient training),
which the hyperbolic tangent solves. The tangent is a scaled version of the sigmoid and has
replaced the sigmoid in practive everywhere.
With the popularity of convolutional nets, the activation function of choice has shifted to
the rectifier. The rectifier is arguable more representative of the biological neuron then the
probability theory-inspired sigmoid or hyperbolic tangent. Neurons using the rectifier (or an
approximation) are called Rectified Linear Units (ReLu). The ReLu has some advantages
compared to its predecessors for neural nets. It does not suffer from the vanishing gradient
problem, it leads to sparse solutions, and they can be used for training nets efficiently without
pre-training. They are also fast; it does not involve exponential computation, and they have
proved to converge to a solution much faster than nets using the tanh activation function.
Introduced as activation function in 2011, as of 2015 the rectifier is the most used activation
function for deep neural nets.
CHAPTER 3. THEORY 32
−2 −1 0 1 2−1
−0.5
0
0.5
1
(a) Sigmoid: 11+e−x
−2 −1 0 1 2−1
−0.5
0
0.5
1
(b) Tanh: tanh(x)
−2 −1 0 1 2−1
−0.5
0
0.5
1
(c) Rectifier: max(0, x)
Figure 3.4: Comparison of the input-output relation of some activation functions used in ANNs.
Layer types
This subsection will provide a short explanation of some different layers often encountered in
ANNs and deep learning.2
Fully Connected / Dense The fully-connected layer is the original ingredient of ANNs, and
perform the high-level reasoning in the network. The layer consists of a set of units which are
all connected to all neurons in the previous layer. Activation of this layer can be computed with
a matrix multiplication.
Convolutional Convolution is a mathematical operation to mix two pieces of information. In
the case of ConvNets, the input data (called feature map) is mixed with the convolution kernel
to form a (one or more) transformed feature map. The operation is often interpreted as a filter,
with kernels filtering the feature map for a certain kind of information (e.g. edges, color...).
Convolution can be described as a cross-correlation relationship; the output of a convolutional
filter is high if the filter feature is present in the input. In deep learning, convolutional layers
are exceptionally good at finding good features in images feeding to the next layer to form a
hierarchy of nonlinear features that grow in complexity and abstraction. They bridge spatial/-
time and frequency domain through the convolution theorem, exploit locality and parameter
sharing. and are implemented extremely efficient through Fourier transforms on current GPUs.
Max Pooling A convolutional layer is mostly followed by a pooling layer which performs
subsampling in convolutional nets. Information is funneled to the next (often another convolu-
tional) layer. Pooling provides some invariance for rotations and translation. The pooling area
2For a very good discussion on Convolutional Neural Networks, see http://cs231n.github.io/ for coursenotes from the Stanford CS course.
CHAPTER 3. THEORY 33
can differ in sizes, with more or less detailed information to consider, reduce dimensionality,
and resulting in large or fit networks to fit GPU memory.
Dropout One effective way of improving model performance and countering overfitting, is
combining multiple models. However, with the computationally extremely intensive ANNs,
this approach becomes quickly unreasonable. Introduced by [31], Dropout is the most effec-
tive technique of addressing overfitting in ANNs. A dropout layer does not perform explicit
computations, but will simply enable or disable (dropping out) neurons. At training time, a
neuron will be present and connect to the next layer with a certain probability (usually 0.5).
This essentially results in sampling a thinned network from it, having 2n possible models from
a network with n neurons. Training a neural network with dropout can be seen as training a
collection of thinned networks (with extensive weight sharing), each thin model trained by very
few training samples. At test time, the single full network is used without dropout, each neuron
using a scaled set of weights. This means the expected output of a neuron is used at test time,
and each of the 2n thinned network are combined in that single network.
3.3 Selected technique: Auto-Encoders
The approach in this work makes use of Auto-Encoders (AE). As such, the following section
will discuss this particular paradigm in slightly more detail, as well as some techniques tailored
to AEs.
The concept of AEs is relatively new: they have been introduced by Hinton and Salakhut-
dinov in 2006 as a ’data dimensionality reduction technique using neural networks’ ([32]) which
captures the concept extremely well. Auto-Encoders are ANNs used to learn efficient coding of
data. They are in general a multi-layer neural network consisting of an input and output layer,
with one or more hidden layers in between, illustrated by figure 3.5. Compared to the general
model of a neural network, an autoencoder has some architectural differences and contraints.
The output layer of the autoencoder has the same form as the input layer. Instead of predict-
ing a value for a given input, an AE is trained to reconstruct its input. The purpose of the
autoencoder is to do some form of dimensionality reduction in its hidden layers, and to learn an
encoded form of representing the data. Therefore, the architecture of the network commonly
possesses a layer with a reduced size acting as a bottleneck. Input data is compressed to that
CHAPTER 3. THEORY 34
InputBottleneck
layer
Output(reconstructedinput)
Figure 3.5: Auto-Encoder as a contraint on a neural net architecture
lower dimensionality using the part of the network acting as encoder. This part of the AE
thus acts as a technique of dimensionality reduction.3 Having the coded representation, this is
expanded again by the part of the network used as decoder to reconstruct the input as good as
possible. The network takes as input the data itself, not a set of features based on the data and
automatically tries to learn the required characteristics of the data. The input data need not
be labeled; therefore, AEs are unsupervised learning models. They are most commonly used as
feature learners in tandem with a supervised classification technique.
Traditional Auto-Encoder
An Auto-Encoder takes an input x ∈ Rd and first maps it to a latent representation h ∈ Rd′
(where commonly d′ < d) using a deterministic mapping. This mapping has the typical form of
an affine mapping followed by a non-linearity:
h = fθ = σ(Wx+ b) (3.8)
The function fθ has parameters θ = W, b, where W is a d′ × d weight matrix and b is
a bias vector of dimensionality d′. The non-linearity σ is normally chosen to be one of the
common activation functions in conventional neural networks. The deterministic mapping fθ
is commonly called the encoder. This resulting latent representation (or code) is then used to
reconstruct the input by a reverse mapping gθ′ . This mapping is again an affine transformation
3One well-known related method for dimensionality reduction is PCA; AEs are a non-linear generalization ofPCA, operate automatically, and are not restricted to the application of pure dimensionality reduction.
CHAPTER 3. THEORY 35
optionally followed by a non-linearity, and is called the decoder :
y = gθ′(h) = σ(W ′h+ b′) (3.9)
The inverse mapping has parameter set θ = W ′, b′ each appropriatly sized. The two
parameter sets are often, but not necessarily, constrained to be of the form W ′ = W T : the same
weights are used to encode the input and to decode the latent representation. In a feedforward
pass of the network each sample pattern of the training set xi is mapped to its code hi and its
reconstructed version yi. In general, the reconstructed y is not to be interpreted as an exact
reconstruction of the input, but in probabilistic terms as inputs/parameters to a distribution
p(X|Y = y = gθ′(h); θ,θ′) that generates X with high probability. The reconstruction error
to be optimized is
L(x,y) ∝ −log(p(x|y)) (3.10)
Working with classification (or one-hot encoded values), x is a binary vector: x ∈ 0, 1d
so a choice for p(x|y) is x|y ∼ β(y)4. The decoder produces a y ∈ [0, 1]d. The loss function
associated with this setup is then
L(x,y) = LH(x,y) (3.11)
= −∑
j[xjlogyj + (1− xj)log(1− yj)] (3.12)
= H(β(x)||β(y)) (3.13)
This last term is is called the cross-entropy loss, and it is seen as the cross-entropy between
two independent multivariate Bernouillis, having means x and y. The parameter sets of the
AE are optimized by minimizing this error (also called cost) function over the training set of n
(input, target) pairs St = (x0, t0), ..., (xn, tn).
If a latent representation h allows for good reconstruction of its input, it means that it has
retained much or useful information that was present in the input. However, the criterion of
retaining information alone is not a useful one. By setting h = x, or using an AE where h has the
4Or equally: xj |y ∼ β(y). β(a) represents the Bernouilli distribution with mean a. Extended to vectorvariables: x ∼ β(a) means ∀j,xj ∼ β(aj).
CHAPTER 3. THEORY 36
same dimensionality as x and learning the identity mapping, there are no useful representations
discovered besides the input itself. Additional constraints are necessary, naturally leading to a
non-zero reconstruction error. The most used approach is to introduce a bottleneck to produce
an under-complete latent representation where d′ < d. This representation is then a lossy,
compressed representation of x. Another possibility is to use an over-complete but sparse
representation, which achieves compression by a large fraction of zeros rather than its explicit
lower dimensionality.
3.3.1 Denoising Auto-Encoder
Traditional Auto-Encoders will learn the identity mapping without additional contraints. This
problem can be circumvented by using a probabilistic RBM approach, sparse coding, or de-
noising auto-encoders (dAEs) trying to reconstruct noisy inputs. Two underlying ideas inspire
this approach: a high-level representation should be robust and stable under limited input cor-
ruption, and performing a denoising task well requires extracting features that capture useful
structure in data. Training a dAE5 involves trying to reconstruct a clean input from a partially
destroyed version of it. The input can be corrupted by adding a variable amount of noise to it
(binomial noise, uncorrelated Gaussian noise, ...). By doing so, the dAE is trained to denoise
the input by using a slightly adapted version of Formula 3.8 and 3.9:
h = fθ(x) = σ(Wx+ b) (3.14)
y = gθ′(h) = σ(W ′h+ b′) (3.15)
The parameter sets are now trained to minimize the reconstruction error by having y as
close as possible to the uncorrupted input x. The key difference is that now y is a deterministic
function of x rather than x. Each time a training sample is applied, a different corrupted
version is used by the application of noise. There is no change in the loss function. By applying
a deterministic mapping to a corrupted input, the network is forced to learn more clever features
and mappings (that prove to be useful in a task as denoising, rather then providing identity
mapping).
By stacking dAEs into a deep architecture (with a supervised classification part at the end),
5Note the goal is not the denoising task in itself. Denoising is merely used as a training criterion in order toextract useful features and form a high-level representation.
CHAPTER 3. THEORY 37
these architectures perform systematically better at several computer vision tasks than ordinary
AE or Deep Belief Networks. They are also shown to correspond to a generative model. Lastly,
dAE work well with data with missing values or multi-model data, as they are trained on data
with missing parts (when corruption is randomly hiding parts of the values). ([33])
3.3.2 Convolutional Auto-Encoder
Both the conventional fully-connected Auto-Encoder and the dAE ignore the 2D structure of
data (commonly occuring as images). However, the most succesful models in object recognition
try to discover localized features that repeat themselves all over the input. Convolutional Auto-
Encoders (CAEs) differs from conventional AEs as their weights are shared among all locations
in the input, preserving spatial locality. The reconstruction is done through a linear combination
of basic image patches based on the latent code. The architecture of a CAE is build upon the
dAE, but applies weight sharing on feature maps. The latent representation of the kth feature
map of a mono-channel input x is given by
hk = σ(x ∗W k + bk) (3.16)
where ∗ denotes the 2D convolution operator. Each latent feature map has its own bias b.
Reconstruction of a convoluted latent representation is then obtained by
y = σ(∑
k∈Hhk ∗ W k + c) (3.17)
where H is the group of feature maps; W is the flipped (over both dimensions) version of
the weights, and c is the bias (one per input channel). The CAE is trained just like normal
networks using a backpropagation algorithm which computes the gradient of an error function
with respect to the parameter sets. Assuming the Mean Squared Error (MSE) cost function
E(θ) =1
2n
n∑i=1
(xi − yi)2 (3.18)
this gradient can be obtained by using convolution operators with the formula
∂E(θ)
∂W k= x ∗ ∂hk + hk ∗ ∂y (3.19)
where ∂h and ∂y are deltas of the hidden state and reconstruction. Using this, the param-
CHAPTER 3. THEORY 38
eters can be updates using (variations of) stochastic gradient descent.
Chapter 4
Methodology
In this chapter, the approach this work will take to the problem is explained and discussed.
First a high-level overview of the compression scheme is offered and some technical aspects and
design decisions are discussed.
4.1 Structural overview of the proposed compression scheme
In this work an Auto-Encoder will be trained for constructing a compression method. An input
file will be encoded to a bitstream suitable for transfer, and decoded again. The actual encoding
and decoding is done by an ANN. Compression is achieved by introducing a bottleneck in the
ANN network architecture. A large input sequence is represented as a compressed, coded repre-
sentation learned by the network plus a residue necessary to correct erroneous reconstructions.
By training the network using DNA data, the aim is to let it learn the implicit structure of the
input data, so it can create a good coded representation and effectively perform reconstruction.
The parameters (i.e. weights of the ANN) of the encoder and decoder module are not included
in the bistream to be transferred. They are fixed (either to a hardcoded set of values or to
a reprogrammable set) during operation. As large ANNs often contain a high amount of pa-
rameters, including these in the bitstream for network transfer might not be efficient. Having
a fixed set of parameters also opens the possibility of efficient hardware implementations (e.g.
FPGA’s, ASICs and coprocessors as currently used for image processing). The encoder and
decoder processes thus share the network weights and architecture.
39
CHAPTER 4. METHODOLOGY 40
FILE input Encoder code Decoder input’ - residue
bitstreampreprocess
Encoding process
(Online) transfer
bitstream
input’Decodercode + residue
input FILE
preprocess−1
Decoding process
Figure 4.1: Block diagram of encoding/decoding process
4.1.1 Encoder
Figure 4.1 shows a block diagram of the encoding & decoding process. As a first step, the source
file is read and preprocessed to a format suitable for the encoder to accept as input. This input
is fed to the encoding part of the AE. This Encoder block is a simplified representation here (and
is discussed below). The encoding block will generate a compressed representation of the input,
which is used as the codeword. The codeword is fed to the decoder, which tries to reconstruct
the input, possibly making an incorrect reconstruction. This reconstruction is compared with
the original input, and the differences are stored in a residue. The codeword and residue are
combined in the bitstream, which is the output of the encoding process.
4.1.2 Decoder
The lower half of figure 4.1 shows the block diagram of the decoding process. The decoder takes
the compressed bitstream and starts by splitting off the codeword and residue. The codeword
CHAPTER 4. METHODOLOGY 41
is fed to the decoding part of the AE, which will try to reconstruct the desired output. The
decoding part of the AE is the same part as used in the encoding process, and the errors
that will be made by the reconstruction are thus known beforehand by the encoding process.
Therefore, the decoder now adds the provided residue to the reconstructed input to get the
faultless reconstruction of the original input. This reconstruction is eventually converted back
into the original source format.
4.1.3 Network view
The ENCODER and DECODER blocks in the previous block diagrams are actually both neural
networks. Figure 5.11 shows an example of what the compression scheme looks like from a neural
network focused perspective. The network takes an input representation and feedforwards it
through its hidden layers to output a coded representation. In order to achieve a compression
scheme, it is essential that the network central hidden layer is of a lower dimensionality than
the input and forms some form of bottleneck. After this bottleneck layer, a (symmetrical to the
encoding part) set of layer form the decoding part of the network, where the inverse process
is done: the compressed representation is expanded to create the input reconstruction. The
network architecture (amount of layers, type of layers, size...) and implementation can be
varied, but will always adhere to this encoding-bottleneck-decoding structure.
4.2 Technical aspects & design decisions
4.2.1 Data acquisition
The dataset used throughout this work consists of the chromosomes of two human genome
sequences. These chromosomes are fully aligned and assembled from scaffolds. The genomes
available are one reference genome, and one alternative genome. This data can be obtained
freely available online1 from the National Center for Biotechnology Information and is formatted
in the FASTA format (the downloadable files are Gzip compressed). The reference genome is
subdivided into 22 numbered chromosomes, plus the chromosomes X and Y. The mean filesize of
a single chromosome is about 128 MB. The content of these files is nearly (excluding a few lines
of descriptive metadata in the file) purely a continuous sequence of nucleobases: for example,
the first chromosome of the reference genome contains a sequence of about 235 million bases.
1ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.106/Assembled_
chromosomes/seq/
CHAPTER 4. METHODOLOGY 42
4.2.2 Data preprocessing
One of the desirable properties of deep learning techniques is the fact that interesting properties
of the input data can be learned automatically without human specification of the features of
interest. This happens by adapting the weights in the hidden layers, filters and neurons. As the
goal is for the network to automatically discover useful features, a first note is that no manually
engineered features (e.g. base averages, counts, frequencies ...) besides the raw source data are
used.
Starting from the FASTA file on the filesystem, the sequence needs to be loaded to the
program memory first into a data structure suitable for processing and input to the neural
network. The FASTA file is read into memory entirely. The descriptive text is stripped away,
all lines and parts contained in the file are chained together and any non-frequent characters
are replaced by N letters, as they will not be part of the compression scheme. Depending on
the requirements of the network, this base sequence is padded: ’horizontally’ up until a certain
required multiple of bases, and ’vertically’ to match the batch size.
The task this ANN has to perform is eventually the prediction of the correct base (one
letter out of a constrained alphabet) in the genome sequence. This ML problem falls under the
category of classification: there are several possible choices (class labels) as output out of which
one must be selected. The most common approach to this is by encoding the labels as a one-hot
matrix. When the output of the network contains a softmax function over the output matrix,
this output can be interpreted as class-probabilities for each sample. The predicted class (base)
is then the one on the index having the highest value in the output vector. Equation 4.1 shows
the conversion of bases to a 2D one-hot encoded matrix. This matrix is the target of the network.
· · ·ACGTNA · · · ⇒
A 1 0 0 0 0
C 0 1 0 0 0
G 0 0 1 0 0
T 0 0 0 1 0
N 0 0 0 0 1
A 0 0 0 0 0
(4.1)
CHAPTER 4. METHODOLOGY 43
The basic approach is thus to one-hot encode each letter and feed these vectors as input
to the network. By doing this however, the network will not learn any information from the
sequence and structure in the input, as it will basically learn just the probability for each class.
Therefore, multiple one-hot encoded letters can be chained together as input to the network.
Figure 4.2 shows the representations of a sequence of bases as one-hot encoded matrix. This
way, the network can learn to recognize patterns, or repeating sequences in the data. Note that
the output of the network will be reshaped to always be a single-base-wide one-hote encoded
matrix followed by a softmax: the network will be judged on the reconstruction of every base,
and not on the reconstruction of the entire input sequence.
A 1 0 0 0 0
C 0 1 0 0 0
G 0 0 1 0 0
T 0 0 0 1 0
N 0 0 0 0 1
A 0 0 0 0 0
⇒
ACG 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0
TNA 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0
(4.2)
When moving to a convolutional network, a different kind of formatting is required. As
convolutional layers are mostly used on multi-channel images, the sequence data will be trans-
formed into a similar structure in order to mimic this successful setup. Color images contain
a channel for R(ed), G(reen) and B(lue). In this case, the sequence data is one-hot encoded
data, thus each position in the bit pattern could be interpreted as a channel for a certain base
in the alphabet (so instead of channels R,G and B, there are now channels A, C, G, T, and N).
Having five distinct classes in the data, an input sequence is therefore transformed into a cube
of 5-channel patterns. Figure 4.2 shows this data formatting.
4.2.3 Network training
During the construction of a suitable network architecture, the scenario is chosen to try training
an AE on a certain chromosome (e.g. chromosome 1) in order to perform compression on the
same chromosome number. As there are two genomes available, one genome provides the
CHAPTER 4. METHODOLOGY 44
1 0
0 00 1
0 00 0
1 00 0
0 10 0
0 0
A C
G T
Figure 4.2: Base sequence representation as 5-channel cubes.
chromosome for training and validation, and the other genome provides the test data. From
the first chromosome, 66% is used as a training set. The remaining 33% is used as validation
set. The test data will depend on the evaluation scenario (see later).
Training is done using the minibatch approach. Each training step (often called epoch) con-
sists of 10 updates with batches having a batch size of 50 base(s) (sequences). These batches
are contiquous base sequences starting at an index randomly picked from the training data set
(with repeated selection possible). The network will try to optimize the categorical crossentropy
as loss function between its reconstructed output and the targets. Updating the weights in the
layers happens using the Nesterov Momentum method - having a learning rate of 0.01 and a
momentum of 0.9 - for the traditional fully-connected AE, and using the Adam method with
a learning rate of 0.001 for the convolutional AE.2. As an AE normally has a symmetrical
architecture around its bottleneck layer, one can choose to tie the weights of the symmetrical
counterparts for each layer or train independantly. There is currently no rule for this other then
simply experimenting to find out what works, so both approaches are tested.
In the following parts (unless stated otherwise) chromosome 1 of the reference genome is
used as a training and validation genome (hs ref GRCh38 chr1.fa). This is one of the largest
chromosomes in the set, but there is no specific reason for this choice other then having a lot
of data in it.
2For an overview of gradient descent optimization algorithms, see http://sebastianruder.com/
optimizing-gradient-descent/index.html
CHAPTER 4. METHODOLOGY 45
4.2.4 Compression-Accuracy trade-off
The AE representss an input sequence as one or more values from a neuron in the hidden
bottleneck layer. These values will be the codeword in the compression scheme. By packing
multiple bases (i.e. a long sequence) into these values, this can result in compression. There
are two variations possible of influencing this compression ratio: either improve compression by
packing more bases into the same amount of bottleneck neurons, or try to improve reconstruction
performance by increasing the number of bottleneck units. There is a balance to be struck
between these options. Each base can be represented by 3 bits (offering 23 = 8 possibilities
while five are required). A neuron value is represented by a 32 bit floating point number. The
resulting compression rate of the AE can thus crudely be estimated as
X bases · 3 bitbase
Y hidden units · 32 bithidden unit
=3X
32 Y= size reduction (4.3)
As an example, if the input is a sequence of 100 bases and the bottleneck contains two
hidden units, the size reduction would be
3 · 100
32 · 2= 4.6875 (4.4)
The filesize of the data would thus be reduced by a factor of a little over 4.5. However, this
assumes that perfect reconstruction is achieved by the decoding process. In reality, this is not
the case. Errors made by the reconstruction have to be corrected, and this error residue must
be stored together with the codeword, reducing the overall compression rate. For this reason,
the reconstruction performance is not the only metric of intereset: the entropy of the residu is
important as well, as low-entropy data can in general be encoded and compressed efficiently by
arithmetic coding.
When varying the input size or the hidden units, one is thus balancing these two things: on
the one hand the compression rate increases with larger input sequence length, but the accuracy
drops. On the other hand, more hidden units mean a smaller compression rate for the codeword,
but better reconstruction performance and a smaller (or better in the sense of entropy) residu
could be possible.
CHAPTER 4. METHODOLOGY 46
4.2.5 Evaluation
Scenarios
The compression networks can be applied in several scenarios. The data from the human
genome consists of 23 distinct chromosomes. A first approach is to train a network on a single
chromosome. This network is then evaluated on the same chromosome from a second human
genome. This scenario would benefit from the similarities between the same chromosome in
different individuals from the same species. A second approach is to train a network on a
chromosome from a genome, and then use the network on other chromosomes from that same
genome specimen. This could expose similarities between chromosomes, and possible reveal
some form of classes of chromosomes with high similarity. Lastly, the genome in its entirety
can be used to train an encoder, and then to compress an entire genome without making
the distinction between different genomes. This would average out characteristics of specific
chromosomes and look at the general structure of DNA in total. Figure 4.3 shows these three
scenarios.
Metrics
Reconstruction accuracy The task the AE will be trained to perform is to reconstruct
its input from the coded sequence. The input in this case is a sequence of bases from the
genome. The first important metric in order to evaluate the effectiveness of the model is the
percentage of correctly reconstructed bases. The higher this reconstruction accuracy is, the
less additional information in the residue must be stored to make correction to the codeword.
In a ML model using multi-class classification (choosing a certain class, here this means a
certain base), the architectural setup normally used for this kind of problem is the softmax
plus categorical accuracy. The network will output a matrix where each row represents one
input sample to be classified. The matrix has as much columns are there are different classes to
choose from. The final layer of the network will apply the softmax function to each row of this
matrix.3 After this, each column value in the row can be interpreted as the probability of that
sample of belonging to the class of that column. Determining the predicted class is thus finding
the index of the maximum value of this row, and comparing this with the class of the one-hot
3Certain implementations (and currenctly the implementation in the Lasagne framework) of the softmaxnonlinearity are not numerically stable. This can lead to NaN appearing as loss when the network comes closeto the desired targets. This is solved by using a modified version of softmax: LogSoftmax, and an associatedmodified categorical accuracy. This modified version is used here, but the principles stay the same.
CHAPTER 4. METHODOLOGY 47
Reference sequence
chromosome 1
chromosome 2
chromosome 3
· · ·
chromosome 20
chromosome 21
chromosome X
(a) Between different chromosomesof the same sequence
Reference sequence
chromosome 1
chromosome 2
chromosome 3
· · ·
chromosome 20
chromosome 21
chromosome X
Alternative sequence
chromosome 1
chromosome 2
chromosome 3
· · ·
chromosome 20
chromosome 21
chromosome X
(b) Between the same chromosomesin different sequences
Reference sequence
chromosome 1chromosome 2chromosome 3
· · ·chromosome 21chromosome X
Alternative sequence
chromosome 1chromosome 2chromosome 3
· · ·chromosome 21chromosome X
(c) Between entire sequences
Figure 4.3: Schematic display of three differenct evaluation scenarios. The grey shaded part isthe data used to train the model, and the arrows point to the (test) data which is compressedusing the trained model.
CHAPTER 4. METHODOLOGY 48
encoded input row. The metric giving the amount of correctly classified samples is called the
categorical accuracy.
Output cross-entropy After the bases are reconstructed by the model, the reconstruction
is compared to the original input. When an error has been made, this has to be corrected
(as lossless compression is required). These corrections are stored in the residue. With the
compression purpose in mind, a useful property to consider is the structure (and more specifically
the entropy) of this residue. This determines whether or not arithmetic coding can be applied
in a next step to the residue effectively, leading to a compact representation of the residu. The
entropy of an array of probabilities is defined as
H(X) = −n∑i=1
P (xi)logbP (xi), (4.5)
When a uniform distribution occurs, and each event is equiprobable, the maximum entropy
is reached. A large entropy indicates there is no significant difference in the occurence of
certain elements of the alphabet, which leads to a lot of uncertainty about the next character
in a sequence. This means there is little gain to be found in arithmetic coding, as there is no
inherent structure present in a stream of events. The aim for this metric is thus to be minimal.
This is also the metric used for updating and training the network. As an example, with an
alphabet of 5 letters and a uniform distribution, the maximum entropy is calculated as shown
in Equation 4.6
H(X) = −n∑i=1
P (xi)logbP (xi)
= −5∑i=1
P (1
5)logbP (
1
5)
= −5 ∗ P (1
5)logbP (
1
5)
= −logbP (1
5) = 1.6094
= 1.6094
(4.6)
The (categorical) cross-entropy is a widely used loss function in machine learning, as it
allows gradient calculations which are required by iterative learning algorithms. Rather than
giving information on how much of the time the network is right (accuracy), this metric provides
CHAPTER 4. METHODOLOGY 49
information of right you are, which is far more useful in determining the desired update to the
network.
Residue representation The last metric of interest concerns the error residue, holding the
information to correct a faulty reconstruction. While the aim is to be for this component to be
as small as possible, it can make a significant addition to the compressed filesize if the network
is unable to perform well. Three ways of storing the residue are considered. Assume a sequence
of ten items long, which is reconstructed and ends up having three errors in its reconstruction.
Also assume the labels in this case are integer numbers.4
A first way of representing the residue for this situation is as one list containing for each
position either a symbol meaning the prediction was correct (here the number 0 is used), or in
case of an error, the correction to be made:
residue1 = [0, 0, 0, 7, 0, 0, 15, 0, 9, 0] (4.7)
The second way of representing the residue stores only the occurences of errors. This residue
is a list where each item contains the index of an error, and the correction to be made:
residue2 = [(3, 7), (6, 15), (8, 9)] (4.8)
The first residue representation is useful when there are a lot of errors. A single number in the
array contains all the information required. However, this list is as long as the entire sequence,
no matter how few errors are made. The second representation is useful when the residue is
sparse: the information on correctly reconstructed items can be ommited; only storing the errors
is required. Without sparsity however, each error is stored as a pair of numbers, meaning this
representation would be larger compared to the naive array storage of residue1 where the index
is implicitly given by its position in the array.
The third representation is somewhat of a hybrid form of these two. It contains two lists, the
first simply being a binary vector (often called bitmask) indicating whether or not this position
is correctly reconstructed. The second list contains the desired values in case an error was made.
This representation is a compromise between the two predecessors, and works well for a lot of
applications if no assumptions on the error frequency are available. Both of the arrays can be
4In practice, most labels are encoded as integer numbers in machine learning implementations, as it allowsfor efficient computations.
CHAPTER 4. METHODOLOGY 50
efficiently coded (using e.g. arithmetic coding, differential coding, Huffman coding...) on their
own.
residue3 = [0, 0, 0, 1, 0, 0, 1, 0, 1, 0], [7, 15, 9] (4.9)
4.2.6 Software & Hardware
The implementation of this compression scheme is done in the Python programming lan-
guage. For the neural network implementation, the frameworks Theano5 and Lasagne6 are
used. Theano ([34]) is a framework allowing to define, optimize and evaluate mathematical ex-
pressions involving multi-dimensional arrays efficiently. Since its introduction in 2008, multiple
frameworks have been built on top of it and it has been used to produce many state-of-the-art
machine learning models. It is a widely used CPU and GPU compiler in the ML community
and allows for very efficient computation using GPU acceleration. One of the frameworks build
on top of it is Lasagne, a light-weight open source framework started by Sander Dieleman in
September 2014. Lasagne offers a more high-level abstraction of Theano expressions, offer-
ing several useful constructs for well-known useful neural network components. Lastly, some
modules of the scikit-learn7 framework are used for preprocessing the data. Scikit-learn offers
a variety of tools for creating ML programs, not restricted to neural networks. All of these
frameworks build on top of Numpy8, a package for scientific computing and efficient array ma-
nipulation in Python.
The hardware used for this project consists of a set of servers from the UGent MMlab, each
having two 10-core Intel(R) Xeon(R) E5-2650v3 CPUs operating at 2.30GHz with 128 GB of
RAM. For working on convolutional networks, the GPU acceleration capabilities are harnessed
by working on the GPU-enabled workstations of the UGent Reservoir Lab. The specific GPUs
used for this work are the Nvidia Tesla k40c and the Nvidia Titan X, both having 12 GB of
gpu memory available, in a workstation with a Intel(R) Core(TM) i7-3930K CPU operating at
3.20GHz and having 32 GB of RAM.
5http://deeplearning.net/software/theano/6http://lasagne.readthedocs.io/en/latest/index.html7http://scikit-learn.org/stable/index.html8http://www.numpy.org/
Chapter 5
Results
This chapter will discuss the results of constructing and applying AEs to DNA sequences.
It starts (as all ML problems do) with an analysis of the source data the program has to
work with. Then a section discusses the machine learning approach to the problem, where
several variations of AEs with increasing complexity are implemented and their performance
evaluated. After this, having compared and selected a well-performing AE architecture, a last
section discusses an implementation of the compression scheme using this AE following the
three scenarioas discussed.
5.1 Data Analysis
File content
Each FASTA formatted file has a content structured as shown in Listing 5.1. The file contains
multiple scaffolds (part of a sequence), each having some descriptive metadata. A descriptor line,
starting with the ’>’ character indicates the properties of the following scaffold. This descriptor
is followed by the nucleobases in the chromosome. Multiple (descriptor, base sequence) pairs
can be present in each file.
51
CHAPTER 5. RESULTS 52
>g i |568815364 | r e f |NT 077402 . 3 | Homo sap i en s chromosome 1 genomic s c a f f o l d , GRCh38
. p2 Primary Assembly HSCHR1 CTG1
TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC
CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAA
. . .
ATGCAAAGATAAATATAAAAACTGATACTCCATCCAGTTACCAGAAAACATTTAGGTATGTGTGAGACAA
CTTGGGTATGTGAACCTACCTTTTCAATGTAAATTCAGTGAAATCTAAGTACAGAT
>g i |568815363 | r e f |NT 187170 . 1 | Homo sap i en s chromosome 1 genomic s c a f f o l d , GRCh38
. p2 Primary Assembly HSCHR1 CTG1 1
GATTCATGGCTGAAATCGTGTTTGACCAGCTATGTGTGTCTCTCAATCCGATCAAGTAGATGTCTAAAAT
TAACCGTCAGAATATTTATGCCTGATTCATGGCTGAAATTGTGTTTGACCAGCTATGTGTGTCTCTTAAT
. . .
Listing 5.1: Example of a FASTA file
The bases occuring in a genome sequence are A, C, G, and T. One other letter is considered
as well: N for unspecified or unknown. However, on rare occasion, other bases are encountered in
the file, such as K, S, Y, M, W, R, B, D, H, V, and the - symbol. These can represent wildcards
or a a set of possibilities but not fully determined. It should be noted these do not occur in the
physical strain, but rather are a way of showing uncertainties due to the limitations of current
sequencing technology. Table 5.1 shows the count of these anomalies for each chromosome. On
average, only 4 times per chromosome this occurs, meaning they are a negligible fraction of
the content. Due to this extremely low probability of these characters, including these in an
encoding scheme would reduce the efficiency of the scheme significantly. A common approach
for this kind of situation is to include these as raw, unencoded characters in the output stream
who are not encoded but sent ’as is’. In this work, these letters will be replaced by N characters,
and the compression models will only be aware of the five main characters in use in the sequence.
Bases, pairs & triplets statistics
In works discussing DNA data from a biological point of view, bases are often grouped together
per three, where they are then called triplets or codons. This grouping has a biological purpose.
From a compression point of view, if the occurance of a (group of) base(s) differs significantly
from its expected frequency, this could be a characteristic of the data to exploit. If it turns out
that for example if only a subset of all possible codons (out of the 64 possibilities) occurs, this
can prove to be valuable information in designing a compression scheme.
CHAPTER 5. RESULTS 53
filename Filesize (B) replacements characters Entropy
GRCh38.p2.chr1.fa 239,324,742 2 2.359 · 108 1.38GRCh38.p2.chr10.fa 137,217,937 36 1.353 · 108 1.387GRCh38.p2.chr11.fa 139,240,135 0 1.373 · 108 1.377GRCh38.p2.chr12.fa 138,387,841 3 1.364 · 108 1.37GRCh38.p2.chr13.fa 101,198,223 3 9.977 · 107 1.364GRCh38.p2.chr14.fa 96,655,086 0 9.529 · 107 1.385GRCh38.p2.chr15.fa 99,469,680 0 9.807 · 107 1.404GRCh38.p2.chr16.fa 88,410,877 1 8.716 · 107 1.404GRCh38.p2.chr17.fa 93,966,740 12 9.264 · 107 1.389GRCh38.p2.chr18.fa 83,020,886 0 8.185 · 107 1.374GRCh38.p2.chr19.fa 74,829,450 0 7.377 · 107 1.493GRCh38.p2.chr2.fa 247,160,296 9 2.437 · 108 1.371GRCh38.p2.chr20.fa 65,455,748 0 6.453 · 107 1.388GRCh38.p2.chr21.fa 41,579,794 3 4.099 · 107 1.377GRCh38.p2.chr22.fa 42,778,725 5 4.217 · 107 1.392GRCh38.p2.chr3.fa 205,139,933 7 2.022 · 108 1.367GRCh38.p2.chr4.fa 195,958,575 0 1.932 · 108 1.363GRCh38.p2.chr5.fa 188,701,111 0 1.860 · 108 1.373GRCh38.p2.chr6.fa 208,639,177 1 2.057 · 108 1.454GRCh38.p2.chr7.fa 163,942,113 4 1.616 · 108 1.374GRCh38.p2.chr8.fa 151,504,740 0 1.494 · 108 1.367GRCh38.p2.chr9.fa 125,257,656 3 1.235 · 108 1.38GRCh38.p2.chrX.fa 158,296,928 5 1.561 · 108 1.381
Table 5.1: Chromosomes and their file content in the refence genome. The entropy is calculatedon the contained sequence, and not on the data stream which includes metadata.
Figure 5.1 shows first the frequency of the five letters in the full reference genome sequence
after replacements, with the minimum and maximum occurences shown by the error bars. As a
point of reference, the 25 % line is marked by the dashed line, indicating the expected frequency
with a uniform distribution. From this figure, it is shown that the distribution of the characters
is rather equal. No base has an exceptionally high or low occurence compared to the other, and
the differences between genomes are modest. Of note is the low frequency of the N character.
Still, this character is included is the alphabet as a placeholder for ’none of the others’ is
necessary, and the occurence is high enough to justify its presence. It should be noted that this
is merely a limitation of current technology, and the unknown part of the sequence will diminish
further with the advances in sequencing equipment.
Figure 5.2 and 5.3 show a similar frequency analysis for base pairs and triplets, each time
with a dashed line marking the expected frequency in case of a uniform distribution. What
would be useful to exploit, is if only a small subset of all possible combination occurs in the
sequences. These groups could then be encoded as a single code symbol, and the limited amount
CHAPTER 5. RESULTS 54
T C G NA
0
5 · 10−2
0.1
0.15
0.2
0.25
0.3
0.35
Base
Occ
ure
nce
(%)
Figure 5.1: Occurence of single bases in the full human reference genome. Error bars indicatethe occurence minima and maxima in separate chromosomes.
of possibilities could lead to an efficient encoding. Unfortunately, from these figures it is clear
that this is not the case. Nearly all (25 pair or 125 triplets) combinations occur in reality,
meaning encoding using a fixed set of groups is not a viable option.
From this frequency analysis, it is concluded that no information or calculated statistics
are of immediate use in devising a compression scheme. Only the raw content of the sequence
will be used as input to the AE and it will automatically learn features instead of performing
manual feature engineering.
5.2 Baseline comparison: state of the art
For DNA sequence data, there are unfortunately no advanced compression models available
that make for an easy comparison. The most used methods (because of their generally good
compression rate) are general-purpose compression schemas such as Zip & 7-Zip. Table 5.2 shows
each chromosome of the reference genome used in this work compressed with 3 popular variants
of this method: Zip, 7-Zip, and the propriatary RAR5 format. All methods were configured to
use their best compression levels. The best compression rate is achieved by 7-Zip, with a rate
averaging at 20%-30%. This resulting compressed file leads to a rate of somewhat under 2 bpb.
Note this is compression on the whole FASTA-file, including the sequence descriptive methadata
which are not considered in this work. However, these descriptors are only a miniscule fraction
of the file content and thus do not influence the results significantly.
CHAPTER 5. RESULTS 55
GT
GG
GC
GA
GN
AC
AA
AT
NN
NA
NC
NG
NT
CN
CC
CA
CG
CT
TT
TN
TG
TC
TA
AG
AN
0
2 · 10−2
4 · 10−2
6 · 10−2
8 · 10−2
0.1
Base (per 2)
Occ
ure
nce
(%)
Figure 5.2: Occurence of base pairs in the full human reference genome. Error bars indicate theoccurence minima and maxima in separate chromosomes.
0
1
2
3
4
·10−2
Base (per 3)
Occ
ure
nce
(%)
Figure 5.3: Occurence of codons (base triplets) in the full human reference genome. Errorbars indicate the occurence minima and maxima in separate chromosomes. Codon labels areommited for clarity.
CHAPTER 5. RESULTS 56
GT
GG
GC
GA
GN
AC
AA
AT
NN
NA
NC
NG
NT
CN
CC
CA
CG
CT
TT
TN
TG
TC
TA
AG
AN
0
2 · 10−2
4 · 10−2
6 · 10−2
8 · 10−2
0.1
Pair
Occ
ure
nce
(%)
Figure 5.4: Occurence of groups of 2 bases in the reference genome, compared with theirexpected frequency. The bars show the actual frequency of the group, the mark shows thefrequency which is expected based on the single-base frequencies.
0
1
2
3
4
·10−2
Triplet
Occ
ure
nce
(%)
Figure 5.5: Occurence of groups of 3 bases in the reference genome, compared with theirexpected frequency.
CHAPTER 5. RESULTS 57
filename filesize Zip RAR5 7-Zip Compression bpb
GRCh38.p2.chr1.fa 228 65 63 54 0.24 1.94GRCh38.p2.chr10.fa 130 37 36 31 0.24 1.97GRCh38.p2.chr11.fa 132 37 36 31 0.23 1.94GRCh38.p2.chr12.fa 131 37 36 31 0.24 1.95GRCh38.p2.chr13.fa 95 27 26 23 0.24 1.95GRCh38.p2.chr14.fa 92 26 25 21 0.23 1.89GRCh38.p2.chr15.fa 94 26 25 20 0.21 1.79GRCh38.p2.chr16.fa 84 23 22 19 0.23 1.88GRCh38.p2.chr17.fa 89 24 23 20 0.22 1.84GRCh38.p2.chr18.fa 79 21 21 18 0.23 1.9GRCh38.p2.chr19.fa 71 18 16 13 0.18 1.5GRCh38.p2.chr2.fa 235 68 66 57 0.24 1.99GRCh38.p2.chr20.fa 62 17 16 14 0.23 1.92GRCh38.p2.chr21.fa 39 11 10 9 0.23 1.87GRCh38.p2.chr22.fa 40 11 10 9 0.23 1.8GRCh38.p2.chr3.fa 195 56 54 47 0.24 1.98GRCh38.p2.chr4.fa 186 54 52 45 0.24 1.96GRCh38.p2.chr5.fa 179 51 49 43 0.24 1.95GRCh38.p2.chr6.fa 198 56 54 43 0.22 1.77GRCh38.p2.chr7.fa 156 44 42 37 0.24 1.93GRCh38.p2.chr8.fa 144 41 40 35 0.24 1.97GRCh38.p2.chr9.fa 119 34 33 27 0.23 1.9GRCh38.p2.chrX.fa 150 42 40 34 0.23 1.87
Table 5.2: General-purpose compression software on chromosome FASTA files of the humanreference sequence. The Compression column is the 7-Zip filesize compared to the uncompressedfile, and the bpb column is the bits per base achieved by the 7-zip compression.
5.3 Auto-Encoder model construction
In this section, seeral Auto-Encoder architectures of increasing complexity are implemented
to explore the feasibility of applying this method to compress genome data. Several network
architectures are tried out and improved upon to form a suitable AE setup.
5.3.1 Shallow fully-connected Auto-Encoder
The first network constructed is a traditional fully-connected AE consisting of five layers. The
first is the input layer taking the matrix of one-hot encoded bases into the network. After that
a hidden fully-connected layer follows. This layer serves as the bottleneck (in the context of
Auto-Encoders often called encoding layer) layer of the network. Then the reconstruction (de-
coding layer) fully-connected layer follows. The output of this layer is reshaped to conform the
target dimensions and a softmax nonlinearity is applied subsequently. The nonlinearity used
in the neurons in these dense layers is the hyperbolic tangent. Figure 5.6 shows this network
CHAPTER 5. RESULTS 58
structure with an encoding size of 2 neurons. This single-hidden layer architecture has been
used as an encoder in lossy image compression and is for most AE approaches the baseline start.
The network is implemented once having shared weights between the fully-connected layers, and
once where each layer has an independant trained set of parameters. This small network has
only 17 trainable parameters in the case of weight sharing, or 27 in the independant case.
Figure 5.6: Shallow traditional fully-connected AE with single-base input and two encodingneurons.
The result of the experiment is shown in Figure 5.7. From this figures it can be seen that
as a start, the network is functional; it quickly learns to perform the reconstruction. The loss
function shows a continuous decrease as well, indicating a stable learning behaviour. However,
this network configuration is practically useless from a compression point of view. Each single
base, which can be represented by 3 bits, is now represented by a set of two 32-bit numbers.
Essentially the representation is blown up and as an over-representation achieves the contrary of
compression. Reaching an accuracy of 100% is therefore not really meaningful besides making
sure the network functions correctly. When comparing the tied weights and the independant
weights, the tied version seems to converge more quickly, but as both go to 100% accuracy, it
is hard to discuss any performance differences.
Architectural variations
In order to achieve an information bottleneck in the AE, there are two variations made to the
single-base AE: increasing the input to a sequence of multiple bases and varying the number
of encoding units. Figure 5.8 indicates this modified architecture. The network now takes in
a sequence of bases, but note that the accuracy is still considered per base. By varying these
two parameters, a trade-off can be made between codeword size and reconstruction performance.
CHAPTER 5. RESULTS 59
0 100 200 300 400 500 6000.5
0.6
0.7
0.8
0.9
1
Number of updates
Rec
onst
ruct
ion
accu
racy
trainingvalidation
(a) Reconstruction accuracy
0 100 200 300 400 500 600
0.5
1
1.5
Number of updates
Cate
gori
cal
cros
sentr
opy
trainingvalidation
(b) Cross-entropy loss
0 100 200 300 400 500 6000.5
0.6
0.7
0.8
0.9
1
Number of updates
Rec
onst
ruct
ion
accu
racy
trainingvalidation
(c) Reconstruction accuracy
0 100 200 300 400 500 6000.4
0.6
0.8
1
1.2
1.4
Number of updates
Cat
egor
ical
cros
sentr
opy
trainingvalidation
(d) Cross-entropy loss
Figure 5.7: Reconstruction accuracy and loss function of shallow fully-connected AE with single-base input and two encoding units. Above: independant weights, below: tied weights.
Figure 5.8: Variation on the shallow traditional fully-connected AE with 100 bases as inputand two encoding units. (1) indicates an increase in input length. (2) indicates an increase inencoding units.
CHAPTER 5. RESULTS 60
Figure 5.9 shows the performance of this shallow AE with a configuration of 100 bases as
input sequence and two encoding units, leading to a 4.5-fold size reduction. The network in
this configuration has 1502 trainable parameters with weight sharing, and 2502 without. These
results show the network is not able to reach a good performance. While the behaviour does
show a limited form of learning, with an accuracy starting at 25% and ending up at 30%, the
performance is not good. The corresponding loss functions decrease slightly from 1.60 to 1.43
but neither this is a good score. Both the accuracy and loss curves stagnate after a short initial
movement, and the network stops learning. There is no discernable difference between the tied
or independant weight setup. The network seems to fail in learning meaningful struture and
features in the data, which seems like an underfitting problem; the model is not sufficiently
powerful to express the required complexity in modelling the data.
5.3.2 Deep fully-connected Auto-Encoder
When approaching a problem using ANNs, improving a result if often simple by extending the
architecture to a deeper network which will outperform a shallow architecture most of the times.
With this in mind, the shallow AE is now extended with two extra layers at each side. Between
the input and the bottleneck layer, two fully-connected dense hidden layers are added, having
twenty and five times the amount of neurons from the bottleneck layer, resulting in a funneled
architecture. The decoder is adapted likewise to preserve symmetry. Figure 5.10 shows this
architecture with two hidden units as encoding size and an input sequence length of 100 bases.
The same set of parameters (100 input bases, 2 encoding units) is selected, and the network
is trained once having independant weights and once having tied weights in the symmetrical
parts. The results are shown in Figure 5.11. The results are very alike compared to the shallow
dense AE. The accuracy never reaches higher than 30%. The loss function, while decreasing in
a stable way, does so only in a minor way. It appear improving this architecture in the brute
force way by simply adding more layers and neurons does not help, the network still suffers
from underfitting.
5.3.3 Shallow Convolutional Auto-Encoder
Working even with a deep architecture of fully-connected layers does not seem to work well on
this problem. Adding more neurons in a layer or more layers in the network does not improve
CHAPTER 5. RESULTS 61
0 500 1,000 1,500 2,000 2,5000
0.2
0.4
0.6
0.8
1
Number of updates
Rec
onst
ruct
ion
acc
ura
cy
trainingvalidation
(a) Reconstruction accuracy
0 500 1,000 1,500 2,000 2,500
1.5
1.55
1.6
Number of updates
Cate
gori
cal
cros
sentr
opy
trainingvalidation
(b) Cross-entropy loss
0 500 1,000 1,500 2,000 2,5000
0.2
0.4
0.6
0.8
1
Number of updates
Rec
onst
ruct
ion
accu
racy
trainingvalidation
(c) Reconstruction accuracy
0 500 1,000 1,500 2,000 2,500
1.5
1.55
1.6
Number of updates
Cat
egor
ical
cros
sentr
opy
trainingvalidation
(d) Cross-entropy loss
Figure 5.9: Reconstruction accuracy and loss function of shallow fully-connected AE with inputsequence length of 100 bases and two encoding units. Above: independant weights, below: tiedweights.
Figure 5.10: Deep fully-connected AE with a sequence input length of 100 bases and two hiddenunits in the encoding layer.
CHAPTER 5. RESULTS 62
0 500 1,000 1,500 2,000 2,5000
0.2
0.4
0.6
0.8
1
Number of updates
Rec
on
stru
ctio
nac
cura
cy
trainingvalidation
(a) Reconstruction accuracy
0 500 1,000 1,500 2,000 2,500
1.45
1.5
1.55
1.6
Number of updates
Cate
gori
cal
cros
sentr
opy
trainingvalidation
(b) Cross-entropy loss
0 500 1,000 1,500 2,000 2,5000
0.2
0.4
0.6
0.8
1
Number of updates
Rec
onst
ruct
ion
accu
racy
trainingvalidation
(c) Reconstruction accuracy
0 500 1,000 1,500 2,000 2,500
1.45
1.5
1.55
1.6
Number of updates
Cat
egor
ical
cros
sentr
opy
trainingvalidation
(d) Cross-entropy loss
Figure 5.11: Reconstruction accuracy and loss function of deep fully-connected AE with inputsequence length of 100 bases and two encoding units. Above: independant weights, below: tiedweights.
CHAPTER 5. RESULTS 63
performance in a significant way. The next logical step therefore is to have a look at convo-
lutional networks. They have proven to be exceptional in object recognition task and are a
powerful addition to a neural network structure.
The Convolutional AE (CAE) is build starting from the shallow fully-connected AE with
a single hidden layer. In front of the dense encoding layer is a single convolutional set placed,
consisting of one convolutional layer followed by a max-pooling layer. A flatten operation is
required to link the feature maps to the dense encoding layer. Between the input layer and the
convolutional layer are a set of operations applied to the 2D input matrix in order to transform
this to the multi-channel structure (as shown in figure 4.2) required for convolutional operation.
After the bottleneck layer, the inverse operations are applied to keep the symmetrical architec-
ture; an upsampling and deconvolution, followed by a shape transformation and a softmax in
order to constrain the network to the single-base output. The inverse part of the network is
implemented once using shared weights and once independantly. Both the convolutional and
fully-connected layer have the sigmoid as non-linearities applied on their output and are initial-
ized with He weight initialization ([35]). For the gradient learning, the Adam update method
([36]) is applied with a learning rate of 0.001. The network is run on GPUs with a batch size
of 50 sequences per weight update.
As the input is transformed into multi-channeled squares (see subsection 4.2.2), the input
sequence length is chosen somewhat different from the previous traditional AEs. Figure 5.12
shows the network structure working on an input sequence of 400 bases (squares of 20 · · · 20).
The formatting of the data into cubes (and back to single-base) is implemented through a
few transformation steps using layers in the network; these operations are shown on this fig-
ure compacted as a single cube formatting layer for brevity. This particular network has 18279
trainable parameters with independant weight training, or 10183 parameters using tied weights.
Figure 5.13 displays the outcome of the training process with 400 bases as input length
and two encoding units. The training shows a stable and desirable learning behaviour. The
accuracy of the reconstruction gradually rises and eventually reaches over 90%. The loss function
decreases gradually as desired. The transition from a traditional AE to a CAE has seemingly
lead to a significant performance gain. From a reconstruction performance of slightly better
CHAPTER 5. RESULTS 64
Figure 5.12: Shallow convolutional AE structure.
than random1, the network has jumped to achieve around 95% correct reconstruction accuracy.
The loss function also displays smooth decreasing behaviour and end up around 1.
When comparing the weight sharing with the independant weight training, there is a no-
ticeable advantage for the weight sharing variant; the accuracy ends up about 5% higher and
reaches far quicker the 90% mark. The cross-entropy loss quickly reaches under 1 while the
weight-independant variant does not go below the 1 border at all. As these are exactly the
same conditions for the network with only the weight sharing as a difference, this leads to the
conclusion that in this setup there is a clear advantage of using tied weights for the encoding and
decoding parts of the CAE. Furthermore, starting with this CAE is a possibly viable candidate
for a compression scheme.
5.3.4 Deep Convolutional Auto-Encoder
The previous subsection has shown that the Convolutional Auto-Encoder performs very well
on the task. Several modifications are now made to the previous CAE in order to try to fur-
ther improve upon the results. As a first step, an extra fully-connected hidden layer is added
before & after the encoding layer. Regarding the convolutional stage(s) of the network, one
additional convolutional-pooling stage is added after the initial max-pooling layer. Each stage
is individually extended and now consists of two convolutions before max-pooling is applied.
The convolution kernels are made smaller, while the number of feature maps is doubled from
the first to the second stage, leading to a configuration of 32@3× 3 and 64@3× 3; these smaller
kernels are shown to have a regularizing effect. These architectural designs are inspired by
recent research on ConvNets ([21], [37]).
The resulting architecture of the net is displayed by 5.14. Only the layers up to the bottleneck
1As the large majority of the sequence is A, C, G, or T, and each letter has a comparable frequency in thesequence, randomly guessing the base would have a 25% success ratio, which the previous network with 30% doesnot improve a lot upon.
CHAPTER 5. RESULTS 65
0 1,000 2,000 3,000 4,000 5,0000.5
0.6
0.7
0.8
0.9
1
Number of updates
Rec
on
stru
ctio
nacc
ura
cy
trainingvalidation
(a) Reconstruction accuracy
0 1,000 2,000 3,000 4,000 5,000
1
1.2
1.4
1.6
Number of updates
Cate
gori
cal
cros
sentr
opy
trainingvalidation
(b) Cross-entropy loss
0 1,000 2,000 3,000 4,000 5,0000.5
0.6
0.7
0.8
0.9
1
Number of updates
Rec
onst
ruct
ion
accu
racy
trainingvalidation
(c) Reconstruction accuracy
0 1,000 2,000 3,000 4,000 5,000
1
1.1
1.2
1.3
1.4
Number of updates
Cate
gori
cal
cros
sentr
opy
trainingvalidation
(d) Cross-entropy loss
Figure 5.13: Reconstruction accuracy and loss function of shallow convolutional AE with inputsequence length of 400 bases and two encoding units. Above: independant weights, below: tiedweights.
CHAPTER 5. RESULTS 66
layer are shown due to space constraints. The ommited decoding part is fully symmetrical to
the first half of the network, and followed by a reshape and softmax output just as the previous
networks. As the previous CAE has shown, a setup with tied weights significantly outperforms
the weight-independant option, so the decoder will be tied to the encoder weights and the
non-tied version is dropped from the experiment. The two sets of convolutional sets (Conv-
Conv-MaxPool) with decreasing sizes allow for a hieararchical feature learning. This network
ends up with 67047 trainable parameters using tied weights.
Figure 5.14: Deep CAE architecture. Symmetrical decoding part ommited for readability.
With the same settings for input and encoding size as the previous CAE, the network training
is shown on figure 5.15. When comparing this to the shallow CAE, it seems the network does
not immediately gain from the adapted architecture. While it works and shows a stable learning
behaviour, this network is seemingly outperformed by the shallow CAE on both accuracy and
cross-entropy loss.
5.3.5 Batch-Normalized ReLu CAE
Recent research (2015, [38]) has introduced the concept of Batch Normalization for ANN layers.
By normalizing the output of a fully-connected or convolutional layer before applying its non-
linearity, they have shown to improve the performance of a model, reduce any overfitting, and
significantly speeding the training of the ANN. It has also shown to be an effective combination
used together with the ReLu activation function, which is not frequently used in Auto-Encoder
architectures because it does not seem to perform well with the inverse operations often used by
AEs. This technique is now tried out with the previous deep CAE architecture. Every dense or
convolutional layer (and their inverse counterpart) is followed by a Batch Normalization layer.
The tied weight setup is kept due to its clear advantage in the previous experiments. As the
technique shows the ability to speedup learning, the learning rate is tripled to 0.003, still using
the Adam gradient optimization method. This network is constructed in two otherwise equal
variations: once using traditional sigmoids, and once in a version with ReLu activations. The
CHAPTER 5. RESULTS 67
0 1,000 2,000 3,000 4,000 5,0000.5
0.6
0.7
0.8
0.9
1
Number of updates
Rec
onst
ruct
ion
acc
ura
cy
trainingvalidation
(a) Reconstruction accuracy
0 1,000 2,000 3,000 4,000 5,000
1
1.2
1.4
1.6
Number of updates
Cat
egori
cal
cross
entr
opy
trainingvalidation
(b) Cross-entropy loss
Figure 5.15: Reconstruction accuracy and loss function of deep CAE with input sequence lengthof 400 bases and two encoding units trained with tied weights.
input length and encoding size parameters are kept. The networks have 69379 parameters.
Figure 5.16 shows the training process of these BN networks. The first thing to discuss is
the ReLu activation function trial, shown in the upper half of the figure. While initially, this
seem to perform well and do the required learning, it starts to deteriorate rather quickly. The
accuracy decreases and the loss function starts to rise again. This diverging trend continues
in the experiment (not shown here on the graph) when trained further to 50000 updates; the
accuracy keep lowering to under 50% - both the training and validation curve - and the loss
ends up at 1.12 From this experiment, it seems that the ReLu is not a useful activation function
for this AE, even in tandem with the BN.
The lower half of the picture shows the variant using traditional sigmoids. Compared with
the previous deep CAE (having an equal architecture bar the BN layers) this figure shows a clear
improvement accompanied by faster learning as ’promised’. Its performance is comparable with
the shallow CAE. The accuracy rivals the previously best shallow CAE by reaching around 95%
and the loss function goes under 0.5. Further training continues this smooth learning behaviour.
5.3.6 Model comparison, selection and discussion
In the previous subsections, an AE architecture was constructed to perform the compression
task. Starting from a traditional AE with a single fully-connected hidden layer which takes a
single base to encode, and ending up with a deep (batch-normalized) convolutional AE encoding
a sequence of hundreds of bases into two values, the end result is a network capable of achieving
CHAPTER 5. RESULTS 68
0 1,000 2,000 3,000 4,000 5,0000.5
0.6
0.7
0.8
0.9
1
Number of updates
Rec
on
stru
ctio
nac
cura
cy
trainingvalidation
(a) Reconstruction accuracy
0 1,000 2,000 3,000 4,000 5,000
0.2
0.4
0.6
0.8
Number of updates
Cat
egori
cal
cross
entr
opy
trainingvalidation
(b) Cross-entropy loss
0 1,000 2,000 3,000 4,000 5,0000.5
0.6
0.7
0.8
0.9
1
Number of updates
Rec
on
stru
ctio
nac
cura
cy
trainingvalidation
(c) Reconstruction accuracy
0 1,000 2,000 3,000 4,000 5,000
1
1.1
1.2
1.3
Number of updates
Cate
gori
cal
cros
sentr
opy
trainingvalidation
(d) Cross-entropy loss
Figure 5.16: Reconstruction accuracy and loss function of Batch Normalized deep CAE withinput sequence length of 400 bases and two encoding units trained with tied weights. Above:ReLu activations, below: Sigmoid activations.
CHAPTER 5. RESULTS 69
a very good compression rate with associated good reconstruction accuracy. The traditional
AEs with only fully-connected layers are unable to perform the task and show an underfitting
problem. Only when moving to (variations on) convolutional AEs, the performance is very
good, and these are the models of interest. This section concludes with a comparison of these
CAE variations.
The CAEs are now trained for 50000 updates, each having the same settings for input se-
quence length and encoding units. Their other (hyper)parameters have remained the same as
discussed in their previous subsections. Figure 5.17 plots the resulting comparison of the re-
construction accuracy and cross-entropy loss for the validation set. A first conclusion to draw
here is the model using the ReLu activation function (BN-CAE(ReLu)) does not work. The
cross-entropy quicly starts to diverge and the corresponding accuracy drops to under 50%. This
leaves the CAEs using sigmoid activations. The Batch Normalized variation (BN-CAE) initially
performs the best. However, even before 10000 updates the model shows signs of overfitting.
The loss function increases after a minimum at around 7500 updates, and the accuracy has
dropped a lot earlier than that, ending up with a score which does outperform the deep CAE
on which it is based, but not matching the shallow CAE. The desired regularization effect
of the Batch Normalization is not immediately observed in this case. The deep CAE shows
a well-known learning behaviour; the accuracy rises and loss function decreases, up until the
overfitting stage occurs. At 10000-15000 updates, clear signs of overfitting show up, and the
performance gets worse from there on. The model is outperformed by both its shallow pre-
decessor, and its batch-normalized successor. The clear winner here is the shallow CAE, also
displaying the well-known learning behaviour. The loss function decreases and at about 30000
updates starts to increase due to overfitting, ending up with a similar score as the BN-CAE.
The accuracy rises to over 96%, and after overfitting occurs ends up slightly below that number.
It outperforms all of its successors, and consequently, this is the model to apply in the next step.
Table 5.3 offers a summary of the models considered with some key figures. The distinction
of between traditional and convolutional AEs is clear. Also note that the best performing
model (Conv AE (tied)) has a low amount of parameters due to its shallow architecture and
the concept of weight sharing.
CHAPTER 5. RESULTS 70
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
·104
0.9
0.92
0.94
0.96
Number of updates
Rec
onst
ruct
ion
accu
racy
CAEdeep CAEBN-CAE
BN-CAE(ReLu)
(a) Reconstruction accuracy on validation set.
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
·104
0.9
1
1.1
1.2
Number of updates
Cate
gori
cal
cros
sentr
opy
CAEdeep CAEBN-CAE
BN-CAE(ReLu)
(b) Cross-entropy loss on validation set.
Figure 5.17: Performance comparison of variations on the CAE with input sequence length of400 bases and two encoding units, all trained with tied weights.
CHAPTER 5. RESULTS 71
Model Compression Accuracy Parameters
Single-base AE (tied) 2,133 99.87 17Shallow AE 21.33 29.77 2,502
Shallow AE (tied) 21.33 29.75 1,502Deep AE 21.33 29.76 41,442
Deep AE (tied) 21.33 29.76 21,022Conv AE 5.33 90.72 18,279
Conv AE (tied) 5.33 95.35 10,183Deep Conv AE (tied) 5.33 91.84 67,047
Batch Norm deep AE (tied) 5.33 94.4 68,213Batch Norm deep AE (ReLu & tied) 5.33 47.57 68,213
Table 5.3: Overview of the Auto-Encoders considered with some key characteristics. (tied)indicates the weights of the decoder and encoder parts are shared. The compression columngives the ratio of the encoding units to the input sequence (at 3 bits per base). It does notinclude the residue at this point.
5.4 Compression scheme implementation
Having determined a succesful model in the previous section, this section will implement a
compression scheme by applying the CAE to the DNA sequences. The evaluation schemes and
the metrics considered are the ones discussed in the methodology chapter (section 4.2.5).
5.4.1 Scenario 1: chromosome Ref-A on chromosome Ref-B
For this first scenario, the CAE is trained using chromosome 1 of the reference genome. The
full sequence is used in this case as train set (as no validation data needs to be left out). The
model is trained for 30000 updates, as this has previously shown to lead to good results. This
model is then used to compress the other chromosomes of the same reference genome, following
the scheme of figure 4.3. A first aspect to investigate is the achieved compression. Two data
structures are of interest for this: the encoded genome and the residue. The encoding - or the
codeword - is the output produced by the AE encoding layer. This is an array with an amount
of rows equal to the number of input sequences in the input, and an amount of columns equal
to the encoding size (the number of neurons in the encoding layer). Each element is a neuron
value, thus a 32-bit floating point number, and this array is saved to disk using Numpy’s save()
method. This results in a binary file (pickled by Python). For the residue there are three
choices. The first form of the residue will always be of the same size as the input. The size
of the second and third form of the residue will depend on the accuracy of the AE, as they
contain elements for errors. The second residue will contain a bitmask of the same form as
CHAPTER 5. RESULTS 72
the input, and an array of varying size with an element for each error. The third residue will
contain only pairs for errors, and thus will be the smallest filesize in this implementation as a
good reconstruction is assumed from the results of the previous section. The combination of the
codeword and this residue will be the size of the bitstream, and the ratio of this bitstream to the
input (i.e. the FASTA file of the original chromosome) will determine the achieved compression
ratio. Note again this compression is stripped from any metadata originally contained in the
sequence file.
Table 5.4 contains the results of this setup when looking at the compression achieved this
way. These values show first of all that this method has been succesfull in achieving compression.
The average compression compared to the FASTA file is 60%-70%. One chromosome can barely
be compressed, but none of them end up growing in size. When the bits per base (bpb) are
considered, values around 5 are achieved. Unfortunately, this does not outperform some other
(general-purpose) compression methods, nor does it do better than the 2 bpb baseline option.
Another remark from this data is that the residu is a major factor in the filesize compared to the
codeword. One should note however this is a naive implementation (e.g. data stored as Numpy
structures) and this is by no means the limit of compression performance. The codeword does
probably not have a lot of room for more efficient storage, as it is already simply the array
of 32-bit neuron values. The residue can probably stored a lot more efficient with the right
implementation and data structure.
With the aim on a realistic compression scenario, the next step would be to perform a more
intelligent storage of the bitstream. One frequently used option would be to apply arithmetic
coding. In order to apply this succesfully, the entropy of the material to be encoded should be
as low as possible. This results in the importance of both the amount of errors made and the
pattern of thes errors. These two metrics are extracted from the first form of the residue and
shown in table 5.5. The entropy of the residu is consistently around 0.25, and the reconstruction
performance slightly over 95%, which is in line with the expected performance from the results
of the model construction.
5.4.2 Scenario 2: chromosome Ref-A on chromosome Alt-A
The second scenario will investigate the result of training the CAE on a chromosome from the
reference genome and encode the same chromosome from the alternative genome. This means
there are 23 CAEs trained, and each are used once for testing. The same metrics are in use as
CHAPTER 5. RESULTS 73
Chromosome Encoded (MB) Residue 3 (MB) Comp. Ratio bpb
hs ref GRCh38.p2 chr10.fa 2.59 83.69 0.66 5.34hs ref GRCh38.p2 chr11.fa 2.62 83.46 0.65 5.25hs ref GRCh38.p2 chr12.fa 2.61 81.77 0.64 5.17hs ref GRCh38.p2 chr13.fa 1.91 58.81 0.63 5.09hs ref GRCh38.p2 chr14.fa 1.82 59.76 0.67 5.4hs ref GRCh38.p2 chr15.fa 1.88 64.38 0.7 5.65hs ref GRCh38.p2 chr16.fa 1.66 55.16 0.67 5.47hs ref GRCh38.p2 chr17.fa 1.77 58.45 0.67 5.44hs ref GRCh38.p2 chr18.fa 1.56 48.24 0.63 5.1hs ref GRCh38.p2 chr19.fa 1.41 67.55 0.97 7.82hs ref GRCh38.p2 chr20.fa 1.24 41.51 0.68 5.53hs ref GRCh38.p2 chr21.fa 0.79 26.41 0.69 5.54hs ref GRCh38.p2 chr22.fa 0.81 29.05 0.73 5.91hs ref GRCh38.p2 chr2.fa 4.65 143.11 0.63 5.08hs ref GRCh38.p2 chr3.fa 3.86 116.08 0.61 4.97hs ref GRCh38.p2 chr4.fa 3.69 109.37 0.6 4.91hs ref GRCh38.p2 chr5.fa 3.56 110.3 0.63 5.12hs ref GRCh38.p2 chr6.fa 3.93 157.45 0.81 6.57hs ref GRCh38.p2 chr7.fa 3.09 97.4 0.64 5.2hs ref GRCh38.p2 chr8.fa 2.85 87.67 0.63 5.08hs ref GRCh38.p2 chr9.fa 2.36 73.57 0.64 5.15hs ref GRCh38.p2 chrX.fa 2.98 95.44 0.65 5.28
Table 5.4: File sizes after encoding using the shallow CAE trained on chromosome 1 of thereference genome. The compression column shows the ratio of the bitstream filesize to theoriginal FASTA.
in the previous subsection, and table 5.6 and 5.7 contain the outcome of these tests. Compared
to the previous setup, the compression achieved is slightly less succesful. The rate averages
around 70% and the bpb is on average noticably larger then the previous scenario.
5.4.3 Additional general-purpose compression
The scheme implemented here should be interpreted as a prototype and proof of concept. It is
aimed towards and suffices to illustrate the achieved results and concepts, but the implementa-
tion has some flaws to consider. First of all comes the fact that the input sequence is stripped
from its metadata and some rare characters are replaced by the unspecified symbol. From there,
the compression works on the sequence data, so technically this is not entirely lossless compres-
sion when viewed from an end-to-end position. As a second remark to be considered is the
fact that working with Numpy arrays to store the data is a rather naıve implementation. More
efficient and better approaches are possible; this implementation is chosen for its ease of use and
straight-forward compatibility with the frameworks used for the machine learning aspect of the
CHAPTER 5. RESULTS 74
Chromosome Entropy Error rate
hs ref GRCh38.p2 chr10.fa 0.25 0.04hs ref GRCh38.p2 chr11.fa 0.25 0.04hs ref GRCh38.p2 chr12.fa 0.25 0.04hs ref GRCh38.p2 chr13.fa 0.24 0.04hs ref GRCh38.p2 chr14.fa 0.26 0.04hs ref GRCh38.p2 chr15.fa 0.27 0.04hs ref GRCh38.p2 chr16.fa 0.26 0.04hs ref GRCh38.p2 chr17.fa 0.26 0.04hs ref GRCh38.p2 chr18.fa 0.24 0.04hs ref GRCh38.p2 chr19.fa 0.35 0.06hs ref GRCh38.p2 chr20.fa 0.26 0.04hs ref GRCh38.p2 chr21.fa 0.26 0.04hs ref GRCh38.p2 chr22.fa 0.27 0.04hs ref GRCh38.p2 chr2.fa 0.24 0.04hs ref GRCh38.p2 chr3.fa 0.24 0.04hs ref GRCh38.p2 chr4.fa 0.23 0.04hs ref GRCh38.p2 chr5.fa 0.24 0.04hs ref GRCh38.p2 chr6.fa 0.3 0.05hs ref GRCh38.p2 chr7.fa 0.25 0.04hs ref GRCh38.p2 chr8.fa 0.24 0.04hs ref GRCh38.p2 chr9.fa 0.24 0.04hs ref GRCh38.p2 chrX.fa 0.25 0.04
Table 5.5: Residue (form 1) analysis after encoding using the shallow CAE trained on chromo-some 1 of the reference genome.
implementation. As an example, when converting the textual FASTA file with bases to numeric
labels, the Numpy data structure holding these labels is stored on disk and has a filesize of two
to ten times the filesize of the FASTA file. The compression numbers could thus be even better
when these flaws are handled and a suitable encoding and set of data formats is chosen.
For the sake of completeness, the general-purpose compression algorithm gzip with the
default settings is applied on top of the bitstreams from scenario 1. This additional layer of
compression might be a simple fix for the lack of intelligent choice for the data structures in
the residu, as the Gzip format will create this efficient representation (it is here the low entropy
will be of use). Table 5.8 shows for these gzipped bitstreams the fraction of the size compared
to the original FASTA file, and the bits per base.
From these end results, a significant gain is found compared to the output of the machine
learning compression scheme alone. The numbers indicate that with a combined application of
the AE and the general-purpose compression technique, a compression rate in the order of 10:1
is possible. The bpb demonstrated here (on the lower end of 1) are very good; they are half of
what the 2 bpb baseline offers, beating many comparable existing compression solutions.
CHAPTER 5. RESULTS 75
Chromosome Encoded (MB) Residue 3 (MB) Comp. Ratio bpb
hs alt CHM1 1.1 chr1.fa 4.78 303.17 1.27 10.3hs alt CHM1 1.1 chr19.fa 1.07 36.65 0.7 5.65hs alt CHM1 1.1 chr10.fa 2.52 82.51 0.67 5.4hs alt CHM1 1.1 chr12.fa 2.49 82.13 0.67 5.44hs alt CHM1 1.1 chr17.fa 1.5 52.33 0.71 5.76hs alt CHM1 1.1 chr15.fa 1.57 55.28 0.72 5.79hs alt CHM1 1.1 chr21.fa 0.68 24.61 0.74 5.96hs alt CHM1 1.1 chr8.fa 2.73 93.66 0.7 5.65hs alt CHM1 1.1 chr5.fa 3.38 105.95 0.64 5.18hs alt CHM1 1.1 chr7.fa 2.98 101.01 0.69 5.59hs alt CHM1 1.1 chr2.fa 4.55 158.83 0.71 5.74hs alt CHM1 1.1 chr16.fa 1.54 54.71 0.72 5.84hs alt CHM1 1.1 chr11.fa 2.51 82.75 0.67 5.44hs alt CHM1 1.1 chrX.fa 2.89 100.11 0.7 5.7hs alt CHM1 1.1 chr9.fa 2.31 88.47 0.78 6.28hs alt CHM1 1.1 chr18.fa 1.43 46.69 0.66 5.37hs alt CHM1 1.1 chr13.fa 1.82 55.83 0.62 5.06hs alt CHM1 1.1 chr22.fa 0.67 25.25 0.77 6.18hs alt CHM1 1.1 chr4.fa 3.59 117.45 0.67 5.4hs alt CHM1 1.1 chr3.fa 3.72 116.99 0.64 5.19hs alt CHM1 1.1 chr20.fa 1.14 37.71 0.67 5.47hs alt CHM1 1.1 chr14.fa 1.69 53.52 0.65 5.24hs alt CHM1 1.1 chr6.fa 3.21 105.62 0.67 5.42
Table 5.6: File sizes after encoding using the shallow CAE. On each row a chromosome fromthe reference genome is used for training, and its counterpart in the alternative genome (who’sname is given in this table) is encoded.
CHAPTER 5. RESULTS 76
Chromosome Entropy Error rate
hs alt CHM1 1.1 chr1.fa 0.42 0.08hs alt CHM1 1.1 chr19.fa 0.26 0.04hs alt CHM1 1.1 chr10.fa 0.25 0.04hs alt CHM1 1.1 chr12.fa 0.26 0.04hs alt CHM1 1.1 chr17.fa 0.27 0.04hs alt CHM1 1.1 chr15.fa 0.27 0.04hs alt CHM1 1.1 chr21.fa 0.28 0.05hs alt CHM1 1.1 chr8.fa 0.26 0.04hs alt CHM1 1.1 chr5.fa 0.25 0.04hs alt CHM1 1.1 chr7.fa 0.26 0.04hs alt CHM1 1.1 chr2.fa 0.27 0.04hs alt CHM1 1.1 chr16.fa 0.27 0.04hs alt CHM1 1.1 chr11.fa 0.26 0.04hs alt CHM1 1.1 chrX.fa 0.27 0.04hs alt CHM1 1.1 chr9.fa 0.29 0.05hs alt CHM1 1.1 chr18.fa 0.25 0.04hs alt CHM1 1.1 chr13.fa 0.24 0.04hs alt CHM1 1.1 chr22.fa 0.28 0.05hs alt CHM1 1.1 chr4.fa 0.25 0.04hs alt CHM1 1.1 chr3.fa 0.25 0.04hs alt CHM1 1.1 chr20.fa 0.26 0.04hs alt CHM1 1.1 chr14.fa 0.25 0.04hs alt CHM1 1.1 chr6.fa 0.26 0.04
Table 5.7: Residue (form 1) analysis after encoding using the shallow CAE. On each row achromosome from the reference genome is used for training, and its counterpart in the alternativegenome (who’s name is given in this table) is encoded.
CHAPTER 5. RESULTS 77
Chromosome Comp. Ratio bpb
hs ref GRCh38.p2 chr10.fa 0.1 0.85hs ref GRCh38.p2 chr11.fa 0.1 0.84hs ref GRCh38.p2 chr12.fa 0.1 0.83hs ref GRCh38.p2 chr13.fa 0.1 0.82hs ref GRCh38.p2 chr14.fa 0.11 0.85hs ref GRCh38.p2 chr15.fa 0.11 0.88hs ref GRCh38.p2 chr16.fa 0.11 0.86hs ref GRCh38.p2 chr17.fa 0.11 0.87hs ref GRCh38.p2 chr18.fa 0.1 0.82hs ref GRCh38.p2 chr19.fa 0.14 1.1hs ref GRCh38.p2 chr20.fa 0.11 0.87hs ref GRCh38.p2 chr21.fa 0.11 0.87hs ref GRCh38.p2 chr22.fa 0.11 0.92hs ref GRCh38.p2 chr2.fa 0.1 0.82hs ref GRCh38.p2 chr3.fa 0.1 0.8hs ref GRCh38.p2 chr4.fa 0.1 0.8hs ref GRCh38.p2 chr5.fa 0.1 0.82hs ref GRCh38.p2 chr6.fa 0.12 0.96hs ref GRCh38.p2 chr7.fa 0.1 0.83hs ref GRCh38.p2 chr8.fa 0.1 0.82hs ref GRCh38.p2 chr9.fa 0.1 0.83hs ref GRCh38.p2 chrX.fa 0.1 0.84
Table 5.8: Compression results after gzip is applied on top of the bitstreams resulting fromthe encodings in scenario 1. Compression ratio is the filesize fraction compared to the originalFASTA file.
Chapter 6
Conclusion
This final chapter provides a short discussion on the results obtained in this work and note
some opportunities for future research after this work.
6.1 Discussion of the results
The purpose os this work has been to explore the possibility of constructing a compression
scheme for DNA sequence data using a deep learning approach. A first data analysis has shown
there are little obvious features and patterns present in the source data, hence the way of
a an unsupervised technique has been chosen. One particular technique, the Auto-Encoder
has been selected, as it has been shown to be capable of being used for data compression.
After the data analysis, a suitable AE architecture has been investigated. This ranged from
a traditional single-hidden layer AE, to the more powerful convolutional variants, and ended
with a trial of a batch-normalized deep convolutional AE inspired by state-of-the-art research in
deep learning. The resulting best performing model turned out to be the shallow convolutional
AE, demonstrating the ability to compress and reconstruct the sequence with an accuracy of
over 90% while having an architecture that offers a good compression rate. The next part
in this research involved implementing a compression scheme using this CAE to investigate if
this approach yields a working setup. Several scenarios in evaluating the scheme have been
investigated. The evaluation involved a look at the encoded sequence and the structure of
the error residue, as these make up the encoded bistream. Even this basic implementation,
with still a decent room for improvement, has shown to be capable of a lossless compression of
sequence data with a resulting compression of 60%-70% in most cases. Further application of a
general-purpose compression scheme such as gzip is shown to lead to an even larger achievable
78
CHAPTER 6. CONCLUSION 79
compression rate, mitigating the simplistic approach in storing the residue files. The combined
steps of the AE compression and gzip leads to a bpb of around 1, which outperforms most of
the existing algorithms.
6.2 Future work
This work can be seen as a feasibility study of applying machine learning techniques to achieve
compression on DNA data. The chosen technique to work with was the (convolutional) Auto-
Encoder. A first option for further research could be to finetune this approach (which is then
rather an optimization question instead of the exploratory look and prototype from this work).
There are some very recent developments and adaptations on AEs which could be tried out in
order to achieve the optimal architecture and parameter set to use for this problem.
Another approach could be to perform feature engineering. A good set of features is still
one of the most important aspects of a succesful machine learning application to a problem.1
One attractive property of the succesful neural network-based methods is their ability to do
automatic feature learning. There has thus been somewhat of a breach between opposing
schools of thought regarding the importance of human feature engineering. In this work, no
manual work on features is done; the genome sequence is used directly as input to the model.
One might investigate in future work if preprocessing of the data could lead to useful data
characteristics for machine learning.
Lastly, there is another neural network technique which has shown recent success in most
notably natural language processing: Recurrent Neural Networks (RNNs). RNNs are quite a
recent development, are hard to train and apply succesfully, and their workings are not entirely
well-understood at this moment. They add a concept of temporal awareness and memory to
neural nets, allowing them to learn and recognize languages, where a particular sequence of
words is important (besides just the set of words). RNNs - and their specific architectural
variant LSTMs (Long Short Term Memory networks, [39]) - can be used to automatically
construct sentences or perform text translation. In the latter case, this is done by analyzing
the source text into an internal representation and subsequently expanding this representation
with the network trained on another language, ending up with the translated text. The process
shows simmilarities to the AE process of compacting to a representation and reconstructing.
1This can be observed in the various Kaggle (machine learning) competition winning submissions and theoften public explanation of their approach.
CHAPTER 6. CONCLUSION 80
When interpreting the DNA strains as containing some biochemical language, having its own
words, structures and patterns, RNNs might prove to be a succesful approach in constructing
DNA.
References
[1] S. Deorowicz and S. Grabowski, “Data compression for sequencing data.”, Algorithms for
molecular biology : AMB, vol. 8, no. 1, p. 25, 2013, issn: 1748-7188. doi: 10.1186/1748-
7188-8-25. [Online]. Available: http://www.pubmedcentral.nih.gov/articlerender.
fcgi?artid=3868316%7B%5C&%7Dtool=pmcentrez%7B%5C&%7Drendertype=abstract.
[2] D. C. James Bornholt Randolph Lopez and L. Ceze, “A dna-based archival storage sys-
tem”, in ASPLOS 2016 (International Conference on Architectural Support for Program-
ming Languages and Operating Systems) - to appear, ACM – Association for Comput-
ing Machinery, Apr. 2016. [Online]. Available: https : / / www . microsoft . com / en -
us/research/publication/dna-based-archival-storage-system/.
[3] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression”, IEEE
Transactions on Information Theory, vol. 23, no. 3, pp. 337–343, May 1977, issn: 0018-
9448. doi: 10.1109/TIT.1977.1055714.
[4] P. G. Howard and J. S. Vitter, “Arithmetic coding for data compression”, Proceedings of
the IEEE, vol. 82, no. 6, pp. 857–865, Jun. 1994, issn: 0018-9219. doi: 10.1109/5.286189.
[5] S. Kuruppu, S. J. Puglisi, and J. Zobel, “Optimized relative lempel-ziv compression of
genomes”, Conferences in Research and Practice in Information Technology Series, vol.
113, no. Acsc, pp. 91–98, 2011, issn: 14451336. doi: 10.1007/978- 3- 642- 16321-
0\_20.
[6] S. Deorowicz and S. Grabowski, “Compression of dna sequence reads in fastq format”,
Bioinformatics, vol. 27, no. 6, pp. 860–862, 2011. doi: 10.1093/bioinformatics/btr014.
eprint: http://bioinformatics.oxfordjournals.org/content/27/6/860.full.pdf+
html. [Online]. Available: http://bioinformatics.oxfordjournals.org/content/27/
6/860.abstract.
81
REFERENCES 82
[7] M. Howison, “High-throughput compression of fastq data with seqdb”, IEEE/ACM Trans-
actions on Computational Biology and Bioinformatics, vol. 10, no. 1, pp. 213–218, Jan.
2013, issn: 1545-5963. doi: 10.1109/TCBB.2012.160.
[8] D. C. Jones, W. L. Ruzzo, X. Peng, and M. G. Katze, “Compression of next-generation
sequencing reads aided by highly efficient de novo assembly”, CoRR, vol. abs/1207.2424,
2012. [Online]. Available: http://arxiv.org/abs/1207.2424.
[9] M. Nicolae, S. Pathak, and S. Rajasekaran, “Lfqc: A lossless compression algorithm
for fastq files”, Bioinformatics, vol. 31, no. 20, pp. 3276–3281, 2015. doi: 10 . 1093 /
bioinformatics / btv384. eprint: http : / / bioinformatics . oxfordjournals . org /
content/31/20/3276.full.pdf+html. [Online]. Available: http://bioinformatics.
oxfordjournals.org/content/31/20/3276.abstract.
[10] C. Kozanitis, C. Saunders, S. Kruglyak, V. Bafna, and G. Varghese, “Compressing ge-
nomic sequence fragments using slimgene”, in Proceedings of the 14th Annual International
Conference on Research in Computational Molecular Biology, ser. RECOMB’10, Lisbon,
Portugal: Springer-Verlag, 2010, pp. 310–324, isbn: 3-642-12682-0, 978-3-642-12682-6. doi:
10.1007/978-3-642-12683-3_20. [Online]. Available: http://dx.doi.org/10.1007/
978-3-642-12683-3_20.
[11] M. N. Sakib, J. Tang, W. J. Zheng, and C.-T. Huang, “Improving transmission efficiency
of large sequence alignment/map (sam) files”, PLoS ONE, vol. 6, no. 12, pp. 1–4, Dec.
2011. doi: 10.1371/journal.pone.0028251. [Online]. Available: http://dx.doi.org/
10.1371/journal.pone.0028251.
[12] H. Li, “Tabix: Fast retrieval of sequence features from generic tab-delimited files”, Bioin-
formatics, vol. 27, no. 5, pp. 718–719, 2011. doi: 10.1093/bioinformatics/btq671.
eprint: http://bioinformatics.oxfordjournals.org/content/27/5/718.full.pdf+
html. [Online]. Available: http://bioinformatics.oxfordjournals.org/content/27/
5/718.abstract.
[13] M. H.-y. Fritz, R. Leinonen, G. Cochrane, and E. Birney, “2011 efficient storage of high
throughput dna sequencing data using reference-based compression”, pp. 734–740, 2011.
doi: 10.1101/gr.114819.110.Freely.
REFERENCES 83
[14] M. D. Cao, T. I. Dix, L. Allison, and C. Mears, “A simple statistical algorithm for biolog-
ical sequence compression”, Data Compression Conference Proceedings, pp. 43–52, 2007,
issn: 10680314. doi: 10.1109/DCC.2007.7.
[15] I. Tabus, G. Korodi, and J. Rissanen, “Dna sequence compression using the normalized
maximum likelihood model for discrete regression”, in Data Compression Conference,
2003. Proceedings. DCC 2003, Mar. 2003, pp. 253–262. doi: 10.1109/DCC.2003.1194016.
[16] G. Korodi and I. Tabus, “An efficient normalized maximum likelihood algorithm for dna
sequence compression”, ACM Trans. Inf. Syst., vol. 23, no. 1, pp. 3–34, Jan. 2005, issn:
1046-8188. doi: 10.1145/1055709.1055711. [Online]. Available: http://doi.acm.org/
10.1145/1055709.1055711.
[17] S. Christley, Y. Lu, C. Li, and X. Xie, “Human genomes as email attachments”, Bioinfor-
matics, vol. 25, no. 2, pp. 274–275, 2009. doi: 10.1093/bioinformatics/btn582. eprint:
http://bioinformatics.oxfordjournals.org/content/25/2/274.full.pdf+html.
[Online]. Available: http://bioinformatics.oxfordjournals.org/content/25/2/
274.abstract.
[18] S. Deorowicz, A. Danek, and M. Niemiec, “GDC 2: Compression of large collections of
genomes”, CoRR, vol. abs/1503.01624, 2015. [Online]. Available: http://arxiv.org/
abs/1503.01624.
[19] S. Kuruppu, B. Beresford-Smith, T. Conway, and J. Zobel, “Iterative dictionary con-
struction for compression of large dna data sets”, IEEE/ACM Transactions on Computa-
tional Biology and Bioinformatics, vol. 9, no. 1, pp. 137–149, 2012, issn: 15455963. doi:
10.1109/TCBB.2011.82.
[20] B. Christopher Leela, M. Manu K, V. Vineetha, K. Satheesh Kumar, Vijayakumar, and
A. S. Nair, “Compression of large genomic datasets using comrad on parallel computing
platform”, Bioinformation, vol. 11, no. 5, pp. 267–271, 2015.
[21] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image
recognition”, CoRR, vol. abs/1409.1556, 2014. [Online]. Available: http://arxiv.org/
abs/1409.1556.
[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep con-
volutional neural networks”, in Advances in Neural Information Processing Systems 25:
26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of
REFERENCES 84
a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., 2012, pp. 1106–
1114. [Online]. Available: http://papers.nips.cc/paper/4824-imagenet-classification-
with-deep-convolutional-neural-networks.
[23] B. Verma, M. Blumenstein, and S. Kulkarni, “A new compression technique using an
artificial neural network”, Journal of Intelligent Systems, vol. 9, no. 1, Jan. 1999. doi:
10.1515/jisys.1999.9.1.39. [Online]. Available: http://dx.doi.org/10.1515/
jisys.1999.9.1.39.
[24] D. P. Dutta, S. D. Choudhury, M. A. Hussain, and S. Majumder, “Digital image com-
pression using neural networks”, in Advances in Computing, Control, Telecommunication
Technologies, 2009. ACT ’09. International Conference on, Dec. 2009, pp. 116–120. doi:
10.1109/ACT.2009.38.
[25] T. D. Kulkarni, W. Whitney, P. Kohli, and J. B. Tenenbaum, “Deep convolutional inverse
graphics network”, CoRR, vol. abs/1503.03167, 2015. [Online]. Available: http://arxiv.
org/abs/1503.03167.
[26] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep
convolutional generative adversarial networks”, CoRR, vol. abs/1511.06434, 2015. [On-
line]. Available: http://arxiv.org/abs/1511.06434.
[27] J. Zhao, M. Mathieu, R. Goroshin, and Y. Lecun, “Stacked what-where auto-encoders”,
ArXiv, vol. 1506.0235, 2015.
[28] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising
autoencoders: Learning useful representations in a deep network with a local denoising
criterion”, J. Mach. Learn. Res., vol. 11, pp. 3371–3408, Dec. 2010, issn: 1532-4435.
[Online]. Available: http://dl.acm.org/citation.cfm?id=1756006.1953039.
[29] J. Masci, U. Meier, D. Ciresan, and J. Schmidhuber, “Stacked convolutional auto-encoders
for hierarchical feature extraction”, in Artificial Neural Networks and Machine Learning
– ICANN 2011: 21st International Conference on Artificial Neural Networks, Espoo, Fin-
land, June 14-17, 2011, Proceedings, Part I, T. Honkela, W. Duch, M. Girolami, and
S. Kaski, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 52–59, isbn:
978-3-642-21735-7. doi: 10.1007/978-3-642-21735-7_7. [Online]. Available: http:
//dx.doi.org/10.1007/978-3-642-21735-7_7.
REFERENCES 85
[30] A. Rasmus, H. Valpola, M. Honkala, M. Berglund, and T. Raiko, “Semi-supervised learn-
ing with ladder network”, CoRR, vol. abs/1507.02672, 2015. [Online]. Available: http:
//arxiv.org/abs/1507.02672.
[31] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout:
A simple way to prevent neural networks from overfitting”, J. Mach. Learn. Res., vol. 15,
no. 1, pp. 1929–1958, Jan. 2014, issn: 1532-4435. [Online]. Available: http://dl.acm.
org/citation.cfm?id=2627435.2670313.
[32] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural
networks”, Science, vol. 313, no. 5786, pp. 504–507, 2006, issn: 0036-8075. doi: 10.1126/
science.1127647. eprint: http://science.sciencemag.org/content/313/5786/504.
full.pdf. [Online]. Available: http://science.sciencemag.org/content/313/5786/
504.
[33] Y. Bengio, “Learning deep architectures for ai”, Found. Trends Mach. Learn., vol. 2, no.
1, pp. 1–127, Jan. 2009, issn: 1935-8237. doi: 10.1561/2200000006. [Online]. Available:
http://dx.doi.org/10.1561/2200000006.
[34] R. Al-Rfou, G. Alain, A. Almahairi, and et al., “Theano: A python framework for fast
computation of mathematical expressions”, CoRR, vol. abs/1605.02688, 2016. [Online].
Available: http://arxiv.org/abs/1605.02688.
[35] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-
level performance on imagenet classification”, CoRR, vol. abs/1502.01852, 2015. [Online].
Available: http://arxiv.org/abs/1502.01852.
[36] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization”, CoRR, vol.
abs/1412.6980, 2014. [Online]. Available: http://arxiv.org/abs/1412.6980.
[37] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception
architecture for computer vision”, CoRR, vol. abs/1512.00567, 2015. [Online]. Available:
http://arxiv.org/abs/1512.00567.
[38] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by
reducing internal covariate shift”, CoRR, vol. abs/1502.03167, 2015. [Online]. Available:
http://arxiv.org/abs/1502.03167.
REFERENCES 86
[39] S. Hochreiter and J. Schmidhuber, “Long short-term memory”, Neural Comput., vol. 9,
no. 8, pp. 1735–1780, Nov. 1997, issn: 0899-7667. doi: 10.1162/neco.1997.9.8.1735.
[Online]. Available: http://dx.doi.org/10.1162/neco.1997.9.8.1735.
REFERENCES 87