Non-reference-based DNA sequence compression using machine ... · Non-reference-based DNA sequence...

Mathieu Hinderyckx

machine learning techniquesNon-reference-based DNA sequence compression using

Academic year 2015-2016Faculty of Engineering and Architecture

Chair: Prof. dr. ir. Rik Van de WalleDepartment of Electronics and Information Systems

Chair: Prof. dr. Jozef VercruysseInformatiesystemenTechnology and Molecular,Vakgroep Elektronica enBiotechnology Department of Environmental Technology, Food

Master of Science in Computer Science Engineering Master's dissertation submitted in order to obtain the academic degree of

Counsellors: Ruben Verhack, Tom Paridaens, Lionel PigouSupervisors: Prof. dr. Wesley De Neve, Prof. dr. ir. Joni Dambre

Mathieu Hinderyckx

machine learning techniquesNon-reference-based DNA sequence compression using

Academic year 2015-2016Faculty of Engineering and Architecture

Chair: Prof. dr. ir. Rik Van de WalleDepartment of Electronics and Information Systems

Chair: Prof. dr. Jozef VercruysseInformatiesystemenTechnology and Molecular,Vakgroep Elektronica enBiotechnology Department of Environmental Technology, Food

Master of Science in Computer Science Engineering Master's dissertation submitted in order to obtain the academic degree of

Counsellors: Ruben Verhack, Tom Paridaens, Lionel PigouSupervisors: Prof. dr. Wesley De Neve, Prof. dr. ir. Joni Dambre

Preface

Acknowledgments

This work forms the culmination of everything I (should) have learned during my time and

education as an engineer in Computer Sciences at Ghent University. It has definitely been a

challenging task to perform, a few people were essential in completing this. First of all, I would

like to thank in one breath Ghent University in general and my parents for offering me first

of all a qualitative education, a great time as a student, and the opportunity for this research

as my last project. More specifically, I would like to thank Lionel Pigou of the Reservoir Lab

and both Ruben Verhack and Tom Paridaens from the MMlab at the university for their role

as counsellors and day-to-day assistance and overview of this project. While not always having

been the easiest of times, their help was very welcome and essential in approaching this, at

times, daunting topic. Lastly, I would give many thanks to my friends and colleagues. Both to

the great students, who finish everything in time and have found some moments to help and

advice on this work, and to the possibly even greater one who also have found a challenging

task in completing their projects, and went through this long project together with me.

Permission for usage

The author(s) gives (give) permission to make this master dissertation available for consultation

and to copy parts of this master dissertation for personal use. In the case of any other use,

the copyright terms have to be respected, in particular with regard to the obligation to state

expressly the source when quoting results from this master dissertation.

Mathieu Hinderyckx,

August 8, 2016

Non-reference-based DNA sequence compressionusing machine learning techniques

Mathieu HinderyckxSupervisors: Prof. dr. Wesley De Neve, Prof. dr. ir. Joni Dambre

Counsellors: Ruben Verhack, Tom Paridaens, Lionel Pigou

Abstract—With the continuous development of the medical andtechnological world, a high interest in the analysis of DNA datahas emerged in the last decades, leading to an explosive growth ofsequence data gathering. This growth outpaces developments inelectronic storage technology, so the need arises for a solutionto deal with this abundance of data. One of the approachesis data compression, which has seen wide use in all kinds ofelectronic data processing. In this work, a data compressionscheme is explored for lossless compression of sequence data.The scheme is build around an Auto-Encoder, a machine learningtechnique. Several models are constructed and evaluated, basedon recent developments in deep learning. The best model achievesa reconstruction accuracy of over 95% in selected scenarios.After selection of this convolutional Auto-Encoder architecture,the model is trained on chromosomes from the human genome,and used to compress sequence data from an alternative humangenome in various configurations. The compression scheme isshown to be able to compress the sequence to 60%-70% of itsuncompressed filesize. While functional, this proof of conceptimplementation does not outperform general-purpose compres-sion techniques, currently the most widely used approach forsequence data. However, the combination these techniques intandem offers a competitive solution, outperforming most ofthe existing solutions. Future improvement on this work canbe to finetune and improve this proof of concept, or to explorecompression using predictive coding by applying LSTM neuralnetworks.

Keywords—Convolutional Auto-Encoder, DNA sequence, com-pression, deep learning

I. INTRODUCTION

The genome of an organism contains its genetic material: theDNA within that organism, composed of nucleotides (bases) A(Adenine), C (Cytosine), G (Guanine) and T (Thymine). Thisbiochemical component of every living organism contains atreasure of largely untapped information about an organism.With the establishment and development of advanced tech-nology in the bioinformatics discipline, attempts are madeto unravel the information contained within. This requires aneletronic representation of this biochemical material, rangingfrom a few to a few hundred gigabytes per human genome. Thedecreasing costs and growing interest in bio-informatics hasled to an explosive growth in DNA sequencing, significantlyoutpacing the required developments in electronic storagetechnology. Several approaches are investigated in order todeal with this data abundance. Among these are a blind growthin storage expenses, the triage of collected data, the storageof physical samples instead, and most prominently the use of

electronic data compression. As (sequenced & aligned) DNAin its raw representation is represented by a long string ofcharacters, compression using generic approaches is possible.However, compression of sequence data might try to takeadvantage of the natural and biological characteristics of DNAmaterial, notably being the repeated content and the oftenvery close relationships between existing reference sequences.Both losless and lossy compression techniques exists, thelatter sometimes having a user-specified trade-off betweencompression ratio and information loss. A lot of compressiontechniques have been investigated, and are currently beingdeveloped to address the compression of sequencing data.This is one of the most promising tracks to face the issueof dealing with high storage requirements as it does notnecessarily involve data loss. This paper will explore the optionof data compression for non-reference-base genome sequences.It will try to apply Auto-Encoders, a technique from theneural network field in machine learning which have shownin previous application to be a viable compression scheme.

II. RELATED WORK

A lot of research has been published on both compressionalgorithms for DNA and on machine learning, neural networksand deep learning. Especially the field of deep learning iscurrently an extremely active area of research with lots ofapplications in several domains.

A. DNA CompressionSequence data can be stored using several formats. A major

distinction is the difference between reads and sequences.Reads are the result of the sequencing process, and are filescontaining short, repeated strains from the physical sample.All of these reads are then aligned and assembled to formone contiguous sequence. A second distinction is the referenceor non-reference based storage. Reference-based operationindicates the data is stored as differential from a goldenstandard genome, while non-reference based systems store thefiles to be used standalone. There are compression algorithmsdevised for combinations of these two scenarios. The lack ofa benchmark dataset, the closed-source nature of some toolsand incompatible feature set of these various solutions makescomparing these very hard. No solution works guaranteed well,and most of the time, general-purpos compression is applied tothe sequences, which does achieve a reasonable performance.A survey of technieque in use is offered by [1] and [2].

B. Image processing with machine learningArtificial Neural Networks have proven to be extremely

succesful and have led to development of deep learningdiscipline. The one technique of interest is the concept ofArtificial Neural Networks (ANNs). They have been tried ona variety of problems, and have proven to be superior overtraditional methods in a lot of situations.

III. METHODOLOGY

In this work an Auto-Encoder will be trained for construct-ing a compression method. An input file will be encodedto a bitstream suitable for transfer, and decoded again. Theactual encoding and decoding is done by an Artificial NeuralNetwork. Compression is achieved by introducing a bottleneckin the ANN network architecture. A large input sequence isrepresented as a compressed, coded representation learned bythe network plus a residue necessary to correct erroneousreconstructions. By training the network using DNA data, theaim is to let it learn the implicit structure of the input data, so itcan create a good coded representation and effectively performreconstruction. The parameters (i.e. weights of the ANN) ofthe encoder and decoder module are not included in thebistream to be transferred. They are fixed (either to a hardcodedset of values or to a reprogrammable set) during operation.As large ANNs often contain a high amount of parameters,including these in the bitstream for network transfer mightnot be efficient. Having a fixed set of parameters also opensthe possibility of efficient hardware implementations (e.g.FPGA’s, ASICs and coprocessors as currently used for imageprocessing). The encoder and decoder processes thus share thenetwork weights and architecture.

Fig. 1: Overview of encoding/decoding process

Figure 3 shows a block diagram of the encoding & decodingprocess. As a first step, the source file is read and preprocessedto a format suitable for the encoder to accept as input. This

input is fed to the encoding part of the AE. This Encoder blockis a simplified representation here (and is discussed below).The encoding block will generate a compressed representationof the input, which is used as the codeword. The codeword isfed to the decoder, which tries to reconstruct the input, possiblymaking an incorrect reconstruction. This reconstruction iscompared with the original input, and the differences are storedin a residue. The codeword and residue are combined in thebitstream, which is the output of the encoding process. Thedecoder performs the same operation as the final half of theencoding process in order to reconstruct the input file.

IV. RESULTS

A. Machine learning model

During the machine learning part of the project, an Auto-Encoder (AE) architecture was constructed to perform thecompression task. Starting from a traditional AE with a singlefully-connected hidden layer which takes a single base toencode, and ending up with a deep (batch-normalized) convo-lutional AE encoding a sequence of hundreds of bases into twovalues, the end result is a network capable of achieving a verygood compression rate with associated good reconstructionaccuracy. The traditional AEs with only fully-connected layersare unable to perform the task and show an underfittingproblem. Only when moving to (variations on) convolutionalAEs, the performance is very good, and these are the modelsof interest. This section concludes with a comparison of theseCAE variations.

Several architectural variations of CAEs are now trained for50000 updates, each having the same settings for input se-quence length and encoding units. Figure 2 plots the resultingcomparison of the reconstruction accuracy and cross-entropyloss for the validation set. A first conclusion to draw here is themodel using the ReLu activation function (BN-CAE(ReLu))does not work. The cross-entropy quicly starts to diverge andthe corresponding accuracy drops to under 50%. This leavesthe CAEs using sigmoid activations. The Batch Normalizedvariation (BN-CAE) initially performs the best. However, evenbefore 10000 updates the model shows signs of overfitting.The loss function increases after a minimum at around 7500updates, and the accuracy has dropped a lot earlier than that,ending up with a score which does outperform the deep CAEon which it is based, but not matching the shallow CAE. Thedesired regularization effect of the Batch Normalization is notimmediately observed in this case. The deep CAE shows awell-known learning behaviour; the accuracy rises and lossfunction decreases, up until the overfitting stage occurs. At10000-15000 updates, clear signs of overfitting show up, andthe performance gets worse from there on. The model isoutperformed by both its shallow predecessor, and its batch-normalized successor. The clear winner here is the shallowCAE, also displaying the well-known learning behaviour. Theloss function decreases and at about 30000 updates starts toincrease due to overfitting, ending up with a similar score as theBN-CAE. The accuracy rises to over 96%, and after overfittingoccurs ends up slightly below that number. It outperforms all

of its successors, and consequently, this is the model to applyin the next step.

Fig. 2: Evaluation of best-performing networks

B. Compression schemeHaving determined a succesful model in the previous sec-

tion, it will be used to implement a compression scheme byapplying the CAE to the DNA sequences.

V. CONCLUSION

The purpose os this work has been to explore the possibilityof constructing a compression scheme for DNA sequence datausing a deep learning approach. A first data analysis has shownthere are little obvious features and patterns present in thesource data, hence the way of a an unsupervised techniquehas been chosen. One particular technique, the Auto-Encoderhas been selected, as it has been shown to be capable ofbeing used for data compression. After the data analysis, asuitable AE architecture has been investigated. This rangedfrom a traditional single-hidden layer AE, to the more powerfulconvolutional variants, and ended with a trial of a batch-normalized deep convolutional AE inspired by state-of-the-artresearch in deep learning. The resulting best performing modelturned out to be the shallow convolutional AE, demonstratingthe ability to compress and reconstruct the sequence withan accuracy of over 90% while having an architecture thatoffers a good compression rate. The next part in this researchinvolved implementing a compression scheme using this CAEto investigate if this approach yields a working setup. Severalscenarios in evaluating the scheme have been investigated. The

Fig. 3: Compression statistics of bitstream encodings followedby gzip compression.

evaluation involved a look at the encoded sequence and thestructure of the error residue, as these make up the encodedbistream. Even this basic implementation, with still a decentroom for improvement, has shown to be capable of a losslesscompression of sequence data with a resulting compressionof 60%-70% in most cases. Further application of a general-purpose compression scheme such as gzip is shown to leadto an even larger achievable compression rate, mitigating thesimplistic approach in storing the residue files. The combinedsteps of the AE compression and gzip leads to a bpb of around1, which outperforms most of the existing algorithms.

ACKNOWLEDGMENT

The author would like to thank Ghent University for offeringme first of all a qualitative education, and as a final result theopportunity for this research. More specifically, he would liketo thank Lionel Pigou of the Reservoir Lab and both RubenVerhack and Tom Paridaens from the MMlab for their roleas counsellors and day-to-day assistance and overview of thisproject. Lastly, many thanks to my friends and colleagues whooffered help, advice and a welcome break at times.

REFERENCES

[1] T. Snyder, Overview and Comparison of Genome Compression Algo-rithms, University of Minnesota, 2012.

[2] Wandelt, Sebastian and Bux, Trends in Genome Compression. CurrentBioinformatics, vol.9, no. 5, p.1-24, 2013.

Contents

Preface iv

1 Introduction 1

1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Content of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Related work 8

2.1 DNA compression algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Image manipulation with machine learning . . . . . . . . . . . . . . . . . . . . . 15

2.3 Auto-Encoders as feature learning models . . . . . . . . . . . . . . . . . . . . . . 17

3 Theory 21

3.1 Brief introduction to machine learning . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Introduction to the concepts of (Artificial) Neural Networks . . . . . . . . . . . . 26

3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.2 Relevant techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Selected technique: Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3.1 Denoising Auto-Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.2 Convolutional Auto-Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Methodology 39

4.1 Structural overview of the proposed compression scheme . . . . . . . . . . . . . . 39

4.1.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.1.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.1.3 Network view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Technical aspects & design decisions . . . . . . . . . . . . . . . . . . . . . . . . . 41

CONTENTS ix

4.2.1 Data acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2.2 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.3 Network training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2.4 Compression-Accuracy trade-off . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2.6 Software & Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Results 51

5.1 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2 Baseline comparison: state of the art . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3 Auto-Encoder model construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.3.1 Shallow fully-connected Auto-Encoder . . . . . . . . . . . . . . . . . . . . 57

5.3.2 Deep fully-connected Auto-Encoder . . . . . . . . . . . . . . . . . . . . . 60

5.3.3 Shallow Convolutional Auto-Encoder . . . . . . . . . . . . . . . . . . . . . 60

5.3.4 Deep Convolutional Auto-Encoder . . . . . . . . . . . . . . . . . . . . . . 64

5.3.5 Batch-Normalized ReLu CAE . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.3.6 Model comparison, selection and discussion . . . . . . . . . . . . . . . . . 67

5.4 Compression scheme implementation . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.4.1 Scenario 1: chromosome Ref-A on chromosome Ref-B . . . . . . . . . . . 71

5.4.2 Scenario 2: chromosome Ref-A on chromosome Alt-A . . . . . . . . . . . . 72

5.4.3 Additional general-purpose compression . . . . . . . . . . . . . . . . . . . 73

6 Conclusion 78

6.1 Discussion of the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

References 81

List of Figures

3.1 Illustration of under- and overfitting: curve (polynomial function) fitting to a

data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Artificial neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Neural Network structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Comparison of the input-output relation of some activation functions used in

ANNs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5 Auto-Encoder as a contraint on a neural net architecture . . . . . . . . . . . . . 34

4.1 Block diagram of encoding/decoding process . . . . . . . . . . . . . . . . . . . . . 40

4.2 Base sequence representation as 5-channel cubes. . . . . . . . . . . . . . . . . . . 44

4.3 Schematic display of three differenct evaluation scenarios. The grey shaded part

is the data used to train the model, and the arrows point to the (test) data which

is compressed using the trained model. . . . . . . . . . . . . . . . . . . . . . . . . 47

5.1 Occurence of single bases in the full human reference genome. Error bars indicate

the occurence minima and maxima in separate chromosomes. . . . . . . . . . . . 54

5.2 Occurence of base pairs in the full human reference genome. Error bars indicate

the occurence minima and maxima in separate chromosomes. . . . . . . . . . . . 55

5.3 Occurence of codons (base triplets) in the full human reference genome. Er-

ror bars indicate the occurence minima and maxima in separate chromosomes.

Codon labels are ommited for clarity. . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.4 Occurence of groups of 2 bases in the reference genome, compared with their

expected frequency. The bars show the actual frequency of the group, the mark

shows the frequency which is expected based on the single-base frequencies. . . . 56

LIST OF FIGURES xi

5.5 Occurence of groups of 3 bases in the reference genome, compared with their

expected frequency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.6 Shallow traditional fully-connected AE with single-base input and two encoding

neurons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.7 Reconstruction accuracy and loss function of shallow fully-connected AE with

single-base input and two encoding units. Above: independant weights, below:

tied weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.8 Variation on the shallow traditional fully-connected AE with 100 bases as input

and two encoding units. (1) indicates an increase in input length. (2) indicates

an increase in encoding units. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.9 Reconstruction accuracy and loss function of shallow fully-connected AE with

input sequence length of 100 bases and two encoding units. Above: independant

weights, below: tied weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.10 Deep fully-connected AE with a sequence input length of 100 bases and two

hidden units in the encoding layer. . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.11 Reconstruction accuracy and loss function of deep fully-connected AE with in-

put sequence length of 100 bases and two encoding units. Above: independant

5.12 Shallow convolutional AE structure. . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.13 Reconstruction accuracy and loss function of shallow convolutional AE with in-

put sequence length of 400 bases and two encoding units. Above: independant

5.14 Deep CAE architecture. Symmetrical decoding part ommited for readability. . . 66

5.15 Reconstruction accuracy and loss function of deep CAE with input sequence

length of 400 bases and two encoding units trained with tied weights. . . . . . . . 67

5.16 Reconstruction accuracy and loss function of Batch Normalized deep CAE with

input sequence length of 400 bases and two encoding units trained with tied

weights. Above: ReLu activations, below: Sigmoid activations. . . . . . . . . . . 68

5.17 Performance comparison of variations on the CAE with input sequence length of

400 bases and two encoding units, all trained with tied weights. . . . . . . . . . . 70

List of Tables

5.1 Chromosomes and their file content in the refence genome. The entropy is cal-

culated on the contained sequence, and not on the data stream which includes

metadata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.2 General-purpose compression software on chromosome FASTA files of the human

reference sequence. The Compression column is the 7-Zip filesize compared to

the uncompressed file, and the bpb column is the bits per base achieved by the

7-zip compression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.3 Overview of the Auto-Encoders considered with some key characteristics. (tied)

indicates the weights of the decoder and encoder parts are shared. The compres-

sion column gives the ratio of the encoding units to the input sequence (at 3 bits

per base). It does not include the residue at this point. . . . . . . . . . . . . . . . 71

5.4 File sizes after encoding using the shallow CAE trained on chromosome 1 of the

reference genome. The compression column shows the ratio of the bitstream

filesize to the original FASTA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.5 Residue (form 1) analysis after encoding using the shallow CAE trained on chro-

mosome 1 of the reference genome. . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.6 File sizes after encoding using the shallow CAE. On each row a chromosome from

the reference genome is used for training, and its counterpart in the alternative

genome (who’s name is given in this table) is encoded. . . . . . . . . . . . . . . . 75

5.7 Residue (form 1) analysis after encoding using the shallow CAE. On each row a

chromosome from the reference genome is used for training, and its counterpart

in the alternative genome (who’s name is given in this table) is encoded. . . . . . 76

LIST OF TABLES xiii

5.8 Compression results after gzip is applied on top of the bitstreams resulting from

the encodings in scenario 1. Compression ratio is the filesize fraction compared

to the original FASTA file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Chapter 1

Introduction

1.1 Problem statement

Context The genome of an organism contains its genetic material: the DNA (Deoxyribonu-

cleic acid) within that organism. This DNA is composed of nucleotides (bases), represented

by the characters A (Adenine), C (Cytosine), G (Guanine) and T (Thymine). This biochem-

ical component of every living organism contains a treasure of largely untapped information

about an organism. With the establishment and development of advanced technology in the

bioinformatics discipline, attempts are made to unravel the information contained within. This

requires an eletronic representation of this biochemical material. The process of converting the

biochemical physical DNA to a data file is called sequencing. The human genome is on average

about 3 GB of uncompressed data, but a single uncompressed genome generated by sequencing

institutes can take up as much as nearly 300 GB.

Problem The cost of sequenceing a single human genome has dropped in the past decade

from tens of millions of dollars to about 10 thousand dollars. The technological improvements

and maturity of the second generation sequencing platforms (which are currently in use) has

lead to equipment currently costing well under one million dollar and available in many scientific

institutes. The upcoming third generation will be even cheaper, making personalized medicine

available for the masses. These technologies are responsible for an ever increasing number of

genomes being sequenced, both from new species and more individuals from a certain species.

When looking at the data produced by these institutes, numbers are rapidly growing. The

world’s biggest sequencing institute, operating over 180 sequencers, produces a total amount

of data on the order of 10 PB per year and growing. The storage requirements for the output

CHAPTER 1. INTRODUCTION 2

of high-throughput sequencing instruments falls in the range of 50-100 PB per year. Viewing

the genomic data growth in the first decade of the century, it is observed that the progress in

computer hardware lags behind. The need for storage outpaces the growth in storage capacity

technology. At the same time, many programs are introduced aimed at collecting large amounts

of sequence data. For example, The Million Veteran Program (led by the US Department of

Veterans Affairs), will produce a total of about 250 PB of raw data over the span of 5-7 years.

As these large-scale projects are emerging, the storage concerns are of increasingly high priority.

Several approaches can be taken in dealing with this growth in sequence data creation (plus

derived and meta-data), all being not mutually exclusive.

A first option is simply to add storage capacity. Prior to 2005, the increase rate of sequenc-

ing capacity closely followed the rate of increase in storage capacity; both doubled around every

18 months. With a stable budget, production sequencing facilities and archival databases could

match the storage hardware requirements. However, the current trend that can be observed is

that the cost of sequencing a single base halves roughly every 8 months. The cost of hard disk

space has been halving every 25 months for the last few years. Even when applying general

purpose compression algorithms, the increasing rate of sequencing is significantly outpacing the

storage growth. This mismatch between technologies means that either a reduction in stored

sequence data must occur, or a progressive necessary increase in storage costs is required, the

latter seeming unlikely and unattractive. When looking at the technological trends in the last

decade, one can observe a growth in the availability of distributed and high-capacity computing

facilities. With the kinds of requirements required for this task, these data centers are much

better for storage than local infrastructure, as keeping even a small amount of fully-sequenced,

uncompressed genomes on the same machine is unrealistic. Next to storage capacity, bandwith

is a concern as well. Network capacity comes into play with a significant importance in these

applications. Transferring sequences between machines having a small bandwidth is out of the

question for real-life applications. One of the largest data centers and cloud computing facil-

ities currently available are Amazon S3 and Amazon EC2 (data of 2013, [1]). While the cost

of sequencing a human genome in January 2013 was about 5700 dollars, the cost of one-year

storage and 15 data downloads was about 1500 dollars. These numbers clearly show that the

costs of data storage and transfer will become comparable the the costs of the actual sequencing

in the near future. With a growing interest in personalized medicine, these costs - besides the

performance limitations of current technology - will be a significant obstacle if not faced properly.

A second option is to throw away some data, known as triage. Some proposals regarding

this approach have been put forward. These include storing the physical sample instead of the

digitally sequenced version, discarding old data, discarding data which could be regenerated,

not storing raw data, but rather limiting storage to analysis results... A common ground in

these suggestions is the implied ability to regenerate data from samples at any point. With large

scale internationally distributed sequencing projects, this is infeasible however, thus the need

arises to store the data electronically. These large volumes of data must be available for analysis

during at least the project runtime (often several years) and preferably for possible follow-up

projects. Given the exponential decrease in storage costs, data storage is heavily influenced

by its early years. Also observing the continuous large investments in variation and cancer se-

quencing, it seems inappropriate to limit the possibilities of re-analysis due to the sake of these

storage costs; compromising possible medical breakthroughs due to a storage cost limitation is

rather difficult to accept. The feasibility of storing and reacquiring clinical samples is a concern

as well. Many samples have low actual DNA content and cannot be distributed freely in the

future. Some rare samples (which are often of the most interest) are non-renewable and their

availability depends thus solely on electronic archiving. Even renewable samples can be very

cost-intensive to reproduce, and due to some inherent randomness in the sequencing process,

reproducing the exact same raw data is nearly impossible. Without reproducability - one of

the main principles of the modern scientific method - the approach of keeping only the physical

samples poses strong concerns. Finally, the worldwide cooperation in this field can be severely

limited due to the complexity of the long-term operation of physical storage, distribution and

end-point sequencing. So, selective(physical) storage and discarding old data is a rather radical

approach, raising major methodological doubts from a scientific and research point of view.

A third option is to compress the electronic data. As (sequenced & aligned) DNA in its

raw representation is represented by a long string of characters, compression using generic ap-

proaches is possible. However, compression of sequence data might try to take advantage of the

natural and biological characteristics of DNA material, notably being the repeated content and

the often very close relationships between existing reference sequences. Both losless and lossy

compression techniques exists, the latter sometimes having a user-specified trade-off between

compression ratio and information loss. A controlled loss of precision can be acceptable, depend-

ing on the appropriate scientific and application requirements. A key feature of reference-based

compression is that performance can increase with the growth in sequencing technology and

projects, as it might exploit the growing redundancy therein. A lot of compression techniques

have been investigated, and are currently being developed to address the compression of se-

quencing data. This is one of the most promising tracks to face the issue of dealing with high

storage requirements as it does not necessarily involve data loss.

DNA data

As this is a research work on DNA data, this subsection will provide a very coincise discussion

on DNA and sequencing techniques. The biochemical aspects are not relevant to this work and

are therefore ommited, with only the necessary technical concepts touched upon here.

DNA is a biochemical molecule that carries the biological and genetic information about a

living organism, used for its development, reproduction... DNA molecules in their physical form

consists of two strands coiled around each other into a double helix structure. These strands are

made up from nucleotides. Each nucleotide consists of (among other elements) a nucleobase:

cytosine (C), guanine (G), adenine (A), or thymine (T). DNA has several uses nowadays in

biochemical and medical technology. An important use is genetic engineering: the application

of man-made (recombinant) DNA which is extracted from an organism and modified to cre-

ate another organism such as disease-resistent agricultural or medical products. DNA profiling

(used in the forensics domain) is a method of identifying an individual by a small amount of

biological material, useful in identifying criminal perpetrators from evidence, identifying victims

from mass-scale incidents, and determining paternity relations. Interesting in the light of this

work is the recent research in using DNA as an archival storage system, where the extremely

dense structure of DNA (up to 1 exabyte per cubic millimeter) is used to archive digital data

in a key-value store. ([2])

The interdisciplinary field of research regarding techniques for storing, data mining and

manipulating biological data is called bio-informatics. The specific characteristics of this data

has lead to advances in general computer science, machine learning, database technology, string

searching and matching algorithms... By aligning DNA sequences with other sequences, specific

mutations and distinctions can be discovered, e.g. the presence or the likelyhood of developing a

certain - possibly hereditary - disease. Gene-finding algorithms search for patterns in the data,

which can in turn be related to specific functions in organisms, evolutionary development...

Since the first method for determining DNA sequences in 1970, a wide variety of advanced

techniques has been developed.

One major distinction in these technologies worth discussing here is the difference between

reads and sequences. When generating coverage of DNA samples, the result is a collection of

millions of short DNA sequences called reads, ranging from 30 to 30000 bases long. These reads

are short, repeated and shifted parts of the sequence. Reads require sequence assembly before

most actual genome analysis occurs.1 The assembly, alignment and sequence reconstruction

of these reads is an extremely computationally challenging task for even the most advanced

approaches. Several assembler algorithms have been developed for this purpose, making use

of greedy algorithms and methods in dynamic programming. The output of this assemblers is

one contiguous, aligned sequence. This is then called the full genome or sequence. The domain

of bio-informatics is a field of highly active research: new and improved methods are being

developed for the data extraction of physical samples, alignment procedures, and data mining

of the information contained in the sequences - both in reads and in sequences.

A second major distinction can be made regarding reference or non-reference based storage

of DNA material. When the sequences are stored and processed independantly of other material,

it is called a non-reference based system. Each sequence contains the full information on the

content of its DNA material. Another possibility is storing the file based on a reference sequence.

As a lot of DNA sequences show very high amounts of similarity, this can be exploited by

only storing differences and modifications in that particular strain compared to a well-known

reference sequence. This reference serves as the gold standard of a DNA strain for a particular

species. When having a large set of sequences to store, the reference-based storage can greatly

reduce storage requirements by ommiting the redundant part of the DNA shared by all of them.

A disadvantage of this approach is the need to agree upon the gold standard, which often varies

after technological improvements and thus requires precise version control.

1One analogy of this situation is the reads resembling the result of shredding multiple copies of a book. Theymight only contain a part of a sentence, but possibly an entire paragraph. Furthermore, a lot of the sentences willbe found repeatedly in the reads, without any notion of their position in the original text. With the limitationsof current technology, some words will be mangled by the extraction process as well. Some parts could beunrecognizable, and even some parts from another book could end up in there. The task at hand is then thereconstruction of the one contiguous source text.

1.2 Content of this work

This work will follow the track of compression of the data. It will use fully-aligned human

genome data as source material. The possibility is explored of applying a machine learning

technique - specifically neural network architectures - to create a lossless compression method

of this material. The scheme will be non-reference based: compression will work on the raw

content of the DNA sequence independantly. These techniques have been shown to be a valid

method for lossy compression of visual material, but at the time of writing, there are no known

applications of these techniques on DNA data. On the other hand, several traditional techniques

for compression of DNA material exist and are being developed, however without making use of

the possible powerful advantages of machine learning. This work will try to apply the succesful

compression performance of machine learning to the human genome.

The reason this approach might prove to work is twofold. Firstly, the base purpose of

machine learning is performing (complex) pattern recognition, and this is where they excel

compared to all traditional techniques. Due to the existence of useful patterns in audiovisual

material, they are extremely succesful in visual computing application. Machine learning suc-

ceeds in (automatically) finding relevant patterns in data which are not trivially discovered by

humans. They might thus be able to find - and exploit - meaningful patterns in DNA data

which are currently not known.

A second reason is the specific content of this data. Visual data often has a lot of redun-

dancy in its data, which makes them a good candidate for compression. As DNA data files

contain redundancy by sharing (among species and individuals) similar strains and repeated

patterns, they might prove to be good source material for certain machine learning algorithms

where traditional algorithms fail.

Several characteristics of both the application scenario and the source material must be

taken into account when choosing or creating a particular compression method. The applica-

tion requires that data can be stored or transferred over a network, and data should preferably

be accessible in real-time (with either a sequential or random access pattern). When the data

sets are stored off-premises in a datacenter, transferring files come with an associated cost.

Additionally, even with high speed access, transferring these data sets takes a long time due to

their large size. The goal of applying compression is here to reduce this cost by reducing the

size: either for network transfer or for storage purposes. By having a reasonably fast scheme

with a good compression ratio, the processing time plus the reduced transfer time can be signif-

icantly lower compared to the full transfer of raw data. Easy transfer of this data furthermore

leads to an optimized cooperation between research institutes as well, which helps in advancing

the scientific progress. With the ongoing surge of large-scale sequencing projects, compression

algorithms currently are focused on compression ratio rather than speed.

This structure of this work is as following. Chapter 2 contains a survey of relevant techniques

and developments in the domain of bio-informatics and machine learning. Chapter 3 introduces

some necessary theoretical concepts from the domain of machine learning used in this work.

Then, in chapter 4, the methodology of the approach taken here is explained, together with some

technical aspects and design decisions. After that, chapter 5 discusses the results obtained by

implementing this solution. Finally, chapter 6 concludes by a discussion of the achieved results

and looks at open questions and what future research might need to investigate on this topic.

Chapter 2

Related work

This chapter provides a survey of some relevant research in the field. The first subsection focuses

on (some of the) existing techniques developed and in use today regarding the manipulation of

sequence data. The second subsection handles some work on succesful applications of machine

learning techniques to image manipulation. The last subsection discusses Auto-Encoders, a

specific method in the machine learning portfolio which will be applied throughout this work.

2.1 DNA compression algorithms

This section offers a brief overview of current compression techniques, starting with general-

purpose techniques, and then discussing some techniques designed specific for sequence data.

General-purpose compression algorithms

First of all, a discussion of some general compression techniques and their effectiveness on

genome data follows. Several higher-level DNA-specialized algorithms make use of these tech-

niques or similar concepts which follow the same underlying reasoning. Here specifically, some

techniques for lossless compression of strings are discussed.

Lempel-Ziv

One well-known technique is the Lempel-Ziv (LZ) algorithm [3], known for performing well on

repetitive text. LZ starts with a dictionary of all possible characters. By running over the input

one character at a time, LZ looks for matching substrings in the dictionary. If not found, the

sequence is added to the dictionary. Repeated strings in the file are replaced in the output by

their indices in this dictionary. This way, repeated patterns can be compressed efficiently. To

CHAPTER 2. RELATED WORK 9

decompress a file, the dictionary gets rebuilt on the fly by reversely executing the compression

algorithm. This means the dictionary does not need to be sent along. When introduced in

1977, it was the best known compression algorithm, and the current popular 7-zip compression

scheme is still based on the principles of LZ. The original LZ-77 compression was improved with

LZ-78, and later extended to Lempel-Ziv-Welch in 1984. Lempel-Ziv compression is widely used

in several algorithms. The first DNA-focused compression algorithm BioCompress was heavily

based on LZ, and lots of recent technologies are still based upon the same principles.

Arithmetic coding

A second widely used technique in lossless data compression is arithmetic coding [4]. Arithmetic

coding is done by taking the probability that a characters will occur and fitting the probabilities

in the range [0,1). A string is read character per character, and the probability distribution of

that character is used to update (shorten) the range. After a number of characters has been

read, the shortest decimal in the final range is chosen to represent that series of characters. This

technique leads to a about 2 bits per character for sequence data. With 2 bits, 4 states can be

represented which match the 4 nucleobases occurring in DNA, and this should be a baseline for

other compression algorithms. Arithmetic coding can also be assisted by a Markov Model. A

Markov Model is a model that allows prediction of the next state based on the current state,

or in an order-n Markov Model, based on the n previous states. The probabilities of moving to

a new state (representing moving to a new character) can then be fed into an arithmetic coder.

Raw sequencing data

DNA Compression schemes

Kuruppu (2011, [5]) uses an algorithm based on the general-purpose LZ-77 algorithm, but

specifically modified for sequence data: optimized Relative Lempel-Ziv (opt-RLZ) to compress

genomes which are stored reference-based. Each sequence in a collection is encoded as a series

of factors (substrings) occurring in the reference. Factors are encoded as pairs, containing a

lookup position and a length (which is encoded further with Golomb coding). The reference

sequence gets stored uncompressed. Some optimizations are made to improve on the LZ-77

parsing which uses a greedy method to match substrings in a sequence. One of this is to look

ahead h characters when encoding a sequence. The algorithm uses a variation on this concept in

order to search the longest factor in a region. A second method used is the matching statistics

algorithm to encode the (position, length) pairs of factors. The algorithm likely creates shorter

factors, which have a shorter literal encoding than a lookup pair encoding. These so-called

short factors are again encoded using a Golomb code. Finally, it is observed that the sequence

of matched long factors form an increasing subsequence. These longest increasing subsequence

(LISS) factors are encoded differentially with a Golomb code. The positions of these LISS factors

can sometimes be predicted, thus need not be encoded, leading to a further compression boost.

This opt-RLZ algorithm is compared on a dataset of four human genomes with a reference to

three other compression algorithms: COMRAD, RLCSA and XMcompress, and is shown to be

the best known compression algorithm at the time of writing. Depending on the genome tested,

opt-RLZ outperformed COMRAD and matched XM in compression ratios. Encodings as low as

0.15 bits per base pair (bpb) and 0.48 bpb are achieved. This rates are half of the bpb achieved

by standard RLZ. The authors futhermore note that the uncompressed reference genome takes

up most of the space, thus with additional genomes added to a dataset even better results are

possible. The execution speed of the algorithm is very fast, memory requirements are low, and

the algorithm allows for rapid random access to substrings of the sequence.

One of the many file formats for storing raw reads is the FASTQ format, containing the reads

together with associated metadata (e.g. read ID, base calls, quality score ...). When compressed

with the general-purpose Gzip compression, a 3-fold size reduction can be obtained. Using a

combination of Huffman coding and a scheme similar to Lempel-Ziv, a dedicated FASTQ com-

pressor DSRC (introduced by Deorowicz in 2011, [6] and in 2014 improved upon by DSRC2)

can obtain a compression ratio of 5. DSRC is fast and handles variable-length reads with an

alphabet size beyond five. Improvements upon DSRC use an additional arithmetic encoder,

group reads with a recognized overlap together, or use a preprocessor (SCALCE) to improve

the compression ratio further using Gzip.

When the choice is made to go with lossy compression, the quality scores of a read are a

natural candidate to allow some loss of information; some tools ignore this data, so they can be

ommited entirely for certain purposes. Another approach is to filter out reads which do not meet

the required quality level. SeqDB [7] is another FASTQ-dedicated compression scheme which

focuses on speed, with speedups of over 10 times compared to DSRC, but with compression

ratios at best at Gzip’s level. A more recent algorithm is Quip [8], using higher-order modelling,

arithmetic coding and an assembly-based approach. The idea is to use the first few (million)

reads as a reference for the following ones. Depending on the algorithm mode used, ratios are

significantly better than DSRC with a slightly lower speed. A somewhat unique feature if Quip

is the possibility to work with both aligned and non-aligned reads, and working with a reference

or standalone.

One of the latest (2015) schemes developed for FASTQ files is LFQC [9]. It is a new

lossless non-reference based compression algorithm outperforming existing big data compression

techniques such as Gzip, Fastqz, Quip and DSRC2 on selected datasets.

(Reference) Aligned reads

While techniques on raw reads are being developed as well, they are usually assembled and

aligned to a reference genome. File formats used for this are the SAM format, augmenting the

reads data with additional quality information, leading to files about twice the size of FASTQ

files. Another format BAM is a Gzip-like compressed equivalent of the textual SAM format.

Some compression schemes can handle both aligned an unaligned reads. One of the compressors

used for reference-based reads is the CRAM toolkit1, a framework and specification developed

by the European Bioinformatics Institute, achieving 40%-50% size reduction compared to BAM

files. For aligned reads, the mapping coordinates and differences from the reference are stored.

For unaligned reads, an artificial reference is constructed with the sole purpose of compression.

The tool allows for lossless and lossy compression with several options to define its behaviour.

Other similar tools are SlimGene [10] and SAMZIP [11]. The previously mentioned Quip can

operate on SAM/BAM formats using a reference as well, and performs better compared to

CRAM. Another program available is the tabix program, a generic tool to perform indexing,

searching and compression ([12]). A textual file is sorted, split into blocks which are compressed

using Gzip, and an index is built to allow random-access queries.

One algorithm to compress reads using a reference sequence is discussed by Fritz et al.

(2011, [13]), taking advantage of the fact that most reads in a run match the reference near-

perfectly. The algorithm takes a mapping of the reads to the reference, and efficiently stores

that mapping plus any deviations, using an artificially constructed reference. First the lookup

position of each read on the reference is stored. The length of these reads is Huffman encoded.

Then these reads are ordered with respect to the lookup position, allowing an efficient relative

1http://www.ebi.ac.uk/ena/software/cram-toolkit

encoding of the differentials between successive values using a Golomb code. Variations from

the reference are stored as an offset relative to the lookup position of the read together with the

base identities or lengths, depending on the type of variation. These offsets are again encoded

using Golomb code. The read pair information is also stored, relative to each other and Golomb

coded. This technique on aligned portions of the sequence data results on varying data sets in

a storage requirement of between 0.02 bits/base pair and 0.66 bits/base pair. These numbers

compare favorably to general-purpose bzip2 compression of DNA (1 bpb) and are considerably

more efficient then BAM compression: a 5 to 54-fold ratio compared to compressed FASTA

or BAM is achieved. Next to aligned sequence data however, usually 10% to up to 70% of

the reads are unmapped to a reference, often dominating storage costs. For this purpose, an

artificial reference is assembled from similar experiments (e.g. similar species). Using a third

human sequence and a database of bacterial and viral sequences, 17% of the reads could be

mapped to these artificial references with a resulting 0.026 bits/base compression performance.

A large 83% of the unmapped reads however, could not be mapped to one of the used references.

These reads are compressed using general-purpose techniques. The result of this combination of

techniques leads to a significantly better compression performance than traditional approaches

on real data, with a 10 to 30-fold better ratio. The authors note this compression scheme

performs better with longer read lengths, which newer sequencing technology offers.

Full genome sequences

Single genome compression

Raw sequencing data of a single genome poses the greatest challenge for storage and trans-

fer. The genome sequence of a single individual is very hard to compress due to the lack of

well-unterstood structure and patterns. When only the symbols A, C, G or T are used, the

simplistic general-porpose encoding using 2 bits per symbol often outperforms ’smart’ general-

purpose compressors like Gzip. Nonetheless, specialized DNA compressors are developed in the

hope to improve upon this baseline. The highly acclaimed XM achieves compresses up to 5

times, but with an impractical low speed. Other notable compressors are dna3 and FCM-M.

XM [14] is an Expert Model algorithm using arithmetic coding. The unique property is

how it determines the probabilities of the characters. The algorithm uses a collection of ex-

perts, being anything that gives a grood probability distribution for a position in the sequence.

Examples are the previously mentioned Markov Model, or a copy expert, which determines if

something is likely a copy of a known block. After obtaining probabilities for characters, XM will

combine these and feed into an arithmetic coder. Additionally, the experts are weighted based

on their past accuracy. In comparison with other genome specific algorithms (BioCompress,

GenCompress, DNACompress, DNAPact), XM performs on average and on more genomes bet-

ter than other algorithms. XM achieves a compression well under 2 bpb, clearly showing that

a specialized algorithm is able to perform better than generic compression algorithms (such as

simple arithmetic coding).

Tabus & Korodi proposed a sophisticated DNA sequence compression method based on

normalized maximum likelihood (NML) discrete regression for approximate block matching

(2003, [15]). T&K first breaks the sequence in equally-sized blocks. Each block is compressed

using three different methods, with the most efficient one selected for storage. The first method

used a Markov Model, the second method is simple arithmetic coding, and the third method

looks for matches in the previous blocks and compresses the block using differences. The third

method works with a few subsequent steps. First a block is searched with an approximate

equal content. The positions of equality are stored in a string, and for the differences, distances

from the reference block are arithmetically encoded. A probability distribution is constructed

to apply a form or arithmetic coding on the distances. A second probability distribution is is

created from that distribution, called a universal code, which will create prefixes in a way that

has good performance on all possible source distributions. Compression results on chromosomes

of the human genome lead to between 1.449 and 1.616 bpb. No comparisons have been made to

other algorithms however, so the effectiveness of the T&K method is somewhat hard to evaluate.

The T&K method has been further developed later in 2005 into GeNML [16].

Genome collections

When databases of lots of individual genomes of the same species are considered, the scenario

is significantly different, as more knowledge of the combined genomes can be exploited. These

genomes are very likely highly similar, sometimes sharing over 99% of their content, and as

such a collection can be compressed more efficiently compared to standalone material. Several

algorithms working on referenced genomes show improved performance with a bigger dataset

of genomes. General-purpose algorithms (Gzip, rar) are usually not applicable here since repe-

titions in the data can be gigabytes apart. Variations on a reference genome consists of Single

Nucleotide Pylomorphisms and indels, insertions of deletions of multiple nucleotides. Assuming

these variations to a reference are known, plus a readily available SNP database, a single human

genome has been compressed to about 4 Mbytes. Recent techniques have been able to com-

press collections to an extent of a few hundred Kilobytes per individual (2009, [17]). Standard

compression techniques as Huffman coding and Lempel-Ziv have used here as well, but do not

achieve the performance of specialized compression schemes. The GDC2 compressor achieves a

compression ratio of 9500 for relatively encoded genomes in a large collection with fast execu-

tion time [18]. Some specialized compressors such as GDC and LZ-End allow for access to an

arbitrary string in the collection, at an expense of compression ratios achieved. An advanced

general-purpose LZ compressor, 7zip, is able to achieve competitive results as well, provided

the data is in a specific order.

COMRAD (Compression using Redundancy of DNA sets) ([19], [20]) generates a dictionary

of substrings of length L and keeps track of their frequencies. It will count the places where

the most numerous substring can be replaced (counting without overlapping). This string will

be replaced by the non-overlapping counts, and these replacements are stored. This process of

counting and repeating is repeated until the counts reach a tresshold. The output is a string

and a set of replacements that were made, allowing to decompress the sequence by reversely

applying the replacements. Using a dataset of human genomes, bacteria and viruses, COMRAD

is compared to RLZ, RLCSA and arithmetic coding. On average, arithmetic coding leads to

2.06 bpb, while COMRAD achieves a relatively efficient 1.10 bpb. As the compared algorithms

are not DNA-specific and do not perform better then simple arithmetic coding, they are a poor

choice of algorithms, and a comparison with better performing methods should be done.

Note While several existing compression schemes for a variety of scenarios are mentioned in

the previous section, it should be noted that this survey is not exhaustive at all, and many

more techniques exist. In evaluating compression schemes, some difficulties arise when trying

to perform experimental tests and comparisons. Many tools are limited in their functionality,

as they are often designed with a specific scenario in mind having its own constraints and

characteristics. They might only accept sequences with a limited alphabet, fixed-width reads,

assume specific ID formats, disregard metadata... This makes an honest comparison rather hard

to perform. Some existing tools show significant problems when trying to run them. While

sometimes open source code is available, some tools are proprietary, do not disclose the settings

leading to their published results, and are therefore very hard to evaluate. The output format

of some of these tools are not always compatible, with features not supported, or not being able

to turn them off, which makes a comparison often not entirely fair. The large variety of file

formatting of the DNA material does pose a difficulty as well. Lastly, the lack of a benchmark

dataset is probably the largest hurdle in making fair comparisons possible. Each research uses

its own dataset to present numbers which are thus often not comparable in a transparant

way, and over the years and advances in technology, these datasets differ significantly. This

specific problem has been addressed by the machine learning community in the form of MNIST,

ImageNet and other publicly available benchmark datasets. Having a similar concept to allow

compression method benchmarking would certainly improve the transparancy of many of the

published research efforts.

2.2 Image manipulation with machine learning

This section will focus on the machine learning success in computer vision applications, where

these applications have proven to be extremely succesful and have led to development of deep

learning discipline. The one technique of interest is the concept of Artificial Neural Networks

(ANNs). They have been tried on a variety of problems, and have proven to be superior over

traditional methods in a lot of situations. This section will discuss a few works on applying

ANNs specifically to image manipulation, classification and compression.

Classification

As ANNs are extremely succesful in computer vision applications, there has been a large inter-

est and set of research papers published over the years around (convolutional) neural networks

in image classification. The gold standard of machine learning classification has long been the

MNIST dataset, which comprises of labeled grayscale pictures of handwritten numbers. For

the purpose of comparing new networks’ performance, a new dataset was later constructed:

CIFAR-10/100, consisting of tiny multi-labeled colour images. Since 2010 the current most

popular benchmark used in object detection and image classification research is the ImageNet

Large Scale Visual Recognition Challenge (ILSVRC). This section will shortly describe some

winning competitors of these competitions.2

Krizhevsky ([22]) discusses the application of ANNs in image classification using the Ima-

geNet dataset, a set of 15 million high-resolution labeled images used in the ILSVRC-2010/2012

competition. They use a large convolutional network (often abbreviated as CNN or ConvNet)

(using 5 convolutional and 3 fully-connected layers) using GPU implementation to achieve a

result by far outscoring any result ever reported on this dataset. A few novel techniques are

discussed to prevent overfitting the model and speed up the training. They introduce Rectified

Linear Units (ReLUs) as nonlinearity in CNN, apply GPU specific architectural decisions, add

a normalization method and overlapping pooling in pooling layers of the network. Overfitting

is reduced by using data augmentation and dropout. Comparing with existing results in the

competitions, they achieve an error rate which is more then 10% lower then existing state of

the art solutions. Their results show that a large, deep convolutional neural network is capa-

ble of achieving recordbreaking results on a highly challenging dataset using purely supervised

learning.

Simonyan and Zisserman ([21]) improve upon these state-of-the-art convolutional neural

nets. It performs a thorough investigation about the architecture of these networks, specifically

aimed at the depth and convolution filter sizes. By stacking a high amount of convolutional

layers with small filters, they come up with significantly more accurate ConvNet architectures,

leading to the winning submissions of the ImageNet Challenge 2014 in both image classification

and localisation. Combining several of their deep models as an ensemble model leads to a record

performance of 24.4% and 7.1% for the top-1 and top-5 error rate on ILSVRC. While still ad-

hering to the original ConvNet architecture, by deepening the configuration, they outperformed

all previous winning networks from previous years. The conclusion of the work is that very deep

ConvNets (having sometimes over 20 layers) outperform existing architectures, confirming the

general adagium of ”the deeper, the better”.

2For a more complete overview of image classification with ILSVRC, see [[21]] and the very useful http:

//rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html, which ranks thepublished results on the most widely used benchmark datasets and where possible links to the papers.

Compression

Traditional techniques for image compression include predictive coding, transform coding and

vector quantization. Several standards regarding compression of (audio)visual material are in

widespread use today. While not widely used in practice, there has been some research on image

compression using neural networks. As ANNs are performing well with incomplete or noisy data,

they can be expected to perform well on images and visual data. ANNs (and machine learning

in general) can process input patterns to a simpler pattern with fewer components/coefficients

as an internal representation. This internal pattern, stored as neurons in a hidden layer, repre-

sents the external input information but in a more compact way, leading to compression. Deep

learning has led to breakthroughs in learning hierarchical representations from images.

[23] is an early work discussing a Direct Solution Method (DSM) based neural network with

an auto-encoder architecture. DSM operates contrary to most approaches using iterative train-

ing and Error BackPropagation (EPB). A multi-layer perceptron with a single hidden layer

is used, where they express the output layer neuron values as a linear system of equations.

The weights of the network are found by solving this system using a Modified Gram-Schmidt

method. This method is chosen over traditional iterative approaches because it is stable, faster,

and requires less computing power. Comparing compression and reconstruction of the DSM

and EPB method, a compression ratio of 3:1 was found with both methods performing well and

DSM being the faster one.

Dutta et al. (2009, [24]) uses a multi-layer feed-forward neural net with sigmoid activation

functions in order to perform image compression. Some standard pictures are compressed,

median filtering and histogram equalization is applied to the reconstructed grayscale images.

Compression rates of around 85% are achieved. While the PSNR is used as a quality measure,

only a visual display shows the results and unfortunately there are no comparisons made between

existing compression codecs.

2.3 Auto-Encoders as feature learning models

While most of the published research on ANN handles image classification and object recog-

nition, which fall under the supervised learning category of machine learning techniques, there

has been active research on applying ANN in unsupervised learning as well (albeit with far

less attention), bridging from their good performance in computer vision tasks to other similar

applications. The majority of this research focuses on complementing the methods: using an

un- or semi-supervised learning stage before applying a supervised classification task. As the

auto-encoder will be used in the methodology of this work, this section provides a closer look

to research on this particular technique.

Kulkarni et al. (2015, [25]) construct a Deep Convolutional Inverse Graphics Network (DC-

IGN) in order to learn image features. It then uses the network to manipulates these features

in order to render an (object in an) image with a different viewport, lighting condition ... The

architecture is a variational autoencoder, with the encoder part consisting of multiple layers

of convolution & max-pooling, and the decoder having symmetrical inverse equivalents. This

leads to a middle hidden layer holding semantically-interpretable graphic features. By recreat-

ing a modified version of an image, they discuss the model’s efficacy at learning a 3D rendering

engine. By using specific batch training datasets, they learn the model to express specific fea-

tures by specific neurons in a hidden layer. They conclude by utilizing a deep convolution and

de-convolution architecture within a variational autoencoder formulation it is possible to train

a deep convolutional inverse graphics network with a graphics code layer representation from

static images.

Radford et al. (2015, [26]) introduce deep convolutional generative adversarial networks

(DCGANs), which are a class of CNN’s with certain architectural constraints. The aim is to

automatically learn a feature representation from unlabeled data, which can then be reused

in supervised learning tasks. The DCGANs are shown to learn a hierarchy of representations

by training on various image datasets, from object parts to entire scenes. By using the fea-

ture representation as input vector to supervised image classification models, they are able to

outperform traditional techniques (K-means, Exemplar CNN) on the CIFAR dataset. The con-

clusion is a set of stable architectures for generative adversarial networks and evidence that

these networks can learn good representation vectors for further use.

Zhao (2015, [27]) presents a novel architecture called a stacked what-where auto-encoders

(SWWAE). The architecture consists of a convolutional net as discriminative model followed by

a deconvolutional net as generative model. By selectively enabling only parts of the architecture,

the architecture provides a unified approach to supervised, unsupervised and semi-supervised

learning. The novelty they introduce is an adaptation of the pooling stage (with an associated

modified loss function). Their idea is whenever a layer implements a many-to-one mapping

(as pooling does), they compute a set of complementary variables to improve reconstruction

ability. Each pooling stage will output a what value, which is the content feeding to the next

layer. Additionally a where variable will be saved, informing the corresponding decoding step

about the location of certain features. Each convolution + pooling stage and its corresponding

decoding stage is then a stacked auto-encoder. Their results on exiting image classification sets

lead to a comparably good accuracy, yet not improving on state-of-the-art.

One application of auto-encoders is the denoising Auto-Encoder (dAE), introduced by Vin-

cent (2010, [28]). In order to learn an auto-encoder to learn useful features instead of the

identity function, the auto-encoder is locally trained to reconstruct its input from a corrupted

version. By stacking layers of dAEs, a deep network is formed that is shown to be able to

learn useful features from natural images or digit images (MMNIST). Experiments using the

denoising objective as unsupervised training criterion in auto-encoders complemented with ex-

isting supervised classification methods show a significant improvement. The conclusion is the

clear establishment of the denoising criterion as a valuable unsupervised objective to guide the

learning of useful higher-order data representations.

Masci (2011, [29]) introduces a novel convolutional auto-encoder (CAE) for unsupervised

learning. By stacking these and adding pooling layers, a ConvNet is formed, trained using

traditional methods. The CAE serves as a scalable hierarchical unsupervised feature extractor,

learning good ConvNet initializations and avoiding the problem of getting stuck in local minima

of highly non-convex objective functions (arising in virtually all deep learning problems). The

(novel) convolutional variant of the auto-encoder exploits the 2D image structure of data, lead-

ing to parameter sharing among locations in the image. This thus preserves spatial locality, and

the discovery of localized, repeated features in data are the property that made ConvNets excell

in object recognition tasks. A stack of CAEs is trained and used as initialisation for a ConvNet

to perform classification on MNIST and CIFAR. They conclude that pre-trained CNNs slightly

but consistently outperform randomly initialized nets.

Building on top of denoising auto-encoders, Rasmus et al. (2015, [30]) build a Ladder

Network consisting of a stacked auto-encoder architecture with skip connections between en-

coder/decoder pairs and where each layer of the network is trained seperately. The Ladder

network is basically a collection of nested denoising autoencoders which share parts of the de-

noising machinery with each other. The idea behind skip connections is to alleviate the need of

the model to capture details in the encoding step, as the decoder can recover these discarded

details through the direct connections. Their approach on unsupervised learning is compati-

ble with existing supervised feed-forward networks, scalable and computationally efficient. By

reaching state-of-the-art on supervised tasks (MNIST & CIFAR), they show how a simultaneous

unsupervised learning stage improves the performance of existing neural nets.

Chapter 3

Theory

This chapter will give a short overview of the theory and principles that are used throughout

this work. It starts with a general overview of the concepts of machine learning, then discusses

one particular technique (Artificial Neural Networks or ANNs), and finally discusses a special

application of these ANNs which are applied in this work, namely Auto-Encoders (AE).

3.1 Brief introduction to machine learning

This section will sketch a brief overview of the theory of machine learning and some important

concepts and distinct techniques.

Machine Learning (ML) is a currently highly active area in computer sciences rooted in pat-

tern recognition and learning theory in Artificial Intelligence, closely related to computational

statistics and mathematical optimization. In machine learning, algorithms learn to perform

tasks without explicit programming by hand; there is an emphasis on automatic methods. The

goal is to devise learning algorithms that do effective learning automatically without human in-

tervention. The motivation behind this is that code is expensive and specialized to a particular

application, while data is often abundant and increasingly becoming cheaper, objective and thus

usable for multiple purposes. Contrary to traditional programming where a program is written

allowing a computer to solve a task directly, in ML a method is searched for where a computer

comes up with its own program, based on example data provided as input. Advanced statistics

are used to create usable programs from template algorithms and structures. By having simple

templates but with a large number of parameters, these techniques are general-purpose, but

able to model complex relationships and programs.

CHAPTER 3. THEORY 22

Machine learning is a data-driven methodology. Often, the relationship between data has a

complexity that is too high or for humans to comprehend or program in a traditional program.

By adjusting numerous parameters in a template model based on training samples, the goal in

applying a ML algorithm is to build a program which automatically constructs its logic and

rulesets based on the data.

The goal of a ML approach is to find an unknown target function f that expresses the

relationship between input/output relationship

f : X → T (3.1)

with the input domain X consisting of vectors xi with dimension M , and the output domain

T of target vectors ti with dimensionN . The targets are representing either a class label (N = 1)

or one or more real-valued numbers (N >= 1).

xi = (x1, ..., xM )T , ti = (t1, ..., tN )T (3.2)

The first phase is the training phase. Suppose a set of training samples Dn is given, compris-

ing N data points, each consisting of input vector x with an associated output (target) value

Dn = (x1, t1), ..., (xN , tN ) (3.3)

A model is then constructed through a learning algorithm that will try to map the data

points of this training set to their known target values. This learning algorithm will construct

a parametrized function hθ(x) out of a hypothesis set H of candidate mapping functions. This

function should be a good approximation of the actual target function f .

hθ : X → T , h ∈ H, hθ ≈ f (3.4)

The learning often uses an iterative procedure, and is performed by defining an error measure

E(h, f). This misfit between the predicted target and the actual known target during the

training phase is propagated as an additional input to the learning algorithm that can update

the model parameter set θ accordingly to try to reduce this error.

E(hθ, f) = e(hθ(x,w), f(x)) (3.5)

Additionally1, a probability distribution P on χ is applied on the input and target values.

This probability distribution accounts for noisy data, and allows to provide a certainty degree

for a predicted target value.

By repeatedly executing evaluating samples from the training set and adjusting the model

parameters to reduce the error measure, the model will hopefully learn a function hθ which

approximates f to some sufficient precision.

The goal of the algorithm is to build a model that will perform this mapping task on new,

unknown input data (the test set) during the testing (evaluation) phase. In order to perform

this task well, the proposed function must approximate the unknown function f that represents

the real input-output relation well. The model will then apply this function to new input x in

order to perform a prediction t of the unknown target valuet.

hθ(x) = t, t ≈ t (3.6)

An important feature of machine learning is the ability for generalization: performing good

predictions for unknown data. As input samples often comprise only a fraction of the input

space, generalization is a central goal in pattern recognition. Care must be taken not to tai-

lor the model too much the specific subset of input samples. This phenomenom is known as

overfitting and leads to poor performance on unseen data. Several techniques are used to avoid

overfitting on machine learning algorithms. Another important step in the process is called

feature extraction. Often, input data is of a high dimensionality. This data is preprocessed in

order to extract relevant features before feeding to the model . Transformation of the data to

a new variable space can make the pattern recognition easier. Feature extraction can preserve

only useful dimensions and discard non-discriminatory information, which allows for a speedup

in the processing.

Depending on the type of problem, a few distinctions can be made between ML algorithms.

Applications where the training data consists of input vectors with an associated target vector

are known as supervised learning. The data is labelled in that case, which allows for easy

1Often, but not necessarily available, depending on the ML technique used.

testing verification of the model. The goal of the problem is to assign new data points a label

(or real-valued number). In other scenarios, the training data lacks any corresponding target

values. The goals of these so-called unsupervised learning problems is not to predict a variable

based on the input features. Rather, the model tries to discover groups and similarities in the

samples, or it tries to find a probability distribution of the data. This is done in order to find

the general structure in the shape of the dataset (for example to reduce the dimensionality

of the data). Lastly, there are reinforcement learning techniques where a set of actions must

be determined in order to maximize a reward. Here, the concept of an environment is used

and the form of the reward must be defined. Some ML techniques are parametrical models.

These models perform training, but do not need to remember the data for evaluation. Other,

non-parametrical models, use the entire dataset when performing predictions, typically without

the need for a lot of training. Some problems are convex, where optimization leads to a global

minimum; the best model for the dataset can be achieved. Other techniques have a non-convex

loss function and do not have a guaranteed single best stationary solution.

Over the years, a large set of techniques has been developed to address a wide range of

problems, each having different requirements in available training data, speed (of training or

evaluation), performance, stability and other features. Some well-known techniques are Logis-

tic Regression, Naıve Bayes, k-Nearest Neighbour, Random Forests, Support Vector Machines,

(Convolutional/Artificial) Neural Networks, Linear Discriminant Analysis, Principal Compo-

nent Analysis, Bagging & Boosting. Some reinforcement learning techniques are Q-learning

and SARSA ...

Underfitting and overfitting

The goal in applying ML to a problem is finding patterns in data. One constructs a model

that tries to approximate the data by a mathematical function, the target function f(x). It

does so by looking at known training data and learns an approximation of the data behaviour

and characteristics based on this data. Assuming the training examples are a representative set

of the new and unknown data the model will have to process after training, it will be able to

work well with unseen test data based on similarities between these data sets. Two important

phenomena in the topic of ML are very important because they threaten a succesful application:

the concept of underfitting and overfitting. This subsestion will therefore discuss these problems

in some more detail.

When constructing a model, there is the choice of working with a very simple or a com-

plex model. Simple models are limited in their expressive power and modelling capacity, while

complex models are the opposite: they are powerful, but require often a very large amount of

computing power. With high-dimensional data, the complexity of a model can very soon be

unwieldly to work with, a phenomenom known as the curse of dimensionality. The complex

model will be able to train very specific behaviour of the training data. The risk is however that

the approximation will result in a target function that is so specific to those training samples,

the model will perform very poorly on new data that slightly differs from these training sam-

ples. It is said the model overfits and does not generalize well. On the other hand, the simple

model might not be able to capture the necessary characteristics of the data it is working with.

No matter how much new data is available to train, the result will be a function lacks useful

expressive power and that is of not much interest; e.g. the output could always be the average

point of a data set. This end of the spectrum is known as underfitting: while simple, the model

is too general to be of use. One mostly aims to strike a balance between these extremes: a

model should be sufficiently powerful to find the important characteristics of the data and still

generalize well to new data.

Figure 3.1 illustrates these phenomena. The blue dots are data points sampled from a source

with a sine distribution and some random noise added. A machine learning model will try to

fit a curve f(x) to these points which is the target function describing the behaviour of the

data. The function here is a polynomial function of a chosen degree, and curve fitting is done

through least-squares error method. Figure 3.1a uses a 1st degree function. It is clear this is

not adequate to describe the data. The model is not expressive enough to find patterns in the

data. This is a case of underfitting. Figure 3.1b shows a 3rd degree polynomial, fitting the data

quite good. When new data comes in from the sine source (indicated by the red points), the

curve fits them pretty good. The model has found the pattern in the data, and generalises well

to new data. This is the desirable case. Figure 3.1c shows the case of overfitting. A 10th degree

polynomial is fitted to the blue data. This fits the blue data exactly. However, the new red

data are not at all represented in a good way by the approximation. The curve is specialized

too much to the known training data, and does not generalise well to new data. This model

will thus perform suboptimal during its eventual application on new data.

(a) 1st degree (b) 3rd degree (c) 10th degree

Figure 3.1: Illustration of under- and overfitting: curve (polynomial function) fitting to a dataset

During the development of machine learning techniques, there are some techniques and

modifications specifically designed to address the problem of overfitting, known as regularization

techniques. When using regression and finding parameters of a polynomial curve, there are the

L1 and L2 regularization, also known as Lasso regression and Ridge regression. They lead

to sparse solutions and avoid large coefficient, favoring smooth solutions. Early stopping is

regularization in time when working with iterative training. Here the model performance will

be monitored during training and training will stop as performance would start to degrate.

Ensemble models are a form of regularization, as they combine the results of multiple, possibly

overfitting, models. Specifically for neural networks, one form of regularization is a technique

called Dropout layers (explained later).

3.2 Introduction to the concepts of (Artificial) Neural Networks

Artificial Neural networks (ANNs) are one kind of machine learning technique where the com-

putational model is inspired by (neurons in) the human brain. While the concept of neural

nets has been around for quite some time, their power and applications have long been under-

estimated. With recent developments in both the theory and easy access to high amounts of

computational power, neural networks have been rediscovered and have seen a surge in interest

during the last decade. ANNs are applied in lots of domains and applications and nowadays

envelop the most advanced models in ML and AI and outperform many other previous state of

the art algorithms.

x2 w2 Σ f

Activationfunction

Output

Weights

Inputs

Figure 3.2: Artificial neuron

3.2.1 Introduction

Architecture

A NN consists of a layered network of computational units, called articifial neurons. The

first neural networks were developed by Rosenblatt starting in the 1950s and were based on

perceptrons. Figure 3.2 shows one such artificial unit (note this is a general model and not

necessarily a perceptron unit).

Each unit takes a certain amount of inputs xi. A weighted sum of these inputs is passed to the

activation function f . The output of this activation function is the output hW,b(x) of the unit,

and this can be passed as an input to other units. Several variations of units are possible: inputs

can be binary or real-valued, and often a bias input is added to the unit. As activation function

a step function can be taken, but more commonly a continuous output function is chosen. A

popular choice is the sigmoid (logistic) function or the hyperbolic tangent, both having desirable

properties regarding their derivatives. This single neuron can thus be interpreted as a model

that makes a decision based on features of its input, using the formula

hW,b(x) = f(

n∑i=1

Wixi + b) = f(W Tx) (3.7)

By adjusting the weights of the summation, the decision behaviour of the model can be

changed.

When chaining multiple of these units together, a neural net is formed, for example the

simple network displayed in Figure 3.3. While a single artificial unit has limited computational

power, this does not hold for multi-layer networks. Construction of NOT/AND/OR operators

is possible, meaning artificial neural nets are functional complete: they can express all possible

truth tables for a collection of input features. Furthermore, the theorem known as the universal

Inputlayer

Hiddenlayer

Outputlayer

Input 1

Input 2

Input 3

Input 4

Input 5

Figure 3.3: Neural Network structure

approximation theorem for neural nets states that an ANN can be used to approximate any

continuous function to arbitrary precision. Their expressive power is thus as large as any other

computational device.

The net in Figure 3.3 is a simple single-layer feedforward neural net. The first layer in the

net is the input layer, which takes the input features. Then the layer of computational units

processes these features, which is called the hidden layer. The final layer giving the output of

the network is the output layer. Several other architectural patterns are possible. More than

1 hidden layer is normally used, which is then commonly called deep learning. The multiple

hidden layers can have different sizes and perform different operations on their input (e.g.

to reduce overfitting). Convolutional networks are networks where the input is transformed

through a variety of operations in early layers before any decision making comes into place.

Recurrent networks are networks where there are loops introduced by chaining a neuron output

to previous neurons and layers in the net, which leads to these nets having some form of memory.

The main difficulty in succesfully applying neural networks to any problem is finding a suitable

architecture. At the time of writing, there are no known definitely good ’recipes’ for constructing

good networks.

Training and evalution

Training a neural net is done by adapting the weights of the connections between layers. When

constructing a network, weights are initialized often randomly. Using a labeled set of examples,

a prediction error for these samples is calculated using a feedforward pass. The weights are

then updated in order to minimize the prediction error following a loss function. This updating

is commonly done using backpropagation in conjunction with gradient descent. The output

error of the network if fed backwards to the network through each layer. Backpropagation

is a mathematical procedure allowing to efficiently compute partial derivatives (requiring an

activation function with good derivative properties) of the chosen cost function. These partial

derivatives are then used to update the weights in all layers. This forward-backward cycle is

repeated until the network converges to a desired performance. The training phase of a neural

network is extremely computationally expensive; the backpropagation algorithm and access to

massive computing power and datasets has therefore been paramount in the recent success and

rediscovery in neural networks.

Training can be influenced by adapting the learning rate of the model and using different

methods of sampling training examples. Using Gradient Descent, the complete set of training

samples are run through in order to perform a parameter update iteration (therefore also called

Batch GD). With large datasets, each step in the process of weight updates is very costly and

time-consuming. A variation of this is Stochastic Gradient Descent (SGD, also called Iterative

or On-line GD), which updates the weights after each training sample. The term stochastic

indicates that this gradient, based on a single sample, is an approximation of the true gradient,

and due to this stochastic nature, the path to the final solution might be zig-zag rather than

the direct way of GD. SGD has a higher variance due to single-sample picking and a lower

learning rate, but almost surely converges to the global cost minimum if the cost function is

convex. The computational power is also lower for SGD, as only a single gradient is to be

computed at an iteration. A compromise between speed and computational power is Mini-

Batch Gradient Descent (MB-GD), where the gradient is computed on a group of samples in

each iteration. Algorithmically speaking, using larger mini-batches reduces the variance of the

updates (by taking the average of the gradients in the mini-batch), and this to take bigger

step-sizes, which means the optimization algorithm will progress faster. MB-GD converges in

fewer iterations than GD because we update the weights more frequently; however, MB-GD

can utilize vectorized implementations, which typically results in a computational performance

gain over SGD.

Having a trained network, evaluation is done by applying the input features to the input

layers, and subsequently calculating each layer output from front to back, eventually resulting

in the output of the last (output) layer. The procedure is called the feedforward pass, is

straightforward and fast.

Deep Learning

A very important part in traditional machine learning is feature engineering; finding useful

patterns in the data to base decisions on. Feature learning is (automatically) finding common

patterns of data to use in classification or regression problems. The term deep learning origi-

nated from methods and strategies for designing deep hierarchies of non-linear features. The

first rise in popularity since the discovery of neural networks in 1965 was the backpropagation

algorithm applied to training neural net training in 1985. Since about 2010, the use of GPUs and

the rectifier activation function has lead to practical possibilities of these powerful architectures

and solved some of the existing issues holding back success. Since 2013, LSTM networks have

been growing rapidly in problems regarding non-linear time dependencies in sequential data, en

together with convolutional nets are the 2 major success stories of deep learning. Deep learning

models often achieve exceptional performance in a lot of problems, and their capabilities are

still being discovered in applications to all domains of computing. Research in deep learning

has been accelerating rapidly since 2012-2014 when Google, Facebook and Microsoft started to

show high interest in the field.

Applications

Neural networks are nowadays used in several applications, often grouped under deep learning

name. They have proven to be unmatched in object recognition in visual data (e.g. face

recognition in photos, handwriting reading), anomaly detection (spam processing), autonomous

systems such as self-driving cars, natural language processing and digital assistents... Deep

learning is nowadays seen as and expected to be a solution to all sorts of problems and most

technology companies are heavily investing in the technology, both in the form of dedicated

hardware solutions (GPU’s, dedicated processors) as software libraries and toolboxes.

3.2.2 Relevant techniques

This subsection will very briefly discuss some specific techniques regarding neural networks that

are used in and are relevant to this work.

Activation function

Each unit in a neural net applies a function to a combination (e.g. a weighted sum) of its inputs.

Several activation functions are possible and have been used throughout the history of neural

nets. Figure 3.4 shows a visual comparison between the 3 most widely used choices in neural

networks and deep learning.

The sigmoid function (also referred to by logistic) has seen frequent use historically. It

has a range of [0, 1] which means its output can be interpreted as a probability and is easy to

understand. Having a smooth shape is attractive because of the importance of the gradient in

the backpropagation algorithm for efficiently training nets. A problem however is the sigmoid

saturates: a large input leads to a gradient close to zero, known as the vanishing gradient.

Another smaller issue is that its output is not zero-centered (desirable for gradient training),

which the hyperbolic tangent solves. The tangent is a scaled version of the sigmoid and has

replaced the sigmoid in practive everywhere.

With the popularity of convolutional nets, the activation function of choice has shifted to

the rectifier. The rectifier is arguable more representative of the biological neuron then the

probability theory-inspired sigmoid or hyperbolic tangent. Neurons using the rectifier (or an

approximation) are called Rectified Linear Units (ReLu). The ReLu has some advantages

compared to its predecessors for neural nets. It does not suffer from the vanishing gradient

problem, it leads to sparse solutions, and they can be used for training nets efficiently without

pre-training. They are also fast; it does not involve exponential computation, and they have

proved to converge to a solution much faster than nets using the tanh activation function.

Introduced as activation function in 2011, as of 2015 the rectifier is the most used activation

function for deep neural nets.

−2 −1 0 1 2−1

−0.5

(a) Sigmoid: 11+e−x

−2 −1 0 1 2−1

−0.5

(b) Tanh: tanh(x)

−2 −1 0 1 2−1

−0.5

(c) Rectifier: max(0, x)

Figure 3.4: Comparison of the input-output relation of some activation functions used in ANNs.

Layer types

This subsection will provide a short explanation of some different layers often encountered in

ANNs and deep learning.2

Fully Connected / Dense The fully-connected layer is the original ingredient of ANNs, and

perform the high-level reasoning in the network. The layer consists of a set of units which are

all connected to all neurons in the previous layer. Activation of this layer can be computed with

a matrix multiplication.

Convolutional Convolution is a mathematical operation to mix two pieces of information. In

the case of ConvNets, the input data (called feature map) is mixed with the convolution kernel

to form a (one or more) transformed feature map. The operation is often interpreted as a filter,

with kernels filtering the feature map for a certain kind of information (e.g. edges, color...).

Convolution can be described as a cross-correlation relationship; the output of a convolutional

filter is high if the filter feature is present in the input. In deep learning, convolutional layers

are exceptionally good at finding good features in images feeding to the next layer to form a

hierarchy of nonlinear features that grow in complexity and abstraction. They bridge spatial/-

time and frequency domain through the convolution theorem, exploit locality and parameter

sharing. and are implemented extremely efficient through Fourier transforms on current GPUs.

Max Pooling A convolutional layer is mostly followed by a pooling layer which performs

subsampling in convolutional nets. Information is funneled to the next (often another convolu-

tional) layer. Pooling provides some invariance for rotations and translation. The pooling area

2For a very good discussion on Convolutional Neural Networks, see http://cs231n.github.io/ for coursenotes from the Stanford CS course.

can differ in sizes, with more or less detailed information to consider, reduce dimensionality,

and resulting in large or fit networks to fit GPU memory.

Dropout One effective way of improving model performance and countering overfitting, is

combining multiple models. However, with the computationally extremely intensive ANNs,

this approach becomes quickly unreasonable. Introduced by [31], Dropout is the most effec-

tive technique of addressing overfitting in ANNs. A dropout layer does not perform explicit

computations, but will simply enable or disable (dropping out) neurons. At training time, a

neuron will be present and connect to the next layer with a certain probability (usually 0.5).

This essentially results in sampling a thinned network from it, having 2n possible models from

a network with n neurons. Training a neural network with dropout can be seen as training a

collection of thinned networks (with extensive weight sharing), each thin model trained by very

few training samples. At test time, the single full network is used without dropout, each neuron

using a scaled set of weights. This means the expected output of a neuron is used at test time,

and each of the 2n thinned network are combined in that single network.

3.3 Selected technique: Auto-Encoders

The approach in this work makes use of Auto-Encoders (AE). As such, the following section

will discuss this particular paradigm in slightly more detail, as well as some techniques tailored

to AEs.

The concept of AEs is relatively new: they have been introduced by Hinton and Salakhut-

dinov in 2006 as a ’data dimensionality reduction technique using neural networks’ ([32]) which

captures the concept extremely well. Auto-Encoders are ANNs used to learn efficient coding of

data. They are in general a multi-layer neural network consisting of an input and output layer,

with one or more hidden layers in between, illustrated by figure 3.5. Compared to the general

model of a neural network, an autoencoder has some architectural differences and contraints.

The output layer of the autoencoder has the same form as the input layer. Instead of predict-

ing a value for a given input, an AE is trained to reconstruct its input. The purpose of the

autoencoder is to do some form of dimensionality reduction in its hidden layers, and to learn an

encoded form of representing the data. Therefore, the architecture of the network commonly

possesses a layer with a reduced size acting as a bottleneck. Input data is compressed to that

InputBottleneck

Output(reconstructedinput)

Figure 3.5: Auto-Encoder as a contraint on a neural net architecture

lower dimensionality using the part of the network acting as encoder. This part of the AE

thus acts as a technique of dimensionality reduction.3 Having the coded representation, this is

expanded again by the part of the network used as decoder to reconstruct the input as good as

possible. The network takes as input the data itself, not a set of features based on the data and

automatically tries to learn the required characteristics of the data. The input data need not

be labeled; therefore, AEs are unsupervised learning models. They are most commonly used as

feature learners in tandem with a supervised classification technique.

Traditional Auto-Encoder

An Auto-Encoder takes an input x ∈ Rd and first maps it to a latent representation h ∈ Rd′

(where commonly d′ < d) using a deterministic mapping. This mapping has the typical form of

an affine mapping followed by a non-linearity:

h = fθ = σ(Wx+ b) (3.8)

The function fθ has parameters θ = W, b, where W is a d′ × d weight matrix and b is

a bias vector of dimensionality d′. The non-linearity σ is normally chosen to be one of the

common activation functions in conventional neural networks. The deterministic mapping fθ

is commonly called the encoder. This resulting latent representation (or code) is then used to

reconstruct the input by a reverse mapping gθ′ . This mapping is again an affine transformation

3One well-known related method for dimensionality reduction is PCA; AEs are a non-linear generalization ofPCA, operate automatically, and are not restricted to the application of pure dimensionality reduction.

optionally followed by a non-linearity, and is called the decoder :

y = gθ′(h) = σ(W ′h+ b′) (3.9)

The inverse mapping has parameter set θ = W ′, b′ each appropriatly sized. The two

parameter sets are often, but not necessarily, constrained to be of the form W ′ = W T : the same

weights are used to encode the input and to decode the latent representation. In a feedforward

pass of the network each sample pattern of the training set xi is mapped to its code hi and its

reconstructed version yi. In general, the reconstructed y is not to be interpreted as an exact

reconstruction of the input, but in probabilistic terms as inputs/parameters to a distribution

p(X|Y = y = gθ′(h); θ,θ′) that generates X with high probability. The reconstruction error

to be optimized is

L(x,y) ∝ −log(p(x|y)) (3.10)

Working with classification (or one-hot encoded values), x is a binary vector: x ∈ 0, 1d

so a choice for p(x|y) is x|y ∼ β(y)4. The decoder produces a y ∈ [0, 1]d. The loss function

associated with this setup is then

L(x,y) = LH(x,y) (3.11)

= −∑

j[xjlogyj + (1− xj)log(1− yj)] (3.12)

= H(β(x)||β(y)) (3.13)

This last term is is called the cross-entropy loss, and it is seen as the cross-entropy between

two independent multivariate Bernouillis, having means x and y. The parameter sets of the

AE are optimized by minimizing this error (also called cost) function over the training set of n

(input, target) pairs St = (x0, t0), ..., (xn, tn).

If a latent representation h allows for good reconstruction of its input, it means that it has

retained much or useful information that was present in the input. However, the criterion of

retaining information alone is not a useful one. By setting h = x, or using an AE where h has the

4Or equally: xj |y ∼ β(y). β(a) represents the Bernouilli distribution with mean a. Extended to vectorvariables: x ∼ β(a) means ∀j,xj ∼ β(aj).

same dimensionality as x and learning the identity mapping, there are no useful representations

discovered besides the input itself. Additional constraints are necessary, naturally leading to a

non-zero reconstruction error. The most used approach is to introduce a bottleneck to produce

an under-complete latent representation where d′ < d. This representation is then a lossy,

compressed representation of x. Another possibility is to use an over-complete but sparse

representation, which achieves compression by a large fraction of zeros rather than its explicit

lower dimensionality.

3.3.1 Denoising Auto-Encoder

Traditional Auto-Encoders will learn the identity mapping without additional contraints. This

problem can be circumvented by using a probabilistic RBM approach, sparse coding, or de-

noising auto-encoders (dAEs) trying to reconstruct noisy inputs. Two underlying ideas inspire

this approach: a high-level representation should be robust and stable under limited input cor-

ruption, and performing a denoising task well requires extracting features that capture useful

structure in data. Training a dAE5 involves trying to reconstruct a clean input from a partially

destroyed version of it. The input can be corrupted by adding a variable amount of noise to it

(binomial noise, uncorrelated Gaussian noise, ...). By doing so, the dAE is trained to denoise

the input by using a slightly adapted version of Formula 3.8 and 3.9:

h = fθ(x) = σ(Wx+ b) (3.14)

y = gθ′(h) = σ(W ′h+ b′) (3.15)

The parameter sets are now trained to minimize the reconstruction error by having y as

close as possible to the uncorrupted input x. The key difference is that now y is a deterministic

function of x rather than x. Each time a training sample is applied, a different corrupted

version is used by the application of noise. There is no change in the loss function. By applying

a deterministic mapping to a corrupted input, the network is forced to learn more clever features

and mappings (that prove to be useful in a task as denoising, rather then providing identity

mapping).

By stacking dAEs into a deep architecture (with a supervised classification part at the end),

5Note the goal is not the denoising task in itself. Denoising is merely used as a training criterion in order toextract useful features and form a high-level representation.

these architectures perform systematically better at several computer vision tasks than ordinary

AE or Deep Belief Networks. They are also shown to correspond to a generative model. Lastly,

dAE work well with data with missing values or multi-model data, as they are trained on data

with missing parts (when corruption is randomly hiding parts of the values). ([33])

3.3.2 Convolutional Auto-Encoder

Both the conventional fully-connected Auto-Encoder and the dAE ignore the 2D structure of

data (commonly occuring as images). However, the most succesful models in object recognition

try to discover localized features that repeat themselves all over the input. Convolutional Auto-

Encoders (CAEs) differs from conventional AEs as their weights are shared among all locations

in the input, preserving spatial locality. The reconstruction is done through a linear combination

of basic image patches based on the latent code. The architecture of a CAE is build upon the

dAE, but applies weight sharing on feature maps. The latent representation of the kth feature

map of a mono-channel input x is given by

hk = σ(x ∗W k + bk) (3.16)

where ∗ denotes the 2D convolution operator. Each latent feature map has its own bias b.

Reconstruction of a convoluted latent representation is then obtained by

y = σ(∑

k∈Hhk ∗ W k + c) (3.17)

where H is the group of feature maps; W is the flipped (over both dimensions) version of

the weights, and c is the bias (one per input channel). The CAE is trained just like normal

networks using a backpropagation algorithm which computes the gradient of an error function

with respect to the parameter sets. Assuming the Mean Squared Error (MSE) cost function

E(θ) =1

n∑i=1

(xi − yi)2 (3.18)

this gradient can be obtained by using convolution operators with the formula

∂E(θ)

∂W k= x ∗ ∂hk + hk ∗ ∂y (3.19)

where ∂h and ∂y are deltas of the hidden state and reconstruction. Using this, the param-

eters can be updates using (variations of) stochastic gradient descent.

Chapter 4

Methodology

In this chapter, the approach this work will take to the problem is explained and discussed.

First a high-level overview of the compression scheme is offered and some technical aspects and

design decisions are discussed.

4.1 Structural overview of the proposed compression scheme

In this work an Auto-Encoder will be trained for constructing a compression method. An input

file will be encoded to a bitstream suitable for transfer, and decoded again. The actual encoding

and decoding is done by an ANN. Compression is achieved by introducing a bottleneck in the

ANN network architecture. A large input sequence is represented as a compressed, coded repre-

sentation learned by the network plus a residue necessary to correct erroneous reconstructions.

By training the network using DNA data, the aim is to let it learn the implicit structure of the

input data, so it can create a good coded representation and effectively perform reconstruction.

The parameters (i.e. weights of the ANN) of the encoder and decoder module are not included

in the bistream to be transferred. They are fixed (either to a hardcoded set of values or to

a reprogrammable set) during operation. As large ANNs often contain a high amount of pa-

rameters, including these in the bitstream for network transfer might not be efficient. Having

a fixed set of parameters also opens the possibility of efficient hardware implementations (e.g.

FPGA’s, ASICs and coprocessors as currently used for image processing). The encoder and

decoder processes thus share the network weights and architecture.

CHAPTER 4. METHODOLOGY 40

FILE input Encoder code Decoder input’ - residue

bitstreampreprocess

Encoding process

(Online) transfer

bitstream

input’Decodercode + residue

input FILE

preprocess−1

Decoding process

Figure 4.1: Block diagram of encoding/decoding process

4.1.1 Encoder

Figure 4.1 shows a block diagram of the encoding & decoding process. As a first step, the source

file is read and preprocessed to a format suitable for the encoder to accept as input. This input

is fed to the encoding part of the AE. This Encoder block is a simplified representation here (and

is discussed below). The encoding block will generate a compressed representation of the input,

which is used as the codeword. The codeword is fed to the decoder, which tries to reconstruct

the input, possibly making an incorrect reconstruction. This reconstruction is compared with

the original input, and the differences are stored in a residue. The codeword and residue are

combined in the bitstream, which is the output of the encoding process.

4.1.2 Decoder

The lower half of figure 4.1 shows the block diagram of the decoding process. The decoder takes

the compressed bitstream and starts by splitting off the codeword and residue. The codeword

is fed to the decoding part of the AE, which will try to reconstruct the desired output. The

decoding part of the AE is the same part as used in the encoding process, and the errors

that will be made by the reconstruction are thus known beforehand by the encoding process.

Therefore, the decoder now adds the provided residue to the reconstructed input to get the

faultless reconstruction of the original input. This reconstruction is eventually converted back

into the original source format.

4.1.3 Network view

The ENCODER and DECODER blocks in the previous block diagrams are actually both neural

networks. Figure 5.11 shows an example of what the compression scheme looks like from a neural

network focused perspective. The network takes an input representation and feedforwards it

through its hidden layers to output a coded representation. In order to achieve a compression

scheme, it is essential that the network central hidden layer is of a lower dimensionality than

the input and forms some form of bottleneck. After this bottleneck layer, a (symmetrical to the

encoding part) set of layer form the decoding part of the network, where the inverse process

is done: the compressed representation is expanded to create the input reconstruction. The

network architecture (amount of layers, type of layers, size...) and implementation can be

varied, but will always adhere to this encoding-bottleneck-decoding structure.

4.2 Technical aspects & design decisions

4.2.1 Data acquisition

The dataset used throughout this work consists of the chromosomes of two human genome

sequences. These chromosomes are fully aligned and assembled from scaffolds. The genomes

available are one reference genome, and one alternative genome. This data can be obtained

freely available online1 from the National Center for Biotechnology Information and is formatted

in the FASTA format (the downloadable files are Gzip compressed). The reference genome is

subdivided into 22 numbered chromosomes, plus the chromosomes X and Y. The mean filesize of

a single chromosome is about 128 MB. The content of these files is nearly (excluding a few lines

of descriptive metadata in the file) purely a continuous sequence of nucleobases: for example,

the first chromosome of the reference genome contains a sequence of about 235 million bases.

1ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.106/Assembled_

chromosomes/seq/

4.2.2 Data preprocessing

One of the desirable properties of deep learning techniques is the fact that interesting properties

of the input data can be learned automatically without human specification of the features of

interest. This happens by adapting the weights in the hidden layers, filters and neurons. As the

goal is for the network to automatically discover useful features, a first note is that no manually

engineered features (e.g. base averages, counts, frequencies ...) besides the raw source data are

Starting from the FASTA file on the filesystem, the sequence needs to be loaded to the

program memory first into a data structure suitable for processing and input to the neural

network. The FASTA file is read into memory entirely. The descriptive text is stripped away,

all lines and parts contained in the file are chained together and any non-frequent characters

are replaced by N letters, as they will not be part of the compression scheme. Depending on

the requirements of the network, this base sequence is padded: ’horizontally’ up until a certain

required multiple of bases, and ’vertically’ to match the batch size.

The task this ANN has to perform is eventually the prediction of the correct base (one

letter out of a constrained alphabet) in the genome sequence. This ML problem falls under the

category of classification: there are several possible choices (class labels) as output out of which

one must be selected. The most common approach to this is by encoding the labels as a one-hot

matrix. When the output of the network contains a softmax function over the output matrix,

this output can be interpreted as class-probabilities for each sample. The predicted class (base)

is then the one on the index having the highest value in the output vector. Equation 4.1 shows

the conversion of bases to a 2D one-hot encoded matrix. This matrix is the target of the network.

· · ·ACGTNA · · · ⇒

A 1 0 0 0 0

C 0 1 0 0 0

G 0 0 1 0 0

T 0 0 0 1 0

N 0 0 0 0 1

A 0 0 0 0 0

The basic approach is thus to one-hot encode each letter and feed these vectors as input

to the network. By doing this however, the network will not learn any information from the

sequence and structure in the input, as it will basically learn just the probability for each class.

Therefore, multiple one-hot encoded letters can be chained together as input to the network.

Figure 4.2 shows the representations of a sequence of bases as one-hot encoded matrix. This

way, the network can learn to recognize patterns, or repeating sequences in the data. Note that

the output of the network will be reshaped to always be a single-base-wide one-hote encoded

matrix followed by a softmax: the network will be judged on the reconstruction of every base,

and not on the reconstruction of the entire input sequence.

A 1 0 0 0 0

C 0 1 0 0 0

G 0 0 1 0 0

T 0 0 0 1 0

N 0 0 0 0 1

A 0 0 0 0 0

ACG 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0

TNA 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0

When moving to a convolutional network, a different kind of formatting is required. As

convolutional layers are mostly used on multi-channel images, the sequence data will be trans-

formed into a similar structure in order to mimic this successful setup. Color images contain

a channel for R(ed), G(reen) and B(lue). In this case, the sequence data is one-hot encoded

data, thus each position in the bit pattern could be interpreted as a channel for a certain base

in the alphabet (so instead of channels R,G and B, there are now channels A, C, G, T, and N).

Having five distinct classes in the data, an input sequence is therefore transformed into a cube

of 5-channel patterns. Figure 4.2 shows this data formatting.

4.2.3 Network training

During the construction of a suitable network architecture, the scenario is chosen to try training

an AE on a certain chromosome (e.g. chromosome 1) in order to perform compression on the

same chromosome number. As there are two genomes available, one genome provides the

0 00 1

0 00 0

1 00 0

0 10 0

Figure 4.2: Base sequence representation as 5-channel cubes.

chromosome for training and validation, and the other genome provides the test data. From

the first chromosome, 66% is used as a training set. The remaining 33% is used as validation

set. The test data will depend on the evaluation scenario (see later).

Training is done using the minibatch approach. Each training step (often called epoch) con-

sists of 10 updates with batches having a batch size of 50 base(s) (sequences). These batches

are contiquous base sequences starting at an index randomly picked from the training data set

(with repeated selection possible). The network will try to optimize the categorical crossentropy

as loss function between its reconstructed output and the targets. Updating the weights in the

layers happens using the Nesterov Momentum method - having a learning rate of 0.01 and a

momentum of 0.9 - for the traditional fully-connected AE, and using the Adam method with

a learning rate of 0.001 for the convolutional AE.2. As an AE normally has a symmetrical

architecture around its bottleneck layer, one can choose to tie the weights of the symmetrical

counterparts for each layer or train independantly. There is currently no rule for this other then

simply experimenting to find out what works, so both approaches are tested.

In the following parts (unless stated otherwise) chromosome 1 of the reference genome is

used as a training and validation genome (hs ref GRCh38 chr1.fa). This is one of the largest

chromosomes in the set, but there is no specific reason for this choice other then having a lot

of data in it.

2For an overview of gradient descent optimization algorithms, see http://sebastianruder.com/

optimizing-gradient-descent/index.html

4.2.4 Compression-Accuracy trade-off

The AE representss an input sequence as one or more values from a neuron in the hidden

bottleneck layer. These values will be the codeword in the compression scheme. By packing

multiple bases (i.e. a long sequence) into these values, this can result in compression. There

are two variations possible of influencing this compression ratio: either improve compression by

packing more bases into the same amount of bottleneck neurons, or try to improve reconstruction

performance by increasing the number of bottleneck units. There is a balance to be struck

between these options. Each base can be represented by 3 bits (offering 23 = 8 possibilities

while five are required). A neuron value is represented by a 32 bit floating point number. The

resulting compression rate of the AE can thus crudely be estimated as

X bases · 3 bitbase

Y hidden units · 32 bithidden unit

32 Y= size reduction (4.3)

As an example, if the input is a sequence of 100 bases and the bottleneck contains two

hidden units, the size reduction would be

3 · 100

32 · 2= 4.6875 (4.4)

The filesize of the data would thus be reduced by a factor of a little over 4.5. However, this

assumes that perfect reconstruction is achieved by the decoding process. In reality, this is not

the case. Errors made by the reconstruction have to be corrected, and this error residue must

be stored together with the codeword, reducing the overall compression rate. For this reason,

the reconstruction performance is not the only metric of intereset: the entropy of the residu is

important as well, as low-entropy data can in general be encoded and compressed efficiently by

arithmetic coding.

When varying the input size or the hidden units, one is thus balancing these two things: on

the one hand the compression rate increases with larger input sequence length, but the accuracy

drops. On the other hand, more hidden units mean a smaller compression rate for the codeword,

but better reconstruction performance and a smaller (or better in the sense of entropy) residu

could be possible.

4.2.5 Evaluation

Scenarios

The compression networks can be applied in several scenarios. The data from the human

genome consists of 23 distinct chromosomes. A first approach is to train a network on a single

chromosome. This network is then evaluated on the same chromosome from a second human

genome. This scenario would benefit from the similarities between the same chromosome in

different individuals from the same species. A second approach is to train a network on a

chromosome from a genome, and then use the network on other chromosomes from that same

genome specimen. This could expose similarities between chromosomes, and possible reveal

some form of classes of chromosomes with high similarity. Lastly, the genome in its entirety

can be used to train an encoder, and then to compress an entire genome without making

the distinction between different genomes. This would average out characteristics of specific

chromosomes and look at the general structure of DNA in total. Figure 4.3 shows these three

scenarios.

Metrics

Reconstruction accuracy The task the AE will be trained to perform is to reconstruct

its input from the coded sequence. The input in this case is a sequence of bases from the

genome. The first important metric in order to evaluate the effectiveness of the model is the

percentage of correctly reconstructed bases. The higher this reconstruction accuracy is, the

less additional information in the residue must be stored to make correction to the codeword.

In a ML model using multi-class classification (choosing a certain class, here this means a

certain base), the architectural setup normally used for this kind of problem is the softmax

plus categorical accuracy. The network will output a matrix where each row represents one

input sample to be classified. The matrix has as much columns are there are different classes to

choose from. The final layer of the network will apply the softmax function to each row of this

matrix.3 After this, each column value in the row can be interpreted as the probability of that

sample of belonging to the class of that column. Determining the predicted class is thus finding

the index of the maximum value of this row, and comparing this with the class of the one-hot

3Certain implementations (and currenctly the implementation in the Lasagne framework) of the softmaxnonlinearity are not numerically stable. This can lead to NaN appearing as loss when the network comes closeto the desired targets. This is solved by using a modified version of softmax: LogSoftmax, and an associatedmodified categorical accuracy. This modified version is used here, but the principles stay the same.

Reference sequence

chromosome 1

chromosome 2

chromosome 3

· · ·

chromosome 20

chromosome 21

chromosome X

(a) Between different chromosomesof the same sequence

Reference sequence

chromosome 1

chromosome 2

chromosome 3

· · ·

chromosome 20

chromosome 21

chromosome X

Alternative sequence

chromosome 1

chromosome 2

chromosome 3

· · ·

chromosome 20

chromosome 21

chromosome X

(b) Between the same chromosomesin different sequences

Reference sequence

chromosome 1chromosome 2chromosome 3

· · ·chromosome 21chromosome X

Alternative sequence

chromosome 1chromosome 2chromosome 3

· · ·chromosome 21chromosome X

(c) Between entire sequences

Figure 4.3: Schematic display of three differenct evaluation scenarios. The grey shaded part isthe data used to train the model, and the arrows point to the (test) data which is compressedusing the trained model.

encoded input row. The metric giving the amount of correctly classified samples is called the

categorical accuracy.

Output cross-entropy After the bases are reconstructed by the model, the reconstruction

is compared to the original input. When an error has been made, this has to be corrected

(as lossless compression is required). These corrections are stored in the residue. With the

compression purpose in mind, a useful property to consider is the structure (and more specifically

the entropy) of this residue. This determines whether or not arithmetic coding can be applied

in a next step to the residue effectively, leading to a compact representation of the residu. The

entropy of an array of probabilities is defined as

H(X) = −n∑i=1

P (xi)logbP (xi), (4.5)

When a uniform distribution occurs, and each event is equiprobable, the maximum entropy

is reached. A large entropy indicates there is no significant difference in the occurence of

certain elements of the alphabet, which leads to a lot of uncertainty about the next character

in a sequence. This means there is little gain to be found in arithmetic coding, as there is no

inherent structure present in a stream of events. The aim for this metric is thus to be minimal.

This is also the metric used for updating and training the network. As an example, with an

alphabet of 5 letters and a uniform distribution, the maximum entropy is calculated as shown

in Equation 4.6

H(X) = −n∑i=1

P (xi)logbP (xi)

= −5∑i=1

5)logbP (

= −5 ∗ P (1

5)logbP (

= −logbP (1

5) = 1.6094

= 1.6094

The (categorical) cross-entropy is a widely used loss function in machine learning, as it

allows gradient calculations which are required by iterative learning algorithms. Rather than

giving information on how much of the time the network is right (accuracy), this metric provides

information of right you are, which is far more useful in determining the desired update to the

network.

Residue representation The last metric of interest concerns the error residue, holding the

information to correct a faulty reconstruction. While the aim is to be for this component to be

as small as possible, it can make a significant addition to the compressed filesize if the network

is unable to perform well. Three ways of storing the residue are considered. Assume a sequence

of ten items long, which is reconstructed and ends up having three errors in its reconstruction.

Also assume the labels in this case are integer numbers.4

A first way of representing the residue for this situation is as one list containing for each

position either a symbol meaning the prediction was correct (here the number 0 is used), or in

case of an error, the correction to be made:

residue1 = [0, 0, 0, 7, 0, 0, 15, 0, 9, 0] (4.7)

The second way of representing the residue stores only the occurences of errors. This residue

is a list where each item contains the index of an error, and the correction to be made:

residue2 = [(3, 7), (6, 15), (8, 9)] (4.8)

The first residue representation is useful when there are a lot of errors. A single number in the

array contains all the information required. However, this list is as long as the entire sequence,

no matter how few errors are made. The second representation is useful when the residue is

sparse: the information on correctly reconstructed items can be ommited; only storing the errors

is required. Without sparsity however, each error is stored as a pair of numbers, meaning this

representation would be larger compared to the naive array storage of residue1 where the index

is implicitly given by its position in the array.

The third representation is somewhat of a hybrid form of these two. It contains two lists, the

first simply being a binary vector (often called bitmask) indicating whether or not this position

is correctly reconstructed. The second list contains the desired values in case an error was made.

This representation is a compromise between the two predecessors, and works well for a lot of

applications if no assumptions on the error frequency are available. Both of the arrays can be

4In practice, most labels are encoded as integer numbers in machine learning implementations, as it allowsfor efficient computations.

efficiently coded (using e.g. arithmetic coding, differential coding, Huffman coding...) on their

residue3 = [0, 0, 0, 1, 0, 0, 1, 0, 1, 0], [7, 15, 9] (4.9)

4.2.6 Software & Hardware

The implementation of this compression scheme is done in the Python programming lan-

guage. For the neural network implementation, the frameworks Theano5 and Lasagne6 are

used. Theano ([34]) is a framework allowing to define, optimize and evaluate mathematical ex-

pressions involving multi-dimensional arrays efficiently. Since its introduction in 2008, multiple

frameworks have been built on top of it and it has been used to produce many state-of-the-art

machine learning models. It is a widely used CPU and GPU compiler in the ML community

and allows for very efficient computation using GPU acceleration. One of the frameworks build

on top of it is Lasagne, a light-weight open source framework started by Sander Dieleman in

September 2014. Lasagne offers a more high-level abstraction of Theano expressions, offer-

ing several useful constructs for well-known useful neural network components. Lastly, some

modules of the scikit-learn7 framework are used for preprocessing the data. Scikit-learn offers

a variety of tools for creating ML programs, not restricted to neural networks. All of these

frameworks build on top of Numpy8, a package for scientific computing and efficient array ma-

nipulation in Python.

The hardware used for this project consists of a set of servers from the UGent MMlab, each

having two 10-core Intel(R) Xeon(R) E5-2650v3 CPUs operating at 2.30GHz with 128 GB of

RAM. For working on convolutional networks, the GPU acceleration capabilities are harnessed

by working on the GPU-enabled workstations of the UGent Reservoir Lab. The specific GPUs

used for this work are the Nvidia Tesla k40c and the Nvidia Titan X, both having 12 GB of

gpu memory available, in a workstation with a Intel(R) Core(TM) i7-3930K CPU operating at

3.20GHz and having 32 GB of RAM.

5http://deeplearning.net/software/theano/6http://lasagne.readthedocs.io/en/latest/index.html7http://scikit-learn.org/stable/index.html8http://www.numpy.org/

Chapter 5

Results

This chapter will discuss the results of constructing and applying AEs to DNA sequences.

It starts (as all ML problems do) with an analysis of the source data the program has to

work with. Then a section discusses the machine learning approach to the problem, where

several variations of AEs with increasing complexity are implemented and their performance

evaluated. After this, having compared and selected a well-performing AE architecture, a last

section discusses an implementation of the compression scheme using this AE following the

three scenarioas discussed.

5.1 Data Analysis

File content

Each FASTA formatted file has a content structured as shown in Listing 5.1. The file contains

multiple scaffolds (part of a sequence), each having some descriptive metadata. A descriptor line,

starting with the ’>’ character indicates the properties of the following scaffold. This descriptor

is followed by the nucleobases in the chromosome. Multiple (descriptor, base sequence) pairs

can be present in each file.

CHAPTER 5. RESULTS 52

>g i |568815364 | r e f |NT 077402 . 3 | Homo sap i en s chromosome 1 genomic s c a f f o l d , GRCh38

. p2 Primary Assembly HSCHR1 CTG1

TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC

CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAA

ATGCAAAGATAAATATAAAAACTGATACTCCATCCAGTTACCAGAAAACATTTAGGTATGTGTGAGACAA

CTTGGGTATGTGAACCTACCTTTTCAATGTAAATTCAGTGAAATCTAAGTACAGAT

>g i |568815363 | r e f |NT 187170 . 1 | Homo sap i en s chromosome 1 genomic s c a f f o l d , GRCh38

. p2 Primary Assembly HSCHR1 CTG1 1

GATTCATGGCTGAAATCGTGTTTGACCAGCTATGTGTGTCTCTCAATCCGATCAAGTAGATGTCTAAAAT

TAACCGTCAGAATATTTATGCCTGATTCATGGCTGAAATTGTGTTTGACCAGCTATGTGTGTCTCTTAAT

Listing 5.1: Example of a FASTA file

The bases occuring in a genome sequence are A, C, G, and T. One other letter is considered

as well: N for unspecified or unknown. However, on rare occasion, other bases are encountered in

the file, such as K, S, Y, M, W, R, B, D, H, V, and the - symbol. These can represent wildcards

or a a set of possibilities but not fully determined. It should be noted these do not occur in the

physical strain, but rather are a way of showing uncertainties due to the limitations of current

sequencing technology. Table 5.1 shows the count of these anomalies for each chromosome. On

average, only 4 times per chromosome this occurs, meaning they are a negligible fraction of

the content. Due to this extremely low probability of these characters, including these in an

encoding scheme would reduce the efficiency of the scheme significantly. A common approach

for this kind of situation is to include these as raw, unencoded characters in the output stream

who are not encoded but sent ’as is’. In this work, these letters will be replaced by N characters,

and the compression models will only be aware of the five main characters in use in the sequence.

Bases, pairs & triplets statistics

In works discussing DNA data from a biological point of view, bases are often grouped together

per three, where they are then called triplets or codons. This grouping has a biological purpose.

From a compression point of view, if the occurance of a (group of) base(s) differs significantly

from its expected frequency, this could be a characteristic of the data to exploit. If it turns out

that for example if only a subset of all possible codons (out of the 64 possibilities) occurs, this

can prove to be valuable information in designing a compression scheme.

filename Filesize (B) replacements characters Entropy

GRCh38.p2.chr1.fa 239,324,742 2 2.359 · 108 1.38GRCh38.p2.chr10.fa 137,217,937 36 1.353 · 108 1.387GRCh38.p2.chr11.fa 139,240,135 0 1.373 · 108 1.377GRCh38.p2.chr12.fa 138,387,841 3 1.364 · 108 1.37GRCh38.p2.chr13.fa 101,198,223 3 9.977 · 107 1.364GRCh38.p2.chr14.fa 96,655,086 0 9.529 · 107 1.385GRCh38.p2.chr15.fa 99,469,680 0 9.807 · 107 1.404GRCh38.p2.chr16.fa 88,410,877 1 8.716 · 107 1.404GRCh38.p2.chr17.fa 93,966,740 12 9.264 · 107 1.389GRCh38.p2.chr18.fa 83,020,886 0 8.185 · 107 1.374GRCh38.p2.chr19.fa 74,829,450 0 7.377 · 107 1.493GRCh38.p2.chr2.fa 247,160,296 9 2.437 · 108 1.371GRCh38.p2.chr20.fa 65,455,748 0 6.453 · 107 1.388GRCh38.p2.chr21.fa 41,579,794 3 4.099 · 107 1.377GRCh38.p2.chr22.fa 42,778,725 5 4.217 · 107 1.392GRCh38.p2.chr3.fa 205,139,933 7 2.022 · 108 1.367GRCh38.p2.chr4.fa 195,958,575 0 1.932 · 108 1.363GRCh38.p2.chr5.fa 188,701,111 0 1.860 · 108 1.373GRCh38.p2.chr6.fa 208,639,177 1 2.057 · 108 1.454GRCh38.p2.chr7.fa 163,942,113 4 1.616 · 108 1.374GRCh38.p2.chr8.fa 151,504,740 0 1.494 · 108 1.367GRCh38.p2.chr9.fa 125,257,656 3 1.235 · 108 1.38GRCh38.p2.chrX.fa 158,296,928 5 1.561 · 108 1.381

Table 5.1: Chromosomes and their file content in the refence genome. The entropy is calculatedon the contained sequence, and not on the data stream which includes metadata.

Figure 5.1 shows first the frequency of the five letters in the full reference genome sequence

after replacements, with the minimum and maximum occurences shown by the error bars. As a

point of reference, the 25 % line is marked by the dashed line, indicating the expected frequency

with a uniform distribution. From this figure, it is shown that the distribution of the characters

is rather equal. No base has an exceptionally high or low occurence compared to the other, and

the differences between genomes are modest. Of note is the low frequency of the N character.

Still, this character is included is the alphabet as a placeholder for ’none of the others’ is

necessary, and the occurence is high enough to justify its presence. It should be noted that this

is merely a limitation of current technology, and the unknown part of the sequence will diminish

further with the advances in sequencing equipment.

Figure 5.2 and 5.3 show a similar frequency analysis for base pairs and triplets, each time

with a dashed line marking the expected frequency in case of a uniform distribution. What

would be useful to exploit, is if only a small subset of all possible combination occurs in the

sequences. These groups could then be encoded as a single code symbol, and the limited amount

T C G NA

5 · 10−2

Figure 5.1: Occurence of single bases in the full human reference genome. Error bars indicatethe occurence minima and maxima in separate chromosomes.

of possibilities could lead to an efficient encoding. Unfortunately, from these figures it is clear

that this is not the case. Nearly all (25 pair or 125 triplets) combinations occur in reality,

meaning encoding using a fixed set of groups is not a viable option.

From this frequency analysis, it is concluded that no information or calculated statistics

are of immediate use in devising a compression scheme. Only the raw content of the sequence

will be used as input to the AE and it will automatically learn features instead of performing

manual feature engineering.

5.2 Baseline comparison: state of the art

For DNA sequence data, there are unfortunately no advanced compression models available

that make for an easy comparison. The most used methods (because of their generally good

compression rate) are general-purpose compression schemas such as Zip & 7-Zip. Table 5.2 shows

each chromosome of the reference genome used in this work compressed with 3 popular variants

of this method: Zip, 7-Zip, and the propriatary RAR5 format. All methods were configured to

use their best compression levels. The best compression rate is achieved by 7-Zip, with a rate

averaging at 20%-30%. This resulting compressed file leads to a rate of somewhat under 2 bpb.

Note this is compression on the whole FASTA-file, including the sequence descriptive methadata

which are not considered in this work. However, these descriptors are only a miniscule fraction

of the file content and thus do not influence the results significantly.

2 · 10−2

4 · 10−2

6 · 10−2

8 · 10−2

Base (per 2)

Figure 5.2: Occurence of base pairs in the full human reference genome. Error bars indicate theoccurence minima and maxima in separate chromosomes.

·10−2

Base (per 3)

Figure 5.3: Occurence of codons (base triplets) in the full human reference genome. Errorbars indicate the occurence minima and maxima in separate chromosomes. Codon labels areommited for clarity.

2 · 10−2

4 · 10−2

6 · 10−2

8 · 10−2

Figure 5.4: Occurence of groups of 2 bases in the reference genome, compared with theirexpected frequency. The bars show the actual frequency of the group, the mark shows thefrequency which is expected based on the single-base frequencies.

·10−2

Triplet

Figure 5.5: Occurence of groups of 3 bases in the reference genome, compared with theirexpected frequency.

filename filesize Zip RAR5 7-Zip Compression bpb

GRCh38.p2.chr1.fa 228 65 63 54 0.24 1.94GRCh38.p2.chr10.fa 130 37 36 31 0.24 1.97GRCh38.p2.chr11.fa 132 37 36 31 0.23 1.94GRCh38.p2.chr12.fa 131 37 36 31 0.24 1.95GRCh38.p2.chr13.fa 95 27 26 23 0.24 1.95GRCh38.p2.chr14.fa 92 26 25 21 0.23 1.89GRCh38.p2.chr15.fa 94 26 25 20 0.21 1.79GRCh38.p2.chr16.fa 84 23 22 19 0.23 1.88GRCh38.p2.chr17.fa 89 24 23 20 0.22 1.84GRCh38.p2.chr18.fa 79 21 21 18 0.23 1.9GRCh38.p2.chr19.fa 71 18 16 13 0.18 1.5GRCh38.p2.chr2.fa 235 68 66 57 0.24 1.99GRCh38.p2.chr20.fa 62 17 16 14 0.23 1.92GRCh38.p2.chr21.fa 39 11 10 9 0.23 1.87GRCh38.p2.chr22.fa 40 11 10 9 0.23 1.8GRCh38.p2.chr3.fa 195 56 54 47 0.24 1.98GRCh38.p2.chr4.fa 186 54 52 45 0.24 1.96GRCh38.p2.chr5.fa 179 51 49 43 0.24 1.95GRCh38.p2.chr6.fa 198 56 54 43 0.22 1.77GRCh38.p2.chr7.fa 156 44 42 37 0.24 1.93GRCh38.p2.chr8.fa 144 41 40 35 0.24 1.97GRCh38.p2.chr9.fa 119 34 33 27 0.23 1.9GRCh38.p2.chrX.fa 150 42 40 34 0.23 1.87

Table 5.2: General-purpose compression software on chromosome FASTA files of the humanreference sequence. The Compression column is the 7-Zip filesize compared to the uncompressedfile, and the bpb column is the bits per base achieved by the 7-zip compression.

5.3 Auto-Encoder model construction

In this section, seeral Auto-Encoder architectures of increasing complexity are implemented

to explore the feasibility of applying this method to compress genome data. Several network

architectures are tried out and improved upon to form a suitable AE setup.

5.3.1 Shallow fully-connected Auto-Encoder

The first network constructed is a traditional fully-connected AE consisting of five layers. The

first is the input layer taking the matrix of one-hot encoded bases into the network. After that

a hidden fully-connected layer follows. This layer serves as the bottleneck (in the context of

Auto-Encoders often called encoding layer) layer of the network. Then the reconstruction (de-

coding layer) fully-connected layer follows. The output of this layer is reshaped to conform the

target dimensions and a softmax nonlinearity is applied subsequently. The nonlinearity used

in the neurons in these dense layers is the hyperbolic tangent. Figure 5.6 shows this network

structure with an encoding size of 2 neurons. This single-hidden layer architecture has been

used as an encoder in lossy image compression and is for most AE approaches the baseline start.

The network is implemented once having shared weights between the fully-connected layers, and

once where each layer has an independant trained set of parameters. This small network has

only 17 trainable parameters in the case of weight sharing, or 27 in the independant case.

Figure 5.6: Shallow traditional fully-connected AE with single-base input and two encodingneurons.

The result of the experiment is shown in Figure 5.7. From this figures it can be seen that

as a start, the network is functional; it quickly learns to perform the reconstruction. The loss

function shows a continuous decrease as well, indicating a stable learning behaviour. However,

this network configuration is practically useless from a compression point of view. Each single

base, which can be represented by 3 bits, is now represented by a set of two 32-bit numbers.

Essentially the representation is blown up and as an over-representation achieves the contrary of

compression. Reaching an accuracy of 100% is therefore not really meaningful besides making

sure the network functions correctly. When comparing the tied weights and the independant

weights, the tied version seems to converge more quickly, but as both go to 100% accuracy, it

is hard to discuss any performance differences.

Architectural variations

In order to achieve an information bottleneck in the AE, there are two variations made to the

single-base AE: increasing the input to a sequence of multiple bases and varying the number

of encoding units. Figure 5.8 indicates this modified architecture. The network now takes in

a sequence of bases, but note that the accuracy is still considered per base. By varying these

two parameters, a trade-off can be made between codeword size and reconstruction performance.

0 100 200 300 400 500 6000.5

Number of updates

trainingvalidation

(a) Reconstruction accuracy

0 100 200 300 400 500 600

Number of updates

trainingvalidation

(b) Cross-entropy loss

0 100 200 300 400 500 6000.5

Number of updates

trainingvalidation

(c) Reconstruction accuracy

0 100 200 300 400 500 6000.4

Number of updates

trainingvalidation

(d) Cross-entropy loss

Figure 5.7: Reconstruction accuracy and loss function of shallow fully-connected AE with single-base input and two encoding units. Above: independant weights, below: tied weights.

Figure 5.8: Variation on the shallow traditional fully-connected AE with 100 bases as inputand two encoding units. (1) indicates an increase in input length. (2) indicates an increase inencoding units.

Figure 5.9 shows the performance of this shallow AE with a configuration of 100 bases as

input sequence and two encoding units, leading to a 4.5-fold size reduction. The network in

this configuration has 1502 trainable parameters with weight sharing, and 2502 without. These

results show the network is not able to reach a good performance. While the behaviour does

show a limited form of learning, with an accuracy starting at 25% and ending up at 30%, the

performance is not good. The corresponding loss functions decrease slightly from 1.60 to 1.43

but neither this is a good score. Both the accuracy and loss curves stagnate after a short initial

movement, and the network stops learning. There is no discernable difference between the tied

or independant weight setup. The network seems to fail in learning meaningful struture and

features in the data, which seems like an underfitting problem; the model is not sufficiently

powerful to express the required complexity in modelling the data.

5.3.2 Deep fully-connected Auto-Encoder

When approaching a problem using ANNs, improving a result if often simple by extending the

architecture to a deeper network which will outperform a shallow architecture most of the times.

With this in mind, the shallow AE is now extended with two extra layers at each side. Between

the input and the bottleneck layer, two fully-connected dense hidden layers are added, having

twenty and five times the amount of neurons from the bottleneck layer, resulting in a funneled

architecture. The decoder is adapted likewise to preserve symmetry. Figure 5.10 shows this

architecture with two hidden units as encoding size and an input sequence length of 100 bases.

The same set of parameters (100 input bases, 2 encoding units) is selected, and the network

is trained once having independant weights and once having tied weights in the symmetrical

parts. The results are shown in Figure 5.11. The results are very alike compared to the shallow

dense AE. The accuracy never reaches higher than 30%. The loss function, while decreasing in

a stable way, does so only in a minor way. It appear improving this architecture in the brute

force way by simply adding more layers and neurons does not help, the network still suffers

from underfitting.

5.3.3 Shallow Convolutional Auto-Encoder

Working even with a deep architecture of fully-connected layers does not seem to work well on

this problem. Adding more neurons in a layer or more layers in the network does not improve

0 500 1,000 1,500 2,000 2,5000

Number of updates

trainingvalidation

0 500 1,000 1,500 2,000 2,500

Number of updates

trainingvalidation

0 500 1,000 1,500 2,000 2,5000

Number of updates

trainingvalidation

0 500 1,000 1,500 2,000 2,500

Number of updates

trainingvalidation

Figure 5.9: Reconstruction accuracy and loss function of shallow fully-connected AE with inputsequence length of 100 bases and two encoding units. Above: independant weights, below: tiedweights.

Figure 5.10: Deep fully-connected AE with a sequence input length of 100 bases and two hiddenunits in the encoding layer.

0 500 1,000 1,500 2,000 2,5000

Number of updates

trainingvalidation

0 500 1,000 1,500 2,000 2,500

Number of updates

trainingvalidation

0 500 1,000 1,500 2,000 2,5000

Number of updates

trainingvalidation

0 500 1,000 1,500 2,000 2,500

Number of updates

trainingvalidation

Figure 5.11: Reconstruction accuracy and loss function of deep fully-connected AE with inputsequence length of 100 bases and two encoding units. Above: independant weights, below: tiedweights.

performance in a significant way. The next logical step therefore is to have a look at convo-

lutional networks. They have proven to be exceptional in object recognition task and are a

powerful addition to a neural network structure.

The Convolutional AE (CAE) is build starting from the shallow fully-connected AE with

a single hidden layer. In front of the dense encoding layer is a single convolutional set placed,

consisting of one convolutional layer followed by a max-pooling layer. A flatten operation is

required to link the feature maps to the dense encoding layer. Between the input layer and the

convolutional layer are a set of operations applied to the 2D input matrix in order to transform

this to the multi-channel structure (as shown in figure 4.2) required for convolutional operation.

After the bottleneck layer, the inverse operations are applied to keep the symmetrical architec-

ture; an upsampling and deconvolution, followed by a shape transformation and a softmax in

order to constrain the network to the single-base output. The inverse part of the network is

implemented once using shared weights and once independantly. Both the convolutional and

fully-connected layer have the sigmoid as non-linearities applied on their output and are initial-

ized with He weight initialization ([35]). For the gradient learning, the Adam update method

([36]) is applied with a learning rate of 0.001. The network is run on GPUs with a batch size

of 50 sequences per weight update.

As the input is transformed into multi-channeled squares (see subsection 4.2.2), the input

sequence length is chosen somewhat different from the previous traditional AEs. Figure 5.12

shows the network structure working on an input sequence of 400 bases (squares of 20 · · · 20).

The formatting of the data into cubes (and back to single-base) is implemented through a

few transformation steps using layers in the network; these operations are shown on this fig-

ure compacted as a single cube formatting layer for brevity. This particular network has 18279

trainable parameters with independant weight training, or 10183 parameters using tied weights.

Figure 5.13 displays the outcome of the training process with 400 bases as input length

and two encoding units. The training shows a stable and desirable learning behaviour. The

accuracy of the reconstruction gradually rises and eventually reaches over 90%. The loss function

decreases gradually as desired. The transition from a traditional AE to a CAE has seemingly

lead to a significant performance gain. From a reconstruction performance of slightly better

Figure 5.12: Shallow convolutional AE structure.

than random1, the network has jumped to achieve around 95% correct reconstruction accuracy.

The loss function also displays smooth decreasing behaviour and end up around 1.

When comparing the weight sharing with the independant weight training, there is a no-

ticeable advantage for the weight sharing variant; the accuracy ends up about 5% higher and

reaches far quicker the 90% mark. The cross-entropy loss quickly reaches under 1 while the

weight-independant variant does not go below the 1 border at all. As these are exactly the

same conditions for the network with only the weight sharing as a difference, this leads to the

conclusion that in this setup there is a clear advantage of using tied weights for the encoding and

decoding parts of the CAE. Furthermore, starting with this CAE is a possibly viable candidate

for a compression scheme.

5.3.4 Deep Convolutional Auto-Encoder

The previous subsection has shown that the Convolutional Auto-Encoder performs very well

on the task. Several modifications are now made to the previous CAE in order to try to fur-

ther improve upon the results. As a first step, an extra fully-connected hidden layer is added

before & after the encoding layer. Regarding the convolutional stage(s) of the network, one

additional convolutional-pooling stage is added after the initial max-pooling layer. Each stage

is individually extended and now consists of two convolutions before max-pooling is applied.

The convolution kernels are made smaller, while the number of feature maps is doubled from

the first to the second stage, leading to a configuration of 32@3× 3 and 64@3× 3; these smaller

kernels are shown to have a regularizing effect. These architectural designs are inspired by

recent research on ConvNets ([21], [37]).

The resulting architecture of the net is displayed by 5.14. Only the layers up to the bottleneck

1As the large majority of the sequence is A, C, G, or T, and each letter has a comparable frequency in thesequence, randomly guessing the base would have a 25% success ratio, which the previous network with 30% doesnot improve a lot upon.

0 1,000 2,000 3,000 4,000 5,0000.5

Number of updates

trainingvalidation

0 1,000 2,000 3,000 4,000 5,000

Number of updates

trainingvalidation

0 1,000 2,000 3,000 4,000 5,0000.5

Number of updates

trainingvalidation

0 1,000 2,000 3,000 4,000 5,000

Number of updates

trainingvalidation

Figure 5.13: Reconstruction accuracy and loss function of shallow convolutional AE with inputsequence length of 400 bases and two encoding units. Above: independant weights, below: tiedweights.

layer are shown due to space constraints. The ommited decoding part is fully symmetrical to

the first half of the network, and followed by a reshape and softmax output just as the previous

networks. As the previous CAE has shown, a setup with tied weights significantly outperforms

the weight-independant option, so the decoder will be tied to the encoder weights and the

non-tied version is dropped from the experiment. The two sets of convolutional sets (Conv-

Conv-MaxPool) with decreasing sizes allow for a hieararchical feature learning. This network

ends up with 67047 trainable parameters using tied weights.

Figure 5.14: Deep CAE architecture. Symmetrical decoding part ommited for readability.

With the same settings for input and encoding size as the previous CAE, the network training

is shown on figure 5.15. When comparing this to the shallow CAE, it seems the network does

not immediately gain from the adapted architecture. While it works and shows a stable learning

behaviour, this network is seemingly outperformed by the shallow CAE on both accuracy and

cross-entropy loss.

5.3.5 Batch-Normalized ReLu CAE

Recent research (2015, [38]) has introduced the concept of Batch Normalization for ANN layers.

By normalizing the output of a fully-connected or convolutional layer before applying its non-

linearity, they have shown to improve the performance of a model, reduce any overfitting, and

significantly speeding the training of the ANN. It has also shown to be an effective combination

used together with the ReLu activation function, which is not frequently used in Auto-Encoder

architectures because it does not seem to perform well with the inverse operations often used by

AEs. This technique is now tried out with the previous deep CAE architecture. Every dense or

convolutional layer (and their inverse counterpart) is followed by a Batch Normalization layer.

The tied weight setup is kept due to its clear advantage in the previous experiments. As the

technique shows the ability to speedup learning, the learning rate is tripled to 0.003, still using

the Adam gradient optimization method. This network is constructed in two otherwise equal

variations: once using traditional sigmoids, and once in a version with ReLu activations. The

0 1,000 2,000 3,000 4,000 5,0000.5

Number of updates

trainingvalidation

0 1,000 2,000 3,000 4,000 5,000

Number of updates

trainingvalidation

Figure 5.15: Reconstruction accuracy and loss function of deep CAE with input sequence lengthof 400 bases and two encoding units trained with tied weights.

input length and encoding size parameters are kept. The networks have 69379 parameters.

Figure 5.16 shows the training process of these BN networks. The first thing to discuss is

the ReLu activation function trial, shown in the upper half of the figure. While initially, this

seem to perform well and do the required learning, it starts to deteriorate rather quickly. The

accuracy decreases and the loss function starts to rise again. This diverging trend continues

in the experiment (not shown here on the graph) when trained further to 50000 updates; the

accuracy keep lowering to under 50% - both the training and validation curve - and the loss

ends up at 1.12 From this experiment, it seems that the ReLu is not a useful activation function

for this AE, even in tandem with the BN.

The lower half of the picture shows the variant using traditional sigmoids. Compared with

the previous deep CAE (having an equal architecture bar the BN layers) this figure shows a clear

improvement accompanied by faster learning as ’promised’. Its performance is comparable with

the shallow CAE. The accuracy rivals the previously best shallow CAE by reaching around 95%

and the loss function goes under 0.5. Further training continues this smooth learning behaviour.

5.3.6 Model comparison, selection and discussion

In the previous subsections, an AE architecture was constructed to perform the compression

task. Starting from a traditional AE with a single fully-connected hidden layer which takes a

single base to encode, and ending up with a deep (batch-normalized) convolutional AE encoding

a sequence of hundreds of bases into two values, the end result is a network capable of achieving

0 1,000 2,000 3,000 4,000 5,0000.5

Number of updates

trainingvalidation

0 1,000 2,000 3,000 4,000 5,000

Number of updates

trainingvalidation

0 1,000 2,000 3,000 4,000 5,0000.5

Number of updates

trainingvalidation

0 1,000 2,000 3,000 4,000 5,000

Number of updates

trainingvalidation

Figure 5.16: Reconstruction accuracy and loss function of Batch Normalized deep CAE withinput sequence length of 400 bases and two encoding units trained with tied weights. Above:ReLu activations, below: Sigmoid activations.

a very good compression rate with associated good reconstruction accuracy. The traditional

AEs with only fully-connected layers are unable to perform the task and show an underfitting

problem. Only when moving to (variations on) convolutional AEs, the performance is very

good, and these are the models of interest. This section concludes with a comparison of these

CAE variations.

The CAEs are now trained for 50000 updates, each having the same settings for input se-

quence length and encoding units. Their other (hyper)parameters have remained the same as

discussed in their previous subsections. Figure 5.17 plots the resulting comparison of the re-

construction accuracy and cross-entropy loss for the validation set. A first conclusion to draw

here is the model using the ReLu activation function (BN-CAE(ReLu)) does not work. The

cross-entropy quicly starts to diverge and the corresponding accuracy drops to under 50%. This

leaves the CAEs using sigmoid activations. The Batch Normalized variation (BN-CAE) initially

performs the best. However, even before 10000 updates the model shows signs of overfitting.

The loss function increases after a minimum at around 7500 updates, and the accuracy has

dropped a lot earlier than that, ending up with a score which does outperform the deep CAE

on which it is based, but not matching the shallow CAE. The desired regularization effect

of the Batch Normalization is not immediately observed in this case. The deep CAE shows

a well-known learning behaviour; the accuracy rises and loss function decreases, up until the

overfitting stage occurs. At 10000-15000 updates, clear signs of overfitting show up, and the

performance gets worse from there on. The model is outperformed by both its shallow pre-

decessor, and its batch-normalized successor. The clear winner here is the shallow CAE, also

displaying the well-known learning behaviour. The loss function decreases and at about 30000

updates starts to increase due to overfitting, ending up with a similar score as the BN-CAE.

The accuracy rises to over 96%, and after overfitting occurs ends up slightly below that number.

It outperforms all of its successors, and consequently, this is the model to apply in the next step.

Table 5.3 offers a summary of the models considered with some key figures. The distinction

of between traditional and convolutional AEs is clear. Also note that the best performing

model (Conv AE (tied)) has a low amount of parameters due to its shallow architecture and

the concept of weight sharing.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Number of updates

CAEdeep CAEBN-CAE

BN-CAE(ReLu)

(a) Reconstruction accuracy on validation set.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Number of updates

CAEdeep CAEBN-CAE

BN-CAE(ReLu)

(b) Cross-entropy loss on validation set.

Figure 5.17: Performance comparison of variations on the CAE with input sequence length of400 bases and two encoding units, all trained with tied weights.

Model Compression Accuracy Parameters

Single-base AE (tied) 2,133 99.87 17Shallow AE 21.33 29.77 2,502

Shallow AE (tied) 21.33 29.75 1,502Deep AE 21.33 29.76 41,442

Deep AE (tied) 21.33 29.76 21,022Conv AE 5.33 90.72 18,279

Conv AE (tied) 5.33 95.35 10,183Deep Conv AE (tied) 5.33 91.84 67,047

Batch Norm deep AE (tied) 5.33 94.4 68,213Batch Norm deep AE (ReLu & tied) 5.33 47.57 68,213

Table 5.3: Overview of the Auto-Encoders considered with some key characteristics. (tied)indicates the weights of the decoder and encoder parts are shared. The compression columngives the ratio of the encoding units to the input sequence (at 3 bits per base). It does notinclude the residue at this point.

5.4 Compression scheme implementation

Having determined a succesful model in the previous section, this section will implement a

compression scheme by applying the CAE to the DNA sequences. The evaluation schemes and

the metrics considered are the ones discussed in the methodology chapter (section 4.2.5).

5.4.1 Scenario 1: chromosome Ref-A on chromosome Ref-B

For this first scenario, the CAE is trained using chromosome 1 of the reference genome. The

full sequence is used in this case as train set (as no validation data needs to be left out). The

model is trained for 30000 updates, as this has previously shown to lead to good results. This

model is then used to compress the other chromosomes of the same reference genome, following

the scheme of figure 4.3. A first aspect to investigate is the achieved compression. Two data

structures are of interest for this: the encoded genome and the residue. The encoding - or the

codeword - is the output produced by the AE encoding layer. This is an array with an amount

of rows equal to the number of input sequences in the input, and an amount of columns equal

to the encoding size (the number of neurons in the encoding layer). Each element is a neuron

value, thus a 32-bit floating point number, and this array is saved to disk using Numpy’s save()

method. This results in a binary file (pickled by Python). For the residue there are three

choices. The first form of the residue will always be of the same size as the input. The size

of the second and third form of the residue will depend on the accuracy of the AE, as they

contain elements for errors. The second residue will contain a bitmask of the same form as

the input, and an array of varying size with an element for each error. The third residue will

contain only pairs for errors, and thus will be the smallest filesize in this implementation as a

good reconstruction is assumed from the results of the previous section. The combination of the

codeword and this residue will be the size of the bitstream, and the ratio of this bitstream to the

input (i.e. the FASTA file of the original chromosome) will determine the achieved compression

ratio. Note again this compression is stripped from any metadata originally contained in the

sequence file.

Table 5.4 contains the results of this setup when looking at the compression achieved this

way. These values show first of all that this method has been succesfull in achieving compression.

The average compression compared to the FASTA file is 60%-70%. One chromosome can barely

be compressed, but none of them end up growing in size. When the bits per base (bpb) are

considered, values around 5 are achieved. Unfortunately, this does not outperform some other

(general-purpose) compression methods, nor does it do better than the 2 bpb baseline option.

Another remark from this data is that the residu is a major factor in the filesize compared to the

codeword. One should note however this is a naive implementation (e.g. data stored as Numpy

structures) and this is by no means the limit of compression performance. The codeword does

probably not have a lot of room for more efficient storage, as it is already simply the array

of 32-bit neuron values. The residue can probably stored a lot more efficient with the right

implementation and data structure.

With the aim on a realistic compression scenario, the next step would be to perform a more

intelligent storage of the bitstream. One frequently used option would be to apply arithmetic

coding. In order to apply this succesfully, the entropy of the material to be encoded should be

as low as possible. This results in the importance of both the amount of errors made and the

pattern of thes errors. These two metrics are extracted from the first form of the residue and

shown in table 5.5. The entropy of the residu is consistently around 0.25, and the reconstruction

performance slightly over 95%, which is in line with the expected performance from the results

of the model construction.

5.4.2 Scenario 2: chromosome Ref-A on chromosome Alt-A

The second scenario will investigate the result of training the CAE on a chromosome from the

reference genome and encode the same chromosome from the alternative genome. This means

there are 23 CAEs trained, and each are used once for testing. The same metrics are in use as

Chromosome Encoded (MB) Residue 3 (MB) Comp. Ratio bpb

hs ref GRCh38.p2 chr10.fa 2.59 83.69 0.66 5.34hs ref GRCh38.p2 chr11.fa 2.62 83.46 0.65 5.25hs ref GRCh38.p2 chr12.fa 2.61 81.77 0.64 5.17hs ref GRCh38.p2 chr13.fa 1.91 58.81 0.63 5.09hs ref GRCh38.p2 chr14.fa 1.82 59.76 0.67 5.4hs ref GRCh38.p2 chr15.fa 1.88 64.38 0.7 5.65hs ref GRCh38.p2 chr16.fa 1.66 55.16 0.67 5.47hs ref GRCh38.p2 chr17.fa 1.77 58.45 0.67 5.44hs ref GRCh38.p2 chr18.fa 1.56 48.24 0.63 5.1hs ref GRCh38.p2 chr19.fa 1.41 67.55 0.97 7.82hs ref GRCh38.p2 chr20.fa 1.24 41.51 0.68 5.53hs ref GRCh38.p2 chr21.fa 0.79 26.41 0.69 5.54hs ref GRCh38.p2 chr22.fa 0.81 29.05 0.73 5.91hs ref GRCh38.p2 chr2.fa 4.65 143.11 0.63 5.08hs ref GRCh38.p2 chr3.fa 3.86 116.08 0.61 4.97hs ref GRCh38.p2 chr4.fa 3.69 109.37 0.6 4.91hs ref GRCh38.p2 chr5.fa 3.56 110.3 0.63 5.12hs ref GRCh38.p2 chr6.fa 3.93 157.45 0.81 6.57hs ref GRCh38.p2 chr7.fa 3.09 97.4 0.64 5.2hs ref GRCh38.p2 chr8.fa 2.85 87.67 0.63 5.08hs ref GRCh38.p2 chr9.fa 2.36 73.57 0.64 5.15hs ref GRCh38.p2 chrX.fa 2.98 95.44 0.65 5.28

Table 5.4: File sizes after encoding using the shallow CAE trained on chromosome 1 of thereference genome. The compression column shows the ratio of the bitstream filesize to theoriginal FASTA.

in the previous subsection, and table 5.6 and 5.7 contain the outcome of these tests. Compared

to the previous setup, the compression achieved is slightly less succesful. The rate averages

around 70% and the bpb is on average noticably larger then the previous scenario.

5.4.3 Additional general-purpose compression

The scheme implemented here should be interpreted as a prototype and proof of concept. It is

aimed towards and suffices to illustrate the achieved results and concepts, but the implementa-

tion has some flaws to consider. First of all comes the fact that the input sequence is stripped

from its metadata and some rare characters are replaced by the unspecified symbol. From there,

the compression works on the sequence data, so technically this is not entirely lossless compres-

sion when viewed from an end-to-end position. As a second remark to be considered is the

fact that working with Numpy arrays to store the data is a rather naıve implementation. More

efficient and better approaches are possible; this implementation is chosen for its ease of use and

straight-forward compatibility with the frameworks used for the machine learning aspect of the

Chromosome Entropy Error rate

hs ref GRCh38.p2 chr10.fa 0.25 0.04hs ref GRCh38.p2 chr11.fa 0.25 0.04hs ref GRCh38.p2 chr12.fa 0.25 0.04hs ref GRCh38.p2 chr13.fa 0.24 0.04hs ref GRCh38.p2 chr14.fa 0.26 0.04hs ref GRCh38.p2 chr15.fa 0.27 0.04hs ref GRCh38.p2 chr16.fa 0.26 0.04hs ref GRCh38.p2 chr17.fa 0.26 0.04hs ref GRCh38.p2 chr18.fa 0.24 0.04hs ref GRCh38.p2 chr19.fa 0.35 0.06hs ref GRCh38.p2 chr20.fa 0.26 0.04hs ref GRCh38.p2 chr21.fa 0.26 0.04hs ref GRCh38.p2 chr22.fa 0.27 0.04hs ref GRCh38.p2 chr2.fa 0.24 0.04hs ref GRCh38.p2 chr3.fa 0.24 0.04hs ref GRCh38.p2 chr4.fa 0.23 0.04hs ref GRCh38.p2 chr5.fa 0.24 0.04hs ref GRCh38.p2 chr6.fa 0.3 0.05hs ref GRCh38.p2 chr7.fa 0.25 0.04hs ref GRCh38.p2 chr8.fa 0.24 0.04hs ref GRCh38.p2 chr9.fa 0.24 0.04hs ref GRCh38.p2 chrX.fa 0.25 0.04

Table 5.5: Residue (form 1) analysis after encoding using the shallow CAE trained on chromo-some 1 of the reference genome.

implementation. As an example, when converting the textual FASTA file with bases to numeric

labels, the Numpy data structure holding these labels is stored on disk and has a filesize of two

to ten times the filesize of the FASTA file. The compression numbers could thus be even better

when these flaws are handled and a suitable encoding and set of data formats is chosen.

For the sake of completeness, the general-purpose compression algorithm gzip with the

default settings is applied on top of the bitstreams from scenario 1. This additional layer of

compression might be a simple fix for the lack of intelligent choice for the data structures in

the residu, as the Gzip format will create this efficient representation (it is here the low entropy

will be of use). Table 5.8 shows for these gzipped bitstreams the fraction of the size compared

to the original FASTA file, and the bits per base.

From these end results, a significant gain is found compared to the output of the machine

learning compression scheme alone. The numbers indicate that with a combined application of

the AE and the general-purpose compression technique, a compression rate in the order of 10:1

is possible. The bpb demonstrated here (on the lower end of 1) are very good; they are half of

what the 2 bpb baseline offers, beating many comparable existing compression solutions.

Chromosome Encoded (MB) Residue 3 (MB) Comp. Ratio bpb

hs alt CHM1 1.1 chr1.fa 4.78 303.17 1.27 10.3hs alt CHM1 1.1 chr19.fa 1.07 36.65 0.7 5.65hs alt CHM1 1.1 chr10.fa 2.52 82.51 0.67 5.4hs alt CHM1 1.1 chr12.fa 2.49 82.13 0.67 5.44hs alt CHM1 1.1 chr17.fa 1.5 52.33 0.71 5.76hs alt CHM1 1.1 chr15.fa 1.57 55.28 0.72 5.79hs alt CHM1 1.1 chr21.fa 0.68 24.61 0.74 5.96hs alt CHM1 1.1 chr8.fa 2.73 93.66 0.7 5.65hs alt CHM1 1.1 chr5.fa 3.38 105.95 0.64 5.18hs alt CHM1 1.1 chr7.fa 2.98 101.01 0.69 5.59hs alt CHM1 1.1 chr2.fa 4.55 158.83 0.71 5.74hs alt CHM1 1.1 chr16.fa 1.54 54.71 0.72 5.84hs alt CHM1 1.1 chr11.fa 2.51 82.75 0.67 5.44hs alt CHM1 1.1 chrX.fa 2.89 100.11 0.7 5.7hs alt CHM1 1.1 chr9.fa 2.31 88.47 0.78 6.28hs alt CHM1 1.1 chr18.fa 1.43 46.69 0.66 5.37hs alt CHM1 1.1 chr13.fa 1.82 55.83 0.62 5.06hs alt CHM1 1.1 chr22.fa 0.67 25.25 0.77 6.18hs alt CHM1 1.1 chr4.fa 3.59 117.45 0.67 5.4hs alt CHM1 1.1 chr3.fa 3.72 116.99 0.64 5.19hs alt CHM1 1.1 chr20.fa 1.14 37.71 0.67 5.47hs alt CHM1 1.1 chr14.fa 1.69 53.52 0.65 5.24hs alt CHM1 1.1 chr6.fa 3.21 105.62 0.67 5.42

Table 5.6: File sizes after encoding using the shallow CAE. On each row a chromosome fromthe reference genome is used for training, and its counterpart in the alternative genome (who’sname is given in this table) is encoded.

Chromosome Entropy Error rate

hs alt CHM1 1.1 chr1.fa 0.42 0.08hs alt CHM1 1.1 chr19.fa 0.26 0.04hs alt CHM1 1.1 chr10.fa 0.25 0.04hs alt CHM1 1.1 chr12.fa 0.26 0.04hs alt CHM1 1.1 chr17.fa 0.27 0.04hs alt CHM1 1.1 chr15.fa 0.27 0.04hs alt CHM1 1.1 chr21.fa 0.28 0.05hs alt CHM1 1.1 chr8.fa 0.26 0.04hs alt CHM1 1.1 chr5.fa 0.25 0.04hs alt CHM1 1.1 chr7.fa 0.26 0.04hs alt CHM1 1.1 chr2.fa 0.27 0.04hs alt CHM1 1.1 chr16.fa 0.27 0.04hs alt CHM1 1.1 chr11.fa 0.26 0.04hs alt CHM1 1.1 chrX.fa 0.27 0.04hs alt CHM1 1.1 chr9.fa 0.29 0.05hs alt CHM1 1.1 chr18.fa 0.25 0.04hs alt CHM1 1.1 chr13.fa 0.24 0.04hs alt CHM1 1.1 chr22.fa 0.28 0.05hs alt CHM1 1.1 chr4.fa 0.25 0.04hs alt CHM1 1.1 chr3.fa 0.25 0.04hs alt CHM1 1.1 chr20.fa 0.26 0.04hs alt CHM1 1.1 chr14.fa 0.25 0.04hs alt CHM1 1.1 chr6.fa 0.26 0.04

Table 5.7: Residue (form 1) analysis after encoding using the shallow CAE. On each row achromosome from the reference genome is used for training, and its counterpart in the alternativegenome (who’s name is given in this table) is encoded.

Chromosome Comp. Ratio bpb

hs ref GRCh38.p2 chr10.fa 0.1 0.85hs ref GRCh38.p2 chr11.fa 0.1 0.84hs ref GRCh38.p2 chr12.fa 0.1 0.83hs ref GRCh38.p2 chr13.fa 0.1 0.82hs ref GRCh38.p2 chr14.fa 0.11 0.85hs ref GRCh38.p2 chr15.fa 0.11 0.88hs ref GRCh38.p2 chr16.fa 0.11 0.86hs ref GRCh38.p2 chr17.fa 0.11 0.87hs ref GRCh38.p2 chr18.fa 0.1 0.82hs ref GRCh38.p2 chr19.fa 0.14 1.1hs ref GRCh38.p2 chr20.fa 0.11 0.87hs ref GRCh38.p2 chr21.fa 0.11 0.87hs ref GRCh38.p2 chr22.fa 0.11 0.92hs ref GRCh38.p2 chr2.fa 0.1 0.82hs ref GRCh38.p2 chr3.fa 0.1 0.8hs ref GRCh38.p2 chr4.fa 0.1 0.8hs ref GRCh38.p2 chr5.fa 0.1 0.82hs ref GRCh38.p2 chr6.fa 0.12 0.96hs ref GRCh38.p2 chr7.fa 0.1 0.83hs ref GRCh38.p2 chr8.fa 0.1 0.82hs ref GRCh38.p2 chr9.fa 0.1 0.83hs ref GRCh38.p2 chrX.fa 0.1 0.84

Table 5.8: Compression results after gzip is applied on top of the bitstreams resulting fromthe encodings in scenario 1. Compression ratio is the filesize fraction compared to the originalFASTA file.

Chapter 6

Conclusion

This final chapter provides a short discussion on the results obtained in this work and note

some opportunities for future research after this work.

6.1 Discussion of the results

The purpose os this work has been to explore the possibility of constructing a compression

scheme for DNA sequence data using a deep learning approach. A first data analysis has shown

there are little obvious features and patterns present in the source data, hence the way of

a an unsupervised technique has been chosen. One particular technique, the Auto-Encoder

has been selected, as it has been shown to be capable of being used for data compression.

After the data analysis, a suitable AE architecture has been investigated. This ranged from

a traditional single-hidden layer AE, to the more powerful convolutional variants, and ended

with a trial of a batch-normalized deep convolutional AE inspired by state-of-the-art research in

deep learning. The resulting best performing model turned out to be the shallow convolutional

AE, demonstrating the ability to compress and reconstruct the sequence with an accuracy of

over 90% while having an architecture that offers a good compression rate. The next part

in this research involved implementing a compression scheme using this CAE to investigate if

this approach yields a working setup. Several scenarios in evaluating the scheme have been

investigated. The evaluation involved a look at the encoded sequence and the structure of

the error residue, as these make up the encoded bistream. Even this basic implementation,

with still a decent room for improvement, has shown to be capable of a lossless compression of

sequence data with a resulting compression of 60%-70% in most cases. Further application of a

general-purpose compression scheme such as gzip is shown to lead to an even larger achievable

CHAPTER 6. CONCLUSION 79

compression rate, mitigating the simplistic approach in storing the residue files. The combined

steps of the AE compression and gzip leads to a bpb of around 1, which outperforms most of

the existing algorithms.

6.2 Future work

This work can be seen as a feasibility study of applying machine learning techniques to achieve

compression on DNA data. The chosen technique to work with was the (convolutional) Auto-

Encoder. A first option for further research could be to finetune this approach (which is then

rather an optimization question instead of the exploratory look and prototype from this work).

There are some very recent developments and adaptations on AEs which could be tried out in

order to achieve the optimal architecture and parameter set to use for this problem.

Another approach could be to perform feature engineering. A good set of features is still

one of the most important aspects of a succesful machine learning application to a problem.1

One attractive property of the succesful neural network-based methods is their ability to do

automatic feature learning. There has thus been somewhat of a breach between opposing

schools of thought regarding the importance of human feature engineering. In this work, no

manual work on features is done; the genome sequence is used directly as input to the model.

One might investigate in future work if preprocessing of the data could lead to useful data

characteristics for machine learning.

Lastly, there is another neural network technique which has shown recent success in most

notably natural language processing: Recurrent Neural Networks (RNNs). RNNs are quite a

recent development, are hard to train and apply succesfully, and their workings are not entirely

well-understood at this moment. They add a concept of temporal awareness and memory to

neural nets, allowing them to learn and recognize languages, where a particular sequence of

words is important (besides just the set of words). RNNs - and their specific architectural

variant LSTMs (Long Short Term Memory networks, [39]) - can be used to automatically

construct sentences or perform text translation. In the latter case, this is done by analyzing

the source text into an internal representation and subsequently expanding this representation

with the network trained on another language, ending up with the translated text. The process

shows simmilarities to the AE process of compacting to a representation and reconstructing.

1This can be observed in the various Kaggle (machine learning) competition winning submissions and theoften public explanation of their approach.

CHAPTER 6. CONCLUSION 80

When interpreting the DNA strains as containing some biochemical language, having its own

words, structures and patterns, RNNs might prove to be a succesful approach in constructing

References

[1] S. Deorowicz and S. Grabowski, “Data compression for sequencing data.”, Algorithms for

molecular biology : AMB, vol. 8, no. 1, p. 25, 2013, issn: 1748-7188. doi: 10.1186/1748-

7188-8-25. [Online]. Available: http://www.pubmedcentral.nih.gov/articlerender.

fcgi?artid=3868316%7B%5C&%7Dtool=pmcentrez%7B%5C&%7Drendertype=abstract.

[2] D. C. James Bornholt Randolph Lopez and L. Ceze, “A dna-based archival storage sys-

tem”, in ASPLOS 2016 (International Conference on Architectural Support for Program-

ming Languages and Operating Systems) - to appear, ACM – Association for Comput-

ing Machinery, Apr. 2016. [Online]. Available: https : / / www . microsoft . com / en -

us/research/publication/dna-based-archival-storage-system/.

[3] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression”, IEEE

Transactions on Information Theory, vol. 23, no. 3, pp. 337–343, May 1977, issn: 0018-

9448. doi: 10.1109/TIT.1977.1055714.

[4] P. G. Howard and J. S. Vitter, “Arithmetic coding for data compression”, Proceedings of

the IEEE, vol. 82, no. 6, pp. 857–865, Jun. 1994, issn: 0018-9219. doi: 10.1109/5.286189.

[5] S. Kuruppu, S. J. Puglisi, and J. Zobel, “Optimized relative lempel-ziv compression of

genomes”, Conferences in Research and Practice in Information Technology Series, vol.

113, no. Acsc, pp. 91–98, 2011, issn: 14451336. doi: 10.1007/978- 3- 642- 16321-

0\_20.

[6] S. Deorowicz and S. Grabowski, “Compression of dna sequence reads in fastq format”,

Bioinformatics, vol. 27, no. 6, pp. 860–862, 2011. doi: 10.1093/bioinformatics/btr014.

eprint: http://bioinformatics.oxfordjournals.org/content/27/6/860.full.pdf+

html. [Online]. Available: http://bioinformatics.oxfordjournals.org/content/27/

6/860.abstract.

REFERENCES 82

[7] M. Howison, “High-throughput compression of fastq data with seqdb”, IEEE/ACM Trans-

actions on Computational Biology and Bioinformatics, vol. 10, no. 1, pp. 213–218, Jan.

2013, issn: 1545-5963. doi: 10.1109/TCBB.2012.160.

[8] D. C. Jones, W. L. Ruzzo, X. Peng, and M. G. Katze, “Compression of next-generation

sequencing reads aided by highly efficient de novo assembly”, CoRR, vol. abs/1207.2424,

2012. [Online]. Available: http://arxiv.org/abs/1207.2424.

[9] M. Nicolae, S. Pathak, and S. Rajasekaran, “Lfqc: A lossless compression algorithm

for fastq files”, Bioinformatics, vol. 31, no. 20, pp. 3276–3281, 2015. doi: 10 . 1093 /

bioinformatics / btv384. eprint: http : / / bioinformatics . oxfordjournals . org /

content/31/20/3276.full.pdf+html. [Online]. Available: http://bioinformatics.

oxfordjournals.org/content/31/20/3276.abstract.

[10] C. Kozanitis, C. Saunders, S. Kruglyak, V. Bafna, and G. Varghese, “Compressing ge-

nomic sequence fragments using slimgene”, in Proceedings of the 14th Annual International

Conference on Research in Computational Molecular Biology, ser. RECOMB’10, Lisbon,

Portugal: Springer-Verlag, 2010, pp. 310–324, isbn: 3-642-12682-0, 978-3-642-12682-6. doi:

10.1007/978-3-642-12683-3_20. [Online]. Available: http://dx.doi.org/10.1007/

978-3-642-12683-3_20.

[11] M. N. Sakib, J. Tang, W. J. Zheng, and C.-T. Huang, “Improving transmission efficiency

of large sequence alignment/map (sam) files”, PLoS ONE, vol. 6, no. 12, pp. 1–4, Dec.

2011. doi: 10.1371/journal.pone.0028251. [Online]. Available: http://dx.doi.org/

10.1371/journal.pone.0028251.

[12] H. Li, “Tabix: Fast retrieval of sequence features from generic tab-delimited files”, Bioin-

formatics, vol. 27, no. 5, pp. 718–719, 2011. doi: 10.1093/bioinformatics/btq671.

eprint: http://bioinformatics.oxfordjournals.org/content/27/5/718.full.pdf+

html. [Online]. Available: http://bioinformatics.oxfordjournals.org/content/27/

5/718.abstract.

[13] M. H.-y. Fritz, R. Leinonen, G. Cochrane, and E. Birney, “2011 efficient storage of high

throughput dna sequencing data using reference-based compression”, pp. 734–740, 2011.

doi: 10.1101/gr.114819.110.Freely.

REFERENCES 83

[14] M. D. Cao, T. I. Dix, L. Allison, and C. Mears, “A simple statistical algorithm for biolog-

ical sequence compression”, Data Compression Conference Proceedings, pp. 43–52, 2007,

issn: 10680314. doi: 10.1109/DCC.2007.7.

[15] I. Tabus, G. Korodi, and J. Rissanen, “Dna sequence compression using the normalized

maximum likelihood model for discrete regression”, in Data Compression Conference,

2003. Proceedings. DCC 2003, Mar. 2003, pp. 253–262. doi: 10.1109/DCC.2003.1194016.

[16] G. Korodi and I. Tabus, “An efficient normalized maximum likelihood algorithm for dna

sequence compression”, ACM Trans. Inf. Syst., vol. 23, no. 1, pp. 3–34, Jan. 2005, issn:

1046-8188. doi: 10.1145/1055709.1055711. [Online]. Available: http://doi.acm.org/

10.1145/1055709.1055711.

[17] S. Christley, Y. Lu, C. Li, and X. Xie, “Human genomes as email attachments”, Bioinfor-

matics, vol. 25, no. 2, pp. 274–275, 2009. doi: 10.1093/bioinformatics/btn582. eprint:

http://bioinformatics.oxfordjournals.org/content/25/2/274.full.pdf+html.

[Online]. Available: http://bioinformatics.oxfordjournals.org/content/25/2/

274.abstract.

[18] S. Deorowicz, A. Danek, and M. Niemiec, “GDC 2: Compression of large collections of

genomes”, CoRR, vol. abs/1503.01624, 2015. [Online]. Available: http://arxiv.org/

abs/1503.01624.

[19] S. Kuruppu, B. Beresford-Smith, T. Conway, and J. Zobel, “Iterative dictionary con-

struction for compression of large dna data sets”, IEEE/ACM Transactions on Computa-

tional Biology and Bioinformatics, vol. 9, no. 1, pp. 137–149, 2012, issn: 15455963. doi:

10.1109/TCBB.2011.82.

[20] B. Christopher Leela, M. Manu K, V. Vineetha, K. Satheesh Kumar, Vijayakumar, and

A. S. Nair, “Compression of large genomic datasets using comrad on parallel computing

platform”, Bioinformation, vol. 11, no. 5, pp. 267–271, 2015.

[21] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image

recognition”, CoRR, vol. abs/1409.1556, 2014. [Online]. Available: http://arxiv.org/

abs/1409.1556.

[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep con-

volutional neural networks”, in Advances in Neural Information Processing Systems 25:

26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of

REFERENCES 84

a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., 2012, pp. 1106–

1114. [Online]. Available: http://papers.nips.cc/paper/4824-imagenet-classification-

with-deep-convolutional-neural-networks.

[23] B. Verma, M. Blumenstein, and S. Kulkarni, “A new compression technique using an

artificial neural network”, Journal of Intelligent Systems, vol. 9, no. 1, Jan. 1999. doi:

10.1515/jisys.1999.9.1.39. [Online]. Available: http://dx.doi.org/10.1515/

jisys.1999.9.1.39.

[24] D. P. Dutta, S. D. Choudhury, M. A. Hussain, and S. Majumder, “Digital image com-

pression using neural networks”, in Advances in Computing, Control, Telecommunication

Technologies, 2009. ACT ’09. International Conference on, Dec. 2009, pp. 116–120. doi:

10.1109/ACT.2009.38.

[25] T. D. Kulkarni, W. Whitney, P. Kohli, and J. B. Tenenbaum, “Deep convolutional inverse

graphics network”, CoRR, vol. abs/1503.03167, 2015. [Online]. Available: http://arxiv.

org/abs/1503.03167.

[26] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep

convolutional generative adversarial networks”, CoRR, vol. abs/1511.06434, 2015. [On-

line]. Available: http://arxiv.org/abs/1511.06434.

[27] J. Zhao, M. Mathieu, R. Goroshin, and Y. Lecun, “Stacked what-where auto-encoders”,

ArXiv, vol. 1506.0235, 2015.

[28] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising

autoencoders: Learning useful representations in a deep network with a local denoising

criterion”, J. Mach. Learn. Res., vol. 11, pp. 3371–3408, Dec. 2010, issn: 1532-4435.

[Online]. Available: http://dl.acm.org/citation.cfm?id=1756006.1953039.

[29] J. Masci, U. Meier, D. Ciresan, and J. Schmidhuber, “Stacked convolutional auto-encoders

for hierarchical feature extraction”, in Artificial Neural Networks and Machine Learning

– ICANN 2011: 21st International Conference on Artificial Neural Networks, Espoo, Fin-

land, June 14-17, 2011, Proceedings, Part I, T. Honkela, W. Duch, M. Girolami, and

S. Kaski, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 52–59, isbn:

978-3-642-21735-7. doi: 10.1007/978-3-642-21735-7_7. [Online]. Available: http:

//dx.doi.org/10.1007/978-3-642-21735-7_7.

REFERENCES 85

[30] A. Rasmus, H. Valpola, M. Honkala, M. Berglund, and T. Raiko, “Semi-supervised learn-

ing with ladder network”, CoRR, vol. abs/1507.02672, 2015. [Online]. Available: http:

//arxiv.org/abs/1507.02672.

[31] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout:

A simple way to prevent neural networks from overfitting”, J. Mach. Learn. Res., vol. 15,

no. 1, pp. 1929–1958, Jan. 2014, issn: 1532-4435. [Online]. Available: http://dl.acm.

org/citation.cfm?id=2627435.2670313.

[32] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural

networks”, Science, vol. 313, no. 5786, pp. 504–507, 2006, issn: 0036-8075. doi: 10.1126/

science.1127647. eprint: http://science.sciencemag.org/content/313/5786/504.

full.pdf. [Online]. Available: http://science.sciencemag.org/content/313/5786/

[33] Y. Bengio, “Learning deep architectures for ai”, Found. Trends Mach. Learn., vol. 2, no.

1, pp. 1–127, Jan. 2009, issn: 1935-8237. doi: 10.1561/2200000006. [Online]. Available:

http://dx.doi.org/10.1561/2200000006.

[34] R. Al-Rfou, G. Alain, A. Almahairi, and et al., “Theano: A python framework for fast

computation of mathematical expressions”, CoRR, vol. abs/1605.02688, 2016. [Online].

Available: http://arxiv.org/abs/1605.02688.

[35] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-

level performance on imagenet classification”, CoRR, vol. abs/1502.01852, 2015. [Online].

Available: http://arxiv.org/abs/1502.01852.

[36] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization”, CoRR, vol.

abs/1412.6980, 2014. [Online]. Available: http://arxiv.org/abs/1412.6980.

[37] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception

architecture for computer vision”, CoRR, vol. abs/1512.00567, 2015. [Online]. Available:

http://arxiv.org/abs/1512.00567.

[38] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by

reducing internal covariate shift”, CoRR, vol. abs/1502.03167, 2015. [Online]. Available:

http://arxiv.org/abs/1502.03167.

REFERENCES 86

[39] S. Hochreiter and J. Schmidhuber, “Long short-term memory”, Neural Comput., vol. 9,

no. 8, pp. 1735–1780, Nov. 1997, issn: 0899-7667. doi: 10.1162/neco.1997.9.8.1735.

[Online]. Available: http://dx.doi.org/10.1162/neco.1997.9.8.1735.

REFERENCES 87

Non-reference-based DNA sequence compression using machine ... · Non-reference-based DNA sequence...

Documents

Transcript of Non-reference-based DNA sequence compression using machine ... · Non-reference-based DNA sequence...

DNA TECHNOLOGY · . Palindrome sequence

Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis.

Grammar-based Compression of DNA Sequences · Grammar-based Compression of DNA Sequences Neva Cherniavsky Richard Ladner May 28, 2004 Abstract Grammar-based compression algorithms

Local DNA Sequence Controls Asymmetry of DNA Unwrapping ...

DNA Sequence Analysis

16S Ribosomal DNA Sequence Analysis

Parallel DNA Sequence Alignment

DNA Sequence Assembly and Multiple Sequence Alignment …1 DNA Sequence Assembly Assembly of short DNA fragments (500-1000bp) generated by shotgun sequencing is a widely used technique

Organellar DNA-Like Sequence...observe the inheritance of DNA methylation in two organellar DNA-like sequence regions in the nuclear genome. Because organellar DNA integration to the

Ultra Fast Sequence Alignment for the DNA Assembly Problem · 2013. 3. 22. · Sequence data compression: • each residue uses as few bits as it is required by the cardinality of

DNA Sequencing & Data Compression

PERSPECTIVES OF DATABASES AND PROGRAM TOOLS …biology.karazin.ua/rtf-doc/BarannikTV_Kharkiv_TAAPSD2010-forsite.p… · • DNAzip: DNA sequence compression using a reference genome

Transcription transcription Gene sequence (DNA) recopied or transcribed to RNA sequence Gene sequence (DNA) recopied or transcribed to RNA sequence.

Genome Sequencing DNA Sequence Analysisdspace.mit.edu/bitstream/handle/1721.1/96935/7-91j...Genome Sequencing & DNA Sequence Analysis M Ch. 3 DNA Sequence Comparison & Alignment M

Determining the Sequence of DNA

Target DNA Sequence for Simultaneous

Site-specific DNA inversion is enhanced by a DNA sequence ...

A DNA Sequence Compression Algorithm Based on LUT and LZ77

Segmenting dna sequence into words

DNA Sequencing. DNA sequencing Determination of nucleotide sequence the determination of the precise sequence of nucleotides in a sample of DNA Two.