GENOMIC DATA COMPRESSION AND PROCESSING: THEORY, MODELS,...

GENOMIC DATA COMPRESSION AND PROCESSING:

THEORY, MODELS, ALGORITHMS, AND EXPERIMENTS

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF ELECTRICAL

ENGINEERING

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Idoia Ochoa-AlvarezAugust 2016

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/st247bt3117

© 2016 by Idoia Ochoa-Alvarez. All Rights Reserved.Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Tsachy Weissman, Primary Adviser


Andrea Goldsmith


David Tse

Approved for the Stanford University Committee on Graduate Studies.Patricia J. Gumport, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Abstract

Recently, there has been growing interest in genome sequencing, driven by advancementsin the sequencing technology. Although early sequencing technologies required severalyears to capture a 3 billion nucleotide genome, genomes as large as 22 billion nucleotidesare now being sequenced within days using next-generation sequencing technologies. Fur-ther, the cost of sequencing a whole human genome has dropped from billions of dollars tomerely $1000 within the past 15 years.

These developments in efficiency and affordability have allowed many to envisionwhole-genome sequencing as an invaluable tool to be used in both personalized medicalcare and public health. As a result, increasingly large and ubiquitous genomic datasets arebeing generated. These datasets need to be stored, transmitted, and analyzed, which posessignificant challenges.

In the first part of the thesis, we investigate methods and algorithms to ease the storageand distribution of these data sets. In particular, we present lossless compression schemestailored to the raw sequencing data, which significantly decrease the size of the files, allow-ing for both storage and transmission savings. In addition, we show that lossy compressioncan be applied to some of the genomic data, boosting the compression performance be-yond the lossless limit while maintaining similar – and sometimes superior – performancein downstream analyses. These results are possible due to the inherent noise present inthe genomic data. However, lossy compressors are not explicitly designed to reduce thenoise present in the data. With that in mind, we introduce a denoising scheme tailored tothese data, and demonstrate that it can result in better inference. Moreover, we show thatreducing the noise leads to smaller entropy, and thus a significant boost in compression isalso achieved.

iv

In the second part of the thesis, we investigate methods to facilitate the access to ge-nomic data on databases. Specifically, we study the problem of compressing a databaseso that similarity queries can still be performed in the compressed domain. Compressingthe database allows it to be replicated in several locations, thus providing easier and fasteraccess to the data, and reducing the time needed to execute a query.

v

Acknowledgments

My life at Stanford during these years wouldn’t have been the same without several re-markable people who have accompanied me along the way.

First and foremost, I am deeply grateful to my advisor Tsachy Weissman, for believingin me and letting me pursue research under his guidance. He has been all I could hope forand more, both from an academic and personal perspective. It has been a privilege beinghis student, and he will always be a role model for me.

I am very thankful to Andrea Goldsmith for her support, her energy for teaching, and forserving as a reader of my thesis and a committee member of my oral defense. I would alsolike to thank David Tse, for many interesting discussions on genomic related problems,and for serving as a reader of my thesis and a committee member of my oral defense.Special thanks to Ayfer Ozgur, for her kindness and interesting discussions throughoutthe years. I am also thankful to Olivier Gevaert, for fruitful collaborations, guidance, andfor serving as a committee member of my oral defense. I would also like to thank EuanAshley, for helpful discussions and collaborations. Thanks also to Golan Yona, for beingmy co-advisor and sharing his knowledge in genomics, and to Anshul Andaje, for servingas the chair in my oral defense. I would also like to dedicate some lines to the late TomCover, a one-of-a-kind professor that I was lucky to interact with for more than a year. Istill remember with nostalgia his weekly group meetings where we would solve numerousmath puzzles. Finally, but not last, I would like to thank my advisor from Spain PedroCrespo, who initiated me into the world of research, and who gave me the initial idea andmotivation to pursue a PhD in the states.

I am happy to acknowledge my collaborators, as well as my fellows with whom I haveenjoyed many interesting interactions: Mikel, Alexandros, Kartik, Mainak, Bobbie, Albert,

vi

Kedar, Heyji, Gowtham, Himanshu, Amir, Rachel, Shirin, Dinesh, Greg, Milind, Khartik,Tom, Yair, Dmitri, Irena, Vinith, Jiantao, Yanjun, and Peng, and Ritesh.

I would also like to thank Amy, Teresa, Meo, and specially Doug, for an amazingadministrative work and for making my life at Stanford easier. Also to Angelica, for herhappiness.

I gratefully acknowledge financial support from Fundacion La Caixa, which supportedme during my first two years of PhD, the Basque Government, the Stanford Graduate Fel-lowships Program in Science and Engineering, and the Center for Science of Information(CSoI).

On a more personal note, I would like to thank Alexandros, Kartik, and Bobbie, withwhom I have shared innumerable unforgettable moments; Mainak and Nima, who havealways let me into their office, even if it was to distract them; the donostiarrak Ane andIvan, for making me feel closer to home; Nadine, Carlos, Adrian, Felix, Yiannis, Sotiria,Borja, Gemma, Christina, and Dani, for memorable moments and for being my familyabroad.

I am deeply grateful to my parents and my sister, for their love throughout all my life,and for making this possible.

I cannot end without thanking Mikel, to whom I owe everything, including a beautifuldaughter. This thesis is dedicated to them.

vii

To Mikel and Naroa

viii

Contents

Abstract iv

Acknowledgments vi

1 Introduction 11.1 Easing the storage and distribution of genomic data . . . . . . . . . . . . . 31.2 Facilitating access to the genomic data . . . . . . . . . . . . . . . . . . . . 111.3 Outline and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.4 Previously Published Material . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Compression of assembled genomes 162.1 Survey of compression schemes for assembled genomes . . . . . . . . . . 162.2 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.1 iDoComp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.2.3 Machine specifications . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Compression of aligned reads 343.1 Survey of lossless compressors for aligned data . . . . . . . . . . . . . . . 35

3.1.1 The lossless data compression problem . . . . . . . . . . . . . . . 373.1.2 Data modeling in aligned data compressors . . . . . . . . . . . . . 38

ix

3.2 Proposed Compression Scheme . . . . . . . . . . . . . . . . . . . . . . . . 403.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2.2 Machine specifications . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.3.1 Low coverage data sets . . . . . . . . . . . . . . . . . . . . . . . . 473.3.2 High coverage data sets . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.4.1 Low Coverage Data Sets . . . . . . . . . . . . . . . . . . . . . . . 503.4.2 High Coverage Data Sets . . . . . . . . . . . . . . . . . . . . . . . 51

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4 Lossy compression of quality scores 544.1 Survey of lossy compressors for quality scores . . . . . . . . . . . . . . . . 554.2 Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2.1 QualComp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.2.2 QVZ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.3.2 Machine Specifications . . . . . . . . . . . . . . . . . . . . . . . . 724.3.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5 Effect of lossy compression of quality scores 785.1 Methodology for variant calling . . . . . . . . . . . . . . . . . . . . . . . 78

5.1.1 SNP calling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.1.2 INDEL detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.1.3 Datasets for SNP calling . . . . . . . . . . . . . . . . . . . . . . . 805.1.4 Datasets for INDEL detection . . . . . . . . . . . . . . . . . . . . 805.1.5 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.2.1 SNP calling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

x

5.2.2 INDEL detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6 Denoising of quality scores 986.1 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.1.1 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996.1.2 Denoising Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 996.1.3 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036.2.1 SNP calling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.2.2 INDEL detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7 Compression schemes for similarity queries 1087.1 Problem Formulation and Fundamental Limits . . . . . . . . . . . . . . . . 110

7.1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . 1107.1.2 Fundamental limits . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.2 Proposed schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1127.2.1 The LC ≠ — scheme . . . . . . . . . . . . . . . . . . . . . . . . . 1127.2.2 The TC ≠ — scheme . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.3 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177.3.1 Binary symmetric sources and Hamming distortion . . . . . . . . . 1187.3.2 General binary sources and Hamming distortion . . . . . . . . . . . 1197.3.3 q-ary sources and Hamming distortion . . . . . . . . . . . . . . . . 121

7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

8 Conclusions 124

Bibliography 127

xi

List of Tables

2.1 Genomic sequence datasets used for the pair-wise compression evaluation.Each row specifies the species, the number of chromosomes they containand the target and the reference assemblies with the corresponding loca-tions from which they were retrieved.T. and R. stand for the Target and the Reference, respectively. . . . . . . . . 27

2.2 Compression results for the pairwise compression. C. time and D. timestand for compression and decompression time [seconds], respectively. Theresults in bold correspond to the best compression performance among thedifferent algorithms. We use the International System of Units for the pre-fixes, that is, 1MB and 1KB stands for 106 and 103 Bytes, respectively.ú denotes the cases where GRS outperformed GReEn. In these cases, i.e.,L. pneumohilia and S. cerevisiae, the compression achieved by GReEn is495KB and 304.2KB, respectively.† denotes the cases where GDC-advanced outperformed GDC-normal. . . . 33

3.1 Data sets used for the assessment of the proposed algorithm.The alignment program used to generate the SAM files is Bowtie2. Thedata sets are divided in two ensembles, low coverage data sets and highcoverage data sets. The size corresponds to the SAM file.a M stands for millions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

xii

3.2 Compression results for the high coverage ensemble. The results in boldshow the compression gain obtained by the proposed method with respectto SamComp.We use the International System of Units for the prefixes, that is, 1 MBstands for 106 Bytes.C.T. stands for compression time.a Raw size refers solely to the size of the mapped reads (1 Byte per basepair). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.1 Illumina’s proposed 8 level mapping. . . . . . . . . . . . . . . . . . . . . . 564.2 Lossless results of the different algorithms for the NA12878 data set. . . . . 75

5.1 Sensitivity, precision, f-score and Compression ratio for the 30◊ and 15◊Coverage datasets for the GATK pipeline, using the NIST ground truth. . . 90

5.2 Sensitivity for INDEL detection by Dindel pipeline with various compres-sion approaches for 4 simulated datasets. . . . . . . . . . . . . . . . . . . . 97

xiii

List of Figures

1.1 Overview of the drop in the sequencing cost. As can be observed, the rateof this price drop is surpassing Moore’s law.Source: www.genome.gov/sequencingcostsdata/. . . . . . . . . . . . . . . . 2

1.2 Next Generation Sequencing technologies require a library preparation ofthe DNA sample, which includes cutting the DNA into small fragments.These are then used as input into the sequencing machine, which performsthe sequencing in parallel. The output data is stored in a FASTQ file. . . . . 3

1.3 The “reads” output by the sequencing machine correspond to fragments ofthe genome. The coverage indicates the number of reads on average thatwere generated from a random location of the genome. . . . . . . . . . . . 4

1.4 Example of a FASTQ file entry corresponding to a read of length 28 andquality scores in the scale Phred + 33. . . . . . . . . . . . . . . . . . . . . 5

1.5 Example of the information contained in a SAM file for a single read. Theinformation in blue represents the data contained in the FASTQ file (i.e.,the read identifier, the read itself, and the quality scores), and that in orangethe one pertaining to the alignment. In this example, the read maps tochromosome 1, at position 50, with no mismatches. For more details onthe extra and optional fields, we refer the reader to [1]. . . . . . . . . . . . 6

1.6 Example of two variants contained in a VCF file. For example, the firstone correspond to a SNP in chromosome 20, position 150. The referencesequence containes a “T” at that location, whereas the sequenced genomecontains a “C”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

xiv

1.7 Typical pipeline of genome sequence, with the corresponding generatedfiles at each step. Sizes are approximate, and correspond to sequencing ahuman genome at 200 coverage (200◊). . . . . . . . . . . . . . . . . . . . 8

2.1 Diagram of the main steps of the proposed algorithm iDoComp. . . . . . . 202.2 Flowchart of the post-processing of the sequence of matches M to generate

the set S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1 Performance of the proposed method and the previously proposed algo-rithms when compressing aligned reads of low coverage data sets. . . . . . 48

4.1 Example of the boundary points and reconstruction points found by a Lloyd-Max quantizer, for M = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2 Our temporal Markov model. . . . . . . . . . . . . . . . . . . . . . . . . . 684.3 Rate-Distortion performance of QualComp for MSE distortion, and the M.

Musculus dataset. cX stands for QualComp run with X number of clusters. 734.4 Rate-Distortion curves of PBlock, RBlock, QualComp and QVZ, for MSE,

L1 and Lorentzian distortions. In QVZ, c1, c3 and c5 denote 1, 3 and 5clusters (when using k-means), respectively. QualComp was run with 3clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.5 Rate-Distortion curves of QVZ for the H. Sapiens dataset, when the clus-tering step is performed with k-means (c3 and c10), and with the Mixture

of Markov Model approach (K3 and K10). In both cases we used 3 and 10clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.1 Difference between the GIAB NIST “ground truth” and the one from Illu-mina, for (a) chromosomes 11 and (b) 20. . . . . . . . . . . . . . . . . . . 81

5.2 Average sensitivity, precision and f-score of the four considered datasetsusing the NIST ground truth. Different colors represent different pipelines,and different points within an algorithm represent different rates. Q40 de-notes the case of setting all the quality scores to 40. . . . . . . . . . . . . . 85

xv

5.3 Average sensitivity, precision and f-score of the four considered datasetsusing the Illumina ground truth. Different colors represent different pipelines,and different points within an algorithm represent different rates. Q40 de-notes the case of setting all the quality scores to 40. . . . . . . . . . . . . . 86

5.4 Comparison of the average sensitivity, precision and f-score of the fourconsidered datasets and two ground truths, when QVZ is used with 3 clus-ters computed with k-means (QVZ-Mc3) and Mixture of Markov Models

(QVZ-MMM3). Different colors represent different pipelines, and differ-ent points within an algorithm represent different rates. . . . . . . . . . . . 87

5.5 Box plot of f-score differences between the lossless case and six lossy com-pression algorithms for 24 simulations (4 datasets, 3 pipelines and 2 groundtruths). The x-axis shows the compression rate achieved by the algorithm.The three left-most boxes correspond to QVZ-Mc3 with parameters 0.9,0.8 and 0.6, while the three right-most boxes correspond to RBlock withparameters 30, 20 and 10. The blue line indicates the mean value, and thered one the median. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.6 ROC curve of chromosome 11 (ERR262996) with the NIST ground truthand the GATK pipeline with the VQSR filter. The ROC curve was gen-erated with respect to the VQSLOD field. The results are for the originalquality scores (uncompressed), and those generated by QVZ-Mc3 (MSEdistortion and 3 clusters), PBlock (p = 8) and RBlock (r = 25). . . . . . . . 92

5.7 Average (of four simulated datasets) sensitivity, precision and f-score forINDEL detection pipelines. Different colors represent different pipelines,and different points within an algorithm represent different rates. . . . . . . 93

6.1 Outline of the proposed denoising scheme. . . . . . . . . . . . . . . . . . . 1006.2 Reduction in size achieved by the denoiser when compared to the original

data (when losslessly compressed). . . . . . . . . . . . . . . . . . . . . . . 1036.3 Denoiser performance on the GATK pipeline (30x dataset, chr. 20). Dif-

ferent points of the same color correspond to running the lossy compressorwith different parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

xvi

6.4 Denoiser performance on the GATK and hstlib.org pipelines (15x dataset,chr. 11). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.5 Improvement achieved by applying the post-processing operation. x-axisrepresents the performance in sensitivity, precision and f-score achieved bysolely applying lossy compression, and the y-axis represents the same butwhen the post-processing operation is applied after the lossy compressor.Grey line corresponds to x = y, and thus all the points above it correspondto an improved performance. . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.1 Answering queries from signatures: a user first makes a query to the com-pressed database, and upon receiving the indexes of the sequences that maypossibly be similar, discards the false positives by retrieving the sequencesfrom the original database. . . . . . . . . . . . . . . . . . . . . . . . . . . 109

7.2 Signature assignment of the LC ≠ — scheme for each sequence x in thedatabase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.3 Binary sources and Hamming distortion: if PX

= PY

= Bern(0.5), RLC≠—ID (D) =

RTC≠—ID (D) = RID(D), whereas if P

X

= PY

= Bern(0.7), RLC≠—ID (D) >

RTC≠—ID (D) = RID(D). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.4 Binary symmetric sequences and similarity threshold D = 0.2: (a) perfor-mance of the proposed architecture with quantized distortion (b) comparis-son with LSH for rate R = 0.3. . . . . . . . . . . . . . . . . . . . . . . . . 119

7.5 Performance of the proposed schemes for sequences of length n = 512,similarity thresholds D = {0.05, 0.1} and P

X

= PY

= Bern(0.7) andBern(0.8). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

7.6 Performance of the LC ≠ — scheme for D = {0.1, 0.2} applied to twodatabases composed of 4-ary sequences: one generated uniformly i.i.d. andthe other comprised of real DNA sequences from [2]. . . . . . . . . . . . . 122

xvii

Chapter 1

Introduction

In the year 2000, US president Bill Clinton declared the success of the Human GenomeProject [3], calling it “the most important scientific discovery of the 20th century” (althoughit wasn’t until 2003 that the human genome assembly was finally completed). It was theend of a project that took almost 13 years to complete and cost 3 billion dollars (around $1per base pair).

Fortunately, sequencing cost has drastically decreased in recent years. While in 2004the cost of sequencing a whole human genome was around $20 million, in 2008 it droppedto a million, and in 2015 to a mere $1000 (see Fig. 1.1). As a result of this decreasein sequencing cost, as well as advancements in sequencing technology, massive amountsof genomic data are being generated. At the current rate of growth (sequencing data isdoubling approximately every seven months), more than an exabyte of sequencing dataper year will be produced, approaching the zettabytes by 2025 [4]. As an example, thesequencing data generated by the 1000 Genomes Project1 in the first 6 months exceededthe sequence data accumulated during 21 years in the NCBI GenBank database [5].

Often, these data are unique, in that the samples are not available for re-sequencing,from organisms and ever changing ecosystems. Moreover, the tools that are used to pro-cess and analyze the data improve over time, and thus it will likely be beneficial to revisitand re-analyze the data in the future. For these, among other reasons, long-term storagewith convenient retrieval is required. In addition, the acquisition of the data is highly

1www.1000genoms.org

1

CHAPTER 1. INTRODUCTION 2

Figure 1.1: Overview of the drop in the sequencing cost. As can be observed, the rate ofthis price drop is surpassing Moore’s law.Source: www.genome.gov/sequencingcostsdata/.

distributed, which demands a large bandwidth to transmit and access these large quanti-ties of information through the network. This situation calls for state-of-the-art, efficientcompressed representations of massive biological datasets, that can not only alleviate thestorage requirements, but also facilitate the exchange and dissemination of these data. Thisundertaking is of paramount importance, as the storage and acquisition of the data arebecoming the major bottleneck, as evidenced by the recent flourishing of cloud-based so-lutions enabling processing the data directly on the cloud. For example, companies such asDNAnexus2, GenoSpace3, Genome Cloud4, and Google Genomics5, to name a few, offersolutions to perform genome analysis in the cloud. In addition, the ultimate goal of genomesequencing is to analyze the data so as to advance the understanding of development, behav-ior, evolution, and disease. Consequently, substantial effort is put into designing accuratemethods for this analysis. In particular, the field of personalized medicine is rapidly devel-oping, enabling the design of individualized paths to help patients mitigate risks, prevent

2http://dnanexus.com3http://www.genospace.com4http://www.genome-cloud.com5https://cloud.google.com/genomics/


disease and treat it effectively when it occurs.The main objective of this thesis is to develop new methods and algorithms to ease the

storage and distribution of the genomic data, and to facilitate its access and analysis.

1.1 Easing the storage and distribution of genomic data

Most of the genomic data being stored and analyzed to date are comprised of sequencingdata produced by Next Generation Sequencing (NGS) technologies. Unfortunately, thesetechnologies are not capable of providing the whole genome sequence. As a reference,the human genome sequence contains 3 billion base-pairs. The reason is that all currentlyavailable NGS sequencing platforms require some level of DNA pre-processing into a li-brary suitable for sequencing. In general, these steps involve fragmenting the DNA into anappropriate platform-specific size range, and ligating specialized adapters to both fragmentends. The different sequence platforms have devised different strategies to prepare the se-quence libraries into suitable templates as well as to detect the signal and ultimately readthe DNA sequence (see [6] for a detailed review).6 The method employed by the differentNGS technologies for the readout of the sequencing signal is known as base calling. Fig.1.2 provides an overview of the sequencing process.

Figure 1.2: Next Generation Sequencing technologies require a library preparation of theDNA sample, which includes cutting the DNA into small fragments. These are then used asinput into the sequencing machine, which performs the sequencing in parallel. The outputdata is stored in a FASTQ file.

6To increase the signal-to-noise ratio on the reading of each base-pair, some technologies perform a localclonal amplification of the prepared fragment.


As a consequence of this process, the NGS technologies produce, instead of the wholegenome sequence, a collection of millions of small fragments (corresponding to those gen-erated during the library preparation). These fragments are called “reads”, and can bethought of as strings randomly sampled from the genome that was sequenced. For mosttechnologies, the length of these reads is of a few hundreds base-pairs, which is generallysignificantly smaller than the genome itself (recall that a human genome is composed ofaround 3 ◊ 109 base-pairs). Fig. 1.3 illustrates this concept.

Figure 1.3: The “reads” output by the sequencing machine correspond to fragments of thegenome. The coverage indicates the number of reads on average that were generated froma random location of the genome.

The base calling process may be affected by various factors, which may lead to a wrongreadout of the sequencing signal. In order to assess the probability of base calling mistakes,the sequencers generate, in addition to the reads (i.e., the nucleotide sequences), qualityscores that reflect the level of confidence in the readout of each base. Thus each of thereads is accompanied by a sequence of quality scores, of the same length, that indicate thelevel of confidence of each of the nucleotides of the read. The higher the quality score,the higher the reliability of the corresponding base. Specifically, the quality score Q is theinteger mapping of P (the probability that the corresponding base call is incorrect) and itis usually represented in one of the following scales/standards:

• Sanger or Phred scale [7]: Q = ≠10 log10 P .

• Solexa scale: Q = ≠10 log10P

1≠P

.

Different NGS technologies use different scales, Phred + 33, Phred + 64 and Solexa + 64


being the most common ones. For example, Phred + 33 corresponds to values of Q in therange [33 : 73]. Note that the size of the alphabet of the quality scores is around 40.

Quality scores are important and very useful in many “downstream applications” (i.e.,applications that operate on the sequencing data) such as trimming (used to remove un-trusted regions) [8, 9], alignment [10, 11, 12, 13] or Single Nucleotide Polymorphism(SNP) detection [14, 15], among others.

The raw sequencing data is therefore mainly composed of the reads and the qualityscores. This information is stored in the FASTQ format, which is widely accepted as astandard for storing sequencing data. FASTQ files consist of separate entries for each read,each consisting of four lines. The first one is for the header line, which begins with the ‘@’character and is followed by a sequence identifier and an optional description. The secondone contains the nucleotide sequence. The third one starts with the ‘+’ character and can befollowed by the same information stored in the first line. Finally, the fourth line is for thequality scores associated with the bases that appear in the nucleotide sequence of line two(both lines must contain the same number of symbols). The quality scores are representedwith ASCII characters in the FASTQ file. Fig. 1.4 shows an example of a FASTQ entry.

@SRR0626347 13976/1

TGGAATCAGATGGAATCATCGAATGGTC

+

IIIHIIHABBBAA=2))!!!(!!!((!!

Figure 1.4: Example of a FASTQ file entry corresponding to a read of length 28 and qualityscores in the scale Phred + 33.

The number of reads present in the raw sequencing data depends on the coverage (i.e.,the expected number of times a specific nucleotide of the genome is sequenced). As anexample, sequencing a human genome with 200 coverage (Illumina’s current technology7)will generate around 6 billion reads (assuming a typical read length of 100). Thus theresulting FASTQ files are very large (typically on the order of hundreds of gigabytes or

7http://www.illumina.com/platinumgenomes/


larger).Once the raw sequencing data is generated, the next step is typically to align the reads

to a reference sequence. In brief, the alignment process consists of determining, for each ofthe reads contained in the FASTQ file, the corresponding location in the reference sequencefrom which the read was generated (or that no such region exists). This is achieved bycomparing the sequence of the read to that of the reference sequence. A mapping algorithmwill try to locate a (hopefully unique) location in the reference sequence that matches theread, while tolerating a certain amount of mismatches. Recall that the reads were generatedfrom a genome that most likely differs from that used as a reference sequence, and thusvariations between the reads and the reference sequence are to be expected. For each read,the alignment program provides the location where the read is mapped in the reference,as well as the mismatching information, if any, together with some extra fields. This isdenoted as the alignment information. This information is stored in the standard SAMformat [1], which also contains the original reads and the quality scores. These files, whichare heavily used by most downstream applications, are extremely large (typically in theterabytes or larger). Fig. 1.5 shows an example of a SAM file entry for a single read.

Figure 1.5: Example of the information contained in a SAM file for a single read. Theinformation in blue represents the data contained in the FASTQ file (i.e., the read identifier,the read itself, and the quality scores), and that in orange the one pertaining to the align-ment. In this example, the read maps to chromosome 1, at position 50, with no mismatches.For more details on the extra and optional fields, we refer the reader to [1].

Once the alignment concludes and the SAM file is generated, the next step consists ofanalyzing the discrepancies between the reads and the reference sequence used for align-ment. This process is called variant calling, and it is the main downstream applicationin practice. The variants (discrepancies) between the original genome and the referencesequence are normally due to Single Nucleotide Polymorphisms (SNPs) (i.e., a single nu-cleotide variation) and insertions and deletions (denoted by INDELS). The variant calling


algorithm analyzes the aligned data, contained in the SAM file, and finds the biologicallyrelevant variants corresponding to the sequenced genome. The set of called variants, to-gether with some extra information such as the quality of the call, are stored in a VCF file[16] (see Fig. 1.6). For human genomes, a VCF file may contain around 3 million variants,most of them due to SNPs (note that two human genomes are about 99.9% identical). Thesize of this file is in the order of a gigabyte. The variants contained in the VCF file canbe later used towards reconstructing the original genome (i.e., the genome from which thereads were generated).

#CHROM POS REF ALT <Extra Fields>20 150 T C –20 175 A T –

Figure 1.6: Example of two variants contained in a VCF file. For example, the first onecorrespond to a SNP in chromosome 20, position 150. The reference sequence containes a“T” at that location, whereas the sequenced genome contains a “C”.

Fig. 1.7 illustrates the typical pipeline after sequencing a genome, which is the onethat we focus on in this thesis. Summarizing, the sequencing process generates a FASTQfile that contains millions of reads that are generated from the genome. This FASTQ fileis then used by an alignment program, which generates a SAM file containing the map-ping/alignment information of each of the reads to a reference genome. Finally, a variantcaller analyzes the information on the SAM file and finds the variants between the origi-nal genome and the one used as reference. The called variants are stored in a VCF file.As mentioned above, with the information contained in the VCF file the original genomecan be potentially reconstructed (this process is called “assembly”), yielding the assem-bled genome. Thus an assembled genome contains the nucleotide sequences of each of thechromosomes that comprise the genome.

Ideally, after all the steps of the pipeline are completed, one would need to store justthe VCF file, as it contains all the relevant information regarding the genome that wassequenced. This would eliminate the need for storing the intermediate FASTQ and SAMfiles, which are generally prohibitively large. Unfortunately, this is not yet the case. As


Figure 1.7: Typical pipeline of genome sequence, with the corresponding generated files ateach step. Sizes are approximate, and correspond to sequencing a human genome at 200coverage (200◊).

already outlined above, there are several reasons why the FASTQ and SAM files need tobe stored. For example, the alignment tools and variant calling programs keep improvingover time, and thus re-analyzing the raw data can yield new variants that were initiallyundetected. Thus there is a pressing need for compression of these files, that can ease thestorage and transmission of the data.

Although there exist general-purpose compression schemes like gzip, bzip2 and 7zip(www.gzip.org, www.bzip.org and www.7zip.org, respectively) that can be directly appliedto any type of genomic data, they do not exploit the particularities of these data, yieldingrelatively low compression gains [17, 18, 19]. With this gap in mind in compression ra-tios, several specialized compression algorithms have been proposed in the last few years.These algorithms can be classified into two main categories: i) Compression of assembledgenomes and ii) Compression of raw NGS data (namely FASTQ and SAM files).

Note that even though the main bottleneck in the storage and transmission of genomic


data is due to the raw sequencing data (FASTQ, SAM), compression of assembled genomesis also important. For example, whereas an uncompressed human genome occupies around3 GB, its equivalent compressed form is in general smaller than 10 MB, thus easing thetransfer and download of genomes. This means that a human genome could simply beattached to an email. Moreover, as we advance towards personalized medicine, increasingamounts of assembled genomes are expected to be generated.

Compression of the raw sequencing data, that is, FASTQ and SAM files, pertainsmainly to the compression of the identifiers, the reads, and the quality scores. However,there has been more interest in the compression of the reads and the quality scores, as theytake most of the space, and carry the most relevant information.

Much effort has been put into designing compression schemes for the reads, both for theraw reads and the aligned reads. Better compression ratios can be achieved when consid-ering aligned reads, as in that case only the differences with respect to the sub-sequence ofthe reference where they aligned to need to be stored (see [17]). Note that those differences,together with the reference sequence, suffice to reconstruct the read exactly.

Quality scores, on the other hand, have been proven to be more difficult to compressthan the reads, due in part to their higher entropy and larger alphabet. When losslesslycompressed, quality scores typically comprise more than 70% of the compressed file [17].In addition, there is evidence that quality scores are inherently noisy, and downstream ap-plications that use them do so in varying heuristic manners. As a result, lossy compressionof quality scores (as opposed to lossless) has emerged as a candidate for boosting compres-sion performance, at a cost of introducing some distortion (i.e., the reconstructed qualityscores may differ from the original ones).

Traditionally, lossy compressors have been analyzed in terms of their rate-distortionperformance. Such analysis provides a yardstick for comparison of lossy compressors ofquality scores that is oblivious to the multitude of downstream applications, which usethe quality scores in different ways. However, the data compressed is used for biologicalinference. Researchers are thus interested in understanding the effect that the distortionintroduced in the quality scores has on the subsequent analysis performed on the data,rather than a more generic measure of distortion such as rate-distortion.

To date, there is not a standard practice on how this analysis should be performed. Proof


of this is the variety of analyses presented in the literature when a new lossy compressorfor quality scores is introduced (see [20, 21, 22, 23] and references therein). Moreover, itis not yet well understood how lossy compression of quality scores affects the downstreamanalysis performed on the data. This can be understood not only by the lack of a standardpractice, but also by the variety of applications that exist and the different manners in whichquality scores are used. In addition, the fact that lossy compressors can work at differentrates and be optimized for several distortion metrics make the analysis more challenging.However, such an analysis is important if lossy compression is to become a viable modefor coping with the surging requirements of genomic data storage.

With that in mind, there has been recent effort and interest in obtaining a methodol-ogy to analyze how lossy compression of quality scores affects the output of one of themost widely used downstream applications: variant calling, which compromises SingleNucleotide Polymorphism (SNP) and Insertion and Deletion (INDEL) calling. Not surpris-ingly, recent studies have shown that lossy compression can significantly alleviate storagerequirements while maintaining variant-calling performance comparable – and sometimessuperior – to the performance achieved using the uncompressed data (see [24] and refer-ences therein). This phenomenon can be explained by the fact that the data is noisy, andthe current variant callers do not use the data in an optimal manner.

These results suggest that the quality scores can be denoised, i.e., structured noise canbe removed to improve the quality of the data. While the proposed lossy compressors forthe quality scores address the problem of storage, they are not explicitly designed to alsoreduce the noise present in the quality scores. Reducing the noise is of importance sinceperturbations in the data may significantly degrade the subsequent analysis performed on it.Moreover, reducing the noise of the quality scores leads to quality scores with smaller en-tropy, and consequently to higher compression ratios than those obtained with the originalfile. Thus denoising schemes for the quality scores can reduce the noise of the genomic datawhile easing its storage and dissemination, which can significantly advance the processingof genomic data.


1.2 Facilitating access to the genomic data

The generation of new databases and the amount of data contained in existing ones isgrowing exponentially. For example, the Sequence Read Archive (SRA) contains morethan 3.6 petabases of genomic data8, the biological database Genbank contains almost 2000million DNA sequences9, and the database BIOZON contains well over 100 million records[2]. As a result, executing queries on them is becoming a time-consuming and challengingtask. To facilitate this effort, there has been recent interest in compressing sequences ina database such that similarity queries can still be performed on the compressed database.Compressing the database allows it to replicate in several locations, thus providing easierand faster access, and potentially reducing the time needed to execute a query.

Given a database consisting of many sequences, similarity queries refer to queries of theform: “which sequences in the database are similar to a given sequence y?”. This kindof query is of practical interest in genomics, such as in molecular phylogenetics, whererelationships among species are established by the similarity between their respective DNAsequences. However, note that these queries are of practical interest not only in genomics,where it is important to find sequences that are similar, but in many other applications aswell. For example forensics, where fingerprints are run through various local, state andnational fingerprint databases for potential matches. Thus such schemes will soon becomenecessary in other fields where databases are rapidly growing in size.

The fundamental limits of this problem characterize the tradeoff between compressionrate and the reliability of the queries performed on the compressed data. While thoseasymptotic limits have been studied and characterized in past work (see for example [25,26, 27]), how to approach these limits in practice has remained largely unexplored.

1.3 Outline and Contributions

In the first part of the thesis, we investigate and develop methods to ease the distributionand storage of the genomic data. Specifically, we first do an overview of the state-of-the-art

8http://www.ncbi.nlm.nih.gov/sra9http://www.ncbi.nlm.nih.gov/genbank


compression schemes for the genomic data, and then describe new compression algorithmstailored to assembled genomes and the sequencing data (e.g., the reads and the qualityscores). In addition, we define a methodology to analyze the effect that lossy compres-sion of the quality scores has on variant calling, and we use the proposed methodology toevaluate the performance of existing lossy compressors, which constitutes the first in-depthcomparison available in the literature. Finally, we introduce the first denoising scheme forthe quality scores, and demonstrate improved inference using the denoised data.

In the second part of the thesis, we investigate methods to facilitate the access to ge-nomic data. Specifically, we study the problem of compressing sequences in a databaseso that similarity queries can still be performed on the compressed database. We proposepractical schemes for this task, and show that their performance is close to the fundamentallimits under statistical models of relevance.

Our contributions per chapter are as follows:

• In Chapter 2 we focus on compression of assembled genomes. We describe a methodthat assumes the existence of a reference genome for compression, which is a realisticassumption in practice. Having such a reference boosts the compression performancedue to the high similarity between genomes of the same species. The presented algo-rithm is one of the most competitive to date, both in running time and compressionperformance. The main underlying insight is that we can efficiently compute themismatches between the genome and the reference, using suffix arrays, and processthese mismatches to reduce their entropy, and thus their compressed representation.

• Chapter 3 describes a new method for compression of aligned reads. The proposedmethod achieves a considerable improvement over the best previously achieved com-pression ratios (particularly for high coverage datasets). These results had broken theexperimental pareto-optimal barrier for compression rate and speed previously per-ceived [17]. Furthermore, the proposed algorithm is amenable for conducting opera-tions in the compressed domain, which can speed up the running time of downstreamapplications.

• In Chapter 4 we focus on the quality scores, proposing two different lossy compres-sion methods that are able to significantly boost the compression performance. The


first method, QualComp, uses the Singular Value Decomposition (SVD) to trans-form the quality scores into Gaussian distributed values. This approach allows usto use theory from rate distortion to allocate the bits in an optimal manner. Thesecond method, QVZ, assumes that the quality scores are generated by an order-1Markov source. The algorithm computes the empirical distribution and uses it todesign the optimal quantizers in order to achieve a specific rate (number of bits perquality score), while optimizing for a given distortion. Moreover, QVZ can also per-form lossless compression. We also show that clustering the quality scores prior tocompression, using a Markov mixture model, can improve the performance. In addi-tion, we provide a complete rate-distortion analysis that includes previously proposedmethods.

• In Chapter 5 we present a methodology to analyze the effect of lossy compression ofquality scores on variant calling. Having a defined methodology for comparison iscrucial in this multidisciplinary area, where researchers working on lossy compres-sion are not familiar with the variant calling pipelines used in practice. In addition,we use the proposed methodology to evaluate the performance of existing lossy com-pressors, which constitutes the first in-depth comparison available in the literature.The results presented in this chapter show that lossy compression can significantlyalleviate storage requirements while maintaining variant-calling performance com-parable – and sometimes superior – to the performance achieved using the uncom-pressed data.

• In Chapter 6 we build upon the results presented in the previous chapter, whichsuggest that the quality scores can be denoised. In particular, we propose the firstdenoiser for quality scores, and demonstrate improved inference on variant callingusing the denoised data. Moreover, a consequence of the denoiser is that the entropyof the produced quality scores is smaller, which leads to higher compression ratiosthan those obtained with the original file. Reducing the noise of genomic data whileeasing its storage and dissemination can significantly advance the processing of ge-nomic data, and the results presented in this chapter provide a promising baseline forfuture research in this direction.


• In Chapter 7 we study the problem of compressing sequences in a database so thatsimilarity queries can still be performed on the compressed database. We proposetwo practical schemes for this task, and show that their performance is close to thefundamental limits under statistical models of relevance. Moreover, we apply theaforementioned schemes to a database containing genomic sequences, as well as to adatabase containing simulated data, and show that it is possible to achieve significantcompression while still being able to perform queries of the form described above.

• To conclude, Chapter 8 contains final remarks and outlines several future researchdirections.

More background on these topics and previous work in these areas is provided in eachindividual chapter.

1.4 Previously Published Material

Part of this dissertation has appeared in the following manuscripts:

• Publication [28]: Idoia Ochoa, Mikel Hernaez, and Tsachy Weissman, “iDoComp: acompression scheme for assembled genomes”, Bioinformatics, btu698, 2014.

• Publication [29]: Idoia Ochoa, Mikel Hernaez, and Tsachy Weissman, “Alignedgenomic data compression via improved modeling”, Journal of bioinformatics andcomputational biology, Vol. 12, no. 06, 2014.

• Publication [20]: Idoia Ochoa, Himanshu Asnani, Dinesh Bharadia, Mainak Chowd-hury, Tsachy Weissman, and Golan Yona, “QualComp: a new lossy compressor forquality scores based on rate distortion theory”, BMC bioinformatics, Vol. 14, no. 1,2013.

• Publication [22]: Greg Malysa, Mikel Hernaez, Idoia Ochoa, Milind Rao, KarthikGanesan, and Tsachy Weissman, “QVZ: lossy compression of quality values”, Bioin-formatics, btv330, 2015.


• Publication [30]: Mikel Hernaez, Idoia Ochoa, Rachel Goldfeder, Tsachy Weissman,and Euan Ashley, “A cluster-based approach to compression of Quality Scores”, DataCompression Conference (DCC), 2015.

• Publication [24]: Idoia Ochoa, Mikel Hernaez, Rachel Goldfeder, Tsachy Weissman,and Euan Ashley, “Effect of lossy compression of quality scores on variant calling”,Briefings in Bioinformatics, doi: 10.1093/bib/bbw011, 2016.

• Publication [31]: Idoia Ochoa, Mikel Hernaez, Rachel Goldfeder, Tsachy Weissman,and Euan Ashley, “Denoising of Quality Scores for Boosted Inference and ReducedStorage”, Data Compression Conference (DCC), 2015.

• Publication [32]: Idoia Ochoa, Amir Ingber, and Tsachy Weissman, “Efficient simi-larity queries via lossy compression”, 51st Annual Allerton Conference on Commu-nication, Control, and Computing, 2013.

• Publication [33], Idoia Ochoa, Amir Ingber, and Tsachy Weissman, “Compressionschemes for similarity queries”, Data Compression Conference (DCC), 2014.

Chapter 2

Compression of assembled genomes

As the sequencing technologies advance, more genomes are expected to be sequenced andassembled in the near future. Thus there is a need for compression of genomes guaranteedto perform well simultaneously on different species, from simple bacteria to humans, thatcan ease their transmission, dissemination and analysis.

In this chapter we introduce iDoComp, a compressor of assembled genomes that com-presses an individual genome using a reference genome for both the compression and thedecompression. Note that most of the genomes to be compressed correspond to individualsof a species from which a reference already exists on the database. Thus, it is natural toassume and exploit the availability of such references.

In terms of compression efficiency, iDoComp outperforms previously proposed algo-rithms in most of the studied cases, with comparable or better running time. For example,we observe compression gains of up to 60% in several cases, including H. Sapiens data,when comparing to the best compression performance among the previously proposed al-gorithms.

2.1 Survey of compression schemes for assembled genomes

Several compression algorithms for assembled genomes have been proposed in the last twodecades. On the one hand, dictionary based algorithms as BioCompress2 [34] and DNA-Compress [35] compress the genome by identifying low complexity sub-strings as repeats

16

CHAPTER 2. COMPRESSION OF ASSEMBLED GENOMES 17

or palindromes and replacing them by the corresponding codeword from the codebook. Onthe other hand, statistics-based algorithms such as XM [36] generate a statistical model ofthe genome and then use entropy coding that relies on the previously computed probabili-ties.

Although the aforementioned algorithms perform very well over data of relatively smallsize, such as mitochondrial DNA, they are impractical for larger sequences (recall that thesize of the human genome is on the order of several gigabytes). Further, they focus oncompressing a single genome without the use of a reference sequence, and thus do notexploit the similarities across genomes of the same species.

It was in 2009 that interest in reference-based compression started to rise with the publi-cation of DNAzip [37] and the proposal from [38]. In DNAzip, the authors compressed thegenome of James Watson (JW) to a mere 4MB based on the mapping from the JW genometo a human reference and using a public database of the most common Single NucleotidePolymorphisms (SNPs) existing in humans. [39] further improved the DNAzip approachby performing a parametric fitting of the distribution of the mapping integers. The mainlimitations of these two proposals is that they rely on a database of SNPs available only forhumans and further assume that the mapping from the target to the reference is given. Thus,while they set a high performance benchmark for whole human genome compression, theyare currently not applicable beyond this specific setting.

[40] proposed the RLZ algorithm for reference-based compression of a set of genomes.The authors improved the algorithm in a subsequent publication, yielding the RLZ-optalgorithm [41]. The RLZ algorithms are based on parsing the target genome into the ref-erence sequence in order to find longest matches. While in RLZ the parsing is done in agreedy manner (i.e., always selecting the longest match), in the optimized version, RLZ-opt, the authors proposed a non-greedy parsing technique that improved the performanceof the previous version. Each of the matches is composed of two values: the position ofthe reference where the match starts (a.k.a. offset) and the length of the match. Once theset of matches is found, some heuristics are used to reduce the size of the set. For example,short matches may be more efficiently stored as a run of base-pairs (a.k.a. literals) than asa match (i.e., a position and a length). Finally, the remaining set together with the set ofliterals is entropy encoded.


[42] proposed the GDC algorithm, which is based on the RLZ-opt. One of the differ-ences between the two is that GDC performs the non-greedy parsing of the target into thereference by hashing rather than using a suffix array. It also performs different heuristicsfor reducing the size of the set of matches, such as allowing for partial matches and allow-ing or denying large “jumps” in the reference for subsequent matches. GDC offers severalvariants, some optimized for compression of large collections of genomes, e.g., the ultra

variant. Finally, [42] showed in their paper that GDC outperforms (in terms of compressionratio) RLZ-opt and, consequently, all the previous algorithms proposed in the literature. Itis worth mentioning that the authors of RLZ did create an improved version of RLZ-optwhose performance is similar to that of GDC. However, [42] showed that it was very slowand that it could not handle a data set with 70 human genomes.

At the same time, two other algorithms, namely GRS and GReEn, [43] and [44], re-spectively, were proposed. The main difference between the aforementioned ones and GRSand GReEN is that in the later two the authors only consider the compression of a singlegenome based on a reference, rather than a set of genomes. Moreover, they assume thatthe reference is available and need not be compressed. It was shown in [42, 44] that GRSonly considers pairs of targets and references that are very similar. [44] proposed an al-gorithm based on arithmetic coding. They use the reference to generate the statistics andthen they perform the compression of the target using arithmetic coding, which uses thepreviously computed statistics. They showed clearly that GReEn was superior to both GRSand the non-optimized RLZ. However, [18] showed in their review paper that there weresome cases where GRS clearly outperformed GReEn in both compression ratio and speed.Interestingly, this phenomenon was observed only in cases of bacteria and yeast, whichhave genomes of relatively small size.

In 2012, another compression algorithm was presented in [45], where they showedsome improved compression results with respect to GReEn. However, the algorithm theyproposed has relatively high running time when applied to big datasets and it does not workin several cases. Also in 2012, [46] proposed a compression algorithm for single genomesthat trades off compression time and space requirements while achieving comparable com-pression rates to that of GDC. The algorithm divides the reference sequence into blocks offixed size, and constructs a suffix tree for each of the blocks, which are later used to parse


the target into the reference.Although it is straightforward to adapt GReEn to the database scenario in order to

compare it with the other state-of-the-art algorithm GDC, in the review paper by [18] theydid not perform any comparison between them. Moreover, the algorithm introduced in[46] was not mentioned. On the other hand, in the review paper by [19] the authors didcompare all the algorithms stating that GDC and [46] achieved the highest compressionratios. However, no empirical evidence in support of that statement was shown in thearticle. Finally, in [47] they showed that GDC achieved better compression ratios than [46]in the considered data sets.

After having examined all the available comparisons in the literature, we considerGReEN and GDC to be the state-of-the-art algorithms in reference-based genomic datacompression. Thus, we use these algorithms as benchmarks1. However, we also add GRSto the comparison base in the cases where [18] showed that GRS outperformed GReEn.

Although in this work we do not focus on compression of collections of genomes,for completeness we introduce the main algorithms designed for this task. As mentionedabove, the version GDC-Ultra introduced in [42] specializes in compression of a collectionof genomes. In 2013, a new algorithm designed for the same purpose, FRESCO, was pre-sented in [48]. The main innovations of FRESCO include a method for reference selectionand reference rewriting, and an implementation of a second-order compression. FRESCOoffers lower running times than GDC-Ultra, with comparable compression ratios. Finally,in [47] they showed that in this scenario a boost in compression ratio is possible if oneconsiders the genomes are given as variations with respect to a reference, in VCF format[16], and the similarity of the variations across genomes is exploited.

In the next section we present the proposed algorithm iDoComp. It is based on a com-bination of ideas proposed in [37], [38] and [45].

2.2 Proposed method

In this subsection we start by describing the proposed algorithm iDoComp, whose goal is tocompress an individual genome assuming a reference is available both for the compression

1We do not use the algorithm proposed in [46] because we were unable to run the algorithm.


T 2 ⌃t

S 2 ⌃s

su�x tree

mappinggenerator

post processof M

MM, S, I

binary fileentropyencoder

Figure 2.1: Diagram of the main steps of the proposed algorithm iDoComp.

and the decompression. We then present the data used to compare the performance of thedifferent algorithms, and the machine specifications where the simulations were conducted.

2.2.1 iDoComp

The input to the algorithm is a target string T œ �t of length t over the alphabet �, and areference string S œ �s of length s over the same alphabet. Note that, in contrast to [41],the algorithm does not impose the condition that the specific pair of target and referencecontain the same characters, e.g., the target T may contain the character N even if it is notpresent in the reference S. As outlined above, the goal of iDoComp is to compress thetarget sequence T using only the reference sequence S, i.e., no other aid is provided to thealgorithm.

iDoComp is composed of mainly three steps (see Figure 2.1): i) the mapping genera-tion, whose aim is to express the target genome T in terms of the reference genome S, ii)the post-processing of the mapping, geared toward decreasing the prospective size of themapping, and iii) the entropy encoder, that further compresses the mapping and generatesthe compressed file. Next, we describe these steps in more detail.

1) Mapping generationThe goal of this step is to create the parsing of the sequence T relative to S. A parsing of

T relative to S is defined as a sequence (Ê1, Ê2, . . . , ÊN

) of sub-strings of S that when con-catenated together in order yield the target sequence T . For reasons that will become clearlater, we slightly modify the above definition and re-define the parsing as (Ê1, Ê2, . . . , Ê

N

),where Ê

i

= (Êi

, Ci

), Êi

being a sub-string of S and Ci

œ � a mismatched character thatappears after Ê

i

in T but not in S. Note that the concatenation of the Ê, i.e., both the


sub-strings and mismatched characters, should still yield the target sequence T .A very useful way of expressing the sub-string Ê

i

is as the triplet mi

= (pi

, li

, Ci

), withp

i

, li

œ (1, . . . , s), where pi

is the position in the reference where Êi

starts and li

is thelength of Ê

i

. If there is a letter X in the target that does not appear in the reference, then thematch (p

i≠1 + li≠1, 0, X) will be generated, where p

i≠1 and li≠1 are the position and length

of the previous match, respectively2. In addition, note that if Êi

appears in more than oneplace in the reference, any of the starting positions is a valid choice.

With this notation, the parsing of T relative to S can be defined as the sequence ofmatches M = {m

i

= (pi

, li

, Ci

)}N

i=1.In this work, we propose the use of suffix arrays to parse the target into the reference due

to its attractive memory requirements, especially when compared to other index structuressuch as suffix trees [49]. This makes the compression and decompression of a humangenome doable on a computer with a mere 2 GB of RAM. Also, the use of suffix arrays isonly needed for compression, i.e., no suffix arrays are used for the decompression. Finally,we assume throughout the chapter that the suffix array of the reference is already pre-computed and stored in the hard-drive.

Once the suffix array of the reference is loaded into memory, we perform a greedy pars-ing of the target as previously described to obtain the sequence of matches M = {m

i

}N

i=1.[42] and [41] showed that a greedy parsing leads to suboptimal results. However, we arenot performing the greedy parsing as described in [40], since every time a mismatch isfound we record the mismatched letter and advance one position in the target. Since mostof the variations between genomes of different individuals within the same species areSNPs (substitutions), recording the mismatch character leads to a more efficient “greedy”mapping.

Moreover, note that at this stage the sequence of matches M suffices to reconstruct thetarget sequence T given the reference sequence S. However, in the next step we performsome post-processing over M in order to reduce its prospective size, which will translateto better compression ratios. This is similar to the heuristic used by [42] and [41] for theirnon-greedy mapping.

2We assume p0 = l0 = 0


2) Post-Processing of the sequence of matches MAfter the sequence of matches M is computed, a post-processing is performed on them.

The goal is to reduce the total number of elements that will be later compressed by theentropy encoder. Recall that each of the matches m

i

contained in M is composed of twointegers in the range (1, . . . , s) and a character on the alphabet �. Since |�| π s, thenumber of unique integers that appear in M will be in general larger than |�|. Thus,the compression of the integers will require in general more bits than those needed tocompress the characters. Therefore, the aim of this step is mainly to reduce the numberof different integers needed to represent T as a parse of S, which will translate to improvedcompression ratios.

Specifically, in the post-processing step we look for consecutive matches mi≠1 and

mi

that can be merged together and converted into an approximate match. By doing thiswe reduce the cardinality of M at a cost of storing the divergences of the approximatematchings with regards to the exact matchings. We classify these divergences as eitherSNPs (substitutions) or insertions, forming the new sets S and I, respectively.

For the case of the SNPs, if we find two consecutive matches mi≠1 and m

i

that canbe merged at the cost of recording a SNP that occurs between them, we add to the set San element of the form s

i

= (pi

, Ci

), where pi

is the position of the target where the SNPoccurs, with T [p

i

] = Ci

. Then we merge matches mi≠1 and m

i

together into a new matchm Ω (p

i≠1, li≠1 + l

i

+ 1, Ci

). Hence, with this simple process we have reduced the numberof integers from 4 to 3.

We constrain the insertions to be of length one; that is, we do not explicitly store shortruns of literals (together with its position and length). This is in contrast to the argumentof [42] and [41] stating that storing short runs of literals is more efficient than storing theirrespective matching. However, as we show next, we store them as a concatenation of SNPs.Although this might seem inefficient, the motivation behind it is that storing short runs ofliterals will in general add new unique integers, which incurs a high cost, since the entropyencoder (based on arithmetic coding) will assign to them a larger amount of bits. We foundthat encoding them as SNPs and then storing the difference between consecutive positionsof SNPs is more efficient. This process is explained next in more detail.

As pointed out by [41], the majority of the matches mi

belong to the Longest Increasing


Sub-Sequence (LISS) of the pi

. In other words, most of the consecutive pi

’s satisfy pi

Æp

i+1 Æ . . . Æ pj

, for i < j, and thus they belong to the LISS. From the mi

’s whose pi

value does not belong to the LISS, we examine those whose length li

is less than a givenparameter L and whose gap to their contiguous instruction is more than �. Among them,those whose number of SNPs is less than a given parameter fl, or are short enough (< �),are classified as several SNPs.

That is, if the match mi

fulfills any of the above conditions, we merge the instructionsm

i≠1 and mi

as described above. Note that the match mi

was pointing to the length li

sub-string starting at position pi

of the reference, whereas now (after merging it to mi≠1) it

points to the one that starts at position pi≠1 + l

i≠1 + 1. Therefore, we need to add to the setS as many SNPs as differences between these two sub-strings.

Note that this operation gets rid of the small-length matches whose pi

’s are far apartfrom their “natural” position in the LISS. These particular matches will harm our compres-sion scheme as they generate new large integers in most of the cases. On the other hand,if the matches were either long, or with several SNP’s, and/or extremely close to their con-tiguous matchings, then storing them as SNP’s would not be beneficial. Therefore, thevalues of L, �, fl and � are chosen such that the expected size of the new generated sub-setof SNPs is less than that of the match m

i

under consideration. This procedure is similar tothe heuristic used by [42] to allow or deny “jumps” in the reference while computing theparsing.

The flowchart of this part of the post-processing is depicted in Figure 2.2.We perform an analogous procedure to find insertions in the sequence of matches M.

Since we only consider length-1 insertions, each insertion in the set I is of the form (p, C),where p indicates the position in the target where the insertion occurs, and C the characterto be inserted. As mentioned before, the short runs of literals have been taken care of in thelast step of the SNP set generation.

After the post-processing described above is concluded, the sequence of matches Mand the two sets S and I are fed to the entropy encoder.

3) Entropy encoder


start

i == |M|?

mi is now the merger of mi�1

and mi. Store the SNP in S.

Compute number of

SNPs if

pi pi�1 + li�1 + 1

Are there few SNPsor the length of mi is

very small?

no

yes

no

yes

no

yes

Store mi�1 in M

Read match m1

and set i to 2

Read match mi from M

yesno

stop

Is mi short andaway from mi�1?

Is there asubstitution

between mi�1 andmi?

Store the previously mergedinstructions in M

mi is now the merger ofmi�1 and mi.

Store all the SNPs in S.

Store mi�1 in M

i++

Figure 2.2: Flowchart of the post-processing of the sequence of matches M to generate theset S.

The goal of the entropy encoder is to compress the instructions contained in the se-quence of matches M and the sets S, I generated in the two previous steps. Recall that theelements in M, S and I are given by integers and/or characters, which will be compressedin a different manner. Specifically, we first create two vectors ⇡ and � containing all theintegers and characters, respectively, from M, S and I. In order to be able to determineto which instruction each integer and character belongs, at the beginning of ⇡ we add thecardinalities of M, S and I, as well as the number of chromosomes.


To store the integers, first note that all the positions pi

in S and I are ordered in as-cending order, thus we can freely store the p

i

’s as pi

Ω pi

≠ pi≠1, for i Ø 2; that is, as the

difference between successive ones. We perform a similar computation with the pi

’s of M.Specifically, we store each p

i

as pi

Ω |pi

≠(pi≠1+l

i≠1)|, for i Ø 2. However, since some ofthe matches may not belong to the LISS, there will be cases where p

i≠1 + li≠1 > p

i

. Hence,a sign vector s is needed in this case to save the sign of the newly computed positions inM. Finally, the lengths l

i

œ M are also stored in ⇡.Once the vector ⇡ is constructed, it is encoded by a byte-based adaptive arithmetic

encoder, yielding the binary stream Afi

(⇡). Specifically, we represent each integer withfour bytes3, and encode each of the bytes independently, i.e., with a different model. Thisavoids the need for having to store the alphabet, which can be a large overhead in somecases. Moreover, the statistics of each of the bytes are updated sequentially (adaptively),and thus they do not need to be previously computed.

The vector � is constructed by storing all the characters belonging to M, S and I.First, note that since the reference is available at both the encoder and the decoder, theyboth can access any position of the reference. Thus, for each of the characters C

i

œ �

we can access its corresponding mismatched character in the reference, that we denote asR

i

. Thus, we generate a tuple of the form (R, C) for all the mismatched characters of theparsing. We then employ a different model for the adaptive Arithmetic encoder for each ofthe different R’s, and encode each C

i

œ � using the model associated to its correspondingR

i

. Note that by doing this, one or two bits per letter can be saved in comparison with thetraditional one-code-for-all approach.

Finally, the binary output stream is the concatenation of Afi

(⇡), s and A“

(�) (thearithmetic-compressed vector �).

2.2.2 Data

In order to asses the performance of the proposed algorithm iDoComp, we consider pair-wise compression applied to different datasets. Specifically, we consider the scenario wherea reference genome is used to compress another individual of the same species (the target

3We chose four bytes as it is the least number of bytes needed to represent all possible integers.


genome)4. This would be the case when there is already a database of sequences from aparticular species, and a new genome of the same species is assembled and thus needs tobe stored (and therefore compressed).

The data used for the pair-wise compression are summarized in Table 2.1. This scenariowas already considered by [37, 38, 43, 42, 44, 18] for assessing the performance of theiralgorithms, and thus the data presented in Table 2.1 include the ensemble of all the datasetsused in the previously mentioned papers.

As evident in Table 2.1, we include in our simulations datasets from a variety of species.These datasets are also of very different characteristics in terms for example of the alphabetsize, the number of chromosomes they include, and the total length of the target genomethat needs to be compressed

2.2.3 Machine specifications

The machine used to perform the experiments and estimate the running time of the differ-ent algorithms has the following specifications: 39 GB RAM, Intel Core i7-930 CPU at2.80GHz ◊ 8 and Ubuntu 12.04 LTS.

2.3 Results

Next we show the performance of the proposed algorithm iDoComp in terms of both com-pression and running time, and compare the results with the previously proposed compres-sion algorithms.

As mentioned, we consider pair-wise compression for assessing the performance of theproposed algorithm. Specifically, we consider the compression of a single genome (targetgenome) given a reference genome available both at the compression and the decompres-sion. We use the target and reference pairs introduced in Table 2.1 to assess the performanceof the algorithm. Although in all the simulations the target and the reference belong to thesame species, note that this is not a requirement of iDoComp, which also works for thecase where the genomes are from different species.

4The target and the reference do not necessarily need to belong to the same species, although bettercompression ratios are achieved if the target and the reference are highly similar.


Species Chr. Assembly Retrieved from

L. pneumohilia 1 T: NC 017526.1 ncbi.nlm.nih.govR: NC 017525.1

E. coli 1 T: NC 017652.1 ncbi.nlm.nih.govR: NC 017651.1

S. cerevisiae 17 T: sacCer3 ucsc.eduR: sacCer2

C. elegans 7 T: ce10 genome.ucsc.eduR: ce6

A. thaliana 7 T: TAIR10 arabidopsis.orgR: TAIR9

Oryza sativa 12 T: TIGR6.0 rice.plantbiology.msu.eduR: TIGR5.0

D. melanogaster 6 T: dmelr41 fruitfly.orgR: dmelr31

H. sapiens 1 25 T: hg19 ncbi.nlm.nih.govR: hg18

H. sapiens 2 25 T: KOREF 20090224 koreangenome.orgR: KOREF 20090131

H. sapiens 3 25 T: YH yh.genomics.org.cnR: hg18 ncbi.nlm.nih.gov

H. sapiens 4 25 T: hg18 ncbi.nlm.nih.govR: YH yh.genomics.org.cn

H. sapiens 5 25 T: YH yh.genomics.org.cnR: hg19 ncbi.nlm.nih.gov

H. sapiens 6 25 T: hg18 ncbi.nlm.nih.govR: hg19

Table 2.1: Genomic sequence datasets used for the pair-wise compression evaluation. Eachrow specifies the species, the number of chromosomes they contain and the target and thereference assemblies with the corresponding locations from which they were retrieved.T. and R. stand for the Target and the Reference, respectively.


To evaluate the performance of the different algorithms, we look at the compressionratio, as well as at the running time of both the compression and the decompression. Wecompare the performance of iDoComp with those of GDC, GReEn and GRS.

When performing the simulations, we run both GReEn and GRS with the default pa-rameters. The results presented for GDC correspond to the best compression among theadvanced and the normal variant configurations, as specified in the supplementary datapresented in [42]. Note that the parameter configuration for the H. sapiens differs from thatof the other species. We modify it accordingly for the different datasets. Regarding iDo-Comp, all the simulations are performed with the same parameters (default parameters),which are hard-coded in the code.

For ease of exhibition, for each simulation we only show the results of iDoComp, GDCand the best among GReEn and GRS. The results are summarized in Table 2.2. For eachspecies, the target and the reference are as specified in Table 2.1. To be fair across thedifferent algorithms, especially when comparing the results obtained in the small datasets,we do not include the cost of storing the headers in the computation of the final size for anyof the algorithms. The last two columns show the gain obtained by our algorithm iDoCompwith respect to the performance of GReEn/GRS and GDC. For example, a reduction from100KB to 80KB represents a 20% gain (improvement). Note that with this metric a 0%gain means the file size remains the same, a 100% improvement is not possible, as this willmean the new file is of size 0, and a negative gain means that the new file is of bigger size.

As seen in Table 2.2, the proposed algorithm outperforms in compression ratio the pre-viously proposed algorithms in all cases. Moreover, we observe that whereas GReEn/GRSseem to achieve better compression in those datasets that are small and GDC in the H.

sapiens datasets, iDoComp achieves good compression ratios in all the datasets, regardlessof their size, the alphabet and/or the species under consideration.

In cases of bacteria (small size DNA), the proposed algorithm obtains compressiongains that vary from 30% against GReEn/GRS to up to 64% when compared with GDC. Forthe S. cerevisae dataset, also of small size, iDoComp does 55% (61%) better in compressionratio than GRS (GDC). For the case of medium size DNA (C. elegans, A. thaliana, O. sativa

and D. melanogaster) we observe similar results. iDoComp again outperforms the otheralgorithms in terms of compression, with gains up to 92%.


Finally, for the H. sapiens datasets, we observe that iDoComp consistently performsbetter than GReEn, with gains above 50% in four out of the six datasets considered, and upto 91%. With respect to GDC, we also observe that iDoComp produces better compressionresults, with gains that vary from 3% to 63%.

Based on these results, we can conclude that GDC and the proposed algorithm iDo-Comp are the ones that produce better compression results on the H. Sapiens genomes. Inorder to get more insight into the compression capabilities of both algorithms when deal-ing with Human genomes, we performed twenty extra pair-wise compressions (results notshown), and found that on average iDoComp employs 7.7 MB per genome, whereas GDCemploys 8.4 MB. Moreover, the gain of iDoComp with respect to GDC for the consideredcases is on average 9.92%.

Regarding the running time, we observe that the compression and the decompressiontime employed by iDoComp is linearly dependent on the size of the target to be com-pressed. This is not the case of GReEn, for example, whose running times are highlyvariable. In GDC we also observe some variability in the time needed for compression.However, the decompression time is more consistent among the different datasets (in termsof the size), and it is in general the smallest among all the algorithms we considered. iDo-Comp and GReEn take approximately the same time to compress and decompress. Overall,iDoComp’s running time is seen to be highly competitive with that of the existing meth-ods. However, note that the time needed to generate the suffix array is not counted in thecompression time of iDoComp, whereas the compression time of the other algorithms mayinclude construction of index structures, like in the case of GDC.

Finally, we briefly discuss the memory consumption of the different algorithms. Wefocus on the compression and decompression of the Homo Sapiens datasets, as they repre-sent the larger files and thus the memory consumption in those cases is the most significant.iDoComp employs around 1.2 GB for compression, and around 2 GB for decompression.GReEn consumes around 1.7 GB both for compression and decompression. Finally, thealgorithm GDC employs 0.9 GB for compression and 0.7 GB for decompression.


2.4 Discussion

Inspection of the empirical results of the previous section shows the superior performanceof the proposed scheme across a wide range of datasets, from simple bacteria to the morecomplex humans, without the need for adjusting any parameters. This is a clear advantageover algorithms like GDC, where the configuration must be modified depending on thespecies being compressed.

Although iDoComp has some internal parameters, namely, L, �, � and fl,5 the defaultvalues that are hard-coded in the code perform very well for all the datasets, as we haveshown in the previous section. However, the user could modify these parameters data-dependently and achieve better compression ratios. Future work may include exploring theextent of the performance gain (which we believe will be substantial) due to optimizing forthese parameters.

We believe that the improved compression ratios achieved by iDoComp are due largelyto the post-processing step of the algorithm, which modifies the set of instructions in a waythat is beneficial to the entropy encoder. In other words, we modify the elements containedin the sets so as to facilitate their compression by the arithmetic encoder.

Moreover, the proposed scheme is universal in the sense that it works regardless of thealphabet used by the FASTA files containing the genomic data. This is also the case withGDC and GReEn, but not with previous algorithms like GRS or RLZ-opt which only workwith A, C, G, T and N as the alphabet.

It is also worth mentioning that the reconstructed files of both iDoComp and GDC areexactly the original files, whereas the reconstructed files under GReEn do not include theheader and the sequence is expressed in a single line (instead of several lines).

Another advantage of the proposed algorithm is that the scheme employed for com-pression is very intuitive, in the sense that the compression consists mainly of generatinginstructions composed of the sequence of matches M and the two sets S and I that suf-fice to reconstruct the target genome given the reference genome. This information byitself can be beneficial for researchers and gives insight into how two genomes are relatedto each other. Moreover, the list of SNPs generated by our algorithm could be compared

5See the Post-Processing step in Section 2 for more details.


with available datasets of known SNPs. For example, the NCBI dbSNP database containsknown SNPs of the H. sapiens species.

Finally, regarding iDoComp, note that we have not included in Table 2.2 the timeneeded to generate the suffix array of the reference, only that needed to load it into memory,which is already included in the total compression time. The reason is that we devise thesealgorithms based on pair-wise compression as the perfect tool for compressing several in-dividuals of the same species. In this scenario, one can always use the same reference forcompression, and thus the suffix array can be reused as many times as the number of newgenomes that need to be compressed.

Regarding compression of sets, any pair-wise compression algorithm can be triviallyused to compress a set of genomes. One has merely to choose a reference and then com-press each genome in the set against the chosen reference. However, as was shown in [42]with their GDC-ultra version of the algorithm, and as can be expected, an intelligent selec-tion of the references can lead to significant boosts in the compression ratios. Therefore,in order to obtain high compression ratios in sets it is of ultimate importance to providethe pair-wise compression algorithms with a good reference-finding method. This couldbe thought of as an add-on that could be applied to any pair-wise compression algorithm.For example, one could first analyze the genomes contained in the set to detect similarityamong genomes, and then use this information to boost the final compression6. The latteris a different problem that needs to be addressed separately.

2.5 Conclusion

We have introduced iDoComp, an algorithm for compression of assembled genomes. Specif-ically, the algorithm considers pair-wise compression, i.e., compression of a target genomegiven that a reference genome is available both for the compression and the decompression.This algorithm is universal in the sense of being applicable for any dataset, from simplebacteria to more complex organisms such as humans, and accepts genomic sequences ofan extended alphabet. We show that the proposed algorithm achieves better compressionthan the previously proposed algorithms in the literature, in most cases. These gains are up

6A similar approach is used in [48]


to 92% in medium size DNA and up to 91% in humans when compared with GReEn andGRS. When compared with GDC, the gains are up to 73% and 63% in medium size DNAand humans, respectively.


Spec

ies

Raw

Size

GR

eEn/

GR

SúG

DC

iDoC

omp

Gai

n

[MB

]Si

ze[K

B]

C.t

ime

D.t

ime

Size

[KB

]C

.tim

eD

.tim

eSi

ze[K

B]

C.t

ime

D.t

ime

GR

.G

DC

L.pn

eum

ohili

a2.

70.

122ú

1ú1ú

0.22

9†0.

30.

030.

084

0.1

0.1

31%

63%

E.co

li5.

10.

119

10.

60.

242†

0.6

0.06

0.08

60.

20.

230

%64

%S.

cere

visi

ae12

.45.

65ú

5ú5ú

6.47

†1

12.

530.

40.

455

%61

%C

.ele

gans

102.

317

045

4748

.71

113

.33

492

%73

%A.

thal

iana

121.

26.

654

566.

3221

22.

094

568

%67

%O

ryza

sativ

a37

8.5

125.

514

014

612

8.6†

806

105.

411

1516

%18

%D

.mel

anog

aste

r12

0.7

390.

651

243

3.7

231

364.

44

47%

16%

Hom

oSa

pien

s1

3,10

011

,200

1687

1701

2,77

026

3667

1,02

595

130

91%

63%

Hom

oSa

pien

s2

3,10

018

,000

652

721

11,9

7351

178

7,24

712

012

660

%39

%H

omo

Sapi

ens

33,

100

10,3

0043

449

56,

840

141

656,

290

118

125

39%

8%H

omo

Sapi

ens

43,

100

6,50

035

237

26,

426

191

705,

767

115

130

11%

10%

Hom

oSa

pien

s5

3,10

035

,500

1761

1846

11,8

73†

249

6211

,560

122

130

67%

3%H

omo

Sapi

ens

63,

100

10,5

6016

8617

756,

939

2348

705,

241

100

120

50%

24%

Tabl

e2.

2:C

ompr

essi

onre

sults

fort

hepa

irwis

eco

mpr

essi

on.C

.tim

ean

dD

.tim

est

and

forc

ompr

essi

onan

dde

com

pres

sion

time

[sec

onds

],re

spec

tivel

y.Th

ere

sults

inbo

ldco

rres

pond

toth

ebe

stco

mpr

essi

onpe

rfor

man

ceam

ong

the

diff

eren

tal

gorit

hms.

We

use

the

Inte

rnat

iona

lSys

tem

ofU

nits

fort

hepr

efixe

s,th

atis

,1M

Ban

d1K

Bst

ands

for1

06an

d10

3B

ytes

,re

spec

tivel

y.ú

deno

test

heca

sesw

here

GR

Sou

tper

form

edG

ReE

n.In

thes

eca

ses,

i.e.,

L.pn

eum

ohili

aan

dS.

cere

visi

ae,t

heco

mpr

essi

onac

hiev

edby

GR

eEn

is49

5KB

and

304.

2KB

,res

pect

ivel

y.†

deno

tes

the

case

sw

here

GD

C-a

dvan

ced

outp

erfo

rmed

GD

C-n

orm

al.

Chapter 3

Compression of aligned reads

As outlined in the introduction, NGS data can be classified into two main categories: i)Raw NGS data –which is usually stored as a FASTQ file and contains the raw output ofthe sequencing machine; and ii) aligned raw NGS data –which besides the raw data itadditionally contains its alignment to a reference. The latter is usually stored as a SAM file[1].

In this chapter we focus on the aligned (reference-based) scenario, since this data isboth the largest (up to terabytes) and the one generally used in downstream applications.Moreover, we believe that the size of these files conveys the major bottleneck for the trans-mission and handling of NGS data among researchers and institutions and, therefore, goodcompression algorithms are of primal need.

We demonstrate the benefit of data modeling for compressing aligned reads. Specif-ically, we show that, by working with data models designed for the aligned data, we canimprove considerably over the best compression ratio achieved by previously proposed al-gorithms. The results obtained by the proposed method indicate that the pareto-optimalbarrier for compression rate and speed claimed by [17] does not apply for high coveragealigned data. Furthermore, our improved compression ratio is achieved by splitting thedata in a manner conducive to operations in the compressed domain by downstream appli-cations.

34

CHAPTER 3. COMPRESSION OF ALIGNED READS 35

3.1 Survey of lossless compressors for aligned data

The first compression algorithm proposed for SAM files was BAM [1]. BAM was releasedat the same time as its uncompressed counterpart SAM. However, more than a special-ized compression algorithm, it is a binarization of the SAM file that uses general-purposecompression schemes as its compression engine.

Concerned by the exponentially increasing size of the BAM files, in [50] they proposedCRAM, a new compression scheme for aligned data. CRAM is presented as a toolkit, in thesense that it combines a reference-based compression of sequence data with a data formatthat is directly available for computational use in downstream applications.

In 2013 Goby was presented as an alternative to CRAM [51]. Goby is also presented asa toolkit that performs reference-based compression with a data format that allows down-stream applications to work over the goby files. The authors also provide software to per-form some common NGS tasks as differential expression analysis or DNA methylationanalysis over the goby-compressed files.

Alternatively, more compression-oriented proposals have been published recently. Theseworks do not focus on creating a compression scheme that aids downstream applications towork over the compressed data, but on maximizing the compression ratio.

In 2012, Quip, a lightweight dependency-free compressor of high-throughput sequenc-ing data was proposed [52]. The main strength of Quip is that it accepts both non-alignedand aligned data, in contrast to both CRAM and GOBY1. In the case of non-aligned data,if a reference is available, it performs its own lightweight and fast assembly and then com-presses the result. Since here we focus on aligned data, in the rest of the chapter we considerQuip exclusively on its SAM file compressor mode.

Finally, in [17] they proposed SamComp, which is, to our knowledge, the best com-pression algorithm for aligned data. Two versions of the algorithm were proposed, namely,SamComp1 and SamComp2. Throughout the chapter we consider SamComp1, as it pro-vides better compression results and it is the one recommended by the authors [17]. Sam-Comp is more restrictive than Quip in the sense that it only accepts SAM and/or BAM filesas input files. However, the same authors also proposed other methods that accept FASTQ

1It is in fact the only algorithm appearing in all the categories when comparing compression algorithmsby data type in the review paper of Bonfield et al. (2013) [17]


files and a reference (optional) as input. If a reference is available they perform their ownfast alignment before compression [17]. Note that the compression methods proposed in[17] do not compress the SAM file itself, but only the necessary information to be able torecover the corresponding FASTQ file.

The compression of NGS aligned raw data can be divided in several sub-problems ofdifferent nature; namely, the compression of the reads, the compression of the header and/orthe identifiers, the compression of the quality scores, and the compression of the fields re-lated to the alignment. Although all these sub-problems are addressed by the above algo-rithms2, in this chapter we focus exclusively on compressing the necessary information toreconstruct the reads. The reason being that they carry most – if not all – of the informa-tion used by downstream applications and thus their compression is of primal importance.Moreover, as we will see in the following chapters, the compression of quality scores canbe drastically decreased by the use of lossy compression methods without compromisingthe performance of the downstream applications (see [20, 21]), while the identifiers and theheaders are generally ignored.

In the context of compressing aligned reads, all the aforementioned algorithms, exceptSamComp, perform similarly in terms of compression ratio. Moreover, each of them out-performs the others in at least one data set (see the results tables of [17, 52, 51, 50]). Onthe other hand, SamComp clearly outperforms the rest of the algorithms, as shown in [17].It is important to mention that the comparison is not strictly fair as SamComp is a special-ized SAM compressor whereas the other algorithms are oriented to provide a stable toolkit,which sometimes includes random access capabilities. However, SamComp shows that ju-dicious design of models for the data yields a significant improvement in compression ratioand thus similar compression techniques should be applied in toolkit oriented programssuch as Goby or CRAM.

Following the approach of SamComp, the main contribution of this chapter is not toprovide a complete compression scheme for aligned data, as done in CRAM or Goby, butto demonstrate the importance of data modeling when compressing aligned reads. That is,the purpose of the proposed method is to give insight into the potential gains that can be

2To the exception of [17] where they do not reconstruct a SAM file but a FASTQ file.


achieved by an improved data model. We believe that the techniques described in this chap-ter can be applied in the future to more complete compression toolkits. We show that, byconstructing effective models adapted to the particularities of the genomic data, significantimprovements in compression ratio are possible. We will show that these improvementsbecome considerable in high coverage data sets.

Next, we introduce the lossless data compression problem, and examine the approach(and assumptions) made by the previously proposed algorithms when modeling the data.

3.1.1 The lossless data compression problem

Information theory states that given a sequence s = [s1, s2, . . . , sn

] drawn from a proba-bility distribution PS(s), the minimum number of bits required on average to represent thesequences s is essentially (ignoring a negligible additive constant term) given by H(S) =E[≠ log(PS(S))] bits3, where H(·) is the Shannon entropy, which only depends on theprobability distribution PS.

Thus, the lossless compression problem comprises two sub-problems: i) data modeling,that is, selection of the probability distribution (or model) PS of the sequence, and ii) codeword assignment, that is, given a PS that models the data, find effective means of compress-ing close to the “ideal” ≠ log(PS(s)) number of bits for any given sequence s, such that theexpected length of the compressed sequences can achieve the entropy (that is, the optimumcompression). Arithmetic coding provides an effective mean to sequentially assign a code-word of length close to ≠ log(PS(s)) to s. There are other codes that achieve the “idealcode length” for particular models, such as Golomb codes for geometric distributions andHuffman codes for D-adic distributions.

The lossless compression problem can therefore equivalently be thought of as one offinding good models for the data to be compressed. The model for a sequence s can bedecomposed as the product of the models for each individual symbol s

t

, where at each in-stant t, the model for s

t

is computed using the sequence of previous symbols [st≠1, . . . , s1]

as context. This characteristic makes it possible to generate the models sequentially, whichis particularly relevant for the aligned data compression problem, as the files can be of

3All the logarithms used in this chapter are base-2.


remarkable size, and more than one pass over the data could be prohibitive. Note thatthe memory needed to store all the different contexts grows exponentially with the con-text length and becomes prohibitively large very quickly. To solve this issue, the simpleapproach is to constrain the context lengths to at most m symbols, while more advancedtechniques rely on variable context lengths [53].

In general, a context within a sequence may refer to entities more general than sub-strings preceding a given symbol. In this work we show how the symbols appearing inthe aligned file can be estimated using previously compressed symbols. This modeling ofthe data ultimately leads to considerable improvements over the state of the art algorithmsproposed so far in the literature.

3.1.2 Data modeling in aligned data compressors

Following a similar classification as in [51], we show different approaches for data mod-eling in the context of aligned data, and then introduce the ones employed by the abovementioned algorithms. Throughout the chapter we assume the SAM files are ordered byposition, as this is also the assumption made by Goby, CRAM and SamComp; moreover,as shown in [17], better compression ratios can be achieved.

The most general approach on data modeling is performed by general compressionalgorithms, which perform a serialization of the whole file and compute a byte-based modelover the serialized file. However, since the SAM file is divided in different fields, the firstintuitive approach is to treat each of the fields separately, as each of them is expected to beruled by a different model. A further improvement can be done by considering cross-fieldcorrelation. In this case, the value of some fields can be modeled (or estimated) using thevalue of other fields, thus achieving better compression ratios. Finally, as shown in Ref.[51], other modeling techniques could be used, as for example, generating a template for afield and then compressing the difference between the actual value and the template.

In this context, CRAM encodes each field separately and extensively uses Golomb cod-ing. By using these codes CRAM implicitly models the data as independent and identicallydistributed by a Geometric distribution. The main reason for using Golomb codes is theirsimplicity and speed. However, we believe that the computational overhead carried by the


use of arithmetic codes is negligible, whereas the compression penalty paid by assuminggeometrically distributed fields may be more significant.

Goby offers several compression modes. In the fastest one, general compression meth-ods are used over the serialized data, yielding the worst compression ratios. In the slowermodes a compression technique denoted by the authors as Arithmetic Coding and Template(ACT) is used. In this case all the fields are converted to a list of integers and each list isencoded independently using arithmetic codes. To our knowledge, no context is used whencompressing these lists. Moreover, we believe that the conversion to integer list, althoughit makes the algorithm more scalable and schema-independent, it damages the compressionas it does not exploit possible models or correlations between neighboring integers. In theslowest mode, some inter-list modeling is performed over the integer lists to aid compres-sion. The improvement upon CRAM is that no assumptions are made about the distributionof the fields. Instead, the code learns the distribution of the integers within each list as itcompresses.

As a more data specific compressor, Quip uses a different arithmetic coding over eachfield. However, they do not use any context when compressing the data (thus, implicitlyassuming independent and identically distributed data) and they treat each field indepen-dently. SamComp, on the other hand, performs an extensive use of contexts; thus, theydo not assume the independence of the data. The results of [17] show that the use of con-texts aids compression significantly. The aforementioned proposals –with the exception ofSamComp– lack accurate data models for the reads and make extensive use of generic com-pression methods (e.g., byte-oriented compressors), relying on models that do not assumeor exploit the possible dependencies across different fields.

In the present chapter, we show the importance of data modeling when compressingaligned reads. We show that, by generating data models that are particularly designed forthe aligned data, we can improve the compression ratios achieved by the previous algo-rithms. Moreover, we show that the pareto-optimal barrier for compression rate and speedclaimed in Ref. [17] does not hold in general. This fact becomes more notable whencompressing aligned reads from high coverage data sets.

Finally, as the proposed scheme is envisaged as being part of a more general toolkit, it isimportant to aim for a compression scheme that aids the future utilization of the compressed


data in downstream applications, as proposed in [50] and [51]. Thus, relevant information,such as the number of Single Nucleotide Polymorphisms (SNPs) and their position withinthe read, should be easily accessible from the compressed data. Quip and SamComp seemto lack this important feature as they perform, as part of the read compression, a base-pairby base-pair model and compression. The main drawback of this approach is that in orderto find variations of a specific read and the positions of the reference where the variationsoccur, one must first reconstruct the entire read to then extract the information. This can becomputationally intensive for the downstream applications. In this chapter, we show thatthe data can be compressed in a way that could enable downstream applications to rapidlyextract relevant information from the compressed files, not only without compromising thecompression ratio, but actually improving it.

3.2 Proposed Compression Scheme

As one of the aims of the proposed compression scheme is to facilitate downstream appli-cations to work over the compressed data, we decompose the aligned information regardingeach read into different lists, and then compress each of these lists using specific models.All operations –splitting the read into the different lists, computing the corresponding mod-els and compressing the values in the lists– are done read by read, thus performing onlyone pass over the data and generating the compressed file as we read the data. Next weshow the different lists in which the read information is split and introduce the models usedto compress each of them.

For each read in the SAM file, we generate the following lists:

• List F : We store a single bit that indicates the strand of the read.

• List P: We store an integer that indicates the position where the read maps in thereference.

• List M: We store a single bit that indicates if the read maps perfectly to the referenceor not. We extract this information from both the CIGAR field and the MD auxiliary


field4. If the latter is not available, the corresponding mismatch information can bedirectly extracted by loading the reference and comparing the read to it.

• List S: In case of a non-perfect match, if no indels (insertions and/or deletions) oc-cur, we store the number of SNPs. We store a 0 otherwise.

• List I: If at least one indel occurs we store the number of insertions, deletions, andsubstitutions (in the following we denote them as variations) occurring within theread.

• List V: For each of the variations (note that each read may have multiple variations),we store an integer that indicates the position where they occur within the read.

• List C: Finally, for each insertion and substitution, we store the corresponding newbase pair, together with the one of the reference in case of a substitution.

The information contained in the above lists suffices to reconstruct the reads, assumingthe reference is available for decompression.

Note that the amount of information we store per read depends on the quality of themapping. For example,

- If a read maps perfectly to the reference we store just a 1 in M.

- If it only has SNPs, we store a 0 in M, the number of SNPs in S and their positionsand base pairs in V and C, respectively.

- If it does have insertions and/or deletions, we store a 0 in M, a 0 in S, the numberof insertions, deletions and SNPs in I, and their respective positions and base pairsin V and C, respectively.

4The CIGAR is a string that contains information regarding the mismatches between the read and theregion of the reference where it maps to. The MD field is a string used to indicate the different SNPs thatoccur in the read.


It can be verified that this transformation of the data is information lossless and thus nocompression capabilities are lost.

Next we describe the model used for each of these lists that will be fed to the arithmeticencoder. Recall that every list is compressed in a sequential manner, that is, at every stepwe compress the next symbol using the model computed from the previous symbols. Thecomputation of the model for a symbol that hasn’t appeared yet is the well known zero

frequency problem. We use different solutions to this problem for each of the lists.

1) Modeling of FWe use a binary model with no context, as empirical calculations suggest that the di-

rections of the strands are almost independent and identically distributed.

2) Modeling of PThe main challenge in compression of this list is that the alphabet of the data is un-

known, very large and almost uniformly distributed. To address this challenge, since thepositions are ordered, we first transform the original positions into gaps between consec-utive positions (delta encoding). This transformation considerably reduces the size of thealphabet and reshapes the distribution, strongly biasing it towards low numbers. This tech-nique is also used in the previously proposed algorithms.

Regarding the unknown alphabet problem, the only information that is available be-forehand is that the size of the alphabet is upper bounded by the length of the largest chro-mosome, which for example for humans is 3 · 106. Therefore, a uniform initialization ofthe model to avoid the zero probability problem is prohibitive. Several solutions have beenproposed to address this problem, as the use of Golomb Codes –which models the datawith a Geometric distribution– or the use of byte-oriented arithmetic encoders which usesan independent model for each of the 4 bytes that forms the integer. However, since thedistribution is not truly geometrical and a byte-oriented arithmetic encoder has a relevantoverhead, we propose a different approach.

We use a modified order-0 Prediction by Partial Match (PPM)[53] scheme to modelthe data. At the beginning, the alphabet of the model is uniquely composed by an escapesymbol e. When a new symbol appears, the encoder looks into the model and if it cannot


find the symbol, it emits an escape symbol. Then the new symbol is stored in a separatelist, and the model is updated to contain the new symbol. In this way, the encoder worksnicely in both low coverage data sets, where lots of new symbols are expected; and in highcoverage data sets, where the alphabet is reduced dramatically.

We also use this list to indicate changes of chromosomes and end of files.

3) Modeling of MTo create the model we use the mapping position of the previous read (stored in P) and

its match value (stored in M) as context. The rationale behind this approach is that weexpect each region of the reference to be covered by several reads (specially for high cover-age data), and thus if the previous read mapped perfectly, it is likely that the next read willalso map perfectly if they start at close-by positions. The same intuition holds for the caseof reads that do not map perfectly. Finally, the matches are compressed using an arithmeticencoder over a binary alphabet.

4) Compression of STo model this list, we use the previously processed number of SNPs seen per read to

continuously update the model. However, due to the non-stationarity of the data, we updatethe statistics often enough such that the local behavior of the data is reflected in the model.

5) Modeling of IThe modeling and compression of this list is done analogously to the previous list.

6) Modeling of VThe positions of the variations is the less compressible list together with the mapping

position list (P), mainly due to the large number of elements it contains. In order to modelit we use several contexts. First, we generate a global vector that indicates, for each positionof the genome, if a variation has previously occurred. Since we know in which region of thereference the read maps, we can use the information stored in the aforementioned vectorto know which variations have previously occurred in that region, and use it as contextto estimate the position of the next variation. We also use the position of the previous


variation within the read (in case of no previous variations a 0 is used), and the direction ofthe strand of the read (stored in F), as contexts. Finally, note that if the read is of length L

and the previous variation has occurred at position L ≠ 10, the current variation must occurbetween L ≠ 9 and L. Therefore, the model updates its probabilities accordingly to assigna non-zero value only to these numbers.

The purpose of this model is to use the information of the previous variations to esti-mate where the current variations are going to occur. This works especially well for highcoverage data, as one can expect several reads mapping to the same region of the reference,and thus having similar –if not the same–, variations.

7) Compression of CFinally, for the compression of the base-pairs, we distinguish between insertions and

SNPs. Moreover, within the SNPs, we use a different model depending on the base-pair ofthe reference that is to be modified. That is, given a SNP, we use the model associated withthe base-pair in the reference. This choice of the model comes from the observation thatthe probability of having a SNP between non-conjugated base-pairs is higher than betweenconjugated base-pairs.

The models described above for the different lists use previously seen information (notrestricted to the same list) to estimate the next values. These models rely on the assumptionthat the reads contain redundant information, a fact particularly apparent in high coveragedata sets. We show in the simulation results that the compression ratio improvements withrespect to the previously proposed algorithms increase with the coverage of the data, as onemay have expected. This is an advantage of the proposed algorithm in light of the continueddrop in the sequencing cost and the increased throughput of the NGS technologies whichwill boost the generation of high coverage data sets.

3.2.1 Data

In order to assess the performance of the proposed algorithm and compare it with the previ-ously proposed ones, we consider the raw sequencing data shown in Table 3.1. To generate


Low Coverage Data Sets

Name Reference Mapped Read Coverage SizeReadsa Length [GB]

SRR062634 1 hg19 23.3 M 100 0.26◊ 6.9SRR027520 1 hg19 20.4 M 76 0.17◊ 5.1SRR027520 2 hg19 20.4 M 76 0.17◊ 5.1SRR032209 mm GRCm38 13.5 M 36 0.17◊ 2.7

SRR043366 2 hg19 13.5 M 76 0.11◊ 3.4SRR013951 2 hg19 9.5 M 76 0.08◊ 2.4SRR005729 1 hg19 9.4 M 76 0.08◊ 2.5

High Coverage Data Sets

Name Reference Mapped Read Coverage SizeReadsa Length [GB]

SRR065390 1 ce ws235 31.6 M 100 32◊ 9.5SRR065390 2 ce ws235 31.2 M 100 32◊ 9.4ERR262997 2 hg19 410.5 M 101 14◊ 122ERR262996 1 hg19 272.8 M 101 10◊ 81

Table 3.1: Data sets used for the assessment of the proposed algorithm.The alignment program used to generate the SAM files is Bowtie2. The data sets aredivided in two ensembles, low coverage data sets and high coverage data sets. The sizecorresponds to the SAM file.a M stands for millions.

the aligned data (SAM files), we used the alignment program Bowtie25 [54]. For each ofthe data sets we specify the reference used to perform the alignment (which belongs to thesame species as the data set under consideration), the number of reads that mapped to thereference after the alignment, the read length, the coverage, and the size of the SAM file.Recall that the SAM files contain only the reads that mapped to the reference. The refer-ences hg19, mm GRCm38 and ce ws235 belong to the H. Sapiens, the M. Musculus and theC. Elegans species, respectively.

As shown in the table, we have divided the data sets into two ensembles, the low cov-erage data ensemble and the high coverage data ensemble.

5Note that any other alignment program could have been used for this purpose. We chose Bowtie2 becauseit was the one employed by [17] to perform the alignments.


The low coverage data ensemble is formed by human (H. Sapiens) and mouse (M.

Musculus) sequencing data of coverage less than 1◊. We chose these human data setsbecause they were used in the previously proposed algorithms to assess their performances.Although these data sets are easy to handle and fast to compress due to their small size (ofthe order of a few GB), we believe that they do not represent the majority of the sequencingdata used by researchers and institutions. Furthermore, in order to analyze the performanceof the different compression algorithms, it is important to consider data sets of differentcharacteristics. With this in mind, in this chapter we also consider data sets of coverage upto 32◊, which are presented next.

The high coverage data ensemble is composed by the sequencing data of two H. Sapiens

and two C. Elegans (ERR and SRR, respectively). The data sets of the C. Elegans werealso used in previous publications. However, note that the total size of these files is still onthe order of few GB, as the size of the C. Elegans genome is significantly smaller than thatof the H. Sapiens.

All the data sets have been retrieved from the European Nucleotide Archive6.

3.2.2 Machine specifications

The machine used to perform the experiments has the following specifications: 39 GBRAM, Intel Core i7-930 CPU at 2.80GHz ◊ 8 and Ubuntu 12.04 LTS.

3.3 Results

Next we show the performance of the proposed compression method when applied to thedata sets shown in Table 3.1, and compare it with the previously proposed algorithms. Asmentioned above, we divide the results into two categories, namely, the low coverage datasets and the high coverage ones. In the following, we express the gain in compression ratioof an algorithm A with respect to another algorithm B as gain = 1 ≠ size(A)/size(B).For example, a reduction from 100MB to 80MB represents a 20% gain (improvement).Note that with this metric a 0% means the file size remains the same, a 100% improvement

6http://www.ebi.ac.uk/ena/


is not possible, as this will mean the new file is of size 0, and a negative value means thatthe new file is of bigger size.

3.3.1 Low coverage data sets

We start by considering the low coverage data sets introduced in Table 3.1 to assess theperformance of the proposed method. Figure 3.1 shows the compression ratio, in bitsper base pair, of the different algorithms proposed in the literature. For completeness,we also show the performance of Fastqz when it uses a reference [17]. However, notethat the comparison is not strictly fair, as Fastqz performs its own fast alignment beforecompression, instead of using the alignment information provided in the SAM file. Thedata sets shown in the figure are ordered –from left to right– starting from the data setwith the lowest coverage (SRR005729 1 with a coverage of 0.08◊) and finishing with thelargest (SRR0062634 1 with a coverage of 0.26◊).

Note that the results shown in the figure refer to the compression of the reads. ForFastqz, Quip and SamComp this value can be computed exactly, whereas for Goby andCRAM only an approximate value can be computed.

The results show that the compression ratio of Goby and Quip is similar, each of themoutperforming the other in at least one data set, but worse than that of CRAM. SamComp,on the other hand, outperfoms the other algorithms in the cases considered. Moreover,as shown in the figure, the proposed method outperforms all the previously proposedalgorithms in all the data sets. The gain with respect to SamComp varies from 0.3%(SRR013951 2), to 8.6% (SRR013951 2). These gains become higher when the perfor-mance is compared with that of the other algorithms. It is important to remark that theseresults show the compression ratio (shown in the figure as bits/bp) of the aligned reads, andit does not include the quality values and/or the identifiers.

Finally, we observe a somewhat expected relation between the compression ratio andthe coverage of each data set (see Table 3.1): the higher the coverage, the more compress-ible the reads.

Regarding the running time, SamComp offers the best compression times, ranging from45 to 160 seconds, followed by the proposed method, which takes between 124 and 163


005729_1 013951_2 043366_2 027520_1 027520_2 032209 062634_10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

SRR Data Set

bits/bp

Proposed MethodSamCompCRAMGobyQuipFastqz

Figure 3.1: Performance of the proposed method and the previously proposed algorithmswhen compressing aligned reads of low coverage data sets.


seconds. Goby, CRAM and Quip need more time for compression. However, note thatthey are compressing the whole SAM file. The slowest is Fastqz, mainly because it isperforming its own alignment. Regarding the decompression time, all the algorithms exceptFastqz employ similar times, ranging from 70 to 400 seconds.

Note that since all the algorithms are sequential (that is, they compress and decompressread by read), the memory consumption of all of them is independent of the number ofreads to compress and/or decompress. For example, for the largest human dataset of the lowcoverage ensemble, SRR062634 1, the algorithm with the lowest memory consumption onthe compression is the proposed method with 0.3 GB, followed by SamComp with 0.5 GB.Fastqz and Quip use around 1.3 GB, Goby 3.2 GB, and CRAM 4 GB. On the other hand,for decompression, SamComp is the one with the lowest memory consumption (300 MB),followed by the proposed method with 500MB. Fastqz uses 700MB, Quip 1.5 GB, andGoby 4 GB7. Note that this comparison is not fully fair as the other algorithms compressand decompress the whole SAM file, while the proposed algorithm focuses on the reads.

3.3.2 High coverage data sets

For this ensemble we set Fastqz, Quip, Goby and CRAM aside, as they are outperformed bySamComp and the proposed method in terms of compression ratio, and focus exclusivelyon the latter two.

The performance of SamComp and the proposed method for the high coverage data setsintroduced in Table 3.1 is summarized in Table 3.2. Similarly as for the low coverage datasets, the proposed method outperforms SamComp in the cases considered. Moreover, weobserve that the gain in compression ratio with respect to SamComp increases as the cov-erage of the data sets increase. For example, for the 10◊ coverage data (ERR262996 1) weobtain a compression gain of 10%, whereas for the 32◊ coverage data set (SRR065390 2),this gain is boosted up to 17%. Note that this gain translates into savings in the MB neededto store the files.

Regarding the running time, we observe that both algorithms employ similar time forcompression, which is around 2 minutes for the C. Elegans data sets, and between 20

7We do not specify the memory consumption of CRAM during the decompression because we wereunable to run it.


and 30 minutes for the H. Sapiens data sets. The decompression times are similar to thecompression times.

3.4 Discussion

Inspection of the empirical results of the previous section shows the superior performanceof the proposed scheme across a wide range of data sets, from very low coverage to highcoverage. Next we discuss these results in more detail.

3.4.1 Low Coverage Data Sets

As shown in Figure 3.1, SamComp and the proposed method clearly outperform the pre-viously proposed algorithms in all cases. However, we should emphasize that while theaim of SamComp and the proposed method is to achieve the maximum possible compres-sion, the aim of CRAM, Goby and Quip is to provide a more general and robust platform(or toolkit) for compression. Moreover, as mentioned before, the compression scheme ofGoby and CRAM facilitates the manipulation of the compressed data by the downstreamapplications.

In this context, we believe that the compression method performed by SamComp andQuip does not allow downstream applications to rapidly extract important information, asthey perform a base-by-base compression. Thus, in order to find variations in the data, onemust first reconstruct the whole read to then find the variations. On the other hand, we pro-posed a compression method that can considerably facilitate the downstream applicationswhen working in the compressed domain.

Regarding the compression ratio, in [17] they mentioned that they believed that apareto-optimal region was achieved in terms of compression ratio after the SeqSqueeze(http://www.sequencesqueeze.org) competition of 2012, in which SamComp was the win-ner in the SAM file compression category. As shown in Figure 3.1, although our algorithmperforms strictly better than SamComp in all the data sets of this ensemble, the gain is verysmall. This fact could validate the pareto-optimal statement made in [17]. However, aswe discuss in the next section, for high coverage data sets the pareto-optimal curve is not


achieved, as significant improvements in compression ratio are possible. We believe thereason is that, in low coverage data sets, little information can be inferred from previouslyseen reads when modeling the subsequent reads, as the overlap between reads is in generalsmall or not existent. This, however, is far from the case in high coverage data sets.

3.4.2 High Coverage Data Sets

As outlined before, the proposed method demonstrates that significant improvements incompression ratio are possible in high coverage data sets. Specifically, we showed in Table3.2 performance gains varying from 10% to 17% with respect to SamComp, which achievesthe best compression ratio among the previously proposed algorithms.

These improvements in compression ratio are significant as, in the high coverage sce-nario, small gains in compression ratio can translate to huge savings in storage space. Forexample, a 10% improvement (e.g., from 0.18 bits/bp to 0.16 bits/bp) over human data withvery large coverage (around 200◊) corresponds to a saving of approximately 12 GB perfile, with a corresponding reduction in the time required to transfer the data. Thus, in thecompression of several large data sets of high coverage, such improvements would lead toseveral PetaBytes of storage savings.

Furthermore, as mentioned before, the proposed scheme generates compressed filesfrom which important information can be easily extracted, potentially allowing downstreamapplications to work over the compressed data. This is an important feature, as with theincrease in the size of the sequencing data, the burden of compressing and decompressingthe files in order to manipulate and/or analyze specific parts in them becomes increasinglyacute.

3.5 Conclusion

We are currently in the $1000 genome era and, as such, a significant increase in the sequenc-ing data being generated is expected in the near future. These files are also expected to growin size as the different Next Generation Sequencing (NGS) technologies improve. There-fore, there is a growing need for honing the capabilities of compressors for the aligned data.


Further, compression that facilitates downstream applications working in the compresseddomain is becoming of primal importance.

With this in mind, we developed a new compression method for aligned reads that out-performs, in compression ratio, the previously-proposed algorithms. Specifically, we showthat applying effective models to the aligned data can boost the compression, especiallywhen high coverage data sets are considered. These gains in compression ratio wouldtranslate to huge savings in storage, thus also facilitating the transmission of genomic dataacross researchers and institutions. Furthermore, we compress the data in a manner that al-lows downstream applications to work on the compressed domain, as relevant informationcan easily be extracted from specific locations along the compressed files.

Finally, we envisage the methods shown in this chapter to be useful in the constructionof future compression programs that consider the compression not only of the aligned reads,but also the quality scores and the identifiers.


Dat

aSe

tC

over

age

Raw

Size

a

Sam

Com

pPr

opos

edM

etho

dG

ain

[MB

]Si

ze[M

B]

bits

/bp

C.T

.[se

c]Si

ze[M

B]

bits

/bp

C.T

.[se

c]

ERR2

6299

61

10◊

2750

060

8.6

0.18

1,21

054

8.7

0.16

1,09

21

0%

ERR2

6299

72

14◊

4146

093

50.

181,

781

827.

30.

161,

585

11

.5%

SRR0

6539

01

32◊

316

047

.70.

1214

840

.50.

1010

41

5%

SRR0

6539

02

32◊

311

055

.20.

1414

045

.90.

1210

81

7%

Tabl

e3.

2:C

ompr

essi

onre

sults

fort

hehi

ghco

vera

geen

sem

ble.

The

resu

ltsin

bold

show

the

com

pres

sion

gain

obta

ined

byth

epr

opos

edm

etho

dw

ithre

spec

tto

Sam

Com

p.W

eus

eth

eIn

tern

atio

nalS

yste

mof

Uni

tsfo

rthe

prefi

xes,

that

is,1

MB

stan

dsfo

r106

Byt

es.

C.T

.sta

nds

forc

ompr

essi

ontim

e.a

Raw

size

refe

rsso

lely

toth

esi

zeof

the

map

ped

read

s(1

Byt

epe

rbas

epa

ir).

Chapter 4

Lossy compression of quality scores

In this chapter we focus on compression of the quality scores presented in the raw sequenc-ing data (i.e., FASTQ and SAM files). As already mentioned in the introduction, qualityscores have proven to be more difficult to compress than the reads. There is also evidencethat quality scores are corrupted by some amount of noise introduced during sequencing[17]. These features are well explained by imperfections in the base-calling algorithmswhich estimate the probability that the corresponding nucleotide in the read is in error[55]. Further, applications that operate on reads often make use of the quality scores in aheuristic manner. This is particularly true for sequence alignment algorithms [11, 12] andvariant calling [56, 57]. Based on these observations, lossy (as opposed to lossless) com-pression of quality scores emerges as a natural candidate for significantly reducing storagerequirements while maintaining adequate performance of downstream applications.

In this chapter we introduce two different methods for lossy compression of the qual-ity scores. The first method, QualComp, transforms the quality scores into Gaussian dis-tributed values and then uses theory from rate distortion to allocate the available bits. Thesecond method, QVZ, assumes the quality scores are generated by a Markov model, andgenerates optimal quantizers based on the empirical distribution of the quality scores to becompressed. We describe both methods in detail, and analyze their performance in termsof rate-distortion.1

1We refer the reader to Chapter 5 for an extensive analysis on the effect that lossy compression of thequality scores has on variant calling.

54

CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES 55

4.1 Survey of lossy compressors for quality scores

Lossy compression for quality scores has recently started to be explored. Slimgene [58] fitsfixed-order Markov encodings for the differences between adjacent quality scores and com-presses the prediction using a Huffman code (ignoring whether or not there are predictionerrors). Q-Scores Archiver [59] quantizes quality scores via several steps of transforma-tions, and then compresses the lossy data using an entropy encoder.

Fastqz [17] uses a fixed-length code which represents quality scores above 30 using aspecific byte pattern and quantizes all lower quality scores to 2. Scalce [60] first calculatesthe frequencies of different quality scores in a subset of the reads of a FASTQ file. Then thequality scores that achieve local maxima in frequency are determined. Anytime these localmaximum values appear in the FASTQ file, the neighboring values are shifted to within asmall offset of the local maximum, thereby reducing the variance in quality scores. Theresult is compressed using an arithmetic encoder.

BEETL [61] first applies the Burrows-Wheeler Transform (BWT) to reads and uses thesame transformation on the quality scores. Then, the nucleotide suffixes generated by theBWT are scanned. Groups of suffixes that start with the same k bases while also sharing aprefix of at least k bases are found. All of the quality scores for the group are converted to amean quality value, taken within the group or across all the groups. RQS/QUARTZ [23, 62]first generates off-line a dictionary of commonly occurring k-mers throughout a population-sized read dataset of the species under consideration. It then computes the divergence ofthe k-mers within each read to the dictionary, and uses that information to decide whetherto preserve or discard the corresponding quality scores.

PBlock [21] allows the user to determine a threshold for the maximum per-symboldistortion. The first quality score in the file is chosen as the first ‘representative’. Qualityscores are then quantized symbol-by-symbol to the representative if the resulting distortionwould fall within the threshold. If the threshold is exceeded, the new quality score takesthe place of the representative and the process continues. The algorithm keeps track ofthe representatives and run-lengths, which are compressed losslessly at the end. RBlock

[21] uses the same process, but the threshold instead sets the maximum allowable ratio ofany quality score to its representative as well as the maximum value of the reciprocal of


this ratio. [21] also compared the performance of existing lossy compression schemes fordifferent distortion measures.

Finally, Illumina proposed a new binning scheme that reduces the alphabet size of thequality scores by applying an 8 level mapping (see Table 4.1). This binning scheme hasbeen implemented in the state-of-the-art compression tools CRAM [50] and DSRC2 [63].

Quality Score Bins New Quality ScoreN (no call) N (no call)

2-9 610-19 1520-24 2225-29 2730-34 3335-39 37Ø 40 40

Table 4.1: Illumina’s proposed 8 level mapping.

4.2 Proposed Methods

In this section we first formalize the problem of lossy compression of quality scores, andthen describe the proposed methods QualComp and QVZ.

We consider the compression of the quality score sequences presented in the genomicdata. Each sequence consists of ASCII characters representing the scores, belonging to analphabet Q, for example Q = [33 : 73]. These quality score sequences are extracted fromthe genomic file (e.g., FASTQ and SAM files) prior to compression. We denote the totalnumber of sequences by N , and assume all the sequences are of the same length n. Thequality scores sequences are then denoted by {Q

i

}N

i=1, where Qi

= [Qi,1, . . . , Q

i,n

].The goal is to design an encoder-decoder pair that describes the quality score vec-

tors using only some amount of bits, while minimizing a given distortion D between theoriginal vectors {Q

i

}N

i=1 and the reconstructed vectors {Q

i

}N

i=1. More specifically, weconsider each Q

i

is compressed using at most nR bits, where R denotes the rate (bitsper quality score), and that the distortion D is computed as the average distortion of


each of the vectors, i.e., D = 1N

qN

i=1 D(i). Further, we consider a distortion functiond : (Q, Q) æ R+, which operates symbol by symbol (as opposed to block by block),so that D(i) = d(Q

i

, Q

i

) = 1n

qn

j=1 d(Qi,j

, Qi,j

). Thus we can model the encoder-decoder pair as a rate-distortion scheme of rate R, where the encoder is described bythe mapping f

n

: Q

i

æ {1, 2, . . . , 2nR}, which represents the compressed version ofthe vector Q

i

of length n using nR bits, and the decoder is described by the mappingg

n

: {1, 2, . . . , 2nR} æ Q

i

, where Q

i

= gn

(fn

(Qi

)) denotes the reconstructed sequence.

4.2.1 QualComp

The proposed lossy compressor QualComp aims to minimize the Mean Squared Error(MSE) between the original quality scores and the reconstructed ones, for a given rate R

specified by the user. That is, D(i) = d(Qi

, Q

i

) = 1n

qn

j=1 d(Qi,j

, Qi,j

) = 1n

qn

j=1(Qi,j

≠Q

i,j

)2. The design of QualComp is guided by some results on rate distortion theory. Fora detailed description on rate distortion theory and proofs, please refer to [64]. We areinterested in the following result:

Theorem 1: For an i.i.d. Gaussian vector source X ≥ N (µX, �X), with �X =diag[‡2

1, . . . , ‡2n

] (i.e., independent components), the optimal allocation of nR bits that min-imizes the MSE is given as the solution to the following optimization problem:

minfl=[fl1,··· ,fln]

1n

nÿ

j=1‡2

j

2≠2flj (4.1)

s.t.nÿ

j=1fl

j

Æ nR, (4.2)

where flj

denotes the number of bits allocated to the jth component of X.Next we describe how we use this result in the design of QualComp. In real data quality

scores take integer values in a finite alphabet Q, but for the purpose of modeling we assumeQ = R (the set of real numbers). Although the quality scores of different reads may becorrelated, we model correlations only within a read, and consider quality scores acrossdifferent reads to be independent. Thus we assume that each quality score vector Q

i

isindependent and identically distributed (i.i.d.) as PQ.


To the best of our knowledge, there are no known statistics of the quality score vectors.However, given a vector source with a particular covariance matrix, the multivariate Gaus-sian is the least compressible. Furthermore, compression/coding schemes designed on thebasis of Gaussian assumption, i.e., worst distribution for compression, will also be good fornon-Gaussian sources, as long as the mean and the covariance matrix remain unchanged[65]. Guided by this observation, we model the quality scores as being jointly Gaussianwith the same mean and covariance matrix, i.e., PQ ≥ N (µQ, �Q), where µQ and �Q areempirically computed from the set of vectors {Q

i

}N

i=1. Due to the correlation of qualityscores within a read, �Q is not in general a diagonal matrix. Thus to apply the theorem weneed to decorrelate the quality score vectors.

In order to decorrelate the quality score vectors, we first perform the singular valuedecomposition (SVD) of the matrix �Q. This allows us to express �Q as �Q = V SV T ,where V is a unitary matrix that satisfies V V T = I and S is a diagonal matrix whosediagonal entries s

jj

, for j œ [1 : n], are known as the singular values of �Q. We thengenerate a new set of vectors {Q

Õi

}N

i=1 by performing the operation Q

Õi

= V T (Qi

≠ µQ)for all i. This transformation, due to the Gaussian nature of the quality score vectors, makesthe components of each Q

Õi

independent and distributed as N(0, sjj

), for j œ [1 : n], sinceQ

Õi

≥ N (0, S). This property allows us to use the result of Theorem 1. The number of bitsalloted per quality score vector, nR, is a user-specified parameter. Thus we can formulatethe bit allocation problem for minimizing the MSE as a convex optimization problem, andsolve it exactly. That is, given a budget of nR bits per vector, we allocate the bits by firsttransforming each Q

i

into Q

Õi

, for i œ [1 : N ], and then allocating bits to the independentcomponents of Q

Õi

in order to minimize the MSE, by solving the following optimizationproblem:

minfl=[fl1,··· ,fln]

1n

nÿ

j=1‡2

j

2≠2flj (4.3)

s.t.nÿ

j=1fl

j

Æ nR, (4.4)

where flj

represents the number of bits allocated to the jth position of Q

Õi

, for i œ [1 :N ], i.e., the allocation of bits is the same for all the quality score vectors and thus the


optimization problem has to be solved only once. Ideally, this allocation should be doneby vector quantization, i.e., by applying a vector quantizer with Nfl

j

bits to {Q

Õi,j

}N

i=1, forj œ [1 : n]. However, due to ease of implementation and negligible performance loss,we use a scalar quantizer. Thus for all i œ [1 : N ], each component Q

Õi,j

, for j œ [1 :n], is normalized to a unit variance Gaussian and then it is mapped to decision regionsrepresentable in fl

j

bits. For this we need flj

to be an integer. However, this will not bethe case in general, so we randomly map each fl

j

to flÕj

, which is given by either the closestinteger from below or from above, so that the average of flÕ

j

and flj

coincide. In order toensure the decoder gets the same value of flÕ

j

, the same pseudorandom generator is used inboth functions. The decision regions that minimize the MSE for different values of fl andtheir representative values are found offline from a Lloyd Max procedure [66] on a scalarGaussian distribution with mean zero and variance one. For example, for fl = 1 we have21 decision regions, which correspond to values below zero (decision region 0) and abovezero (decision region 1), with corresponding representative values ≠0.7978 and +0.7978.Therefore, if we were to encode the value ≠0.344 with one bit, we will encode it as a ‘0’,and the decoder will decode it as ≠0.7978. The decoder, to reconstruct the new qualityscores {Q

i

}N

i=1, performs the operations complementary to that done by the encoder. Thedecoder constructs round(V Q

Õ + µQ) and replaces the quality scores corresponding to anunknown basepair (given by the character ‘N’), by the least reliable quality value score.The steps followed by the encoder are the following:

Encoding the quality scores of a FASTQ file using nR bits per sequencePrecompute:

1. Extract the quality score vectors {Q

i

}N

i=1 of length n from the FASTQ file.2. Compute µQ and �Q empirically from {Q

i

}N

i=1.4. Compute the SVD: �Q = V SV T .5. Given S and the parameter nR, solve for the optimal fl = [fl1, . . . , fl

n

] thatminimizes the MSE.For i = 1 to N :

1. Q

Õi

= V T (Qi

≠ µQ).2. For j = 1 to n:


2.1. QÕÕi

(j) = Q

Õi,jÔsjj

.

2.2. Randomly generate integer flÕj

from flj

.2.3. Map QÕÕ

i,j

into its corresponding decision region.2.4. Encode the decision region using flÕ

j

bits and write them to file.

Notice that the final size is given by nRN plus an overhead to specify the mean andcovariance of the quality scores, the length n and the number of sequences N . This can befurther reduced by performing a lossless compression using a standard universal entropycode.

Clustering

Since the algorithm is based on the statistics of the quality scores, better statistics wouldgive lower distortion. With that in mind, and to capture possible correlation between thereads, we allow the user to first cluster the quality score vectors, and then perform the lossycompression in each of the clusters separately.

The clustering is based on the k-means algorithm [67], and it is performed as follows.For each of the clusters, we initialize a mean vector V of length n, with the same value ateach position. The values are chosen to be equally spaced between the minimum qualityscore and the maximum. For example, if the quality scores go from 33 to 73 and there are3 clusters, the mean vectors will be initialized as all 33’s, all 53’s, and all 73’s. Then, eachof the quality score vectors will be assigned to the cluster that minimizes the MSE withrespect to its mean vector V , i.e., to the cluster that minimizes 1

n

qn

i=1(Q(i)≠V (i))2. Afterassigning each quality score vector to a cluster, the mean vectors are updated by computingthe empirical mean of the quality score vectors assigned to the cluster. This process isrepeated until none of the quality score vectors is assigned to a different cluster, or until amaximum number of iterations is reached.

Finally, notice that R = 0 is not the same as discarding the quality scores, since thedecoder will not assign the same value to all the reconstructed quality scores. Instead,the reconstructed quality score vectors within a cluster will be the same, and equal to theempirical mean of the original quality score vectors within the cluster, but each quality


score within the vector will in general be different.

4.2.2 QVZ

For the design of QVZ, we model the quality score sequence Q = [Q1, Q2, . . . , Qn

] by aMarkov chain of order one: we assume the probability that Q

j

takes a particular value de-pends on previous values only through the value of Q

j≠1. We further assume that the qual-ity score sequences are independent and identically distributed (i.i.d.). We use a Markovmodel based on the observation that quality scores are highly correlated with their neigh-bors within a single sequence, and we refrain from using a higher order Markov model toavoid the increased overhead and complexity this would produce within our algorithm.

The Markov model is defined by its transition probabilities P (Qj

|Qj≠1), for j œ 1, 2, . . . , n,

where P (Q1|Q0) = P (Q1). QVZ finds these probabilities empirically from the entire dataset to be compressed and uses them to design a codebook. The codebook is a set of quan-tizers indexed by position and previously quantized value (the context). These quantizersare constructed using a variant of the Lloyd-Max algorithm [66], and are capable of min-imizing any quasi-convex distortion chosen by the user (i.e., not necessarily MSE). Afterquantization, a lossless, adaptive arithmetic encoder is applied to achieve entropy-rate com-pression.

In summary, the steps taken by QVZ are:

1. Compute the empirical transition probabilities of a Markov-1 Model from the data.

2. Construct a codebook (section 4.2.2) using the Lloyd-Max algorithm (section 4.2.2).

3. Quantize the input using the codebook and run the arithmetic encoder over the result(section 4.2.2).

Lloyd-Max quantizer

Given a random variable X governed by the probability mass function P (·) over the alpha-bet X of size K, let D œ RK◊K be a distortion matrix where each entry D

x,y

= d(x, y) isthe penalty for reconstructing symbol x as y. We further define Y to be the alphabet of thequantized values of size M Æ K.


Thus, a Lloyd-Max quantizer, denoted hereafter as LM(·), is a mapping X æ Ythat minimizes an expected distortion. Specifically, the Lloyd-Max quantizer seeks tofind a collection of boundary points b

k

œ X and reconstruction points yk

œ Y , wherek œ {1, 2, . . . , M}, such that the quantized value of symbol x œ X is given by the re-construction point of the region to which it belongs (see Fig. 4.1). For region k, anyx œ {b

k≠1, . . . , bk

≠ 1} is mapped to yk

, with b0 being the lowest score in the qualityalphabet and b

M

the highest score plus one. Thus the Lloyd-Max quantizer aims to mini-mize the expected distortion by solving

{bk

, yk

}M

k=1 = argminbk,yk

Mÿ

j=1

bj≠1ÿ

x=bj≠1

P (x)d(x, yj

). (4.5)

P (X)

X

Reconstruction Points

b0 b1 b2y1 y2 y3

b3

Figure 4.1: Example of the boundary points and reconstruction points found by a Lloyd-Max quantizer, for M = 3.

In order to approximately solve Eq. (4.5), which is an integer programming problem,we employ an algorithm which is initialized with uniformly spaced boundary values andreconstruction points taken at the midpoint of these bins. For an arbitrary D and P (·), thisproblem requires an exhaustive search. We assume that the distortion measure d(x, y) isquasi-convex over y with a minimum at y = x, i.e., when x Æ y1 Æ y2 or y2 Æ y1 Æx, d(x, y1) Æ d(x, y2). If the distortion measure is quasi-convex, an exchange argumentsuffices to show the optimality of contiguous quantization bins and a reconstruction pointwithin the bin. The following steps are iterated until convergence:


1. Solving for yk

: We first minimize Eq. (4.5) partially over the reconstruction pointsgiven boundary values. The reconstruction points are obtained as,

yk

= argminy={bk≠1,...,bk≠1}

bk≠1ÿ

x=bk≠1

P (x)d(x, y), ’ k = 1, 2, . . . , M. (4.6)

2. Solving for bk

: This step minimizes Eq. (4.5) partially over the boundary values giventhe reconstruction points. b

k

could range from {yk

+ 1, . . . , yk+1} and is chosen as

the largest point where the distortion measure to the previous reconstruction value yk

is lesser than the distortion measure to the next reconstruction value yk+1, i.e.,

bk

= maxÓx œ {y

k

+ 1, . . . , yk+1} : P (x)d(x, y

k

) ÆP (x)d(x, y

k+1)Ô

’ k = 1, 2, . . . , M ≠ 1. (4.7)

Note that this algorithm, which is a variant of the Lloyd-Max quantizer, converges in atmost K steps.

Given a distortion matrix D, the defined Lloyd-Max quantizer depends on the numberof regions M and the input probability mass function P (·). Therefore we denote the Lloyd-Max quantizer with M regions as LMP

M

(·), and the quantized value of a symbol x œ X asLMP

M

(x).An ideal lossless compressor applied to the quantized values can achieve a rate equal to

the entropy of LMP

M

(X), which we denote by H(LMP

M

(X)). For a fixed probability massfunction P (·), the only varying parameter is the number of regions M . Since M needsto be an integer, not all rates are achievable. Because we are interested in achieving anarbitrary rate R, we define an extended version of the LM quantizer, denoted as LME. Theextended quantizer consists of two LM quantizers with the numbers of regions given by fl

and fl + 1, each of them used with probability 1 ≠ r and r, respectively (where 0 Æ r Æ 1).Specifically, fl is given by the maximum number of regions such that H(LMP

fl

(X)) < R

(which implies H(LMP

fl+1(X)) > R). Then, the probability r is chosen such that theaverage entropy (and hence the rate) is equal to R, the desired rate. More formally,


LMEP

R

(x) =

Y_]

_[

LMP

fl

(x), w.p. 1 ≠ r,

LMP

fl+1(x), w.p. r,

fl = max{x œ {1, . . . , K} : H(LMP

x

(X)) Æ R}r =

R ≠ H(LMP

fl

(X))H(LMP

fl+1(X)) ≠ H(LMP

fl

(X)) . (4.8)

Codebook generation

Because we assume the data follows a Markov-1 model, for a given position j œ {1, . . . , n}we design as many quantizers ‚Qj

q

as there were unique possible quantized values q inthe previous context j ≠ 1. This collection of quantizers forms the codebook for QVZ.For an unquantized quality score Q

j

we denote the quantized version as ‚Qj

, so „Q =

[ ‚Q1, ‚Q2, . . . , ‚Qn

] is the random vector representing a quantized sequence. The quantizersare defined as

‚Q1 = LMEP(Q1)–H(Q1) (4.9)

‚Qj

q

= LMEP(Qj | ‚Qj≠1=q)–H(Qj | ‚Qj≠1=q)

, for j = 2, . . . , n (4.10)

where – œ [0, 1] is the desired compression factor. – = 0 corresponds to 0 rate encoding,– = 1 to lossless compression, and any value in between scales the input file size by thatamount. Note that the entropies can be directly computed from the corresponding empiricalprobabilities.

Next we show how the probabilities needed for the LMEs are computed.In order to compute the quantizers defined above, we require P(Q

j+1 | ‚Qj

), whichmust be computed from the empirical statistics P(Q

j+1 | Qj

) found earlier. The first stepis to calculate P( ‚Q

j

| Qj

) recursively, and then to apply Bayes rule and the Markov Chainproperty to find the desired probability:


P( ‚Qj

| Qj

) =ÿ

‚Qj≠1

P( ‚Qj

, ‚Qj≠1 | Q

j

)

=ÿ

‚Qj≠1

P( ‚Qj

| Qj

, ‚Qj≠1)

ÿ

Qj≠1

P( ‚Qj≠1, Q

j≠1 | Qj

)

=ÿ

‚Qj≠1

P( ‚Qj

| Qj

, ‚Qj≠1)

ÿ

Qj≠1

P( ‚Qj≠1 | Q

j≠1, Qj

)P(Qj≠1 | Q

j

)

=ÿ

‚Qj≠1

P( ‚Qj

| Qj

, ‚Qj≠1)

ÿ

Qj≠1

P( ‚Qj≠1 | Q

j≠1)P(Qj≠1 | Q

j

) (4.11)

Eq. (4.11) follows from the fact that ‚Qj≠1 ¡ Q

j≠1 ¡ Qj

form a Markov chain. Addition-ally, P( ‚Q

j

| Qj

, ‚Qj≠1 = q) = P( ‚Qj

q

(Qj

) = ‚Qj

), which is the probability that a specificquantizer produces ‚Q

j

given previous context q. This can be found directly from r (de-fined in Eq. (4.8)) and the possible values for q. We now proceed to compute the requiredconditional probability as

P(Qj+1 | ‚Q

j

) =ÿ

Qj

P(Qj

| ‚Qj

)P(Qj+1 | Q

j

, ‚Qj

)

=ÿ

Qj

P(Qj

| ‚Qj

)P(Qj+1 | Q

j

) (4.12)

= 1P( ‚Q

j

)ÿ

Qj

P( ‚Qj

| Qj

)P(Qj

, Qj+1), (4.13)

where Eq. (4.12) follows from the same Markov chain as earlier. Terms in Eq. (4.13) are:i) P(Q

j

, Qj+1): joint pmf computed empirically from the data, ii) P( ‚Q

j

| Qj

): computedin Eq. (4.11), and iii) P( ‚Q

j

): normalizing constant given by

P( ‚Qj

= q) =ÿ

Qj

P( ‚Qj

= q | Qj

)P(Qj

).

The steps necessary to compute the codebook are summarized in Algorithm 1. Notethat support(Q) denotes the support of the random variable Q or the set of values that Q

takes with non-zero probability.


Algorithm 1 Generate codebookInput: Transition probabilities P(Q

j

| Qj≠1), compression factor –

Output: Codebook: collection of quantizers { ‚Ql

q

}P Ω P(Q1)Compute and store ‚Q1 based on P using Eq. (4.9)for all columns j = 2 to n do

Compute P( ‚Qj≠1 | Q

j≠1 = q) ’q œ support(Qj≠1)

Compute P(Qj

| ‚Qj≠1) ’q œ support( ‚Q

j≠1)for all q œ support( ‚Q

j≠1) doP Ω P(Q

j

| ‚Qj≠1 = q)

Compute and store ‚Qj

q

based on P using Eq. (4.10)end for

end for

Encoding

The encoding process is summarized in Algorithm 2. First, we generate the codebook andquantizers. For each read, we quantize all scores sequentially, with each value formingthe left context for the next value. As they are quantized, scores are passed to an adaptivearithmetic encoder, which uses a separate model for each position and context.

Algorithm 2 Encoding of quality scores

Input: Set of N reads {Qi}Ni=1

Output: Set of quantizers { ‚Ql

q

} (codebook) and compressed representation of readsCompute empirical statistics of input readsCompute codebook { ‚Ql

q

} according to Algorithm 1for all i = 1 to N do

[Q1, . . . , Qn

] Ω Qi‚Q1 Ω ‚Q1(Q1)for all j = 2 to n do

‚Qi

Ω ‚Qj

‚Qj≠1

(Qj

)end forPass [ ‚Q1, . . . , ‚Q

n

] to arithmetic encoderend for


Clustering

The performance of the compression algorithm depends on the conditional entropy of eachquality score given its predecessor. Earlier we assumed that the data was all i.i.d., butit is more effective to allow each read to be independently selected from one of severaldistributions. If we first cluster the reads into C clusters, then the variability within eachcluster may be smaller. In turn, the conditional entropy would decrease and fewer bitswould be required to encode Q

j

at a given distortion level, assuming that an individualcodebook is available unique to each cluster.

Thus QVZ has the option of clustering the data prior to compression. We have exploredtwo approaches for performing the clustering.

1. K-means:The first approach uses the K-means algorithm [67], initialized using C quality valuesequences chosen at random from the data. It assigns each sequence to a clusterby means of Euclidean distance. Then, the centroid of each cluster is computed asthe mean vector of the sequences assigned to it. Due to the lack of convergenceguarantees, we have incorporated a stop criterion that avoids further iterations oncethe centroids of the clusters have moved less than U units (in Euclidean distance).The parameter U is set to 4 by default, but it can be modified by the user. Finally,storing which cluster each read belongs to incurs a rate penalty of at most log2(C)/L

bits per symbol, which allows QVZ to reconstruct the series of reads in the sameorder as they were in the uncompressed input file.

2. Mixture of Markov Models:

Given our assumption that the quality sores sequences are generated by an order-1Markov source, we can express the probability of a given sequence Q

i

as:

P (Qi

) =nŸ

j=1P (Q

i,j

|Qi,j≠1, . . . , Q

i,1) = P (Qi,1)

nŸ

j=2P (Q

i,j

|Qi,j≠1), (4.14)

where the last equality comes from the Markov assumption.


A discrete Markov source can be fully determined by its transition matrix A, whereA

ms

= P (Qi,j

= m|Qi,j≠1 = s) is the probability of going from state s to state

m, ’j, and the prior state probability fis

= P (Qi,1 = s), which is the probability of

starting at state s. We further denote the model parameters as ◊ = {A, fi}. With thisnotation we can rewrite (4.14) as

P (Qi

; ◊) =|Q|Ÿ

s=1(fi

s

)1[Qi,1=s]nŸ

j=2

|Q|Ÿ

m=1

|Q|Ÿ

s=1(A

ms

)1[Qi,j=m,Qi,j≠1=s]. (4.15)

Note that with the above definitions we have assumed that the stochastic process thatgenerates the quality score sequences is time invariant, i.e., that the value of A

ms

is independent of the time (position) j. However, strong correlations exist betweenadjacent quality scores, as well as a trend that the quality scores degrade as a readprogresses.

1

2

Q

1

2

Q

1

2

Q

1

2

Q

j = 1 j = 2 j = n� 1 j = n

Figure 4.2: Our temporal Markov model.

In order to take into consideration the temporal behavior of the quality score se-quences, we increase the number of states from |Q| to |Q| ◊ n, one for each possiblevalue of Q and j. To represent the temporal dimension, we redefine the transition ma-trix as a three dimensional matrix, where the first dimension represents the previousvalue of the quality score, the second one the current value of the quality score and thethird one the time j within the sequence. That is, A

msj

= P (Qi,j

= m|Qi,j≠1 = s)

is the probability of transitioning from state n to state s at time j.

For the clustering step, we further assume that the quality score sequences have beengenerated independently by one of K underlying Markov models, such that the whole


set of quality score sequences are generated by a mixture of Markov models. Withsome abuse of notation we now define ◊ = {fi(k), A

(k)}K

k=1 to be the parametersof the K Markov models, and ◊

k

= {fi(k), A

(k)} to be the parameters of the kthMarkov model. We further define Z

i

to be the latent random variable that specifiesthe identity of the mixture component for the ith sequence. Thus, the set of qualityscore sequences that has been generated by the sequencing machine is distributed as:

Q ≥ P (Q; ◊) =NŸ

i=1P (Q

i

; ◊) =NŸ

i=1

Kÿ

k=1P (Q

i

|Zi

= k; ◊)µk

, (4.16)

where µk

, P (Zi

= k), and P (Qi

|Zi

= k; ◊) is the probability that the sequence Qi

has been generated by the kth Markov model. Substituting (4.15) in (4.16) we getthat the likelihood of the data is given by

P (Q; ◊) =NŸ

i=1

Kÿ

k=1µ

k

Q

a|Q|Ÿ

s=1(fi(k)

s

)1[Qi,1=s]nŸ

j=2

|Q|Ÿ

m=1

|Q|Ÿ

s=1(A(k)

msj

)1[Qi,j=m,Qi,j≠1=s]

R

b .

(4.17)

The goal of the clustering step is to assign each sequence to the most probable modelthat has generated it. However, since the parameters of the models are unknown, theclustering step first computes the maximum likelihood estimation of the parameters{A

(k), fi(k), µk

} of each of the Markov models. Since the log likelihood ¸(◊) ,log P (Q; ◊) is intractable due to the summand appearing in (4.17), this operation isdone by using the well known Expectation-Maximitation (EM) algorithm [68]. TheEM algorithm iteratively maximizes the function

g(◊, ◊(l≠1)) , EZ|Q,◊

(l≠1)

Cÿ

i

log P (Qi

, Zi

; ◊)D

,

which is the expectation of the complete log likelihood with respect to the conditionaldistribution of Z given Q and the current estimated parameters. It can be shown [68]that for any mixture model this function is given by

g(◊, ◊(l≠1)) =ÿ

i

ÿ

k

rik

log µk

+ÿ

i

ÿ

k

rik

log P (Qi

; ◊k

), (4.18)


where rik

, P (Zi

= k|Qi

, ◊(l≠1)) is the responsibility that cluster k takes for thequality sequence i. In particular, for the case of a mixture of Markov models, theprevious equation is given by

g(◊, ◊(l≠1)) =Nÿ

i=1

Kÿ

k=1r

ik

log µk

+Nÿ

i=1

Kÿ

k=1r

ik

|Q|ÿ

s=11[Q

i,1 = s] log(fi(k)s

)+

Nÿ

i=1

Kÿ

k=1r

ik

nÿ

j=2

|Q|ÿ

m=1

|Q|ÿ

s=1log(A(k)

msj

)1[Qi,j

= m, Qi,j≠1 = s], (4.19)

where the expansion of log P (Qi

; ◊k

) is obtained by taking the log of (4.15).

The initialization of the EM algorithm is performed by randomly selecting the pa-rameters {„

A

(k), fi(k), µk

}. Then, the algorithm iteratively performs as follows. In theE-step it computes r

ik

, which for the case of Markov mixture models is given by

rik

Ã µk

fi(k)(Qi,1) ‚A(k)

2 (Qi,1, Q

i,2) ‚A(k)3 (Q

i,2, Qi,3) . . . ‚A(k)

n

(Qi,n≠1, Q

i,n

), (4.20)

where‚A(k)

j

(Qi,j≠1, Q

i,j

) =|Q|Ÿ

m=1

|Q|Ÿ

s=1( ‚A(k)

msj

)1[Qi,j=m,Qi,j≠1=s].

In the M-step it computes the parameters ◊ that maximize Q(◊, ◊(l≠1)). In the case ofa mixture of Markov models, these parameters can be computed using the Lagrangemultipliers method on Q(◊, ◊(l≠1)), where the constraints are that all the rows of A

(k)

and the vectors fi(k) and µ must sum to one. For the case under consideration, it canbe shown that the maximizing parameters computed in the M-step are given by

µk

= 1N

Nÿ

i=1r

ik

(4.21)

fi(k)s

=q

N

i=1 rik

1[Qi,1 = s]

q|Q|s

Õ=1q

N

i=1 rik

1[Qi,1 = sÕ]

(4.22)

‚A(k)msj

=q

N

i=1 rik

1[Qi,j

= m, Qi,j≠1 = s]

q|Q|m

Õ=1q

N

i=1 rik

1[Qi,j

= mÕ, Qi,j≠1 = s]

, (4.23)

with j = 2, . . . , n, s = 1, . . . , |Q| and m = 1, . . . , |Q|.


Furthermore, the EM algorithm guarantees that choosing ◊ to improve Q(◊, ◊(l≠1))beyond Q(◊(l≠1), ◊(l≠1)) will improve ¸(◊) beyond ¸(◊(l≠1)), which yields a decreas-ing value of Q(◊(l≠1), ◊(l≠1)) per iteration. We stop the algorithm once the change onthe value of Q(◊(l≠1)|◊(l≠1)) is small enough, or after a fixed number of iterations.

Once the EM algorithm terminates, the value of rik

tells us the responsibility of eachmixture component k over the sequence i. The clustering step uses this informationto perform the clustering. Specifically, each sequence is assigned to the cluster C

k

with k such that rik

Ø rik

Õ , ’kÕ ”= k.

4.3 Results

In order to assess the rate-distortion performance of the proposed algorithms QualCompand QVZ, we compare it with the state of the art lossy compression algorithms PBlockand RBlock [21], which provide the best rate-distortion performance among existing lossycompression algorithms for quality scores that do not use any extra information for com-pression. We also consider CRAM [50] and DSRC2 [63], which apply Illumina’s binningscheme. For completeness, we also compare the lossless performance of QVZ with that ofCRAM, DSRC2 (in their lossless mode), and gzip.

4.3.1 Data

The datasets used for our analysis are the NA12878.HiSeq. WGS.bwa.cleaned.recal.hg19.20.bam,which corresponds to the chromosome 20 of a H. Sapiens individual, and a ChIP-Seqdataset from a M. Musculus (SRR32209). The H. Sapiens dataset was downloaded fromthe GATK bundle.2 We generated the SAM file from the BAM file and then extractedthe quality score sequences from it. The data set contains 51, 585, 658 sequences, each oflength 101. The M. Musculus dataset was downloaded from the DNA Data Bank of Japan(DDBJ),3 and contains 18, 828, 274 sequences, each of length 36.

2https://www.broadinstitute.org/gatk/3http://trace.ddbj.nig.ac.jp/DRASearch/run?acc=%20SRR032209


4.3.2 Machine Specifications

The machine used to perform the experiments has the following specifications: 39 GBRAM, Intel Core i7-930 CPU at 2.80GHz x 8 and Ubuntu 12.04 LTS.

4.3.3 Analysis

First, we describe the options used to run each algorithm. QVZ was run with the defaultparameters, multiple rates and different number of clusters (for both clustering approaches,that is, k-means and Mixture of Markov Models). Similarly, QualComp was run with dif-ferent number of clusters and multiple rates. PBlock and RBlock were run with differentthresholds (that is, different values of p and r, respectively). CRAM and DSRC2 were runwith the lossy mode that implements Illumina’s proposed binning scheme. Finally, we alsorun each of the mentioned algorithms in the lossless mode, except QualComp, since it doesnot support lossless compression (see results below).

We start by looking at the performance of QualComp. Fig. 4.3 shows its performancewhen applied to the M. Musculus dataset, with one, two, and three clusters, and severalrates. As expected, increasing the rate for a given number of clusters decreases the MSE,especially for small rates. Similarly, increasing the number of clusters for the same ratedecreases the MSE. As can be observed, the decrease in the MSE is mainly noticeable forsmall rates, having less effect for moderate to high rates.

Worth noticing is that QualComp is not able to achieve lossless compression. This isdue to the transformation performed in the quality score sequences prior to compression,and the assumption made by QualComp that Q = R. As a result, Qualcomp may performworse than other lossy compressors for rates close to those of lossless compression (seeresults below).

Regarding the performance of QVZ, recall that it can minimize any quasi-convex dis-tortion, as long as the distortion matrix is provided. In addition, for ease of usage, theimplementation contains three built-in distortion metrics: i) the average Mean Squared Er-ror (MSE), where d(x, y) = |x≠y|2; ii) the average L1 distortion, where d(x, y) = |x≠y|;and iii) the average Lorentzian distortion, where d(x, y) = log2(1 + |x ≠ y|). Hereafter werefer to each of them as QVZ-M, QVZ-A and QVZ-L, respectively.


0 0.5 1 1.5 2

Bits per quality scores

0

50

100

150

Average

MSE

distortion

M. Musculus (SRR032209)

QualComp - c1QualComp - c2QualComp - c3

Figure 4.3: Rate-Distortion performance of QualComp for MSE distortion, and the M.Musculus dataset. cX stands for QualComp run with X number of clusters.

Fig. 4.4 shows the performance of QVZ for the H. Sapiens dataset, when k-means

is employed to perform the clustering step. We also show in the same plot the perfor-mance of QualComp (when run with 3 clusters), as well as that of the previously proposedalgorithms. As can be seen, QVZ outperforms QualComp and the previously proposed al-gorithms for all three choices of distortion metric. The lossy modes of CRAM and DSRC2can each achieve only one rate-distortion point, and both are outperformed by QVZ. Wefurther observe that although QualComp outperforms RBlock and PBlock for low rates(in all three distortions), the latter two achieve a smaller distortion for higher rates. Thisis in line with our intuition that QualComp does not provide a competitive performancefor moderate to high rates. QVZ however outperforms all previously proposed algorithmsin both low and high rates. QVZ’s advantage becomes especially apparent for distortionsother than MSE.

It is also significant that QVZ achieves a zero distortion at a rate at which the other lossyalgorithms exhibit positive distortion. In other words, QVZ achieves lossless compressionfaster than RBlock or PBlock. Moreover, QVZ also outperforms the lossless compressorsCRAM and gzip, and achieves similar performance to that of DSRC2 (see Table 4.2).

Finally, we observe that applying the k-means clustering prior to compression in QVZis especially beneficial at low rates. For higher rates, the performance of 1, 3 and 5 clustersis almost identical. Note that these results are in line with those obtained with QualComp.


Size [MB]0 500 1000 1500 2000 2500

Average

MSE

distortion

0

20

40

60

80

100

120PBlockRBlockQualCompQVZ-M c1QVZ-M c3QVZ-M c5CRAM-lossyCRAM-losslessDSRC2-lossyDSRC2-losslessgzip500 1000 1500 2000

0

5

10

Size [MB]0 500 1000 1500 2000 2500

Average

L1distortion

0

1

2

3

4

5

6

7 PBlockRBlockQualCompQVZ-A c1QVZ-A c3QVZ-A c5CRAM-lossyCRAM-losslessDSRC2-lossyDSRC2-losslessgzip

Size [MB]0 500 1000 1500 2000 2500

Average

Lorentziandistortion

0

0.5

1

1.5

2

2.5

3 PBlockRBlockQualCompQVZ-L c1QVZ-L c3QVZ-L c5CRAM-lossyCRAM-losslessDSRC2-lossyDSRC2-losslessgzip

Figure 4.4: Rate-Distortion curves of PBlock, RBlock, QualComp and QVZ, for MSE, L1and Lorentzian distortions. In QVZ, c1, c3 and c5 denote 1, 3 and 5 clusters (when usingk-means), respectively. QualComp was run with 3 clusters.


QVZ PBlock DSRC2 CRAM gzip(3 clusters) RBlock

Size [MB] 1,632 3,229 1,625 2,000 1,999

Table 4.2: Lossless results of the different algorithms for the NA12878 data set.

Next we analyze the Mixture of Markov Model approach for clustering the qualityscores prior to compression with QVZ, and compare it with the k-means approach. Fig.4.5 shows the rate-distortion performance of both approaches when applied to the H. Sapi-

ens dataset, for the three considered distortion metrics. As can be observed, the Mixture

of Markov Model clustering approach exhibits superior performance under all consideredmetrics, when compared to the k-means approach. For example, for a rate of 1 bit perquality score, the Mixture of Markov Model clustering approach achieves half the MSEdistortion incurred by the k-means approach.

0.5 1 1.5 2 2.5Bits per quality score

0

5

10

15

20

25

Average

MSEdistortion

QVZ-c3

QVZ-c10

QVZ-K3

QVZ-K10

0 0.5 1 1.5 2 2.5Bits per quality score

0

1

2

3

4

Average

L1distortion

0 0.5 1 1.5 2 2.5

Bits per quality score

0

50

100

150

200

Average

Lorentziandistortion

Figure 4.5: Rate-Distortion curves of QVZ for the H. Sapiens dataset, when the cluster-ing step is performed with k-means (c3 and c10), and with the Mixture of Markov Modelapproach (K3 and K10). In both cases we used 3 and 10 clusters.

Regarding the running time, QualComp takes around 90 minutes to compute the nec-essary statistics, and 20 minutes to finally compress the quality scores presented in theH. Sapiens dataset. The decompression is done in 15 minutes. QVZ, on the other hand,requires approximately 13 minutes to compress the same dataset, and 12 minutes to decom-press it. These numbers were computed without performing the clustering step. DSRC2


requires 20 minutes to compress and decompress, whereas CRAM employs 14 minutes tocompress and 4 minutes to decompress. Finally, both Pblock and Rblock take around 4minutes to compress and decompress, being the algorithms with the least running timesamong those that we analyzed. The running times of gzip to compress and decompress are7 and 30 minutes, respectively.

In terms of memory usage, QVZ uses 5.7 GB to compress the analyzed data set and lessthan 1 MB to decompress, whereas QualComp employs less than 1 MB for both operations.Pblock and Rblock have more memory usage than QualComp, but this is still below 40 MBto compress and decompress. DSRC2 uses 3 GB to compress and 5 GB to decompress,whereas CRAM employs 2 GB to compress and 3 GB to decompress. Finally, gzip usesless than 1 MB for both operations.

4.4 Discussion

From the results presented in the previous section, we can conclude that QualComp offerscompetitive performance for small rates, whereas it is outperformed for moderate to highrates. Due to the transformation performed by QualComp on the quality scores, losslesscompression cannot be achieved, even if the rate is set to a very high value. This is partof the reason why QualComp performs worse for rates close to that of lossless. On theother hand, QVZ is able to achieve lossless compression. In addition, it outperforms thepreviously proposed lossy compressors for all rates.

Another advantage of QVZ with respect to previously proposed methods is that it canminimize any quasi-convex distortion metric. This feature is important for lossy compres-sors of quality scores, since the criterion under which the goodness of the reconstructionshould be assessed is still not clear. It makes sense to pick a distortion measure by examin-ing how different distortion measures affect the performance of downstream applications,but the abundance of applications and variations in how quality scores are used makesthis choice too dependent on the specifics of the applications considered. These trade-offssuggest that an ideal lossy compressor for quality scores should not only provide the bestpossible compression and accommodate downstream applications, but it should provideflexibility to allow a user to pick a desired distortion measure and/or rate.


Based on the results presented in the previous section, it becomes apparent that perform-ing clustering prior to compression can improve the rate-distortion performance, especiallyfor small rates. k-means is a valid approach for performing the clustering step, and as wehave seen, improves the performance for both QualComp and QVZ. However, for QVZ,the clustering approach based on a Mixture of Markov Models outperforms that of k-means.This suggests that the clustering step should take into account the statistics employed bythe lossy compressor, so as to boost the performance.

Finally, we have demonstrated through simulation that the binning scheme proposed byIllumina can be outperformed by lossy compressors that take the statistics of the qualityscores into account for designing the compression scheme.

4.5 Conclusions

To partially tackle the problem of storage and dissemination of genomic data, in this chap-ter we have developed QualComp and QVZ, two new lossy compressors for the qualityscores. One advantage of the proposed methods with respect to previously proposed lossycompressors is that they allow the user to specify the rate prior to compression. WhereasQualComp aims to minimize the Mean Squared Error (MSE), QVZ can work for severaldistortion metrics, including any quasi-convex distortion metric provided by the user, a fea-ture not supported by the previously proposed algorithms. Moreover, QVZ also allows forlossless compression, and a seamless transition from lossy to the lossless with increasingrate. QualComp exhibits better rate-distortion performance for small rates than previouslyproposed methods, whereas QVZ exhibits better rate-distortion performance for all rates.The results presented in this chapter demonstrate the significant savings in storage that canbe achieved by lossy compression of quality scores.

Chapter 5

Effect of lossy compression of qualityscores on variant calling

In this chapter we evaluate the effect of lossy compression of quality scores on variantcalling (SNP and INDEL detection). To that end, we first propose a methodology for theanalysis, and then use it to compare the performance of the recently proposed lossy com-pressors for quality scores. Specifically, we investigate how the output of the variant callerwhen using the original data differs from that obtained when quality scores are replaced bythose generated by a lossy compressor.

We demonstrate that lossy compression can significantly alleviate the storage whilemaintaining variant calling performance comparable to that with the original data. Further,in some cases lossy compression can lead to variant calling performance that is superiorto that of using the original file. We envisage our findings and framework serving as abenchmark in future development and analyses of lossy genomic data compressors.

5.1 Methodology for variant calling

In this section we describe the proposed methodology to test the effect of lossy compressorsof quality scores on variant calling. The methodologies suggested for SNPs and INDELsdiffer, and thus we introduce each of them separately.

78

CHAPTER 5. EFFECT OF LOSSY COMPRESSION OF QUALITY SCORES 79

5.1.1 SNP calling

Based on the most recent literature that compares different SNP calling pipelines ([69, 70,71, 72, 73]) we have selected three pipelines for our study. Specifically, we propose the useof: i) the SNP calling pipeline suggested by the Broad Institute, which uses the GenomeAnalysis Toolkit (GATK) software package [14, 57, 74]; ii) the pipeline presented in theHigh Throughput Sequencing LIBrary (htslib.org), which uses the Samtools suite devel-oped by The Wellcome Trust Sanger Institute [1]; and iii) the recently proposed variantcaller named Platypus developed by Oxford University [75]. In the following we refer tothese pipelines as GATK1, htslib.org2 and Platypus, respectively.

In all pipelines we use BWA-mem [76] to align the FASTQ files to the reference (NCBI

build 37, in our case), as stated in all best practices.Regarding the GATK pipeline, we note that the best practices recommends to further

filter the variants found by the Haplotype Caller by either applying the Variant QualityScore Recalibration (VQSR) or the Hard Filter. The VQSR filter is only recommended ifthe data set is big enough (more than 100K variants), since otherwise one of the steps ofthe VQSR, the Gaussian mixture model, may be inaccurate. Therefore, in our analysis weconsider the use of both the VQSR and the Hard Filter after the Haplotype Caller, both asspecified in the best practices.

5.1.2 INDEL detection

To evaluate the effect of lossy compression of base quality scores on INDEL calling, weemploy popular INDEL detection pipelines: Dindel [77], Unified Genotyper, HaplotypeCaller [14, 57, 74] and Freebayes [78]. First, reads were aligned to the reference genome,NCBI build 37, with BWA-mem [76]. We replaced the quality scores of the correspond-ing SAM/BAM file by those obtained after applying various lossy compressors, and thenwe performed the INDEL calling with each of the four tools. Note that several of thesepipelines can be used to call both SNPs and INDELs, but the commands or parameters aredifferent for each variant type.

1https://www.broadinstitute.org/gatk/guide/best-practices2More commonly referred to as samtools. http://www.htslib.org/workflow


5.1.3 Datasets for SNP calling

A crucial part of the analysis is selecting a dataset for which a consensus of SNPs exists(hereafter referred to as “ground truth”), as it serves as the baseline for comparing the per-formance of the different lossy compressors against the lossless case. Thus, for the SNPcalling analysis, we consider datasets from the H. Sapiens individual NA12878, for whichtwo “ground truth” of SNPs have been released. In particular, we consider the datasetsERR174324 and ERR262997, which correspond to a 15x-coverage pair-end WGS datasetand a 30x-coverage pair-end WGS dataset, respectively. For each of them we extractedthe chromosomes 11 and 20. The decision of extracting some chromosomes was madeto speed up the computations. We chose chromosome 20 because it is the one normallyused for assessment3, and chromosome 11 as a representative of a longer chromosome.Regarding the two “ground truths”, they are the one released by the Genome in a Bottleconsortium (GIAB) [79], which has been adapted by the National Institute of Standard-izations and Technology (NIST); and the ground truth released by Illumina as part of thePlatinium Genomes project.4 Fig. 5.1 summarizes the differences between the two. Ascan be observed, most of the SNPs contained in the NIST ground truth are also includedin Illumina’s ground truth, for both chromosomes. Note also that the number of SNPs onchromosome 20 is almost half of chromosome 11, for both “ground truths”5.

5.1.4 Datasets for INDEL detection

To evaluate the effect of lossy compression on INDEL detection, we simulated four datasets.Each dataset is composed of one chromosome with approximately 3000 homozygous IN-DELs. To mimic biologically realistic variants, we generated distributions of INDEL sizesand frequencies, and insertion to deletion ratios, all conditioned on location (coding vsnon-coding) using the Mills and 1000Genomes INDELs provided in the GATK bundle.We drew from these distributions to create our simulated data.

3http://gatkforums.broadinstitute.org/discussion/1213/whats-in-the-resource-bundle-and-how-can-i-get-it

4http://www.illumina.com/platinumgenomes5As is clear from the discussion in this subsection, the term ground truth should be taken with a grain of

salt and as such should appear in quotation marks throughout. We omit these marks henceforth for simplicity.


(a) Chromosome 11 (b) Chromosome 20

Figure 5.1: Difference between the GIAB NIST “ground truth” and the one from Illumina,for (a) chromosomes 11 and (b) 20.

We generated approximately 30◊ coverage of the chromosome with 100bp paired-endsequencing reads (using an Illumina-like error profile) for these simulated datasets usingART [80].

5.1.5 Performance metrics

The output of each of the pipelines is a VCF file [16], which contains the set of the calledvariants. We can compare these variants with those contained in the ground truth. TruePositives (T.P.) refer to those variants contained both in the VCF file under considerationand the ground truth (a match in both position and genotyping must occur for the call to bedeclared T.P. for SNP, while for INDELs the criteria were more lenient: any INDEL within10bp of the true location was considered a T.P., methods similar to [72]); False Positives(F.P.) refer to variants contained in the VCF file but not in the ground truth; and FalseNegatives (F.N.) correspond to variants that are present in the ground truth dataset but notin the VCF file under consideration. The more T.P. (or equivalently the fewer F.N.) andthe fewer F.P. the better. To evaluate the impact of lossy compression on variant calling,we compare the number of T.P. and F.P. from various lossy compression approaches to thenumber of T.P. and F.P. obtained from lossless compression. Ideally, we would like to applya lossy compressor to the quality scores, such that the resulting file is smaller than that ofthe losslessly compressed, while obtaining a similar number of T.P. and F.P. We will showthat not only is this possible, but that in some cases we can simultaneously obtain more T.P.and fewer F.P than with the original data.


To analyze the performance of the lossy compressors on the proposed pipelines, weemploy the widely used metrics sensitivity and precision, which include in their calculationthe true positives, false positives and false negatives, as described below:

• Sensitivity: measures the proportion of all the positives that are correctly called,computed as T.P.

(T.P.+F.N.) .

• Precision: measures the proportion of called positives that are true, computed asT.P.

(T.P.+F.P.) .

Depending on the application, one may be inclined to boost the sensitivity at the costof slightly reducing the precision, in order to be able to find as many T.P. as possible. Ofcourse, there are also applications where it is more natural to optimize for precision thansensitivity. In an attempt to provide a measure that combines the previous two, we alsocalculate the f-score:

• F-score: corresponds to the harmonic mean of the sensitivity and precision, com-puted as 2◊Sensitivity◊Precision

Sensitivity+Precision .

In the discussion above we have considered that all the variants contained in a VCF fileare positive calls. However, another approach is to consider only the subset of variants inthe VCF file that satisfy a given constraint to be positive calls. In general, this constraintconsists of having the value of one of the parameters associated with a variant above acertain threshold. This approach is used to construct the well-known Receiver OperatingCurves (ROC). In the case under consideration, the ROC curve shows the performanceof the variant caller as a classification problem. That is, it shows how well the variantcaller differentiates between true and false variants when filtered by a certain parameter.Specifically, it plots the False Positive Rate (F.P.R.) versus the True Positive Rate (T.P.R.)(also denoted as sensitivity) for all thresholding values. Given an ROC plot with severalcurves, a common method for comparing them is by calculating the area under the curve(AUC) of each of them, such that larger AUCs are better.

There are several drawbacks with this approach. The main one, in our opinion, relatesto how to compare the AUC of different VCF files. Note that in general, different VCF


files contain a different number of calls. Thus, it is not informative to compute the ROCcurve of each VCF file independently, and then compare the respective AUCs. A morerigorous comparison can be performed by forcing all the VCF files under consideration tocontain the same number of calls. This can be achieved by computing the union of all thecalls contained in the VCF files, and adding to each VCF file the missing ones, such thatthey all contain the same number of calls. In [62] they followed this approach to performpair-wise comparisons. However, this does not scale very well for a large number of VCFfiles. Moreover, after performing the analysis, if one more VCF file is generated, all theAUC files must be re-computed (assuming the new VCF file contains at least a call notincluded in the previous ones). The other main drawback that we encountered relates tothe selection of the thresholding parameter. For instance, in SNP calling, when using theGATK pipeline, the QD (Quality by Depth) field is as valid a parameter as the QUAL field,but each of them results in different AUCs.

Given the above discussion, we believe that this approach is mainly suitable to analyzethe VCF files that contain a clear thresholding parameter, like those VCF files obtainedby the GATK pipeline after applying the VQSR filter, since in this case there is a clearparameter to be selected, namely the VQSLOD.

5.2 Results

We analyze the output of the variant caller (i.e., the VCF file) for each of the introducedpipelines when the quality scores are replaced by those generated by a lossy compressor.We focus on those lossy compressors that use only the quality scores for compression, asit would be too difficult to draw conclusions about the underlying source that generates thequality scores from analyzing algorithms like [62], where the lossy compression is donemainly using the information from the reads. To our knowledge, and based on the resultspresented in the previous chapter, RBlock, PBlock [21] and QVZ are the algorithms thatperform better among the existing lossy compressors that solely use the quality scores tocompress. Therefore, those are the algorithms that we consider for our study. In addition,


we consider Illumina’s proposed binning6, which is implemented both by DSRC2 [63] andCRAM [50]. Hereafter we refer to the performance of DSRC2.

There are some important differences between the lossy compressors selected for ourstudy. For example, the compression scheme of Illumina’s proposed binning does notdepend on the statistics of the quality scores, whereas QVZ and P/R-Block do. Also, inboth Illumina’s proposed binning and P/R-Block the maximum absolute distance betweena quality score and its reconstructed one (after decompression) can be controlled by theuser, whereas in QVZ this is not the case. The reason is that QVZ designs the quantizersto minimize a given average distortion based on a rate constraint, and thus even thoughon average the distortion is small, some specific quality scores may have a reconstructedquality score that is far from the true one. Also, note that whereas Illumina’s proposedbinning applies more precision to high quality scores, R-Block does the opposite, and P-Block does it equally among all the quality scores. Finally, in Illumina’s proposed binningand P/R-Block the user cannot estimate the size of the compressed file in advance, whereasthis is possible in QVZ.

Due to the extensive number of simulations performed, we only show the results forQVZ with MSE distortion and three clusters, RBlock and the Illumina proposed binning.We selected these as they are good representatives of the overall results.

5.2.1 SNP calling

Figures 5.2 and 5.3 show the average sensitivity, precision and f-score, together with thecompression ratio (in bits per quality score), over the 4 datasets and for the three pipelineswhen the golden standard is that of NIST and Illumina, respectively. For ease of visual-ization, we only show the results obtained with the lossless compressed data, and the onelossily compressed with QVZ (applied with MSE distortion and 3 clusters as computedwith k-means, denoted as QVZ-Mc3), RBlock, and Illuminas proposed binning. We choseto show the results on these algorithms because we found the results to be very representa-tive. The lossless compressed rate is computed using QVZ in lossless mode.

When reading the results, it is important to note the ground truth that was used for6http://www.illumina.com/documents/products/whitepapers/whitepaper_

datacompression.pdf


bits per quality score0 1 2 3

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

SensitivityGATK (RBlock)

GATK (QVZ)

GATK (Illumina)

Platypus (RBlock)

Platypus (QVZ)

Platypus (Illumina)

HTSlib (RBlock)

HTSlib (QVZ)

HTSlib (Illumina)

Platypus (Lossless)

GATK (Lossless)

HTSlib (Lossless)

GATK (Q40)

Platypus (Q40)

HTSlib (Q40)


0.65

0.66

0.67

0.68

0.69

0.7

0.71

0.72

0.73

0.74

Precision

bits per quality score0 0.5 1 1.5 2 2.5 3

0.76

0.77

0.78

0.79

0.8

0.81

0.82

0.83

0.84

0.85

F-score

Figure 5.2: Average sensitivity, precision and f-score of the four considered datasets usingthe NIST ground truth. Different colors represent different pipelines, and different pointswithin an algorithm represent different rates. Q40 denotes the case of setting all the qualityscores to 40.

the evaluation, as the choice of ground truth can directly affect the results. Recall that, asshown in Fig. 5.1, Illuminas ground truth contains the majority of the SNPs contained inthe NIST-GIAB, plus some more. Thus, assuming both ground truths are largely correct,a SNP caller is likely to achieve a higher sensitivity with the NIST ground truth, while theprecision will probably be lower. The opposite holds when comparing the output of a SNPcaller against the Illumina ground truth: we will probably obtain a lower sensitivity and ahigher precision.

We further define the variability observed in the output of the different SNP callingpipelines as the methodological variability, and the variability introduced by the lossy com-pressor within a pipeline as the lossy variability. We show that the lossy variability is ordersof magnitude smaller than the methodological variability; this indicates that the changes incalling accuracy introduced by lossy compressing the quality scores are negligible.

As shown in the figures, the variability obtained between different variant callers (method-ological variability) is significantly larger than the variability introduced by the lossy com-pressors (for most rates), i.e., the lossy variability. Specifically, for rates larger than 1 bitper quality score, we observe that the effect that lossy compressors have on SNP calling



0.89

0.9

0.91

0.92

0.93

0.94

0.95Sensitivity

GATK (RBlock)

GATK (QVZ)

GATK (Illumina)

Platypus (RBlock)

Platypus (QVZ)

Platypus (Illumina)

HTSlib (RBlock)

HTSlib (QVZ)

HTSlib (Illumina)

HTSlib (Lossless)

Platypus (Lossless)

GATK (Lossless)

GATK (Q40)

Platypus (Q40)

HTSlib (Q40)


0.82

0.84

0.86

0.88

0.9

0.92

0.94Precision


0.85

0.86

0.87

0.88

0.89

0.9

0.91

0.92

0.93F-score

Figure 5.3: Average sensitivity, precision and f-score of the four considered datasets usingthe Illumina ground truth. Different colors represent different pipelines, and different pointswithin an algorithm represent different rates. Q40 denotes the case of setting all the qualityscores to 40.

is several orders of magnitude smaller than the variability that already exists within thedifferent variant calling pipelines. For smaller rates, we observe a degradation in perfor-mance when using QVZ, and the lossy variability becomes more noticeable in this case.Recall that QVZ minimizes the average distortion, and thus at very small rates some of thequality scores may be highly distorted. If the highly distorted quality scores happen to playan important role in calling a specific variant, the overall performance may be affected.On the other hand, RBlock permits the user to specify the maximum allowed individualdistortion, and less degradation is obtained in general for small rates. Note also that forrates higher than 1 bit per quality score the performance of both QVZ and RBlock is sim-ilar. Illumina’s proposed binning achieves around 1 bit per quality score on average, andachieves a performance comparable to that of QVZ and RBlock. Finally, we found thatswapping the original quality scores with ones generated uniformly at random7, or with

7Results not shown.


all set to a fixed value (Q40 in the figure), significantly degraded the performance. Theseobservations demonstrate that the quality scores are actively used in all the pipelines whencalling variants, and thus discarding them is not a viable option.

Regarding the selection of the ground truth, we observe a higher sensitivity with theNIST ground truth, and a higher precision with the Illumina ground truth. Note that theseresults are in line with the above discussion regarding the choice of ground truth.

The performance of QVZ can be further improved if clustering is performed by meansof Mixture of Markov Models rather than k-means. Fig. 5.4 compares the performanceof both methods when 3 clusters are used. As can be observed, the choice of clusteringmethod has little effect for rates above 1 bit per quality score. However, for small rates,using the Mixture of Markov Models for clustering significantly improves the performance,specially for GATK and Platypus pipelines.

0 0.5 1 1.5 2 2.5


0.9

0.91

0.92

0.93

0.94

0.95

0.96

Sensitivity

GATK (QVZ-Mc3)

GATK (Lossless)

Platypus (QVZ-Mc3)

Platypus (Lossless)

Samtools (QVZ-Mc3)

Samtools (Lossless)

0 0.5 1 1.5 2 2.5


0.83

0.84

0.85

0.86

0.87

0.88

0.89

0.9

Precision

0 0.5 1 1.5 2 2.5


0.86

0.87

0.88

0.89

0.9

0.91

0.92

F-score

GATK(QVZ-MMM3)

Platypus(QVZ-MMM3)

Samtools(QVZ-MMM3)

Figure 5.4: Comparison of the average sensitivity, precision and f-score of the four con-sidered datasets and two ground truths, when QVZ is used with 3 clusters computed withk-means (QVZ-Mc3) and Mixture of Markov Models (QVZ-MMM3). Different colors rep-resent different pipelines, and different points within an algorithm represent different rates.

To gain insight into the possible benefits of using lossy compression, we show thedistribution of the f-score difference between the lossy and lossless case for different lossycompressors and rates (thus a positive number indicates an improvement over the losslesscase). The distribution is computed by averaging over all simulations (24 values in total; 4datasets, 3 pipelines and 2 ground truths). Fig. 5.5 shows the box-plot and the mean valueof the f-score difference for six different compression rates. Since QVZ performs better for


high rates, we show the results for QVZ-Mc3 with parameters 0.9, 0.8 and 0.6 (left-mostside of the figure). Analogously, for high compression ratios we show the results of RBlockwith parameters 30, 20, and 10 (right-most side of the figure).

Compression (%)10 17 37 43 68 77

Loss

yF-sco

re{

Loss

less

F-sco

re

#10-3

-1.5

-1

-0.5

0

0.5

1

Figure 5.5: Box plot of f-score differences between the lossless case and six lossy com-pression algorithms for 24 simulations (4 datasets, 3 pipelines and 2 ground truths). Thex-axis shows the compression rate achieved by the algorithm. The three left-most boxescorrespond to QVZ-Mc3 with parameters 0.9, 0.8 and 0.6, while the three right-most boxescorrespond to RBlock with parameters 30, 20 and 10. The blue line indicates the meanvalue, and the red one the median.


Remarkably, for all the rates the median is positive, which indicates that in at least 50%of the cases lossy compression improved upon the uncompressed quality scores. Moreover,the mean is also positive, except for the point with highest compression. This suggests thatlossy compression may be used to reduce the size of the quality scores without compromis-ing the performance on the SNP calling.

The above reported results show that lossy compression of quality scores (up to a certainthreshold on the rate) does not affect the performance on variant calling. Moreover, the boxplot of Fig. 5.5 indicates that in some cases an improvement with respect to the originaldata can be obtained. We now look into these results in more detail, by focusing on theindividual performance of each of the variant calling pipelines.

We choose to show the results using tables as they help visualize which lossy com-pressors and/or parameters work better for a specific setting. We color in red (will appearas a shaded cell) the values of the sensitivity, precision and f-score that improve upon theuncompressed.

Table 5.1 shows the results for algorithms RBlock, QVZ-Mc3 (MSE distortion crite-ria and 3 clusters) and Illumina binning-DSRC2 for the GATK with hard filtering pipelinewhen using the NIST ground truth. The two columns refer to the average results of Chro-mosomes 11 and 20 of the ERR262996 and ERR174310 datasets, respectively. For easeof exposition, we omit the results of QVZ using other distortions and rates, and those ofPBlock, as well as the results of individual chromosomes. Similarly, we omit the resultsfor the htslib.org and Platypus pipelines, but we comment on the results.

It is worth noting that with the GATK pipeline, several compression approaches im-prove simultaneously the sensitivity, precision, and f-score when compared to the uncom-pressed (original) quality scores. For example, in the 30-coverage dataset, RBlock im-proves the performance while reducing the size by more than 76%. In the 15-coveragedataset QVZ improves upon the uncompressed and reduces its size by 20%. With thehtslib.org pipeline, it is interesting to see that most of the points improve the sensitivityparameter, meaning that they are able to find more T.P. than with the uncompressed qual-ity scores. Finally, with the Platypus pipeline, the parameters that improve in general arethe precision and the f-score, which indicates that a bigger percentage of the calls are T.P.rather than F.P. Some points also improve upon the uncompressed. When the ground truth


Tabl

e5.

1:Se

nsiti

vity

,pre

cisi

on,f

-sco

rean

dC

ompr

essi

onra

tiofo

rth

e30

◊an

d15

◊C

over

age

data

sets

for

the

GAT

Kpi

pelin

e,us

ing

the

NIS

Tgr

ound

truth

.

GAT

KER

R26

2996

(30◊

):C

hr11

,Chr

20ER

R17

4310

(15◊

):C

hr11

,Chr

20Se

nsiti

vity

Prec

isio

nF-

Scor

eC

ompr

essi

onSe

nsiti

vity

Prec

isio

nF-

Scor

eC

ompr

essi

onLo

ssle

ss0.

9829

0.74

220.

8456

00.

9699

0.73

320.

8351

0R

bloc

k3

0.98

280.

7424

0.84

57-1

6.7

0.97

0.73

310.

835

-16.

98

0.98

30.

7421

0.84

5631

0.97

080.

733

0.83

5331

.610

0.98

30.

7422

0.84

5742

0.97

080.

733

0.83

5343

200.

9831

0.74

220.

8457

67.4

0.97

10.

7328

0.83

5270

.730

0.98

330.

7422

0.84

5876

.40.

9709

0.73

260.

8351

78.9

QV

Z-M

c30.

90.

9829

0.74

210.

8456

10.2

0.96

990.

7332

0.83

5110

0.8

0.98

30.

7425

0.84

5914

.70.

9699

0.73

330.

8351

190.

60.

9828

0.74

240.

8457

37.6

0.97

0.73

260.

8347

38.3

0.4

0.98

330.

7422

0.84

5857

.50.

9691

0.73

210.

8341

58.6

0.2

0.98

310.

7418

0.84

5477

.70.

9679

0.73

090.

8328

77.9

Illum

ina-

DSR

C2

0.98

270.

7424

0.84

5754

.78

0.96

940.

7345

0.83

5755

.66


is provided by Illumina, with the GATK pipeline, R/P-Block improves mainly the sensitiv-ity and f-score, with PBlock improving the precision as well in the 30x coverage dataset.QVZ seems to perform better in this case, improving upon the uncompressed for severalrates. It also achieves a performance better than that of Illumina’s proposed binning for asimilar compression rate. With the htslib.org pipeline R/P-Block improves mainly the sen-sitivity, while QVZ improves the precision and the f-score (in the 30x coverage dataset).The performance on Platypus is similar to the one obtained when the NIST ground truth isused instead.

In summary, in terms of the distortion metric that QVZ aims to minimize, MSE workssignificantly better for small rates (in most of the cases), whereas for higher rates the threeanalyzed distortions offer a similar performance. Thus the compression rate seems muchmore significant to the variability in the performance than the choice of distortion criterion.RBlock offers in general better performance than PBlock for similar compression rates.Finally, in most of the analyzed cases, Illumina’s binning is outperformed by at least oneother lossy compressor, while offering a similar compression rate. Overall, for high com-pression ratios (30%-70%), RBlock seems to perform the best, whereas QVZ is preferredfor lower compression rates (>70%).

In the previously analyzed cases we have assumed that all the SNPs contained in theVCF file are positive calls, since the pipelines already follow their “best practice” to gen-erate the corresponding VCF file. As discussed in the Methodology, another possibility isto select a parameter and consider positive calls only for those whose parameter is abovea certain threshold. Varying the threshold results in the ROC curve. We believe this ap-proach is of interest to analyze the VCF files generated by the GATK pipeline followed bythe VQSR filter, with thresholding parameter given the VQSLOD field, and thus we presentthe results for this case.

Fig. 5.6 shows the ROC curve of chromosome 11 of the 30x coverage dataset (ERR262996),with the NIST ground truth. The results correspond to those obtained when the qualityscores are the original ones (lossless), and the ones generated by QVZ-Mc3 (MSE distor-tion and 3 clusters), PBlock with parameter 8, RBlock with parameter 25 and the Illuminabinning (as the results of applying the DSRC2 algorithm). As shown in the figure, eachof the algorithms outperform the rest in at least one point of the curve. This is not the


case for the Illumina Binning, as it is outperformed by at least one other algorithm in allpoints. Moreover, the AUC of all the lossy compressors except that of the Illumina Binningoutperform that of the lossless case.

False Positive Rate0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

TruePositiveRate

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

QVZ-Mc3 [θ = 0.4] (AUC =0.6662)PBlock [p = 8] (AUC =0.66667)RBlock [r = 25] (AUC =0.66548)DSRC2-Illumina Bin. (AUC =0.66428)Lossless (AUC =0.6649)

Figure 5.6: ROC curve of chromosome 11 (ERR262996) with the NIST ground truth andthe GATK pipeline with the VQSR filter. The ROC curve was generated with respect to theVQSLOD field. The results are for the original quality scores (uncompressed), and thosegenerated by QVZ-Mc3 (MSE distortion and 3 clusters), PBlock (p = 8) and RBlock (r =25).



We show that lossy compression of quality scores leads to smaller files while enablingINDEL detection algorithms to achieve accuracies similar to the accuracies obtained withdata that has been compressed losslessly.

We simulated four datasets that each consisted of the CEU major alleles for chromo-some 22 [81] with approximately 3000 homozygous INDELs that were biologically realis-tic in length, location, and insertion-to-deletion ratio.

Fig. 5.7 shows the sensitivity, precision, and f-score achieved by each INDEL detec-tion pipeline using input data from the aforementioned compression approaches, togetherwith the compression ratio in bits per quality score. Note that the figure displays themeans across the four simulated datasets. In terms of sensitivity, all four INDEL detec-tion pipelines (HaplotypeCaller, UnifiedGenotyper, Dindel, and Freebayes) resulted in alossy variability, as described above, that does not exceed the methodological variability.All compression algorithm and INDEL detection pipeline combinations had high precision(all but 1 obtained precision > 0.995). Besides the DSRC2 compression approach appliedto HaplotypeCaller, lossy compression did not result in variability in precision.


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Sensitivity


0.97

0.975

0.98

0.985

0.99

0.995

1Precision


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1F-score

DINDEL RBlock

DINDEL QVZ

DINDEL Illumina

DINDEL Lossless

UG RBlock

UG QVZ

UG Illumina

UG Lossless

HC RBlock

HC QVZ

HC Illumina

HC Lossless

Freebayes RBlock

Freebayes QVZ

Freebayes Illumina

Freebayes Lossless

Figure 5.7: Average (of four simulated datasets) sensitivity, precision and f-score for IN-DEL detection pipelines. Different colors represent different pipelines, and different pointswithin an algorithm represent different rates.

Table 5.2 displays the sensitivity for an example INDEL detection pipeline, Dindel;results are shown for each compression approach for each simulated dataset individually,


along with the mean and standard deviation across datasets. The mean sensitivity for thelossless compression was 0.9796. Interestingly, RBlock (with r parameter set to 8 or 10)achieves a slightly higher average sensitivity of 0.9798. The remaining pipelines have meansensitivities ranging from 0.9644 to 0.9796. The standard deviation across pipelines waslow, ranging from 0.0016 to 0.0033.

5.3 Discussion

We have shown that lossy compressors can reduce file size at a minimal cost – or evenbenefit – to sensitivity and precision in SNP and INDEL detection.

We have analyzed several lossy compressors introduced recently in the literature that donot use any biological information (such as the reads) for compression. The main differenceamong them relates to the way they use the statistics of the quality scores for compression.For example, Illumina’s proposed binning is a fixed mapping that does not use the under-lying properties of the quality scores. In contrast, algorithms like QVZ are fully based onthe statistics of the quality scores to design the corresponding quantizers for each case.

Based on the results shown in the previous section, we conclude that in many cases lossycompression can significantly reduce the genomic file sizes (with respect to the losslesslycompressed) without compromising the performance on the variant calling. Specifically,we observe that the variability in the calls output by different existing SNP and INDELcallers is generally orders of magnitude larger than the variability introduced by lossy com-pressing the quality scores, specially for moderate to high rates. For small rates (aroundless than 0.5 bits per quality score), lossy compressors that minimize the average distortion,such as QVZ, get a degradation in performance. This is due to some of the quality scoresgetting highly distorted. At high rates, the analyzed lossy compressors perform similarly,except for Illumina’s proposed binning, which is generally outperformed by the other lossycompressors. This suggests that using the statistics of the quality scores for compression isbeneficial, and that not all datasets should be treated in the same way.

The degradation in performance observed when setting the quality scores to a randomvalue or all to maximum, demonstrates that the quality scores do matter, and thus discardingthem is not a viable option in our opinion. We recommend applying lossy compression with


moderate to high rates to ensure the quality scores are not highly distorted. In algorithmssuch as PBlock and RBlock, the user can directly specify the maximum allowed distortion.In algorithms that minimize an average distortion, such as QVZ, we recommend to employat least around one bit per quality score.

Finally, in several cases we have observed that lossy compression actually leads to su-perior results compared to lossless compression, i.e., they generate more true positives andfewer false positives than with the original quality scores, when compared to the corre-sponding ground truth. One important remark is that none of the analyzed lossy compres-sors make use of biological information for compression, in contrast to other algorithmssuch as the one introduced in [62]. We believe this is of importance, as one could arguethat the latter algorithms are tailored towards variant calling, and thus a careful read of theresults should be made. The fact that we are able to show improved variant calling per-formance in some cases with algorithms that do not use any biological information furthershows the potential of lossy compression of quality scores to improve on any downstreamapplication.

Our findings put together with the fact that, when losslessly compressed, quality scorescomprise more than 50% of the compressed file [17], seem to indicate that lossy com-pression of quality scores could become an acceptable practice in the future for boostingcompression performance or when operating in bandwidth constrained environments. Themain challenge in such a mode may be to decide which lossy compressor and/or rate to usein each case. Part of this is due to the fact that the results presented so far are experimental,and we have yet to develop theory that will guide the construction or choice of compressorsgeared toward improved inference.

5.4 Conclusion

Recently there has been a growing interest in lossy compression of quality scores as a wayto reduce raw genomic data storage costs. However, the genomic data under considerationis used for biological inference, and thus it is important to first understand the effect thatlossy compression has on the subsequent analysis performed on it. To date, there is noclear methodology to do so, as can be inferred from the variety of analyses performed in


the literature when new lossy compressors are introduced. To alleviate this issue, in thischapter we have described a methodology to analyze the effect that lossy compression ofquality scores has on variant calling, one of the most widely used downstream applicationsin practice. We hope the described methodology will be of use in the future when analyzingnew lossy compressors and/or new datasets.

Specifically, the proposed methodology considers the use of different pipelines for SNPcalling and INDEL calling, and datasets for which true variants exist (“ground truth”). Wehave used this methodology to analyze the behavior of the state-of-the-art lossy compres-sors, which to our knowledge constitutes the most complete analysis to date. The resultsdemonstrate the potential of lossy compression as a means to reduce the storage require-ments while obtaining performance close to that based on the original data. Moreover, inmany cases we have shown that it is possible to improve upon the original data.

Our findings and the growing need for reducing the storage requirements suggest thatlossy compression may be a viable mode for storing quality scores. However, further re-search should be performed to better understand the statistical properties of the qualityscores, to enable the principled design of lossy compressors tailored to them. Moreover,methodologies for the analysis on other important downstream applications should be de-veloped.


Tabl

e5.

2:Se

nsiti

vity

for

IND

ELde

tect

ion

byD

inde

lpi

pelin

ew

ithva

rious

com

pres

sion

appr

oach

esfo

r4

sim

ulat

edda

tase

ts.

Sens

itivi

ty

Com

pres

sion

appr

oach

Dat

aset

1D

atas

et2

Dat

aset

3D

atas

et4

Mea

nSt

anda

rdde

viat

ion

Loss

less

0.98

170.

9788

0.98

050.

9775

0.97

960.

0019

Illum

ina

-DSR

C2

0.96

610.

9662

0.96

660.

9621

0.96

520.

0021

Mc3

-0.3

0.97

760.

9737

0.97

470.

9696

0.97

390.

0033

Mc3

-0.7

0.98

170.

9775

0.97

990.

9758

0.97

870.

0026

Mc3

-0.9

0.98

170.

9788

0.98

050.

9764

0.97

940.

0023

Pblo

ck-2

0.98

170.

9788

0.98

050.

9775

0.97

960.

0019

Pblo

ck-8

0.98

100.

9778

0.98

020.

9775

0.97

910.

0017

Pblo

ck-1

60.

9654

0.96

620.

9652

0.96

070.

9644

0.00

25R

bloc

k-3

0.98

170.

9788

0.98

050.

9775

0.97

960.

0019

Rbl

ock

-80.

9817

0.97

880.

9805

0.97

810.

9798

0.00

16R

bloc

k-1

00.

9817

0.97

880.

9805

0.97

810.

9798

0.00

16

Chapter 6

Denoising of quality scores

The results presented in the previous chapter, which show that lossy compression of thequality scores can lead to variant calling performance that improves upon the uncom-pressed, suggest that denoising of the quality scores is possible and of potential benefit.However, reducing the noise in the quality scores has remained largely unexplored.

With that in mind, in this chapter we propose a denoising scheme to reduce the noisepresented in the quality scores, and demonstrate improved inference with the denoiseddata. Specifically, we show that replacing the quality scores with those generated by theproposed denoiser results in more accurate variant calling in general. Moreover, we showthat reducing the noise leads to a smaller entropy than that of the original quality scores,and thus a significant boost in compression is also achieved. Thus the angle of the presentwork is denoising for improved inference, with boosted compression performance as animportant benefit stemming from data processing principles. Such schemes to reduce thenoise of genomic data while easing its storage and dissemination can significantly benefitthe field of genomics.

6.1 Proposed method

We first formalize the problem of denoising quality scores, and then describe the proposeddenoising scheme in detail. We conclude this section with a description of the evaluationcriteria.

98

CHAPTER 6. DENOISING OF QUALITY SCORES 99

6.1.1 Problem Setting

Let Xi

= [Xi,1, X

i,2, . . . , Xi,n

] be a sequence of true quality scores of length n, and X ={X

i

}N

i=1 a set of quality score sequences. We further let Q = {Qi

}N

i=1 be the set of noisyquality score sequences that we observe and want to denoise1, where Q

i,j

= Xi,j

+Zi,j

andQ

i

= [Qi,1, Q

i,2, . . . , Qi,n

]. Note that {Zi,j

: 1 Æ i Æ N, 1 Æ j Æ n} represents the noiseadded during the sequencing process. This noise comes from different sources of generallyunknown statistics, some of which are not reflected in the mathematical models used toestimate the quality scores [82, 55].

Our goal is to denoise the noisy quality scores Q to obtain a version closer to the trueunderlying quality score sequences X . We further denote the output of the denoiser by„X = { „X

i

}N

i=1, with „Xi

= [„Xi,1, „X

i,2, . . . „Xi,n

].

6.1.2 Denoising Scheme

The suggested denoising scheme is depicted in Fig. 6.1. It consists of a lossy compres-sor applied to the noisy quality scores Q, the corresponding decompressor, and a post-processing operation that uses both the reconstructed quality scores „Q and the originalones. The output of the denoiser is the sequence of noiseless quality scores „X . In order tocompute the final storage size, a lossless compressor for quality scores is applied to the de-noised signal „X . Note that we cannot simply store the output of the lossy compressor anduse that as the final size, since the post-processing operation also needs access to the orig-inal quality scores. That is, the denoiser needs to perform both the lossy compression andthe decompression, and incorporate the original (uncompressed) noisy data for computingits final output.

The proposed denoiser is based on the one outlined in [83], which is universally optimalin the limit of large amounts of data when applied to a stationary ergodic source corruptedby additive white noise. Specifically, consider a stationary ergodic source Xn and its noisycorrupted version Y n given by

Yi

= Xi

+ Zi

,

1For example, the quality score sequences found in a FASTQ or SAM file.


Lossy Compressor Lossy DecompressorPost-Processing

Lossless CompressorOperation

bQQ bX Final Size

Q

Denoiser

Figure 6.1: Outline of the proposed denoising scheme.

for 1 Æ i Æ n, where Zn is an additive white noise process. Then, the first step towardsrecovering the noiseless signal Xn consists of applying a lossy compressor under the dis-tortion measure fl : Y ◊ ‚Y æ R+ given by

fl(y, y) , log 1P

Z

(y ≠ y) , (6.1)

where PZ

(·) is the probability mass function of the random variable Z. Moreover, the lossycompressor should be tuned to distortion level H(Z), that is, the entropy of the noise.

In the case of the quality scores, the statistics of the noise are unknown, and thus wecannot set the right distortion measure and level at the lossy compressor. In the presenceof such uncertainty, one could make the worst case assumption that the noise is Gaussian[84] of unknown variance, which would translate into a distortion measure given by thesquare of the error (based on Eq. (6.1)). However, even with this assumption, the correctdistortion level depends on the unknown variance. Thus, instead, we take advantage ofthe extensive work performed on lossy compressors for quality scores in the past, and usethem for the lossy compression step. Since we do not know the right distortion level toset at the lossy compressor, we apply each of them with different distortion levels. Thisdecision, although lacking theoretical guarantees, works in practice, as demonstrated in thefollowing section. Moreover, it makes use of the lossy compressors that have already beenproposed and tested.

The second step consists of performing a post-processing operation based on the noisysignal Y n and the reconstructed sequence ‚Y n. For a given integer m = 2m0 + 1 > 0,


ym œ Ym and y œ ‚Y , define the joint empirical distribution as

p(m)Y

n ‚Y

n(ym, y) , |{m0 + 1 Æ i Æ n ≠ m0 : (Y i+m0

i≠m0 , ‚Yi

) = (ym, y)}|n ≠ m + 1 . (6.2)

Thus, Eq. (6.2) represents the fraction of times Y i+m0i≠m0 = ym while Y

i

= y, for all i. Oncethe joint empirical distribution is computed, the denoiser generates its output as

„Xi

= argminxœ ‚X

ÿ

xœ‚Yp

(m)Y

n ‚Y

n(Y i+m0

i≠m0 , x)d(x, x), (6.3)

for 1 Æ i Æ n. Note that d : „X ◊ X æ R+ is the original loss function under whichthe performance of the denoiser is to be measured, and „X is the alphabet of the denoisedsequence „Xn.

For the case of the quality scores, the joint empirical distribution can be computedmostly as described in Eq. (6.2). However, since now we have a set of quality scoresequences, we redefine it as

p(m)Q,

‚Q(qm, q) , |{(i, j) : (Qi,j≠m0 , . . . , Q

i,j+m0 , ‚Qi,j

) = (qm, q)}|nN

, (6.4)

where Qi,j

= 0 for j < 1 and j > n. Finally, the output of the denoiser is given by

„Xi,j

= argminxœ ‚X

ÿ

qœ ‚Qp

(m)Q,

‚Q(Qi,j≠m0 , . . . , Q

i,j+m0 , q)d(x, q), (6.5)

for 1 Æ i Æ N and 1 Æ j Æ n, with d being squared distortion. Note also that the alphabetsof the original, reconstructed and denoised quality scores are the same, i.e., Q = ‚Q = „X .

Finally, as mentioned above, we apply a lossless compressor to the output of the decoderto compute the final size.

As outlined in [85], the intuition behind the proposed scheme is as follows. First, notethat adding noise to a signal always increases its entropy, since

I(Xn +Zn; Zn) = H(Xn +Zn)≠H(Xn +Zn|Zn) = H(Xn +Zn)≠H(Xn) Ø 0, (6.6)


which implies H(Y n) Ø H(Xn), with Y n = Xn + Zn. Also, lossy compression of Y n

at distortion level D can be done by searching among all reconstruction sequences withinradius D of Y n, and choosing the most compressible one. Thus, if the distortion level is setappropriately, a reasonable candidate for the reconstruction sequence can be the noiselesssequence Xn. The role of the lossy compressor is to partially remove the noise and to learnthe source statistics in the process, such that the post-processing operation can be though ofas performing Bayesian denoising. Therefore, we also expect the denoised quality scoresto be more compressible than the original ones, due to the reduced entropy.

6.1.3 Evaluation Criteria

To measure the quality of the denoiser we cannot compare the set of denoised sequences„X to the true sequences X , as the latter are unavailable. Instead, we analyze the effecton variant calling when the original quality scores are replaced by the denoised ones. Forthe analysis, we follow the methodology described in Chapter 5. Recall that it consists ofseveral pipelines and datasets specific for SNP calling and INDEL detection.

In brief, the considered pipelines for SNP calling are GATK [14, 57, 74], htslib.org[1] and Platypus [75], and for INDEL detection we used Dindel [77], Unified Genotyper,Haplotype Caller [14, 57, 74] and Freebayes [78]. The datasets used for SNP calling corre-spond to the H. Sapiens individual NA12878. In particular, chromosomes 11 and 20 of thepair-end whole genome sequencing datasets ERR174324 (15x) and ERR262997 (30x). Forthese data two consensus of SNPs are available, the one released by the GIAB (Genome InA Bottle) consortium [79], and the one released by Illumina. The dataset used for INDELdetection correspond to a chromosome containing 3000 heterozygous INDELs from which100bp paired-end sequencing reads were generated with ART [80]. All datasets in thisstudy have a consensus sequence, making it possible to analyze the accuracy of the variantcalls. We expect that using the denoised in lieu of the original data would yield highersensitivity, precision and f-score.


6.2 Results and Discussion

We analyze the performance of the proposed denoiser for both SNP calling and INDELdetection. For the lossy compressor block we used the algorithms RBlock [21], PBlock[21], QVZ and Illumina’s proposed binning. Since the right distortion level at which theyshould operate is unknown, we run each of them with different parameters (i.e., differentdistortion levels)2. Regarding the post-processing operation, we set m in Eq. (6.4) to beequal to three in all the simulations. This choice was made to reduce the running time andcomplexity, because of the large alphabet of the quality scores. As the entropy encoder weapplied QVZ in lossless mode, which offers competitive performance (see Chapter 4).

MSE distortion at the lossy compressor0 5 10 15 20 25

Size[M

B]

250

300

350

400

450

500

550

600

Original dataDenoised data

Figure 6.2: Reduction in size achieved by the denoiser when compared to the original data(when losslessly compressed).

Due to the extensive amount of simulations, here we focus on the most representativeresults. For completeness, in addition to analyzing the performance of the proposed de-noiser with that of the original data, we also compared it with the performance obtained

2Except for Illumina’s proposed binning which generates only one point in the rate-distortion plane.


when only lossy compression is applied to the data (i.e., without the post-processing oper-ation). We observed that the post-processing operation improves the performance beyondthat achieved by applying only lossy compression in most cases. Moreover, the denoiseddata occupies less than the original one, corroborating our expectation that the denoiserreduces the noise of the quality scores and thus the entropy, and consistent with the dataprocessing principle. As an example, Fig. 6.2 compares the size of the chr. 20 of the dataERR262997 (used for the analysis on SNP calling), with that generated by the denoiserwith different lossy compressors targeting different distortion levels (x-axis). As can beobserved, for all distortion levels above 4, the reduction in size is between 30% and 44%.Interestingly, similar results were obtained with all the tested datasets, which suggests thatmore than 30% of the entropy (of the original data) is due to noise.

In the following we focus on the performance of the denoiser in terms of its effect onSNP calling and INDEL detection.

6.2.1 SNP calling

We observe that the results for the chromosomes 11 and 20 of the 30x coverage dataset arevery similar for all the considered pipelines, and thus for ease of exposition, we restrict ourattention to chromosome 20. Regarding the 15x coverage dataset, we focus on chromosome11 and the SNP consensus produced by Illumina (similar results where obtained with theGIAB consensus).

Fig. 6.3 shows the results for the 30x coverage dataset on the GATK pipeline. As can beobserved, for MSE distortion levels between 0 and 20 approximately, and any lossy com-pressor, the denoiser improves all three metrics; f-score, sensitivity and precision. Amongthe analyzed pipelines, GATK is the most consistent and the one offering the best results.This suggests that the GATK pipeline uses the quality scores in the most informative way.

For htslib.org and Platypus, we also observe that the points that improve upon the orig-inal one exhibit an MSE distortion less than 20 in general. However, in this case the lossycompressors perform differently. For example, QVZ improves the precision and f-scorewith the htslib.org pipeline and the sensitivity with the platypus one. On the other hand,Pblock and Rblock achieve best results in terms of precision and fscore with the platypus


Distortion level at the lossy compressor (MSE)0 20 40 60

Sen

sitivity

0.936

0.937

0.938

0.939

0.94

Distortion level at the lossy compressor (MSE)0 20 40 60

Precision

0.88

0.885

0.89

0.895

OriginalRblockPblockIlluminaQVZ 1 cluster

Distortion level at the lossy compressor (MSE)0 10 20 30 40 50 60

F-Score

0.908

0.91

0.912

0.914

0.916

Figure 6.3: Denoiser performance on the GATK pipeline (30x dataset, chr. 20). Differentpoints of the same color correspond to running the lossy compressor with different param-eters.

pipeline and sensitivity with htslib.org.With the 15x dataset, the denoiser achieves in general better performance when using

the lossy compressors Rblock and Pblock. For example, Fig. 6.4 shows the f-score for theGATK and htslib.org pipelines. As can be observed, the denoised data improves upon theuncompressed in both cases with Rblock and Illumina’s proposed binning, and with Pblockwhen the distortion level is below 20. With QVZ the denoiser achieves better precision withthe GATK and htslib.org pipelines, and better sensitivity with Platypus.

Finally, it is worth noticing the potential of the post-processing operation to improveupon the performance when applying only lossy compression. We observed that this is truefor all the four considered datasets (chromosomes 11 and 20 and coverages 15x and 30x),and the three pipelines (GATK, htslib.org and platypus). To give some concrete examples,with the platypus pipeline the post-processing operation boosts the performance of thesensitivity when applying any lossy compressor, for all datasets. The general improvementis more noticeable for the 15x coverage datasets, where all metrics improve in most of the


0 20 40 60

Distortion level at the lossy compressor (MSE)

0.932

0.933

0.934

0.935

0.936

0.937

0.938

0.939

0.94

0.941

0.942F-Score

GATK

0 20 40 60

Distortion level at the lossy compressor (MSE)

0.932

0.933

0.934

0.935

0.936

0.937

0.938

0.939

0.94

0.941

0.942

F-Score

htslib.org

OriginalRblockPblockIlluminaQVZ 1 cluster

Figure 6.4: Denoiser performance on the GATK and hstlib.org pipelines (15x dataset, chr.11).

cases.


Among the analyzed pipelines, the denoiser exhibits the best performance on the HaplotypeCaller pipeline. For example, in terms of f-score, we observe that the proposed scheme withIllumina’s binning and Rblock as the lossy compressor achieves better performance than theoriginal data. QVZ and Pblock also improve for the points with smaller distortions. Similarresults are obtained for the sensitivity and precision. Moreover, in this case the potentialof applying the post-processing operation after any of the considered lossy compressorsbecomes particularly apparent, as the performance always improves (see Fig. 6.5).

We also observe improved performance using Freebayes with QVZ in terms of sensitiv-ity, precision and f-score, and an improved precision with the remaining lossy compressors.With the GATK-UG and Dindel pipelines, Rblock achieves the best performance, improv-ing upon the original data under all three performance metrics.


Lossy compressed data0 0.05 0.1 0.15 0.2 0.25 0.3

Denoised

data

0

0.2

0.4

0.6

0.8

1Performance of the lossy compressed data vs the denoised data

Sensitivity

Precision

F-score

Figure 6.5: Improvement achieved by applying the post-processing operation. x-axis rep-resents the performance in sensitivity, precision and f-score achieved by solely applyinglossy compression, and the y-axis represents the same but when the post-processing opera-tion is applied after the lossy compressor. Grey line corresponds to x = y, and thus all thepoints above it correspond to an improved performance.

6.3 Conclusion

In this chapter we have proposed a denoising scheme for quality scores. The proposedscheme is composed of a lossy compressor followed by the corresponding decompressorand a post-processing operation. Experimentation on real data suggests that the proposedscheme has the potential to improve the quality of the data insofar as its effect on the down-stream inferential applications, while at the same time significantly reducing the storagerequirements.

Further study of denoising of quality scores is merited as it seems to hold the potentialto enhance the quality of the data while at the same time easing its storage requirements.We hope the promising results presented in this chapter serve as a baseline for future re-search in this direction. Further research should include improved modeling of the statisticsof the noise, construction of denoisers tuned to such models, and performing more experi-mentation on real data and with additional downstream applications.

Chapter 7

Compression schemes for similarityqueries

In this chapter we study the problem of compressing sequences in a database so that sim-ilarity queries can still be performed efficiently on the compressed domain. Specifically,we focus on queries of the form: “which sequences in the database are similar to a given

sequence y?”, which are of practical interest in genomics.More formally, we consider schemes that generate, for each sequence x in the database,

a short signature of fixed-length, denoted by T (x), that is stored in the compressed database.Then, given a query sequence y, we answer the question of whether x and y are similar,based only on the signature T (x), rather than the original sequence x.

When answering a query, there are two types of errors that can be made: a false posi-

tive, when a sequence is misidentified as similar to the query sequence; and a false negative,when a similar sequence stays undetected. We impose the restriction that false negativesare not permitted, as even a small probability of a false negative translates to a substantialprobability of misdetection of some sequences in the large database, which is unacceptablein many applications. On the other hand, false positives do not cause an error per se asthe precise level of similarity is assessed upon retrieval of the full sequence from the largedatabase. However, they introduce a computational burden due to the need of further ver-ification (retrieval), so we would like to reduce their probability as much as possible. Fig.7.1 shows a typical usage case.

108

CHAPTER 7. COMPRESSION SCHEMES FOR SIMILARITY QUERIES 109

Figure 7.1: Answering queries from signatures: a user first makes a query to the com-pressed database, and upon receiving the indexes of the sequences that may possibly besimilar, discards the false positives by retrieving the sequences from the original database.

This problem has been studied from an information-theoretic perspective in [25], [86]for discrete sources, and in [27] for Gaussian sources. These papers analyze the fundamen-tal tradeoff between compression rate, sequence length and reliability of queries performedon the compressed data. Although these limits provide a bound on the optimal perfor-mance of any given scheme, the achievability proofs are non-constructive, which raises thequestion of how to design such schemes in practice.

With that in mind, in this chapter we propose two schemes for this task, based in parton existing lossy compression algorithms. We show that these schemes achieve the funda-mental limits in some statistical models of relevance. In addition, the proposed schemesare easy to analyze and implement, and they can work with any discrete database and anysimilarity measure satisfying the triangle inequality.

Variants of this problem have been previously considered in the literature. For exam-ple, the Bloom filter [87], which is restricted to exact matches, enables membership queriesfrom compressed data. Another related notion is that of Locality-Sensitive Hashing (LSH)[88, Chapter 3], which is a framework for the nearest neighbor search (NNS) problem. Thekey idea of LSH is to hash points in such a way that the probability of collision is higherfor points that are similar than for those that are far apart. Other methods for NNS include


vector approximation files (VA-File) [89], that employs scalar quantization. An extensionof this method is the so called compression/clustering based search [90], which performsvector quantization implemented through clustering. While these techniques trade off ac-curacy with computational complexity and space, and false negatives are allowed, in oursetting false negatives are not allowed, but significant compression can still be achieved.

7.1 Problem Formulation and Fundamental Limits

7.1.1 Problem Description

Given two sequences x and y of length n, we measure their similarity by computing thedistortion d(x, y) given by 1

n

qn

i=1 fl(xi

, yi

), where fl : X ◊ Y æ R+ is an arbitrarydistortion measure. We say that two sequences x and y are D-similar (or simply similar

when clear from the context) when d(x, y) Æ D.We consider databases consisting of M discrete sequences of length n, i.e., {x

i

}M

i=1,with x

i

= [xi,1, . . . , x

i,n

]. The proposed architecture generates, for each sequence x, asignature T (x), so that the compressed database is {T (x

i

)}M

i=1. Then, given a query se-quence y, the scheme makes the decision of whether x is D-similar to y, based only on itscompressed version T (x), rather than on the original sequence x. Note that a scheme iscompletely defined given its signature assignment and the corresponding decision rule.

More formally, a rate-R identification system (T, g) consists of a signature assignmentT : X n æ [1 : 2nR] and a decision function g : [1 : 2nR] ◊ Yn æ {no, maybe}. Weuse the notation {no, maybe} instead of {no, yes} to reflect the fact that false positives arepermitted, while false negatives are not. This is formalized next. A system is said to beD-admissible if

g(T (x), y) = maybe ’ x, y s.t. d(x, y) Æ D. (7.1)

Since a D-admissible scheme does not produce false negatives, a natural figure of merit isthe frequency at which false positives occur, that we wish to minimize.

We recall next the fundamental limits on performance in this problem, as we will referto them in the following sections when assessing the performance of the proposed schemes.


7.1.2 Fundamental limits

Let X and Y be random vectors of length n, representing the sequence from the databaseand the query sequence, respectively. We assume X and Y are independent, with entriesdrawn independently from P

X

and PY

, respectively. Define the false positive event asfp = {g(T (X), Y) = maybe|d(X, Y) > D}. For a D-admissible scheme,

P (g(T (X), Y) = maybe) = P (d(X, Y) Æ D) + P (fp)P (d(X, Y) > D). (7.2)

Note that P (fp) is the only term that depends on the scheme used, as the other termsdepend strictly on the probability distribution of X and Y. Hence minimizing P (fp) overall D-admissible schemes is equivalent to minimizing P (g(T (X), Y) = maybe). Thus, fora given D, the fundamental limits characterize the trade-off between the compression rateR and the P (g(T (X), Y) = maybe).

Note that as n æ Œ, P (d(X, Y) Æ D) goes to one or to zero (according to whetherD is above or below the expected level of similarity between X and Y ). The problem isnon-trivial only when the event of similarity is atypical, the case on which we focus. In thiscase, as is evident from Eq. (7.2), P (maybe) æ 0 iff P (fp) æ 0.

Definition 1. For given distribution PX

, PY

and similarity threshold D, a rate R is said

to be D-achievable if there exists a sequence of rate-R admissible schemes (T (n), g(n)), s.t.

limnæŒ P

1g(n)

1T (n)(X), Y

2= maybe

2= 0.

Definition 2. For a similarity threshold D, the identification rate RID(D) is the infimum of

D-achievable rates. That is, RID(D) , inf{R : R is D-achievable}.

For the case considered in this paper: discrete sources, fixed-length signature assign-ment and zero false negatives, the identification rate is characterized in [86, Theorem 1]as

RID(D) = minPU|X :

quœU PU (u)fl(PX|U (·|u),PY )ØD

I(X; U), (7.3)

where U is any random variable with finite alphabet U (|U| = |X | + 2 suffices to obtain thetrue value of RID(D)), that is independent of Y . fl(P

X

, PY

) = minE[fl(X, Y )] is a distancebetween distributions, with fl being the distortion under which similarity is measured, and


where the minimization is w.r.t. all jointly distributed random variables X, Y with marginaldistributions P

X

and PY

, respectively.Finally, we define DID(R) as the inverse function of RID(D), i.e., the similarity thresh-

old below which any similarity level can be achieved at given rate R.Characterizing the identification rate and exponent is a hard problem in general. In

[25], where the variable-length coding equivalent of our setting was considered, the au-thors present an achievable rate (they do not consider the converse to the identification rateproblem). The results of [25] for the identification exponent rely on an auxiliary randomvariable of unbounded cardinality, thus making the quantities uncomputable in general. Forthe quadratic-Gaussian case the identification rate and exponent were found in [26] [27],and for discrete binary sources and Hamming distortion they were found in [91].

7.2 Proposed schemes

We propose two practical schemes that achieve the limits introduced above in some cases.The first scheme is based on Lossy Compressors (LC) and the second one on a Type Cov-ering lemma (TC), and they both use a decision rule based on the triangle inequality (—).Based on this, hereafter we refer to them as the LC≠— and TC≠— schemes, respectively.Note that whereas a scheme based on lossy compressors (LC ≠ — scheme) is straightfor-ward to implement, implementation of the type covering lemma based scheme (TC ≠ —scheme) in practice is more challenging.

Next we introduce both schemes and analyze their optimality.

7.2.1 The LC ≠ — scheme

Description

The signature of the LC≠— scheme is based on fixed-length lossy compression algorithms.They are characterized by an encoding function f

n

: x æ [1 : 2nR

Õ ] and a decodingfunction g

n

: [1 : 2nR

Õ ] æ x, where x = gn

(fn

(x)) denotes the reconstructed sequence.Specifically, the signature of a sequence x is composed of the output i œ [1 : 2nR

Õ ] of thelossy compressor, and the distortion between x and x, i.e., T (x) = {i, d(x, x)} (see Fig.


7.2). The total rate of the system is R = RÕ + �R, where RÕ is the rate of the lossy-compressor and �R represents the extra rate to represent and store the distortion valued(x, x).

fn(x)x

i 2 [1 : 2

nR0]

gn(i)ˆ

x

d(x, ˆx)d(·, ·)

x

T (·)

{i, d(x, ˆx)}

i

Figure 7.2: Signature assignment of the LC ≠ — scheme for each sequence x in thedatabase.

Regarding the decision function g : [1 : 2nR] ◊ Yn æ {no, maybe}, recall that itmust satisfy Eq. (7.1). Given the signature assignment described above, the decision rulefor sequence x and query sequence y is based on the tuple (T (x), y) = ({i, d(x, x)}, y).Notice that x can be recovered from the signature as g

n

(i). The decision rule is given by

g(T (x), y) =Y]

[maybe, d(x, x) ≠ D Æ d(x, y) Æ d(x, x) + D;no, otherwise,

(7.4)

which satisfies Eq. (7.1) for any given distortion measure satisfying the triangle inequality.In an attempt to reduce the rate of the system (e.g., decrease the size of the compressed

database) without affecting the performance, one can decrease the value of �R by quan-tizing the distortion d(x, x). In that case, assuming d0 Æ d(x, x) Æ d1,

g(T (x), y) =Y]

[maybe, d0 ≠ D Æ d(x, y) Æ d1 + D;no, otherwise,

(7.5)

which preserves the admissibility of the scheme. While �R can be arbitrary small (forn æ Œ), there is a trade-off for finite n between its value and the P (maybe). This willbecome relevant for the simulations.


Asymptotic analysis

Recall from rate distortion theory [64] that an optimal lossy compressor with rate R attainsfor long enough sequences and with high probability, a distortion between x and x arbi-trarily close to the distortion-rate function D(R). Finally, consider the looser decision ruleg(T (x), y) = no if d(x, y) > d(x, x) + D. Note that the scheme is still admissible (zerofalse negatives) with this decision rule. Under these premises, as shown in [86], an LC≠—scheme of rate R can attain any similarity threshold below DLC≠—

ID (R), with

DLC≠—ID (R) , E[fl(X, Y )] ≠ E[fl(X, X)] = E[fl(X, Y )] ≠ D(R), (7.6)

where E[fl(X, Y )] is completely determined by PX

(induced by the lossy compressor) andP

Y

. Finally, let RLC≠—ID (D) be the inverse function of DLC≠—

ID (R), i.e., the compression rateachieved for a similarity threshold D.

As shown in [86], for binary symmetric sources and Hamming distortion, RID(D) =RLC≠—

ID (D), i.e., the scheme achieves the fundamental limit. However, the scheme is sub-optimal in general, in the sense that RID(D) < RLC≠—

ID (D).

7.2.2 The TC ≠ — scheme

Motivation

A closer look at Eq. (7.6) suggests the following intuitive idea: in the distortion rate case,we wish to minimize the distortion with a constraint on the mutual information. The opti-mization is with respect to the transition probability P

X|X . This is in agreement with Eq.(7.6), as we also want to minimize E[fl(X, X)]. However, the quantity E[fl(X, Y )] also de-pends on P

X

(determined by PX|X and P

X

). This suggests optimizing both terms together.As shown in [86], this is possible, and the key is to use a type covering lemma (TC) to gen-erate x (and not just the one that minimizes the distortion between X and X). Specifically,any similarity threshold below DTC≠—

ID (R) can be attained by a TC ≠ — scheme of rate R,where

DTC≠—ID (D) , max

PX|X :I(X;X)ÆR

E[fl(X, Y )] ≠ E[fl(X, X)]. (7.7)


0 0.1 0.2 0.3 0.40

0.2

0.4

0.6

0.8

1

Similarity threshold D

R[bits]

X , Y ∼ Bern(0.5)

RLC−△

ID (D)

RTC−△

ID (D)RID(D)Entropy

0 0.1 0.2 0.3 0.40

0.2

0.4

0.6

0.8

Similarity threshold D

R[bits]

X , Y ∼ Bern(0.7)

RLC−△

ID (D)

RTC−△

ID (D)RID(D)Entropy

Figure 7.3: Binary sources and Hamming distortion: if PX

= PY

= Bern(0.5),RLC≠—

ID (D) = RTC≠—ID (D) = RID(D), whereas if P

X

= PY

= Bern(0.7), RLC≠—ID (D) >

RTC≠—ID (D) = RID(D).

As in the previous case, we denote by RTC≠—ID (D) the inverse function of DTC≠—

ID (R).It is easy to see that RTC≠—

ID (D) Æ RLC≠—ID (D). Furthermore, for memoryless binary

sources and Hamming distortion, RTC≠—ID (D) = RID(D) and both are strictly lower than

RLC≠—ID

(D) for non-symmetric sources, the difference being particularly pronounced at lowdistortion, as shown in [86] (see Fig 7.3).

The question now is how to create a practical TC≠— scheme that achieves RTC≠—ID (D),

which will imply that the scheme achieves a smaller compression rate that an LC ≠ —scheme, and that it is optimal for general binary sources and Hamming distortion. Whilecreating a practical scheme that achieves RLC≠—

ID (D) is straightforward, how to implementa TC ≠ — scheme is not clear in general. We propose a valid TC ≠ — scheme, which weintroduce next.

Description

Based on the previous results, for each sequence x in the database, we want to generate asignature assignment from which we can reconstruct a sequence x such that the empiricaldistribution between x and x is equal to the one associated with the solution to the optimiza-tion problem shown in Eq. (7.7). This will imply that the scheme attains RTC≠—

ID (D); whichis better than RLC≠—

ID (D), and even optimal for memoryless binary sources and Hammingdistortion.


We propose a practical scheme for this task based on lossy compression algorithms.Specifically, we show that the desired distribution can be achieved by carefully choosingthe distortion to be applied by the lossy compressor. In other words, if

P úX|X = arg max

PX|X :I(X;X)ÆR

E[fl(X, Y )] ≠ E[fl(X, X)], (7.8)

we are seeking a distortion measure flú(X, X) such that

P úX|X = argmin

PX|X :I(X;X)ÆR

E[flú(X, X)], (7.9)

i.e., the conditional probability induced by the lossy compressor is equal to P úX|X .

We show that Eq. (7.9) holds if flú(X, X) = log 1P

úX|X

, where P úX|X is induced from

P úX|X and P

X

, and I(X; X) = R. Note that flú(X, X) is reminiscent of logarithmic loss[92]. This is based on the following lemma1:

Lemma 1. Let X ≥ PX

, and let PX

(x) > 0 for all x œ X . For a channel PX|X , let P

X|X

be the reversed channel, and consider a rate distortion problem with distortion measure

fl(x, u) = log 1P

X|X(x|u) . (7.10)

Then, for the rate constraint I(X; U) Æ I(X; X), the optimal test channel P úU |X is

equal to PX|X .2

Proof. First, note that

E[fl(X, U)] =ÿ

x,u

PX,U

(x, u) log 1P

X|X(x|u) (7.11)

=ÿ

u

PU

(u)D(PX|U(·|u)||P

X|X(·|u)) + H(X|U), (7.12)

1Private communication, Thomas Courtade.2Note that here U denotes the reconstruction symbol, PX|X the desired distribution, and the optimization

is done over PU |X , whereas in Eq. (7.8), (7.9) P úX|X denotes the optimal distribution and the optimization is

done over PX|X .


and that the rate constraint implies H(X|U) Ø H(X|X). Therefore,

E[fl(X, U)] Ø ÿ

u

PU

(u)D(PX|U(·|u)||P

X|X(·|u)) + H(X|X). (7.13)

Thus,min

PU|X :I(X;U)ÆI(X;X)E[fl(X, U)] = H(X|X), (7.14)

and the minimum is attained if and only if PX|U = P

X|X .

Going back to our setting, note that the optimization problem shown in Eq. (7.7) thatsolves for RID(D)TC≠— has the constraint I(X, X) Æ R. The maximizing probability inEq. (7.8) will in general achieve I(X, X) = R, and thus we can apply the lemma.

Therefore, the proposed TC ≠ — scheme effectively employs for the signature assign-ment a good lossy compressor for distortion measure fl(x, x) = log 1

P

úX|X

(x|x) , where P úX|X

is induced by P úX|X , given by Eq. (7.8), and P

X

. With an optimal lossy compressor, and as-suming I(X, X) = R, the joint type of the sequences x and x will be close to P ú

X|X , whichachieves RTC≠—

ID , which is optimal for the case of general binary sources and Hammingdistortion. In the next section we show that the performance of the proposed scheme ap-proaches the fundamental performance limit, and performs notably better than the LC ≠ —scheme.

7.3 Simulation results

In this section we examine the performance of both the LC ≠ — and the TC ≠ — schemes.We consider datasets composed of M binary sequences of length n, and Hamming distor-tion for computing the similarity between sequences. We generate the sequences in thedatabase as X ≥ r

n

i=1 PX

(xi

), with PX

= Bern(p). These sequences are independentof the query sequences, generated as Y ≥ r

n

i=1 PY

(yi

), with PY

= Bern(q). With theseassumptions, for each sequence x

i

in the database, i œ [1 : M ], given its signature T (xi

),we can compute the probability that g(T (x

i

), y) = maybe (for a similarity threshold D),


denoted by P (maybe|T (xi

)), analytically, with the following formula:

P (maybe|T (xi

)) =Ân(d1+D)Êÿ

d=Án(d0≠D)Ë

dÿ

i=0

An0

i

BAn ≠ n0

d ≠ i

B

qn≠n0≠d+2i(1 ≠ q)n0+d≠2i, (7.15)

where n0 denotes the number of zeros of x

i

, and d0 and d1 are the delimiters of the decisionregion to which d(x

i

, x

i

) belongs. If no quantization is applied, d0 = d1 = d(xi

, x

i

).Finally, we compute the probability of maybe for the database as the average over all thesequences it contains, i.e., P (maybe) = 1

M

qM

i=1 P (maybe|T (xi

)). Note that we want thisprobability to be as small as possible.

Regarding the quantization of d(x, x), we approximate the distribution of d(X, X) as aGaussian N (µ, ‡2), where µ and ‡2 are computed empirically (for each rate). We then usethe k-means algorithm to find the 2k decision regions (�R = k/n, i.e., k bits are allocatedfor the description of the quantized distortion). Thus, for each distortion, we store only thedecision region to which it belongs.

Finally, we also analyze the performance of the LC≠— scheme when applied to q≠arysources.

7.3.1 Binary symmetric sources and Hamming distortion

Note that the performance of the LC ≠ — and the TC ≠ — schemes are equivalent in thiscase. For the analysis, we consider a dataset composed of M = 1000 binary sequences oflength n = 512, with p = q = 0.5. As the fixed-length lossy compression algorithm, weuse a binary-Hamming version of the successive refinement compression scheme [93].

Regarding the quantization of d(x, x), there exists a tradeoff between the quantizationlevel and the probability of maybe. Fig. 7.4(a) shows the results for different quantizationlevels (denoted by k) and a similarity threshold D = 0.20 (i.e., 80% similarity). As ex-pected, no special value of k performs better than the others for any overall compressionrate R. Therefore, in the subsequent figures the presented results correspond to the bestvalue of k for each rate. As can be observed, we can reduce the size of the database by 76%(R = 0.24) and retrieve on average 1% of the sequences per query. With 70% reduction wecan get a P (maybe) of 10≠4 (on average one sequence every 1000 is retrieved). One can


get even more compression with the same P (maybe) for lower values of D. For example,95% compression with a 1% average retrieval is achieved for D = 0.05.

0.1 0.15 0.2 0.25 0.3 0.35 0.4

10−8

10−6

10−4

10−2

100

R[bits]

P(maybe)

RID = RTC−△

ID = RTC−△

ID

No quantizationk = 1k = 2k = 3k = 4k = 5k = 6k = 7k = 8

(a)

0 0.1 0.2 0.30

0.1

0.2

0.3

P (fn)

P(fp)

LSH: AND-OR scheme

LSH: OR-AND scheme

LC − △ scheme

10−4

(b)

Figure 7.4: Binary symmetric sequences and similarity threshold D = 0.2: (a) performanceof the proposed architecture with quantized distortion (b) comparisson with LSH for rateR = 0.3.

Finally, we include a comparison with LSH [88]. We use the accepted family of func-tions H = {h

i

, i œ [1 : n]}, with hi

(x) = x(i), the ith coordinate of x, and consider boththe AND-OR and the OR-AND constructions described in [88, Chapter 3]. Note that thecomparison is not completely fair, as LSH allows false negatives (fn’s), compresses thequery sequence, and its design is not optimized for the problem considered in this paper.This is reflected in Fig. 7.4(b), where we show the achievable probabilities of fn’s andfalse positives (fp’s) for both schemes, considering the database introduced above and rateR = 0.3. As can be observed, it is not possible to have both probabilities going to zero atthe same time, whereas the proposed scheme achieves for the same rate a P (fp) close to10≠4 with zero fn’s.

7.3.2 General binary sources and Hamming distortion

We compare the performance of the TC ≠ — and LC ≠ — schemes, assuming PX

=P

Y

= Bern(p), with p ”= 0.5. For a fair comparison, we simulate both schemes with thelossy compressor presented in [94], that allows us to specify the distortion to be used. The


0 0.1 0.210

−6

10−4

10−2

100

R[bits]

P(maybe)

X, Y ∼ Bern(0.7), D = 0.05

0 0.1 0.210

−6

10−4

10−2

100

R[bits]

P(maybe)

X, Y ∼ Bern(0.7), D = 0.10

0 0.1 0.2 0.3 0.410

−6

10−4

10−2

100

R[bits]

P(maybe)

X, Y ∼ Bern(0.8), D = 0.05

RLC−△

ID

RLC−△

ID

LC− △ scheme: approx.TC− △ scheme: approx.LC− △ schemeTC− △ scheme

0 0.1 0.2 0.3 0.410

−6

10−4

10−2

100

R[bits]

P(maybe)

X, Y ∼ Bern(0.8), D = 0.10

Figure 7.5: Performance of the proposed schemes for sequences of length n = 512, simi-larity thresholds D = {0.05, 0.1} and P

X

= PY

= Bern(0.7) and Bern(0.8).

LC ≠ — scheme uses Hamming distortion, whereas the TC ≠ — scheme uses the distor-tion measure given by Eq. (7.10), with P

X|X computed from PX

and PX|X as defined in

Eq. (7.8) (for each rate). Note that this distortion is to be used only by the lossy com-pressor. The decision rule g(T (x), y) in both schemes still uses Hamming distortion tomeasure similarity between sequences and the triangle inequality property for computingthe decision threshold.

We show simulation results in Fig. 7.5 for a dataset composed of M = 1000 sequencesof length n = 512, and P

X

= PY

= Bern(0.7) and Bern(0.8). We also plot the threerates (RID = RTC≠—

ID < RLC≠—ID ) and an approximation for each scheme, computed as

follows. For a given rate R, the approximation for the LC ≠ — scheme assumes PX|X

is given as argminPX|X :I(X;X)ÆR

E[fl(X, X)] (rate distortion optimization problem), with fl

representing Hamming distortion. On the other hand, for the TC ≠ — scheme, PX|X is

assumed to be equal to that of Eq. (7.8). We then compute the P (maybe) of each schemeusing Eq. (7.15), with d0 = d1 = E[fl(X, X)], with fl representing Hamming distortion,and n0 = nP

X

(x = 0).As can be observed, the TC ≠ — scheme performs better than the LC ≠ — scheme in


all cases, as is suggested by the theory. For example, for X, Y ≥ Bern(0.7), D = 0.05and R = 0.13 (87% compression), while the LC ≠ — scheme achieves P (maybe) = 10≠4,the TC ≠ — scheme achieves 10≠5. Similarly, for D = 0.1 and R = 0.2, the P (maybe)decreases from 10≠3 to 10≠4, i.e., on average it retrieves 1 sequence every 10000, insteadof every 1000. For the case X, Y ≥ Bern(0.8) we observe similar results. With D = 0.05(95% similarity) and P (maybe) = 10≠2, the TC ≠ — scheme attains 93% compression(R = 0.07), whereas the LC ≠ — schemes achieves only 84% compression (R = 0.16),i.e., a reduction in rate of 55%. Furthermore, R = 0.07 is close to RLC≠—

ID . Similarly,for D = 0.1 and P (maybe) = 10≠4 the decrease in rate is from 0.35 to 0.3 bits, whichrepresents an improvement in compression of 14.2%. Finally, notice that for a given rate,the smaller the similarity threshold D, the smaller the P (maybe).

7.3.3 q-ary sources and Hamming distortion

The LC ≠ — scheme can be easily extended to the case of q-ary sources. Note that thedecision rule of Eq. (7.4) still applies in this case. One important example of the kind ofsource where this scheme would be of special importance is DNA data, where the alphabetis of size four, {A, C, G, T}.

We consider a database composed of M = 1000 i.i.d. uniform 4-ary sequences oflength n = 100, and apply the proposed architecture with the lossy compression algorithmpresented in [94]. To see how the scheme works on real data, we generate a databasecomposed of 1000 DNA sequences of length 100, taken from BIOZON [2]. The empiricaldistribution is given by p

A

= 0.25, pC

= 0.23, pG

= 0.29, pT

= 0.23. We emphasize thatthe proposed architecture makes the scheme D-admissible, independently of the probabilis-tic model behind the sequences of the database, if any. We consider i.i.d. and uniformlydistributed query sequences to compute the probability of maybe in both cases.

The results for both datasets are shown in Fig. 7.6. As can be observed, the performanceon the simulated data and on the DNA dataset are very similar. We present some results forthe DNA database. For D = 0.1, we get a probability of maybe of 0.001 with a reductionin size of 83.5% (R = 0.33). For D = 0.2 and R = 0.47, we get a probability of maybe of0.01.


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

R[bits]

10-8

10-6

10-4

10-2

100

Pr{maybe}

D = 0.1

RID(D)

Approximation

Simulated data

DNA data

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

R[bits]

10-8

10-6

10-4

10-2

100

Pr{maybe}

D = 0.2

Figure 7.6: Performance of the LC≠— scheme for D = {0.1, 0.2} applied to two databasescomposed of 4-ary sequences: one generated uniformly i.i.d. and the other comprised ofreal DNA sequences from [2].

7.4 Conclusion

In this chapter we have investigated schemes for compressing a database so that similarityqueries can be performed efficiently on the compressed database. These schemes are ofpractical interest in genetics, where it is important to find sequences that are similar. Thefundamental limits for this problem have been characterized in past work, and they serveas the basis for performance evaluation.

Specifically, we have introduced two schemes for this task, the LC≠— and the TC≠—schemes, respectively, both based on lossy compression algorithms. The performance ofthe LC ≠ — scheme, although close to the fundamental limits in some cases (e.g., binarysymmetric sources and Hamming distortion), is suboptimal in general. The TC≠— schemebuilds upon the previous one and achieves a better compression rate in many cases. For ex-ample, for general memoryless binary sequences and Hamming distortion, the TC ≠ —scheme exhibits on simulated data performance approaching the fundamental limits, sub-stantially improving over the LC ≠ — scheme. The TC ≠ — scheme is also based on lossy


compression algorithms, but in this case the distortion measure to be applied by the lossycompressor is judiciously designed, a measure which is not Hamming despite the fact thatsimilarity for the query is measured under Hamming. Finally, both schemes are applicableto any discrete database and similarity measure satisfying the triangle inequality.

Chapter 8

Conclusions

This dissertation has been motivated by the ever growing amount of genomic data that isbeing generated. These data must be stored, processed, and analyzed, which poses signifi-cant challenges. To partially address this issue, in this thesis we have investigated methodsto ease the storage and distribution of the genomic data, as well as methods to facilitate theaccess to these data on databases.

Specifically, we have designed lossless and lossy compression schemes for the genomicdata, improving upon the previously proposed compressors. Lossy compressors have beentraditionally analyzed in terms of their rate-distortion performance. However, since thegenomic data is used for biological inference, an analysis on their effect on downstreamanalyses is necessary in this case. With that in mind, we have also presented an extensiveanalysis on the effect that lossy compression of the genomic data has on variant calling,one of the most important analysis performed in practice. The results provided in thisthesis show that lossy compression has the means to significantly reduce the size of thedata while providing a performance on variant calling that is comparable to that obtainedwith the lossless compressed data. Moreover, these results show that in some cases lossycompression can lead to inference that improves upon the uncompressed, which suggestthat the data could be denoised. In that regard, we have also proposed the first denoisingscheme for these type of data, and demonstrated improved inference with the denoised data.In addition, we have shown that it is possible to do so while reducing the size of the databeyond its lossless limit. Finally, we have proposed two schemes to compress sequences

124

CHAPTER 8. CONCLUSIONS 125

in a database such that similarity queries can still be performed in the compressed domain.The proposed schemes are able to achieve significant compression, allowing to replicatethe database in several locations, thus providing easier and faster access to the data.

There are several interesting research directions for future work related to the resultspresented in this thesis. We conclude by listing some of them.

• Genomic data compression: There remain several challenges in genomic data com-pression. In particular, the compressors may need to incorporate important featuresbeyond end-to-end compression. For example, it may be desirable to trade off com-pression performance for random access capabilities. Random access may be neces-sary in applications where one is interested in accessing only one part of the genome,for example, while avoiding the need to decompress the whole file. In addition, be-ing able to perform operations in the compressed domain can speed up the analysisperformed on the data, especially as the size of the data grows.

• Error-Correction: Current sequencing technologies are imperfect and, as a result,the reads contained in the sequencing files contain errors. Correcting these errors canimprove the analysis performed on the data by downstream applications, which willgenerate more accurate results. For example, one could use a Bayesian frameworkand ideas from Coding Theory to model the process that generates the reads, anddevelop a robust approach for correction of these errors based on this model.

• Boosting downstream analyses: There are many downstream applications that usegenomic data, each with a different purpose. Currently, several algorithms exist for agiven downstream analysis. These algorithms rarely produce the same results whenapplied to the same data. This can be understood by the lack of accurate modelsand the compromises made in favor of complexity reduction. An example of thisare algorithms to identify variants, which are used in important applications, suchas medical decision making. Improving the variant callers can have significant im-pact in practice. It would be desirable to design algorithms for this task guided bysound theory with performance guarantees, and extend this line of research to otherimportant downstream applications beyond variant calling.

CHAPTER 8. CONCLUSIONS 126

• Compression schemes for similarity queries: There are several applications that mayrequire similarity queries based on a similarity metric other than Hamming. Forexample, in genomics, similarity between genomes is often measure with the editdistance, which accounts for substitutions, insertions, and deletions. Thus extendingthese type of schemes to such metrics will increase the practically of such schemes.In addition, it would be desirable to be able to perform queries directly on the com-pressed data, without the need of decompressing the whole sequences.

Bibliography

[1] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, GaborMarth, Goncalo Abecasis, Richard Durbin, et al. The sequence alignment/map formatand samtools. Bioinformatics, 25(16):2078–2079, 2009.

[2] Aaron Birkland and Golan Yona. Biozon: a system for unification, management andanalysis of heterogeneous biological data. BMC bioinformatics, 7(1):1, 2006.

[3] Eric S Lander, Lauren M Linton, Bruce Birren, Chad Nusbaum, Michael C Zody,Jennifer Baldwin, Keri Devon, Ken Dewar, Michael Doyle, William FitzHugh, et al.Initial sequencing and analysis of the human genome. Nature, 409(6822):860–921,2001.

[4] Zachary D Stephens, Skylar Y Lee, Faraz Faghri, Roy H Campbell, and et. al. Bigdata: Astronomical or genomical? PLoS Biol, 13(7):e1002195, 2015.

[5] C Re, A Ro, and A Re. Will computers crash genomics? Science, 5:1190, 2010.

[6] HPJ Buermans and JT Den Dunnen. Next generation sequencing technology: ad-vances and applications. Biochimica et Biophysica Acta (BBA)-Molecular Basis of

Disease, 1842(10):1932–1941, 2014.

[7] Peter JA Cock, Christopher J Fields, Naohisa Goto, Michael L Heuer, and Pe-ter M Rice. The sanger fastq file format for sequences with quality scores, and thesolexa/illumina fastq variants. Nucleic acids research, 38(6):1767–1771, 2010.

127

BIBLIOGRAPHY 128

[8] Marc Lohse, Anthony Bolger, Axel Nagel, Alisdair R Fernie, John E Lunn, MarkStitt, and Bjorn Usadel. Robina: a user-friendly, integrated software solution for rna-seq-based transcriptomics. Nucleic acids research, page gks540, 2012.

[9] Murray P Cox, Daniel A Peterson, and Patrick J Biggs. Solexaqa: At-a-glance qual-ity assessment of illumina second-generation sequencing data. BMC bioinformatics,11(1):485, 2010.

[10] Heng Li, Jue Ruan, and Richard Durbin. Mapping short dna sequencing reads andcalling variants using mapping quality scores. Genome research, 18(11):1851–1858,2008.

[11] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast andmemory-efficient alignment of short dna sequences to the human genome. Genome

biology, 10(3):1, 2009.

[12] Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics, 25(14):1754–1760, 2009.

[13] Gerton Lunter and Martin Goodson. Stampy: a statistical algorithm for sensitive andfast mapping of illumina sequence reads. Genome research, 21(6):936–939, 2011.

[14] Aaron McKenna and et. al. The genome analysis toolkit: a mapreduce framework foranalyzing next-generation DNA sequencing data. Genome research, 2010.

[15] Jinghui Zhang, David A Wheeler, Imtiaz Yakub, Sharon Wei, Raman Sood, WilliamRowe, Paul P Liu, Richard A Gibbs, and Kenneth H Buetow. Snpdetector: a softwaretool for sensitive and accurate snp detection. PLoS Comput Biol, 1(5):e53, 2005.

[16] Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A Albers, Eric Banks,Mark A DePristo, Robert E Handsaker, Gerton Lunter, Gabor T Marth, Stephen TSherry, et al. The variant call format and vcftools. Bioinformatics, 27(15):2156–2158, 2011.

[17] James K Bonfield and Matthew V Mahoney. Compression of fastq and sam formatsequencing data. PloS one, 8(3):e59190, 2013.

BIBLIOGRAPHY 129

[18] Zexuan Zhu, Yongpeng Zhang, Zhen Ji, Shan He, and Xiao Yang. High-throughputdna sequence data compression. Briefings in Bioinformatics, page bbt087, 2013.

[19] Sebastian Deorowicz and Szymon Grabowski. Data compression for sequencing data.Algorithms for Molecular Biology, 8(1):1, 2013.

[20] Idoia Ochoa, Himanshu Asnani, Dinesh Bharadia, Mainak Chowdhury, TsachyWeissman, and Golan Yona. Qualcomp: a new lossy compressor for quality scoresbased on rate distortion theory. BMC bioinformatics, 14(1):1, 2013.

[21] Rodrigo Canovas, Alistair Moffat, and Andrew Turpin. Lossy compression of qualityscores in genomic data. Bioinformatics, 30(15):2130–2136, 2014.

[22] Greg Malysa, Mikel Hernaez, Idoia Ochoa, Milind Rao, Karthik Ganesan, and TsachyWeissman. Qvz: lossy compression of quality values. Bioinformatics, page btv330,2015.

[23] Y William Yu, Deniz Yorukoglu, and Bonnie Berger. Traversing the k-mer landscapeof ngs read datasets for quality score sparsification. In Research in Comp. Molecular

Bio., 2014.

[24] Idoia Ochoa, Mikel Hernaez, Rachel Goldfeder, Tsachy Weissman, and Euan Ash-ley. Effect of lossy compression of quality scores on variant calling. Briefings in

Bioinformatics, page doi: 10.1093/bib/bbw011, 2016.

[25] Rudolf Ahlswede, E-h Yang, and Zhen Zhang. Identification via compressed data.IEEE Transactions on Information Theory, 43(1):48–70, 1997.

[26] Amir Ingber, Thomas Courtade, and Tsachy Weissman. Quadratic similarity querieson compressed data. In Data Compression Conference (DCC), 2013, pages 441–450.IEEE, 2013.

[27] Amir Ingber, Thomas Courtade, and Tsachy Weissman. Compression for quadraticsimilarity queries. IEEE Transactions on Information Theory, 61(5):2729–2747,2015.

BIBLIOGRAPHY 130

[28] Idoia Ochoa, Mikel Hernaez, and Tsachy Weissman. iDoComp: a compressionscheme for assembled genomes. Bioinformatics, page btu698, 2014.

[29] Idoia Ochoa, Mikel Hernaez, and Tsachy Weissman. Aligned genomic data compres-sion via improved modeling. Journal of bioinformatics and computational biology,12(06), 2014.

[30] Mikel Hernaez, Idoia Ochoa, Rachel Goldfeder, Tsachy Weissman, and Euan Ashley.A cluster-based approach to compression of quality scores. Submitted to the Data

Compression Conference (DCC), 2015.

[31] Idoia Ochoa, Mikel Hernaez, Rachel Goldfeder, Tsachy Weissman, and Euan Ashley.Denoising of quality scores for boosted inference and reduced storage. Submitted to

the Data Compression Conference (DCC), 2015.

[32] Idoia Ochoa, Amir Ingber, and Tsachy Weissman. Efficient similarity queries vialossy compression. In Allerton, pages 883–889, 2013.

[33] Idoia Ochoa, Amir Ingber, and Tsachy Weissman. Compression schemes for similar-ity queries. In Data Compression Conference (DCC), 2014, pages 332–341. IEEE,2014.

[34] Stephane Grumbach and Fariza Tahi. A new challenge for compression algorithms:genetic sequences. Information Processing & Management, 30(6):875–886, 1994.

[35] Xin Chen, Ming Li, Bin Ma, and John Tromp. Dnacompress: fast and effective dnasequence compression. Bioinformatics, 18(12):1696–1698, 2002.

[36] Minh Duc Cao, Trevor I Dix, Lloyd Allison, and Chris Mears. A simple statisticalalgorithm for biological sequence compression. In Data Compression Conference,

2007. DCC’07, pages 43–52. IEEE, 2007.

[37] Scott Christley, Yiming Lu, Chen Li, and Xiaohui Xie. Human genomes as emailattachments. Bioinformatics, 25(2):274–275, 2009.

BIBLIOGRAPHY 131

[38] Marty C Brandon, Douglas C Wallace, and Pierre Baldi. Data structures and com-pression algorithms for genomic sequence data. Bioinformatics, 25(14):1731–1738,2009.

[39] Dmitri S Pavlichin, Tsachy Weissman, and Golan Yona. The human genome contractsagain. Bioinformatics, page btt362, 2013.

[40] Shanika Kuruppu, Simon J Puglisi, and Justin Zobel. Relative lempel-ziv compressionof genomes for large-scale storage and retrieval. In String Processing and Information

Retrieval, pages 201–206. Springer, 2010.

[41] Shanika Kuruppu, Simon J Puglisi, and Justin Zobel. Optimized relative lempel-zivcompression of genomes. In Proceedings of the Thirty-Fourth Australasian Computer

Science Conference-Volume 113, pages 91–98. Australian Computer Society, Inc.,2011.

[42] Sebastian Deorowicz and Szymon Grabowski. Robust relative compression ofgenomes with random access. Bioinformatics, 27(21):2979–2986, 2011.

[43] Congmao Wang and Dabing Zhang. A novel compression tool for efficient storage ofgenome resequencing data. Nucleic acids research, 39(7):e45–e45, 2011.

[44] Armando J Pinho, Diogo Pratas, and Sara P Garcia. Green: a tool for efficient com-pression of genome resequencing data. Nucleic acids research, 40(4):e27–e27, 2012.

[45] BG Chern, Idoia Ochoa, Alexandros Manolakos, Albert No, Kartik Venkat, andTsachy Weissman. Reference based genome compression. In Information Theory

Workshop (ITW), 2012 IEEE, pages 427–431. IEEE, 2012.

[46] Sebastian Wandelt and Ulf Leser. Adaptive efficient compression of genomes. Algo-

rithms for Molecular Biology, 7(1):1, 2012.

[47] Sebastian Deorowicz, Agnieszka Danek, and Szymon Grabowski. Genome compres-sion: a novel approach for large collections. Bioinformatics, page btt460, 2013.

BIBLIOGRAPHY 132

[48] Sebastian Wandelt and Ulf Leser. Fresco: Referential compression of highly similarsequences. Computational Biology and Bioinformatics, IEEE/ACM Transactions on,10(5):1275–1288, 2013.

[49] Dan Gusfield. Algorithms on strings, trees and sequences: computer science and

computational biology. Cambridge university press, 1997.

[50] Markus Hsi-Yang Fritz, Rasko Leinonen, Guy Cochrane, and Ewan Birney. Efficientstorage of high throughput dna sequencing data using reference-based compression.Genome research, 21(5):734–740, 2011.

[51] Fabien Campagne, Kevin C Dorff, Nyasha Chambwe, James T Robinson, and Jill PMesirov. Compression of structured high-throughput sequencing data. PloS one,8(11):e79871, 2013.

[52] Daniel C Jones, Walter L Ruzzo, Xinxia Peng, and Michael G Katze. Compression ofnext-generation sequencing reads aided by highly efficient de novo assembly. Nucleic

acids research, 40(22):e171–e171, 2012.

[53] Khalid Sayood. Introduction to data compression. Newnes, 2012.

[54] Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with bowtie 2.Nature methods, 9(4):357–359, 2012.

[55] Shreepriya Das and Haris Vikalo. Onlinecall: fast online parameter estimation andbase calling for illumina’s next-generation sequencing. Bioinformatics, 28(13), 2012.

[56] Heng Li. A statistical framework for snp calling, mutation discovery, associationmapping and population genetical parameter estimation from sequencing data. Bioin-

formatics, 27(21):2987–2993, 2011.

[57] Mark A DePristo, Eric Banks, et al. A framework for variation discovery and geno-typing using next-generation dna sequencing data. Nature genetics, 43(5), 2011.

[58] Christos Kozanitis, Chris Saunders, Semyon Kruglyak, Vineet Bafna, and GeorgeVarghese. Compressing genomic sequence fragments using slimgene. Journal of

Computational Biology, 18(3):401–413, 2011.

BIBLIOGRAPHY 133

[59] Raymond Wan, Vo Ngoc Anh, and Kiyoshi Asai. Transformations for the com-pression of fastq quality scores of next-generation sequencing data. Bioinformatics,28(5):628–635, 2012.

[60] Faraz Hach, Ibrahim Numanagic, Can Alkan, and S Cenk Sahinalp. Scalce: boostingsequence compression algorithms using locally consistent encoding. Bioinformatics,28(23):3051–3057, 2012.

[61] Lilian Janin, Giovanna Rosone, and Anthony J Cox. Adaptive reference-free com-pression of sequence quality scores. Bioinformatics, page btt257, 2013.

[62] Y William Yu, Deniz Yorukoglu, Jian Peng, and Bonnie Berger. Quality score com-pression improves genotyping accuracy. Nature biotechnology, 33(3):240–243, 2015.

[63] Łukasz Roguski and Sebastian Deorowicz. Dsrc2 industry-oriented compression offastq files. Bioinformatics, 30(15):2213–2215, 2014.

[64] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley &Sons, 2012.

[65] Amos Lapidoth. On the role of mismatch in rate distortion theory. IEEE Trans. Inf.

Theory, 43(1):38–47, 1997.

[66] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information

theory, 28(2):129–137, 1982.

[67] James MacQueen et al. Some methods for classification and analysis of multivari-ate observations. In Proceedings of the fifth Berkeley symposium on mathematical

statistics and probability, volume 1, pages 281–297. Oakland, CA, USA., 1967.

[68] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.

[69] Michael D Linderman, Tracy Brandt, Lisa Edelmann, Omar Jabado, Yumi Kasai,Ruth Kornreich, Milind Mahajan, Hardik Shah, Andrew Kasarskis, and Eric E Schadt.Analytical validation of whole exome and whole genome sequencing for clinical ap-plications. BMC medical genomics, 7(1):1, 2014.

BIBLIOGRAPHY 134

[70] Xiangtao Liu, Shizhong Han, Zuoheng Wang, Joel Gelernter, and Bao-Zhu Yang.Variant callers for next-generation sequencing data: a comparison study. PloS one,8(9):e75619, 2013.

[71] Xiaoqing Yu and Shuying Sun. Comparing a few snp calling algorithms using low-coverage sequencing data. BMC bioinformatics, 14(1):1, 2013.

[72] Jason O’Rawe, Tao Jiang, Guangqing Sun, Yiyang Wu, Wei Wang, Jingchu Hu, PaulBodily, Lifeng Tian, Hakon Hakonarson, W Evan Johnson, et al. Low concordanceof multiple variant-calling pipelines: practical implications for exome and genomesequencing. Genome medicine, 5(3):1, 2013.

[73] Hugo YK Lam, Michael J Clark, Rui Chen, Rong Chen, Georges Natsoulis, MaeveO’Huallachain, Frederick E Dewey, Lukas Habegger, Euan A Ashley, Mark B Ger-stein, et al. Performance comparison of whole-genome sequencing platforms. Nature

biotechnology, 30(1):78–82, 2012.

[74] Geraldine A Auwera, Mauricio O Carneiro, et al. From fastq data to high-confidencevariant calls: the genome analysis toolkit best practices pipeline. Current Protocols

in Bioinformatics, pages 11–10, 2013.

[75] Andy Rimmer, Hang Phan, Iain Mathieson, et al. Integrating mapping-, assembly-andhaplotype-based approaches for calling variants in clinical sequencing applications.Nature genetics, 46(8):912–918, 2014.

[76] Heng Li. Aligning sequence reads, clone sequences and assembly contigs with bwa-mem. arXiv preprint arXiv:1303.3997, 2013.

[77] Cornelis A Albers, Gerton Lunter, and et. al. Dindel: accurate indel calls from short-read data. Genome research, 21(6):961–973, 2011.

[78] Erik Garrison and Gabor Marth. Haplotype-based variant detection from short-readsequencing. arXiv preprint arXiv:1207.3907, 2012.

[79] Justin M Zook, Brad Chapman, Jason Wang, David Mittelman, Oliver Hofmann, Win-ston Hide, and Marc Salit. Integrating human sequence data sets provides a resource

BIBLIOGRAPHY 135

of benchmark snp and indel genotype calls. Nature biotechnology, 32(3):246–251,2014.

[80] Weichun Huang, Leping Li, Jason R Myers, and Gabor T Marth. Art: a next-generation sequencing read simulator. Bioinformatics, 28(4):593–594, 2012.

[81] Frederick E Dewey, Rong Chen, Sergio P Cordero, Kelly E Ormond, Colleen Caleshu,Konrad J Karczewski, Michelle Whirl-Carrillo, Matthew T Wheeler, Joel T Dudley,Jake K Byrnes, et al. Phased whole-genome genetic risk in a family quartet using amajor allele reference sequence. PLoS Genet, 7(9):e1002280, 2011.

[82] Wei-Chun Kao, Kristian Stevens, and et. al. Bayescall: A model-based base-callingalgorithm for high-throughput short-read sequencing. Genome research, 19(10),2009.

[83] Tsachy Weissman and Erik Ordentlich. The empirical distribution of rate-constrainedsource codes. IEEE Trans. Inf. Theory, 51, 2005.

[84] Himanshu Asnani, Ilan Shomorony, and et. al. Network compression: Worst-caseanalysis. In Inf. Theory Proceedings (ISIT), 2013 IEEE Intern. Symp. on, pages 196–200, 2013.

[85] Shirin Jalali and Tsachy Weissman. Denoising via MCMC-based lossy compression.IEEE Trans. Signal Process., 60(6), 2012.

[86] Amir Ingber and Tsachy Weissman. The minimal compression rate for similarityidentification. arXiv preprint arXiv:1312.2063, 2013.

[87] Burton H Bloom. Space/time trade-offs in hash coding with allowable errors. Com-

munications of the ACM, 13(7):422–426, 1970.

[88] Anand Rajaraman and Jeffrey David Ullman. Mining of massive datasets. CambridgeUniversity Press, 2012.

[89] Sunil Arya, David M Mount, Nathan S Netanyahu, Ruth Silverman, and Angela YWu. An optimal algorithm for approximate nearest neighbor searching fixed dimen-sions. Journal of the ACM (JACM), 45(6):891–923, 1998.

BIBLIOGRAPHY 136

[90] Sharadh Ramaswamy and Kenneth Rose. Adaptive cluster distance bounding forhigh-dimensional indexing. IEEE Transactions on Knowledge and Data Engineering,23(6):815–830, 2011.

[91] Amir Ingber, Thomas Courtade, and Tsachy Weissman. Compression for exact matchidentification. In Information Theory Proceedings (ISIT), 2013 IEEE International

Symposium on, pages 654–658. IEEE, 2013.

[92] Thomas A Courtade and Tsachy Weissman. Multiterminal source coding under log-arithmic loss. In Information Theory Proceedings (ISIT), 2012 IEEE International

Symposium on, pages 761–765. IEEE, 2012.

[93] Ramji Venkataramanan, Tuhin Sarkar, and Sekhar Tatikonda. Lossy compression viasparse linear regression: Computationally efficient encoding and decoding. IEEE

Transactions on Information Theory, 60(6):3265–3278, 2014.

[94] Ankit Gupta and Sergio Verdu. Nonlinear sparse-graph codes for lossy compression.IEEE Transactions on Information Theory, 55(5):1961–1975, 2009.

GENOMIC DATA COMPRESSION AND PROCESSING: THEORY, MODELS,...

Documents

Transcript of GENOMIC DATA COMPRESSION AND PROCESSING: THEORY, MODELS,...