Computer Science Basics Compression of Files · 2021. 1. 11. · RN OT RNO OT RNO = Berner...
Transcript of Computer Science Basics Compression of Files · 2021. 1. 11. · RN OT RNO OT RNO = Berner...
-
Computer Science BasicsCompression of Files
Emmanuel BenoistFall Term 2020/2021
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 1
-
Encoding and Compression of Files
� Last Week
� Lossless compressionsRLE Run-Length EncodingHuffmanLZW - Lempel-Ziv-Welch
� Conclusion
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 2
-
Last Week
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 3
-
Character Encoding
UTF-8 is the standard nowUsed in Web sites
Problem for DatabasesUTF-8 characters do not have the same lengthNot possible to be stored in a DBMySQL / MariaDB : Characters are encoded on 3 bytes.MS SQL Server found another solution.
Legacy systems use many different encodingsUnix / Mac / WindowsUS / latin-1 / latin-???International systems : UTF-16BE, UTF-16LE, UTF-32...
Job of the Medical Informatics specialistLet all the systems work together!
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 4
-
Lossless compressions
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 5
-
CompressionLarge data
TextsWeb PagesProgramsDocumentsImagesSoundsVideos
Lossless compressionData are compressed, without losing any informationTexts, web pages, programs (source or executables),documentszip, gZip, . . .
Lossy compressionFor images or Videos.jpeg, different videos codecs, different audio codecs.
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 6
-
RLE Run-Length Encoding
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 7
-
RLE - Run-Length Encoding
How to compressAAAAAAAAAAAAAAA
Solution16AExamples
AAABBBCCC becomes 3A3B3CLess good
Another example becomes1A1n1o1t1h1e1r1SP1e1x1a1m1p1l1e (2 x larger)
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 8
-
RLE for Black and White images
If we have an image with only Black and White pixels1
Black pixels are denoted BWhite pixels are denoted W
The image
WWWWWWWWWWWWBWWWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWW
Becomes12W1B14W3B23W1B11W
1Source: https://fr.wikipedia.org/wiki/Run-length_encodingBerner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 9
https://fr.wikipedia.org/wiki/Run-length_encoding
-
Applications
ImagesBMP format uses compression for 1bit, 4 and 8 bits per pixel(Black and White, 16 colors, 256 colors).PCX uses 8 and 24 bits per pixel (24 bits mean 3 channelswith one byte per pixel for RGB colors).
FaxOnly the lengths are sent, saving more placeEvery line must start with the same color.
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 10
-
Huffman
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 11
-
Huffman Code
Invented in 1952 by David Albert HuffmanIdea
Same as MorseUse less bit for characters that occur more oftenExample in Morse e is coded “.”
First step: read the text and build a treeThe elements occurring more often are high in the tree (e, s, tin english),Elements occurring rarely are low in the tree (k,y,z in english).
Second step: use the tree to encode charactersWe use the path between root and the element to describe thenodeThe higher a letter is, the shorter its coding.
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 12
-
Huffman code tree
Tree for the sentence “This is an example of a huffmantree” 2
2Source https://fr.wikipedia.org/wiki/Codage_de_HuffmanBerner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 13
https://fr.wikipedia.org/wiki/Codage_de_Huffman
-
Which tree?
Always the same tree (depending on the type of file)For french textFor english textFor german textFor a Java Program
A tree specific for one documentText is read once, tree is generatedCode is the smallest possibleBut tree needs to be sent
Adaptative methodOne default tree is modified dynamically by reading the textTree remains smallLarger computation (need to modify the tree during decoding)
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 14
-
Encoding/Decoding
Each time a byte is readSearch in the tree
One writes the bits corresponding to the path in the tree
Starting at the root0 means left1 means right
Exampleemmanuel is coded 000 0111 0111 010 0010 00111 00011001It uses 31 bits, whereas 8 x 7 bit are needed in Ascii
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 15
-
LZW - Lempel-Ziv-Welch
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 16
-
LZW - Lempel-Ziv-Welch
Algorithm 3
Based on LZ78 written by Abraham Lempel et Jacob ZivEnhanced by Terry Welch
Compression based on a dictionaryThe dictionary is created and maintained during readingWe replace common words by their place in the dictionaryThe dictionary is initialized with all the bytes (every word hasone character).
If some words are repeated, compression is efficient
3Source: http://igm.univ-mlv.fr/~dr/XPOSE2013/La_compression_de_donnees/lzw.html
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 17
http://igm.univ-mlv.fr/~dr/XPOSE2013/La_compression_de_donnees/lzw.htmlhttp://igm.univ-mlv.fr/~dr/XPOSE2013/La_compression_de_donnees/lzw.html
-
Principle
You reed the file and construct the dictionaryNeed only one pass
Dictionary contains all the one byte words per defaultFor a text all ASCII (or better latin1) characters.
If a word exists in the dictionary, it is replaced by itsrank in the dictionaryIf a new word is read, then it is added in the dictionaryfor later use
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 18
-
Example: Encode a StringEncode TOBEORNOTTOBEORTOBEORNOT4Dictionary initially contains 256 words of on byte (all the possible chars).
w c wc Output dictionaryT T
T O TO T TO = O B OB O OB = B E BE B BE = E O EO E EO = O R OR O OR = R N RN R RN = N O NO N NO = O T OT O OT = T T TT T TT = T O TOTO B TOB TOB = B E BEBE O BEO BEO = O R OROR T ORT ORT = T O TOTO B TOBTOB E TOBE TOBE = E O EOEO R EOR EOR = R N RNRN O RNO RNO = O T OTOT
4https://fr.wikipedia.org/wiki/Lempel-Ziv-WelchBerner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 19
https://fr.wikipedia.org/wiki/Lempel-Ziv-Welch
-
Example: Encoded output
The output of the algorithm is:TOBEORNOT
Place neededNormally (ASCII): 24 bytes, i.e. 192 bitsEach number is encoded on 9 bit (largest number is 263 (<512)16*9 bit = 144 bits
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 20
-
Example: Decoding the text
Input:TOBEORNOT
c w In w+in[0] Output dictionaryT T TO T O TO O TO = B O B OB B OB = E B E BE E BE = O E O EO O EO = R O R OR R OR = N R N RN N RN = O N O NO O NO = T O T OT T OT = T TO TT TO TT = TO BE TOB BE TOB = BE OR BEO OR BEO = OR TOB ORT TOB ORT = TOB EO TOBE EO TOBE = EO RN EOR RN EOR = RN OT RNO OT RNO =
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 21
-
Use of LZW algorithm
For images compressionImages in Gif, TIFF,Audio files MOD
Characters have 12 bits (not 8)A pixel is 3 bytes (RGB) i.e. 24 bit2characters represent one pixel
Very good at recognizing patternsBut now, images have millions of colors and do not have thatmuch patterns.
Maybe we could change a little bit the image to help thecompression
Lossy compression : Next week
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 22
-
Conclusion
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 23
-
ConclusionCompression is central in a lot of Med Inf topics
Storage of large documentsTransfer of large documents
AlgorithmsRLE is efficient if their are a lot of repetitions (Black andWhite images for instance).Huffman generates a tree. Is efficient if the type is known or ifthe tree remains small regarding the size of the document.LZW is efficient even if nothing is none.
ZIP filesFor lossless compression of filesUses the algorithm Deflate based on LZ77 and HuffmanFirst compress the duplicate series of bytes (Lempel-Ziv)Then replaces commonly used symbols (Huffman).Very efficient with text files.
Lossless vs lossy compressionSometime, we do not need all of the information, a part of it issufficient
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 24
-
References
Web pageshttps://fr.wikipedia.org/wiki/Codage_de_Huffmanhttps://fr.wikipedia.org/wiki/Lempel-Ziv-Welchhttp://www.journaldunet.com/developpeur/tutoriel/theo/041014-algo-compression-sans-perte.shtmlhttps://www.numerama.com/tech/299921-comment-fonctionne-la-compression-de-donnees.htmlhttp://igm.univ-mlv.fr/~dr/XPOSE2013/La_compression_de_donnees/lzw.html
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 25
https://fr.wikipedia.org/wiki/Codage_de_Huffmanhttps://fr.wikipedia.org/wiki/Lempel-Ziv-Welchhttp://www.journaldunet.com/developpeur/tutoriel/theo/041014-algo-compression-sans-perte.shtmlhttp://www.journaldunet.com/developpeur/tutoriel/theo/041014-algo-compression-sans-perte.shtmlhttps://www.numerama.com/tech/299921-comment-fonctionne-la-compression-de-donnees.htmlhttps://www.numerama.com/tech/299921-comment-fonctionne-la-compression-de-donnees.htmlhttps://www.numerama.com/tech/299921-comment-fonctionne-la-compression-de-donnees.htmlhttp://igm.univ-mlv.fr/~dr/XPOSE2013/La_compression_de_donnees/lzw.htmlhttp://igm.univ-mlv.fr/~dr/XPOSE2013/La_compression_de_donnees/lzw.html
-
Exercises
1. Encode using RLE“Hallo Mitenand”“AAAAAABBBBBAAAACCCCCWWWWW”
2. Encode using LZW“Bonjour a vous tous”“GTACCTAGGTAGTAAGTATGTAC”
Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 26
Last WeekLossless compressionsRLE Run-Length EncodingHuffmanLZW - Lempel-Ziv-Welch
Conclusion