Computer Science Basics Compression of Files · 2021. 1. 11. · RN OT RNO OT RNO = Berner...

26
Computer Science Basics Compression of Files Emmanuel Benoist Fall Term 2020/2021 Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 1

Transcript of Computer Science Basics Compression of Files · 2021. 1. 11. · RN OT RNO OT RNO = Berner...

  • Computer Science BasicsCompression of Files

    Emmanuel BenoistFall Term 2020/2021

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 1

  • Encoding and Compression of Files

    � Last Week

    � Lossless compressionsRLE Run-Length EncodingHuffmanLZW - Lempel-Ziv-Welch

    � Conclusion

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 2

  • Last Week

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 3

  • Character Encoding

    UTF-8 is the standard nowUsed in Web sites

    Problem for DatabasesUTF-8 characters do not have the same lengthNot possible to be stored in a DBMySQL / MariaDB : Characters are encoded on 3 bytes.MS SQL Server found another solution.

    Legacy systems use many different encodingsUnix / Mac / WindowsUS / latin-1 / latin-???International systems : UTF-16BE, UTF-16LE, UTF-32...

    Job of the Medical Informatics specialistLet all the systems work together!

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 4

  • Lossless compressions

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 5

  • CompressionLarge data

    TextsWeb PagesProgramsDocumentsImagesSoundsVideos

    Lossless compressionData are compressed, without losing any informationTexts, web pages, programs (source or executables),documentszip, gZip, . . .

    Lossy compressionFor images or Videos.jpeg, different videos codecs, different audio codecs.

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 6

  • RLE Run-Length Encoding

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 7

  • RLE - Run-Length Encoding

    How to compressAAAAAAAAAAAAAAA

    Solution16AExamples

    AAABBBCCC becomes 3A3B3CLess good

    Another example becomes1A1n1o1t1h1e1r1SP1e1x1a1m1p1l1e (2 x larger)

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 8

  • RLE for Black and White images

    If we have an image with only Black and White pixels1

    Black pixels are denoted BWhite pixels are denoted W

    The image

    WWWWWWWWWWWWBWWWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWW

    Becomes12W1B14W3B23W1B11W

    1Source: https://fr.wikipedia.org/wiki/Run-length_encodingBerner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 9

    https://fr.wikipedia.org/wiki/Run-length_encoding

  • Applications

    ImagesBMP format uses compression for 1bit, 4 and 8 bits per pixel(Black and White, 16 colors, 256 colors).PCX uses 8 and 24 bits per pixel (24 bits mean 3 channelswith one byte per pixel for RGB colors).

    FaxOnly the lengths are sent, saving more placeEvery line must start with the same color.

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 10

  • Huffman

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 11

  • Huffman Code

    Invented in 1952 by David Albert HuffmanIdea

    Same as MorseUse less bit for characters that occur more oftenExample in Morse e is coded “.”

    First step: read the text and build a treeThe elements occurring more often are high in the tree (e, s, tin english),Elements occurring rarely are low in the tree (k,y,z in english).

    Second step: use the tree to encode charactersWe use the path between root and the element to describe thenodeThe higher a letter is, the shorter its coding.

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 12

  • Huffman code tree

    Tree for the sentence “This is an example of a huffmantree” 2

    2Source https://fr.wikipedia.org/wiki/Codage_de_HuffmanBerner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 13

    https://fr.wikipedia.org/wiki/Codage_de_Huffman

  • Which tree?

    Always the same tree (depending on the type of file)For french textFor english textFor german textFor a Java Program

    A tree specific for one documentText is read once, tree is generatedCode is the smallest possibleBut tree needs to be sent

    Adaptative methodOne default tree is modified dynamically by reading the textTree remains smallLarger computation (need to modify the tree during decoding)

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 14

  • Encoding/Decoding

    Each time a byte is readSearch in the tree

    One writes the bits corresponding to the path in the tree

    Starting at the root0 means left1 means right

    Exampleemmanuel is coded 000 0111 0111 010 0010 00111 00011001It uses 31 bits, whereas 8 x 7 bit are needed in Ascii

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 15

  • LZW - Lempel-Ziv-Welch

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 16

  • LZW - Lempel-Ziv-Welch

    Algorithm 3

    Based on LZ78 written by Abraham Lempel et Jacob ZivEnhanced by Terry Welch

    Compression based on a dictionaryThe dictionary is created and maintained during readingWe replace common words by their place in the dictionaryThe dictionary is initialized with all the bytes (every word hasone character).

    If some words are repeated, compression is efficient

    3Source: http://igm.univ-mlv.fr/~dr/XPOSE2013/La_compression_de_donnees/lzw.html

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 17

    http://igm.univ-mlv.fr/~dr/XPOSE2013/La_compression_de_donnees/lzw.htmlhttp://igm.univ-mlv.fr/~dr/XPOSE2013/La_compression_de_donnees/lzw.html

  • Principle

    You reed the file and construct the dictionaryNeed only one pass

    Dictionary contains all the one byte words per defaultFor a text all ASCII (or better latin1) characters.

    If a word exists in the dictionary, it is replaced by itsrank in the dictionaryIf a new word is read, then it is added in the dictionaryfor later use

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 18

  • Example: Encode a StringEncode TOBEORNOTTOBEORTOBEORNOT4Dictionary initially contains 256 words of on byte (all the possible chars).

    w c wc Output dictionaryT T

    T O TO T TO = O B OB O OB = B E BE B BE = E O EO E EO = O R OR O OR = R N RN R RN = N O NO N NO = O T OT O OT = T T TT T TT = T O TOTO B TOB TOB = B E BEBE O BEO BEO = O R OROR T ORT ORT = T O TOTO B TOBTOB E TOBE TOBE = E O EOEO R EOR EOR = R N RNRN O RNO RNO = O T OTOT

    4https://fr.wikipedia.org/wiki/Lempel-Ziv-WelchBerner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 19

    https://fr.wikipedia.org/wiki/Lempel-Ziv-Welch

  • Example: Encoded output

    The output of the algorithm is:TOBEORNOT

    Place neededNormally (ASCII): 24 bytes, i.e. 192 bitsEach number is encoded on 9 bit (largest number is 263 (<512)16*9 bit = 144 bits

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 20

  • Example: Decoding the text

    Input:TOBEORNOT

    c w In w+in[0] Output dictionaryT T TO T O TO O TO = B O B OB B OB = E B E BE E BE = O E O EO O EO = R O R OR R OR = N R N RN N RN = O N O NO O NO = T O T OT T OT = T TO TT TO TT = TO BE TOB BE TOB = BE OR BEO OR BEO = OR TOB ORT TOB ORT = TOB EO TOBE EO TOBE = EO RN EOR RN EOR = RN OT RNO OT RNO =

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 21

  • Use of LZW algorithm

    For images compressionImages in Gif, TIFF,Audio files MOD

    Characters have 12 bits (not 8)A pixel is 3 bytes (RGB) i.e. 24 bit2characters represent one pixel

    Very good at recognizing patternsBut now, images have millions of colors and do not have thatmuch patterns.

    Maybe we could change a little bit the image to help thecompression

    Lossy compression : Next week

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 22

  • Conclusion

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 23

  • ConclusionCompression is central in a lot of Med Inf topics

    Storage of large documentsTransfer of large documents

    AlgorithmsRLE is efficient if their are a lot of repetitions (Black andWhite images for instance).Huffman generates a tree. Is efficient if the type is known or ifthe tree remains small regarding the size of the document.LZW is efficient even if nothing is none.

    ZIP filesFor lossless compression of filesUses the algorithm Deflate based on LZ77 and HuffmanFirst compress the duplicate series of bytes (Lempel-Ziv)Then replaces commonly used symbols (Huffman).Very efficient with text files.

    Lossless vs lossy compressionSometime, we do not need all of the information, a part of it issufficient

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 24

  • References

    Web pageshttps://fr.wikipedia.org/wiki/Codage_de_Huffmanhttps://fr.wikipedia.org/wiki/Lempel-Ziv-Welchhttp://www.journaldunet.com/developpeur/tutoriel/theo/041014-algo-compression-sans-perte.shtmlhttps://www.numerama.com/tech/299921-comment-fonctionne-la-compression-de-donnees.htmlhttp://igm.univ-mlv.fr/~dr/XPOSE2013/La_compression_de_donnees/lzw.html

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 25

    https://fr.wikipedia.org/wiki/Codage_de_Huffmanhttps://fr.wikipedia.org/wiki/Lempel-Ziv-Welchhttp://www.journaldunet.com/developpeur/tutoriel/theo/041014-algo-compression-sans-perte.shtmlhttp://www.journaldunet.com/developpeur/tutoriel/theo/041014-algo-compression-sans-perte.shtmlhttps://www.numerama.com/tech/299921-comment-fonctionne-la-compression-de-donnees.htmlhttps://www.numerama.com/tech/299921-comment-fonctionne-la-compression-de-donnees.htmlhttps://www.numerama.com/tech/299921-comment-fonctionne-la-compression-de-donnees.htmlhttp://igm.univ-mlv.fr/~dr/XPOSE2013/La_compression_de_donnees/lzw.htmlhttp://igm.univ-mlv.fr/~dr/XPOSE2013/La_compression_de_donnees/lzw.html

  • Exercises

    1. Encode using RLE“Hallo Mitenand”“AAAAAABBBBBAAAACCCCCWWWWW”

    2. Encode using LZW“Bonjour a vous tous”“GTACCTAGGTAGTAAGTATGTAC”

    Berner Fachhochschule | Haute école spécialisée bernoise | Berne University of Applied Sciences 26

    Last WeekLossless compressionsRLE Run-Length EncodingHuffmanLZW - Lempel-Ziv-Welch

    Conclusion