#Translate Nato Managing Defence Systems in the Information Age Cals003
NATO UNCLASSIFIED NATO Consultation, Command and Control Agency COMMUNICATIONS & INFORMATION SYSTEMS...
-
Upload
brayan-curt -
Category
Documents
-
view
223 -
download
1
Transcript of NATO UNCLASSIFIED NATO Consultation, Command and Control Agency COMMUNICATIONS & INFORMATION SYSTEMS...
NATO UNCLASSIFIED
NATONATOConsultation, Command and Consultation, Command and
Control AgencyControl Agency
COMMUNICATIONS & INFORMATIONCOMMUNICATIONS & INFORMATION
SYSTEMS SYSTEMS
Decreasing “Bit Pollution” through Decreasing “Bit Pollution” through “Sequence Reduction”“Sequence Reduction”
Dr. Davras [email protected]
NATO UNCLASSIFIED 2
You will find this presentation and the You will find this presentation and the accompanying paper at accompanying paper at
www.nc3a.info/MCC2006www.nc3a.info/MCC2006
from where both can be viewed and/or downloadedfrom where both can be viewed and/or downloaded
(the four other NC3A presentations can also be found (the four other NC3A presentations can also be found at the above URL) at the above URL)
NATO UNCLASSIFIED 3
TerminologyTerminology
““Sequence Reduction” Sequence Reduction” Originates with Peribit ~2000, Founder’s Ph. D. on Genome Originates with Peribit ~2000, Founder’s Ph. D. on Genome Mapping - uses the term “Molecular Sequence Reduction” Mapping - uses the term “Molecular Sequence Reduction” (MCR) - Biomedical Informatics, Stanford University(MCR) - Biomedical Informatics, Stanford University
““Bit Pollution” Bit Pollution” Link/network pollution repetition of redundant digital Link/network pollution repetition of redundant digital sequences over transmission media (especially significant sequences over transmission media (especially significant for mobile/deployed networks/links)for mobile/deployed networks/links)
Other related terms: WAN optimizer, Other related terms: WAN optimizer, Application Accelerator/ Application Accelerator/ Optimizer or Application Controller-Optimizer, Performance Optimizer or Application Controller-Optimizer, Performance Enhancement Proxies (PEP), WAN Expanders, Latency (=delay) Enhancement Proxies (PEP), WAN Expanders, Latency (=delay) removers/compensators/mitigators ….. etc.removers/compensators/mitigators ….. etc.
New & dynamic field, many terms will continue to appear, coalesce, New & dynamic field, many terms will continue to appear, coalesce, some will catch on others will disappear some will catch on others will disappear
NATO UNCLASSIFIED
• “Next Generation Compression”, “Bit Pollution Reduction”, “Sequence Reduction” (latter Peribit/Dr. Amit Singh)
• WAN Expander (WX), WAN Optimizer, WAN Optimization Controller (WOC) (Juniper/Peribit)
• Application Accelerator/Optimizer/Controller-Optimizer• Latency Remover/Optimizer (replace Latency by “Delay” )
• Especially for networks with SATCOM links
• In general; use of a-priori knowledge of data comms protocols required by application to optimize the data input/output
• Combinations of above• Unfortunately all present implementations “proprietary”
• Unrealistic to expect “standards” soon, technology too new and lucrative
Terminology
NATO UNCLASSIFIED 5
Why “Bit Pollution” ?Why “Bit Pollution” ?
Most of us deal daily with various electronic files/ informationMost of us deal daily with various electronic files/ information
Taking MS Office as an example; Word, PPT, Excel, Project, HTML, Taking MS Office as an example; Word, PPT, Excel, Project, HTML, Access, …. FilesAccess, …. Files
……and/or many other electronic files, data-bases, forms, etc.,..and/or many other electronic files, data-bases, forms, etc.,..
On many occasions we make small changes and send them back On many occasions we make small changes and send them back and/or forward to othersand/or forward to others
Repetitive traffic over communication links can, in general, be classified Repetitive traffic over communication links can, in general, be classified broadly into 3 categories:broadly into 3 categories:1) Application & protocol overheads1) Application & protocol overheads2) Commonly used words, phrases, strings, objects (logos, 2) Commonly used words, phrases, strings, objects (logos, images, audio clips, etc.)images, audio clips, etc.) 3) 3) Process flows (data-base updates/views, forms, templates, Process flows (data-base updates/views, forms, templates, etc. going back & forth)etc. going back & forth)
NATO UNCLASSIFIED 6
SEQUENCE REDUCTIONSEQUENCE REDUCTIONNext Generation Compression Next Generation Compression
- Examples- Examples
256 Kbps satellite link256 Kbps satellite link 20 Mbytes PPT file (48 slides) sent 120 Mbytes PPT file (48 slides) sent 1stst time : ~12 minutes (700 secs) time : ~12 minutes (700 secs)
6 of the slides modified, file size change <0.5 Mbytes6 of the slides modified, file size change <0.5 Mbytes Modified file sent 6 hours later time taken: ~ 8 secsModified file sent 6 hours later time taken: ~ 8 secs Same modified file sent 24 hours later ~ 18 secsSame modified file sent 24 hours later ~ 18 secs
Sent 7 days later ~24 secsSent 7 days later ~24 secs Original file sent 7 days later : ~14 secsOriginal file sent 7 days later : ~14 secs
Similar results for Word, Excel files and web pagesSimilar results for Word, Excel files and web pages Less but still significant improvement for PDF filesLess but still significant improvement for PDF files Smallest improvement for zipped files (reduction by ~ 2.5 to 3)Smallest improvement for zipped files (reduction by ~ 2.5 to 3)
Amount of “new” files in between repetitions & SR RAM/HD capacities have strong Amount of “new” files in between repetitions & SR RAM/HD capacities have strong effect on the duration of repeat transmissions (dynamic library updates)effect on the duration of repeat transmissions (dynamic library updates)
Above results based on Peribit SR s : German MOD, Syracuse University Above results based on Peribit SR s : German MOD, Syracuse University “Real World” Labs (Network Computing Nov 2004) and NC3A“Real World” Labs (Network Computing Nov 2004) and NC3A
GE MOD results based on operational traffic, others test trafficGE MOD results based on operational traffic, others test traffic
Ref [6] of paper: Ref [6] of paper: “Record“Record for throughput was ~60Mbps through a T1. It came about for throughput was ~60Mbps through a T1. It came about when copying 1.5GB file twice! ”when copying 1.5GB file twice! ”
NATO UNCLASSIFIED
Mobile/Tactical Comms Mobile/Tactical Comms
DivergenceDivergence
Mobile/Tactical Comms Mobile/Tactical Comms
DivergenceDivergence
NATO UNCLASSIFIED
• Fixed communications – WANs with all users/nodes fixedFixed communications – WANs with all users/nodes fixed• Fiber-optic/photonic revolution: Essentially unlimited capacity is now possible/available if/when a cable can be installed
• Mobile comms: Networks with mobile/deployable users • No technological revolution similar to photonic foreseen• Radio propagation will be the limiting factor
–Mainstay will be radio: Tactical LOS tens/hundreds of Kbps, BLOS (rough terrain, long distances) few Kbps
–Star-wars scenarios : Moving laser beams ???• LEO satellites will provide some 100s of Kbps at a cost
• Divergence will continue • Another factor: Input into the five senses : ~100 Shannon/
Entropy bps– For transmission redundancy : x 10 = 1 Kbps
• Fixed communications – WANs with all users/nodes fixedFixed communications – WANs with all users/nodes fixed• Fiber-optic/photonic revolution: Essentially unlimited capacity is now possible/available if/when a cable can be installed
• Mobile comms: Networks with mobile/deployable users • No technological revolution similar to photonic foreseen• Radio propagation will be the limiting factor
–Mainstay will be radio: Tactical LOS tens/hundreds of Kbps, BLOS (rough terrain, long distances) few Kbps
–Star-wars scenarios : Moving laser beams ???• LEO satellites will provide some 100s of Kbps at a cost
• Divergence will continue • Another factor: Input into the five senses : ~100 Shannon/
Entropy bps– For transmission redundancy : x 10 = 1 Kbps
Therefore: we must treat mobile/tactical comms differentlyTherefore: we must treat mobile/tactical comms differently
NATO UNCLASSIFIED 8
Deployable, Mobile, On-the-Deployable, Mobile, On-the-MoveMove
CommunicationsCommunications
At least one end of a link moving/deployedAt least one end of a link moving/deployed Networks which have nodes/users moving/deployedNetworks which have nodes/users moving/deployed
Such links/networks essential for survivability and rapid Such links/networks essential for survivability and rapid reaction reaction Will be taking on increasingly more critical tasksWill be taking on increasingly more critical tasks
Present approach: Use applications developed for fixed Present approach: Use applications developed for fixed links/networks for deployed/mobile unitslinks/networks for deployed/mobile units Must consider the very different characteristics of such networks Must consider the very different characteristics of such networks
when choosing applicationswhen choosing applications
Can we measure information” so we can determine performance of links/ Can we measure information” so we can determine performance of links/ networks in terms of “information” transported, not just bits/bytesnetworks in terms of “information” transported, not just bits/bytes
NATO UNCLASSIFIED 9
Can we measure Can we measure “information” ?“information” ?Yes we can !Yes we can !
Shannon defined the concept of “Entropy”, a Shannon defined the concept of “Entropy”, a logarithmic measure in 1940s logarithmic measure in 1940s (while working on (while working on
cryptography)cryptography),, it has stood the test of time it has stood the test of time First suggestion of log measure was Hartley (base First suggestion of log measure was Hartley (base
10) but Shannon used the idea to develop a 10) but Shannon used the idea to develop a complete “theory of information & communication”complete “theory of information & communication”
Shannon preferred LogShannon preferred Log22 and called the “unit” bits and called the “unit” bits Base e is also sometimes used (Nats)Base e is also sometimes used (Nats)
Smaller the probability of occurrence of an event Smaller the probability of occurrence of an event higher the “information delivered” when it occurshigher the “information delivered” when it occurs
NATO UNCLASSIFIED
C. E. Shannon (BSTJ 1948)
{{
{Si} {Rj}
discrete
Discrete, countable
NATO UNCLASSIFIED 11
EntropyEntropy
Entropy (H) in the case of two
possibilities/events/symbols
Prob of one = pthe other q = 1-p
H = -(p log p + q log q)
H versus p plotted
NATO UNCLASSIFIED 12
Let us take a “Natural Language” English as an Let us take a “Natural Language” English as an exampleexample English has 26 letters (characters)English has 26 letters (characters) Space as a delimiterSpace as a delimiter TOTAL 27 characters (symbols)TOTAL 27 characters (symbols) One could include punctuation, special characters, One could include punctuation, special characters,
etc., for example we could use the full 256 ASCII etc., for example we could use the full 256 ASCII symbol set - methodology is the samesymbol set - methodology is the same
Extension to other natural languages readily madeExtension to other natural languages readily made Extension to images also possible (same Extension to images also possible (same
methodology)methodology)
NATO UNCLASSIFIED 13
Structure of a “Natural Language” - EnglishStructure of a “Natural Language” - English Defined by many characteristics: Grammar, semantics, Defined by many characteristics: Grammar, semantics,
etymology, usage, …., historical developments, ….etymology, usage, …., historical developments, …. Until early 70s there was substantial belief that “Natural Until early 70s there was substantial belief that “Natural
Languages” and “computer programming languages” Languages” and “computer programming languages” (finite automata instructions) had similarities(finite automata instructions) had similarities
Noam Chomsky’s work (Professor at MIT) completely Noam Chomsky’s work (Professor at MIT) completely destroyed those expectationsdestroyed those expectations
Natural Languages can be studied through Natural Languages can be studied through probabilistic (Markov) models probabilistic (Markov) models Shannon’s approach Shannon’s approach (1940s, no computers, Bell Labs staff (1940s, no computers, Bell Labs staff
flipped through many pages of books to get the probabilities)flipped through many pages of books to get the probabilities) He was actually working on cryptography and He was actually working on cryptography and
made important contributions in that area alsomade important contributions in that area also
NATO UNCLASSIFIED 14
Various Markov model examples here, Various Markov model examples here, skipped here for continuity, may be found skipped here for continuity, may be found
at the endat the end
NATO UNCLASSIFIED 15
Zipf’s Law “Principle of Least Effort”Zipf’s Law “Principle of Least Effort” George Kingsley Zipf, Professor of Linguistics, Harvard (1902 – 1950)George Kingsley Zipf, Professor of Linguistics, Harvard (1902 – 1950) If the “words” in a language are ordered (“ranked”) from the
most frequently used down the probability Pn of the nth word
in this list is Pn 0.1 / n
Implies a maximum vocabulary size 12366 words since
( 1 / n is not finite when summed 1 to )
For details of above see DY IEEE Transactions on Information Theory, September 1974
Many other applications of “Zipf’s Law”, if interested just make a Google/Internet search
NATO UNCLASSIFIED
Zipf’s Law
(Principle of
Least Effort)
From “Symbols, Signals & Noise” J. R. Pierce
~ million words, various texts
NATO UNCLASSIFIED 17
Entropy bits/character - EnglishEntropy bits/character - English
Amazingly it turns out to be about the same for most “Natural Languages” for which the analysis has been done (Arabic, French, German, Hebrew, Latin,
Spanish, Turkish, .…). These languages also follow Zipf’s Law.
NATO UNCLASSIFIED 18
Entropy of Natural LanguagesEntropy of Natural Languages
Between 1 & 2 bits per letter/characterBetween 1 & 2 bits per letter/character
1.5 bits per letter is commonly used1.5 bits per letter is commonly used
English has ~4.5 letters per word on the averageEnglish has ~4.5 letters per word on the average
4.5 x 1.5 = 6.75 or ~7 bits per word 4.5 x 1.5 = 6.75 or ~7 bits per word
averageaverage
Normal speech 1 - 2 words per Normal speech 1 - 2 words per secondsecond
Hence information per second ~ 5 Hence information per second ~ 5 bitsbits
NATO UNCLASSIFIED 19
Extension to ImagesExtension to Images
Same concept and definitionsSame concept and definitions Letters replaced by pixels/groups of pixels, etc.Letters replaced by pixels/groups of pixels, etc.
Words could be analogous to sets of pixels, objectsWords could be analogous to sets of pixels, objects The numbers are much largerThe numbers are much larger
E.g. 400 x 600 = 240000 pixel image with each pixel capable E.g. 400 x 600 = 240000 pixel image with each pixel capable of taking on one of 16 brightness levelsof taking on one of 16 brightness levels• 1616240000240000 possible images possible images
• Assume all these images are equally likely (*): Probability of Assume all these images are equally likely (*): Probability of one these images is 1/ 16 one these images is 1/ 16240000 240000 and the information and the information provided by that image is 240000 logprovided by that image is 240000 log22 16 = 0.96 10 16 = 0.96 1066 bits bits
• A real image contains much smaller “information” A real image contains much smaller “information” adjacent/nearby pixels are not independent of each otheradjacent/nearby pixels are not independent of each other
• Movies : frame to frame only small/incremental changes Movies : frame to frame only small/incremental changes
(*) “equally likely” assumption clearly not realistic“equally likely” assumption clearly not realistic
NATO UNCLASSIFIED
~5 b/s is irreducible information content, x by 10 to introduce redundancy - therefore we should be able communicate speech “information” at ~50 bps
Examples of speech coding we use:
64000 bps , 32000 bps PC64000 bps , 32000 bps PC
16000 bps CVSD, 2400 bps LPC, MELP 16000 bps CVSD, 2400 bps LPC, MELP
1200, 600 bps MELP1200, 600 bps MELP
All above “waveform” codecs, they will also convey “non-All above “waveform” codecs, they will also convey “non-measurable” (intangible) informationmeasurable” (intangible) information
Speech codecs (recognition at transmitter and synthesis at receiver ) technology could conceivably go lower than 600 bps but would not contain the intangible component !
Speech CodingSpeech Coding
NATO UNCLASSIFIED 21
A QUICK REFRESHER ON A QUICK REFRESHER ON
CONVENTIONAL CONVENTIONAL
COMPRESSIONCOMPRESSIONMay be found at the endMay be found at the end
NATO UNCLASSIFIED 22
SEQUENCE REDUCTIONSEQUENCE REDUCTIONNext Generation Next Generation
CompressionCompression
Dictionary based – implements learning algorithmDictionary based – implements learning algorithm Dynamically learns the “language” of the communications traffic Dynamically learns the “language” of the communications traffic
and translates into “short-hand”and translates into “short-hand” Continuously updates/improves “knowledge” of link “language” Continuously updates/improves “knowledge” of link “language” Frequent patterns move up in dictionary, infrequent patterns Frequent patterns move up in dictionary, infrequent patterns
move down and eventually can age out move down and eventually can age out No fixed packet or window boundariesNo fixed packet or window boundaries
Unlike e.g. LZ which generally uses 2048 byte windowUnlike e.g. LZ which generally uses 2048 byte window
Once a pattern is learned and put in dictionary it will be Once a pattern is learned and put in dictionary it will be compressed wherever it appearscompressed wherever it appears
Data compression is based on previously seen dataData compression is based on previously seen data
Performance improves with time as “learning” increasesPerformance improves with time as “learning” increases Very quickly at first (10 –20 minutes) and then slowlyVery quickly at first (10 –20 minutes) and then slowly When a new application comes in, SR adapts to its “language”When a new application comes in, SR adapts to its “language”
NATO UNCLASSIFIEDRelative positioning of statistical and substitutional compression algorithms (from Peribit, A. P. Singh)
MOLECULAR SEQUENCE R
EDUCTION
NATO UNCLASSIFIED 24
““Molecular Sequence reduction”Molecular Sequence reduction”
www.Peribit.com
NATO UNCLASSIFIED 25
MSR – TechnologyMSR – Technology
Origins in DNApattern matching
Real time, high speed, low latencyReal time, high speed, low latency Continuously learns and updates dictionaryContinuously learns and updates dictionary Transparently operates on all traffic Transparently operates on all traffic (optimized for IP)(optimized for IP) Eliminates patterns of any size, anywhere in streamEliminates patterns of any size, anywhere in stream Patent-pending technologyPatent-pending technology
NATO UNCLASSIFIED 26
MSR – MSR – Molecular Sequence ReductionMolecular Sequence Reduction““Next-gen dictionary-based compression”Next-gen dictionary-based compression”
www.peribit.com
NATO UNCLASSIFIED 27
Government/Military use Government/Military use examplesexamples
Many thousands of units in use in USA Many thousands of units in use in USA (mostly corporate but also government (mostly corporate but also government agencies)agencies)
GE MOD using Peribit SRs (since ~2 years)GE MOD using Peribit SRs (since ~2 years) INMARSAT German Navy WAN (encrypted)INMARSAT German Navy WAN (encrypted) Links to GE Navy ships in/around South AfricaLinks to GE Navy ships in/around South Africa Satellite links to GE units in AfghanistanSatellite links to GE units in Afghanistan Plans for some 64 Kbps landlinesPlans for some 64 Kbps landlines GE MOD total : 300+ unitsGE MOD total : 300+ units
also other nations ……also other nations …… Some with initial trials Some with initial trials
NATO UNCLASSIFIED 28
Reduction rates observed(reduced by % amount given)
GE Armed Forces Results
Traffic type Version 3.0 V 4.02 V 5.0
HTTP 30 % 40 % 46 %
MAIL 61 % 67 %
NetBios 59 % 62 %
CIFS 92 % 92 %
FTP 69 % 73 %
TELNET 65 % 69 %
93 %
NATO UNCLASSIFIED 29
From German MOD
NATO UNCLASSIFIED 30
Startup behavior example From German MOD
NATO UNCLASSIFIED 31
From German MOD
NATO UNCLASSIFIED 32
From German MOD
NATO UNCLASSIFIED 33
From Peribit.com (not GE MOD data)
NATO UNCLASSIFIED 34
EFFECTIVE WAN CAPACITYINCREASED BY 2.80DATA REDUCTION BY 64.34 %
NO DATA COMPRESSION & NO REDUCTION
WITH DATA COMPRESSION & REDUCTION !!!
Peribit (screen capture)
NC3A – WAN (NL – BE)
NATO UNCLASSIFIED 35
NATO UNCLASSIFIED 36
Peribit Sequence ReducersPeribit Sequence Reducers
www.peribit.com
NATO UNCLASSIFIED 37
512 kbpssatellite link
MultiplexedTCP/IP
Link with SCPS-TP acceleration
Link with application accelerator & IP data compressor
Un-accelerated link
NC3A TEST RESULT NC3A TEST RESULT SUMMARYSUMMARY
Expand Model 4800 “WAN Link Expand Model 4800 “WAN Link Accelerators”Accelerators”
NATO UNCLASSIFIED 38
512 kbpssatellite link
MultiplexedTCP/IP
Link with SCPS-TP acceleration
Link with application accelerator & IP data compressor
Un-accelerated link
NC3A TEST RESULT SUMMARYNC3A TEST RESULT SUMMARY
NATO UNCLASSIFIED 39
Link with SCPS-TP acceleration
Link with application accelerator & IP data compressor
Un-accelerated link
512 Kbps 512 Kbps satellite linksatellite link
10 multiplexed 10 multiplexed
TCP/IP sessionsTCP/IP sessions
NATO UNCLASSIFIED 40
PacketeerPacketeer
NATO UNCLASSIFIED 41
IndustryIndustry
New area but many & increasing number of companiesNew area but many & increasing number of companies
Peribit.com (now Juniper Networks)Peribit.com (now Juniper Networks)Expand.com (Expand Networks)Expand.com (Expand Networks)Packeteer.comPacketeer.comRiverbed.comRiverbed.comSilver-peak.comSilver-peak.com……....
National authorities (e.g. USA & GE) also working with National authorities (e.g. USA & GE) also working with industry to incorporate SR/WX technology into national industry to incorporate SR/WX technology into national
crypto devicescrypto devices
NATO UNCLASSIFIED 42
SEQUENCE REDUCTIONSEQUENCE REDUCTIONNext Generation CompressionNext Generation Compression
Summary (1)Summary (1)
WANs will form backbone of Network Enabled OperationWANs will form backbone of Network Enabled OperationThis technology provides significant improvements in capacityThis technology provides significant improvements in capacity
Dictionary based – implements learning algorithmDictionary based – implements learning algorithm Dynamically learns the “language” of the communications traffic Dynamically learns the “language” of the communications traffic
and translates into “short-hand”and translates into “short-hand” Continuously updates/improves “knowledge” of link “language” Continuously updates/improves “knowledge” of link “language” Frequent patterns move up in dictionary, infrequent patterns Frequent patterns move up in dictionary, infrequent patterns
move down and eventually can age out move down and eventually can age out No fixed packet or window boundariesNo fixed packet or window boundaries
Unlike conventional compression which operates over 1-2 KbytesUnlike conventional compression which operates over 1-2 Kbytes Once a pattern is learned and put in dictionary it will be compressed Once a pattern is learned and put in dictionary it will be compressed
wherever it appearswherever it appears
Data compression is based on previously seen dataData compression is based on previously seen data Performance improves with time as “learning” increasesPerformance improves with time as “learning” increases
Very quickly at first (10 –20 minutes) and then slowlyVery quickly at first (10 –20 minutes) and then slowly When a new application comes in, SR adapts to its “language”When a new application comes in, SR adapts to its “language”
NATO UNCLASSIFIED 43
SEQUENCE REDUCTIONSEQUENCE REDUCTIONNext Generation CompressionNext Generation Compression
Summary (1)Summary (1)
• Significant advantages for WANs where capacity Significant advantages for WANs where capacity is an issue (i.e. deployed/mobile/tactical)is an issue (i.e. deployed/mobile/tactical)
• Removes redundant/repetitive transmissionsRemoves redundant/repetitive transmissions• Packet-flow acceleration (latency removal) can be Packet-flow acceleration (latency removal) can be
easily addedeasily added• Quality of Service & Policy Based Multipath can Quality of Service & Policy Based Multipath can
also be implementedalso be implemented• Does not impact security implementations Does not impact security implementations
(cryptos between SRs)(cryptos between SRs)
HoweverHowever• Presently available from a few sources, each with Presently available from a few sources, each with
its “proprietary” technology its “proprietary” technology
NATO UNCLASSIFIED 44
ConclusionsConclusions
Shannon Information Theory provides tools for Shannon Information Theory provides tools for measuring “information” as “Entropy”measuring “information” as “Entropy”
Has formed the basis for most of the coding, Has formed the basis for most of the coding, data transmission/detection results since 1950sdata transmission/detection results since 1950s
DNA / Genome mapping process has also DNA / Genome mapping process has also apparently benefited from itapparently benefited from it In 90s estimate for human genome was 20-30 years; took 2-In 90s estimate for human genome was 20-30 years; took 2-
3 years with the computational developments in late 90s3 years with the computational developments in late 90s A new form of compression, “Sequence Reduction” A new form of compression, “Sequence Reduction”
provides significant reductions by reducing redun-provides significant reductions by reducing redun-dancies in transmitted datadancies in transmitted data Will provide important advantages for mobile/deployable/moving Will provide important advantages for mobile/deployable/moving
WAN link applicationsWAN link applications
NATO UNCLASSIFIED 45
QuestionsQuestionsComments Comments
This presentation & associated paper can be found at This presentation & associated paper can be found at
www.nc3a.info/MCC2006www.nc3a.info/MCC2006
NATO UNCLASSIFIED 46
NC3ANC3A
NC3A Brussels
Visiting address:
Bâtiment ZAvenue du Bourget 140B-1110 BrusselsTelephone +32 (0)2 7074111Fax +32 (0)2 7078770
Postal address:NATO C3 AgencyBoulevard Leopold IIIB-1110 Brussels - Belgium
NC3A The Hague
Visiting address:
Oude Waalsdorperweg 612597 AK The Hague
Telephone +31 (0)70 3743000Fax +31 (0)70 3743239
Postal address:NATO C3 AgencyP.O. Box 1742501 CD The HagueThe Netherlands
NATO UNCLASSIFIED 47
Markov model examplesMarkov model examples
NATO UNCLASSIFIED
AZEWRTZYNSADXESYJRQY_WGECIJJ_OB AZEWRTZYNSADXESYJRQY_WGECIJJ_OB
_KRBQPOZB_YMBUAWVLBTQCNIKFMP_KM_KRBQPOZB_YMBUAWVLBTQCNIKFMP_KM
VUUGBSAXHLHSIE_MAULEXJ_NATSKIVUUGBSAXHLHSIE_MAULEXJ_NATSKI
AZEWRTZYNSADXESYJRQY_WGECIJJ_OB AZEWRTZYNSADXESYJRQY_WGECIJJ_OB
_KRBQPOZB_YMBUAWVLBTQCNIKFMP_KM_KRBQPOZB_YMBUAWVLBTQCNIKFMP_KM
VUUGBSAXHLHSIE_MAULEXJ_NATSKIVUUGBSAXHLHSIE_MAULEXJ_NATSKI
Zeroth approximation to English (zero memory)
[Zero order Markov : equally likely letters, 27 numbers ]
All logs base 2
Entropy = pi log (1/pi) for i = 1 to 27 = log 27 = 4.75 bits / letter (or symbol)
NATO UNCLASSIFIED
AI_NGAE__ITF__NR_ASAEV_OIE_BAINTHHHYROAI_NGAE__ITF__NR_ASAEV_OIE_BAINTHHHYRO
O_POER_SETRYGAIETRWCO__ EHDUARU_ O_POER_SETRYGAIETRWCO__ EHDUARU_
EU_C_FT_NSREM_DIY_EESE_ F_O_SRIS_R EU_C_FT_NSREM_DIY_EESE_ F_O_SRIS_R
__UNNASHOR_CIE_AT_XEOIT_UTKLOOUL_E__UNNASHOR_CIE_AT_XEOIT_UTKLOOUL_E
AI_NGAE__ITF__NR_ASAEV_OIE_BAINTHHHYROAI_NGAE__ITF__NR_ASAEV_OIE_BAINTHHHYRO
O_POER_SETRYGAIETRWCO__ EHDUARU_ O_POER_SETRYGAIETRWCO__ EHDUARU_
EU_C_FT_NSREM_DIY_EESE_ F_O_SRIS_R EU_C_FT_NSREM_DIY_EESE_ F_O_SRIS_R
__UNNASHOR_CIE_AT_XEOIT_UTKLOOUL_E__UNNASHOR_CIE_AT_XEOIT_UTKLOOUL_E
First approximation to English (zero memory)
[Zero order Markov : letter probabilities, 27 numbers ]
Entropy = pi log (1/pi) for i = 1 to 27 = ~ 4 bits / letter
NATO UNCLASSIFIED
URTESHETHING_AD_E AT_FOULE_ URTESHETHING_AD_E AT_FOULE_
ITHALIORT_WACT_D_STE_MINTSAN_OLIITHALIORT_WACT_D_STE_MINTSAN_OLI
NS__TWID_OULY_TE_THIGHE_CO_YS_THNS__TWID_OULY_TE_THIGHE_CO_YS_TH
_HR_ UPAVIDE_PAD_CTAVED_QUES_E_HR_ UPAVIDE_PAD_CTAVED_QUES_E
Second approximation to English (memory)
[First order Markov : e.g. prob(a|a), prob(b|a), prob(c|a), … ,
27 x 27 = 729 numbers, some zero]
Entropy = pi,k log (1/pi/k) for i = 1 to 729 (= 27 x 27) = ~ 3.3 bits / letter
NATO UNCLASSIFIED
IANKS _CAN_OU_ANG_RLER_THATTED IANKS _CAN_OU_ANG_RLER_THATTED
_OF_TO_SHOR_OF_TO_HAVEMEM_A_I__OF_TO_SHOR_OF_TO_HAVEMEM_A_I_
MAND_AND_BUT_WHISSITABLY_THERVMAND_AND_BUT_WHISSITABLY_THERV
EREER_EIGHTS_TAKILLIS_TA_KIND_ALEREER_EIGHTS_TAKILLIS_TA_KIND_AL
Third approximation to English (memory)
[Second order Markov : e.g. prob(a|aa), prob(a|ab), prob(a|ac), …,
….., prob(z|zy), prob(z|zz - 27 x 27 x 27 = 19683, ~
75% zero]
(Shannon calls these “di-gram probabilities)
Entropy: ~ 3 bits / letter
NATO UNCLASSIFIED
JOU_MOUPLAS_DE_MONNERNAISSAINJOU_MOUPLAS_DE_MONNERNAISSAIN
S_DEME_US_VREH_BRETU_DE_TOUCS_DEME_US_VREH_BRETU_DE_TOUC
HEUR_DIMMERE_LLES_MAR_ELAME_HEUR_DIMMERE_LLES_MAR_ELAME_
RE_A_VER_IL_DOUVENTS_SO_FUITERE_A_VER_IL_DOUVENTS_SO_FUITE
Third approximation to French
N. Abramson “Information Theory & Coding”
NATO UNCLASSIFIED
ET_LIGERCUM_SITECI_LIBEMUS_ACET_LIGERCUM_SITECI_LIBEMUS_AC
ERELEN_TE_VICAESCERUM_PE_NONERELEN_TE_VICAESCERUM_PE_NON
_SUM_MINUS_UTERNE_UT_IN_ARION_SUM_MINUS_UTERNE_UT_IN_ARION
_POPOMIN_SE_INQUENEQUE_IRA_POPOMIN_SE_INQUENEQUE_IRA
Third approximation to ????
N. Abramson “Information Theory & Coding”
NATO UNCLASSIFIED
WE COULD CONTINUE THIS WITH CONDITIONAL WE COULD CONTINUE THIS WITH CONDITIONAL
PROBABILITIES GIVEN TRIPLETS (tri-grams), PROBABILITIES GIVEN TRIPLETS (tri-grams),
QUADRUPLETS (tetra-grams), … n-grams,... QUADRUPLETS (tetra-grams), … n-grams,...
etc. (i.e. metc. (i.e. mthth ORDER MARKOV SOURCES m ORDER MARKOV SOURCES m
3)3) HOWEVER, THIS BECOMES IMPRACTICAL AS THE HOWEVER, THIS BECOMES IMPRACTICAL AS THE
NUMBER OF JOINT PROBABILITIES BECOMES TOO NUMBER OF JOINT PROBABILITIES BECOMES TOO
LARGE - SO SHANNON JUMPED TO MARKOV LARGE - SO SHANNON JUMPED TO MARKOV
SOURCES WITH WORDS AS SYMBOLS - symbol SOURCES WITH WORDS AS SYMBOLS - symbol
set no longer 27 characters, but thousands of set no longer 27 characters, but thousands of
words. However m=1,2 Markov model gives much words. However m=1,2 Markov model gives much
betterbetter results than n-gram analysis as “n” results than n-gram analysis as “n”
is increased is increased
NATO UNCLASSIFIED
REPRESENTING AND SPEEDILY IS AN REPRESENTING AND SPEEDILY IS AN
GOOD APT OR COME CAN DIFFERENTGOOD APT OR COME CAN DIFFERENT
NATURAL HERE HE THE A IN CAME THE TO NATURAL HERE HE THE A IN CAME THE TO
OF TO EXPERT GRAY COME TO FURNISHES OF TO EXPERT GRAY COME TO FURNISHES
THE LINE MESSAGE HAD BE THESE …THE LINE MESSAGE HAD BE THESE …
Fourth approximation to English
[Zero order Markov with words : e.g. Probability of
words, zero memory]
(Shannon 1948)Entropy = ~ 2.2 bits / letter (using Zipf’s Law)
NATO UNCLASSIFIED
THE HEAD AND IN FRONTAL ATTACK ON AN THE HEAD AND IN FRONTAL ATTACK ON AN
ENGLISH WRITER THAT THE CHARACTER OF ENGLISH WRITER THAT THE CHARACTER OF
THIS POINT IS THEREFORE ANOTHER THIS POINT IS THEREFORE ANOTHER
METHOD FOR THE LETTERS THAT THE TIME METHOD FOR THE LETTERS THAT THE TIME
OF WHO EVER TOLD THE PROBLEM FOR AN…OF WHO EVER TOLD THE PROBLEM FOR AN…
Fifth approximation to English (memory)
[First order Markov with words :
e.g. Probability (wordi | wordj)
(Shannon 1948)
NATO UNCLASSIFIED
BIR ANLATTIKLARINA BIR ANLATTIKLARINA GŰLMECE YAZDIGŰLMECE YAZDI
YAPITLARININ ŞARAP BİÇİMLERİ BELA YAPITLARININ ŞARAP BİÇİMLERİ BELA
GÖRŰNŰMŰ GÖRŰNŰMŰ GGİİBBİ AMA BİR ETMEK YOK İ AMA BİR ETMEK YOK
TUTULDU GELEN TUTULDU GELEN GİDENGİDEN YER YER KALMADIKALMADI ... ...
Fifth approximation to Turkish (memory)
[First order Markov with words :
e.g. Probability (wordi | wordj)
NATO UNCLASSIFIED 58
A QUICK REFRESHER ON A QUICK REFRESHER ON
CONVENTIONAL COMPRESSIONCONVENTIONAL COMPRESSION
NATO UNCLASSIFIED
Lossy Compression
•Not necessarily a copy of the input: most audio, image, video compression algorithms are “Lossy” – our ears and eyes have resolution thresholds
Loss-less Compression
•Data integrity essential in digital data communications – Network compression must be “Loss-less”
•Two basic approaches
•Statistical compression algorithms
•Substitutional compression algorithms
Conventional CompressionConventional Compression
NATO UNCLASSIFIED
Statistical compression : Probabilities of characters in the input
data calculated (or given) - frequently occurring characters are encoded into
fewer bits [e.g. Huffman code, Morse code]
• Static coding : Once the coding is determined in accordance with the probabilities of occurrence it does not change
• Dynamic coding : Coding changes with “context” - for example, the occurrence of “q” in English increases the probability of occur-rence of “u” to 1, similarly the occurrence of “th” significantly increases the probability of occurrence of “e” , etc.
• As the amount of “historical context” information increases “dynamic coding” techniques can approach “Shannon limit”, however computational requirements increase exponentially making them impractical for real-time/on-line applications
NATO UNCLASSIFIED
Substitutional compression : Identifies repeated strings of
characters (longer the better) and replaces them with reference
identifiers or tokens (shorter the better) - At the receiver the tokens
are de-referenced and the reverse substitution performed
• Essentially a form of “pattern recognition” and classification• Pattern detection/recognition generally much faster than
computations needed for dynamic coding algorithms• Most network compression techniques in use today use
substitutional compression
Compression techniques can also be combined – for example
substitution based compression followed by static coding, etc.
NATO UNCLASSIFIED
• “Substitution” based compression is the basis of almost all
network compression implementations
• Principle of all : replace repeated patterns with shorter tokens
• Different techniques for detecting/encoding repeated patterns
Two basic approaches :
• Lempel-Ziv (LZ) “stateless” window compression
• e.g. v.42bis, fax compression, LZS(STAC)
• Predictor compression
• Tries to predict the next input byte : the matching algorithm looks for the most recent match of any pattern rather than best and longest match - higher speed but misses many significant pattern repetitions therefore lower data reduction (not much used)
NATO UNCLASSIFIED
Published in 1977 (hence LZ77)
• Basis of ~all loss-less data compression implementations today
• Repeated “strings” replaced by “pointers” to the previous location where the string had occurred
• Buffer or “window” required for the “historical” information to be available for reference – typically 1000 – 2000 bytes (mostly 2048 bytes)
• All previous data outside the buffer/window is lost or “forgotten” hence the name “stateless” or memory-less
•Can find and compress only patterns that are repeated within the window – repetitions separated by more than window size are ignored
• Poor scalability: For compression efficiency large window size is required but this increases pattern search computation significantly
• Good for “file compression” type applications
Lempel-Ziv (LZ) “stateless” window compression
NATO UNCLASSIFIED 64
NATO UNCLASSIFIED 65
Nov 1978, University of Pennsylvania, Museum Hall, Banquet in honor of Claude E. Shannon receiving H. Pender award (Prof. F. Haber & DY)