Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution,...
Transcript of Darwinian Evolution, Symmetry, Conservation Principles and ...Title Slide “Darwinian Evolution,...
Title Slide
“Darwinian Evolution, Symmetry, ConservationPrinciples and a box of chocolates"
Version 1.2: 06/Nov/2017
SEMINAR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Les HattonEmeritus Professor of Forensic Software Engineering,
Kingston University, LondonProtein work done with Professor Greg Warr,
NSF and MUSC.
Walked > 3,500 miles in Richmond Park Produced 2,602 plots Analysed 100 million lines of code hundreds of times Analysed the 27Gb European Protein Database
uniprot.org hundreds of times Parsed the entire MusicXML library and significant
chunks of Project Gutenberg several times Wrote compiler front-ends for 7 programming
languages in C + around 22,000 lines of perl + 3,300 lines of R statistics scripts and a protein parser
Got more rejections ( > 8 journals) than in the rest of my career put together
So far, an 8-year slog during which …
Component: a piece of a system made from a string of …
… Tokens: Indivisible symbols chosen from an … … Alphabet: The unique set of symbols from
which every component is made. Examples
Computer programs. Each function or subroutine is made from an alphabet of programming language tokens.
Proteins are components made of amino acids chosen from a unique alphabet of 22 amino acids - those coded directly from DNA, plus an (increasingly) large number of tweaked amino acids.
Vocabulary
Tokens in programming languages
Take an example from C:
void int ( ) [ ] { , ; for = >= -- <= ++ if > -
bubble a N i j t 1 2
void bubble( int a[], int N){ int i, j, t; for( i = N; i >= 1; i--) { for( j = 2; j <= i; j++) { if ( a[j-1] > a[j] ) { t = a[j-1]; a[j-1] = a[j]; a[j] = t; } } }}
Fixed (18)
Variable (8)
+
Total (94)
Tokenising requires writing a compiler front-end for each language
Unique alphabet Total size
A singular pattern …
Some of the many software ccdfs I analysed in 2010-11
Distribution of software component lengths in programming tokens
600,000 functions from 80 million lines of C
Almost perfect power-lawai ~ ti
-1/
Weird pointy bit
ccdf – comp. cum.
dist. fn.
pdf – prob. dist. fn.
Tokens in proteins
Protein Sequence UniqueAlphabet
VG22_BPT2 KAEEEVEKNK EEAEEEAEKK IAE KAVENI
PHI_MYTCA AKAKRSPRKK KAAVKKSSKS KAKKPKSPKK KKAAKKPAPKK AAKKK
KAVRSP
Strings of 22 letters directly coded from DNA + thousands of tweaked ones (PTM), through glycosylation and so on.
Distribution of protein lengths in amino acids
13,532,084 proteins built from 5,392,041,307 amino acids in the TrEMBL database 15-07.
Almost perfect power-lawai ~ ti
-1/
Weird pointy bit
Computer programs and proteins
Why are the length distributions of such disparate systems functionally identical
?
Time: Anything lasting more than a few years (a few minutes for news media). Especially, Natural Selection Tectonic drift
Scale: Anything much smaller than a midge and anything further away than say New Zealand. A grain of sand contains ~ 1019 atoms. The universe contains ~ 1024 stars, or ~ 1082 atoms.
Scientific things humans find it difficult to deal with
Natural Selection - time
The eye. Light-sensitive cells finally shown to have originated in the brain by molecular fingerprinting the brain of a “a living fossil”, Platynereis dumerlii, a marine worm, (EMBL 2004, Science), although it exists in many stages in the animal kingdom.There are however some things which it does not explain, for example the lengths of proteins.
For physical systems (close enough)), every conservation principle is associated with a symmetry.
Energy -> invariance in time
Linear momentum -> invariance in displacement
Angular momentum -> invariance in direction.
So are there any symmetries here ?
Emmy Noether’s amazing theorem (1918) - scale
Scale invariance - proteins
All life
All bacteria
HumanAll data from TrEMBL genomic databases.
Scale invariance - software
“Universe” of 7 languages
C language
GNU C compilerAll data from Open source downloads
Emergent behaviour in 40 MSLOC
40 million lines of Ada, C, C++, Fortran, Java, Tcl-Tk from 80+ systems
Playing with beads and sniffing for conservation principles
Heterogeneous – software, proteins, music, literature
Homogeneous – atomic elements, literature
Hartley-Shannon information is basically the log of the number of ways of arranging tokens without caring what they mean. For a unique alphabet A and a total size T, this is log(AT).
But, what happens when we build systems to conserve this (CoHSI) ?
Hartley-Shannon Information is token-agnostic
Statistical mechanicsMy first effort (2011-14)
Conserve total beads
Conserve total Information
log(AT)
Boltzmann’s magical
statistical mechanics
machine gives distributions
Enter a box of chocolates …
How many ways can we arrange a chocolate box of ti chocolates guaranteeing ai unique chocolates ?N(ti, ai; ai)
(For ti >> ai, this becomes ai^ ti as we need, to give the observed power-law.)
This is not trivial (as I first thought) and relies on recursion and the additive compositions of numbers …Example: N(5,2;2) = (5!)/(1!4!) + (5!)/(4!1!) + (5!)/(2!3!) + (5!)/(3!2!)
2016: The chocolate box extension for heterogeneous systems
2016: The chocolate box extension
Heterogeneous CoHSI Length Distribution
Chocolate Box (2016-7)
Proteins
Eureka !
First effort (2014)
Playing with beads
Heterogeneous – software, proteins, music, literature
Homogeneous – atomic elements, literature
2017: The homogeneous case
The resultant CoHSI pdf for homogeneous systems mutates directly into Zipf’s law at all scales
ai ~ i- 2017
where i is the rank order and therefore serves as a proof of Zipf’s empirical law.
2017: The homogeneous case (and hybrid systems)
Homogeneous
(word freq.)
Heterogeneous
(letter freq.)
Three Men in a Boat
European Constitution
pdf ccdf
2017: The homogeneous case (and wild speculation)
Distribution of elements in universe
Distribution of elements in sea water
Dark Energy (atomic number -1 ?)
Dark Matter (atomic number 0 ?)
Structure and component size:Power-law => big components are
inevitable
27
10 decisions
50 decisions
In any software system, for every eleven 10 decision components there will on average be one 50 decision component. The bigger the system, the more accurate this becomes and you have no control over this. For proteins, the exponent is around 1.6
ti-1.5
Music and alphabets ?
If notes are tokens, does including duration make a difference in CoHSI ?
Music and alphabets ?
CoHSI predicts that consistent alphabets will be power-laws and power-laws of one another just as we observe above.
883 pieces of music, duration and no-duration alphabet
log-log duration and no-duration alphabet
The length and alphabetic properties of proteins right down to the species level and software components down to individual packages can be explained by CoHSI without recourse to natural selection or human volition.
CoHSI appears to be a deeper principle setting bounds for all discrete systems
CoHSI implies highly conserved average component length and unusually frequent larger components exactly as observed.
CoHSI implies that all consistent categorisations are power-laws of each other.
Summary
The occurrence rate of the various letters in the amino acid alphabet suggests that approximately one protein in the entire TrEMBL fully annotated subset SwissProt will contain the words KING and ELVIS at the same time. As it happens, there is exactly one …
A stunning revelation …
Protein Sequence
PT111_YEAST … LQENAHIHTR KINGGEDSSL SGFNAVVDFER FEFKKKKVSH NDVYGAELVIS NSLKEGIAP …
Reference
My writing site:-
http://www.leshatton.org/
Earlier results of this work appear in IEEE TSE, Plos One and arXiv.
Photographs of scientific figures and the Bach chorale courtesy of Wikipedia under Creative Commons.