Basic_principles_of_design.

НЕдеструктивный дизайн

Лена Чуклина Артур Залевский

Лекция 1. Базовые принципы дизайна

SMTB 2014

Disclaimer (ответственное заявление)

• В этой лекции нещадно используются • Идеи и примеры из книги Robin Williams “The Non-

designer’s Design Book”* • Постеры участников школы “Современная биология

& будущее биотехнологий” 2013 и 2014**

*Авторские права не соблюдены. Автору лекции очень стыдно… **Эти люди знали, на что шли. Они подавали свои постеры для разбора на школе. Некоторым это помогло сделать постеры лучше.

Отличный план 1.  Зачем нужен дизайн?

• Сколько нужно выучить алгебры для достижения гармонии?

2.  Четыре принципа дизайна • Contrast (контраст) • Repetition (повтор) • Alignment (выравнивание) • Proximity (близость)

3.  Примеры

Дизайн – не для красоты

• … the important part must stand out and the unimportant must be subdued . . . .

•  Jan Tschichold 1935 • … важное должно выделяться, а второстепенное должно отойти на второй план…

• Ян Щичольд 1935 г.

Базовые принципы дизайна

Элементы дизайна • Цвета • Формы и линии • Шрифты • Взаимное расположение ( + выравнивание)

Близость (она же группировка) • Близость в пространстве подразумевает смысловую близость

•  Группируйте элементы в смысловые единицы

Близость.

До После

Близость

Выравнивание

• Ни один • Элемент

• Не должен быть • Расположен

• произвольно

• Всему • свое

• место

Виды выравнивания

Сильная линия дает опору

Что изменилось?

этот постер можно сделать лучше в три клика

Повторение • Повторяйте элементы дизайна. Повтор создает структуру и успокаивает

• Что можно повторять? •  Цвет •  Шрифт •  Толщину линий •  Размеры (шрифтов, колонок, картинок)

Что здесь повторяется?

Modelling Leaf Shape Evolution with Gaussian ProcessesN. A. Raharinirina, L. Rusaitis, H. Jackson, N. S. Jones, J. W. J. Anderson, M. Tsiantis, M. Cartolano and J. Hein‡

Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG, United Kingdom‡ [email protected] .

MotivationLeaf shapes display a tremendous variation overtheir evolution, which makes them an attractivesystem to study. Our focus of investigation is tofind some ways of quantifying this leaf shape di-versity and to infer the existing phylogenetic treesfrom sample leaf data. Although there are manytechniques available already for phylogenetic infer-ence, in our implementation, we will take the edgesof the leaves as a 2-D function, and assume that theycome from a phylogenetic Gaussian process.

Varying the topology of thephylogeny that we assumethe leaves come from, we in-tend to be able to select thecorrect one simply by maxi-mum likelihood methods.

Representing Leaf Shapes

a) Olimarabidopsis Pumila b) Arabidopsis Neglecta

We quantify the leaves by taking a 2-D represen-tation of them, and finding the distances from thevein to the edge of the leaf, as well as using the gra-dient or just the very tip of the leaf to compare theeffectiveness of each different data type.

Gaussian Process Regression ModelWe infer a Gaussian Process on our leaf data and find the mean and the covariance function of the GP.

Firstly, we analyse one leaf shape GP regression, and get covariance in space only:

k(x, x0; l) = e

� (x�x

0 )22l

2 + �2�(x � x

0).

Then, to do a phylogenetic inference, we introduce a covariance in evolutionary timet for the leaves u and v:

k(x

u

, x0v

; l, t = (t1, t2)) = e

�(t1+t2)e

� (x

u

�x

0v

)2

2l

2 + �2�(u � v)�(x

u

� x

0v

).

Maximizing the likelihood over (l, t) we find the most likely phylogeny:

p(y|X, (l, t)) = 1(2⇡) n

2 |Ky

| 12e

� 12 (y�µ)T

K

�1y

(y�µ).

Inference on Simulated DataSimulating ’leaves’ from a GP for which we knowall the relevant parameters, we can see how well weare able to recover them using our inference proce-dure. Most simulated data sets we tried this on gavereasonable results, and the estimate of the time be-tween leaves was not overly sensitive to incorrectlengthscales.

0.5 1.0 1.5 2.0

0.20

0.25

0.30

0.35

0.40

0.45

0.50

Proportion of Correct Trees

Length scale

Prop

ortio

n co

rrect

a) Comparison withUPGMA(red)

●

●

●

●●●●●● ●●●

●

●●●●

●

●●●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

0.25 0.638888888888889 1.41666666666667 2

02

46

810

12

Estimate of total time of tree

Length scale

Tim

e

b) Allowed evolutionary time(red is true total time)

The benchmark we are trying to beat is the pro-portion of times that the correct phylogeny wasinferred using the UPGMA method, using a dis-tance matrix given by the sum of squared distancesbetween points on the leaves.Simulating 4 leaves with total of 15 possible phy-logenies, we took 100 datasets. The proportionof phylogenies selected correctly by UPGMA was0.385. At the correct lengthscale (l = 1), the propor-tion selected correctly by the Gaussian process in-ference was 0.53. So we can say with confidence thatGaussian Process regression performs better thanUPGMA when we get the covariance structure cor-rect. As our lengthscale guess gets further from thetruth, though, the performance of the GP inferencedecreases a lot.

Results on Real Leaf Data

a) Original b) Polar Form c) Consensus d) Gradient e)Tip of the Leaf

The previous analysis on the simulated data showed that it is possibleto use the GP to infer phylogenies. Encouraged by this, we used a gen-eral squared exponential space covariance and a simple exponentialcovariance in time on a real sample of 5 leaves in Arabidopsis family.In the figures above, we present the maximum likelihood surfaces ofall the different data type representations we used for the leaf shape.The green point is the maximum likelihood of the true phylogeny, sowe can straightforwardly quantify the strength of our predictions.

The normal space covariance seems to give us reasonably good resultsof some data sets, and very poor predictions for the other. Therefore,the model is highly sensitive to the type of leaf shape representation.

Olimarabidopsis pumila

Arabidopsis halleri

Arabidopsis lyrata

Arabidopsis neglecta

Arabidopsis thalianaTrue Tree

Olimarabidopsis pumila

Arabidopsis halleri

Arabidopsis lyrata

Arabidopsis neglecta

Arabidopsis thaliana

The tree has log likelihood = 67.75

Tree Comparison

True Tree (left) against ourmost likely inferred Tree(right)

Further Results and ExtensionsAnother area of interest was to investigate the con-sequences of assuming a non-homogeneous spacecovariance, by increasing correlation between spe-cial points on the leaves. We chose these points asthe turning points in the leaf outlines by analysingthe gradient, and we changed the space covariancematrix accordingly. We observed some drastic im-provement in the prediction for some data types,particularly for the original, gradient and tip rep-resentations. Thus, provided we can find the rightcovariance structure that represent the leaf shape,we can make much better predictions.

a) A significant improvementfor true tree likelihood in

original leaf representation

b) Modified space-covariancematrix for leaves Halleri,

Thaliana, Pumila, Neglecta

The project directs to many other areas still left to investigate, from studying these modified covariancematrices and hyperparameter sensitivity more in-depth, as well as experimenting with 2-dimensional re-gression models and other representations of the leaf shapes. Gaussian Process regression proves to be apowerful method worthy of more investigation.

References[1] Nick S. Jones and John Moriarty (2010). Evolutionary Infer-

ence for Functional Data: Using Gaussian Processes on Phy-logenies to Study Shape Evolution.

[2] C. E. Rasmussen, C. K. I. Williams (2006). Gaussian Processesfor Machine Learning, the MIT Press.

AcknowledgementsThis work was carried out as part of the Oxford Summer Schoolin Computational Biology, 2011, in conjunction with the Depart-ment of Plant Sciences, and with support from the Department ofZoology. Funding was provided by J. Hein’s PRA. We speciallythank J. W. J. Anderson, N. S. Jones and J. Hein for guidance, andeveryone at the Plant Sciences that made this project possible.

1

Что здесь: 1. Повторяется? 2. Выравнивается?

Контраст

• Избегайте похожих элементов. Если они не одинаковые (совсем одинаковые), то сделайте их

действительно разными

Контраст цвета

Контраст в тоне, насыщенности и яркости

Отсутствует J

Присутствует J

Контраст в структуре

Усиливаем контраст в шрифтах

Усиливаем контраст в линиях

amount of mRNA under different conditions

stat

nod

1234

-2-3

6080

100120140

FC

Контраст, которого не хватает

Контраста не хватает, т.к. элементов слишком много

Потренируемся на котиках… (мышах)

• Какие принципы нарушены? • В картинках • В шрифтах • В цветах

Как питаются россияне (slon.ru)

Что не так с повтором/контрастом?

Какие принципы соблюдены, а какие нарушены?

Последний… TSS mapping and transcript repertoire!TSS position in relation to gene is key to its function. !

Promoter motif prediction!

Transcrip)on+Start+Site+Map+Of+Soy+Symbiont+Bradyrhizobium-japonicum-Based+On+dRNA:seq!

1Moscow!Ins*tute!of!Physics!and!Technology,!Dolgoprudny,!Russia!2A.A.!Kharkevich!Ins*tute!for!Informa*on!Transmission!Problems,!

Moscow,!Russia,!3!M.V.Lomonosov!Moscow!State!University,!Moscow,!Russia,!4MassachuseKs!Ins*tute!of!Technology,!Boston,!USA,!!

5Ins*tute!of!Microbiology!and!Molecular!Biology,!JustusOLiebeg!Universitat!Giessen,!Gießen,!Germany!

[email protected]

Jelena!Chuklina1,!2,!Nikolay!Lyubimov3,!Maxim!Imakaev4,!Elena!EvguenievaOHackenberg5!and!Mikhail!S.!Gelfand2,3!

A+ sub+ T+ G+ sub+ C+

Outline!•  Perform new round of machine-learning with updated training set!•  Update gTSS and 5’-aTSS classification!•  Compare dRNA-seq data with expression array and proteome

data!

!

Acknowledgments!•  Julia Hahn and Sebastian Thalmann for experimental validation of transcription

start sites and promoter motifs!•  Iakov Davydov and Aleksandr Chuklin for numerous advices on program

development!•  Cynthia Sharma, Konrad Förstner, Jorg Vogel for sequencing and read mapping •  Gabriella Pessi und Hans-Martin Fischer for nodule RNA!

!

Summary!1. We! detected! 17574! peaks,! aYer! machine!

learning!10071!were!leY!as!TSS.!

2. We! detected! 3979! RpoD! promoters,! 485! RpoN!

mo*fs,!159!TSSes!have!both.!

3. AYer! reOannota*on! 73! ncTSS! and! 682! iTSSes!were!reOclassified!as!gTSSes.!

Abstract!dRNA%seq) was) designed) for) selec4ve) sequencing) of) na4ve) transcripts)origina4ng) from) transcrip4on) start) sites) (TSS).) Here) we) present) TSSF) –)Transcrip4on) Start) Site) Finder) –) a) soBware) package) which) allows)comprehensive) analysis) of) bacterial) trancriptomic) landscape.) TSS) map)allows) to)assess) repertoire)of) small)non%coding)RNA,) inves4gate)promoter)mo4fs)and)improve)gene)annota4on.)In) this) study) we) use) TSSF) to) compare) transcriptome) of) soy) symbiont)Bradyrhizobium) japonicum,) in) liquid) cultures) and) root) nodule) popula4ng)bacteroids.))

!

Re-annotation!

TSS detection. Machine learning!(+)! library! is! RNA,! selected! for! primary!

transcripts,! (O)! library! is! all! RNA,! including!

processed!(Fig.1).!All!peaks!matching!in!(+)!and!

(O)! library! were! treated! as! candidate! TSS! and!

were! subjected! to! automated! machine!

learning.! ExpertOassessed! peaks! as! a! training!

set! (Fig.2! and! Table1).! Machine! learning! was!

performed! separately! for! freeOliving! bacteria!

(FR)! and! nodules! (NO).! To! compute! support!

vectors,! the! following! parameters! were!

selected:!

i.  Height!of!(+)!and!(O)!peak!(Fig.!3)!!ii.  ra*o!of!(+)!and!(O)!peak!iii. average!expression!in!30!b.p.!radius!

Fig. 3. Peak detection: RNA-seq read coverage (blue), salience function (green), peaks (red)

Fig. 5. Best-scoring patterns were used to construct Positional weight matrix (PWM). PWM threshold determination (upper): score distribution density of normal upstreams is skewed towards higher scores when compared to random sequences. Resulting logos (lower) of RpoD (σ70) and RpoN(σ54).

0.00

0.05

0.10

0.15

5 10 15 20totalScore

density normal

random

RpoN,�score�distribution�density.��TSSes�overexpressed�in�nodules

subs*tu*on!

box2!

box1! box2!

box1! box2!

extension!

shiY!

box1!

!+ ISGA+vs+old+ RAST+vs+old+ RAST+vs+ISGA+matching!!CDSes+ 4749! 4669! 7690!

matching!genes+ 4796! !! !!

reOannotated!start+ 3050! 2941! 898!

new!genes+ 1351! 1105! 556!

discarded!+ 525! 707! 127!

!+ old+ ISGA+ RAST++genes++ 8373! 9197! !!

CDS++ 8317! 9144! 8715!

sRNA length assessment!Typical transcript starts with TSS and ends with terminator. We used 3 publically available tools (ARNold, TransTermHP, WebGesterDB) for rho-independent terminator prediction. Only ARNold predicts terminators independently of annotated gene end and we used it to assess sRNA length. !Only 247 TSSes were matched wi th terminators, their length was usually 40-200 nt, rarely more than 400 nt.!

See also: poster by Julia Hahn!

Fig. 2. Expert assessment of candidate TSS for training set.

Fig. 1. dRNA-seq data. (+) library – red, (-) library – blue.

Table 1. Training set: M a n u a l l y a s s e s s e d 0-130kb and 1681..1920 kb (symb.island) of genome

Fig. 9. Start-codon re-annotation: change in protein lengths after re-annotation with RAST and ISGA. There is clear skew of ORFs which became shorter for both ISGA and RAST. This leads to iTSS re-classification as gTSS

5’-untranslated region length!Fig 7. While most of 5‘-UTR have typical length of 20-40 nt, there is considerable amount of leaderless transcripts, which s e e m s t o b e common property of bacteria

Fig. 8. Re-annotation of RegR (blr0904): now the TSS №1 precedes start-codon. Old annotation is grey, new is cyan. P1, P2, P3 are predicted promoters.

Table 2. Number of genes (CDS) predicted by different AGEs

Table 3. Different B.japonicum USDA 110 annotations

Anti-sense transcript mapping!

Most! of! TSS,! classified! as! gTSS! and! aTSS!

belong!to!5’OUTR!and!oYen!don’t!intersect!

corresponding!an*Osense!transcripts!and!!

thus!are!gTSS/oTSS,!transcribed!divergently!(as!

dashed!arrow!above).!Overlap!in!various!aTSS!types!

is!due!to!overlap!of!annotated!genes.!

Protein:coding+genes:+•  4084!proteinOcoding!genes!have!TSS!•  Maximal!number!of!TSS!per!gene!is!4!

•  873!proteinOcoding!genes!have!more!than!

one!TSS!

An):+sense+RNAs:+•  4013!genes!have!an*Osense!TSS!(2056!of!

them!expressed)!

Internal+TSSes:+•  4167!genes!have!iTSSes!(2368!of!them!

are!expressed)!

!

! gTSS!=!gene!TSS!

iTSS!=!internal!TSS!

oTSS!=!orphan!TSS!

aTSS_5!!

aTSS_i!!!!!!!!!an*Osense!

aTSS_3!Fig. 6. Different TSS type (=transcript type) distribution. Abundance of iTSS maybe due to: 1) Operon intrinsic promoter; 2) RNA cleavage products misclassified as TSS. For aTSS misclassification analysis, see below.

1340+

oTSS!

TSS! mapping! allows! for! correc*on! of!

annota*on! errors,! especially! reO

annota*on!of!start!codons.! !We!applied!

automated! genome! annota*on! engines!

(AGE)! RAST! and! ISGA! to! improve!

Bradyrhizobium) japonicum) USDA) 110)annota*on.!

TSS! candidate! upstream! sequences! is!

enriched!with!promoter!mo*fs!!!!

promoters!support!TSS!candidate!as!true!TSS.!

Usually!promoters!possess:!

1.  Conserved!twoObox!sequences!2.  Conserved!distance!to!TSS!3.  Conserved!distance!between!boxes!We! scanned! 60! nt! sequences! upstream! of!

each! predicted! TSS! (or! subset! of! TSS!

u p r e g u l a t e d! i n! n o d u l e s )! t o! fi n d!

overrepresented!6Ont!mo*fs.!We!allowed!1O2!

nt!shiY!of!boxes! from!the! ideal!distance,!1O2!

nt! extension!of! distance!between!boxes! and!!

1O2! subs*tu*ons! in! each! box,! penalizing! for!

each.!

Fig. 4. In the region -35 and -10 nt accordingly there are the most concentration of correlated position. Illustration is based on 5000 best patterns

0.00

0.05

0.10

0.15

5 10 15totalScore

density normal

random

RpoD,�score�distribution�density.

P3 P2 P1

1 2

1 2 3

T ATG old

TTG new

RegR, bll0904

Яркость и насыщенность

Последовательные цвета

Basic_principles_of_design.

Science

Transcript of Basic_principles_of_design.