APIs and Synthetic Biology
-
Upload
uri-laserson -
Category
Technology
-
view
1.439 -
download
2
description
Transcript of APIs and Synthetic Biology
2
The API, or how to make your computational collaborators love you
Uri Laserson | @laserson | [email protected] May 2014
3
The API, or how to make your computational collaborators love you, and also some perspectives on engineering biology and immunologyUri Laserson | @laserson | [email protected] May 2014
4
5
NCBI Sequence Read Archive (SRA)
Today…1.14 petabytes
One year ago…609 terabytes
For every “-ome” there’s a “-seq”
Genome DNA-seq
TranscriptomeRNA-seqFRT-seqNET-seq
Methylome Bisulfite-seq
Immunome Immune-seq
ProteomePhIP-seqBind-n-seq
7
Crappy academic code
counts_dict = {}for chain in vdj.parse_VDJXML(inhandle): try: counts_dict[chain.junction] += 1 except KeyError: counts_dict[chain.junction] = 1
for count in counts_dict.itervalues(): print >>outhandle, np.int_(count)
8
Crappy academic code
counts_dict = {}for chain in vdj.parse_VDJXML(inhandle): try: counts_dict[chain.junction] += 1 except KeyError: counts_dict[chain.junction] = 1
for count in counts_dict.itervalues(): print >>outhandle, np.int_(count)
SELECT count(*) FROM antibodies GROUP BY junction
vs.
9
What is an API?
10
What is an API?
• Application Programming Interface• Contract (between machines)• Specifications for:
1. Procedures and methods2. Data structures/messages
11
Stripe API
12
Stripe API
13
Java API
public interface List<E> { int size(); boolean isEmpty(); boolean contains(Object o); boolean add(E e); void add(int index, E element); boolean remove(Object o);}
14
Python DB API v2.0 (PEP 249)
http://legacy.python.org/dev/peps/pep-0249/
15
Why use an API?
• Encapsulation/interfaces/abstraction• Loose-coupling of components• Reusable services• Service-oriented architecture
16
Linked-In’s Loose Coupling Architecture
17
Linked-In’s Loose Coupling Architecture
18
(If This Then That)Stitching APIs together
https://ifttt.com/recipes#popular
19
20
IMGT
22
IMGT’s API is an FTP site
23
IMGT does not have an API
def __initVQUESTform(self): # get form request = urllib2.Request( 'http://imgt.cines.fr/IMGT_vquest/vquest?livret=0&Option=humanIg') response = urllib2.urlopen(request) forms = ClientForm.ParseResponse(response, form_parser_class=ClientForm.XHTMLCompatibleFormParser, backwards_compat=False) response.close() form = forms[0] # fill out base part of form - Synthesis view with no extra options - TEXT form['l01p01c03'] = ['inline'] form['l01p01c07'] = ['2. Synthesis'] form['l01p01c05'] = ['TEXT'] # may need to be 'TEXT' form['l01p01c09'] = ['60'] form['l01p01c35'] = ['F+ORF+ in-frame P'] form['l01p01c36'] = ['0'] form['l01p01c40'] = ['1'] # ['1'] for searching with indels form['l01p01c25'] = ['default’] ...
24
Haussler and genomics services
25
Google Genomics API
26
Google Genomics API
27
Flask/Bottle web server example
@route("/receptor/<id>")def lookup_receptor(id): # get the raw read
@route("/sample/<sample_id>")def sample_summary(sample_id): # impl for getting sample information; can return: # * summary of repertoire information # (num reads, VDJ distribution, etc.) # * demographic info
@route("/sample/<sample_id>/common_junctions")def common_junctions(sample_id): # impl for getting the most common CDR3s
28
Genomics ETL has converged on standards
.fastq .bam .vcf
short read alignment
genotype calling analysisbiochemistry
29
VCF##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHR POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs605 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs604 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.6 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
30
What about immune data?
.fastq .bam .vcf
short read alignment
genotype calling analysisbiochemistry
.???immune receptor alignment
31
Multiple models for same types: VDJFasta
sub new { my ($class) = @_; my $self = {}; $self->{filename} = ""; $self->{headers} = []; $self->{sequence} = []; $self->{germline} = []; $self->{nseqs} = 0; $self->{mids} = {};
$self->{accVsegQstart} = {}; # example: 124 $self->{accVsegQend} = {}; # example: 417 $self->{accJsegQstart} = {}; $self->{accJsegQend} = {}; $self->{accDsegQstart} = {};
32
Multiple models for same types: vdj
class ImmuneChain(SeqRecord): def cdr3(self): return len(self.junction)
def num_mutations(self): aln = self.letter_annotations['alignment'] return aln.count('S') + aln.count('I') def v(self): return self.__getattribute__('V-REGION') \ .qualifiers['allele'][0] def v_seq(self): return self.__getattribute__('V-REGION') \ .extract(self.seq.tostring())
33
Interoperability/services depend on being able to communicated data
34
CSV
9 CCTG_PRCONS=IGHC1_R1_IGM unproductive Homsap IGHV5-51*01 F, or Homsap IGHV5-51*03 F Homsap IGHJ4*02 F Homsap 12 GGGG_PRCONS=IGHC3_R1_IGA productive Homsap IGHV3-11*01 F Homsap IGHJ1*01 F Homsap IGHD2-2*03 F .......13 CTTC_PRCONS=IGHC5_R1_IGG unproductive Homsap IGHV1-2*02 F Homsap IGHJ5*02 F Homsap IGHD5-18*01 F .......18 ACTT_PRCONS=IGHC3_R1_IGA productive Homsap IGKV3-15*01 F, or Homsap IGKV3D-15*01 F or Homsap IGKV3D-15*02 P Homsap 20 GGAC_PRCONS=IGHC5_R1_IGG productive Homsap IGHV4-61*02 F Homsap IGHJ4*02 F Homsap IGHD1-26*01 F .......25 TCGT_PRCONS=IGHC2_R1_IGD productive Homsap IGHV3-23*01 F, or Homsap IGHV3-23*04 F or Homsap IGHV3-23D*01 F Homsap 26 GGTG_PRCONS=IGHC5_R1_IGG productive Homsap IGHV4-34*01 F, or Homsap IGHV4-34*02 F or Homsap IGHV4-34*08 F Homsap 28 GTGA_PRCONS=IGHC5_R1_IGG productive Homsap IGHV1-46*01 F, or Homsap IGHV1-46*02 F or Homsap IGHV1-46*03 F Homsap 31 ACCC_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-9*01 F, or Homsap IGHV3-9*02 F Homsap IGHJ3*02 F Homsap 36 GCAA_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-9*01 F, or Homsap IGHV3-9*02 F Homsap IGHJ2*01 F Homsap 39 GCAA_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-7*01 F Homsap IGHJ6*02 F Homsap IGHD1-7*01 F .......40 GGGT_PRCONS=IGHC1_R1_IGM productive Homsap IGHV4-34*01 F, or Homsap IGHV4-34*02 F or Homsap IGHV4-34*08 F Homsap 42 TAGG_PRCONS=IGHC5_R1_IGG productive Homsap IGHV4-39*01 F, or Homsap IGHV4-39*05 F Homsap IGHJ4*02 F Homsap 47 CAAA_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-15*01 F, or Homsap IGHV3-15*02 F Homsap IGHJ6*02 F Homsap 48 AGAA_PRCONS=IGHC5_R1_IGG unproductive Homsap IGHV3-30*04 F, or Homsap IGHV3-30-3*01 F or Homsap IGHV3-30-3*02 F or Ho52 GCAG_PRCONS=IGHC1_R1_IGM productive Homsap IGHV3-23*01 F, or Homsap IGHV3-23*04 F or Homsap IGHV3-23D*01 F Homsap 53 AACC_PRCONS=IGHC3_R1_IGA productive Homsap IGHV3-30*02 F Homsap IGHJ4*02 F Homsap IGHD5-18*01 F .......
35
XML
<ImmuneChain> <c>IGHD</c> <barcode>RL014</barcode> <j_start_idx>389</j_start_idx> <seq>TTGTGGCTATTTTAAA ... CTCGGACT</seq> <descr>003699_0091_0140</descr> <tag>coding</tag> <clone>IGHV3-43_IGHJ4|387</clone> <j>IGHJ4*02</j> <v_end_idx>314</v_end_idx> <v>IGHV3-43*01</v> <junction>TGTGCAAAAGATAATCT ... TCTTTGACTACTGG</junction> <d>IGHD5-24*01</d></ImmuneChain>
36
JSON
{ "v": "IGHV4-39*02", "seq": "CCTATCCCCCTGTGTGCCTT ... CTCCACCAAG", "num_mutations": 43, "name": "HG2DXMN01CY8UH", "letter_annotations": { "alignment": "..............S....S....3333333333333333........S.." }, "junction_nt": "GCGAGGGGCCGATGGGACTTTTATTACATGGACGTC", "j": "IGHJ6*03", "annotations": { "usearch_90_cluster": "6277", "experiment_date": "20120119", "donor": "17517", "sample_type": "memory_B_cells", "source": "SeqWright", "tags": ["revcomp", "coding"], "taxonomy": [] }, "d": "IGHD3-10*01", "features": [ { "strand": 1, "type": "V-REGION", "location": [51, 356], "qualifiers": { "CDR_length": ["[10.7.2]"], "codon_start": ["1"], "gene": ["IGHV4-39"], "allele": ["IGHV4-39*02"] } }, ... ]}
http://www.json.org/
37
JSON
{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000000" }, "annotations" : { "D-REGION" : "IGHD3-10*01", "accessions" : "HG2DXMN01CY8{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000001" }, "annotations" : { "D-REGION" : "IGHD3-9*01", "accessions" : "HG2DXMN01A3VH{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000002" }, "annotations" : { "D-REGION" : "IGHD3-10*01", "accessions" : "HG2DXMN01BC6{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000003" }, "annotations" : { "D-REGION" : "IGHD6-19*01", "accessions" : "HG2DXMN01DYU{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000004" }, "annotations" : { "D-REGION" : "IGHD6-19*01", "accessions" : "HG2DXMN01A8F{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000005" }, "annotations" : { "D-REGION" : "IGHD3-9*01", "accessions" : "HG2DXMN01BDI2{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000006" }, "annotations" : { "D-REGION" : "IGHD6-19*01", "accessions" : "HG2DXMN01BS2{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000007" }, "annotations" : { "D-REGION" : "IGHD6-19*01", "accessions" : "HG2DXMN01DLL{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000008" }, "annotations" : { "D-REGION" : "IGHD6-25*01", "accessions" : "HG2DXMN01BLF{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c6172308000009" }, "annotations" : { "D-REGION" : "IGHD3-3*01", "accessions" : "HG2DXMN01D4TL{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c617230800000a" }, "annotations" : { "D-REGION" : "IGHD3-10*01", "accessions" : "HG2DXMN01BU6{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c617230800000b" }, "annotations" : { "D-REGION" : "IGHD2-2*03", "accessions" : "HG2DXMN01BIMG{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c617230800000c" }, "annotations" : { "D-REGION" : "IGHD3-3*01", "accessions" : "HG2DXMN01BM9Z{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c617230800000d" }, "annotations" : { "D-REGION" : "IGHD2-2*03", "accessions" : "HG2DXMN01BH9Q{ "__SeqRecord__" : true, "_id" : { "$oid" : "4f1f5525e7c617230800000e" }, "annotations" : { "D-REGION" : "IGHD6-19*01", "accessions" : "HG2DXMN01BR3
38
Binary formats
• Protobuf, Thrift, or Avro• Flexible data model
• All common primitive types (e.g. int, double string)• Support nested types, including arrays and maps
• Efficient binary encoding• Code generation for many languages (binary
compatible)• Support for schema evolution• Support IDL for data types and services
39
Thrift example: Twitter
service Twitter { void ping(); bool postTweet(1:Tweet tweet); TweetSearchResult searchTweets(1:string query);}
struct Tweet { 1: required i32 userId; 2: required string userName; 3: required string text; 4: optional Location loc; 16: optional string language = "english"}
40
Thrift example: Immune receptor
cd ~/repos/kiwithrift --gen java kiwi-format/src/main/resources/thrift/kiwi.thriftthrift --gen py:new_style kiwi-format/src/main/resources/thrift/kiwi.thrift
See: https://github.com/laserson/kiwi
41
Questions?
42
Biological parts specifications
• Library of parts with well-characterized input-output characteristics
• In total, similar to API spec
Canton, Nat. Biotech. 26: 787 (2008)
43
Engineering signaling pathways at inputs/outputs
Lim, Nat. Rev. Mol. Cell 11: 393 (2010)
44
Bottom-up genetic circuit design
Brophy, Nature Meth. 11: 508 (2014)
45
Bottom-up genetic circuit design
Brophy, Nature Meth. 11: 508 (2014)
46
Predict composability of genetic elements
Kosuri, PNAS 110: 14024 (2013)
• 114 promoters x 111 RBS
“…rather than relying on prediction or standardization, we can screen synthetic libraries for desired behavior.”
47
Most addressableCheapest to create
ZFN => TALEN => CRISPR/CasLeast addressableMost expensive to create
48
Addressability for precision nanoscale engineering
Douglas, NAR 37: 5001(2009)
49
Addressability for precision nanoscale engineering
Douglas, Nature 459: 414 (2009)
50
Evolution for encapsulation: an evolved electronic thermometer
http://www.genetic-programming.com/hc/thermometer.html
51
Lycopene synthesis optimization
Wang, Nature 460: 894 (2009)
52
Evolutionary encapsulation for signaling pathway engineering
Peisajovich, Science 328: 368 (2010)
53
Evolutionary encapsulation for signaling pathway engineering
Peisajovich, Science 328: 368 (2010)
54
Genetic isolation with Re.coli
Lajoie, Science 342: 357 (2013)
So far, we discussed antibody-only data analysis
Antigen-only data generation
Larman, Nat. Biotech. 29: 535 (2011)
Ben Larman
Steve Elledge
Agilent OLS array
59
Phage immunoprecipitation sequencing (PhIP-seq)
60
Patient A Replica 1
Pat
ient
A R
epl
ica
2
SAPK4
NOVA1
TGIF2LX
log10(-log10 P-value)
PhIP-seq proof-of-principle
61
‘Forward vaccinology’
62
‘Reverse vaccinology’
63
‘Immunization without vaccination’
64
Encapsulation for cancer immunotherapy through TMG processing
Tran, Science 344: 641 (2014)
65
Other examples?
66
Conclusions
• The API perspective helps organize and communicate data
• Use sane file formats if possible:• JSON for lightweight work• Thrift/Avro for heavyweight serialization/communication
• Decouple data modeling for implementation details• Biological engineering: what abstractions are
available?• Evolution as nature’s encapsulator
67