Embulk 20150411
-
Upload
hiroshi-nakamura -
Category
Technology
-
view
12.126 -
download
2
Transcript of Embulk 20150411
![Page 1: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/1.jpg)
Hiroshi NakamuraSoftware EngineerTreasure Data, K.K.
『Embulk』に見るモダンJavaの実践的テクニック~並列分散処理システムの実装手法~
1
#ccc_cd4 / #embulk
![Page 2: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/2.jpg)
#ccc_cd4 / #embulk
Today’s talk
Embulkとは > バルクデータ転送の難しさ > Embulkのアプローチ > アーキテクチャ概要
Java実装技術 > Java 7ネイティブ > Guiceによるコンポーネント間の接続 > ServiceLoaderによる拡張 > Jacksonによるモデルクラス、Immutable > Nettyバッファアロケータ、Unsafe
2
![Page 3: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/3.jpg)
#ccc_cd4 / #embulk
Embulkとは? - http://embulk.org/
> オープンソースのバルクデータ転送ツール > “A” から “B” へレコードを転送
> プラグイン機構 > 多様な “A” と “B” の組み合わせ
> データ連携を容易に > システム構築の頭痛の種の一つ
Storage, RDBMS, NoSQL, Cloud Service, …
broken records, error recovery, maintenance,
performance, …
3
![Page 4: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/4.jpg)
#ccc_cd4 / #embulk
Embulk committers
Hiroshi Nakamura @nahi
Muga Nishizawa @muga_nishizawa
Sadayuki Furuhashi @frsyuki
4
![Page 5: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/5.jpg)
#ccc_cd4 / #embulk
バルクデータ転送の難しさ
> 入力データの正規化
> エラー処理
> メンテナンス
> 性能
![Page 6: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/6.jpg)
#ccc_cd4 / #embulk
入力データ正規化の難しさ
データエンコーディングのバリエーション
> null、時刻、浮動小数点
> 改行、エスケープ、レコード/カラム区切り
> 文字コード、圧縮有無
→ 試行によるデータ正規化
![Page 7: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/7.jpg)
#ccc_cd4 / #embulk
エラー処理の難しさ
例外値の扱い ネットワークエラーからの復旧 ディスクフルからの復旧 重複データ転送の回避
→ データバリデーション、リトライ、リジューム
![Page 8: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/8.jpg)
#ccc_cd4 / #embulk
メンテナンスの難しさ
継続的な動作の確保 データ転送要件変更への対応
→ ドキュメント、汎用化、OSS化
![Page 9: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/9.jpg)
#ccc_cd4 / #embulk
性能の問題
転送データ量は通常増えていく 対象レコードも増えたりする
→ 並列・分散処理
![Page 10: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/10.jpg)
#ccc_cd4 / #embulk
バルクデータ転送の例
指定された 10GB CSV file をPostgreSQLにロード > 1. コマンド叩いてみる → 失敗 > 2. データを正規化するスクリプトを作成
”20150127T190500Z”→“2015-01-27 19:05:00 UTC”に “null”→“\N”に変換 元データを見ながら気付く限り…
> 3. 再度チャレンジ → 取り込まれたが元データと合わない “Inf”→“Infinity”に変換
> 4. ひたすら繰り返す > 5. うっかりレコードが重複して取り込まれた…
![Page 11: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/11.jpg)
#ccc_cd4 / #embulk
バルクデータ転送の例
指定された 10GB CSV file をPostgreSQLにロード > 6. スクリプトが完成 > 7. cronに登録して毎日バルクデータロードするよう登録 > 8. ある日別の原因でエラーに…
不正なUTF-8 byte sequenceをU+FFFDに変換
![Page 12: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/12.jpg)
#ccc_cd4 / #embulk
バルクデータ転送の例
過去の日次 10GB CSV file を 730個 取り込む(2年分) > 1. たいていのスクリプトは遅い
> 最適化してる暇がない > 1ファイル1時間、エラー発生しなかったとして1ヶ月
> 2. 並列データロードするようスクリプト変更 > 3. ある日ディスクフル/ネットワークエラーで失敗
> どこまで読み込まれた? > 4. 障害後に再開し易いよう、データロード単位を調整 > 5. 安全な再開機能をスクリプトに追加
![Page 13: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/13.jpg)
#ccc_cd4 / #embulk
システム構築の頭痛の種
様々な転送データ、データストレージ
> CSV, TSV, JSON, XML, MessagePack, SequenceFile, RCFile
> S3, Salesforce.com, Google Cloud Storage, BigQuery, Elasticsearch
> MySQL, PostgreSQL, Oracle, MS SQL Server, Amazon Redshift, Redis, MongoDB
![Page 14: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/14.jpg)
#ccc_cd4 / #embulk
HDFS
MySQL
Amazon S3
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
Broken script :(
Sometimes fails :(
No one can fix :(
14
![Page 15: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/15.jpg)
#ccc_cd4 / #embulk
HDFS
MySQL
Amazon S3
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
Broken script :(
Sometimes fails :(
No one can fix :(
N x Mscripts!
> Poor error handling > No retrying / resuming > Low performance > Often no maitainers
15
![Page 16: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/16.jpg)
#ccc_cd4 / #embulk
Embulkのアプローチ
> プラグインアーキテクチャ
> 入力データ正規化支援: guess, preview
> 並列・分散実行
> 繰り返し実行
> トランザクション制御
16
![Page 17: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/17.jpg)
#ccc_cd4 / #embulk
HDFS
MySQL
Amazon S3
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
Broken script :(
Sometimes fails :(
No one can fix :(
17
![Page 18: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/18.jpg)
#ccc_cd4 / #embulk
HDFS
MySQL
Amazon S3
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
Reliable framework :-)
18
![Page 19: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/19.jpg)
#ccc_cd4 / #embulk
HDFS
MySQL
Amazon S3
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
PluginsPlugins
Reusable plugins
19
![Page 20: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/20.jpg)
#ccc_cd4 / #embulk
プラグインアーキテクチャ
拡張ポイントが再利用可能なコンポーネントを定義
従っている限りフレームワークの恩恵を受けられる
> 並列処理、繰り返し実行、エラー処理、リカバリ
20
![Page 21: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/21.jpg)
#ccc_cd4 / #embulk
Embulkプラグインの例
RubyGemsとして配布 - http://www.embulk.org/plugins/ > DB > Oracle, MySQL, PostgreSQL, Amazon Redshift
> 検索エンジン > Elasticsearch
> クラウドサービス > Salesforce.com > Amazon S3 > Google Cloud Storage, Google BigQuery
> ファイルフォーマット > CSV, TSV, JSON, XML > pcap packet capture files > gzip, bzip2, zip, tar, cpio
21
![Page 22: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/22.jpg)
#ccc_cd4 / #embulk
デモ
> guessとpreview
> 並列・分散実行
> 繰り返し実行
> トランザクション処理
> プラグインのサンプル
22
![Page 23: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/23.jpg)
#ccc_cd4 / #embulk
Embulkアーキテクチャ概要
![Page 24: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/24.jpg)
#ccc_cd4 / #embulk
InputPlugin OutputPlugin
Executor pluginFilter plugin
Filter pluginFilter plugins
records
Threads, MapReduce
records
convert, …
input, … output.
24
records
config
![Page 25: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/25.jpg)
#ccc_cd4 / #embulk
InputPlugin
FileInput plugin
OutputPluginDecoder plugin
Parser plugin
HDFS, S3,Riak CS, …
gzip, bzip2,aes, …
CSV, JSON,pcap, …
buffer
buffer
Filter pluginFilter plugin
Filter plugins
records
records
Executor plugin
25
records
config
![Page 26: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/26.jpg)
#ccc_cd4 / #embulk
InputPlugin
FileInput plugin
OutputPlugin
FileOutput plugin
Encoder plugin
Formatter plugin
Decoder plugin
Parser plugin
HDFS, S3,Riak CS, …
gzip, bzip2,aes, …
CSV, JSON,pcap, …
buffer
bufferbuffer
buffer
Filter pluginFilter plugin
Filter plugins
recordsrecords
Executor plugin
26
records
config
![Page 27: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/27.jpg)
#ccc_cd4 / #embulk
Embulkアーキテクチャ概要
4種のプラグインとそれを組み上げるフレームワーク
1. Executor: 実行
2. Input: バルクデータのレコード群を取り込み
FileInput, Decoder, Parser: ファイル操作
3. Filter: レコードに対するデータ操作
4. Output: バルクデータのレコード群を出力
FileOutput, Encoder, Formatter: ファイル操作
![Page 28: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/28.jpg)
#ccc_cd4 / #embulk
InputPlugin OutputPlugin
Executor pluginFilter plugin
Filter pluginFilter plugins
28
records records
records
config diff
![Page 29: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/29.jpg)
#ccc_cd4 / #embulk
InputPlugin OutputPlugin
Executor pluginFilter plugin
Filter pluginFilter plugins
29
task
schema
report
task
schema
reportrecords records
task
schemarecords
resume state
config
![Page 30: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/30.jpg)
#ccc_cd4 / #embulk
Embulkの実装技術
> Java 7ネイティブ
> Guiceによるコンポーネント間の接続
> ServiceLoaderによる拡張
> Jacksonによるモデルクラス
> Immutableなモデルクラス
> Nettyバッファアロケータ
> sun.misc.Unsafeによるバッファコピー回避
30
![Page 31: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/31.jpg)
#ccc_cd4 / #embulk
Java 7ネイティブ
try-with-resources ファイル操作:Files & Paths API
※Date and TimeはJRubyの実装を利用
try (SetCurrentThreadName dontCare = new SetCurrentThreadName(“transaction”)){ return doRun(config);}
Path basePath = Paths.get(“.”).normalize();Path file = basePath.resolve(“relative.csv”);
![Page 32: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/32.jpg)
#ccc_cd4 / #embulk
Guiceによるコンポーネント間の接続
32
public class EmbulkService{ protected final Injector injector;
public EmbulkService(ConfigSource systemConfig) { ImmutableList.Builder<Module> modules = ImmutableList.builder(); modules.add(new SystemConfigModule(systemConfig)); modules.add(new ExecModule()); modules.add(new ExtensionServiceLoaderModule(systemConfig)); modules.add(new BuiltinPluginSourceModule()); modules.add(new JRubyScriptingModule(systemConfig)); injector = Guice.createInjector(modules.build()); }
public Injector getInjector() { return injector; }}
![Page 33: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/33.jpg)
#ccc_cd4 / #embulk
Guiceによるコンポーネント間の接続
33
public class EmbulkService{ protected final Injector injector;
public EmbulkService(ConfigSource systemConfig) { ImmutableList.Builder<Module> modules = ImmutableList.builder(); modules.add(new SystemConfigModule(systemConfig)); modules.add(new ExecModule()); modules.add(new ExtensionServiceLoaderModule(systemConfig)); modules.add(new BuiltinPluginSourceModule()); modules.add(new JRubyScriptingModule(systemConfig)); injector = Guice.createInjector(modules.build()); }
public Injector getInjector() { return injector; }}
public class ExecModule implements Module{ @Override public void configure(Binder binder) { ... binder.bind(LocalThreadExecutor.class).in(Scopes.SINGLETON); registerPluginTo(binder, ExecutorPlugin.class, "local", LocalExecutorPlugin.class);
![Page 34: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/34.jpg)
#ccc_cd4 / #embulk
Guiceによるコンポーネント間の接続
34
public class EmbulkService{ protected final Injector injector;
public EmbulkService(ConfigSource systemConfig) { ImmutableList.Builder<Module> modules = ImmutableList.builder(); modules.add(new SystemConfigModule(systemConfig)); modules.add(new ExecModule()); modules.add(new ExtensionServiceLoaderModule(systemConfig)); modules.add(new BuiltinPluginSourceModule()); modules.add(new JRubyScriptingModule(systemConfig)); injector = Guice.createInjector(modules.build()); }
public Injector getInjector() { return injector; }}
public class ExecModule implements Module{ @Override public void configure(Binder binder) { ... binder.bind(LocalThreadExecutor.class).in(Scopes.SINGLETON); registerPluginTo(binder, ExecutorPlugin.class, "local", LocalExecutorPlugin.class);
public class LocalExecutorPlugin implements ExecutorPlugin{ private final ExecutorService executor;
@Inject public LocalExecutorPlugin(LocalThreadExecutor executor) { this.executor = executor.getExecutorService(); ... (InjectedPluginSource) public T newPlugin(Injector injector){ return (T) new FileInputRunner((FileInputPlugin) injector.getInstance(impl));
![Page 35: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/35.jpg)
#ccc_cd4 / #embulk
Guiceによるコンポーネント間の接続
XMLでなくすべてJavaで書く系のDI
Annotationによる宣言的なDI、injectorによる動的なDIの組み合わせ > 動的なモジュール差し替えがし易い
![Page 36: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/36.jpg)
#ccc_cd4 / #embulk
ServiceLoaderによる拡張
public class ExtensionServiceLoaderModule implements Module{ private final ConfigSource systemConfig;
public ExtensionServiceLoaderModule(ConfigSource systemConfig) { this.systemConfig = systemConfig; }
@Override public void configure(Binder binder) { ServiceLoader<Extension> serviceLoader = ServiceLoader.load(Extension.class, classLoader); for (Extension extension : serviceLoader) { for (Module module : extension.getModules(systemConfig)) { module.configure(binder); } } }}
![Page 37: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/37.jpg)
#ccc_cd4 / #embulk
ServiceLoaderによる拡張
jarをclasspathに入れるだけでモジュール追加/差し替え
簡単でClassLoaderいじるよりは安全
標準Plugin群の登録に利用
※Pluginの読み込みはClassLoader
![Page 38: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/38.jpg)
#ccc_cd4 / #embulk
Jacksonによるモデルクラス(task)
public class CsvParserPlugin implements ParserPlugin{ public interface PluginTask extends Task, LineDecoder.DecoderTask, TimestampParser.ParserTask { @Config("columns") public SchemaConfig getSchemaConfig();
@Config("header_line") @ConfigDefault("null") public Optional<Boolean> getHeaderLine();
@Config("skip_header_lines") @ConfigDefault("0") public int getSkipHeaderLines(); public void setSkipHeaderLines(int n);
@Config("delimiter") @ConfigDefault("\",\"") public char getDelimiterChar();
![Page 39: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/39.jpg)
#ccc_cd4 / #embulk
InputPlugin OutputPlugin
Executor pluginFilter plugin
Filter pluginFilter plugins
39
task
schema
report
task
schema
reportrecords records
task
schemarecords
config
config diffresume state
![Page 40: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/40.jpg)
#ccc_cd4 / #embulk
Jacksonによるモデルクラス(schema)
public class ColumnConfig{ private final String name; private final Type type;
@JsonCreator public ColumnConfig( @JsonProperty("name") String name, @JsonProperty("type") Type type) { this.name = name; this.type = type; }
@JsonProperty("name") public String getName() { return name; }
@JsonProperty("type") public Type getType() { return type; }}
![Page 41: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/41.jpg)
#ccc_cd4 / #embulk
Jacksonによるモデルクラス
デ/シリアライズが重要
> 並列・分散実行のため
> Ruby <-> Javaのやりとりのため
IDL生成でなくすべてJavaで書く系のモデル
![Page 42: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/42.jpg)
#ccc_cd4 / #embulk
Immutableなモデルクラス
ソースコード中のfinalなメンバー変数の割合
> Presto:83%(4714変数)
> Embulk:72%(255変数)
> Cassandra:59%(2348変数)
> Elasticsearch:51%(6871変数)
> Nashorn (OpenJDK 8):43%(852変数)
> JRuby:40%(3154変数)
> Hadoop:31%(9280変数)
> Hive:23%(4600変数)
![Page 43: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/43.jpg)
#ccc_cd4 / #embulk
Nettyバッファアロケータ
レコード群のためのメモリをすべて自前管理
> OutOfMemoryが起きる前に検出
> GCコスト削減
複数のバルクロードセッションをサーバプロセス内で同時実行可能に
![Page 44: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/44.jpg)
#ccc_cd4 / #embulk
Nettyバッファアロケータ
レコード群のためのメモリをすべて自前管理
> OutOfMemoryが起きる前に検出
> GCコスト削減
複数のバルクロードセッションをサーバプロセス内で同時実行可能に
public Buffer allocate(int minimumCapacity){ int size = MINIMUM_BUFFER_SIZE; while (size < minimumCapacity) { size *= 2; } return new NettyByteBufBuffer(nettyBuffer.buffer(size));}
![Page 45: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/45.jpg)
#ccc_cd4 / #embulk
Unsafe
airlift/slice - sun.misc.Unsafe APIのwrapper
> バイト列の直接操作(デ/シリアライズ)
> コピー削減
参考: http://frsyuki.hatenablog.com/entry/
2014/03/12/155231
![Page 46: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/46.jpg)
#ccc_cd4 / #embulk
Unsafe
airlift/slice - sun.misc.Unsafe APIのwrapper
> バイト列の直接操作(デ/シリアライズ)
> コピー削減
参考: http://frsyuki.hatenablog.com/entry/
2014/03/12/155231
public void addRecord(){ // record header bufferSlice.setInt(position, nextVariableLengthDataOffset); bufferSlice.setBytes(position + 4, nullBitSet); count++;
this.position += nextVariableLengthDataOffset; this.nextVariableLengthDataOffset = fixedRecordSize; Arrays.fill(nullBitSet, (byte) 0);
// flush if next record will not fit in this buffer if (buffer.capacity() < position + nextVariableLengthDataOffset + stringReferenceSize) { flush(); }}
![Page 47: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/47.jpg)
47ユーザー
転送管理コンソールから実行
トレジャークラウドストレージ
EmbulkWorker
管理コンソールからアクセス
S3のインポートから,エクスポートまでを完全自動化
![Page 48: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/48.jpg)
#ccc_cd4 / #embulk
Contributing to the Embulk project
> Pull-requests & issues on Github > Posting blogs
> “使ってみた” > “コードを読んでみた” > “ここがイケてる / イケてない”
> Talking on Twitter with a word “embulk" > Writing & releasing plugins > Windows support > Integration to other software
> ETL tools, Fluentd, Hadoop, Presto, …
48
![Page 49: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/49.jpg)
1. Distributed Systems Engineer
2. Integration Engineer
3. Software Engineer, MPP DBMS
4. Sales Engineer
5. Technical Support Engineer
(日本,東京,丸の内)
https://jobs.lever.co/treasure-data
We’re hiring!
ANALYTICS INFRASTRUCTURE. SIMPLIFIED IN THE CLOUD.
![Page 50: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/50.jpg)
50
![Page 51: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/51.jpg)
#ccc_cd4 / #embulk
FluentdとEmbulk
![Page 52: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/52.jpg)
52
This?
![Page 53: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/53.jpg)
53
Or this?
![Page 54: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/54.jpg)
M x N → M + N
Nagios
MongoDB
Hadoop
Alerting
Amazon S3
Analysis
Archiving
MySQL
Apache
Frontend
Access logs
syslogd
App logs
System logs
Backend
Databasesbuffer/filter/route
![Page 55: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/55.jpg)
#ccc_cd4 / #embulk
FluentdとEmbulk
ストリーミングデータコレクターvs バルクデータローダー
ストリーミングデータか、転送単位がはっきりしているバルクデータか
任意データ vs データバリデーション・正規化
即時 vs トランザクション性
![Page 56: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/56.jpg)
#ccc_cd4 / #embulk
Embulkの実行
インストール
guess preview 繰り返し実行
56
![Page 57: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/57.jpg)
# install $ wget https://bintray.com/artifact/download/
embulk/maven/embulk-0.2.0.jar -o embulk.jar $ chmod 755 embulk.jar
Installing embulk
Bintray releases
Embulk is released on Bintray
wget embulk.jar
![Page 58: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/58.jpg)
# install $ wget https://bintray.com/artifact/download/
embulk/maven/embulk-0.2.0.jar -o embulk.jar $ chmod 755 embulk.jar
# guess $ vi partial-config.yml $ ./embulk guess partial-config.yml
-o config.yml
Guess format & schema in: type: file paths: [data/examples/] out: type: example
in: type: file paths: [data/examples/] decoders: - {type: gzip} parser: charset: UTF-8 newline: CRLF type: csv delimiter: ',' quote: '"' header_line: true columns: - name: time type: timestamp format: '%Y-%m-%d %H:%M:%S' - name: account type: long - name: purchase type: timestamp format: '%Y%m%d' - name: comment type: string out: type: example
guess
by guess plugins
![Page 59: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/59.jpg)
# install $ wget https://bintray.com/artifact/download/
embulk/maven/embulk-0.2.0.jar -o embulk.jar $ chmod 755 embulk.jar
# guess $ vi partial-config.yml $ ./embulk guess partial-config.yml
-o config.yml
# preview $ ./embulk preview config.yml $ vi config.yml # if necessary
+--------------------------------------+---------------+--------------------+ | time:timestamp | uid:long | word:string | +--------------------------------------+---------------+--------------------+ | 2015-01-27 19:23:49 UTC | 32,864 | embulk | | 2015-01-27 19:01:23 UTC | 14,824 | jruby | | 2015-01-28 02:20:02 UTC | 27,559 | plugin | | 2015-01-29 11:54:36 UTC | 11,270 | fluentd | +--------------------------------------+---------------+--------------------+
Preview & fix config
![Page 60: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/60.jpg)
# install $ wget https://bintray.com/artifact/download/
embulk/maven/embulk-0.2.0.jar -o embulk.jar $ chmod 755 embulk.jar
# guess $ vi partial-config.yml $ ./embulk guess partial-config.yml
-o config.yml
# preview $ ./embulk preview config.yml $ vi config.yml # if necessary
# run $ ./embulk run config.yml -o config.yml
in: type: file paths: [data/examples/] decoders: - {type: gzip} parser: charset: UTF-8 newline: CRLF type: csv delimiter: ',' quote: '"' header_line: true columns: - name: time type: timestamp format: '%Y-%m-%d %H:%M:%S' - name: account type: long - name: purchase type: timestamp format: '%Y%m%d' - name: comment type: string last_paths: [data/examples/sample_001.csv.gz] out: type: example
Deterministic run
![Page 61: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/61.jpg)
in: type: file paths: [data/examples/] decoders: - {type: gzip} parser: charset: UTF-8 newline: CRLF type: csv delimiter: ',' quote: '"' header_line: true columns: - name: time type: timestamp format: '%Y-%m-%d %H:%M:%S' - name: account type: long - name: purchase type: timestamp format: '%Y%m%d' - name: comment type: string last_paths: [data/examples/sample_002.csv.gz] out: type: example
Repeat
# install $ wget https://bintray.com/artifact/download/
embulk/maven/embulk-0.2.0.jar -o embulk.jar $ chmod 755 embulk.jar
# guess $ vi partial-config.yml $ ./embulk guess partial-config.yml
-o config.yml
# preview $ ./embulk preview config.yml $ vi config.yml # if necessary
# run $ ./embulk run config.yml -o config.yml
# repeat $ ./embulk run config.yml -o config.yml $ ./embulk run config.yml -o config.yml
![Page 62: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/62.jpg)
#ccc_cd4 / #embulk
Writing Embulk plugins
62
![Page 63: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/63.jpg)
InputPlugin
module Embulk class InputExample < InputPlugin Plugin.register_input('example', self)
def self.transaction(config, &control) # read config task = { 'message' => config.param('message', :string, default: nil) } threads = config.param('threads', :int, default: 2)
columns = [ Column.new(0, 'col0', :long), Column.new(1, 'col1', :double), Column.new(2, 'col2', :string), ]
# BEGIN here
commit_reports = yield(task, columns, threads)
# COMMIT here puts "Example input finished"
return {} end
def run(task, schema, index, page_builder) puts "Example input thread #{@index}…"
10.times do |i| @page_builder.add([i, 10.0, "example"]) end @page_builder.finish
commit_report = { } return commit_report end end end
![Page 64: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/64.jpg)
OutputPlugin
module Embulk class OutputExample < OutputPlugin Plugin.register_output('example', self)
def self.transaction( config, schema, processor_count, &control) # read config task = { 'message' => config.param('message', :string, default: "record") }
puts "Example output started." commit_reports = yield(task) puts "Example output finished. Commit reports = #{commit_reports.to_json}"
return {} end
def initialize(task, schema, index) puts "Example output thread #{index}..." super @message = task.prop('message', :string) @records = 0 end
def add(page) page.each do |record| hash = Hash[schema.names.zip(record)] puts "#{@message}: #{hash.to_json}" @records += 1 end end
def finish end
def abort end
def commit commit_report = { "records" => @records } return commit_report end end end
![Page 65: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/65.jpg)
GuessPlugin
# guess_gzip.rb module Embulk
class GzipGuess < GuessPlugin Plugin.register_guess('gzip', self)
GZIP_HEADER = "\x1f\x8b".force_encoding('ASCII-8BIT').freeze
def guess(config, sample_buffer) if sample_buffer[0,2] == GZIP_HEADER return {"decoders" => [{"type" => "gzip"}]} end return {} end end
end
# guess_ module Embulk
class GuessNewline < TextGuessPlugin Plugin.register_guess('newline', self)
def guess_text(config, sample_text) cr_count = sample_text.count("\r") lf_count = sample_text.count("\n") crlf_count = sample_text.scan(/\r\n/).length if crlf_count > cr_count / 2 && crlf_count > lf_count / 2 return {"parser" => {"newline" => "CRLF"}} elsif cr_count > lf_count / 2 return {"parser" => {"newline" => "CR"}} else return {"parser" => {"newline" => "LF"}} end end end
end
![Page 66: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/66.jpg)
#ccc_cd4 / #embulk
Releasing to RubyGems
Examples > embulk-plugin-postgres-json.gem
> https://github.com/frsyuki/embulk-plugin-postgres-json > embulk-plugin-redis.gem
> https://github.com/komamitsu/embulk-plugin-redis > embulk-plugin-input-sfdc-event-log-files.gem
> https://github.com/nahi/embulk-plugin-input-sfdc-event-log-files
![Page 67: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/67.jpg)
#ccc_cd4 / #embulk
plugin bundle
> embulk bundle <dir> > Gemfile
![Page 68: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/68.jpg)
Attention!
Presto Presto Presto Presto
Presto Presto Presto Presto
Presto Presto Presto Presto
Presto Presto Presto Presto
Presto Presto Presto Presto
68
![Page 69: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/69.jpg)
Attention!
Hive Hive Hive Hive
Hive Hive Hive Hive
Hive Hive Hive Hive
Hive Hive Hive Hive
69
![Page 70: Embulk 20150411](https://reader034.fdocuments.net/reader034/viewer/2022051016/55a5ce0e1a28ab5c1b8b460e/html5/thumbnails/70.jpg)
Hive Hive Hive Hive
Hive Hive HiveHive
PrestoPrestogres
hba.conf
PostgreSQL
70