I/O Bottleneck Investigation in Deep Learning Systems · LMDB Inefficiencies (cont.) Caffe’sI/O...

1
10 100 1000 10000 100000 1 2 4 8 16 36 72 144 288 576 1152 2304 4608 9216 Time (s) Number of Processes LMDB LMDBIO-LMM LMDBIO-LMM-DIO LMDBIO-LMM-DM LMDBIO-LMM-DIO-PROV LMDBIO-LMM-DIO-PROV-COAL LMDBIO-LMM-DIO-PROV-COAL-STAG I/O Bottleneck Investigation in Deep Learning Systems Sarunya Pumma , 1,2 Min Si , 2 Wu - chun Feng , 1 and Pavan Balaji 2 Motivation 1 Virginia Tech, 2 Argonne National Laboratory Deep Learning & Challenges Robotics Asimo (Honda) Offline & Online Data Analytics Real Time News Feed (Facebook) Facial Recognition Deep Dense Face Detector (Yahoo Labs) Network Size (width and depth) Batch Size (# samples) I/O Bound Communication bound Compute bound High-dimensional input data Image classification Data Science Bowl’s tumor detection from CT scans Networks with large number of parameters Unsupervised image feature extraction LLNL’s network with 15 billion parameters High volume data Sentiment analysis Twitter analysis Yelp’s review fraud detection Image classification ImageNet’s image classification Image feature extraction Tumor detection from CT scans In the past decade … 10 – 20x improvement in processor speed 10 – 20x improvement in network speed Only 1.5x improvement in I/O performance I/O will eventually become a bottleneck for most computations Deep Learning Scaling Overall Training Time (CIFAR10-Large-AlexNet, 512 iterations) Training Time Breakdown (CIFAR10-Large-AlexNet, 512 iterations) LMDB Inefficiencies (cont.) Caffe’s I/O Subsystem: LMDB Problem 1: Mmap’s Interprocess Contention Underlying I/O in mmap relies on the CFS scheduler to wake up processes after I/O has been completed Processes are put to sleep while waiting for I/O to complete I/O completion interrupt is a bottom-half interrupt The handler does not have knowledge about the specific process that triggered the I/O operation Every process that is waiting for I/O is marked as runnable Every reader is woken up each time an I/O interrupt comes in This causes a large number of unnecessary context switches 0 20 40 60 80 100 120 140 160 1 2 4 8 16 32 64 128 256 512 Context Switches Millions Number of Processes 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 4 8 16 32 64 128 256 512 Read Time Breakdown Number of Processes User time Kernel time Sleep time Problem 2: Sequential Data Access Restriction LMDB data access is sequential in nature due to the B+-tree structure There is no way to randomly access a data record All branch nodes associated with the previous records must be read before accessing a particular record When multiple processes read the data, they read extra data Different processes do different amount of work, causing skew D0 D1 D2 D3 Database P0 reads P1 reads P2 reads P3 reads P1 seeks P2 seeks P3 seeks LMDB redundant data movement Our Solution: LMDBIO (cont.) LMDBIO-LMM-DM (cont.) D0 D1 D2 D3 Database P0 seeks Sequential (in-memory) P0 sends cursor to P1 P1 sends cursor to P2 P2 sends cursor to P3 P0 reads P0 accesses P1 seeks P1 accesses P2 seeks P2 accesses P3 seeks P3 accesses P1 reads P2 reads P3 reads Concurrent Part II: Parallel I/O and in-memory sequential seek Our Solution: LMDBIO Optimization: Take into account data access pattern of deep learning and Linux’s I/O scheduling to reduce mmap’s contentions Shared file system (shared between nodes) Page cache Read Map Shared Memory mmap buffer (Process 0) Copy Process 0 Process 1 Process 2 Access Access Access Localized mmap Only one process does mmap on each node Using MPI shared-memory (MPI-3) to share data Even LMDBIO has extra copy (from mmap to shared memory), Caffe still gains benefit from LMDBIO LMDBIO-LMM LMDB Inefficiencies Context Switches Read Time Breakdown Uses Lightning Memory-mapped database (LMDB) for accessing the dataset B+-tree representation of the data Database is mapped to memory using mmap and accessed through direct buffer arithmetic Virtual memory allocated for the size of the full file Specific physical pages dynamically loaded by the OS on-demand Pros: makes it easy to manipulate complex Cons: OS has very little knowledge of the data structures (e.g., B+ trees) since LMDB access model and parallelism making it hard can think of it as fully in-memory to optimize Part II: Speculative Parallel I/O We use a history-based training for our estimation We correct our estimate in each iteration depending on the actual data read in all of the previous iterations The general ideal of out correction is that we attempt to expand the speculative boundaries to reduce the number of missed pages Initial iterations might be slightly inaccurate, but we converge fairly quickly (1-2 iterations) Each process estimates pages that it will need and speculatively fetches pages to memory in parallel Then each process sequentially seeks the location for another processes and sends the cursor to the next higher rank process The expectation is that the seek can be done entirely in memory Once the sequential seek is done, each reader can perform actual data access This adds a small amount of extra data reading, but allows parallel I/O The estimation of number of pages to fetch is based on the first record’s data size I.e., CIFAR10-Large record’s size is 3 KB, which is ~1 page. To read n records, it needs to fetch n pages The estimation of the read offset is performed in the same fashion Estimation of the “approximate” start and end location for each process is important If the estimate is completely wrong, we will end up reading up to 2x the dataset size (still better than the LMDB) Estimation of Speculative I/O Estimation of Speculative I/O (cont.) Results Sarunya Pumma, Min Si, Wu-chun Feng and Pavan Balaji. Towards Scalable Deep Learning via I/O Analysis and Optimization. IEEE International Conference on High Performance Computing and Communications (HPCC). Dec. 18-20, 2017, Bangkok, Thailand. 1 10 100 1000 10000 100000 1 2 4 8 16 36 72 144 288 576 1152 2304 4608 9216 Time (s) Number of Processes Caffe/LMDB Ideal 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 4 8 16 36 72 144 288 576 1152 2304 4608 9216 Execution Time Breakdown Number of Processes Param update time Param calculation time Param sync time Wait time before param sync Total backward time Total forward time Transform time Read time 660x worse than ideal Platform: Argonne’s Blues Dataset: CIFAR10-Large InfiniBand Qlogic QDR Network: AlexNet 110 TB GPFS storage MPI: MVAPICH-2.2 Each node 2 Sandy Bridge 2.6 GHz Pentium Xeon (16 cores, hyperthreading disabled) 64 GB memory 15 GB RAM disk Problem 3: Mmap’s Workflow Overheads Since mmap performs implicit I/O, the user has no control over when an I/O operation is issued. To showcase this overhead, we developed a microbenchmark to read a 256 GB file using a single reader on a single machine Mmap benchmark uses memcpy on a mmap buffer POSIX I/O benchmark uses pread mmap’s read bandwidth is approximately 2.5x lower than that of POSIX I/O Problem 4: I/O Block Size Management 0 0.5 1 1.5 2 2.5 3 4K 16 K 64 K 256 K 1M 4M 16 M 64 M 256 M 1024 M Read Bandwidth (GB/s) I/O Request Size (bytes) mmap POSIX I/O As the number of processes increases, subbatch is smaller POSIX I/O benefits from larger block size, while mmap does not Migrating LMDB to use direct I/O and larger block size can give a significant performance improvement Problem 5: I/O Randomization I/O requests are typically out of order in parallel I/O A large number of processes need to divide a large file into smaller pieces and each process needs to access a part of it Each process issues an I/O request at the same time I/O requests do not arrive at the I/O server processes in any specific order as each process is independent This causes the server processes to access the file in a nondeterministic fashion Server 1 1 3 5 7 Request queue Client 8 All requests are issued at the same time File 5 3 7 1 Client 7 Client 6 Client 5 Client 4 Client 3 Client 2 Client 1 Server 2 2 4 6 8 2 6 8 4 Request queue File Library Optimization Reducing Interprocess Contention Explicit I/O Eliminating Sequential Seek Managing I/O Size Reducing I/O Randomization LMDB - LMDBIO LMM LMM-DM (partial) LMM-DIO LMM-DIO-PROV LMM-DIO-PROV-COAL LMM-DIO-PROV-COAL-STAG Summary of LMDBIO Optimizations LMDBIO-LMM-DM Optimization: coordinate between reader processes to improve parallelism Portable Cursor Representation LMDB calls the position indicator for a record within B+ tree a “cursor” Not a simple offset from the start of file It contains the complete path of the record’s parent branch nodes (multiple pointers), a pointer to the page header, and access flags It is not trivial to port pointers across processes as virtual address spaces are different Serialize data reading and coordinate between processes Each process reads its data and sends the higher rank process the location to start fetching its data from This allows NO extra data reading: number of bytes read is EXACT… but I/O is done sequentially Part I: Serializing I/O D0 D1 D2 D3 Database P0 reads Sequential P0 sends cursor to P1 P1 sends cursor to P2 P1 reads P2 reads P3 reads Part I: Sequential I/O and cursor handoff P2 sends cursor to P3 Portable Cursor Representation (cont.) Our solution: symmetric address space Every process memory-maps the database file to the same memory location Allowing the pointers within the B+ tree to be portable across processes LMDBIO-LMM-DIO read data to shared buffer (POSIX I/O) read data to shared buffer (POSIX I/O) read data to shared buffer (POSIX I/O) seek (mmap) scatter offsets Timeline P0 P1 P2 wait wait Optimization: Replace mmap with POSIX I/O To use direct I/O, we need to know the position of each data record The root process gets offsets of all data samples by seeking the database using mmap Sequential seek is unavoidable because the offsets are not deterministic Other reader processes receive their offsets from root and perform data reading using POSIX I/O Readers share data using MPI shared buffer as same as LMM LMDBIO-LMM-DIO-PROV Optimization: Utilize provenance information to entirely replace mmap with POSIX I/O Making a case for storing data provenance information for deep learning (how the data was created) LMDB’s database layout can be deterministic only if the information of how it is created is provided We can compute exactly where the data pages are located Sequential seek can be completely eliminated All I/O operations can be done via direct I/O (mmap is completely removed) Provenance information is not stored in the original LMDB format This is an extension that we are proposing We use a separate auxiliary file to store this information This file can be created while the database is being generated or later using a one-time read of the database It is much smaller than the dataset itself (a few hundred bytes) Important Notes LMDBIO-LMM-DIO-PROV-COAL Sarunya Pumma, Min Si, Wu-chun Feng and Pavan Balaji. Parallel I/O Optimizations for Scalable Deep Learning. IEEE International Conference on Parallel and Distributed Systems (ICPADS). Dec. 15-17, 2017, Shenzhen, China. Optimization: Coalesce multiple batches of data to be read at once to allow direct I/O to benefit from large I/O size We read a larger chunk of data to enlarge I/O time to eliminate the skew in I/O A constant amount of memory is kept aside for data reading We read multiple batches of data at once LMDBIO-LMM-DIO-PROV-COAL-STAG Optimization: Adopt I/O staggering to reduce I/O randomization I/O staggering technique orders the requests Readers are divided into multiple groups with the same number of members Only one group can perform data reading at a time MPI_Send and MPI_Recv are used in the implementation Caffe/LMDB is 660x worse than ideal for 9216 processes Read time takes up 90% of the total training time for 9216 processes I/O bottleneck is caused by five major problems 1. Interprocess contention -- results in excessive number of context switches 2. Implicit I/O inefficiency -- OS fully controls I/O 3. Sequential data access restriction -- arbitrary database access is not allowed in LMDB 4. Inefficient I/O block size -- I/O request size is too small to be efficient 5. I/O randomization -- abundant readers participating in I/O at the same time We proposed 6 optimizations that address 5 problems in state of the art I/O subsystem of deep learning Experiment Information Dataset: CIFAR10-Large Network: AlexNet Batch size: 18,432 Training iterations: 512 Framework: Caffe Testbed: LCRC Bebop (Each node: 36 cores Intel Broadwell, 128 GB memory) 1.0 1.0 1.1 1.2 1.7 4.6 3.5 2.7 3.6 4.6 7.7 15.5 36.9 64.4 0 10 20 30 40 50 60 70 Factor of Improvment over LMDB Number of Processes LMDBIO-LMM LMDBIO-LMM-DIO LMDBIO-LMM-DM LMDBIO-LMM-DIO-PROV LMDBIO-LMM-DIO-PROV-COAL LMDBIO-LMM-DIO-PROV-COAL-STAG Factor of Improvement over LMDB Total Execution Time

Transcript of I/O Bottleneck Investigation in Deep Learning Systems · LMDB Inefficiencies (cont.) Caffe’sI/O...

  • 10

    100

    1000

    10000

    100000

    1 2 4 8 16 36 72 144 288 576 1152 2304 4608 9216

    Tim

    e (s)

    Number of Processes

    LMDBLMDBIO-LMMLMDBIO-LMM-DIOLMDBIO-LMM-DMLMDBIO-LMM-DIO-PROVLMDBIO-LMM-DIO-PROV-COALLMDBIO-LMM-DIO-PROV-COAL-STAG

    I/O Bottleneck Investigation in Deep Learning SystemsSarunya Pumma ,1,2 Min Si ,2 Wu-chun Feng ,1 and Pavan Balaji2

    Motivation

    1Virginia Tech, 2Argonne National Laboratory

    Deep Learning & Challenges

    RoboticsAsimo(Honda)

    Offline & Online Data AnalyticsReal Time News Feed (Facebook)

    Facial RecognitionDeep Dense Face Detector (Yahoo Labs)

    Network Size(width and depth)

    Batc

    h Si

    ze(#

    sam

    ples

    )I/O Bound

    Communication bound

    Compute bound• High-dimensional input data• Image classification• Data Science Bowl’s

    tumor detection from CT scans

    • Networks with large number of parameters• Unsupervised image feature

    extraction• LLNL’s network with 15 billion

    parameters

    • High volume data• Sentiment analysis• Twitter analysis• Yelp’s review fraud

    detection• Image classification• ImageNet’s image

    classification

    Image feature extraction

    Tumor detection from CT scans

    In the past decade …

    • 10 – 20x improvement in processor speed• 10 – 20x improvement in network speed• Only 1.5x improvement in I/O performanceI/O will eventually become a bottleneck for most computations

    Deep Learning ScalingOverall Training Time

    (CIFAR10-Large-AlexNet, 512 iterations) Training Time Breakdown

    (CIFAR10-Large-AlexNet, 512 iterations)

    LMDB Inefficiencies (cont.)

    Caffe’s I/O Subsystem: LMDB

    Problem 1: Mmap’s Interprocess Contention

    Underlying I/O in mmap relies on the CFS scheduler to wake up processes after I/O has been completed• Processes are put to sleep while waiting for I/O

    to complete• I/O completion interrupt is a bottom-half

    interrupt• The handler does not have knowledge about

    the specific process that triggered the I/O operation

    • Every process that is waiting for I/O is marked as runnable

    • Every reader is woken up each time an I/O interrupt comes in

    • This causes a large number of unnecessary context switches

    0

    20

    40

    60

    80

    100

    120

    140

    160

    1 2 4 8 16 32 64 128256512

    ContextSwitche

    sMillions

    NumberofProcesses

    0%

    10%

    20%

    30%

    40%

    50%

    60%

    70%

    80%

    90%

    100%

    1 2 4 8 16 32 64 128

    256

    512

    ReadTim

    eBreakdow

    n

    NumberofProcesses

    Usertime Kerneltime Sleeptime

    Problem 2: Sequential Data Access Restriction• LMDB data access is sequential in nature due to the B+-tree

    structure• There is no way to randomly access a data record

    • All branch nodes associated with the previous records must be read before accessing a particular record

    • When multiple processes read the data, they read extra data

    • Different processes do different amount of work, causing skew

    D0 D1 D2 D3

    Database

    P0reads

    P1reads

    P2readsP3reads

    P1seeks

    P2seeksP3seeks

    LMDB redundant data movement

    Our Solution: LMDBIO (cont.)LMDBIO-LMM-DM (cont.)

    D0 D1 D2 D3

    Database

    P0seeks

    Sequ

    entia

    l(in

    -mem

    ory)

    P0sendscursortoP1

    P1sendscursortoP2

    P2sendscursortoP3

    P0reads

    P0accesses P1seeks

    P1accesses P2seeks

    P2accesses P3seeks

    P3accesses

    P1readsP2reads

    P3reads

    Concurrent

    Part II: Parallel I/O and in-memory sequential seek

    Our Solution: LMDBIO

    Optimization: Take into account data access pattern of deep learning and Linux’s I/O scheduling to reduce mmap’scontentions

    Sharedfilesystem(sharedbetweennodes)

    PagecacheRead

    Map

    SharedMemorymmap buffer(Process0)

    Copy

    Process0 Process1 Process2

    Access Access Access

    • Localized mmap• Only one process does mmap on each node• Using MPI shared-memory (MPI-3) to share data

    • Even LMDBIO has extra copy (from mmap to shared memory), Caffe still gains benefit from LMDBIO

    LMDBIO-LMMLMDB Inefficiencies

    Context Switches Read Time Breakdown

    • Uses Lightning Memory-mapped database (LMDB) for accessing the dataset• B+-tree representation of the data• Database is mapped to memory using mmap and accessed through direct buffer arithmetic• Virtual memory allocated for the size of the full file

    • Specific physical pages dynamically loaded by the OS on-demand

    Pros: makes it easy to manipulate complex Cons: OS has very little knowledge of the data structures (e.g., B+ trees) since LMDB access model and parallelism making it hard can think of it as fully in-memory to optimize

    Part II: Speculative Parallel I/O

    • We use a history-based training for our estimation• We correct our estimate in each iteration depending

    on the actual data read in all of the previous iterations• The general ideal of out correction is that we attempt

    to expand the speculative boundaries to reduce the number of missed pages

    • Initial iterations might be slightly inaccurate, but we converge fairly quickly (1-2 iterations)

    • Each process estimates pages that it will need and speculatively fetches pages to memory in parallel

    • Then each process sequentially seeks the location for another processes and sends the cursor to the next higher rank process • The expectation is that the seek can be done

    entirely in memory• Once the sequential seek is done, each reader can

    perform actual data access• This adds a small amount of extra data reading, but

    allows parallel I/O

    • The estimation of number of pages to fetch is based on the first record’s data size• I.e., CIFAR10-Large record’s size is 3 KB, which is

    ~1 page. To read n records, it needs to fetch npages

    • The estimation of the read offset is performed in the same fashion

    • Estimation of the “approximate” start and end location for each process is important• If the estimate is completely wrong, we will end up

    reading up to 2x the dataset size (still better than the LMDB)

    Estimation of Speculative I/O

    Estimation of Speculative I/O (cont.)

    ResultsSarunya Pumma, Min Si, Wu-chun Feng and Pavan Balaji. Towards Scalable Deep Learning via I/O Analysis and Optimization. IEEE International Conference on High Performance Computing and Communications (HPCC). Dec. 18-20, 2017, Bangkok, Thailand.

    1

    10

    100

    1000

    10000

    100000

    1 2 4 8 16 36 72 144

    288

    576

    1152

    2304

    4608

    9216

    Time(s)

    NumberofProcesses

    Caffe/LMDB

    Ideal

    0%10%20%30%40%50%60%70%80%90%

    100%

    1 2 4 8 16 36 72 144

    288

    576

    1152

    2304

    4608

    9216

    ExecutionTimeBreakdow

    n

    NumberofProcesses

    Paramupdatetime

    ParamcalculationtimeParamsynctime

    WaittimebeforeparamsyncTotalbackwardtime

    Totalforwardtime

    Transformtime

    Readtime

    660xworsethanideal

    Platform: Argonne’s Blues Dataset: CIFAR10-Large• InfiniBand Qlogic QDR Network: AlexNet• 110 TB GPFS storage MPI: MVAPICH-2.2 • Each node• 2 Sandy Bridge 2.6 GHz

    Pentium Xeon (16 cores, hyperthreading disabled)

    • 64 GB memory• 15 GB RAM disk

    Problem 3: Mmap’s Workflow Overheads• Since mmap performs implicit I/O, the user has no

    control over when an I/O operation is issued.• To showcase this overhead, we developed a

    microbenchmark to read a 256 GB file using a single reader on a single machine• Mmap benchmark uses memcpy on a mmap buffer• POSIX I/O benchmark uses pread

    • mmap’s read bandwidth is approximately 2.5x lowerthan that of POSIX I/O

    Problem 4: I/O Block Size Management

    00.51

    1.52

    2.53

    4K

    16K

    64K

    256K

    1M

    4M

    16M

    64M

    256M

    1024MRe

    adBandw

    idth(G

    B/s)

    I/ORequestSize(bytes)

    mmap

    POSIXI/O

    • As the number of processes increases, subbatch is smaller

    • POSIX I/O benefits from larger block size, while mmap does not

    • Migrating LMDB to use direct I/O and larger block size can give a significant performance improvement

    Problem 5: I/O Randomization

    • I/O requests are typically out of order in parallel I/O• A large number of processes need to divide a large file

    into smaller pieces and each process needs to access a part of it

    • Each process issues an I/O request at the same time• I/O requests do not arrive at the I/O server processes

    in any specific order as each process is independent • This causes the server processes to access the file in a

    nondeterministic fashion

    Server1

    1 3 5 7

    Requestqueue

    Client8

    Allrequestsareissuedatthesametime

    File

    5

    3

    7

    1

    Client7Client6Client5Client4Client3Client2Client1

    Server2

    2 4 6 8

    2

    6

    8

    4

    Requestqueue

    File

    Library Optimization Reducing InterprocessContention

    ExplicitI/O

    EliminatingSequential Seek

    Managing I/O Size

    Reducing I/ORandomization

    LMDB -LMDBIO LMM ✔

    LMM-DM ✔ (partial)LMM-DIO ✔ ✔LMM-DIO-PROV ✔ ✔ ✔LMM-DIO-PROV-COAL ✔ ✔ ✔ ✔LMM-DIO-PROV-COAL-STAG ✔ ✔ ✔ ✔ ✔

    Summary of LMDBIO Optimizations

    LMDBIO-LMM-DMOptimization: coordinate between reader processes to improve parallelism

    Portable Cursor Representation• LMDB calls the position indicator for a record within B+

    tree a “cursor”• Not a simple offset from the start of file• It contains the complete path of the record’s parent

    branch nodes (multiple pointers), a pointer to the page header, and access flags

    • It is not trivial to port pointers across processes as virtual address spaces are different

    • Serialize data reading and coordinate between processes

    • Each process reads its data and sends the higher rank process the location to start fetching its data from

    • This allows NO extra data reading: number of bytes read is EXACT… but I/O is done sequentially

    Part I: Serializing I/O

    D0 D1 D2 D3

    Database

    P0reads

    Sequ

    entia

    l P0sendscursortoP1

    P1sendscursortoP2P1reads

    P2reads

    P3reads …

    Part I: Sequential I/O and cursor handoff

    P2sendscursortoP3

    Portable Cursor Representation (cont.)• Our solution: symmetric address space• Every process memory-maps the database file to

    the same memory location• Allowing the pointers within the B+ tree to be

    portable across processes

    LMDBIO-LMM-DIO

    readdatatosharedbuffer(POSIXI/O)

    readdatatosharedbuffer(POSIXI/O)

    readdatatosharedbuffer(POSIXI/O)seek(mmap)

    scatteroffsets

    Timeline

    P0

    P1

    P2wait

    wait

    ……

    Optimization: Replace mmapwith POSIX I/O• To use direct I/O, we need to know the position of each data

    record• The root process gets offsets of all data samples by

    seeking the database using mmap• Sequential seek is unavoidable because the offsets are not

    deterministic• Other reader processes receive their offsets from root and

    perform data reading using POSIX I/O• Readers share data using MPI shared buffer as same as LMM

    LMDBIO-LMM-DIO-PROVOptimization: Utilize provenance information to entirely replace mmapwith POSIX I/O• Making a case for storing data provenance information for

    deep learning (how the data was created)• LMDB’s database layout can be deterministic only if the

    information of how it is created is provided• We can compute exactly where the data pages are located• Sequential seek can be completely eliminated• All I/O operations can be done via direct I/O (mmap is

    completely removed)

    • Provenance information is not stored in the original LMDB format• This is an extension that we are proposing

    • We use a separate auxiliary file to store this information• This file can be created while the database is being

    generated or later using a one-time read of the database

    • It is much smaller than the dataset itself (a few hundred bytes)

    Important Notes

    LMDBIO-LMM-DIO-PROV-COAL

    Sarunya Pumma, Min Si, Wu-chun Feng and Pavan Balaji. Parallel I/O Optimizations for Scalable Deep Learning. IEEE International Conference on Parallel and Distributed Systems (ICPADS). Dec. 15-17, 2017, Shenzhen, China.

    Optimization: Coalesce multiple batches of data to be read at once to allow direct I/O to benefit from large I/O size• We read a larger chunk of data to enlarge I/O time to

    eliminate the skew in I/O• A constant amount of memory is kept aside for data

    reading• We read multiple batches of data at once

    LMDBIO-LMM-DIO-PROV-COAL-STAGOptimization: Adopt I/O staggering to reduce I/O randomization• I/O staggering technique orders the requests

    • Readers are divided into multiple groups with the same number of members

    • Only one group can perform data reading at a time• MPI_Send and MPI_Recv are used in the implementation

    • Caffe/LMDB is 660x worse than ideal for 9216 processes• Read time takes up 90% of the total training time for 9216 processes• I/O bottleneck is caused by five major problems

    1. Interprocess contention -- results in excessive number of context switches2. Implicit I/O inefficiency -- OS fully controls I/O3. Sequential data access restriction -- arbitrary database access is not allowed in LMDB4. Inefficient I/O block size -- I/O request size is too small to be efficient 5. I/O randomization -- abundant readers participating in I/O at the same time

    • We proposed 6 optimizations that address 5 problems in state of the art I/O subsystem of deep learning

    Experiment InformationDataset: CIFAR10-LargeNetwork: AlexNetBatch size: 18,432Training iterations: 512Framework: CaffeTestbed: LCRC Bebop (Each node: 36 cores Intel Broadwell, 128 GB memory)

    1.0 1.0 1.1 1.2 1.7 4.6 3.5 2.7 3.6 4.6 7.7

    15.5

    36.9

    64.4

    010203040506070

    Fact

    or o

    f Im

    prov

    men

    t ove

    r LM

    DB

    Number of Processes

    LMDBIO-LMMLMDBIO-LMM-DIOLMDBIO-LMM-DMLMDBIO-LMM-DIO-PROVLMDBIO-LMM-DIO-PROV-COALLMDBIO-LMM-DIO-PROV-COAL-STAG

    Factor of Improvement over LMDB

    Total Execution Time