for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC...
Transcript of for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC...
![Page 2: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/2.jpg)
Python: a popular high-level language
2
![Page 3: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/3.jpg)
Learning Python from scratch
3
http://www.codecademy.com/en/tracks/python
Free, interactive web-based tutorialGreat for new programmers
http://learnpythonthehardway.org/book/
Free web book. Exercise based, comprehensive.
![Page 4: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/4.jpg)
Python topics to be covered
4
- Virtual Environment & Anaconda
- Ipython Notebook
- NumPy/SciPy
- Matplotlib
- Interactive plotting using Bokeh
- Turn python package to web app
![Page 5: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/5.jpg)
A published Python package by Rangananthan and Reynolds from GCSB- Command line based- Nice IPython Notebook tutorial- Detailed documentation
Challenges:- Use from command line- Dependencies- Interactive presentation- Broader impact
Solutions:- A python web app
Online PySCA – An ongoing case study
5
![Page 6: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/6.jpg)
Python: pros
6
A clean, easy to learn language
Huge number of community created packages
Booming popularity for scientific computing
Python bindings / API for a lot of other software
Open source – Free!
![Page 7: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/7.jpg)
Python: cons
7
Dependency Hell
Affects all modern languages, especially interpreted ones.
Python especially challenging:
• Huge number of 3rd party packages
• Rapidly changing APIs
• Scientific packages need non-python dependencies.
Solutions - Anaconda / virtualenv etc…
![Page 8: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/8.jpg)
Anaconda
8
Download for your own machine (free):
http://continuum.io/downloads
Use on BioHPC cluster or clients:
module load python/2.7.x-anaconda
module load python/3.4.x-anaconda
Other python modules are deprecated
Manages python packages AND their non-python dependencies.
Allows creation of multiple environments, with versions you need for specific projects.
![Page 9: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/9.jpg)
Anaconda – Default Environment
9
162 packages, including full scientific python stack
Spyder scientific development environment
Ipython notebook ready to run
$ conda list
$ ipython notebook
$ spyder
![Page 10: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/10.jpg)
Python on BioHPC
10
![Page 11: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/11.jpg)
Anaconda – Create Your Own Environment
11
The main module installation must be stableWe won’t update packages in it frequently.
The conda tool lets you create your own environments with versions you needThe virtual environment is installed in $HOME/.conda
# Create a new environment with latest anaconda package setconda create -n test anaconda
# See environments availableconda env list
# Start using/switch back to defaultsource activate testsource deactivate
# Remove environmentconda env remove –-name test
![Page 12: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/12.jpg)
Anaconda – Create Your Own Environment
12
# Create a minimal environment with specific python and numpy# Won't install all of the conda package setconda create -n test1 numpy scipy matplotlib bokeh
# Start using the environmentsource activate test1
# Add ipython to this active environmentconda install ipython
# Update the numpy package to the latest versionconda update numpy
# Install a non-conda package from PyPI using pip conda search colourpip search colourpip install colour
![Page 13: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/13.jpg)
SCA – Statistical coupling analysis of protein families
13
Characterize pattern of evolutionary constraints of Amino acid positionsGiven sequence alignments, SCA measures functional constraints at each position and the correlations.
4 steps to use pySCA software, eg:
./annotate_MSA.py Inputs/PF00186_full.txt\–o Outputs/PF00186_full.an –a ‘pfam’
./scaProcessMSA.py Inputs/PF00186_full.an\–s 1RX2 –c A –f ‘Escherichia coli’ -t
./scaCore.py PF00186_full.db –n ‘frob’ –l 0.03 –t 10
./scaSectorid.py PF00186_full.db –p 0.95
![Page 14: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/14.jpg)
Ipython Notebook
14
![Page 15: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/15.jpg)
Numpy
15
a = range(100000)
for i in range(100):
for n in a:
b = n**2
100 loops, lapsed 1.53494286537 sec
import numpy as np
a = np.arange(100000)
for i in range(100):
b = a**2
100 loops, lapsed 0.0342528820038 sec
The numpy method is 44.8120793223 times faster
Much Faster !!!
Notebook-1,2
![Page 16: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/16.jpg)
Python: cons
16
Python is slooooooow…..
Trades execution speed for development speed.
Solution: Move critical portions closer to machine code.
• Directly call C code - Cython
• Use modules built on optimized, compiled code.e.g. NumPy builds on BLAS / LAPACK
![Page 17: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/17.jpg)
NumPy
17
NumPy performs (multi-dimensional) array arithmetic much faster than native python objects, by using low-level contiguous arrays and compiled libraries:
![Page 18: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/18.jpg)
SciPy and MatPlotLib
18
Notebook-3,4
import matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3Dfrom matplotlib import cmfrom scipy import *from scipy.special import jn , jn_zeros # Bessel function
def drumhead_height (n, k, distance, angle, t ):nth_zero = jn_zeros (n, k)return cos(t)* cos(n * angle ) * jn(n, distance * nth_zero )
theta = r_[0 : 2 * pi :50j]radius = r_[0 : 1 : 50j]x = array([ r* cos( theta ) for r in radius ])y = array([ r* sin( theta ) for r in radius ])z = array([ drumhead_height(1, 1, r, theta, 0.5) for r in radius ])fig = plt.figure ()ax = Axes3D( fig )ax.plot_surface (x, y, z, rstride =1, cstride=1, cmap=cm.jet )ax.set_xlabel('x')ax.set_ylabel('y')ax.set_zlabel('z')plt.show()
![Page 19: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/19.jpg)
Turn pySCA into a website
19
Tasks
- Web interface- Multiple user- Run analysis- Display results to explore
Design
- Django framework- Database model design- Simple form interface- Celery backend execution- Direct call pySCA- Interactive plottings
![Page 20: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/20.jpg)
Django – web server glues everything
20
Forms for users
Django turns forms into database entries
![Page 21: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/21.jpg)
Celery – distributed task queue
21
User submits form
Celery takes command, runs job in queue
Django converts command, issues async celery tasks
![Page 22: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/22.jpg)
Bokeh - Interactive plotting tool
22
Notebook – 5,6
Output can be an html file, on the web, or Ipython notebook
![Page 23: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/23.jpg)
Take home messages
23
- Python is easy and fun to use.
- Use virtual environment for different projects.
- Use compiled libraries for speed.
- Python has interactive packages for data sharing and exploring.
- Python can do a lot!
![Page 24: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/24.jpg)
pySCA is a package developed by Olivier Rivoire, Kimberly Reynolds, and Rama
Ranganathan. cf: Olivier Rivoire, Kimberly Reynolds, and Rama Ranganathan,
Evolution-Based Functional Decomposition of Proteins, PLOS Computational Biology,
12(6): e1004817.
NumPy, SciPy, mpi4py and multiprocess code was taken from the Texas Advanced
Computing Center hpc-python course. Slides have been reformatted to UTSW style.
Code was changed accordingly. Source: https://portal.tacc.utexas.edu/-/hpc-python
”HPC Python”, Texas Advanced Computing Center, 2015. Available under a Creative
Commons Attribution Non-Commercial 3.0 Unported License.
Bokeh medal example was taken from bokeh.pydata.org
More reference: www.cism.ucl.ac.be/Services/Formations//python/2015
Acknowledgements / License
![Page 25: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/25.jpg)
25
![Page 26: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/26.jpg)
Python: cons
26
Global Interpreter Lock
Can create many threads, but only runs 1 at a time.
Solution – multiple processes
a = list()
a.append(1)
item=a.pop()
a = list()
a.append(1)
item=a.pop()
Thread A Thread B
Time
![Page 27: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/27.jpg)
Multiprocessing
27
One way around the GIL is to use the multiprocessing module.
Convenient methods to:
• Create individual child processes executing a function
• Create and use a pool of processes
• Perform a ‘map’ from inputs to output using multiple processes
• Share data between processes using shared memory objects *
• Run a server process holding shared objects that can be manipulated by workers
![Page 28: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/28.jpg)
Multiprocessing – Direct Creation & Management
28
multiproc_test.py
import random, os, multiprocessing
def list_append(count, out_list):# Appends a random number to the list 'count' number# of times. A CPU-heavy operation!print os.getpid(), 'is working'for i in range(count):
out_list.append(random.random())
if __name__ == "__main__":size = 10000000 # Number of random numbers to addprocs = 2 # Number of processes to create# Create a list of processes and define work for each processprocess_list = []
for i in range(0, procs):out_list = list()process = multiprocessing.Process(target=list_append,
args=(size, out_list))process_list.append(process)
# Start the processes (i.e. calculate the random number lists)for p in process_list:
p.start()
# End all of the processes have finishedfor p in process_list:
p.join()
print "List processing complete."
![Page 29: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/29.jpg)
Multiprocessing – Map on an iterable object
29
multip.py
from multiprocessing import Poolimport time
def f(x): # do some tedius workfor i in range(10000):
a = x * xreturn a
if __name__ == '__main__':pool = Pool(processes=10) # start a pool of 10 workersn = 10000results = pool.apply_async(f, (n,)) # use 10 worke processes to run f 10k times.print results # apply_async returns an objects, non-blocking
while not results.ready():print "results are coming ..."time.sleep(1)
print "results are ready: ", results.ready()
ts = time.time() # we time how long it takes to run with one processa = map(f, range(n)) te = time.time()a = pool.map(f, range(n)) # use 10 pool workers to run the tasktep = time.time()
print "a 10k calls takes ", te-ts, ' sec'print "a multiprocessing 10k calls takes ", tep-te, ' sec'
![Page 30: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/30.jpg)
MPI
30
A interface for parallel computation using message passing between processes
Small set of instructions, but quite complex to use
![Page 31: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/31.jpg)
Mpi4py – MPI wrappers for python
31
Install the module
$ pip install mpi4py$ module add openmpi/gcc/64/1.6.5
$ mpirun –n 10 python hello.py
Run the code
hello.py
from mpi4py import MPIimport socketcomm=MPI.COMM_WORLDrank = comm.Get_rank()
msg = comm.bcast('hello')print "rank %d of %d says %s from host %s" % (rank, comm.size, msg, socket.gethostname())
![Page 32: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/32.jpg)
mpi4py – Communication of python objects
32
$ mpirun –n 2 python p2p.py
SLOW! – Python objects must be serialized & deserialized.
Note – send and receive lower case!
p2p.py
# send message p2pfrom mpi4py import MPI
comm = MPI.COMM_WORLDrank = MPI.COMM_WORLD.Get_rank()a=range(100)if rank == 0:
data = acomm.send(data, dest=1, tag=98)
else:data = comm.recv(source=0, tag=98)
if rank == 1:print data
![Page 33: for Scientific Developing - UT Southwestern · 2019-06-27 · for Scientific Developing 1 BioHPC Training 10/19/2016 biohpc-help@utsouthwestern.edu](https://reader033.fdocuments.net/reader033/viewer/2022042405/5f1c5c6b46cde141dc510a3c/html5/thumbnails/33.jpg)
mpi4py – Communication of numpy arrays
33
$ mpirun –n 2 python Bcast.py
Faster – numpy arrays can be sent / received as mem buffer, directly by the MPI layer
Note – send and receive lower case!
Bcast.py
# send collective messagefrom mpi4py import MPIimport numpy as np
comm = MPI.COMM_WORLDrank = MPI.COMM_WORLD.Get_rank()if rank == 0:
data = np.arange(100, dtype='i')else:
data = np.empty(100, dtype='i')
comm.Bcast(data, root=0)
if rank == 1:print data