Outline - Uppsala Universityuser.it.uu.se/~carln/HPC2015_carln1.pdf · • Great influence on the...
Transcript of Outline - Uppsala Universityuser.it.uu.se/~carln/HPC2015_carln1.pdf · • Great influence on the...
C/C++
Carl Nettelblad 2015-11-24
Outline
• Languages
• Cases:
– Printing lists
– Sorting lists
• The discussion will include:
– Templates vs. inheritance
Why is C a good language?
Why is C a good language?
• Fast
• Nothing is hidden
• “Lingua franca”
– Runs everywhere
• For any type of program
• Any kind of distributed/parallel computing
– Can interact with anything
• Compiled and static typing
Why is C a bad language?
• Tedious
– Easily getting stuck in “how”, not “what”
• Long iteration times
– Rebuild after a simple bug
• Unsafe
– Bugs can be devastating
– For scientific codes:
• Complex bugs can be hidden a long time
Why is Python a good language?
Why is Python a good language?
• Flexible
– Different abstractions
• Concise
• Good libraries for scientific and non-scientific purposes
• Easy to use for interactive and quick prototyping
Why is Python a bad language?
Why is Python a bad language?
• Slow
– Default version is interpreted
– Not scaling well with threading
• Flexibility can promote bad habits
– Hard to guarantee that all parts are consistently
used when changes are made
• (Indentation carrying semantic meaning)
What do we want?
• Flexible abstractions
• Good and predictable libraries
• High performance
• Easy interactivity
• Type safety
• This language could be C++!
– Or Python with a mix of C++
Python from Matlab
• Matlab R2014b (8.4) and later support immediate
Python integration
• py.module.function
• I.e. access the highly accurate summation function
fsum using py.math.fsum
• Can work straight away, flat vectors (not matrices)
automatically translated back and forth
• Any Python module built in this course might then be
accessed in Matlab
Python from the web
• IPython platform for interactive Python
• IPython Notebook, web-based interface to IPython
– Combine code, text, and figures
• Kind of like Mathematica
– Easily edit different code snippets
– Press Shift+Enter to (re)compute
C++ from the web
• Jupyter project
– IPython is separated into interactivity engine and
actual Python
cling and clang
• At CERN, the ROOT framework has existed for a long
time
– Special classes and an interpreter of a language
similar to C++
– With several oddities
– Interpreted language truly slow
• Effort to rebuild this into using “real” C++
– cling real-time compiler based on clang
– clang is the C++ compiler currently used by Apple
clang
• g++/gcc has been the de-facto standard for (open-source)
C/C++ compilers for a long time
• gcc has an archaic codebase
– Historically not easy to easily tie into some services
– E.g. get a parse tree
– Or add new code on the fly to an ongoing compilation
process
• Other compilers are closed source
– And also tend to lack flexible APIs
• clang is modularized (the front-end to the separate LLVM
backend) and open-sourced
Other users of clang
• In addition to cling and the Apple compilers, clang is
found in e.g.
– The Nvidia CUDA device compiler is clang-based,
no matter what host compiler you have
– The IDE Ceemple, which tries to bundle a lot of C++
libraries with a separate compiling mode with very
short latency is based on clang
• Keep the compiler loaded with all headers
between reruns
Working on an array in C
void printIntArray(int* data, int size)
{
for (int i = 0; i < size; i++)
{
printf("%d\n", data[i]);
}
}
Why is this bad?
• Adapted to one specific type of data (int)
• Size is an explicit parameter
– If size is specified incorrectly, we will read invalid
data
• The function can easily change the data
• Data pointer can be invalid
What would this look like in
Python?
def printArray(array):
for i in array:
print i
What would this look like in C++?
void printIntVector(IntVector* vector)
{
for (int i = 0; i < vector->size(); i++)
{
printf("%d\n", vector->get(i));
}
}
What would this look like in C++?
void printIntVector(vector<int>::iterator begin, vector<int>::iterator end)
{
for (vector<int>::iterator i = begin; i != end; i++)
{
cout << *i << "\n";
}
}
The inheritance abstraction
• An iterator would be a common interface or base class
• This is the case in e.g. Java
• Subclasses inherit from this base class
– Performing iteration in a specific data structure
– Virtual methods for getting next element, current
element etc.
• (Runtime) polymorphism
How is the method call made?
• Each object has a table of method implementations
• The slot numbers are fixed at compilation
– Any call to an Iterator method will be “call the
method pointed to in the right slot in the vtable”
– This is an indirect jump
IntVectorIterator
next()
get()
Iterator
next()
get()
Indirect jumps
• A modern fast CPU is pipelined and out of order
– Multiple instructions “in flight” at once
– If instructions depend on each other, an out of order core starts
executing a later one
– Pipeline depth 20
• Out of order window of 224 in recent Intel CPU
– Hides waiting on memory
– Latency is the difference between real and theoretical
performance
ADD MOV CMP JNZ MOV
MOV CMP JNZ MOV …
CMP JNZ MOV … …
Branch prediction
• Out of order works fine if the instruction stream is
known
• If you have a loop or an if statement, the CPU has to
guess
– Can actually get pretty good
• A virtual method call is another branch
– In the very worst case, that instruction is not even
cached
Virtual methods in the compiler
• When you call a function directly in C, the compiler can
see everything that happens
– It can inline the function
– Move instructions around
– Do all the optimizations that make a modern
compiler fast, across the function call
• The virtual method call breaks this
– Sometimes the compiler can identify that the same
implementation is always used
The duck-typing abstraction
• Python uses the concept of duck typing
– “If it walks like a duck, swims like a duck, quacks like
a duck, it is a duck”
– “If an object has all the methods of an iterator, it is
an iterator”
• Convenient, flexible
– You can use inheritance, but you don’t rely on it to
define the contract
• Functions are looked up by name in a data structure
when they are called
– C++ vtables suddenly seem superfast
C++ templates
• Create functions and classes that can work on arbitrary
classes
• Simple motivation
– Type-safe container classes
• vector<int>
• map<int, double>
• These are done at compile-time
• Compiler error messages can be hard to track
– Templates within templates within templates
– Compare this to sudden error at runtime
Printing a list
template<typename T>
void printList(T begin, T end)
{
for (T i = begin; i != end; i++)
{
printf("%d\n", static_cast<int>(*i));
}
}
What happened here
• We are doing duck-typing in C++
• We don’t know what T is
– But begin and end are of the same type
– We can get a value with the dereference (*) operator
– That value can be casted to an int
– We can iterate to the next value with ++
• All of this is done at compile time
– Performance
– Correctness
Abstraction costs
• For a simple array, this is just as fast as the C version
– That code could only handle pointer-based int arrays
– But it can be binary trees (set), or a network stream
• For performance, you want to keep runtime costs of the
generalizations and abstractions you make at a
minimum
Printing a list
template<typename T>
void printList(T begin, T end)
{
for (auto i = begin; i != end; i++)
{
printf("%d\n", static_cast<int>(*i));
}
}
Printing a list
template<typename T>
void printList(const T& list)
{
for (auto i : list)
{
printf("%d\n", static_cast<int>(i));
}
}
Printing a list
template<typename T>
void printList(const T& list)
{
for (auto i : list)
{
cout << i << "\n";
}
}
Consequences
• auto keyword
– For local variables, you frequently don’t really care about the type, no
“contract”
– Full typename could change if you change data structures later on
– Just let the compiler figure it out
• const &
– C and C++ send all paramters by value by default
– If you would send a full vector to a function, that could imply copying
the vector
– const means “I don’t want to be able to change this object by
accident”
– & means “I want to work on the original object, not a copy”
– These are semantic differences
Consequences
• for (auto i : list)
– Simple “for each” notation
– Under the hood relying on iterators
– But you can do stuff like
for (auto x : map<int,int>{{1,2}, {3,5}}) {
printf("%d %d\n", x.first, x.second);
}
• You simply can’t accidentally go outside the range with this syntax
Give your code a Boost
• The C++ standard library is rather thin
– It’s become larger in the last few standards
– You want to interact with the underlying tech (the
OS), not a library faking the OS
– OS libraries are rarely nice C++…
• Also lack of general algorithms and abstractions
• The Boost library (or library of libraries) changes this
Boost
• Independent project
– Started out in the end of last millennium
– Libraries added after peer review process, focusing on
generality and “nice interface”
– Varying quality
• Far fewer, but far more stable than arbitrary Perl, Python, or
R libraries
• Great influence on the C++ standards process
– The TR1 document between C++03 and C++11 based several
new libraries on their boost counterparts
– C++11 continued this
– Added language features in C++11 based on “things Boost
could not achieve”
What do we have in Boost?
• Accumulators, Algorithm, Align, Any, Array, Asio, Assert, Assign, Atomic, Bimap,
Bind, Call Traits, Chrono, Circular Buffer, Compatibility, Compressed Pair,
Concept Check, Config, Container, Context, Conversion, Convert, Core, Coroutine,
Coroutine2, CRC, Date Time, Dynamic Bitset, Enable If, Endian, Exception,
Filesystem, Flyweight, Foreach, Format, Function, Function Types, Functional,
Fusion, Geometry, GIL, Graph, Heap, ICL, Identity Type, In Place Factory,
Integer, Interprocess, Interval, Intrusive, IO State Savers, Iostreams, Iterator,
Lambda, Lexical Cast, Local Function, Locale, Lockfree, Log, Math, Member
Function, Meta State Machine, Min-Max, MPI, MPL, Multi-Array, Multi-Index,
Multiprecision, Numeric Conversion, Odeint, Operators, Optional, Parameter,
Phoenix, Pointer Container, Polygon, Pool, Predef, Preprocessor, Program
Options, Property Map, Property Tree, Random, Range, Ratio, Rational, Ref,
Regex, Result Of, Scope Exit, Serialization, Signals, Signals2, Smart Ptr, Sort,
Spirit, Statechart, Static Assert, String Algo, Swap, System, Test, Thread,
ThrowException, Timer, Tokenizer, TR1, Tribool, TTI, Tuple, Type Index, Type
Traits, Typeof, uBLAS, Units, Unordered, Utility, Uuiod, Value Initialized, Variant,
Wave, Xpressive
Python and C++
• When you integrate languages with each other, you
need to define:
– Who are you?
– Who are your users?
– Which language is extending the bridge into the
other?
– What features of the two languages need to be
maintained in the bridge?
– Do you have performance concerns?
Cython
• There are many ways to create bindings between
Python and other languages
• Cython generates C++ code from Python code
– Can call into C++ with some work
– The Python parser needs to understand C++
declarations
– The generated C++ code also needs to compile
correctly
• Do not confuse Cython with CPython (normal Python
implementation)
Performance of Cython
• Code can be annotated with exact types
– Allows more optimizations
– Tight loops can be quick
• Still plagued of some of the indirection problems of
Python
– Just as fast as C code interacting closely with
Python
– Not as fast as code in C/C++ with full control over
data structures
– Transition between C and Cython code is very quick
Cython C++ wrapping
class Rectangle {
public:
int x0, y0, x1, y1;
Rectangle(int x0, int y0, int x1, int y1);
~Rectangle();
int getLength();
int getHeight();
int getArea();
void move(int dx, int dy);
};
Wrapping to Cython
cdef extern from "Rectangle.h":
cdef cppclass Rectangle:
Rectangle(int, int, int, int) except +
int x0, y0, x1, y1
int getLength()
int getHeight()
int getArea()
void move(int, int)
Wrapping to Python
cdef class PyRectangle:
cdef Rectangle *thisptr # hold a C++ instance which we're wrapping
def __cinit__(self, int x0, int y0, int x1, int y1):
self.thisptr = new Rectangle(x0, y0, x1, y1)
def __dealloc__(self):
del self.thisptr
def getLength(self):
return self.thisptr.getLength()
def getHeight(self):
return self.thisptr.getHeight()
def getArea(self):
return self.thisptr.getArea()
def move(self, dx, dy):
self.thisptr.move(dx, dy)
Conclusion
• Interface stated three times
• One time in C++, two times in semi-Python
• Makes perfect sense if you are a Python coder
wrapping an existing C++ library
• Performance nice overall
• Wrapping is imperative in style
Boost.Python
• Far older interface (dating back to 2002!)
• Write C++ classes
• Define in C++ how these classes are mapped
Rectangle example again
BOOST_PYTHON_MODULE(shapes)
{
class_<Rectangle>("PyRectangle", init<int,int,int,int>())
.def("getLength", &Rectangle::getLength)
.def("getHeight", &Rectangle::getHeight)
.def("getArea", &Rectangle::getArea)
.def("move", &Rectangle::move)
;
}
Exposing data members
• .def_readonly("x0", &Rectangle::x0)
• More relevant, exposing existing getter with property
syntax of Python
.add_property("area", &Rectangle::getArea)
• Add a third parameter to have a setter as well
More complex stories
• Define customized rules for how to map Python types
to C++ types
• Define declarative ownership rules for objects created
in C++
– That’s what the PyRectangle Cython wrapper did in
code
Sorting data
• Common task
• Quicker to keep data non-sorted and sort it later vs.
maintaining propery sorted data structure
– I.e. keep a vector, then sort it, rather than keeping a
C++ set (which is a sorted self-balancing tree)
What is needed to sort
• Sorting itself is O(n log n)
– Using a proper algorithm, you can always sort a list
in a number of operations that is proportional to
n log n basic operations
– Since proportional, it doesn’t matter what log base
we are using
– If sorting 1,000 elements would use T operations, we
would expect 1,000,000 elements to use 2000T (not
1000T)
or
Which data layout would you use for sorting? Which data
layout would you use for accessing the data later?
Indirection
El1 El2 El3 El4
Ref1 Ref2 Ref3 Ref4
El1 El3 El4 El2
Indirection
• Just keeping references could seem making sorting
faster
– You don’t need to move the full elements
– This is kind of true
– Depends a bit on how much of the data you need to
access to do the sorting comparisons
• Overall size larger in indirect case
– Frequently overhead for each allocated element
• Remember: Current CPUs are very fast
– When they can predict what to do beforehand
– Moving a chunk of data is predictable
Indirection
• If sorting for speedy access later, “sorting” an indirected data
structure will keep actual data stored all over the place
• When is indirection used?
– Python lists, Python dictionaries
– Java ArrayList, HashMap etc
– Java arrays of non-primitive types
– General pointer-based data structures in C and C++
– Cell arrays in Matlab
• When is it not used?
– array module in Python
– numpy matrices, Matlab matrices
– C/C++ arrays, and some C++ STL containers (vector, array)
Case in point
• Python list of integers
– Each value is really just 4 bytes
– An entry in a list will use 8 bytes on a 64-bit machine
– The minimum size of the allocated list element is 24
bytes
• Sorting will require walking over the 8-byte entries,
tracing each to the correct element, and then moving
the entries around
Caches
• All memory is not equally fast
• Different levels of caches
• If data does not fit into cache, things become slow
• CPU does prefetching
– If memory accesses follow simple patterns
– Random indirection does NOT
• Cache-friendly code
– Good locality
• Keep using the same part of memory before moving on
– Small workset
• Keep memory usage low
– Good predictability
• Helps prefetcher
So, how do we sort?
• Do not implement a general sorting algorithm yourself
• In C:
– void qsort(void *base, size_t nitems, size_t size, int (*compar)(const void *, const void*))
• You have bare pointers to data, you need to state the size of each
element, you need a pointer to a function that can do comparisons
• We learned in the printing example that we can do better…
• template <class RandomAccessIterator> void sort (RandomAccessIterator first, RandomAccessIterator last);
• template <class RandomAccessIterator, class Compare> void sort (RandomAccessIterator first, RandomAccessIterator last, Compare comp);
Simple and complex sorting
• Elements with valid < operator can be sorted in increasing order by simply
– sort(vector.begin(), vector.end());
• Nice trick
– pair class, make tuples of your data where the desired order is
represented by the first element
• Harder case, implement a function which takes const (references) to the
elements
– Returning true if the first object comes before the latter
– false if it comes after or if they are equivalent (strict-weak ordering)
– The bugs you can get in any language for invalid comparison code are
nasty
• The pair suggestion might not be too bad
Functors
• A function is a very real thing in original C
– It’s a single specific piece of code (location in
memory) than can be called by the CPU
• In modern C++
– We have templates etc
• Same piece of source can result in multiple sets
of machine code
– Inlining might mean that there is no function call, not
even a block of machine instructions for the function
• A function is just code, no data
Functors
• C++ supports operator overloading
• () is another operator
• So, comparing elements can be as simple as struct intComparator
{
bool operator() (const int left, const int right)
{
return left < right;
}
};
…
sort(data.begin(), data.end(), intComparator());
Why would you want a functor?
• Auxiliary data
• Settings
• Caching/precalculation to speed up additional function calls
• Keeping statistics
• Any case where you want to inject a piece of code inside an
algorithm or library
Sorting again
struct intComparator
{
int comps;
intComparator() : comps(0) {}
bool operator() (const int left, const int right)
{
return left < right;
}
};
…
intComparator comparer;
sort(data.begin(), data.end(), comparer);
printf("%d\n", comparer.comps);
Functors
• This was nice
• But it moves the logic for the sorting away from the
place where we sort
– Logical if is a general ordering
– But if it’s a general ordering, it should probably just
be in the < operator for the elements we sort
Lambda expressions
• You would really like to put the instructions right where
they logically belong
• Like…
This?
int comps = 0;
sort(data.begin(), data.end(), [&] (int left, int right) {
comps++;
return left < right;
} );
printf("%d\n", comps);
• And in the C++14 standard, you can even put ”auto” for left and right there
What was that?
• Lambda expressions
– Common in functional programming
– Common in Python
• Create a functor object in place anywhere
– Put all relevant code in one place
• Local variables can be made accessible within the
functor
– “Capturing”
Lambda syntax
• [capture-list] (params) mutable -> return-type { body }
• Specifying mutable is optional
• Return type and arrow does not need to be specified, if
it can be deduced correctly
– Single return statement with evident type
Capture lists
• If you just specify [], the lambda won’t have access to
local variables
• You can also specify a list of variable names [a,b,c]
– These are then captured by value, a copy is made
– That makes them safe to access even when the
function that created the lambda has returned
• You can also specify variables with & - [&a,&b,&c]
• Shorthand, capture all variables by value [=], all
variables by reference [&]
• If you have a method in a class, you can also expose
instance variables using [this]
Performance comparison
• Sorted the same numbers, the same way
– Generate 10,000,000 random numbers
– Sort them
– With functor
• With and without counting comparisons
• With and without inlining
– With lambda
• With and without counting comparisons
• Amounting to 282306119 comparisons
• On GCC 5.2, Tintin
• Repeated timings a bit, not fully accurate
Performance comparison
Type Inline Counting Time (s)
Functor X 1.114
Lambda X 1.119
Functor 1.522
Functor X X 1.102
Lambda X X 1.205
Functor X 1.523
Performance comparison
• Functor turns out to be faster than lambda for counting case
– Probably variable capture by reference leads to two levels of
indirection (get to the lambda data, get the reference to the
comps variable, update it)
– Lambda still less optimized than normal objects
• Some uncertainty in the numbers
• Relatively huge overhead in not inlining
– Even a non-indirect function call is expensive if the operation is
simple
• Benefits of inlining can sometimes be even stronger for slightly
longer methods
– More data interactions between caller and callee that the
optimizer can work on
Summary
• Lambda, auto, range for, and templates are some
examples of things that make modern C++ a much
more pleasant language to use
• The way these technologies are implemented allow
them to give the same or better performance than
equivalent C code
– Much faster than Python code
– While actual code can be similar in style
• Powerful libraries in Boost
• Interactive modern C++ is there, but not as mature as
IPython