HDK
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
benchmark.h
Go to the documentation of this file.
1 /*
2  Copyright 2017 Larry Gritz et al. All Rights Reserved.
3 
4  Redistribution and use in source and binary forms, with or without
5  modification, are permitted provided that the following conditions are
6  met:
7  * Redistributions of source code must retain the above copyright
8  notice, this list of conditions and the following disclaimer.
9  * Redistributions in binary form must reproduce the above copyright
10  notice, this list of conditions and the following disclaimer in the
11  documentation and/or other materials provided with the distribution.
12  * Neither the name of the software's owners nor the names of its
13  contributors may be used to endorse or promote products derived from
14  this software without specific prior written permission.
15  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
16  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
17  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
18  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
19  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
20  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
21  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
22  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
23  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
24  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
25  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
26 
27  (This is the Modified BSD License)
28 */
29 
30 // clang-format off
31 
32 #pragma once
33 
34 #include <iostream>
35 #include <vector>
36 
39 #include <OpenImageIO/strutil.h>
40 #include <OpenImageIO/timer.h>
41 
42 
44 
45 /// DoNotOptimize(val) is a helper function for timing benchmarks that fools
46 /// the compiler into thinking the the location 'val' is used and will not
47 /// optimize it away. For benchmarks only, do not use in production code!
48 /// May not work on all platforms. References:
49 /// * Chandler Carruth's CppCon 2015 talk
50 /// * Folly https://github.com/facebook/folly/blob/master/folly/Benchmark.h
51 /// * Google Benchmark https://github.com/google/benchmark/blob/master/include/benchmark/benchmark_api.h
52 
53 template <class T>
54 OIIO_FORCEINLINE T const& DoNotOptimize (T const &val);
55 
56 
57 /// clobber_all_memory() is a helper function for timing benchmarks that
58 /// fools the compiler into thinking that potentially any part of memory
59 /// has been modified, and thus serves as a barrier where the optimizer
60 /// won't assume anything about the state of memory preceding it.
61 
63 
64 
65 
66 /// A call to clobber(p) fools the compiler into thinking that p (or *p, for
67 /// the pointer version) might potentially have its memory altered. The
68 /// implementation actually does nothing, but it's in another module, so the
69 /// compiler won't know this and will be conservative about any assupmtions
70 /// of what's in p. This is helpful for benchmarking, to help erase any
71 /// preconceptions the optimizer has about what might be in a variable.
72 
73 void OIIO_API clobber (void* p);
74 OIIO_FORCEINLINE void clobber (const void* p) { clobber ((void*)p); }
75 
76 template<typename T>
77 OIIO_FORCEINLINE T& clobber (T& p) { clobber(&p); return p; }
78 
79 
80 
81 
82 /// Benchmarker is a class to assist with "micro-benchmarking".
83 /// The goal is to discern how long it takes to run a snippet of code
84 /// (function, lambda, etc). The code will be run in some number of trials,
85 /// each consisting of many iterations, yielding statistics about the run
86 /// time of the code.
87 ///
88 /// Tne number of trials is user-selectable, with a reasonable default of 10
89 /// trials. The number of iterations per trial may be set explicitly, but
90 /// the default is to automatically compute a reasonable number of
91 /// iterations based on their timing. For most use cases, it's fire and
92 /// forget.
93 ///
94 /// Generally, the most and least expesive trials will be discarded (all
95 /// sorts of things can happen to give you a few spurious results) and then
96 /// the remainder of trials will be used to compute the average, standard
97 /// deviation, range, and median value, in ns per iteration as well as
98 /// millions of executions per second. The default behavior it just to echo
99 /// the relevant statistics to the console.
100 ///
101 /// The basic use illustrated by this example in which we try to assess
102 /// the difference in speed between acos() and fast_acos():
103 ///
104 /// Benchmarker bench;
105 /// float val = 0.5f;
106 /// clobber (val); // Scrub compiler's knowledge of the value
107 /// bench ("acos", [&](){ DoNotOptimize(std::acos(val)); });
108 /// bench ("fast_acos", [&](){ // alternate indentation style
109 /// DoNotOptimize(OIIO::fast_acos(val));
110 /// });
111 ///
112 /// Which produces output like this:
113 /// acos : 4.3 ns, 230.5 M/s (10x2097152, sdev=0.4ns rng=31.2%, med=4.6)
114 /// fast_acos : 3.4 ns, 291.2 M/s (10x2097152, sdev=0.4ns rng=33.0%, med=3.4)
115 ///
116 /// Some important details:
117 ///
118 /// After declaring the Benchmarker, a number of options can be set: number
119 /// of trials to run, iterations per trial (0 means automatic detection),
120 /// verbosity, whether (or how many) outliers to exclude. You can chain them
121 /// together if you want:
122 /// bench.iterations(10000).trials(10);
123 ///
124 /// It can be VERY hard to get valid benchmarks without the compiler messing
125 /// up your results. Some tips:
126 ///
127 /// * Code that is too fast will not be reliable. Anything that appears
128 /// to take less than 1 ns actually prints "unreliable" instead of full
129 /// stats, figuring that it is likely that it has been inadvertently
130 /// optimized away.
131 ///
132 /// * Use the DoNotOptimize() call on any final results computed by your
133 /// benchmarked code, or else the compiler is likely to remove the code
134 /// that leads to any values it thinks will never be used.
135 ///
136 /// * Beware of the compiler constant folding operations in your code --
137 /// do not pass constants unless you want to benchmark its performance on
138 /// known constants, and it is probably smart to ensure that all variables
139 /// acccessed by your code should be passed to clobber() before running
140 /// the benchmark, to confuse the compiler into not assuming its value.
141 
143 public:
145 
146  // Calling Benchmarker like a function (operator()) executes the
147  // benchmark. This process runs func(args...), several trials, each
148  // trial with many iterations. The value returned is the best estimate
149  // of the average time per iteration that it takes to run func.
150  template<typename FUNC, typename... ARGS>
151  double operator()(string_view name, FUNC func, ARGS&&... args)
152  {
153  m_name = name;
154  run(func, args...);
155  if (verbose())
156  std::cout << (*this) << std::endl;
157  return avg();
158  }
159 
160  // Return the average, sample standard deviation, median, and range
161  // of per-iteration time.
162  double avg() const { return m_avg; }
163  double stddev() const { return m_stddev; }
164  double range() const { return m_range; }
165  double median() const { return m_median; }
166 
167  // Control the number of iterations per trial. The special value 0 means
168  // to determine automatically a reasonable number of iterations. That is
169  // also the default behavior.
171  {
172  m_user_iterations = val;
173  return *this;
174  }
175  size_t iterations() const { return m_iterations; }
176 
177  // Control the number of trials to perform.
179  {
180  m_trials = val;
181  return *this;
182  }
183  size_t trials() const { return m_trials; }
184 
185  // Control the number of values of work that each iteration represents.
186  // Usually you will leave this at the default of 1, but for some cases,
187  // it may be helpful. An example of where you might use this is if you
188  // are benchmarking SIMD operations. A scalar sqrt and an SIMD sqrt may
189  // run in the same amount of time, but the SIMD version is operating on
190  // 4 (or 8, etc.) times as many values. You can use the 'work' size to
191  // make the calls report Mvals/s, showing more accurately than the SIMD
192  // call is faster than the scalar call.
194  {
195  m_work = val;
196  return *this;
197  }
198  size_t work() const { return m_work; }
199 
200  // Control the exclusion of outliers. This number (default 1) of fastest
201  // and slowest trials will be excluded from the statistics, to remove
202  // the effects of spurious things happening on the system. Setting
203  // outliers to 0 will compute statistics on all trials, without any
204  // outlier exclusion.
206  {
207  m_exclude_outliers = e;
208  return *this;
209  }
210  int exclude_outliers() const { return m_exclude_outliers; }
211 
212  // Control the verbosity of the printing for each benchmark. The default
213  // is 1, which prints basic statistics. Verbosity 0 is silent and leaves
214  // it up to the app to retrieve results.
216  {
217  m_verbose = v;
218  return *this;
219  }
220  int verbose() const { return m_verbose; }
221 
222  // Control indentation in the printout -- this number of spaces will
223  // be printed before the statistics.
224  Benchmarker& indent(int spaces)
225  {
226  m_indent = spaces;
227  return *this;
228  }
229  int indent() const { return m_indent; }
230 
231  // Choices of unit to report results.
232  enum class Unit : int { autounit, ns, us, ms, s };
233 
234  // Control the units for reporting results. By default, an appropriate
235  // unit will be chosen for nice printing of each benchmark individually.
236  // But the user may also wish to request specific units like ns or ux in
237  // order to ensure that all benchmark resutls are using the same units.
239  {
240  m_units = s;
241  return *this;
242  }
243  Unit units() const { return m_units; }
244 
245  const std::string& name() const { return m_name; }
246 
247 private:
248  size_t m_iterations = 0;
249  size_t m_user_iterations = 0;
250  size_t m_trials = 10;
251  size_t m_work = 1;
252  std::string m_name;
253  std::vector<double> m_times; // times for each trial
254  double m_avg; // average time per iteration
255  double m_stddev; // standard deviation per iteration
256  double m_range; // range per iteration
257  double m_median; // median per-iteration time
258  int m_exclude_outliers = 1;
259  int m_verbose = 1;
260  int m_indent = 0;
261  Unit m_units = Unit::autounit;
262 
263  template<typename FUNC, typename... ARGS>
264  double run(FUNC func, ARGS&&... args)
265  {
266  if (m_user_iterations)
267  m_iterations = m_user_iterations;
268  else
269  m_iterations = determine_iterations(func, args...);
270  m_times.resize(m_trials);
271 
272  double overhead = iteration_overhead() * iterations();
273  for (auto& t : m_times)
274  t = std::max(0.0, do_trial(m_iterations, func, args...) - overhead);
275  compute_stats();
276  return avg();
277  }
278 
279  template<typename FUNC, typename... ARGS>
280  size_t determine_iterations(FUNC func, ARGS&&... args)
281  {
282  // We're shooting for a trial around 1/100s
283  const double target_time = 0.01;
284  size_t i = 1;
285  while (1) {
286  double t = do_trial (i, func, args...);
287  // std::cout << "Trying " << i << " iters = " << t << "\n";
288  if (t > target_time * 1.5 && i > 2)
289  return i / 2;
290  if (t > target_time * 0.75 || i > (size_t(1) << 30))
291  return i;
292  if (t < target_time / 16)
293  i *= 8;
294  else
295  i *= 2;
296  }
297  }
298 
299  template<typename FUNC, typename... ARGS>
300  double do_trial(size_t iterations, FUNC func, ARGS&&... args)
301  {
302  Timer timer;
303  while (iterations--) {
305  func(args...);
306  }
307  return timer();
308  }
309 
310  void compute_stats() { compute_stats(m_times, m_iterations); }
311  void compute_stats(std::vector<double>& times, size_t iterations);
312  double iteration_overhead();
313 
314  friend OIIO_API std::ostream& operator<<(std::ostream& out,
315  const Benchmarker& bench);
316 };
317 
318 
319 
320 /// Helper template that runs a function (or functor) n times, using a
321 /// Timer to benchmark the results, and returning the fastest trial. If
322 /// 'range' is non-NULL, the range (max-min) of the various time trials
323 /// will be stored there.
324 ///
325 /// DEPRECATED(1.8): This may be considered obsolete, probably the
326 /// Benchmarker class is a better solution.
327 template<typename FUNC>
328 double
329 time_trial(FUNC func, int ntrials = 1, int nrepeats = 1, double* range = NULL)
330 {
331  double mintime = 1.0e30, maxtime = 0.0;
332  while (ntrials-- > 0) {
333  Timer timer;
334  for (int i = 0; i < nrepeats; ++i) {
335  // Be sure that the repeated calls to func aren't optimzed away:
337  func();
338  }
339  double t = timer();
340  if (t < mintime)
341  mintime = t;
342  if (t > maxtime)
343  maxtime = t;
344  }
345  if (range)
346  *range = maxtime - mintime;
347  return mintime;
348 }
349 
350 /// Version without repeats.
351 template<typename FUNC>
352 double
353 time_trial(FUNC func, int ntrials, double* range)
354 {
355  return time_trial(func, ntrials, 1, range);
356 }
357 
358 
359 
360 // Benchmarking helper function: Time a function with various thread counts.
361 // Inputs:
362 // task(int iterations) : The function to run (which understands an
363 // iteration count or work load).
364 // pretask() : Code to run before the task threads start.
365 // posttask() : Code to run after the task threads complete.
366 // out : Stream to print results (or NULL to not print anything).
367 // maxthreads : Don't do any trials greater than this thread count,
368 // even if it's in the threadcounts[].
369 // total_iterations : Total amount of work to do. The func() will be
370 // called with total_iterations/nthreads, so that the
371 // total work for all threads stays close to constant.
372 // ntrials : The number of runs for each thread count (more will take
373 // longer, but be more accurate timing). The best case
374 // run is the one that will be reported.
375 // threadcounts[] : An span<int> giving the set of thread counts
376 // to try.
377 // Return value:
378 // A vector<double> containing the best time (of the trials) for each
379 // thread count. This can be discarded.
380 OIIO_API std::vector<double>
381 timed_thread_wedge (function_view<void(int)> task,
382  function_view<void()> pretask,
383  function_view<void()> posttask,
384  std::ostream *out,
385  int maxthreads,
386  int total_iterations, int ntrials,
387  cspan<int> threadcounts = {1,2,4,8,12,16,24,32,48,64,128});
388 
389 // Simplified timed_thread_wedge without pre- and post-tasks, using
390 // std::out for output, with a default set of thread counts, and not needing
391 // to return the vector of times.
392 OIIO_API void
393 timed_thread_wedge (function_view<void(int)> task,
394  int maxthreads, int total_iterations, int ntrials,
395  cspan<int> threadcounts = {1,2,4,8,12,16,24,32,48,64,128});
396 
397 
398 
399 
400 //////////////////////////////////////////////////////////////////////////
401 //////////////////////////////////////////////////////////////////////////
402 // Implementation details...
403 //
404 
405 
406 namespace pvt {
407 void OIIO_API use_char_ptr (char const volatile *);
408 }
409 
410 
411 #if ((OIIO_GNUC_VERSION && NDEBUG) || OIIO_CLANG_VERSION >= 30500 || OIIO_APPLE_CLANG_VERSION >= 70000 || defined(__INTEL_COMPILER)) && (defined(__x86_64__) || defined(__i386__))
412 
413 // Major non-MS compilers on x86/x86_64: use asm trick to indicate that
414 // the value is needed.
415 template <class T>
416 OIIO_FORCEINLINE T const&
417 DoNotOptimize (T const &val) {
418 #if defined(__clang__)
419  // asm volatile("" : "+rm" (const_cast<T&>(val)));
420  // Clang doesn't like the 'X' constraint on `val` and certain GCC versions
421  // don't like the 'g' constraint. Attempt to placate them both.
422  asm volatile("" : : "g"(val) : "memory");
423 #else
424  asm volatile("" : : "i,r,m"(val) : "memory");
425 #endif
426  return val;
427 }
428 
429 #elif _MSC_VER
430 
431 // Microsoft of course has its own way of turning off optimizations.
432 #pragma optimize("", off)
433 template <class T>
434 OIIO_FORCEINLINE T const & DoNotOptimize (T const &val) {
435  pvt::use_char_ptr(&reinterpret_cast<char const volatile&>(val));
436  _ReadWriteBarrier ();
437  return val;
438 }
439 #pragma optimize("", on)
440 
441 #elif __has_attribute(__optnone__)
442 
443 // If __optnone__ attribute is available: make a null function with no
444 // optimization, that's all we need.
445 template <class T>
446 inline T const & __attribute__((__optnone__))
447 DoNotOptimize (T const &val) {
448  return val;
449 }
450 
451 #else
452 
453 // Otherwise, it won't work, just make a stub.
454 template <class T>
455 OIIO_FORCEINLINE T const & DoNotOptimize (T const &val) {
456  pvt::use_char_ptr(&reinterpret_cast<char const volatile&>(val));
457  return val;
458 }
459 
460 #endif
461 
462 
463 
464 #if ((OIIO_GNUC_VERSION && NDEBUG) || OIIO_CLANG_VERSION >= 30500 || OIIO_APPLE_CLANG_VERSION >= 70000 || defined(__INTEL_COMPILER)) && (defined(__x86_64__) || defined(__i386__))
465 
466 // Special trick for x86/x86_64 and gcc-like compilers
468  asm volatile ("" : : : "memory");
469 }
470 
471 #elif _MSC_VER
472 
474  _ReadWriteBarrier ();
475 }
476 
477 #else
478 
479 // No fallback for other CPUs or compilers. Suggestions?
481 
482 #endif
483 
484 
485 
GLdouble s
Definition: glew.h:1390
vint4 max(const vint4 &a, const vint4 &b)
Definition: simd.h:4703
void OIIO_API use_char_ptr(char const volatile *)
GLenum GLint * range
Definition: glew.h:3500
Benchmarker & indent(int spaces)
Definition: benchmark.h:224
size_t trials() const
Definition: benchmark.h:183
int verbose() const
Definition: benchmark.h:220
size_t iterations() const
Definition: benchmark.h:175
GLuint const GLchar * name
Definition: glew.h:1814
Definition: timer.h:86
#define OIIO_FORCEINLINE
Definition: platform.h:249
OIIO_API std::vector< double > timed_thread_wedge(function_view< void(int)> task, function_view< void()> pretask, function_view< void()> posttask, std::ostream *out, int maxthreads, int total_iterations, int ntrials, cspan< int > threadcounts={1, 2, 4, 8, 12, 16, 24, 32, 48, 64, 128})
const Args & args
Definition: printf.h:628
GLuint const GLfloat * val
Definition: glew.h:2794
Definition: span.h:47
void OIIO_API clobber(void *p)
const GLdouble * v
Definition: glew.h:1391
String-related utilities, all in namespace Strutil.
Simple timer class.
Benchmarker & verbose(int v)
Definition: benchmark.h:215
OIIO_FORCEINLINE void clobber_all_memory()
Definition: benchmark.h:480
std::ostream & operator<<(std::ostream &ostr, const DataType &a)
Definition: DataType.h:133
int exclude_outliers() const
Definition: benchmark.h:210
const std::string & name() const
Definition: benchmark.h:245
Benchmarker & units(Unit s)
Definition: benchmark.h:238
typedef int(WINAPI *PFNWGLRELEASEPBUFFERDCARBPROC)(HPBUFFERARB hPbuffer
#define __attribute__(x)
Definition: strutil.h:94
double median() const
Definition: benchmark.h:165
double time_trial(FUNC func, int ntrials=1, int nrepeats=1, double *range=NULL)
Definition: benchmark.h:329
size_t work() const
Definition: benchmark.h:198
double avg() const
Definition: benchmark.h:162
GLfloat GLfloat p
Definition: glew.h:16321
GLsizei const GLchar *const * string
Definition: glew.h:1844
GLenum func
Definition: glcorearb.h:782
Benchmarker & work(size_t val)
Definition: benchmark.h:193
Benchmarker & iterations(size_t val)
Definition: benchmark.h:170
Benchmarker & exclude_outliers(int e)
Definition: benchmark.h:205
OIIO_NAMESPACE_BEGIN OIIO_FORCEINLINE T const & DoNotOptimize(T const &val)
Definition: benchmark.h:455
double range() const
Definition: benchmark.h:164
double stddev() const
Definition: benchmark.h:163
Unit units() const
Definition: benchmark.h:243
#define const
Definition: zconf.h:214
#define OIIO_NAMESPACE_END
Definition: oiioversion.h:66
GLdouble GLdouble t
Definition: glew.h:1398
double operator()(string_view name, FUNC func, ARGS &&...args)
Definition: benchmark.h:151
Benchmarker & trials(size_t val)
Definition: benchmark.h:178
int indent() const
Definition: benchmark.h:229
#define OIIO_NAMESPACE_BEGIN
Definition: oiioversion.h:65
#define OIIO_API
Definition: export.h:91