Criterion.rs

Criterion.rs is a statistics-driven micro-benchmarking tool. It is a Rust port of Haskell's Criterion library.

Criterion.rs benchmarks collect and store statistical information from run to run and can automatically detect performance regressions as well as measuring optimizations.

Criterion.rs is free and open source. You can find the source on GitHub. Issues and feature requests can be posted on the issue tracker.

API Docs

In addition to this book, you may also wish to read the API documentation.

License

Criterion.rs is dual-licensed under the Apache 2.0 and the MIT licenses.

Debug Output

To enable debug output in Criterion.rs, define the environment variable CRITERION_DEBUG. For example (in bash):

CRITERION_DEBUG=1 cargo bench

This will enable extra debug output. Criterion.rs will also save the gnuplot scripts alongside the generated plot files. When raising issues with Criterion.rs (especially when reporting issues with the plot generation) please run your benchmarks with this option enabled and provide the additional output and relevant gnuplot scripts.

Getting Started

Step 1 - Add Dependency to cargo.toml

To enable Criterion.rs benchmarks, add the following to your cargo.toml file:

[dev-dependencies]
criterion = "0.2"

[[bench]]
name = "my_benchmark"
harness = false

This adds a development dependency on Criterion.rs, and declares a benchmark called my_benchmark without the standard benchmarking harness. It's important to disable the standard benchmark harness, because we'll later add our own and we don't want them to conflict.

Step 2 - Add Benchmark

As an example, we'll benchmark an implementation of the Fibonacci function. Create a benchmark file at $PROJECT/benches/my_benchmark.rs with the following contents (see the Details section below for an explanation of this code):


# #![allow(unused_variables)]
#fn main() {
#[macro_use]
extern crate criterion;

use criterion::Criterion;

fn fibonacci(n: u64) -> u64 {
    match n {
        0 => 1,
        1 => 1,
        n => fibonacci(n-1) + fibonacci(n-2),
    }
}

fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("fib 20", |b| b.iter(|| fibonacci(20)));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
#}

Step 3 - Run Benchmark

To run this benchmark, use the following command:

cargo bench

You should see output similar to this:

     Running target/release/deps/example-423eedc43b2b3a93
Benchmarking fib 20
Benchmarking fib 20: Warming up for 3.0000 s
Benchmarking fib 20: Collecting 100 samples in estimated 5.0658 s (188100 iterations)
Benchmarking fib 20: Analyzing
fib 20                  time:   [26.029 us 26.251 us 26.505 us]
Found 11 outliers among 99 measurements (11.11%)
  6 (6.06%) high mild
  5 (5.05%) high severe
slope  [26.029 us 26.505 us] R^2            [0.8745662 0.8728027]
mean   [26.106 us 26.561 us] std. dev.      [808.98 ns 1.4722 us]
median [25.733 us 25.988 us] med. abs. dev. [234.09 ns 544.07 ns]

Details

Let's go back and walk through that benchmark code in more detail.


# #![allow(unused_variables)]
#fn main() {
#[macro use]
extern crate criterion;

use criterion::Criterion;
#}

First, we declare the criterion crate and import the Criterion type. Criterion is the main type for the Criterion.rs library. It provides methods to configure and define groups of benchmarks.


# #![allow(unused_variables)]
#fn main() {
fn fibonacci(n: u64) -> u64 {
    match n {
        0 => 1,
        1 => 1,
        n => fibonacci(n-1) + fibonacci(n-2),
    }
}
#}

Second, we define the function to benchmark. In normal usage, this would be imported from elsewhere in your crate, but for simplicity we'll just define it right here.


# #![allow(unused_variables)]
#fn main() {
fn criterion_benchmark(c: &mut Criterion) {
#}

Here we create a function to contain our benchmark code. The name of the benchmark function doesn't matter, but it should be clear and understandable.


# #![allow(unused_variables)]
#fn main() {
    c.bench_function("fib 20", |b| b.iter(|| fibonacci(20)));
}
#}

This is where the real work happens. The bench_function method defines a benchmark with a name and a closure. The name should be unique among all of the benchmarks for your project. The closure must accept one argument, a Bencher. The bencher performs the benchmark - in this case, it simply calls our fibonacci function in a loop. There are a number of other benchmark functions, including the option to benchmark with arguments, to benchmark external programs and to compare the performance of two functions. See the API documentation for details on all of the different benchmarking options.


# #![allow(unused_variables)]
#fn main() {
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
#}

Here we invoke the criterion_group! (link) macro to generate a benchmark group called benches, containing the criterion_benchmark function defined earlier. Finally, we invoke the criterion_main! (link) macro to generate a main function which executes the benches group. See the API documentation for more information on these macros.

Step 4 - Optimize

This fibonacci function is quite inefficient. We can do better:


# #![allow(unused_variables)]
#fn main() {
fn fibonacci(n: u64) -> u64 {
    let mut a = 0u64;
    let mut b = 1u64;
    let mut c = 0u64;

    if n == 0 {
        return 0
    }

    for _ in 0..(n+1) {
        c = a + b;
        a = b;
        b = c;
    }
    return b;
}
#}

Running the benchmark now produces output like this:

     Running target/release/deps/example-423eedc43b2b3a93
Benchmarking fib 20
Benchmarking fib 20: Warming up for 3.0000 s
Benchmarking fib 20: Collecting 100 samples in estimated 5.0000 s (13548862800 iterations)
Benchmarking fib 20: Analyzing
fib 20                  time:   [353.59 ps 356.19 ps 359.07 ps]
                        change: [-99.999% -99.999% -99.999%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 99 measurements (6.06%)
  4 (4.04%) high mild
  2 (2.02%) high severe
slope  [353.59 ps 359.07 ps] R^2            [0.8734356 0.8722124]
mean   [356.57 ps 362.74 ps] std. dev.      [10.672 ps 20.419 ps]
median [351.57 ps 355.85 ps] med. abs. dev. [4.6479 ps 10.059 ps]

As you can see, Criterion is statistically confident that our optimization has made an improvement. If we introduce a performance regression, Criterion will instead print a message indicating this.

User Guide

This chapter covers the output produced by Criterion.rs benchmarks, both the command-line reports and the charts. It also details more advanced usages of Criterion.rs such as benchmarking external programs and comparing the performance of multiple functions.

Migrating from libtest

This page shows an example of converting a libtest or bencher benchmark to use Criterion.rs.

The Benchmark

We'll start with this benchmark as an example:


# #![allow(unused_variables)]
#![feature(test)]
#fn main() {
extern crate test;
use test::Bencher;

fn fibonacci(n: u64) -> u64 {
    match n {
        0 => 1,
        1 => 1,
        n => fibonacci(n-1) + fibonacci(n-2),
    }
}

#[bench]
fn bench_fib(b: &mut Bencher) {
    b.iter(|| fibonacci(20));
}
#}

The Migration

The first thing to do is update the Cargo.toml to disable the libtest benchmark harness:

[[bench]]
name = "example"
harness = false

We also need to add Criterion.rs to the dev-dependencies section of Cargo.toml:

[dev-dependencies]
criterion = "0.2"

The next step is to update the imports:


# #![allow(unused_variables)]
#fn main() {
#[macro_use]
extern crate criterion;
use criterion::Criterion;
#}

Then, we can change the bench_fib function. Remove the #[bench] and change the argument to &mut Criterion instead. The contents of this function need to change as well:


# #![allow(unused_variables)]
#fn main() {
fn bench_fib(c: &mut Criterion) {
    c.bench_function("fib 20", |b| b.iter(|| fibonacci(20)));
}
#}

Finally, we need to invoke some macros to generate a main function, since we no longer have libtest to provide one:


# #![allow(unused_variables)]
#fn main() {
criterion_group!(benches, bench_fib);
criterion_main!(benches);
#}

And that's it! The complete migrated benchmark code is below:


# #![allow(unused_variables)]
#fn main() {
#[macro_use]
extern crate criterion;
use criterion::Criterion;

fn fibonacci(n: u64) -> u64 {
    match n {
        0 => 1,
        1 => 1,
        n => fibonacci(n-1) + fibonacci(n-2),
    }
}

fn bench_fib(c: &mut Criterion) {
    c.bench_function("fib 20", |b| b.iter(|| fibonacci(20)));
}

criterion_group!(benches, bench_fib);
criterion_main!(benches);
#}

Command-Line Output

The output for this page was produced by running cargo bench -- --verbose. cargo bench omits some of this information. Note: If cargo bench fails with an error message about an unknown argument, see the FAQ.

Every Criterion.rs benchmark calculates statistics from the measured iterations and produces a report like this:

Benchmarking alloc
Benchmarking alloc: Warming up for 1.0000 s
Benchmarking alloc: Collecting 100 samples in estimated 13.354 s (5050 iterations)
Benchmarking alloc: Analyzing
alloc                   time:   [2.5094 ms 2.5306 ms 2.5553 ms]
                        thrpt:  [391.34 MiB/s 395.17 MiB/s 398.51 MiB/s]
                        change: [-38.292% -37.342% -36.524%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe
slope  [2.5094 ms 2.5553 ms] R^2            [0.8660614 0.8640630]
mean   [2.5142 ms 2.5557 ms] std. dev.      [62.868 us 149.50 us]
median [2.5023 ms 2.5262 ms] med. abs. dev. [40.034 us 73.259 us]

Warmup

Every Criterion.rs benchmark iterates the benchmarked function automatically for a configurable warmup period (by default, for three seconds). For Rust function benchmarks, this is to warm up the processor caches and (if applicable) file system caches. For external program benchmarks, it can also be used to warm up JIT compilers.

Collecting Samples

Criterion iterates the function to be benchmarked with a varying number of iterations to generate an estimate of the time taken by each iteration. The number of samples is configurable. It also prints an estimate of the time the sampling process will take based on the time per iteration during the warmup period.

Time

time:   [2.5094 ms 2.5306 ms 2.5553 ms]
thrpt:  [391.34 MiB/s 395.17 MiB/s 398.51 MiB/s]

This shows a confidence interval over the measured per-iteration time for this benchmark. The left and right values show the lower and upper bounds of the confidence interval respectively, while the center value shows Criterion.rs' best estimate of the time taken for each iteration of the benchmarked routine.

The confidence level is configurable. A greater confidence level (eg. 99%) will widen the interval and thus provide the user with less information about the true slope. On the other hand, a lesser confidence interval (eg. 90%) will narrow the interval but then the user is less confident that the interval contains the true slope. 95% is generally a good balance.

Criterion.rs performs bootstrap resampling to generate these confidence intervals. The number of bootstrap samples is configurable, and defaults to 100,000.

Optionally, Criterion.rs can also report the throughput of the benchmarked code in units of bytes or elements per second.

Change

When a Criterion.rs benchmark is run, it saves statistical information in the target/criterion directory. Subsequent executions of the benchmark will load this data and compare it with the current sample to show the effects of changes in the code.

change: [-38.292% -37.342% -36.524%] (p = 0.00 < 0.05)
Performance has improved.

This shows a confidence interval over the difference between this run of the benchmark and the last one, as well as the probability that the measured difference could have occurred by chance. These lines will be omitted if no saved data could be read for this benchmark.

The second line shows a quick summary. This line will indicate that the performance has improved or regressed if Criterion.rs has strong statistical evidence that this is the case. It may also indicate that the change was within the noise threshold. Criterion.rs attempts to reduce the effects of noise as much as possible, but differences in benchmark environment (eg. different load from other processes, memory usage, etc.) can influence the results. For highly-deterministic benchmarks, Criterion.rs can be sensitive enough to detect these small fluctuations, so benchmark results that overlap the range +-noise_threshold are assumed to be noise and considered insignificant. The noise threshold is configurable, and defaults to +-2%.

Additional examples:

alloc                   time:   [1.2421 ms 1.2540 ms 1.2667 ms]
                        change: [+40.772% +43.934% +47.801%] (p = 0.00 < 0.05)
                        Performance has regressed.
alloc                   time:   [1.2508 ms 1.2630 ms 1.2756 ms]
                        change: [-1.8316% +0.9121% +3.4704%] (p = 0.52 > 0.05)
                        No change in performance detected.
benchmark               time:   [442.92 ps 453.66 ps 464.78 ps]
                        change: [-0.7479% +3.2888% +7.5451%] (p = 0.04 > 0.05)
                        Change within noise threshold.

Detecting Outliers

Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

Criterion.rs attempts to detect unusually high or low samples and reports them as outliers. A large number of outliers suggests that the benchmark results are noisy and should be viewed with appropriate skepticism. In this case, you can see that there are some samples which took much longer than normal. This might be caused by unpredictable load on the computer running the benchmarks, thread or process scheduling, or irregularities in the time taken by the code being benchmarked.

In order to ensure reliable results, benchmarks should be run on a quiet computer and should be designed to do approximately the same amount of work for each iteration. If this is not possible, consider increasing the measurement time to reduce the influence of outliers on the results at the cost of longer benchmarking period. Alternately, the warmup period can be extended (to ensure that any JIT compilers or similar are warmed up) or other iteration loops can be used to perform setup before each benchmark to prevent that from affecting the results.

Additional Statistics

slope  [2.5094 ms 2.5553 ms] R^2            [0.8660614 0.8640630]
mean   [2.5142 ms 2.5557 ms] std. dev.      [62.868 us 149.50 us]
median [2.5023 ms 2.5262 ms] med. abs. dev. [40.034 us 73.259 us]

This shows additional confidence intervals based on other statistics.

Criterion.rs performs a linear regression to calculate the time per iteration. The first line shows the confidence interval of the slopes from the linear regressions, while the R^2 area shows the goodness-of-fit values for the lower and upper bounds of that confidence interval. If the R^2 value is low, this may indicate the benchmark isn't doing the same amount of work on each iteration. You may wish to examine the plot output and consider improving the consistency of your benchmark routine.

The second line shows confidence intervals on the mean and standard deviation of the per-iteration times (calculated naively). If std. dev. is large compared to the time values from above, the benchmarks are noisy. You may need to change your benchmark to reduce the noise.

The median/med. abs. dev. line is similar to the mean/std. dev. line, except that it uses the median and median absolute deviation. As with the std. dev., if the med. abs. dev. is large, this indicates the benchmarks are noisy.

A Note Of Caution

Criterion.rs is designed to produce robust statistics when possible, but it can't account for everything. For example, the performance improvements and regressions listed in the above examples were created just by switching my laptop between battery power and wall power rather than changing the code under test. Care must be taken to ensure that benchmarks are performed under similar conditions in order to produce meaningful results.

Command-Line Options

Criterion.rs benchmarks accept a number of custom command-line parameters. This is a list of the most common options. Run cargo bench -- -h to see a full list.

  • To filter benchmarks, use cargo bench -- <filter> where <filter> is a substring of the benchmark ID. For example, running cargo bench -- fib_20 would only run benchmarks whose ID contains the string fib_20
  • To print more detailed output, use cargo bench -- --verbose
  • To disable colored output, use cargo bench -- --color never
  • To disable plot generation, use cargo bench -- --noplot
  • To iterate each benchmark for a fixed length of time without saving, analyzing or plotting the results, use cargo bench -- --profile-time <num_seconds>. This is useful when profiling the benchmarks. It reduces the amount of unrelated clutter in the profiling results and prevents Criterion.rs' normal dynamic sampling logic from greatly increasing the runtime of the benchmarks.
  • To save a baseline, use cargo bench -- --save-baseline <name>. To compare against an existing baseline, use cargo bench -- --baseline <name>. For more on baselines, see below.
  • To test that the benchmarks run successfully without performing the measurement or analysis (eg. in a CI setting), use cargo bench -- --test.

Note:

If cargo bench fails with an error message about an unknown argument, see the FAQ.

Baselines

By default, Criterion.rs will compare the measurements against the previous run (if any). Sometimes it's useful to keep a set of measurements around for several runs. For example, you might want to make multiple changes to the code while comparing against the master branch. For this situation, Criterion.rs supports custom baselines.

  • --save-baseline <name> will compare against the named baseline, then overwrite it.
  • --baseline <name> will compare against the named baseline without overwriting it.

Using these options, you can manage multiple baseline measurements. For instance, if you want to compare against a static reference point such as the master branch, you might run:

git checkout master
cargo bench -- --save-baseline master
git checkout optimizations
cargo bench -- --baseline master

# Some optimization work here

# Measure again and compare against the stored baseline without overwriting it
cargo bench -- --baseline master

HTML Report

If gnuplot is installed, Criterion.rs can generate an HTML report displaying the results of the benchmark under target/criterion/report/index.html.

To see an example report, click here. For more details on the charts and statistics displayed, check the other pages of this book.

Plots & Graphs

If gnuplot is installed, Criterion.rs can generate a number of useful charts and graphs which you can check to get a better understanding of the behavior of the benchmark.

File Structure

The plots and saved data are stored under target/criterion/$BENCHMARK_NAME/. Here's an example of the folder structure:

$BENCHMARK/
├── base/
│  ├── raw.csv
│  ├── estimates.json
│  ├── sample.json
│  └── tukey.json
├── change/
│  └── estimates.json
├── new/
│  ├── raw.csv
│  ├── estimates.json
│  ├── sample.json
│  └── tukey.json
└── report/
   ├── both/
   │  ├── pdf.svg
   │  └── regression.svg
   ├── change/
   │  ├── mean.svg
   │  ├── median.svg
   │  └── t-test.svg
   ├── index.html
   ├── MAD.svg
   ├── mean.svg
   ├── median.svg
   ├── pdf.svg
   ├── pdf_small.svg
   ├── regression.svg
   ├── regression_small.svg
   ├── relative_pdf_small.svg
   ├── relative_regression_small.svg
   ├── SD.svg
   └── slope.svg

The new folder contains the statistics for the last benchmarking run, while the base folder contains those for the last run on the base baseline (see Command-Line Options for more information on baselines). The plots are in the report folder. Criterion.rs only keeps historical data for the last run. The report/both folder contains plots which show both runs on one plot, while the report/change folder contains plots showing the differences between the last two runs. This example shows the plots produced by the default bench_function benchmark method. Other methods may produce additional charts, which will be detailed in their respective pages.

MAD/Mean/Median/SD/Slope

Mean Chart

These are the simplest of the plots generated by Criterion.rs. They display the bootstrapped distributions and confidence intervals for the given statistics.

Regression

Regression Chart

The regression plot shows each data point plotted on an X-Y plane showing the number of iterations vs the time taken. It also shows the line representing Criterion.rs' best guess at the time per iteration. A good benchmark will show the data points all closely following the line. If the data points are scattered widely, this indicates that there is a lot of noise in the data and that the benchmark may not be reliable. If the data points follow a consistent trend but don't match the line (eg. if they follow a curved pattern or show several discrete line segments) this indicates that the benchmark is doing different amounts of work depending on the number of iterations, which prevents Criterion.rs from generating accurate statistics and means that the benchmark may need to be reworked.

The combined regression plot in the report/both folder shows only the regression lines and is a useful visual indicator of the difference in performance between the two runs.

PDF

PDF Chart

The PDF chart shows the probability distribution function for the samples. It also shows the ranges used to classify samples as outliers. In this example (as in the regression example above) we can see that the performance trend changes noticeably below ~35 iterations, which we may wish to investigate.

Benchmarking With Inputs

Criterion.rs can run benchmarks with multiple different input values to investigate how the performance behavior changes with different inputs.


# #![allow(unused_variables)]
#fn main() {
    static KB: usize = 1024;

    Criterion::default()
    .bench_function_over_inputs("from_elem", |b, &&size| {
        b.iter(|| iter::repeat(0u8).take(size).collect::<Vec<_>>());
    }, &[KB, 2 * KB, 4 * KB, 8 * KB, 16 * KB])
#}

In this example, we're benchmarking the time it takes to collect a iterator producing a sequence of N bytes into a Vec. We use the bench_function_over_inputs method. Unlike bench_function, the lambda here takes a Bencher and a reference to a parameter, in this case size. Finally, we provide a slice of potential input values. This generates five benchmarks, named "from_elem/1024" through "from_elem/16384" which individually behave the same as any other benchmark. Criterion.rs also generates some charts in target/criterion/from_elem/report/ showing how the iteration time changes as a function of the input.

Line Chart

Here we can see that there is a approximately-linear relationship between the length of an iterator and the time taken to collect it into a Vec.

Advanced Configuration

Criterion.rs provides a number of configuration options for more-complex use cases. These options are documented here.

Throughput Measurements

When benchmarking some types of code it is useful to measure the throughput as well as the iteration time, either in bytes per second or elements per second. Criterion.rs can estimate the throughput of a benchmark, but it needs to know how many bytes or elements each iteration will process.

Throughput measurements are only supported with using the Benchmark or ParameterizedBenchmark structures; it is not available when using the simpler bench_function interface.

To measure throughput, use the throughput method on Benchmark, like so:


# #![allow(unused_variables)]
#fn main() {
use criterion::*;

fn decode(bytes: &[u8]) {
    // Decode the bytes
    ...
}

fn bench(c: &mut Criterion) {
    let bytes : &[u8] = ...;

    c.bench(
        "throughput-example",
        Benchmark::new(
            "decode",
            |b| b.iter(|| decode(bytes)),
        ).throughput(Throughput::Bytes(bytes.len() as u32)),
    );
}

criterion_group!(benches, bench);
criterion_main!(benches);
#}

For parameterized benchmarks, each argument might represent a different number of elements, so the throughput function accepts a lambda instead:


# #![allow(unused_variables)]
#fn main() {
use criterion::*;

type Element = ...;

fn encode(elements: &[Element]) {
    // Encode the elements
    ...
}

fn bench(c: &mut Criterion) {
    let elements_1 : &[u8] = ...;
    let elements_2 : &[u8] = ...;

    c.bench(
        "throughput-example",
        ParameterizedBenchmark::new(
            "encode",
            |b, elems| b.iter(|| encode(elems)),
            vec![elements_1, elements_2],
        ).throughput(|elems| Throughput::Elements(elems.len() as u32)),
    );
}

criterion_group!(benches, bench);
criterion_main!(benches);
#}

Setting the throughput causes a throughput estimate to appear in the output:

alloc                   time:   [5.9846 ms 6.0192 ms 6.0623 ms]
                        thrpt:  [164.95 MiB/s 166.14 MiB/s 167.10 MiB/s]  

Chart Axis Scaling

By default, Criterion.rs generates plots using a linear-scale axis. When using parameterized benchmarks, it is common for the input sizes to scale exponentially in order to cover a wide range of possible inputs. In this situation, it may be easier to read the resulting plots with a logarithmic axis.

As with throughput measurements above, this option is only available when using the ParameterizedBenchmark structure.


# #![allow(unused_variables)]
#fn main() {
use criterion::*;

fn do_a_thing(x: u64) {
    // Do something
    ...
}

fn bench(c: &mut Criterion) {
    let plot_config = PlotConfiguration::default()
        .summary_scale(AxisScale::Logarithmic);

    c.bench(
        "log_scale_example",
        ParameterizedBenchmark::new(
            "do_thing",
            |b, i| b.iter(|| do_a_thing(i)),
            vec![1u64, 10u64, 100u64, 1000u64, 10000u64, 100000u64, 1000000u64],
        ).plot_config(plot_config),
    );
}

criterion_group!(benches, bench);
criterion_main!(benches);
#}

Currently the axis scaling is the only option that can be set on the PlotConfiguration struct. More may be added in the future.

Comparing Functions

Criterion.rs can automatically benchmark multiple implementations of a function and produce summary graphs to show the differences in performance between them. First, lets create a comparison benchmark.


# #![allow(unused_variables)]
#fn main() {
#[macro_use]
extern crate criterion;
use criterion::{Criterion, ParameterizedBenchmark}

fn fibonacci_slow(n: u64) -> u64 {
    match n {
        0 => 1,
        1 => 1,
        n => fibonacci_slow(n-1) + fibonacci_slow(n-2),
    }
}

fn fibonacci_fast(n: u64) -> u64 {
    let mut a = 0u64;
    let mut b = 1u64;
    let mut c = 0u64;

    if n == 0 {
        return 0
    }

    for _ in 0..(n+1) {
        c = a + b;
        a = b;
        b = c;
    }
    return b;
}

fn bench_fibs(c: &mut Criterion) {
    c.bench(
        "Fibonacci",
        ParameterizedBenchmark::new("Recursive", |b, i| b.iter(|| fibonacci_slow(*i)), vec![20u64, 21u64])
            .with_function("Iterative", |b, i| b.iter(|| fibonacci_fast(*i))),
    );
}

criterion_group!(benches, bench_fibs);
criterion_main!(benches);
#}

These are the same two fibonacci functions from the Getting Started page. The difference here is that we import the ParameterizedBenchmark type as well.


# #![allow(unused_variables)]
#fn main() {
fn bench_fibs(c: &mut Criterion) {
    c.bench(
        "Fibonacci",
        ParameterizedBenchmark::new("Recursive", |b, i| b.iter(|| fibonacci_slow(*i)), vec![2u64, 5, 10, 20])
            .with_function("Iterative", |b, i| b.iter(|| fibonacci_fast(*i))),
    );
}
#}

Here, we define a ParameterizedBenchmark which calls the recursive implementation with several different inputs. We also add a second benchmark which calls the iterative implementation with the same inputs. This is then passed to the Criterion::bench function, which executes each benchmark with each input. Criterion will generate a report for each individual benchmark/input pair, as well as summary reports for each benchmark (across all inputs) and each input (across all benchmarks), as well as an overall summary of the whole benchmark group.

For benchmarks which do not accept a parameter, there is also the Benchmark struct, which is identical to ParameterizedBenchmark except it does not accept parameters.

Violin Plot

Violin Plot

The Violin Plot shows the median times and the PDF of each implementation.

Line Chart

Line Chart

The line chart shows a comparison of the different functions as the input or input size increases, which can be enabled with ParameterizedBenchmark.

Benchmarking External Programs

Criterion.rs has the ability to benchmark external programs (which may be written in any language) the same way that it can benchmark Rust functions. What follows is an example of how that can be done and some of the pitfalls to avoid along the way.

First, let's define our recursive Fibonacci function, only in Python this time:

def fibonacci(n):
    if n == 0 or n == 1:
        return 1
    return fibonacci(n-1) + fibonacci(n-2)

In order to benchmark this with Criterion.rs, we first need to write our own small benchmarking harness. I'll start with the complete code for this and then go over it in more detail:

import time
import sys

MILLIS = 1000
MICROS = MILLIS * 1000
NANOS = MICROS * 1000

def benchmark():
    argument = int(sys.argv[1])

    for line in sys.stdin:
        iters = int(line.strip())

        # Setup

        start = time.perf_counter()
        for x in range(iters):
            fibonacci(argument)
        end = time.perf_counter()

        # Teardown

        delta = end - start
        nanos = int(delta * NANOS)
        print("%d" % nanos)
        sys.stdout.flush()

benchmark()

The important part is the benchmark() function.

The Argument

argument = int(sys.argv[1])

This example uses the Criterion::bench_program_over_inputs function to benchmark our Python Fibonacci function with a variety of inputs. The external program recieves the input value as a command-line argument appended to the command specified in the benchmark, so the very first thing our benchmark harness does is parse that argument into an integer. If we used bench_program instead, there would be no argument.

Reading from stdin

    for line in sys.stdin:
        iters = int(line.strip())

Next, our harness reads a line from stdin and parses it into an integer. Starting an external process is slow, and it would mess with our measurements if we had to do so for each iteration of the benchmark. Besides which, it would obscure the results (since we're probably more interested in the performance of the function without the process-creation overhead). Therefore, Criterion.rs starts the process once per input value or benchmark and sends the iteration counts to the external program on stdin. Your external benchmark harness must read and parse this iteration count and call the benchmarked function the appropriate number of times.

Setup

If your benchmarked code requires any setup, this is the time to do that.

Timing

        start = time.perf_counter()
        for x in range(iters):
            fibonacci(argument)
        end = time.perf_counter()

This is the heart of the external benchmark harness. We measure how long it takes to execute our Fibonacci function with the given argument in a loop, iterating the given number of times. It's important here to use the most precise timer available. We'll need to report the measurement in nanoseconds later, so if you can use a timer that returns a value in nanoseconds (eg. Java's System.nanoTime()) we can skip a bit of work later. It's OK if the timer can't measure to nanosecond precision (most PC's can't) but use the best timer you have.

Teardown

If your benchmarked code requires any teardown, this is the time to do that.

Reporting

        delta = end - start
        nanos = int(delta * NANOS)
        print("%d" % nanos)
        sys.stdout.flush()

To report the measured time, simply print the elapsed number of nanoseconds to stdout. perf_counter reports its results as a floating-point number of seconds, so we first convert it to an integer number of nanoseconds before printing it.

Beware Buffering: Criterion.rs will wait until it recieves the measurement before sending the next iteration count. If your benchmarks seem to be hanging during the warmup period, it may be because your benchmark harness is buffering the output on stdout, as Python does here. In this example we explicitly force Python to flush the buffer; you may need to do the same in your benchmarks.

Defining the Benchmark

If you've read the earlier pages, this will be quite familiar.


# #![allow(unused_variables)]
#fn main() {
use criterion::Criterion;
use std::process::Command;

fn create_command() -> Command {
    let mut command = Command::new("python3");
    command.arg("benches/external_process.py");
    command
}

fn python_fibonacci(c: &mut Criterion) {
    c.bench_program_over_inputs("fibonacci-python",
        create_command,
        &[1, 2, 4, 8, 16]);
}
#}

As before, we create a Criterion struct and use it to define our benchmark. This time, we use the bench_program_over_inputs method. This takes a function (used to create the Command which represents our external program) and an iterable containing the inputs to test. Aside from the use of a Command rather than a closure, this behaves just like (and produces the same output as) bench_function_over_inputs.

If your benchmark doesn't require input, simply omit the input values and use bench_program instead, which behaves like bench_function.

CSV Output

Criterion.rs saves its measurements in several files, as shown below:

$BENCHMARK/
├── base/
│  ├── raw.csv
│  ├── estimates.json
│  ├── sample.json
│  └── tukey.json
├── change/
│  └── estimates.json
├── new/
│  ├── raw.csv
│  ├── estimates.json
│  ├── sample.json
│  └── tukey.json

The JSON files are all considered private implementation details of Criterion.rs, and their structure may change at any time without warning.

However, there is a need for some sort of stable and machine-readable output to enable projects like lolbench to keep historical data or perform additional analysis on the measurements. For this reason, Criterion.rs also writes the raw.csv file. The format of this file is expected to remain stable between different versions of Criterion.rs, so this file is suitable for external tools to depend on.

The format of raw.csv is as follows:

group,function,value,sample_time_nanos,iteration_count
Fibonacci,Iterative,,915000,110740
Fibonacci,Iterative,,1964000,221480
Fibonacci,Iterative,,2812000,332220
Fibonacci,Iterative,,3767000,442960
Fibonacci,Iterative,,4785000,553700
Fibonacci,Iterative,,6302000,664440
Fibonacci,Iterative,,6946000,775180
Fibonacci,Iterative,,7815000,885920
Fibonacci,Iterative,,9186000,996660
Fibonacci,Iterative,,9578000,1107400
Fibonacci,Iterative,,11206000,1218140
...

This data was taken with this benchmark code:


# #![allow(unused_variables)]
#fn main() {
fn compare_fibonaccis(c: &mut Criterion) {
    let fib_slow = Fun::new("Recursive", |b, i| b.iter(|| fibonacci_slow(*i)));
    let fib_fast = Fun::new("Iterative", |b, i| b.iter(|| fibonacci_fast(*i)));

    let functions = vec![fib_slow, fib_fast];

    c.bench_functions("Fibonacci", functions, 20);
}
#}

raw.csv contains the following columns:

  • group - This corresponds to the function group name, in this case "Fibonacci" as seen in the code above. This is the parameter given to the Criterion::bench functions.
  • function - This corresponds to the function name, in this case "Iterative". When comparing multiple functions, each function is given a different name. Otherwise, this will be the empty string.
  • value - This is the parameter passed to the benchmarked function when using parameterized benchmarks. In this case, there is no parameter so the value is the empty string.
  • iteration_count - The number of times the benchmark was iterated for this sample.
  • sample_time_nanos - The time taken by the measurement for this sample, in nanoseconds. Note that this is the time for the whole sample, not the time-per-iteration (see Analysis Process for more detail). To calculate the time-per-iteration, use sample_time_nanos/iteration_count.

As you can see, this is the raw measurements taken by the Criterion.rs benchmark process. There is one record for each sample, and one file for each benchmark.

The results of Criterion.rs' analysis of these measurements are not currently available in machine-readable form. If you need access to this information, please raise an issue describing your use case.

Known Limitations

There are currently a number of limitations to the use of Criterion.rs relative to the standard benchmark harness.

First, it is necessary for Criterion.rs to provide its own main function using the criterion_main macro. This results in several limitations:

  • It is not possible to include benchmarks in code in the src/ directory as one might with the regular benchmark harness.
  • It is not possible to benchmark non-pub functions. External benchmarks, including those using Criterion.rs, are compiled as a separate crate, and non-pub functions are not visible to the benchmarks.
  • It is not possible to benchmark functions in binary crates. Binary crates cannot be dependencies of other crates, and that includes external tests and benchmarks (see here for more details)

Criterion.rs cannot currently solve these issues. An experimental RFC is being implemented to enable custom test and benchmarking frameworks.

Second, Criterion.rs provides a stable-compatible replacement for the black_box function provided by the standard test crate. This replacement is not as reliable as the official one, and it may allow dead-code-elimination to affect the benchmarks in some circumstances. If you're using a Nightly build of Rust, you can add the real_blackbox feature to your dependency on Criterion.rs to use the standard black_box function instead.

Example:

criterion = { version = '...', features=['real_blackbox'] }

Bencher Compatibility Layer

Criterion.rs provides a small crate which can be used as a drop-in replacement for most common usages of bencher in order to make it easy for existing bencher users to try out Criterion.rs. This page shows an example of how to use this crate.

Example

We'll start with the example benchmark from bencher:


# #![allow(unused_variables)]
#fn main() {
#[macro_use]
extern crate bencher;

use bencher::Bencher;

fn a(bench: &mut Bencher) {
    bench.iter(|| {
        (0..1000).fold(0, |x, y| x + y)
    })
}

fn b(bench: &mut Bencher) {
    const N: usize = 1024;
    bench.iter(|| {
        vec![0u8; N]
    });

    bench.bytes = N as u64;
}

benchmark_group!(benches, a, b);
benchmark_main!(benches);
#}

The first step is to edit the Cargo.toml file to replace the bencher dependency with criterion_bencher_compat:

Change:

[dev-dependencies]
bencher = "0.1"

To:

[dev-dependencies]
criterion_bencher_compat = "0.2"

Then we update the benchmark file itself to change:


# #![allow(unused_variables)]
#fn main() {
#[macro_use]
extern crate bencher;
#}

To:


# #![allow(unused_variables)]
#fn main() {
#[macro_use]
extern crate criterion_bencher_compat as bencher;
#}

That's all! Now just run cargo bench:

     Running target/release/deps/bencher_example-d865087781455bd5
a                       time:   [234.58 ps 237.68 ps 241.94 ps]
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

b                       time:   [23.972 ns 24.218 ns 24.474 ns]
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

Limitations

criterion_bencher_compat does not implement the full API of the bencher crate, only the most commonly-used subset. If your benchmarks require parts of the bencher crate which are not supported, you may need to temporarily disable them while trying Criterion.rs.

criterion_bencher_compat does not provide access to most of Criterion.rs' more advanced features. If the Criterion.rs benchmarks work well for you, it is recommended to convert your benchmarks to use the Criterion.rs interface directly. See Migrating from libtest for more information on that.

Timing Loops

The Bencher structure provides a number of functions which implement different timing loops for measuring the performance of a function. This page discusses how these timing loops work and which one is appropriate for different situations.

iter

The simplest timing loop is iter. This loop should be the default for most benchmarks. iter calls the benchmark N times in a tight loop and records the elapsed time for the entire loop. Because it takes only two measurements (the time before and after the loop) and does nothing else in the loop iter has effectively zero measurement overhead - meaning it can accurately measure the performance of functions as small as a single processor instruction.

However, iter has limitations as well. If the benchmark returns a value which implements Drop, it will be dropped inside the loop and the drop function's time will be included in the measurement. Additionally, some benchmarks need per-iteration setup. A benchmark for a sorting algorithm might require some unsorted data to operate on, but we don't want the generation of the unsorted data to affect the measurement. iter provides no way to do this.

iter_with_large_drop

iter_with_large_drop is an answer to the first problem. In this case, the values returned by the benchmark are collected into a Vec to be dropped after the measurement is complete. This introduces a small amount of measurement overhead, meaning that the measured value will be slightly higher than the true runtime of the function. This overhead is almost always negligible, but it's important to be aware that it exists. Extremely fast benchmarks (such as those in the hundreds-of-picoseconds range or smaller) or benchmarks that return very large structures may incur more overhead.

Aside from the measurement overhead, iter_with_large_drop has its own limitations. Collecting the returned values into a Vec uses heap memory, and the amount of memory used is not under the control of the user. Rather, it depends on the iteration count which in turn depends on the benchmark settings and the runtime of the benchmarked function. It is possible that a benchmark could run out of memory while collecting the values to drop.

iter_batched/iter_batched_ref

iter_batched and iter_batched_ref are the next step up in complexity for timing loops. These timing loops take two closures rather than one. The first closure takes no arguments and returns a value of type T - this is used to generate setup data. For example, the setup function might clone a vector of unsorted data for use in benchmarking a sorting function. The second closure is the function to benchmark, and it takes a T (for iter_batched) or &mut T (for iter_batched_ref).

These two timing loops generate a batch of inputs and measure the time to execute the benchmark on all values in the batch. As with iter_with_large_drop they also collect the values returned from the benchmark into a Vec and drop it later without timing the drop. Then another batch of inputs is generated and the process is repeated until enough iterations of the benchmark have been measured. Keep in mind that this is only necessary if the benchmark modifies the input - if the input is constant then one input value can be reused and the benchmark should use iter instead.

Both timing loops accept a third parameter which controls how large a batch is. If the batch size is too large, we might run out of memory generating the inputs and collecting the outputs. If it's too small, we could introduce more measurement overhead than is necessary. For ease of use, Criterion provides three pre-defined choices of batch size, defined by the BatchSize enum - SmallInput, LargeInput and PerIteration. It is also possible (though not recommended) to set the batch size manually.

SmallInput should be the default for most benchmarks. It is tuned for benchmarks where the setup values are small (small enough that millions of values can safely be held in memory) and the output is likewise small or nonexistent. SmallInput incurs the least measurement overhead (equivalent to that of iter_with_large_drop and therefore negligible for nearly all benchmarks), but also uses the most memory.

LargeInput should be used if the input or output of the benchmark is large enough that SmallInput uses too much memory. LargeInput incurs slightly more measurement overhead than SmallInput, but the overhead is still small enough to be negligible for almost all benchmarks.

PerIteration forces the batch size to one. That is, it generates a single setup input, times the execution of the function once, discards the setup and output, then repeats. This results in a great deal of measurement overhead - several orders of magnitude more than the other options. It can be enough to affect benchmarks into the hundreds-of-nanoseconds range. Using PerIteration should be avoided wherever possible. However, it is sometimes necessary if the input or output of the benchmark is extremely large or holds a limited resource like a file handle.

Although sticking to the pre-defined settings is strongly recommended, Criterion.rs does allow users to choose their own batch size if necessary. This can be done with BatchSize::NumBatches or BatchSize::NumIterations, which specify the number of batches per sample or the number of iterations per batch respectively. These options should be used only when necessary, as they require the user to tune the settings manually to get accurate results. However, they are provided as an option in case the pre-defined options are all unsuitable. NumBatches should be preferred over NumIterations as it will typically have less measurement overhead, but NumIterations provides more control over the batch size which may be necessary in some situations.

What do I do if my function's runtime is smaller than the measurement overhead?

Criterion.rs' timing loops are carefully designed to minimize the measurement overhead as much as possible. For most benchmarks the measurement overhead can safely be ignored because the true runtime of most benchmarks will be very large relative to the overhead. However, benchmarks with a runtime that is not much larger than the overhead can be difficult to measure.

If you believe that your benchmark is small compared to the measurement overhead, the first option is to adjust the timing loop to reduce the overhead. Using iter or iter_batched with SmallInput should be the first choice, as these options incur a minimum of measurement overhead. In general, using iter_batched with larger batches produces less overhead, so replacing PerIteration with NumIterations with a suitable batch size will typically reduce the overhead. It is possible for the batch size to be too large, however, which will increase (rather than decrease) overhead.

If this is not sufficient, the only recourse is to benchmark a larger function. It's tempting to do this by manually executing the routine a fixed number of times inside the benchmark, but this is equivalent to what NumIterations already does. The only difference is that Criterion.rs can account for NumIterations and show the correct runtime for one iteration of the function rather than many. Instead, consider benchmarking at a higher level.

It's important to stress that measurement overhead only matters for very fast functions which modify their input. For slower functions (roughly speaking, anything at the nanosecond level or larger, or the microsecond level for PerIteration, assuming a reasonably modern x86_64 processor and OS or equivalent) are not meaningfully affected by measurement overhead. For functions which only read their input and do not modify or consume it, one value can be shared by all iterations using the iter loop which has effectively no overhead.

Deprecated Timing Loops

In older Criterion.rs benchmarks (pre 2.10), one might see two more timing loops, called iter_with_setup and iter_with_large_setup. iter_with_setup is equivalent to iter_batched with PerIteration. iter_with_large_setup is equivalent to iter_batched with NumBatches(1). Both produce much more measurement overhead than SmallInput. Additionally. large_setup also uses much more memory. Both should be updated to use iter_batched, preferably with SmallInput. They are kept for backwards-compatibility reasons, but no longer appear in the API documentation.

Analysis Process

This page details the data collection and analysis process used by Criterion.rs. This is a bit more advanced than the user guide; it is assumed the reader is somewhat familiar with statistical concepts. In particular, the reader should know what bootstrap sampling means.

So, without further ado, let's start with a general overview. Each benchmark in Criterion.rs goes through four phases:

  • Warmup - The routine is executed repeatedly to fill the CPU and OS caches and (if applicable) give the JIT time to compile the code
  • Measurement - The routine is executed repeatedly and the execution times are recorded
  • Analysis - The recorded samples are analyzed and distilled into meaningful statistics, which are then reported to the user
  • Comparison - The performance of the current run is compared to the stored data from the last run to determine whether it has changed, and if so by how much

Warmup

The first step in the process is warmup. In this phase, the routine is executed repeatedly to give the OS, CPU and JIT time to adapt to the new workload. This helps prevent things like cold caches and JIT compilation time from throwing off the measurements later. The warmup period is controlled by the warm_up_time value in the Criterion struct.

The warmup period is quite simple. The routine is executed once, then twice, four times and so on until the total accumulated execution time is greater than the configured warm up time. The number of iterations that were completed during this period is recorded, along with the elapsed time.

Measurement

The measurement phase is when Criterion.rs collects the performance data that will be analyzed and used in later stages. This phase is mainly controlled by the measurement_time value in the Criterion struct.

The measurements are done in a number of samples (see the sample_size parameter). Each sample consists of one or more (typically many) iterations of the routine. The elapsed time between the beginning and the end of the iterations, divided by the number of iterations, gives an estimate of the time taken by each iteration.

As measurement progresses, the sample iteration counts are increased. Suppose that the first sample contains 10 iterations. The second sample will contain 20, the third will contain 30 and so on. More formally, the iteration counts are calculated like so:

iterations = [d, 2d, 3d, ... Nd]

Where N is the total number of samples and d is a factor, calculated from the rough estimate of iteration time measured during the warmup period, which is used to scale the number of iterations to meet the configured measurement time. Note that d cannot be less than 1, and therefore the actual measurment time may exceed the configured measurement time if the iteration time is large or the configured measurement time is small.

Note that Criterion.rs does not measure each individual iteration, only the complete sample. The resulting samples are stored for use in later stages. The sample data is also written to the local disk so that it can be used in the comparison phase of future benchmark runs.

Analysis

During this phase Criterion.rs calculates useful statistics from the samples collected during the measurement phase.

Outlier Classification

The first step in analysis is outlier classification. Each sample is classified using a modified version of Tukey's Method, which will be summarized here. First, the interquartile range (IQR) is calculated from the difference between the 25th and 75th percentile. In Tukey's Method, values less than (25th percentile - 1.5 * IQR) or greater than (75th percentile + 1.5 * IQR) are considered outliers. Criterion.rs creates additional fences at (25pct - 3 * IQR) and (75pct + 3 * IQR); values outside that range are considered severe outliers.

Outlier classification is important because the analysis method used to estimate the average iteration time is sensitive to outliers. Thus, when Criterion.rs detects outliers, a warning is printed to inform the user that the benchmark may be less reliable. Additionally, a plot is generated showing which data points are considered outliers, where the fences are, etc.

Note, however, that outlier samples are not dropped from the data, and are used in the following analysis steps along with all other samples.

Linear Regression

The samples collected from a good benchmark should form a rough line when plotted on a chart showing the number of iterations and the time for each sample. The slope of that line gives an estimate of the time per iteration. A single estimate is difficult to interpret, however, since it contains no context. A confidence interval is generally more helpful. In order to generate a confidence interval, a large number of bootstrap samples are generated from the measured samples. A line is fitted to each of the bootstrap samples, and the result is a statistical distribution of slopes that gives a reliable confidence interval around the single estimate calculated from the measured samples.

This resampling process is repeated to generate the mean, standard deviation, median and median absolute deviation of the measured iteration times as well. All of this information is printed to the user and charts are generated. Finally, if there are saved statistics from a previous run, the two benchmark runs are compared.

Comparison

In the comparison phase, the statistics calculated from the current benchmark run are compared against those saved by the previous run to determine if the performance has changed in the meantime, and if so, by how much.

Once again, Criterion.rs generates many bootstrap samples, based on the measured samples from the two runs. The new and old bootstrap samples are compared and their T score is calculated using a T-test. The fraction of the bootstrapped T scores which are more extreme than the T score calculated by comparing the two measured samples gives the probability that the observed difference between the two sets of samples is merely by chance. Thus, if that probability is very low or zero, Criterion.rs can be confident that there is truly a difference in execution time between the two samples. In that case, the mean and median differences are bootstrapped and printed for the user, and the entire process begins again with the next benchmark.

This process can be extremely sensitive to changes, especially when combined with a small, highly deterministic benchmark routine. In these circumstances even very small changes (eg. differences in the load from background processes) can change the measurements enough that the comparison process detects an optimization or regression. Since these sorts of unpredictable fluctuations are rarely of interest while benchmarking, there is also a configurable noise threshold. Optimizations or regressions within (for example) +-1% are considered noise and ignored. It is best to benchmark on a quiet computer where possible to minimize this noise, but it is not always possible to eliminate it entirely.

Frequently Asked Questions

How Should I Run Criterion.rs Benchmarks In A CI Pipeline?

Criterion.rs benchmarks can be run as part of a CI pipeline just as they normally would on the command line - simply run cargo bench.

To compare the master branch to a pull request, you could run the benchmarks on the master branch to set a baseline, then run them again with the pull request branch. An example script for Travis-CI might be:

#!/usr/bin/env bash

if [ "${TRAVIS_PULL_REQUEST_BRANCH:-$TRAVIS_BRANCH}" != "master" ] && [ "$TRAVIS_RUST_VERSION" == "nightly" ]; then
    REMOTE_URL="$(git config --get remote.origin.url)";
    cd ${TRAVIS_BUILD_DIR}/.. && \
    git clone ${REMOTE_URL} "${TRAVIS_REPO_SLUG}-bench" && \
    cd  "${TRAVIS_REPO_SLUG}-bench" && \
    # Bench master
    git checkout master && \
    cargo bench && \
    # Bench pull request
    git checkout ${TRAVIS_COMMIT} && \
    cargo bench;
fi

(Thanks to BeachApe for the script on which this is based.)

Note that cloud CI providers like Travis-CI and Appveyor introduce a great deal of noise into the benchmarking process. For example, unpredictable load on the physical hosts of their build VM's. Benchmarks measured on such services tend to be unreliable, so you should be skeptical of the results. In particular, benchmarks that detect performance regressions should not cause the build to fail, and apparent performance regressions should be verified manually before rejecting a pull request.

cargo bench Gives "Unrecognized Option" Errors for Valid Command-line Options

By default, Cargo implicitly adds a libtest benchmark harness to your crate when benchmarking, to handle any #[bench] functions, even if you have none. It compiles and runs this executable first, before any of the other benchmarks. Normally, this is fine - it detects that there are no libtest benchmarks to execute and exits, allowing Cargo to move on to the real benchmarks. Unfortunately, it checks the command-line arguments first, and panics when it finds one it doesn't understand. This causes Cargo to stop benchmarking early, and it never executes the Criterion.rs benchmarks.

This will occur when running cargo bench with any argument that Criterion.rs supports but libtest does not. For example, --verbose and --save-baseline will cause this issue, while --help will not. There are two ways to work around this at present:

You could run only your Criterion benchmark, like so:

cargo bench --bench my_benchmark -- --verbose

Note that my_benchmark here corresponds to the name of your benchmark in your Cargo.toml file.

Another option is to disable benchmarks for your lib or app crate. For example, for library crates, you could add this to your Cargo.toml file:

[lib]
bench = false

Of course, this only works if you define all of your benchmarks in the benches directory.

See Rust Issue #47241 for more details.

How Should I Benchmark Small Functions?

Exactly the same way as you would benchmark any other function.

It is sometimes suggested that benchmarks of small (nanosecond-scale) functions should iterate the function to be benchmarked many times internally to reduce the impact of measurement overhead. This is not required with Criterion.rs, and it is not recommended.

To see this, consider the following benchmark:


# #![allow(unused_variables)]
#fn main() {
fn compare_small(c: &mut Criterion) {
    use criterion::black_box;
    use criterion::ParameterizedBenchmark;

    c.bench(
        "small",
        ParameterizedBenchmark::new("unlooped", |b, i| b.iter(|| i + 10), vec![10])
            .with_function("looped", |b, i| b.iter(|| {
                for _ in 0..10000 {
                    black_box(i + 10);
                }
            }))
    );
}
#}

This benchmark simply adds two numbers - just about the smallest function that could be performed. On my computer, this produces the following output:

small/unlooped          time:   [270.00 ps 270.78 ps 271.56 ps]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe
small/looped            time:   [2.7051 us 2.7142 us 2.7238 us]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

2.714 microseconds/10000 gives 271.4 picoseconds, or pretty much the same result. Interestingly, this is slightly more than one cycle of my 4th-gen Core i7's maximum clock frequency of 4.4 GHz, which shows how good the pipelining is on modern CPUs. Regardless, Criterion.rs is able to accurately measure functions all the way down to single instructions. See the Analysis Process page for more details on how Criterion.rs performs its measurements, or see the Timing Loops page for details on choosing a timing loop to minimize measurement overhead.

When Should I Use criterion::black_box?

black_box is a function which prevents certain compiler optimizations. Benchmarks are often slightly artificial in nature and the compiler can take advantage of that to generate faster code when compiling the benchmarks than it would in real usage. In particular, it is common for benchmarked functions to be called with constant parameters, and in some cases rustc can evaluate the function entirely at compile time and replace the function call with a constant. This can produce unnaturally fast benchmarks that don't represent how some code would perform when called normally. Therefore, it's useful to black-box the constant input to prevent this optimization.

However, you might have a function which you expect to be called with one or more constant parameters. In this case, you might want to write your benchmark to represent that scenario instead, and allow the compiler to optimize the constant parameters.

For the most part, Criterion.rs handles this for you - if you use parameterized benchmarks, the parameters are automatically black-boxed by Criterion.rs so you don't need to do anything. If you're writing an un-parameterized benchmark of a function that takes an argument, however, this may be worth considering.

Cargo Prints a Warning About Explicit [[bench]] Sections in Cargo.toml

Currently, Cargo treats any *.rs file in the benches directory as a benchmark, unless there are one or more [[bench]] sections in the Cargo.toml file. In that case, the auto-discovery is disabled entirely.

In Rust 2018 edition, Cargo will be changed so that [[bench]] no longer disables the auto-discovery. If your benches directory contains source files that are not benchmarks, this could break your build when you update, as Cargo will attempt to compile them as benchmarks and fail.

There are two ways to prevent this breakage from happening. You can explicitly turn off the autodiscovery like so:

[[package]]
autobenches = false

The other option is to move those non-benchmark files to a subdirectory (eg. benches/benchmark_code) where they will no longer be detected as benchmarks. I would recommend the latter option.

Note that a file which contains a criterion_main! is a valid benchmark and can safely stay where it is.