This document describes tools designed for use with programs that use the Intel® Cilk™ Plus extensions to the C and C++ languages:
These tools only work with the Intel® Cilk™ portion of the Intel Cilk Plus extensions. They do not currently interact with the C/C++ Extensions for Array Notations (CEAN) portion of Intel Cilk Plus.
The Intel® Cilk screen (Cilk screen) and Intel® Cilk view (Cilk view) tools require that one of the following be installed:
These two products include many software development components, including the Intel Cilk Plus run-time system and the Intel® C++ Compiler, which supports the Intel Cilk Plus keywords. For more information about these products, see http://www.intel.com/software/products/.
Attempting to run the Cilk screen and Cilk view tools with an older version of the Intel Cilk Plus run-time system will display a message that no Intel Cilk Plus code was found.
There is a known problem running a 32-bit application under the Cilk screen tool on 64-bit Windows operating systems. This problem is being worked on.
The Cilk view and Cilk screen tools are included in the Intel® Cilk™ Plus SDK, which is available from: http://whatif.intel.com.
The Cilk screen race detector (cilkscreen command) monitors the actual operation of an Intel® Cilk™ Plus program as run with some test input. The Cilk screen tool reports all data races encountered during execution. By monitoring program execution, Cilk screen can detect races in your production binary, and can even detect races produced by third-party libraries for which you may not have source code.
To ensure a reliable parallel program, you should review and resolve all races that Cilk screen reports. You can instruct Cilk screen to ignore benign races.
Cilk screen runs your program on a single worker and monitors all memory reads and writes. When the program terminates, it displays information about read/write and write/write conflicts. A race is reported if any possible schedule of the program could produce results different from the serial program execution.
To identify races, run Cilk screen on an Intel Cilk Plus program using appropriate test input data sets, much as you would do to run regression tests. Test cases should be selected so that you have complete code coverage and, ideally, complete path coverage. Cilk screen can only analyze code that is executed, and races may only occur under certain data and execution path combinations.
Please note that Cilk screen only detects and reports races that result from parallelism created by the Intel Cilk Plus keywords. Cilk screen will NOT report races that result from conflicts between threads created using other parallel libraries or threads created explicitly using system calls.
Cilk screen recognizes and correctly manages many special cases, including:
Because Cilk screen understands the parallel structure of your program (that is, which strands can run in parallel and which always run in series), the algorithm is probably correct.
In this chapter, we will explain how to use Cilk screen, explain how to interpret the output, document calls that your program can make to control Cilk screen, and discuss the performance overhead that Cilk screen incurs.
To run Cilk screen, either use the command line or the Microsoft Visual Studio* Tools menu.
The cilkscreen command is:
cilkscreen [cilkscreen options] [--] your_program [program options]
Within Visual Studio*, click the "Tools" menu, and select "Run Cilkscreen Race Detector" from the "Intel Cilk Plus Tools" submenu. Cilk screen runs the program using the command arguments specified in the debugging property of the project.
You can run production or debug versions of your program under Cilk screen. Debug builds typically contain more symbolic information, while release (optimized) builds may include optimizations that make it hard to relate races back to your source code. However, binary code differences may lead to races that are seen only in one version or the other. We recommend testing with debug builds during development, and verifying that production builds are correct before distributing your Intel Cilk Plus program.
To ensure high reliability, we recommend that you run Cilk screen with a variety of input data sets, as different input data may expose different code paths in your program. Because the resolution of one race may expose or create a previously unreported race, you should run Cilk screen after any program changes until you achieve race-free operation. For performance reasons, you may choose to use smaller data sets than you might use for other regression testing. For example, we test our qsort examples for aces using a small array of elements, but run performance tests with large arrays.
Two of the examples include intentional data races. You can build these examples and run them under Cilk screen to see the race reports.
The cilkscreen command recognizes the following options. These options are not available when you run Cilk screen from within Visual Studio.
You can optionally use "--" to separate the name of the program from the options.
Unless specified otherwise, output goes to the stderr (by default, the console).
-r reportfile |
Write output in ASCII format to reportfile. Note that Windows GUI applications do not have stderr, so output will be lost unless you use the -r option. |
-x xmlfile |
Write cilkscreen command output in XML format to xmfile. The -x option overrides the -r option; you cannot use both. |
-a |
Report all race conditions. The same condition may be reported multiple times. Normally, a race condition will only be reported once. |
-d |
Output verbose debugging information, including detailed trace information such as DLL loading. Unless the -l option is specified, the debug output will be written to stderr. |
-l logfile |
Write debugging information to logfile. This file is created only if the -d option is specified. |
-p [n] |
Pause n seconds (default is one second) before starting the Cilk screen process. |
-s |
Display the command passed to PIN. PIN is the dynamic instrumentation package that Cilk screen uses to monitor instructions. |
-h, -? |
Display cilkscreen command usage and exit. |
-v |
Display version information and exit. |
Consider the following program sum.cpp:
01 #include <cilk/cilk.h> 02 #include <iostream> 03 04 int sum = 0; 05 06 void f(int arg) 07 { 08 sum += arg; 09 } 10 11 int main() 12 { 13 cilk_spawn f(1); 14 cilk_spawn f(2); 15 16 cilk_sync; 17 std::cout << "Sum is " << sum << std::endl; 18 return 0; 19 }
Build the program with debugging information and run Cilk screen:
Windows* OS | Linux* OS | |
---|---|---|
icl /Zi sum.cpp Cilkscreen sum.exe |
icc -g sum.cpp Cilkscreen sum |
On Windows, Cilk screen generates output similar to this:
Cilkscreen Race Detector V2.0.0, Build 950 for Intel64 Race condition on location 000000014000A1A0 write access at 00000001400017B1: (C:\sum.cpp:8, sum.exe!f+0x45) read access at 00000001400017AB: (C:\sum.cpp:8, sum.exe!f+0x3f) called by 0000000140001916: (C:\sum.cpp:14, sum.exe!main+0x158) called by 0000000140004367: (f:\dd\vctools\crt_bld\self_64_amd64\crt\src\crtexe.c:597, sum.exe!__tmainCRTStartup+0x197) Variable: 000000014000A1A0 - sum Race condition on location 000000014000A1A0 write access at 00000001400017B1: (C:\sum.cpp:8, sum.exe!f+0x45) write access at 00000001400017B1: (C:\sum.cpp:8, sum.exe!f+0x45) called by 0000000140001916: (C:\sum.cpp:14, sum.exe!main+0x158) called by 0000000140004367: (f:\dd\vctools\crt_bld\self_64_amd64\crt\src\crtexe.c:597, sum.exe!__tmainCRTStartup+0x197) Variable: 000000014000A1A0 - sum Sum is 3 2 errors found by Cilkscreen Cilkscreen suppressed 1 duplicate error messages
Here is what the output means:
The tool detected a race condition at memory location 000000014000A1A0.
The first access that participated in the race was a write access from location 00000001400017B1 in the program. This corresponds to source line 8 in sum.cpp, from the instruction at offset 0x45 in the function f() loaded from the executable binary file sum.exe. From the program listing, we see that the source code at line 8 is
sum += arg;
The second access was a read from location 00000001400017AB in the program. This also corresponds to source line 8 in sum.cpp, but from the instruction at offset 0x3f in f(). The write access occurs when storing the new value back to sum. The read access occurs when the value of sum is read from memory in order to add the value of arg to it.
Cilk screen displays a stack trace for the second memory reference involved in the race. It would be very expensive in both space and time to store the stack trace for every memory reference. It is much cheaper to record the stack trace only when a race is detected. Here we see that the caller was at line 14 in sum.cpp, called from offset 0x158 in the function main. We can see in the source code that line 14 is the cilk_spawn:
cilk_spawn f(2);
The next line of the stack track shows that main was called from the C runtime library function __tmainCRTStartup to start the application.
The last line in the block shows us the name of the variable involved in the race. This information is not always available.
The next block of output displays the write/write race that was detected on the same memory location. This race occurs between the two attempts to update sum that occur in parallel. The output is very similar to the race we just described.
Finally, Cilk screen summarizes the results. There were two distinct races reported. A third race was a duplicate of one of the reported races, and so the output was suppressed.
For the advanced user, Cilk screen provides a set of macros defined in cilkscreen.h that you can use in your program. If you are using the Intel® C++ Compiler (supports C and C++), these macros will generate metadata to implement the requested operation. This mechanism allows you to customize Cilk screen from within your program without measurable cost when you do not run your program under Cilk screen. If you are not using the Intel C++ Compiler, these macros will be compiled away.
Cilk screen begins instrumenting your program when you enter the first
spawning function; a function that contains a cilk_spawn
keyword. This instrumentation is expensive, and there are some cases when it
could make sense to disable all instrumentation while running part of your
program. Enabling instrumentation is very expensive, so these calls should
only be used to speed up your program under Cilk screen when you have very
large serial sections of code. For example, it might make sense to disable
instrumentation if you have a very expensive, serial C++ initialization
routine that is called within your spawning function before any parallel
constructs are invoked.
__cilkscreen_disable_instrumentation(); __cilkscreen_enable_instrumentation();
For example:
#include <cilk/cilk.h> #include <cilktools/cilkscreen.h> int main() { __cilkscreen_disable_instrumentation(); very_expensive_initialization(); __cilkscreen_enable_instrumentation(); cilk_spawn parallel_function1(); parallel_function2(); return 0; }
You can turn off race checking without disabling instrumentation. This mechanism is not as safe as a fake lock, but may be useful in limited circumstances. Enabling and disabling checking is much cheaper than enabling instrumentation. However, disabling instrumentation allows your program to run at full speed. With checking disabled, Cilk screen continues to instrument your program with all of the corresponding overhead. While checking is disabled, Cilk screen will not record any memory accesses.
__cilkscreen_disable_checking(); __cilkscreen_enable_checking();
For example, Cilk screen will not report a race in this program, even though there is in fact a race:
#include <cilk/cilk.h> #include <cilktools/cilkscreen.h> int sum = 0; void f(int arg) { sum += arg; } int main() { __cilkscreen_disable_checking(); cilk_spawn f(1); cilk_spawn f(2); __cilkscreen_enable_checking(); cilk_sync; std::cout << "Sum is " << sum << std::endl; return 0; }
Note that enabling and disabling checking is managed with a single counter that starts at 0, is decremented when checking is disabled, and incremented when checking is enabled. Therefore, these calls may be nested; checking is only re-enabled when the counter returns to 0. Do not enable checking if it is not disabled. This is a fatal runtime error and will abort our program.
Cilk screen tracks commonly used locks; critical sections on Windows* and pthread mutexes on Linux* operating systems. If you are using another type of lock, you may need to inform Cilk screen that it has been acquired or released.
__cilkscreen_acquire_lock(lock) __cilkscreen_release_lock(lock)
The acquire and release lock macros take the address or handle of your lock.
If you are confident that a race in your program is benign, you can suppress
the Cilk screen race report by using a fake lock. Simply pass the address
of a variable to __cilkscreen_acquire_lock
and __cilkscreen_release_lock
and Cilk screen will believe that the region of code is protected by a lock.
Remember that a data race, by definition, only occurs when conflicting accesses to memory are not protected by the same lock. Use a fake lock to pretend to hold and release a lock. This will inhibit race reports with very low run-time cost, as no lock is actually acquired.
Do not forget to call __cilkscreen_release_lock
in all paths
through your code! C++ users may find it helpful to create a small class which
calls __cilkscreen_acquire_lock
when it is constructed, and
__cilkscreen_release_lock
when it goes out of scope and is destructed.
For example:
class fake_lock { int _lock; public: fake_lock() { __cilkscreen_acquire_lock(&_lock); } ~fake_lock() { __cilkscreen_release_lock(&_lock); } };
For example, Cilk screen will not report a race in this program:
#include <cilk/cilk.h> #include <cilktools/cilkscreen.h> class fake_lock { int _lock; public: fake_lock() { __cilkscreen_acquire_lock(&_lock); } ~fake_lock() { __cilkscreen_release_lock(&_lock); } }; bool isUpdated = false; void f(int arg) { fake_lock l; isUpdated = true; } int main() { cilk_spawn f(1); f(2); cilk_sync; return 0; }
While you can use __cilkscreen_acquire_lock
and __cilkscreen_release_lock
to implement your own fake locks, an implementation of the fake_lock
class is provided with
the Cilk Tools distribution, along with a companion class to tie the acquisition and release of
the lock to the scope of the guard variable. The following program demonstrates these classes:
#include <cilk/cilk.h> #include <cilktools/fake_mutex.h> #include <cilktools/lock_guard.h> // Global variable we'll race on bool isUpdated = false; // Lock to mediate access to the global variable cilkscreen::fake_mutex m; void f(int arg) { // The lock "m" will be acquired in the lock_guard constructor, and // released when g goes out of scope cilkscreen::lock_guard<cilkscreen::fake_mutex> g(m); isUpdated = true; } int main() { cilk_spawn f(1); f(2); cilk_sync; return 0; }
You can use cilkscreen::lock_guard
with your own classes that provide
lock
and unlock
methods.
Cilk screen will recognize locks implemented by any package that builds locks based on the underlying operating system primitives. In addition, Cilk screen includes support for TBB spinlocks. Note that you must define the TBB_USE_THREADING_TOOLS macro so that TBB will generate the ITTNOTIFY calls that tell Cilk screen about lock acquisition and release. Failure to define this macro will result in errors reporting lock held across strand boundaries.
#include <cilk/cilk.h> // Cause TBB to notify tools about lock usage #define TBB_USE_THREADING_TOOLS 1 #include <tbb/spin_rw_mutex.h> tbb::spin_rw_mutex countMutex; int foo() { int z=0; cilk_for (int i = 0; i < 10; i++) { tbb::spin_rw_mutex::scoped_lock lock(countMutex); z *= i; } return z; } int main() { foo(); return 0; }
Cilk screen needs to recognize when memory is allocated and freed. Otherwise, Cilk screen would report a race if you allocate a memory block, access it, free it, and then allocate another block of memory at the same address. The early accesses would appear to race with the later accesses to the same addresses, even though the memory is logically fresh.
Cilk screen recognizes memory allocation routines provided by standard system libraries, operating system calls, and Miser. If you have routines in your program that act like memory allocators (for example, look-aside lists, sub-allocators, or memory pools) then you can inform Cilk screen when memory should be considered "clean" and thus any previous accesses to that memory should be forgotten.
__cilkscreen_clean(void *begin, void *end);
Typically, your memory allocator will call __cilkscreen_clean
just before
returning a pointer to freshly allocated memory. For example:
void * myAllocator(size_t bytesRequested) { // find memory from our local pool void *ptr = getPooledMemory(bytesRequested); if (ptr != NULL) { __cilkscreen_clean(ptr, (char *)ptr + bytesRequested); } return ptr; }
You will notice that Cilk screen requires significantly more time and memory compared to a regular, unmonitored run of the program.
To keep track of all memory references, Cilk screen uses approximately 5 times as much memory as the program normally need (6x for 64-bit programs.) Thus, a 32-bit program that uses 100 megabytes of memory will require a total of about 600 megabytes when run under Cilk screen.
In addition, when you run your program under Cilk screen, it will slow down by a factor of about 20-40x or more. There are several factors that lead to the slowdown:
We've said that Cilk screen recognizes various aspects of your program such as memory allocation calls, program calls to Cilk screen, and the Intel Cilk Plus keywords. You may wonder whether the performance of your production program is slowed down in order to support this. In fact, Cilk screen uses a metadata mechanism that records information about the program using symbolic information stored in the executable image, but not normally loaded into memory. Thus, Cilk screen calls add virtually no overhead when your program is not run under Cilk screen.
The Cilk view scalability and performance analyzer (cilkview command) is designed to help you understand the parallel performance of your Intel Cilk Plus program. Cilk view can:
In this chapter, we will illustrate how to use Cilk view with a sample Intel Cilk Plus program, document how to control what data is collected and displayed using both API and command line options, and explain how to interpret the output.
Like Cilk screen, Cilk view monitors your program as it runs on a single worker. To estimate performance on multiple cores, Cilk view makes some assumptions:
To run Cilk view, either use the command line or the Microsoft Visual Studio* Tools menu.
Run Cilk view from the command line using the cilkview command:
cilkview [options] program [program options]
The resulting report looks something like this:
Cilkscreen Scalability Analyzer V2.0.0, Build 950 1) Parallelism Profile Work : 53,924,334 instructions Span : 16,592,885 instructions Burdened span : 16,751,417 instructions Parallelism : 3.25 Burdened parallelism : 3.22 Number of spawns/syncs: 100,000 Average instructions / strand : 179 Strands along span : 83 Average instructions / strand on span : 199,914 Total number of atomic instructions : 18 2) Speedup Estimate 2 processors: 1.31 - 2.00 4 processors: 1.55 - 3.25 8 processors: 1.70 - 3.25 16 processors: 1.79 - 3.25 32 processors: 1.84 - 3.25
The exact number will vary depending on the program input, which compiler and runtime is used, and the compiler options selected.
Note: The actual numbers will be different on Linux* and Windows* systems because of compiler differences. Also, the results will vary if the data is shuffled differently, changing the amount of work required by the quick sort algorithm.
The next section describes what the various numbers mean.
On Windows* OS, you can also run Cilk view from within Visual Studio*:
Cilk view results appear in a Visual Studio window after the program exits and displays a graph that summarizes the parallelism and predicted speedup:
View a copy of the detailed text report by right clicking on the graph and selecting "Show Run Details" from the context menu.
Cilk view writes output into a .csv file. Select "Tools" -> "Intel Cilk Plus Tools" -> "Open Cilkview Log" to view the graph from a previously created .csv log file.
At this time, Cilk view is not fully integrated with Visual Studio. In particular, you cannot specify trial runs using the Visual Studio interface. To collect trial benchmark data, run cilkview from the command line as described above.
The cilkview command line has the form:
cilkview [options] program [program arguments]
The cilkview command options are:
-trials none |
Do not run any benchmark trials. This is the default. |
-trials one [k] |
Run exactly one trial with k workers (default k = # of cores detected) |
-trials log [k] |
Run trials with k, k/2, k/4, .. 1 workers. (default k = # of cores detected) |
-trials all [k] |
Run trials with 1..k works. (default k= # of cores detected) |
-no-workspan -nw |
Do not generate work and span scalability analysis. |
-plot none |
Do not run any plotting program. |
-plot gnuplot |
Run gnuplot to display the output. This is the default behavior when Cilk view is run on a Linux* system. |
-plot cilkplot |
Run cilkplot to display the output (Windows* systems only). This is the default behavior when Cilk view is run on a Windows system. Cilk Plot is distributed with the Intel® Cilk Tools. |
-append |
Append output of this run to existing data files. By default, each run creates a new set of data files. When displaying the output, cilkview will select the fastest run for each tag and thus display the best performance seen. This helps mitigate the indeterminate effect of measuring performance while other processes may be active on your system. |
-quiet |
Do not display banners before each benchmark run. |
-verbose |
Display extra information such as the PIN command line. Primarily used for debugging. |
-v, -version |
Print version information and exit. |
-h, -?, -help |
Display cilkview command usage information and exit. |
The Cilk view report is broken into two sections, the Parallelism Profile and the Speedup Estimate.
The Parallelism Profile displays the statistics collected during a single run of the program. To get a complete picture of the parallel operation of the program, Cilk view runs the program on a single worker while tracking all spawns. If the program specifies a grain size for a cilk_for loop, the analysis is run using that grain size. For cilk_for loops that use the default grain size, the grain size for the analysis is set to 1.
In this section, we will use the term overhead to mean the overhead associated with parallelism: scheduling, stealing and synchronization.
Note that because the program is run on a single worker, new reducer views are never created or merged. Thus, the instruction counts do not include the cost of creating new views, calling the reduction function, and destroying the views.
The following statistics are shown:
Work |
The total number of instructions that were executed. Because the program is run on a single worker, no parallel overhead is included in the work count. |
Span |
The number of instructions executed on the critical path. |
Burdened Span |
The number of instructions on the critical path when overhead is included. |
Parallelism |
The Work divided by the Span. This is the maximum speedup you would expect if you could run the program on an infinite number of processors with no overhead. |
Burdened Parallelism |
The Work divided by the Burdened Span. |
Number of spawns |
The number of spawns counted during the run of the program. |
Average instructions / strand |
The Work divided by the number of strands. A small number may indicate that the overhead of using cilk_spawn is high. |
Strands along span |
The number of strands in the critical path. |
Average instructions / strand on span |
The Span divided by the Strands along span. A small number may indicate that the scheduling overhead will be high. |
Total number of atomic instructions |
This correlates with the number of times you acquire and release locks, or use atomic instructions directly. |
Cilk view estimates the expected speedup on 2, 4, 8, 16 and 32 processor cores. The speedup estimates are displayed as ranges with lower and upper bounds.
The numbers in the profile report help you understand the parallel behavior of your program. Here are some things to look for:
These are the basic attributes of a parallel program. In a program without any parallel constructs, the work and span will be the same. Even an infinite number of workers will never execute the program more quickly than a single worker executing the span. If the two numbers are close together, there is very little computation that can be done in parallel. Look for places to add cilk_spawn and cilk_for keywords.
The parallelism is a key number, as it represents the theoretical speedup limit. In practice, we find that you usually want the parallelism to be at least 5-10 times the number of processors you will have available. If the parallelism is lower, it may be hard for the scheduler to effectively utilize all of the workers. If the parallelism is smaller than the number of processors, additional workers will remain idle.
Low parallelism can result from several factors:
Burdened parallelism considers the runtime and scheduling overhead, and for typical programs, is a better estimate of the speedup that is possible in practice. If the burdened parallelism is much lower than the parallelism, the program has too much parallel overhead. Typically, this means that the amount of work spawned is too small to overcome the overhead of spawning.
To take advantage of parallelism, each strand needs to perform more work. There are several possible approaches:
This count is the average number of instructions executed between parallel control points. A small number may indicate that the overhead of using cilk_spawn is high relative to the amount of work done by the spawned function. This will typically also be reflected in the burdened parallelism number.
This count is the average number of instructions per strand on the critical path. A small number suggests that the program has a granularity problem, as explained above.
This counter is the number of atomic instructions executed by the program. This is correlated with the number of locks that are acquired, including those used in libraries and the locks implied by the direct use of atomic instructions via compiler intrinsics. If this number is high, lock contention may become a significant issue for your program.
In this section, we present a sample Intel Cilk Plus program that has a parallel performance issue, and show how Cilk view can help identify the problem.
This program uses cilk_for to perform operations over an array of elements in parallel:
static const int COUNT = 4; static const int ITERATION = 1000000; long arr[COUNT]; . . . long do_work(long k) { long x = 15; static const int nn = 87; for(long i = 1; i < nn; ++i) { x = x / i + k % i; } return x; } void repeat_work() { for (int j = 0; j < ITERATION; j++) { cilk_for (int i = 0; i < COUNT; i++) { arr[i] += do_work( j * i + i + j); } } } int cilk_main(int argc, char* argv[]) { . . . repeat_work(); . . . }
This program exhibits speedup of less than 1, also known as slowdown. Running this program on 4 processors takes about 3 times longer than serial execution.
Here is the relevant Cilk view output:
1) Parallelism Profile . . . . . . Span: 2,281,596,648 instructions Burdened span: 32,281,638,648 instructions Parallelism: 3.06 Burdened parallelism: 0.22 . . . . . . Average instructions / strand: 698 Strands along span: 5,000,006 Average instructions / strand on span: 456 2) Speedup Estimation 2 processors: 0.45 - 2.00 4 processors: 0.26 - 3.06 8 processors: 0.24 - 3.06 16 processors: 0.23 - 3.06 32 processors: 0.22 - 3.06
Even though the program has a parallelism measure of 3.06, the burdened parallelism is only 0.22. In other words, the overhead of running the code in parallel could eliminate the benefit obtained from parallel execution. The average instruction count per strand is less than 700 instructions. The cost of stealing can exceed that value. The tiny strand length is the problem.
Looking at the source code, we can see that the cilk_for loop is only iterating 4 times, and the amount of work per iterations is very small, as we expected from the profile statistics.
The theoretical parallelism of the program is more than 3. In principle, the 4 iterations could be split between 4 processors. However, the scheduling overhead to steal and synchronize in order to execute only a few hundred instructions overwhelms any parallel benefit.
In this program, we cannot simply convert the outer loop (over j) to a cilk_for loop because this will cause a race on arr[i]. Instead, a simple fix is to invert the loop by bringing the loop over i to the outside:
void repeat_work_revised() { cilk_for (int i = 0; i < COUNT; i++) { for (int j = 0; j < ITERATION; j++) { arr[i] += do_work( j * i + i + j); } } }
Here is the Cilk view report for the revised code:
1) Parallelism Profile . . . . . . Span: 1,359,597,788 instructions Burdened span: 1,359,669,788 instructions Parallelism: 4.00 Burdened parallelism: 4.00 . . . . . . Average instructions / strand: 258,885,669 Strands along span: 11 Average instructions / strand on span: 123,599,798
The revised program achieves almost perfect linear speedup on 4 processors. As Cilk view reports, the parallelism and burdened parallelism have both improved to 4.00.
By default, Cilk view reports information about a full run of the program from beginning to end. This information may include a significant amount of purely serial code such as initialization and output. Often you will want to restrict the analysis to the smaller sections of your code where you have introduced parallel constructs.
The Cilk view API allows you to control which portions of your Intel Cilk Plus program are analyzed. The API is accessed through methods on a Cilk view object as defined in cilkview.h.
Here is a simple example that analyzes the whole program. When run under Cilk view, a single graph will be created. If Cilk view is not running, the functions do nothing and there is no significant performance impact on your program.
#include <cilk/cilk.h> #include <cilkview.h> int main () { cilkview_data_t data; __cilkview_query(d); // begin analyzing do_some_stuff(); // execute some parallel work __cilkview_report(&d, NULL, "main_tag", CV_REPORT_WRITE_TO_RESULTS); return 1; }
In a slightly more complex example, we analyze two separate parallel regions. Cilk view
will generate two graphs, labeled first_tag
and second_tag
.
cilkview_data_t data; __cilkview_query(d); // begin analyzing do_first_stuff(); // execute some parallel work __cilkview_report(&d, NULL, "first_tag", CV_REPORT_WRITE_TO_RESULTS); __cilkview_query(d); // begin analyzing second parallel region do_different_stuff(); // execute more parallel work __cilkview_report(&d, NULL, "second_tag", CV_REPORT_WRITE_TO_RESULTS);You can also use the same tag repeatedly. In this case, Cilk view treats the data as independent runs of the same region, and chooses the run that finishes in the least time for the graph. The following code will analyze the fastest run of
do_parallel_stuff()
.
for (int count=0; count<MAX; ++count) { cilkview_data_t data; __cilkview_query(d); do_parallel_stuff(); __cilkview_report(&d, NULL, "stuff_tag", CV_REPORT_WRITE_TO_RESULTS); }
In addition to parallel analysis, Cilk view also provides a framework for benchmarking an Intel Cilk Plus program on various numbers of processors. A benchmark run is a complete run of your Intel Cilk Plus program using a specified number of workers. Cilk view measures the total elapsed time, and plots the results. If analysis is enabled, the benchmark data is overlaid on the estimated scalability graph.
To enable benchmarking, use the -trials switch. As detailed in the reference section below, you can specify 0, 1, or multiple trial runs.
unsigned long long __cilkview_getticks()
Returns a counter that can be used for timing applications. The timer resolution is milliseconds. This function will always return valid values, even if the application is not running under Cilk view.
void __cilkview_puts(arg)
Writes a string to the Cilk view log. Be sure to include any required newlines.
This function does nothing if the application is not running under Cilk view, and will be compiled away if not compiled using the Intel C++ Compiler.
void __cilkview_query(cilkview_data_t d)
Fetches the current set of Cilk view counters. If successful, d.status will be non-zero.
This function does nothing if the application is not running under Cilk view, and will be compiled away if not compiled using the Intel C++ Compiler.
void __cilkview_report(cilkview_data_t *start, cilkview_data_t *end, char *tag, unsigned int flags)
Writes information about the interval between the start and end data to either the report, the data file, or both.
start |
A pointer to a cilkview_data_t that was previously initialized by a
call to __cilkview_query() . |
end |
A pointer to a cilkview_data_t that was previously initialized by a
call to __cilkview_query() , or NULL. If NULL is passed,
Cilk view will use the current values as the end point. |
tag |
A string representing the "tag" to be used for this report. There may not be spaces in the tag string. If multiple reports are written with the same tag, Cilk view will plot the smallest one. |
flags |
Flags to indicate whether the report is to be written to the log or the results file. Possible values are CV_REPORT_WRITE_TO_LOG and CV_REPORT_WRITE_TO_RESULTS. The values may be ORed together to write the data to both files. |
This function does nothing if the application is not running under Cilk view, and will be compiled away if not compiled using the Intel C++ Compiler.
The Intel®: Cilk Plot program is provided for Windows systems only. The cilkplot command line has the form:
cilkplot TAGNAME.csv
where TAGNAME.csv is a data file produced by Cilk view. This allows you to easily view a plot of Cilk view data without re-running the measurements.
Cilk view writes TAGNAME.csv and TAGNAME.plt files that can be read by gnuplot. To display the data with gnuplot on Linux or Windows systems run gnuplot (or wgnuplot) as:
gnuplot TAGNAME.plt
where TAGNAME.plt is the gnuplot script file produced by Cilk view. This allows you to easily view a plot of Cilk view data without re-running the measurements.
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely
on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves
these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising
from future changes to them. The information here is subject to change without notice. Do not finalize a design with
this information.
The products described in this document may contain design defects or errors known as errata which may cause the
product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing
your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature,
may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm
Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to:
http://www.intel.com/products/processor_number/
This document contains information on products in the design phase of development.
BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel Threading Building Blocks, Intel Viiv, Intel vPro, Intel XScale, InTru, the InTru logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, skoool, the skoool logo, Sound Mark, The Journey Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries.
* Other names and brands may be claimed as the property of others.
Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft
Corporation in the United States and/or other countries.
Copyright (C) 2010, Intel Corporation. All rights reserved.