SdkRuntime API Reference¶

This section presents the SdkRuntime Python host API reference and associated utilities to develop kernels for the Cerebras Wafer Scale Engine.

SdkRuntime¶

Python API for SdkRuntime functions.

class cerebras.sdk.runtime.sdkruntimepybind.SdkRuntime(bindir: Union[pathlib.Path, str], **kwargs)¶

Bases: object

Manages the execution of SDK programs on the Cerebras Wafer Scale Engine (WSE) or simfabric. The constructor analyzes the WSE ELFs in the bindir and prepares the WSE or simfabric for a run. Requires CM IP address and port for WSE runs.

Parameters

bindir (Union[pathlib.Path, str]) – Path to ELF files which is compiled by cslc. The runtime collects the I/O and fabric parameters automatically, including height, width, number of channels, width of buffers,… etc.

Keyword Arguments

cmaddr (str) – 'IP_ADDRESS:PORT' string of CM. Omit this kwarg to run on simfabric.
suppress_simfab_trace (bool) – If True, suppresses generation of simfab_traces when running. Default value is False, i.e., simfab_traces are produced.
simfab_numthreads (int) – Number of threads to use if running on simfabric. Maximum value is 64. Default value is 5, i.e., the simulator uses 5 threads.
msg_level (str) – Message logging output level. Available output levels are DEBUG, INFO, WARNING, and ERROR. Default value is WARNING.

Example:

In the following example, an SdkRuntime runner object is instantiated. If args.cmaddr is non-empty, then the kernel code will run on the WSE pointed to by that address; otherwise, the kernel code will run on simfabric. The compiled kernel code in the directory args.name has exported symbols A and B pointing to arrays on the device. After loading the code and starting the run with load() and run(), data on the host stored in data is copied to A on the device, and then B on the device is copied back into data on the host.
runner = SdkRuntime(args.name, cmaddr=args.cmaddr)
symbol_A = runner.get_id("A")
symbol_B = runner.get_id("B")
runner.load()
runner.run()
runner.memcpy_h2d(symbol_A, data, px, py, w, h, l,
                  streaming=False, data_type=memcpy_dtype,
                  order=memcpy_order, nonblock=False)
runner.memcpy_d2h(data, symbol_B, px, py, w, h, l,
                  streaming=False, data_type=memcpy_dtype,
                  order=memcpy_order, nonblock=False)

coord_logical_to_physical(logical_coords: int, int)¶

Convert a logical coordinate to a physical coordinate. For a program with fabric offsets (offset_x, offset_y), and program rectangle coordinate (x, y), this function returns (offset_x + x, offset_y + y).

Parameters: logical_coords – Tuple containing logical coordinates.
Returns: physical_coords ((int, int)) – Tuple containing physical coordinates.

dump_core(corefile: str)¶

Dump the core of a simulator run, to be used for debugging with csdb. Note that the specified name of the corefile MUST be “corefile.cs1” to use with csdb, and this method can only be called after calling stop().

Parameters: corefile – Name of corefile. Must be “corefile.cs1” to use with csdb.

get_id(symbol: str)¶

Retrieve the integer representation of an exported symbol which is exported in the kernel. Possible symbols include a data tensor or a host-callable function.

Parameters: symbol (str) – The exported name of the symbol.

is_task_done(task_handle: Task) → bool¶

Query if task task_handle is complete

Parameters: task_handle (Task) – Handle to a task previously launched by SdkRuntime.
Returns: task_done (bool) – True if task is done, and False otherwise.

launch(symbol: str, *args, **kwargs) → Task ¶

Trigger a host-callable function defined in the kernel, with type checking for arguments.

Parameters: symbol (str) – The exported name of the symbol corresponding to a host-callable function.
Positional Arguments: Matches the arguments of the host-callable function. launch will perform type checking on the arguments.
Keyword Arguments: nonblock (bool) – Nonblocking if True, blocking otherwise.
Returns: task_handle (Task) – Handle to the task launched by launch.

Example:

Consider a kernel which defines a host-callable function fn_foo by:
comptime {
  @export_symbol(fn_foo);
}
The host calls fn_foo by runner.launch("fn_foo", nonblock=False).

load()¶: Load the binaries to simfabric or WSE. It may takes 80+ seconds to load the binaries onto the WSE.

memcpy_d2h(dest: numpy.ndarray, src: int, px: int, py: int, w: int, h: int, elem_per_pe: int, **kwargs) → Task ¶

Receive a host tensor to the device via either copy mode or streaming mode. The data is distributed into the region of interest (ROI) which is a bounding box starting at coordinate (px, py) with width w and height h.

Parameters

dest (numpy.ndarray) – A 3-D host tensor A[h][w][l], wrapped in a 1-D array according to keyword argument order.
src (int) – A user-defined color if keyword argument streaming=True, symbol of a device tensor otherwise.
px (int) – x-coordinate of start point of the ROI.
py (int) – y-coordinate of start point of the ROI.
w (int) – Width of the ROI.
h (int) – Height of the ROI.
elem_per_pe (int) – Number of elements per PE. The data type of an element is 16-bit and 32-bit only. If the tensor has k elements per PE, elt_per_pe is k even if the data type is 16-bit. If the data type is 16-bit, the user has to extend the tensor to a 32-bit one, with zero filled in the higher 16 bits.

Keyword Arguments

streaming (bool) – Streaming mode if True, copy mode otherwise.
data_type (MemcpyDataType) – 32-bit if MemcpyDataType.MEMCPY_32BIT or 16-bit if MemcpyDataType.MEMCPY_16BIT. Note that this argument has no effect if streaming is True, and the user must handle the data appropriately in the receiving wavelet-triggered task. Additionally, the underlying type of the tensor dest must be 32-bit. The tensor must be extended to a 32-bit one with zero filled in the higher 16 bits.
order (MemcpyOrder) – Row-major if MemcpyOrder.ROW_MAJOR or column-major if MemcpyOrder.COL_MAJOR.
nonblock (bool) – Nonblocking if True, blocking otherwise.

Returns

task_handle (Task) – Handle to the task launched by memcpy_d2h.

memcpy_h2d(dest: int, src: numpy.ndarray, px: int, py: int, w: int, h: int, elem_per_pe: int, **kwargs) → Task ¶

Send a host tensor to the device via either copy mode or streaming mode. The data is distributed into the region of interest (ROI) which is a bounding box starting at coordinate (px, py) with width w and height h.

Parameters

dest (int) – A user-defined color if keyword argument streaming=True, symbol of a device tensor otherwise.
src (numpy.ndarray) – A 3-D host tensor A[h][w][l], wrapped in a 1-D array according to parameter order.
px (int) – x-coordinate of start point of the ROI.
py (int) – y-coordinate of start point of the ROI.
w (int) – Width of the ROI.
h (int) – Height of the ROI.
elem_per_pe (int) – Number of elements per PE. The data type of an element is 16-bit and 32-bit only. If the tensor has k elements per PE, elt_per_pe is k even if the data type is 16-bit. If the data type is 16-bit, the user has to extend the tensor to a 32-bit one, with zero filled in the higher 16 bits.

Keyword Arguments

streaming (bool) – Streaming mode if True, copy mode otherwise.
data_type (MemcpyDataType) – 32-bit if MemcpyDataType.MEMCPY_32BIT or 16-bit if MemcpyDataType.MEMCPY_16BIT. Note that this argument has no effect if streaming is True, and the user must handle the data appropriately in the receiving wavelet-triggered task. Additionally, the underlying type of the tensor src must be 32-bit. The tensor must be extended to a 32-bit one with zero filled in the higher 16 bits.
order (MemcpyOrder) – Row-major if MemcpyOrder.ROW_MAJOR or column-major if MemcpyOrder.COL_MAJOR.
nonblock (bool) – Nonblocking if True, blocking otherwise.

Returns

task_handle (Task) – Handle to the task launched by memcpy_h2d.

run()¶: Start the simfabric or WSE run and wait for commands from the host runtime.

stop()¶

Wait for all pending commands (data transfers and kernel function calls) to complete and then stop simfabric or WSE. After this call is complete, no new commands will be accepted for this SdkRuntime object.

stop must be called to end a program. Otherwise, the runtime will emit an error.

task_wait(task_handle: Task)¶

Wait for the task task_handle to complete.

Parameters: task_handle (Task) – Handle to a task previously launched by SdkRuntime.

class cerebras.sdk.runtime.sdkruntimepybind.MemcpyDataType¶

Bases: Enum

Specifies the data size for transfers using memcpy_d2h and memcpy_h2d copy mode.

Values

MEMCPY_16BIT
MEMCPY_32BIT

class cerebras.sdk.runtime.sdkruntimepybind.MemcpyOrder¶

Bases: Enum

Specifies mapping of data for transfers using memcpy_d2h and memcpy_h2d.

Values

ROW_MAJOR
COL_MAJOR

class cerebras.sdk.runtime.sdkruntimepybind.Task¶: Handle to a task launched by SdkRuntime.

sdk_utils¶

Utility functions for common operations with SdkRuntime.

cerebras.sdk.sdk_utils.calculate_cycles(timestamp_buf: numpy.ndarray) → numpy.int64:¶

Converts values in timestamp_buf returned from device into a human-readable elapsed cycle count.

Parameters: timestamp_buf (numpy.ndarray) – array returned from device containing elapsed timestamp data
Returns: elapsed_cycles (numpy.int64) – Elapsed cycle count.

Example:

Consider the following CSL snippet which records timestamps and produces a single array to copy back to the host, to generate an elapsed cycle count:

// import time module and create timestamp buffers
const timestamp = @import_module("<time>");
var tsc_end_buf = @zeros([timestamp.tsc_size_words]u16);
var tsc_start_buf = @zeros([timestamp.tsc_size_words]u16);

// create elapsed timer buffer and advertise to host
var timer_buf = @zeros([3]f32);
var ptr_timer_buf: [*]f32 = &timer_buf;

timestamp.enable_tsc();
// record starting timestamp
timestamp.get_timestamp(&tsc_start_buf);

// perform some operation for which you want to calculate elapsed cycles

// record ending timestamp
timestamp.get_timestamp(&tsc_end_buf);
timestamp.disable_tsc();

var lo_: u16 = 0;
var hi_: u16 = 0;
var word: u32 = 0;

lo_ = tsc_start_buf[0];
hi_ = tsc_start_buf[1];
timer_buf[0] = @bitcast(f32, (@as(u32,hi_) << @as(u16,16)) | @as(u32, lo_) );

lo_ = tsc_start_buf[2];
hi_ = tsc_end_buf[0];
timer_buf[1] = @bitcast(f32, (@as(u32,hi_) << @as(u16,16)) | @as(u32, lo_) );

lo_ = tsc_end_buf[1];
hi_ = tsc_end_buf[2];
timer_buf[2] = @bitcast(f32, (@as(u32,hi_) << @as(u16,16)) | @as(u32, lo_) );

Then the elapsed cycles can be calculated on the host with:

# Get symbol for timer_buf on device
symbol_timer_buf = runner.get_id("timer_buf")

# Copy back timer_buf from all width x height PEs
data = np.zeros((width*height*3, 1), dtype=np.uint32)
runner.memcpy_d2h(data, symbol_timer_buf, 0, 0, width, height, 3, streaming=False,
  data_type=MemcpyDataType.MEMCPY_32BIT, order=MemcpyOrder.ROW_MAJOR, nonblock=False)
elapsed_time_hwl = data.view(np.float32).reshape((height, width, 3))

# Print elapsed cycles for each PE
for pe_x in range(width):
  for pe_y in range(height):
    cycle_cnt = sdk_utils.calculate_cycles(elapsed_time_hwl[pe_y, pe_x, :])
    print("Elapsed cycles on PE ", pe_x, ", ", pe_y, ": ", cycle_cnt)

cerebras.sdk.sdk_utils.input_array_to_u32(arr: numpy.ndarray, sentinel: Optional[int], fast_dim_sz: int) → numpy.ndarray¶

Converts a 16-bit tensor to a 32-bit tensor of type u32 for use with memcpy. The parameter sentinel distiguishes two different extensions of 16-bit data. If sentinel is None, zero-pad the upper 16 bits. If sentinel is not None, pack the index of the innermost dimension of the array into the upper 16-bits.

Parameters

arr (numpy.ndarray) – A numpy array with 2 or 4 bytes per element.
sentinel (Optional[int]) – For 16-bit input data, if this parameter is not None, pack the index of the innermost dimension into the high bits of the 32-bit wavelet. If sentinel is None, then the high bits are zeros.
fast_dim_sz (int) – If sentinel is not None, specifies size of fastest-changing dimension for generating the index.

Returns

output_view (numpy.ndarray.view) – Numpy view into arr with specified numpy data type.

cerebras.sdk.sdk_utils.memcpy_view(arr: numpy.ndarray, datatype: numpy.dtype) → numpy.ndarray.view¶

Returns a 32, 16 or 8 bit view of a 32 bit numpy array (only the lower 16 or 8 bits of each 32 bit word in the last two cases).

Parameters

arr (numpy.ndarray) – A numpy array with 4 bytes per element on which the numpy view will be created.
datatype (numpy.dtype) – The numpy data type which should be used in the output view. The itemsize must be 1, 2, or 4 bytes.

Returns

output_view (numpy.ndarray.view) – Numpy view into arr with specified numpy data type.

Example:

memcpy_view simplifies the use of various precision data types when copying between host and device. Consider the following Python host code which creates a float16 view into a numpy array. Note that this array must be 32-bit. The user can fill the array with float16 data, and copy it to an array on the device with CSL data type f16.
x_symbol = runner.get_symbol('x')
# This container array must be 32-bit
x_container = np.zeros(N, dtype=np.uint32)

x = sdk_utils.memcpy_view(x_container, np.float16)
x.fill(0.5)

runner.memcpy_h2d(x_symbol, x_container, 0, 0, 1, 1, N,
            streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT,
            order=MemcpyOrder.ROW_MAJOR, nonblock=False)

debug_util¶

Utilities for parsing debug output and core files of a simulator run.

class cerebras.sdk.debug.debug_util.debug_util(bindir: Union[pathlib.Path, str])¶

Bases: object

Loads ELF files in bindir in order to dump symbols for debugging.

The user does not need to export the symbols in the kernel. debug_util dumps the core and looks for the symbols in the ELFs. If the symbol at Px.y is not found in the corresponding ELF, debug_util emits an error.

The most common errors are either: 1) a wrong coordinate passed in get_symbol(), or 2) a correct coordinate, but the symbol has been removed due to compiler optimization. One can use readelf to check if the symbol exists or not. If not, the user can export the symbol in the kernel to keep the symbol in the ELF.

The functionality of this class is only supported in the simulator.

Example:

from cerebras.sdk.debug.debug_util import debug_util

# run the app
# dirname is the path to ELFs
simulator = SdkRuntime(dirname)
simulator.load()
simulator.run()
...
simulator.stop()

# retrieve symbols after the run
debug_mod = debug_util(dirname)
# assume the core rectangle starts at P4.1, the dimension is
# width-by-height and we want to retrieve the symbol y for every PE
core_offset_x = 4
core_offset_y = 1
for py in range(height):
  for px in range(width):
    t = debug_mod.get_symbol(core_offset_x+px, core_offset_y+py, 'y', np.float32)
    print(f"At (py, px) = {py, px}, symbol y = {t}")

get_symbol(col: int, row: int, symbol: str, dtype: numpy.dtype) → numpy.ndarray¶

Read the value of symbol of given type at given PE coordinates. Note that each call to this function scans the whole fabric, so prefer get_symbol_rect over calling this in a loop.

Parameters

px (int) – x-coordinate of the PE, indexed from the northwest corner of the entire fabric (NOT the program rectangle)
py (int) – y-coordinate of the PE, indexed from the northwest corner of the entire fabric (NOT the program rectangle)
symbol (str) – Name of the symbol to be read.
dtype (numpy.dtype) – Numpy data type of values contained by symbol.

Returns

output_arr (numpy.ndarray) – Numpy array of output values read at symbol.

get_symbol_rect(rectangle: Rectangle, symbol: str, dtype: numpy.dtype) → numpy.ndarray¶

Read the value of symbol of given type for a rectangle of PEs.

Parameters

rectangle (Rectangle) – Rectangle specified as ((col, row), (width, height)), indexed from the northwest corner of the entire fabric (NOT the program rectangle)
symbol (str) – Name of the symbol to be read.
dtype (numpy.dtype) – Numpy data type of values contained by symbol.

Returns

output_arr (numpy.ndarray) – Numpy array of output values read at symbol. The first two dimensions of the returned array are PE coordinates (column, row) relative to the rectangle.

read_trace(px: int, py: int, name: str) → list¶

Parse a CSL trace buffer with name name at the given PE coordinates.

Parameters

px (int) – x-coordinate of the PE, indexed from the northwest corner of the entire fabric (NOT the program rectangle)
py (int) – y-coordinate of the PE, indexed from the northwest corner of the entire fabric (NOT the program rectangle)
name (str) – Name of the trace buffer to be read.

Returns

trace_output (list) – Heterogenous list of trace values.

Example:

Consider a device kernel which initializes a trace buffer with the CSL debug library and uses it to record values:
const debug_mod = @import_module("<debug>", .{.key = "my_trace", .buffer_size = 100});

fn foo() void {
  debug_mod.trace_timestamp();
  debug_mod.trace_string("Bar");
  debug_mod.trace_i16(1);
}
Then the trace can be read in the host code with:
trace_output = debug_mod.read_trace(4, 1, 'my_trace')
print(trace_output)
If foo was executed only once, then trace_output will be a heterogenous list containing a timestamp, the string “Bar”, and the number 1.

SDK Documentation (1.2.0)

SdkRuntime API Reference

Contents

SdkRuntime API Reference¶

SdkRuntime¶

sdk_utils¶

debug_util¶