The Taiyen system is primarily a two-layer component comprised of volatile and persistent heap managers that allows the creation and management of four heap types: volatile, archive, snapshot, and transactional. The system offers productivity and performance features that simply aren’t found in or matched by other similar components. Originally, it was conceived as a tool to be used internally to help us build end-user products, but we felt that its feature set might be interesting to other developers; and moreover, that eventually other similar products would find their way into the market place anyway. So we decided to distribute the system.

We have builds for 32 and 64-bit Windows (on Itanium) and 32 and 64-bit Unix (currently, we support Linux and Solaris, and we will be adding other Unix flavors over the coming months). The system comes in the form of static libraries, so the only functions linked into the application are the ones that are used, and there’s no need to worry about having to distribute yet another DLL. The release library’s small footprint of ~(200 KB – 500 KB), depending on the platform, lessens any advantage that would be derived by using a shared library, especially given the small odds that multiple processes running at a time will be running the same Taiyen code. Furthermore, avoiding the DLL route foregoes the versioning problems associated with DLLs and provides more flexibility in terms of the library interface. All functions in the API have C linkage with cdecl as the calling convention. We also have fastcall versions of our 32-bit Windows libraries, which use faster register-based parameter passing.

Unlike the case with competing systems, all Taiyen heaps are self-contained contiguous structures that are created at explicit addresses in the process, similar to the way private heaps are created via Windows’ HeapCreate(). To this end, the system provides functions to find available address space at both the low and high ends of the process. The Taiyen system works independently from the CRT and handles the allocation of all block sizes, even the largest multi-page blocks, internally. Other systems are often built on top of the CRT’s heap and pass the allocation and resizing of large blocks off to the virtual memory manager (VMM) (as is the case for a Windows private heap, which hands allocation requests greater than 0x7FFF8 bytes in size to the VMM). This approach, while easier to implement, simply couldn’t have been possible, given the complexity and breadth of features that we wanted to provide. We needed our heaps to be stand-alone structures wherein all internal control structures, meta data, and user data had to reside in the same contiguous span.

This yielded several advantages: One, this allowed us to precisely control the alignment of internal structures optimally with respect to cache lines and pages, thereby maximizing locality. This is critical to prevent different threads running on different processors from modifying the same cache line at the same time, a phenomenon called false sharing, which severely degrades performance as the cache coherency mechanism goes into overdrive. Two, this allowed us to leverage the same volatile heap code to create and manage persistent heaps, which definitely need to be self-contained objects. Sharing the same code reduces the code footprint, improving code locality (and therefore, at the margin, speed); and more importantly, reliability (less code needs to be tested). Three, this allowed us to implement PTE redirection, a feature to be covered below. Four, this allowed the ability to create multiple volatile and persistent heaps within a process. The capacity to create multiple volatile heaps was needed to provide system scalability on NUMA platforms where each node could be assigned its own heap. And five, having all user blocks reside within the heap space, allows the quick resolution of a block to its respective heap.

A critical feature of our allocator is its faculty to maximize availability while still maintaining exceptional speed. We define availability as a measurement of a heap's ability to accommodate allocation requests as a percentage of its ability were all of its free blocks coalesced together, and is given by equation one below:

For a series of N free blocks that are available for allocation, ordered from the largest to the smallest, (F1, F2, F3, …FN), and O is the total number of free blocks in the heap, the heap’s availability, A is as follows:

To illustrate the concept of availability, consider the four heaps in the figure below:

As evidenced by the figure, higher availability indicates increased allocation potential within a given space. Availability may be thought of as the inverse of fragmentation, which occurs because blocks can be freed in random order, as opposed to blocks being deallocated on the stack, for instance. While there is no way for the heap manager to control the order in which blocks are freed, it can control where new blocks are directed; which, of course, is what segregates one allocation algorithm from another. A variety of strategies have been devised to tackle fragmentation from aggregation strategies, where blocks of similar size are grouped together in mini heaps (which can be very fast but lead to low availability) to best-fit strategies, where the free block offering the best fit is targeted for allocation (which can be slower but yields very good availability). This second strategy, for example, would place a block one unit is size in the right most block of heap three.

The Taiyen’s allocator employs what we call a modified-best-fit strategy, which sacrifices a small amount of availability (only a couple of percent) to achieve superior speed. One key attribute of the algorithm is its robustness, which is its ability to handle a wide range of block sizes as well as a dynamically varying average block size. This is in contrast to many specialty allocators such as Hoard that are designed for a specific block range (usually 10s – 100s of bytes). The allocator is also a coalescing allocator, which means that every time a block is deleted or moved, any created free blocks are coalesced with neighboring free blocks. Some allocators avoid coalescing because it can be computationally expensive, but such a practice can lead to a substantial loss of availability.

The Taiyen system is fully scalable on SMP machines, and offers complete protection against the false sharing of cache lines, for both internal structures (as mentioned) and user allocated blocks. The system is also scalable on NUMA platforms by assigning each node its own heap. The figure below shows Taiyen performance relative to stock CRT allocators. These tests (as well as other tests referenced herein) were performed on three platforms: an Intel S450NX with four 550 MHz Xeon processors running Windows 2003 Server; an HP i2000 workstation with two 800 MHz Itanium I processors running Windows 2003 Server; and a Sun e4500 with eight 336 MHz Ultrasparc II processors running Solaris 9. The source code (TestMemoryManager.cpp) for the application used to perform these tests is available in the SDKs.

The figure shows the Taiyen system’s strong single-threaded performance and nearly linear scalability. The stock allocators don’t scale because thread access is serialized. Compounding the problem, as multiple threads attempt to engage the heap, they are blocked, resulting in those threads being immediately put into wait states, which then results in a series of expensive context switches as the thread scheduler shortcuts each thread’s time-slice. Worse yet, the clock cycles consumed in response to any cache misses will have been wasted as well.

Much work in the area of heap management has focused on this particular deficiency, but there are many other areas where there is room for improvement, as we argue below. Two specialty allocators specifically designed to provide scalability against which we have tested are Hoard (an allocator developed at the Universities of Texas and Massachusetts) and mtmalloc (Sun Microsystem’s multithreaded offering).

While Hoard did show strong performance for small block sizes, it didn’t prove to be very robust – it was only able to complete five of 36 simulations of our testing regimen (which is described in detail in our white paper) before it consumed too much paging file space too continue; and on occasion, the allocator locked up. mtmalloc is a power-of-two allocator that sacrifices availability and memory efficiency for speed. After completing 11 of 46 simulations, it failed after consuming too much disk space. Both of these allocators required special simulations to derive the performance data shown above.

The Taiyen system aligns all allocated blocks optimally with respect to cache lines and pages. This doesn’t mean that every block will be allocated along a cache line boundary, just that no block will touch extra cache lines. For example, a 16-byte block will never straddle a cache line, or a page boundary for that matter. Blocks, however, that are greater than equal to a cache line in size will be aligned along cache boundaries, and likewise, blocks that are greater than or equal to a page in size will be aligned along page boundaries. There are rare cases where the system may position a sub-page block such that it straddles a page boundary, but this should not be considered a liability, as page alignment is significantly more critical for multi-page blocks.

The aim of this strategy is to increase block access speed by maximizing cache, TLB, and page hit ratios in turn by maximizing locality. The strict alignment policy will also prevent cache unit splits where accessing a cache line's worth of data causes two cache line loads. And the outcome can be significant, especially on platforms where the ratio of processor speed to main memory access speed is high. One such machine where such locality yields noticeable performance gains is the Itanium, which has a main memory latency of 176+ clocks, an L3 cache latency of 22 clocks, an L2 latency of 6 clocks, and an L1 latency of 2 clocks. The figure below shows the difference in block access speed yielded by the Taiyen system versus that of the stock allocators on the three platforms used for testing, with an average block size of 33 bytes on the 32-bit platform and 43 bytes on the 64-bit platforms, and with a cache line size of 64-bytes on all three machines. On Itanium, the Taiyen system yielded nearly a 50% improvement in access speed over the CRT, which allowed over 99% of the allocated blocks to straddle cache lines.

Note that these alignment rules don’t necessarily allow the Taiyen allocator to be substitute for the CRT’s aligned_malloc(size_t size, size_t alignment), which forces a block allocated thereby to be aligned along the power-of-two boundary, alignment. For example, the Taiyen allocator will not guarantee that a 512-byte block will be aligned along a 512-byte boundary. It will guarantee, however, that it will touch the minimum number of cache lines. The allocator does, on the other hand, guarantee power-of-two alignments for seven critical granularities: the system word, the double-word, the quad-word, the cache line, double the cache line, the page, and the segment (with a segment being defined as 16 pages). Here, a word refers to a system word (i.e. four bytes on a 32-bit system and eight bytes on a 64-bit system).

   

Another significant difference between the Taiyen system and competing technologies is its native support for dynamic blocks, which as the name implies, are designed to contain structures such as dynamic arrays that need to be continually resized over time. To facilitate this, dynamic blocks are given built-in buffers to reduce the frequency of block relocations. When these blocks are page multiples in size, the buffers only consume virtual address space. To accelerate the relocation of multi-page dynamic blocks, the system employs a proprietary technique that we call page table entry (PTE) redirection, which allows a block to be moved in virtual address space while keeping its data stationary in physical storage. Such relocations based on redirection can be made up to two orders of magnitude faster than the byte-by-byte method used by memmove(), for example. Another important virtue of redirection is that it allows block data that have been paged to disk to be relocated without having to read that data it into main memory beforehand.

The two charts above show Taiyen performance versus stock allocator performance with respect to dynamic blocks, with the second chart focusing on multi-page block resizing speed. In the tests highlighted by this second chart, 76.4% of the operations were block resizings, with the average increase in block size being 141.07% and the average decrease, 49.94%. What’s notable is that even with Taiyen’s use of expansion buffers, with a total of 428,748 relocations, it relocated blocks 97,279 more times than did Windows; 68.84% of the Taiyen resizings required a relocation. This speaks to the power of redirection, which as the chart shows, can produce exceptional relocation speeds.

Another key difference between the way Taiyen and the stock allocators handle dynamic blocks is that the stock allocators always shrink blocks in place. We regard this as a significant deficiency, because it can lead to the worst kind of fragmentation – small blocks spread evenly across the address space. In contrast, the Taiyen system views every block resizing as an opportunity to reduce heap fragmentation, especially when blocks are contracting; only 45% of the blocks referenced in the lower chart were contracted in place. This, of course, can only be practicable with redirection.

We measure page efficiency via a spy device that scans the PTEs of the heap’s pages for dirty bits. Efficiency is then calculated by dividing the number of allocated bytes by the number of dirty page bytes. The result factors in the overhead of the heap’s meta data as well as how efficiently the allocated blocks are laid out in the heap space, with the more compact layout yielding the better efficiency. A principal aim of the system was to maintain good page efficiency while maximizing speed, and the charts below show that the Taiyen system compares well with the Windows CRT, which is a highly efficient allocator. The page efficiencies shown in the charts below are the products of heavily thrashed systems, running a worst case scenario where millions of blocks of varying sizes are repeatedly allocated, freed, and resized.

The system offers a plethora of debugging features from bounds checking, which is automatic for debug builds, to multi-threaded leak detection. Reliability is the most important feature of all and the system contains copious checking to catch both internal errors and user errors such as the freeing of a dangling pointer.

The system offers a means to compress and decompress heaps to and from buffers and files. Included are sub-features that allow heaps to be resized, relocated, and even defragmented upon decompression; and we supply ancillary functions and macros to help resolve pointer references to any moved blocks. Both volatile and persistent heaps may be compressed, and volatile heaps may optionally be made persistent during decompression (or vice versa). The compression routines are highly optimized and file compression/decompression only requires a small 16-page buffer. The following are compression data from a run of the mill 2.8 GHz PC running Windows XP using a Maxtor 6Y200PO ultra ATA-133 drive with 8 MB of cache:

Average heap compression speed to buffer (bytes per second): 543,986,452.160653
Average heap decompression speed from buffer (bytes per second): 163,992,872.510485
Average heap compression speed to file (bytes per second): 44,198,439.568796
Average heap decompression speed from file (bytes per second): 40,802,358.469421

From its inception, a primary goal of our system was heap persistence. In principle, any persistent heap should be based on solid heap algorithms, so working backwards, we developed the volatile heap manager first to provide the core functionality that persistent heaps would use to manage their data. As mentioned, most of the volatile heap code base is reused by the persistent system. In fact, leveraging existing highly optimized components is what makes our system so lightweight and yet, powerful. Persistent heaps incorporate six technologies: one, volatile heap logic to manage the efficient apportioning of virtual address space; two, memory mapped files; three, the virtual memory manager, which excels at keeping the most frequently utilized pages in memory; four, exception handling (signal handling on Unix) to keep track of written pages; five, sparse files to minimize disk consumption; and six, low-level file I/O for high-speed transactions. On top of the code needed to integrate these components are routines that provide fail-safe snapshots, multiple undo/redo, transactions, rollback, etc.

Handling persistent data has always been messy. Given a file, the first task is to get the needed data into RAM where it can be read and manipulated, and there are a variety or ways to accomplish this. The most basic way is to simply read the entire file into RAM in one shot, operate on the data, and then write the data back to disk, again, in one shot. This, of course, isn’t very efficient. What if the file is a large CAD drawing, and only one entity needs to be changed, or the file just needs to be quickly viewed?

What if the file needs to be parsed as it’s read? Can it be loaded into a giant buffer and then parsed, or will the buffer need to be so large that the system will start paging at the same time the file is being read, with the parsed data being loaded into a heap, and with that data immediately being paged back to the same disk its being read from? This can be addressed by using a small fixed-sized buffer, but this can often call for sophisticated code to handle cases where structures that need to be read as a whole either won’t fit in the buffer or will straddle sequential buffer reads; and of course, all of this code needs to be fault tolerant. Even with efficient buffering, the entire file is still being read and written at a time.

Once the data is in RAM, how should it be periodically saved? The brute force method of writing the whole file to disk with every save could be workable for moderately sized data sets but is surely not the most elegant solution and certainly not practical for larger files. The more efficient approach would obviously be to keep track of unsaved data and write only that information to disk with each save, but this may not be easy when dealing with complex data structures. And how should the stream be reconciled with the existing file? Should each save append unsaved data to the end of the file, causing the file to grow even though existing data may have only been modified versus new data being added? And this approach would still require that, at some point, these appended saves all be reconciled together. Or should each save be reconciled on the fly, which could require some rather intricate code. On top of all of this, again, what happens when there is a system failure during a disk write? Will the file be found in a consistent state, or will it contain transient data, making portions of it or even all of it indecipherable? Imagine the case where the data is in the form of a linked list and the file is found with one of the links broken.

Fortunately, most all modern operating systems offer a superior means of accessing files over standard file I/O in the form of memory mapping. The memory mapped file is a powerful construct that addresses a few of the problems noted above by using the VMM to handle the loading and saving of files at page granularities. When a file view is initially mapped, the data is only loaded virtually. Actual disk reads occur on demand only as pages are touched, but from the perspective of the user, the data may always be assumed to be completely present in memory. This is the beauty of virtual memory; it relieves the user from having to worry about where his data lies (RAM or disk).

Imagine using mapping to load in a large CAD file. The drawing loads almost instantaneously because only the data need to paint the current view of the drawing is actually read from disk. Under the surface, the processor issues a page fault for each page that is touched during the initial screen paint. Each fault, however, is trapped by the VMM, which goes to work reading the accessed pages from disk.

Updates are also efficient with respect to mapped files. Essentially, a mapped file is a named paging file. Understand that every page committed in virtual memory will be backed by some sort of file. It can either be named or anonymous, but the option to have no backing file doesn't exist. Transparently to the user, the VMM continually moves pages to and from the mapped file based on the demands of the system, just as it does with the operating system’s paging files. Because the VMM keeps track of dirty pages, file saves (or flushes) are highly efficient. Only those pages modified since the last flush are written to disk, and of course, already paged-out data are ignored. The user needn’t keep track of which pages are present in memory, which pages have been modified, etc.

If memory mapped files can do all of this, you may be asking why a persistent heap is even necessary. First off, a mapping is just a span of virtual address, and while the VMM does a great job of managing virtual address space, it only does so at the page level. Clearly, some other component must reside on top of the VMM to organize this address space, that is: allow it to be subdivided into blocks of varying size, both sub-page and multi-page. This component is what’s commonly termed a heap. But heaps typically extend the VMM layer in more ways than this through features like bounds checking, etc.

The Taiyen system allows the creation of persistent heaps through its persistent heap manager. This component inherits most of the functionality of the volatile heap manager on which it’s based, which includes features such as: high availability, optimal block alignment, native dynamic block support, PTE redirection, good page efficiency, bounds checking, leak detection, and compression/decompression.

The system allows for the creation of the three persistent heap types: archive, snapshot, and transactional. Archive heaps are designed for information retrieval. Snapshot heaps are designed for single-user desktop systems, whereas transactional heaps are designed for multi-user server systems. The one feature that is not inherited from volatile heaps is SMP scalability. The reason for this is that scalability would primarily be an issue for a server-based transactional heap, which can’t scale because it uses a transaction log that must be serialized, effectively serializing access to the heap, a subject to be expanded upon below. Scalability can be achieved, however, by creating multiple persistent heaps, with the stipulation that they operate independently from one another. Mentioning details such as the transaction log highlights some important ways in which persistent heaps greatly extend memory mapped files. At this point, these features will be explored in detail.

One weakness of the memory mapped file is that a system failure during a file flush can leave the file in an inconsistent state. The Taiyen system addresses this shortcoming by providing fail-safe snapshots. A snapshot heap is supported by two files: a backing file and a data file, which are collectively termed heap files. The backing file is mapped and the data file is completely isolated. If a failure occurs at any point, the heap will be restorable to its state as of the last successful snapshot from one of these two files.

A snapshot is essentially a double file flush, first to the backing file and secondly to the data file. The double flush is necessary to always maintain a consistent image of the heap. While the backing file is being written, that consistent image is located on the data file and vice versa. And because the backing file is constantly being updated by the VMM as it flushes dirty pages to it in the background, it can’t be relied upon to have a consistent image between snapshots. The only time it does is mid-snapshot, after all of the dirty pages in the heap have been explicitly and completely flushed. This, of course, is predicated by the restriction that during a snapshot, no pages in the heap be modified. Once the backing file contains a consistent image, those pages that were modified since the last snapshot are now written to the data file. If a failure were to occur during this data file update, the heap would be restorable from the backing file, and if a failure were to occur at any other time, the heap would be restorable from the data file.

To know which pages to flush to the data file during a snapshot, the system must keep track of dirty pages. The VMM does this under the surface by setting the modified bit in the PTE for each page that is dirtied. But the VMM clears this bit when it reads the page back into memory, which means that the PTEs representing the heap space can not be used to indicate which pages have been modified since the last snapshot. Furthermore, any scheme that was based on scanning PTEs, would require a device driver to gain access to these kernel level structures. As noted above, we did write such a driver for 32-bit Windows to measure page efficiency, but this was only meant to be a testing tool.

To remedy this, the Taiyen system maintains its own structures to keep track of dirty pages, which include both dirty segment and dirty page bit arrays, with a segment being defined as 16 pages. The system sets page protections that cause the CPU to issue an access violation (or SIGSEGV) when a particular page is modified, which is immediately trapped by the system’s exception handler, allowing the bit arrays to be updated. Including a segment array allows upwards of 8 MB of heap space to be scanned per clock cycle on 64-bit systems (64 bits * 131,072 bytes/bit = 8 MB). This comes in handy for multi-terabyte heaps where only a handful of pages at the end of the heap, opposite that where the scan is started, are dirty, making the page scan a significant portion of the overall snapshot time. Bit scanning is a critical function in the Taiyen system and most of these routines as well as other time critical functions have been hand-coded in assembly to maximize performance.

The system grants access to the dirty page bit array so that the user can monitor pages that have been modified. Those of you familiar with the Windows API may notice the similarity between this capability and the Windows function, GetWriteWatch(), which unfortunately, is not supported by NT, 2000, or later.

When a snapshot heap is created or restored without a data file, it will be configured for archival use. Functionally, an archive heap operates identically to a snapshot heap; it can even make snapshots, but they won’t be fail-safe. If a system failure occurs on an archive heap that has been modified, the heap may not be consistent and/or recoverable. A data file can be attached to an archive heap at any time to enable durable snapshots. And a transaction log may be attached as well to enable transactions.

Other technologies (beyond basic file I/O) conjured to facilitate the management of persistent data include: archives, object databases, and relational databases, with the latter providing a significant abstraction away from the raw data. While persistent heaps can’t be viewed as direct substitutes for relational databases, they are a compelling alternative to archives and object databases, and in fact, were originally conceived as such. One problem with archives like Microsoft’s CArchive (a class within the MFC framework) and object databases is that they impose strict semantic requirements. Persistent heaps, on the other hand, are data semantics and organization agnostic, which makes them easier to use and more extensible. This is sometimes referred to as orthogonal persistence, which allows the programmer to treat structures the same way regardless of whether they exist in volatile or persistent domains.

For example, to gain persistence in MFC, all objects must be derived from the CObject base class, and with respect to object databases the base class is generally, d_Object. Firstly, as soon as the word derived is mentioned, the application must be coded in an object oriented language. While this is certainly allowable for persistent heap utilization, it is not necessary, as the Taiyen libraries provide C linkage, allowing their use in procedural code. Moreover, no objects allocated in a persistent heap need to be derived from any base class. Instead of segregating objects by type, they are segregated by location, allowing the persistence means to be less meddlesome. Quite simply, if a particular object is allocated in a volatile heap, it is volatile. If the same object is allocated in a persistent heap, it is persistent.

Essentially, persistent heaps are used just as volatile heaps. In fact, both types use the same core function calls: CHeapManager_pmAlloc(), CHeapManager_pmSetDynBlockSize(), and CHeapManager_Free(). The system can tell what type of heap is being used from the heap address passed to these calls. There are, however, similar persistent heap functions that combine these calls with other actions such as making undo buffer entries. For example, CPersistentHeapManager_pmAllocEUE() allocates a block and makes the undo entries in the undo buffer (if attached) necessary to deallocate and possibly reallocate the block during an undo/redo cycle. Undo/redo will be covered in more detail below.

Another cumbersome facet of archive and object databases is that objects stored thereby need to be individually serialized. Explicitly loading objects and saving objects can be a tremendous programming burden, akin to the problems associated with basic file I/O described above. The code can get quite convoluted, especially if mechanisms have to be built to keep track of modified objects. In contrast, no objects in a persistent heap need to be individually serialized. When a persistent heap is restored, all objects therein are loaded at once, but only virtually. Only as the objects are referenced in the application, are they physically read from disk, courtesy of the VMM. And saving objects is just as easy. There’s no need to develop an algorithm to keep track of modified objects. Simply perform a snapshot, and any objects that have been modified since the last snapshot are flushed to the backing and data files, automatically and transparently to the user.

Of course, because the grunt work is being handled by the VMM, all of this is occurring at the page level; but managing data at the page level is preferred over the object level. For instance, if only a small portion of a multi-page object is being accessed frequently, the dormant part of that object can be paged to disk, allowing the consumed memory to be put to better use somewhere else. It also enables the system to make good use of I/O bandwidth by moving data a page at a time, although most VMMs will take this further and read and write multiple contiguous pages at once. The VMM’s job can certainly be aided by maximizing page utilization and maintaining strict object alignments, and this is where the heap layer is beneficial. This highlights another point: It’s unclear as to how efficiently archives and object databases handle data churn, where objects are frequently allocated, resized, and freed. By leveraging sound volatile heap algorithms, Taiyen persistent heaps do excel in this type of environment.

Another problem with the typical database management system is that it plays tug of war with the operating system instead of facilitating it. For example, a common practice employed by database managers is to cache frequently used objects. In principle, this is a good idea, but it demonstrates a redundancy. The VMM is already charged with this task, and there’s no way to turn the VMM off. Its threads are going to receive the same time slices regardless of what other threads are running, so they might as well be utilized. Moreover, the VMM is able to monitor page usage patterns on a global basis across all running processes, so it is better suited for the mission of allotting physical memory. Some embedded systems manage caching through disk pages, which is getting even more redundant. Not only is there no advantage to writing a custom paging system, there’s no way to bypass that of the operating system, leaving you with a paging system on top of a paging system.

A further problem with object and embedded databases is encountered when pointer references are needed. For instance, OMDG databases require such references be made through d_Ref<T> objects. Because persistent heaps are always restored to the same address at which they were created, pointers can be used at will. The system does offer a means of relocating a heap to a different address, but this is done via compression and decompression. Naturally, doing so will require that pointer references in the heap at least be offset, and if the heap is defragmented, adjusted individually.

Creating and loading (or restoring) persistent heaps are very simple processes, and the same functions are used to create and restore all three heap types: archive, snapshot, and transactional, with the type being determined by what file handles are passed to the functions. For example, attaching just a backing file creates or restores an archive, including a data file gives you a snapshot heap, and including a transaction log with a buffer gives you a transactional heap. These last two files can be attached at later time as well. Persistent heaps are created via CPersistentHeapManager_pcCreate() and restored by CPersistentHeapManager_pcRestoreHeapFromHeapFiles().

This second function is very powerful as it can restore a heap that has been closed properly or prematurely as a result of system failure. It can determine the failure mode and apply the appropriate method of recovery. Once a persistent heap has been created or restored, various buffers may be attached to add functionality. Undo and redo buffers may be attached to a snapshot heap and a rollback buffer may be attached to a transactional heap. The system also provides many ancillary functions to do things such as: extract heap parameters from the data file before restoration, check internal structures after restoration, verify checksums, etc.

Restoring a persistent heap that has been closed normally (as opposed to a closure resulting from system failure), is an extremely fast operation indeed, because it is much more a function of processor speed than disk speed. Persistent heaps are closed by calling CPersistentHeapManager_uwClose(), which performs two tasks: One, it performs a final snapshot to place the data file in congruence with the backing file (and with the heap); and two, it frees any clusters on disk that are no longer being utilized, which is possible by virtue of the system’s use of sparse heap files. When the restoration function detects that the heap files are congruent, restoration mainly entails mapping in the backing file’s views. As noted earlier, file mapping, in general, is the ideal way to load persistent data because pages are only read from disk as they are touched, thereby minimizing the disk access bottleneck. For instance, our Itanium tests produced an average loading speed of 1.75 GB per second. By the way, the GB in 1.75 GB/s represents stored data as opposed to heap space.

The system provides a very powerful undo/redo service to snapshot heaps that is enabled by simply attaching undo and redo buffers. Using this service can make the task of building undo and redo into an application significantly easier than writing the necessary routines from scratch, which can get very complex when the operations to undo are intricate and difficult to reverse. Having undo/redo at the heap layer can also yield efficiencies that are not possible otherwise. For example, when a block is freed, any pages that are no longer needed can be returned to the VMM. This is possible because the system has an explicit allocator that, when invoked, will restore the freed block to its last address, allowing it to be repopulated with its old data. In fact, the service can undo and redo all heap operations: allocations,  resizings, and the freeing of blocks, as well as other ancillary operations. To allow an allocation to be undone for instance, CPersistentHeapManager_pmAllocEUE() is called instead of CHeapManager_pmAlloc(). The EUE at the end of the call stands for embedded undo entry and such functions in general are called u-procs. This function makes a call to CHeapManager_pmAlloc() and then makes the data and procedure entries in the undo buffer needed to undo the allocation. The source code for all of the u-procs offered by the system is provided in the API Guide.

The undo buffer takes two types of entries: data and procedure. The data entry is the simpler of the two and involves copying a block of data to the buffer that will be restored to a specific address when an undo is invoked. When that undo is invoked, the current block of data at that address is copied to the redo buffer allowing it to be restored should a redo be invoked. This illuminates another advantage of the service – redo is always automatic. All the user has to do is call CPersistentHeapManager_uwRedo(), and the last operation that was done will be undone. It’s that easy, and the reason is that the user never has to make any entries in the redo buffer. These, as we term mirror entries, are made automatically by the system as operations are undone. And once an operation is redone, it can be undone again, and redone, and so on.

Calling CPersistentHeapManager_uwUndo() implicitly starts an undo/redo cycle, where undos are permitted to the extent that there are entries in the undo buffer, and where each undo causes mirror entries to be added to the redo buffer to allow the operation that was just undone to be redone. Invoking all of the entries in the undo buffer will cause the redo buffer to become fully populated with mirror entries, at which point no further entries will be added (new entries are never added to the undo buffer during an undo/redo cycle). Any redos or undos beyond this point will just move pointers in these buffers to their respective current entries. To avoid placing the heap out of sync with the undo and redo buffers, an undo/redo cycle must be explicitly exited via a call to CPersistentHeapManager_ExitUndoRedoCycle() before executing an operation that would modify any heap data. This call clears the redo buffer and resets both buffers’ internal structures, effectively locking any changes made to heap during the cycle before any modification takes place.

Because most operations in an application will constitute multiple undo buffer entries, a series of entries can be packaged as a transaction by calling CPersistentHeapManager_CloseUndoBufferTransaction() In fact, only transactions can be undone (note that the term transaction here should not be confused with a heap transaction, which is a different animal). The drawing of a line in a CAD program, for example, might constitute several sub-operations: one to allocate a block for the entity; another to set its coordinate, layer, color, and type information; and others yet to update some ancillary structures. The user isn’t going to want to undo each sub-operation individually. He’s going to want to undo them collectively in one atomic operation. To accomplish this, the undo buffer entries that represent each sub-operation are packaged into a transaction.

Any operation on a heap can be undone with one or more data entries. Data entries are made via CPersistentHeapManager_uwMakeDataEntryInUndoBuffer(). All instructions processed by a CPU ultimately are meant to modify bytes, so saving those bytes before they are modified allows a particular operation to be undone merely by resetting the affected bytes to their pre-operation values. This means that undo and redo for an entire application can be built upon data entries. Data entries, however, in certain cases may not be easy to implement because a particular operation may affect many non-contiguous bytes, or they may not be efficient because the operation effects a large swath of data. Imagine inverting a one-MB matrix. Making a one-MB data entry to undo this operation would simply consume too much undo buffer space.

To remedy this, the system offers a way store procedures in the buffer via the call, CPersistentHeapManager_uwMakeProcedureEntryInUndoBuffer(). Essentially, storing a procedure entails copying its address, parameters, as well as some ancillary information such as the calling convention, an address to take the return value (if appropriate), and a return value to verify, etc. to the buffer. But the procedure being stored shouldn’t be the one that was used to perform the operation, it must be a mirror procedure designed to reverse that operation. Furthermore, that mirror procedure must include a recursive call to CPersistentHeapManager_uwMakeProcedureEntryInUndoBuffer() to populate the redo buffer with a mirror of the mirror procedure, which can often be the original procedure that performed the operation. The use of this one function to make entries in both the undo and redo buffers may be a point of confusion, but there is a simple explanation. During normal processing, the target buffer for undo entries is the undo buffer. During an undo, however, the target buffer automatically becomes the redo buffer; and during a redo, there is no target buffer – no entries are made in either buffer.

Taking the matrix inversion example further, the same function that performs the inversion can reverse it, so implementing undo for this operation is very straightforward. Other operations can be more complex to undo, such as the resizing of a block. When CPersistentHeapManager_pmSetDynBlockSizeEUE() is called to resize a block, it makes a series of data entries and a procedure entry for the internal function, pmExplicitSetDynBlockSizeEUE() in the undo buffer. When an undo is invoked, this second call will relocate the dynamic block to its pre-resized address and restore it to its original size. Any data that was clipped as a result of the block shrinking will be restored via a data entry as well. Notice that the internal function is appended by an EUE, which indicates that, when called, it will also makes an undo entry, except in this case, it will be to the redo buffer. As described, when a procedure entry is invoked by the undo processor, all entries made thereby are automatically directed to the redo buffer; and when a procedure entry is invoked by the redo processor, all resulting requests to make a buffer entry are ignored.

There are two other features of the undo/redo service that should be mentioned. One, the buffers can be configured to be either circular or non-circular. When the buffer is circular, it won’t overflow. Instead, at some point, new transactions will start overwriting old transactions. Recall that entries are packaged together in transactions, so when one entry is overwritten, all of the other entries in that transaction are invalidated, which can get complicated because entries can wrap around from one end of the buffer to the other. When the buffer is non-circular, however, it obviously can overflow (at which point the undo service will be disabled), but using a non-circular buffer will ensure that all of the operations made since the point of attachment can be undone. The second feature that should be noted is that the buffers can be made persistent and re-attached allowing operations that were performed during one session to be undone during a subsequent session, which may be of interest to some application developers.

This brief exposition of the undo/redo service may have made it sound more complex to use than it really is. Most of the undo/redo functionality you’ll need for your application can be implemented through the use of the u-procs provided by the system and through data entries, which are very straight forward. All of the gory under-the-surface details touched upon here can generally be ignored, but in case your have more advanced requirements, the framework is fully capable of addressing those needs and is well documented. Additionally, the more complex features are not that difficult to master and can take the work out of what, by nature, can be convoluted and buggy code. Furthermore, the API Guide and the SDK, which contains plenty of sample code, provide much direction in this area.

Persistent heaps can’t grow in size, and it would be impractical to design them to. Any structure that attempts to grow in size will have to contend with being impeded by another structure standing in its way by allowing itself to be movable. Heaps aren’t very movable, of course, because they contain references to absolute addresses, both in the meta data and the user data. With the advent of virtual memory, however, this is all a moot point because heaps can be created that are gigabytes in size on 32-bit platforms and terabytes in size on 64-bit platforms while consuming virtually no physical resources. For example, a one-GB heap on a 32-bit system initially commits just two pages of physical storage. But this may raise a question: If I create a several terabyte heap, do I need backing and data files of the same size as the heap? The answer is yes, but you certainly don’t need that much disk space, and the reason is file sparseness.

Quite frankly, without file sparseness, a persistent heap wouldn’t be practical, at least on 64-bit platforms (which is where server and desktop computing is going at an ever-increasing rate). In this case, the heap size would either be limited by the attached storage, or the heap would have to be backed by a myriad of small files, which would be overly complex and a maintenance nightmare. To simplify matters, the Taiyen system won’t work on a file system that doesn’t support sparseness. Luckily, the only one of note that doesn’t is FAT32, which won’t be found on modern hardware anyway, and certainly not on 64-bit machines.

Sparseness can be likened to virtual memory for files. It allows an extremely large file to be created that only consumes disk space to the extent that there is data in the file. For example, a one-TB sparse file with a byte written at the low end and a byte written at the high end would consume only a couple of clusters. Some file systems such as Windows’ NTFS allow specific regions of a sparse file to be logically zeroed, allowing the clusters backing those regions to be returned to the pool of available clusters on disk. Not only is this efficient in terms of disk utilization, it is much faster than writing a stream of zeros to the media. Such zeroing operations are undertaken when the transaction log is cleared, and as mentioned, when a persistent heap is closed. Actually, to minimize fragmentation, only spans of disk space that constitute a segment and above are returned to the file system upon closure.

By virtue of using sparse backing and data files, expect disk utilization to be at least equivalent to the page efficiency figures cited earlier and in real world cases, much higher. Remember that these page efficiencies are the products of a substantial amount of churning, which won’t be the typical case. You might consider the use of two heap files to be an inefficiency, but in actuality, it is not. Recall that the backing file is essentially a surrogate paging file, and because this file is mapped, loading a persistent heap does not allocate from the OS’s paging files. This means that for many systems, especially server-based systems, the collective size of the paging files can be reduced to the extent that storage is needed for any backing files.

Transactional heaps extend the functionality of snapshot heaps with transactions, which are essentially ultra fast fine-grained heap saves. While a snapshot heap is guaranteed to be restorable to the last snapshot, a transactional heap will always be restorable to the last transaction. As mentioned, a transactional heap is created via the same function as a snapshot heap, CPersistentHeapManager_pcCreate() with the only difference being that the attachment of a transaction log and log buffer tells the system to make the heap transactional.

The transactional heap inherits the ease of use traits of the snapshot heap, such as freedom from having to serialize objects individually, and offers its own features to simplify the programming task while at the same time, maximizing performance. While other transactional systems exist, they can’t match the flexibility and extensibility of transactional heaps. Because these heaps operate below the application layer and are completely data agnostic, they allow virtually any operation or series of operations to be formed into a transaction. And they support all variety of structures from linked-lists, to b-trees, to hash tables, to more exotic structures such a square lists, to whatever the programmer conjures.

Similar to the undo buffer, the transaction log takes two types of entries: data and procedure, with the main difference being that entries in the log aren’t meant to reverse an operation; they’re meant to record it. Actually, a rollback buffer can be attached to a transactional heap that does take mirror entries like the undo buffer. Log entries also carry more error checking information such as checksums than do undo buffer entries because it assumed that they will be needed to rebuild a heap after a system failure. In such a case, the system is paranoid about the consistency of the heap files and the log.

One or more log entries followed by a terminator form a transaction, and the first entry following a terminator or in a fresh log implicitly starts a new transaction. Entries are made by calling CPersistentHeapManager_uwMakeDataEntryInLog() and CPersistentHeapManager_uwMakeProcedureEntryInLog(), and are terminated via CPersistentHeapManager_uwCloseTransaction(). Entries are not actually made directly in the transaction log, they are first made in a log buffer. Upon termination, the entries that constitute the current transaction in the buffer are flushed to the log. Because the log is created with file system buffering disabled (this is the case for all files used by the system), writes are made directly to the storing device, and CPersistentHeapManager_uwCloseTransaction() will not return until the transaction is resident therein.

Transactional heaps are closed and restored (just as snapshot heaps) via CPersistentHeapManager_uwClose() and CPersistentHeapManager_pcRestoreHeapFromHeapFiles(), respectively. If the heap was closed normally (resulting in congruent heap files), the restoration is as described above – a process of mapping in the backing files views. However, if the restoration is post system failure, a transactional heap will first restore to the last snapshot, and then if need be, to the last closed transaction. If the heap is being restored from the backing file (a failure occurred mid-snapshot), the image in the backing file itself represents the last transaction. On the other hand, if the heap is being restored from the data file (a failure occurred during normal processing), the last transaction will be somewhere in the transaction log. In this instance, the system will process the transactions in the log one by one, invoking data entries and procedure entries until no terminated entries are found. Any entries that are found without a terminator are deemed part of an incomplete transaction that must have been interrupted by a system failure and are therefore ignored.

A transactional heap's log remains a fixed size throughout the heap's life. This is possible by virtue of the automatic snapshot mechanism. Two events can trigger an automatic snapshot: log overflow and an attempt to make a log entry that is over half the log buffer size. Actually, these events only cause an automatic snapshot to be scheduled. The actual snapshot is made upon the next call to CPersistentHeapManager_uwCloseTransaction(). Here, the snapshot acts like sort of an über transaction, compiling all of the little transactions in the log into one large transaction to the heap files.

As is the case with a snapshot heap, a transactional-heap snapshot is comprised of a double flush, first to the backing file and then to the data file. In addition, once the backing file contains a consistent image, the log and its buffer are cleared, opening the way for fresh entries. Because the log is a sparse file, it can be zeroed out logically (and therefore very quickly), versus having to write zeros to the media. The buffer, which is fixed at two segments in size, can also be zeroed out in short order. Note that because the buffer is generally much smaller than the log, it is cleared numerous times as the log is populated, and the system always makes sure to flush any unwritten entries before doing so. If there's a failure before or during the backing file flush, the heap will be restorable from the data file and log. If there's a failure during the remainder of the snapshot, the heap will be restorable to from the backing file.

The advantage of this approach is two-fold: One, from the perspective of the user, it makes the transaction log appear to be of infinite size, and therefore extremely easy to use. The way it works is that when an attempt is made to make an entry that would cause the log to overflow (or not leave room enough for a terminator) the entry is simply ignored and a snapshot is scheduled. In fact, once a snapshot is scheduled, all further entries are ignored; and because no error codes are returned to the user, it seems as though the entries were made without a hitch. But as soon as the transaction is closed, instead of flushing the log buffer, the pending snapshot is made.

Any attempt to make an entry that is over half the buffer size is also ignored and also schedules a snapshot. The logic here is that such an entry must be a data entry, and it would make more sense to write a block of data that size to the heap files immediately via a snapshot instead of the double-write to the log and then at some point downstream, to the heap files. Again, the system returns no error code here, and from the user’s perspective, the entry is accepted.

The second advantage of this mechanism is efficiency. Because the log is a fixed size and generally a fraction of the heap size, it can be placed on a relatively small, high I/O speed (IOS) solid state drive (SSD); while the heap files are placed in cheaper disk-based storage such as RAID that can’t achieve the IOS of SSDs but can sustain relatively high throughput for the large block writes produced during snapshots. In this way, the system can convert a choppy input data stream into a bursty sequential feed (imagine a large period square wave) that leverages the throughput capabilities of a disk array and its interconnect. This is not possible on certain transactional systems where the log continually grows throughout the life of the system or is not readily truncated, and therefore must be placed on low IOS bulk storage.

There is another efficiency gained by this approach that may not be obvious. Imagine a series of transactions that reset the value of the same byte in a heap repeatedly. As these transactions are added to the log, and let’s say that the log is one GB in size, at some point, a snapshot will be scheduled. When the pending snapshot is made, instead of having to stream a GB of transactions to the heap files, only the pages modified by those transactions need to be flushed, which in this case is one page. This allows the system to consolidate a large block of transactions into a potentially smaller block of pages to flush. Now, just the opposite effect could have occurred if each transaction modified a byte on a different page, but this would be highly unlikely. And the principle of locality teaches us otherwise: 80% of the time is spent processing 20% of the data.

The system provides a rollback service that is enabled upon attachment of a rollback buffer. This buffer takes mirror entries (just as the undo buffer) that when invoked will reverse all of the operations constituting the current transaction, thereby moving the heap’s data back to the last consistent state – that which existed when last transaction was closed. A rollback is initiated by calling CPersistentHeapManager_uwRollback(), and if this function fails, the system offers the backup function, CPersistentHeapManager_uwRebuild(). This second call is the brute-force method of rollback where the entire heap is rebuilt from the heap files and log. Because the current transaction hasn’t been terminated, it is ignored during the rebuild process, effectively rolling the heap back to the last transaction.

Unlike the undo service, a rollback can only reverse the current transaction. Remember that transactional heaps are designed for multi-user environments, so the previous transaction may well have been made by another user and is therefore off limits. Thus, the rollback buffer is cleared upon the closure of each transaction, and upon execution of a rollback. During a rollback, entries are automatically made in the transaction log corresponding to each invoked mirror entry. The log always moves forward, with the main reason being that while user data is guaranteed to be returned to its state as of the last transaction, the heap’s internal meta data may differ. So to maintain coherence between the log and the heap as a whole, anytime an operation is made on the heap (regardless of whether it is part of a rollback or not), that operation must be reflected in the log.

Similar to the u-procs that integrate heap operations with the undo buffer such as CPersistentHeapManager_uwFreeEUE(), the transactional system offers l-procs that integrate heap operations with the log and rollback buffer. One such l-proc, for example, is CPersistentHeapManager_pmAllocELE() (the ELE here stands for embedded log entry), which combines a call to CHeapManager_pmAlloc() with the log entries necessary to record the operation and the rollback entries necessary to reverse it. The system also provides r-procs, which are like l-procs, but only make rollback entries. The source code for these procedures can, again, be found in the API guide.

The system uses the one-writer / many-reader model, where any operations that modify heap data and their corresponding log entries must be wrapped within a log mutex (to be provided by the user). To guard against snapshots being made while the heap is being updated, snapshots need to be protected by this mutex as well. Sample pseudo code for this arrangement is as follows:

Writing thread ν

{
    Enter log mutex.
    Wait for object A’s read/write auto-reset event. // Waiting for any reading threads to finish up.

    // Modify object A. For example:

    Make a rollback entry for the data about to be modified. // Data rollback entries are made pre data modification.
    Modify a block of data.
    Log the data operation. // Implicitly starting the transaction. Data log entries are made post data modification.

    Call a procedure.
    Log the procedure.
    Make a rollback entry for the procedure.

    Close the transaction. // A rollback can be performed at anytime before the transaction is closed.

    Set object A’s read/write event. // Allow any pending reading threads or another writing thread in.
    Leave log mutex.
}

Reading thread ν

{
    Enter object A’s thread count critical section.

    If (++ thread count == 1) // Only the first reading thread needs to potentially wait.
        Wait for object A’s read/write auto-reset event.

    Leave object A’s thread count critical section.

    // Once one reading thread has access, all others can bypass the read/write synchronization object.
    // Read object A.


    Enter object A’s thread count critical section.

    If (-- thread count == 0) // Once all of the reading threads are out, open the object to any waiting writing thread.
        Set object A’s auto-reset event.

    Leave object A’s thread count critical section.
}

// The snapshot thread awakens every so often to get a dirty page count, and if the count is over a certain threshold,
// a snapshot is made.


Snapshot thread ν

{
    Enter log mutex. // The heap can be read during the snapshot; it just can’t be modified.

    Snapshot.

    Leave log mutex.
}

This model in conjunction with the system’s persistence engine allow heap transactions to meet the requirements given by the acronym, ACID, which stands for atomic, consistent, isolated, and durable. Atomicity is achieved as an object's updating thread blocks all reading threads until the update is complete. From the perspective of the reading threads, the heap’s data state only changes when each transaction is closed. The rollback mechanism is atomic as well because when invoked, it completely restores the heap’s user data to its state as of the previous transaction.

Consistency is achieved as each transaction moves the heap atomically from one consistent state to the next. The model also ensures that upon closure of each transaction, the heap is completely congruent with the log. This is critical in case the transaction is closed via an automatic snapshot, in which case the heap will need to be in a consistent state prior to exiting the log mutex. Because the application riding on top of a transactional heap determines when transactions are closed, some responsibility for maintaining consistency falls to the application layer.

Isolation is achieved by allowing only one thread to modify the heap at a time, ensuring that transactions are made serially, and by allowing the modifications made by one transaction to be visible to other transactions only after the first transaction either commits or aborts. Concurrency can be achieved by utilizing multiple transactional heaps, although with one stipulation: the heaps must be completely isolated from one another – no transaction can straddle two or more heaps. The high degree of isolation imposed here prevents dirty reads where a read of incomplete data or an abandoned transaction can wind up as part of a second transaction, thereby introducing data inconsistency.

Durability is achieved through use of the transaction log and fail-safe snapshots, which make transactional heaps impervious to system failure, and guarantee that a transactional heap will always restore to its state as of the last closed transaction. Durability is assured because the log entries comprising a transaction are committed to the log’s storing device before the next transaction is started. This applies to snapshots as well, which do not return until both the backing and data files have been updated. No data in a transactional heap are lazily flushed, albeit for the paging to and from the backing file, which is accounted for.

The serialization of heap transactions should not be deemed as a liability because the associated log writes have to be serialized to make it across the device bus, which can become saturated when the log device's latency drops below that of the interconnect. By its very nature, the maximum attainable log feed will be substantially less than the interconnect's bandwidth. The reason is that each log update is non-pipelined (which is to say it must be completed before the subsequent one is initiated) and is thus hindered by non-transfer latencies, which can be significant. For example, recent tests published by Intel of an eight-lane PCI Express implementation with an average payload size of 2,048 bytes showed non-pipelined transfer rates averaging 2.2 GB/s. But when the transfer size averaged 512 bytes, the transfer rate fell to nearly 1.2 GB/s, showing that non-transfer latency factors noticeably for these smaller payloads. Still, that figure is impressive and higher throughputs are possible because PCI Express can scale to 16 lanes, predicting small transaction throughputs of roughly 2.4 GB/s. The challenge for the log's SSD will be to receive transactions at these interconnect speeds (≈ 4.7 million per second at 512 bytes per transaction). Such a unit will definitely have to be directly attached to the interconnect and will have to consist of battery-backed high-speed DRAM.

Other transactional systems concentrate on achieving writing concurrency, which requires intricate structures to ensure transaction isolation. The problem is that as the IOS the of the log device approaches that of the interconnect, any achieved concurrency will show no net performance gains. We prefer the more generalized and flexible approach that guarantees transaction isolation for every conceivable operation, with the focus on getting each transaction into persistent cache as fast as possible. Additional processors are best used to concurrently serve querying threads and to accelerate individual CPU-intensive operations, when they lend themselves to parallelization. Also, available interconnect bandwidth and CPU cycles can be utilized to lazily flush dirty pages to the backing file, thereby reducing the pending snapshot size.

On top of outstanding programmability, extensibility, and flexibility, transactional heaps offer unparalleled raw speed. While the optimal media for the log is of the solid state variety, performance is still very good when the log resides on a disk drive because all writes are sequential and optimized in terms of the number the number of sectors that are written per update. Furthermore, the use of procedure entries can significantly reduce the average transaction size, but this is less of a factor because transaction speed is predominantly a function of I/O speed, not transaction size. To illustrate, the following are the results from transactional heap tests that were run on the same 2.8 GHz PC mentioned above, running Windows XP with the heap files and log being located on a Maxtor 6Y200PO Ultra ATA-133 drive with 8 MB of cache.

Number of log entries: 12,807,351.000000
Number of log transactions: 1,441,968.000000
Average number of log entries per transaction: 8.881855
Average transaction size: 274.521254
Average transaction flushing efficiency: 35.081203
Number of forced snapshots: 0
Number of timed snapshots: 12
Average snapshot size: 72,615,253.333333
Average dirty page bytes per log entry bytes: 0.692030
Number of heap restorations: 16

Average log entry speed (log entries per second): 3,986,259.995762
Average transaction termination speed (transaction terminations per second): 5,357.543433
Average transaction speed (transactions per second): 5,294.343605
Average transaction speed (bytes per second): 1,453,409.847199
Average snapshot speed (bytes per second): 12,787,703.611579
Average sustained throughput (transactions per second): 4,908.287731
Average sustained throughput (bytes per second): 1,347,429.304278

Average heap restoration from backing file speed (bytes per second): 9,978,499.730237
Average heap restoration from data file speed (bytes per second): 1,418,307.172064
Average heap restoration from congruent heap files speed (bytes per second): 1,625,675,118.906685
Average heap compression speed to buffer (bytes per second): 543,986,452.160653
Average heap decompression speed from buffer (bytes per second): 163,992,872.510485
Average heap compression speed to file (bytes per second): 44,198,439.568796
Average heap decompression speed from file (bytes per second): 40,802,358.469421

Notice the exceptional transaction speed of nearly 5,300 per second with an average of approximately nine log entries per transaction. These tests were run without enabling file system buffering for the log, which means that each transaction was written directly to the drive. The drive’s 8-MB write cache, however, was enabled, which allows the drive to signal that a write is complete once the data stream encompassing that write is resident in cache, and not necessarily on disk. In contrast, when the write cache is disabled, streamed bytes must be resident on the actual disk media before the drive can accept the next transaction, which will severely hamper I/O speed. For example, when the test at hand was run with the a disabled write cache, transaction speed was reduced to a dismal average of 116 per second, even with all writes being completely sequential. This, of course, illuminates the principal bottleneck in disk drives – mechanical latency, which is measured by the two quantities: average seek time and rotational latency. Actually, the total random access time includes other second order factors and is given by the following sum: (average seek time) + (average rotational latency) + (transfer time) + (controller overhead).

Based on the performance data for the Maxtor drive, the 116/s-transaction speed is predictable. The published numbers declare an average seek time ≤ 9.4 ms, an average rotational latency = 4.18 ms, and a controller overhead < .3 ms. Assuming that the log occupies contiguous sectors and tracks in the drive, and given that writes to the log are sequential, the seek time can be ignored. With an average of 856 sectors per track (or 438,272 bytes per track) and an average transaction size of 275 bytes, each track holds roughly 1600 transactions, which means that the drive performs relatively few seeks as the log is populated. And when a seek is performed, it is only to the next track over (a track-to-track seek takes .9 ms on this drive). With transfer times in the tens of MBs per second and a controller overhead of only .3 ms, the dominant factor is clearly rotational latency. The average latency of 4.18 ms would predict a transaction speed of nearly 240/s, but interestingly, because each transaction is made next to the one before it, the drive may have to wait one full revolution before the disk head is in the correct position to make the next write, resulting in a worst case latency of 8.36 ms (1/.00836 ≈ 116). There might be ways to optimize the process by spacing writes such that the target sector always leads the disk head by a few degrees, but this is a moot point.

As the test shows, employing the write cache eliminates nearly all of the drive latency. As mentioned, the log buffer is flushed when a transaction is terminated, so it’s actually the termination speed that specifically measures the storage IOS. (Note that the transaction speed is calculated by adding the log entry overhead to the transaction termination speed, but at nearly four million entries per second, this overhead is minimal.) At 5,358 terminations per second, the latency calculates to .187 ms, which can almost entirely be attributed to controller overhead. Throughput is maximized because sequential writes allow a continuous stream bytes to pass through the write head as the platters spin underneath. The average transaction flushing efficiency of 35% shows the actual amount of transaction data written to disk as a percentage of each log buffer flush. Because these flushes bypass the file system cache, they are aligned along sector boundaries and are sector multiples in size. Given a disk sector size of 512 and an average transaction size of 275, a 35% flushing efficiency is in line with expectations.

The extra bytes written per flush only trivially affect IOS. The transaction throughput at 1.45 MB/s is well below the drive’s maximum sustained transfer rate of 37 to 67 MB/s, which rate depending on where the write head is with respect to the disk’s outer diameter where throughput is maximized. To gain further insight, the same test was run with a sector size of 1,024, which lowered the flushing efficiency to 21.2%. The resulting speed of 5,317 transaction terminations per second reinforces the understanding that for these relatively small transaction sizes, latency (specifically, controller latency) is the predominant factor limiting IOS.

Another metric that provides insight is the average number of dirty page bytes per log entry bytes (DPPL), which turned out to be around .7 for this test. The DPPL is an indicator of the locality of heap operations where the lower the ratio, the higher the locality. Recall the extreme example above where the same byte is repeatedly reset, which would rapidly populate the log while dirtying only one page. This would yield a DPPL approaching zero. The test referenced here, because it continually churns the memory by allocating, resizing, and freeing blocks of random sizes, renders a higher DPPL than would be expected by a real world system.

Notice that the average compression and decompression speeds of 44.2 and 40.8 MB/s fall within the drive’s range of maximum sustained transfer rates. Snapshots, which entail flushes to two files instead of one, should theoretically be half the speed of compressions but fall short at about 57% of this mark. The reason why is that snapshots are not completely sequential as are compressions and decompressions. Because the snapshot algorithm does coalesce neighboring dirty pages into single block writes, snapshots are largely sequential (and with a smaller DPPL ratio, even more so), but even a small amount of sparseness in a series of disk writes can induce latency overhead. On the other hand, any sparseness in the snapshot could’ve been mitigated had the drive used a cache larger than 8 MB, which was completely overwhelmed by the average snapshot size of 72.6 MB.

Nevertheless, an average snapshot speed of 12.8 MB/s is respectable, and by no means is the intended storage a $50.00 ATA drive. One would expect even a modestly priced system to connect to U320 SCSI or Fibre Channel RAID, where snapshot speeds are measured in hundreds of MB per second (and eventually, thousands per second for 10-Gb FC). To avoid situations where a pending snapshot becomes so large so as to noticeably affect system availability, the system provides a function to retrieve the current dirty page count, which along with the other mentioned automatic triggers, could force a snapshot merely based on the dirty page tally breaching a certain threshold. Note that snapshot speed can be increased by lazily and continuously writing dirty pages to the backing file such that when each snapshot occurs, most or all of the flush is directed towards the data file, thereby reducing the total number of bytes flushed per snapshot. Most operating systems offer a way of tuning mapped-file flushing threads.

While the transaction termination speed shows the rate at which transactions are written to the log, they only partially indicate the sustained throughput capacity of a transactional heap. The reason is that the log, being of fixed size, can’t continue to accept entries ad infinitum. At some point, a snapshot will be needed to clear the log and flush the pages that were dirtied in the process of populating it. Because all writing threads must be suspended during a snapshot (recall that all snapshots must be made inside the log mutex), snapshot time must factor into any calculation of net throughput. In equation two below, net throughput, TP in bytes per unit time is defined as the number of transaction bytes, Tb in the log (at the time of a snapshot) divided by the sum: average transaction time, Tt plus average snapshot time, St.

The number of bytes flushed during a snapshot is equal to (DPPL ● Tb), and dividing this product by the snapshot speed, Ss yields the snapshot time. Equation three substitutes this quantity for St.

Dividing the numerator and denominator by Tb and substituting transaction speed, Ts for the quotient (Tb / Tt) renders the final throughput equation below.

This equation provides good insight into transactional heap performance. As expected, maximizing transaction and snapshot speeds are key, but what may have been unclear is the extent to which DPPL plays into the equation. Clearly, the smaller the DPPL the better. Two factors combine to reduce DPPL: locality and time. If modifications are relegated to relatively few pages, the DPPL, by definition, will continue to shrink as entries are added to the log. But allowing the dirty page count to grow beyond a certain point means that any upcoming snapshot may be large enough to noticeably delay any writing threads. Most applications will want to keep snapshots under a second, so if a 100 MB/s snapshot can be achieved with a decent RAID system, a snapshot trigger might be set at 50 MB. If the DPPL for a particular application averaged .2, at least a 250-MB log would be needed to avoid automatic snapshots. Notice how the DPPL determines the minimum log size, with the lower the DPPL, the larger the requisite log. Imagine that the log storage is able to achieve a transactional throughput of ten MB/s. In this case, the heap could sustain a net throughput of 9.8 MB/s (1 / (1 / 10 + .2 / 100)) with snapshots occurring every 25 seconds.

In the test results above, the sustained throughput of 1.34 MB/s is only slightly below the log throughput of 1.44 MB/s, indicating that on the hardware used for this test, transaction speed has a long way to go before snapshot speed factors appreciably in the equation. A much higher transaction speed is possible when an SSD is used to store the log. The typical  SSD is capable of an IOS in the 100,000 per second range and even higher for bus-attached units. To simulate the use of an SSD, the same test was run with file buffering enabled for the log. The resulting transaction speed wound up at over ten times the non-buffered figure at 56,639 transactions/s, yielding a net throughput of 8.5 MB/s at 30,972 transactions/s. To put this figure in perspective, this equates to over 1.8 million transactions per minute. At the time of this writing, the world record transaction speed was around one million per minute.

There’s no way to build a completely reliable system, and by system we are referring to the combination of hardware and software. There are just too many modes of failure from power outages, to actual circuitry failures, to intentional destruction (as in the case of military systems), to software bugs. The beauty of transactional systems, in general, is their ability to recover from failure, but some do not do this very gracefully and are not very easy to administer. Transactional heaps, on the other hand, are highly adept at handling system failures and are extremely easy to administer. Merely pass in handles to the two heap files and the log, and the restoration function, CPersistentHeapManager_pcRestoreHeapFromHeapFiles() does the rest. If a failure did occur, this function will fill out a CHeapFileParameters structure with details about the failure and recovery.

After the heap is recovered, the user can call CHeapManager_uwVerifyHeap(), which quickly performs a top-down check of all of the heap’s internal structures. This function works on volatile heaps as well. The restoration function performs many checks as well during the restoration process such as verifying checksums as log entries are invoked. Restorations from failure are also very fast. As the results above show, restorations from the backing and data files averaged 10.0 and 1.4 MB per second, respectively.

The Taiyen system was devised with two fundamental principles in mind. One, there are two basic data types: volatile and persistent. And two, there are two basic application types: single-user and multi-user. The outcome was an uncomplicated software stack where volatile data is managed via volatile heaps, single-user persistent data is managed via snapshot heaps, and multi-user persistent data is managed via transactional heaps. It’s a powerful model that can service a wide variety of applications. And by no means is the performance for one application type sacrificed by supporting the other. In other words, the system’s volatile allocation speed could not have been increased by foregoing persistent heap support, nor could’ve its transactional performance been improved by omitting snapshot heaps. Ultimately, the system’s generalized approach to memory management is one of its key strengths.

First and foremost, the system is a performance component. But it’s almost a misnomer to say that software is fast. It’s really hardware that is fast. CPUs can be made faster through the brute-force increasing of clock frequency and through more clever optimizations like pipelining, where multiple instructions are overlapped in execution. Disk drive transfer rates can be improved by increasing the areal density of the platters and rotation speed. In software, there’s no clock to boost and there’s no motor to spin faster. The hardware determines the potential performance. It is incumbent upon the software to realize this potential, and somewhere in the myriad of instruction combinations is an optimal solution. But there are general principles that can guide the way: First-order optimizations focus on minimizing algorithm time and memory complexity. Second-order optimizations maximize the amount of work done per clock cycle, and focus on good coding techniques such as the minimization of branching. The continual divergence between CPU speed and storage speed (both volatile and persistent) makes locality increasingly vital so as to maximize cache and page hits. Multi-threaded applications demand the minimization of thread contention and context switching. And disk access needs to be sequential as possible.

The Taiyen system leverages these principles to achieve exceptional performance. The extent to which overall application performance will improve through use of the system will depend on a variety of factors such as the allocation type, how memory-intensive the application is, how allocation-intensive the application is, whether access will be single-user or multi-user, and so on. Server-based applications will potentially see the largest gains, especially transactional systems, where the main bottleneck is the interaction with persistent storage. A primary aim of the system was to deliver enterprise-level transactional performance on modest hardware, which, as demonstrated, was achieved. The system, unmatched in terms of transaction / (second ● dollar), promises to lower IT infrastructure costs, operating costs, and increase the accessibility of transactional systems to an ever wider audience. The economic feasibility of many nascent technologies such as micro-payments and RFID depends on the ability to sustain low per-transaction costs. But beyond performance, transactional heaps are more adaptable and extensible than other transactional systems, which typically shoehorn the developer into a particular data model. In fact, these other object and relational systems could easily be built on top of transactional heaps.

With respect to desktop applications, the Taiyen system and specifically, snapshot heaps promise efficient and fail-safe saves, and ultra-fast document loads. Moreover, they promise to make the development of complex data-intensive programs a significantly easier task than it is otherwise. For one thing, the programmer doesn’t have to necessarily invent a file format. He has only to worry about the memory structures needed to run the application. To add persistence, he simply allocates those exact structures in a snapshot heap. PTE redirection allows the manipulation of large dynamic structures without having to worry about relocation latency. Undo and redo is easily incorporated via the built-in undo/redo service. Not having to mess with a file format makes the application more extensible allowing the creation of ever more complex and rich content. In lieu of a snapshot heap (or even in conjunction there with), persistence could also be gained via heap compression. With compression and decompression speeds in the 40 MB/s range on a typical PC, even relatively large documents could typically be saved and loaded in sub-second times.

The 64-bit desktop is already here and will soon be pervasive. The greatly expanded addressability of 64-bit computing will yield ever more powerful applications, especially in the categories of CAD/CAE, financial systems, video editing, and scientific computing. The data sets handled by these applications will grow from the megabyte range to the gigabyte range, and it will be critical to handle this data as efficiently as possible. From its inception, the Taiyen system was originally designed for 64 bits and as such, would make an ideal platform for this new breed of application.

The capabilities outlined here really only touch the surface of the system’s complete feature set, which includes over 80 API calls. For more information, download our free white papers, API guide, and demo SDK. The demo SDK contains the source code for tester used to generate all the performance results cited herein as well as raw test results. It also contains a set of small sample programs that run on both Unix and Windows and series of extensive user guides that function as a tutorial for the system. The demo SDK contains a complete set of non-optimized debug libraries for each platform that we currently support. There is no time limit on the SDK. The only limits are the number of CPUs to which the volatile heap manager will scale and how many allocations can be made during a particular session, which is currently set to 100K.

  • IA32(x86)

    1. Linux. The system has been verified to run on the following distributions:
    RedHat Linux 9.0.

    2. Windows 2000 and later.

  • IA64(Itanium I/II)
    The system has been optimized for Itanium II processors.

    1. Windows XP/2003 and later.

  • USPARC64

    1. Solaris 5.9 and later.

Builds for additional Unix platforms are forthcoming as well as the 64-bit AMD Opteron version of Windows. Please call or email if builds for a specific platform are needed, and we will try to expedite their development.

Small entity (< 100 employees):
Large entity (>= 100 employees):
Introductory price! (for a limited time) $799.95 $349.95 per seat.
Please call (415.567.5850).

All distributions are royalty free and include all of the Unix and Windows libraries (32 and 64-bit), documentation and sample code available at the time of order, plus an account through which to download the latest Taiyen SDK for a period of one year. All retail libraries are fully functional and allow full allocator scalability (to 32 CPUs on 32-bit platforms and 64 CPUs on 64-bit platforms). In other words, no libraries (other than those in the demo SDK) contain any artificial limitations. Each seat is authorized one year of tech support via email; phone support packages are also available.

The Taiyen system is the clear value choice when compared with competing systems, which cost many times more and lack Taiyen's innovative and powerful feature set.

Site licenses are also available and can offer savings for multi-seat purchases.

Terms of use: Demo SDK, Retail SDK