|
|
|
|
|
The Taiyen system is primarily a two-layer component
comprised of volatile and persistent heap managers that allows the creation
and management of four heap types: volatile, archive, snapshot, and transactional.
The system offers productivity and performance features that simply aren’t
found in or matched by other similar components. Originally, it was conceived
as a tool to be used internally to help us build end-user products, but
we felt that its feature set might be interesting to other developers; and
moreover, that eventually other similar products would find their way into
the market place anyway. So we decided to distribute the system.
We have builds for 32 and 64-bit Windows (on Itanium) and 32 and 64-bit
Unix (currently, we support Linux and Solaris, and we will be adding other
Unix flavors over the coming months). The system comes in the form of static
libraries, so the only functions linked into the application are the ones
that are used, and there’s no need to worry about having to distribute yet
another DLL. The release library’s small footprint of ~(200 KB – 500 KB),
depending on the platform, lessens any advantage that would be derived by
using a shared library, especially given the small odds that multiple processes
running at a time will be running the same Taiyen code. Furthermore, avoiding
the DLL route foregoes the versioning problems associated with DLLs and
provides more flexibility in terms of the library interface. All functions
in the API have C linkage with cdecl as the calling convention. We also
have fastcall versions of our 32-bit Windows libraries, which use faster
register-based parameter passing.
|
|
|
|
|
|
|
Unlike the case with competing systems, all Taiyen heaps
are self-contained contiguous structures that are created at explicit addresses
in the process, similar to the way private heaps are created via Windows’
HeapCreate(). To this end, the system provides functions to find available
address space at both the low and high ends of the process. The Taiyen system
works independently from the CRT and handles the allocation of all block
sizes, even the largest multi-page blocks, internally. Other systems are
often built on top of the CRT’s heap and pass the allocation and resizing
of large blocks off to the virtual memory manager (VMM) (as is the case
for a Windows private heap, which hands allocation requests greater than
0x7FFF8 bytes in size to the VMM). This approach, while easier to implement,
simply couldn’t have been possible, given the complexity and breadth of
features that we wanted to provide. We needed our heaps to be stand-alone
structures wherein all internal control structures, meta data, and user
data had to reside in the same contiguous span.
This yielded several advantages: One, this allowed us to precisely control
the alignment of internal structures optimally with respect to cache lines
and pages, thereby maximizing locality. This is critical to prevent different
threads running on different processors from modifying the same cache line
at the same time, a phenomenon called false sharing, which severely
degrades performance as the cache coherency mechanism goes into overdrive.
Two, this allowed us to leverage the same volatile heap code to create and
manage persistent heaps, which definitely need to be self-contained objects.
Sharing the same code reduces the code footprint, improving code locality
(and therefore, at the margin, speed); and more importantly, reliability
(less code needs to be tested). Three, this allowed us to implement PTE
redirection, a feature to be covered below. Four, this allowed the ability
to create multiple volatile and persistent heaps within a process. The capacity
to create multiple volatile heaps was needed to provide system scalability
on NUMA platforms where each node could be assigned its own heap. And five,
having all user blocks reside within the heap space, allows the quick resolution
of a block to its respective heap.
|
|
|
|
|
|
|
A critical feature of our allocator is its faculty to
maximize availability while still maintaining exceptional speed. We define
availability as a measurement of a heap's ability to accommodate allocation
requests as a percentage of its ability were all of its free blocks coalesced
together, and is given by equation one below:
For a series of N free blocks that are available for allocation, ordered
from the largest to the smallest, (F1, F2, F3, …FN), and O is the total
number of free blocks in the heap, the heap’s availability, A is as follows:
|
|
|
|
|
|
|
To illustrate the concept of availability, consider the
four heaps in the figure below: |
|
|
|
|
|
|
As evidenced by the figure, higher availability indicates
increased allocation potential within a given space. Availability may be
thought of as the inverse of fragmentation, which occurs because blocks
can be freed in random order, as opposed to blocks being deallocated on
the stack, for instance. While there is no way for the heap manager to control
the order in which blocks are freed, it can control where new blocks are
directed; which, of course, is what segregates one allocation algorithm
from another. A variety of strategies have been devised to tackle fragmentation
from aggregation strategies, where blocks of similar size are grouped together
in mini heaps (which can be very fast but lead to low availability)
to best-fit strategies, where the free block offering the best fit
is targeted for allocation (which can be slower but yields very good availability).
This second strategy, for example, would place a block one unit is size
in the right most block of heap three.
The Taiyen’s allocator employs what we call a modified-best-fit strategy,
which sacrifices a small amount of availability (only a couple of percent)
to achieve superior speed. One key attribute of the algorithm is its robustness,
which is its ability to handle a wide range of block sizes as well as a
dynamically varying average block size. This is in contrast to many specialty
allocators such as Hoard that are designed for a specific block range (usually
10s – 100s of bytes). The allocator is also a coalescing allocator, which
means that every time a block is deleted or moved, any created free blocks
are coalesced with neighboring free blocks. Some allocators avoid coalescing
because it can be computationally expensive, but such a practice can lead
to a substantial loss of availability.
|
|
|
|
|
|
|
The Taiyen system is fully scalable on SMP machines,
and offers complete protection against the false sharing of cache lines,
for both internal structures (as mentioned) and user allocated blocks. The
system is also scalable on NUMA platforms by assigning each node its own
heap. The figure below shows Taiyen performance relative to stock CRT allocators.
These tests (as well as other tests referenced herein) were performed on
three platforms: an Intel S450NX with four 550 MHz Xeon processors running
Windows 2003 Server; an HP i2000 workstation with two 800 MHz Itanium I
processors running Windows 2003 Server; and a Sun e4500 with eight 336 MHz
Ultrasparc II processors running Solaris 9. The source code (TestMemoryManager.cpp)
for the application used to perform these tests is available in the SDKs.
|
|
|
|
|
|
|
The figure shows the Taiyen system’s strong single-threaded
performance and nearly linear scalability. The stock allocators don’t scale
because thread access is serialized. Compounding the problem, as multiple
threads attempt to engage the heap, they are blocked, resulting in those
threads being immediately put into wait states, which then results in a
series of expensive context switches as the thread scheduler shortcuts each
thread’s time-slice. Worse yet, the clock cycles consumed in response to
any cache misses will have been wasted as well.
Much work in the area of heap management has focused on this particular
deficiency, but there are many other areas where there is room for improvement,
as we argue below. Two specialty allocators specifically designed to provide
scalability against which we have tested are Hoard (an allocator developed
at the Universities of Texas and Massachusetts) and mtmalloc (Sun Microsystem’s
multithreaded offering).
|
|
|
|
|
|
|
|
|
|
While Hoard did show strong performance for small block
sizes, it didn’t prove to be very robust – it was only able to complete
five of 36 simulations of our testing regimen (which is described in detail
in our white paper) before it consumed too much paging file space too continue;
and on occasion, the allocator locked up. mtmalloc is a power-of-two allocator
that sacrifices availability and memory efficiency for speed. After completing
11 of 46 simulations, it failed after consuming too much disk space. Both
of these allocators required special simulations to derive the performance
data shown above.
|
|
|
|
|
|
|
The Taiyen system aligns all allocated blocks optimally
with respect to cache lines and pages. This doesn’t mean that every block
will be allocated along a cache line boundary, just that no block will touch
extra cache lines. For example, a 16-byte block will never straddle a cache
line, or a page boundary for that matter. Blocks, however, that are greater
than equal to a cache line in size will be aligned along cache boundaries,
and likewise, blocks that are greater than or equal to a page in size will
be aligned along page boundaries. There are rare cases where the system
may position a sub-page block such that it straddles a page boundary, but
this should not be considered a liability, as page alignment is significantly
more critical for multi-page blocks.
The aim of this strategy is to increase block access speed by maximizing
cache, TLB, and page hit ratios in turn by maximizing locality. The strict
alignment policy will also prevent cache unit splits where accessing a cache
line's worth of data causes two cache line loads. And the outcome can be
significant, especially on platforms where the ratio of processor speed
to main memory access speed is high. One such machine where such locality
yields noticeable performance gains is the Itanium, which has a main memory
latency of 176+ clocks, an L3 cache latency of 22 clocks, an L2 latency
of 6 clocks, and an L1 latency of 2 clocks. The figure below shows the difference
in block access speed yielded by the Taiyen system versus that of the stock
allocators on the three platforms used for testing, with an average block
size of 33 bytes on the 32-bit platform and 43 bytes on the 64-bit platforms,
and with a cache line size of 64-bytes on all three machines. On Itanium,
the Taiyen system yielded nearly a 50% improvement in access speed over
the CRT, which allowed over 99% of the allocated blocks to straddle cache
lines.
|
|
|
|
|
|
|
Note that these alignment rules don’t necessarily allow
the Taiyen allocator to be substitute for the CRT’s aligned_malloc(size_t
size, size_t alignment), which forces a block allocated thereby
to be aligned along the power-of-two boundary, alignment. For example,
the Taiyen allocator will not guarantee that a 512-byte block will be aligned
along a 512-byte boundary. It will guarantee, however, that it will touch
the minimum number of cache lines. The allocator does, on the other hand,
guarantee power-of-two alignments for seven critical granularities: the
system word, the double-word, the quad-word, the cache line, double the
cache line, the page, and the segment (with a segment being defined as 16
pages). Here, a word refers to a system word (i.e. four bytes on a 32-bit
system and eight bytes on a 64-bit system).
|
|
| |
|
|
|
|
Another significant difference between the Taiyen system
and competing technologies is its native support for dynamic blocks, which
as the name implies, are designed to contain structures such as dynamic
arrays that need to be continually resized over time. To facilitate this,
dynamic blocks are given built-in buffers to reduce the frequency of block
relocations. When these blocks are page multiples in size, the buffers only
consume virtual address space. To accelerate the relocation of multi-page
dynamic blocks, the system employs a proprietary technique that we call
page table entry (PTE) redirection, which allows a block to be moved
in virtual address space while keeping its data stationary in physical storage.
Such relocations based on redirection can be made up to two orders of magnitude
faster than the byte-by-byte method used by memmove(), for example. Another
important virtue of redirection is that it allows block data that have been
paged to disk to be relocated without having to read that data it into main
memory beforehand.
|
|
|
|
|
|
|
|
|
|
The two charts above show Taiyen performance versus stock
allocator performance with respect to dynamic blocks, with the second chart
focusing on multi-page block resizing speed. In the tests highlighted by
this second chart, 76.4% of the operations were block resizings, with the
average increase in block size being 141.07% and the average decrease, 49.94%.
What’s notable is that even with Taiyen’s use of expansion buffers, with
a total of 428,748 relocations, it relocated blocks 97,279 more times than
did Windows; 68.84% of the Taiyen resizings required a relocation. This
speaks to the power of redirection, which as the chart shows, can produce
exceptional relocation speeds.
Another key difference between the way Taiyen and the stock allocators handle
dynamic blocks is that the stock allocators always shrink blocks
in place. We regard this as a significant deficiency, because it can lead
to the worst kind of fragmentation – small blocks spread evenly across the
address space. In contrast, the Taiyen system views every block resizing
as an opportunity to reduce heap fragmentation, especially when blocks are
contracting; only 45% of the blocks referenced in the lower chart were contracted
in place. This, of course, can only be practicable with redirection.
|
|
|
|
|
|
|
We measure page efficiency via a spy device that scans
the PTEs of the heap’s pages for dirty bits. Efficiency is then calculated
by dividing the number of allocated bytes by the number of dirty page bytes.
The result factors in the overhead of the heap’s meta data as well as how
efficiently the allocated blocks are laid out in the heap space, with the
more compact layout yielding the better efficiency. A principal aim of the
system was to maintain good page efficiency while maximizing speed, and
the charts below show that the Taiyen system compares well with the Windows
CRT, which is a highly efficient allocator. The page efficiencies shown
in the charts below are the products of heavily thrashed systems, running
a worst case scenario where millions of blocks of varying sizes are repeatedly
allocated, freed, and resized.
|
|
|
|
|
|
|
|
|
|
|
|
|
The system offers a plethora of debugging features from
bounds checking, which is automatic for debug builds, to multi-threaded
leak detection. Reliability is the most important feature of all and the
system contains copious checking to catch both internal errors and user
errors such as the freeing of a dangling pointer.
|
|
|
|
|
|
|
The system offers a means to compress and decompress
heaps to and from buffers and files. Included are sub-features that allow
heaps to be resized, relocated, and even defragmented upon decompression;
and we supply ancillary functions and macros to help resolve pointer references
to any moved blocks. Both volatile and persistent heaps may be compressed,
and volatile heaps may optionally be made persistent during decompression
(or vice versa). The compression routines are highly optimized and file
compression/decompression only requires a small 16-page buffer. The following
are compression data from a run of the mill 2.8 GHz PC running Windows XP
using a Maxtor 6Y200PO ultra ATA-133 drive with 8 MB of cache:
Average heap compression speed to buffer (bytes per second): 543,986,452.160653
Average heap decompression speed from buffer (bytes per second): 163,992,872.510485
Average heap compression speed to file (bytes per second): 44,198,439.568796
Average heap decompression speed from file (bytes per second): 40,802,358.469421
|
|
|
|
|
|
|
From its inception, a primary goal of our system was
heap persistence. In principle, any persistent heap should be based on solid
heap algorithms, so working backwards, we developed the volatile heap manager
first to provide the core functionality that persistent heaps would use
to manage their data. As mentioned, most of the volatile heap code base
is reused by the persistent system. In fact, leveraging existing highly
optimized components is what makes our system so lightweight and yet, powerful.
Persistent heaps incorporate six technologies: one, volatile heap logic
to manage the efficient apportioning of virtual address space; two, memory
mapped files; three, the virtual memory manager, which excels at keeping
the most frequently utilized pages in memory; four, exception handling (signal
handling on Unix) to keep track of written pages; five, sparse files to
minimize disk consumption; and six, low-level file I/O for high-speed transactions.
On top of the code needed to integrate these components are routines that
provide fail-safe snapshots, multiple undo/redo, transactions, rollback,
etc.
Handling persistent data has always been messy. Given a file, the first
task is to get the needed data into RAM where it can be read and manipulated,
and there are a variety or ways to accomplish this. The most basic way is
to simply read the entire file into RAM in one shot, operate on the data,
and then write the data back to disk, again, in one shot. This, of course,
isn’t very efficient. What if the file is a large CAD drawing, and only
one entity needs to be changed, or the file just needs to be quickly viewed?
What if the file needs to be parsed as it’s read? Can it be loaded into
a giant buffer and then parsed, or will the buffer need to be so large that
the system will start paging at the same time the file is being read, with
the parsed data being loaded into a heap, and with that data immediately
being paged back to the same disk its being read from? This can be addressed
by using a small fixed-sized buffer, but this can often call for sophisticated
code to handle cases where structures that need to be read as a whole either
won’t fit in the buffer or will straddle sequential buffer reads; and of
course, all of this code needs to be fault tolerant. Even with efficient
buffering, the entire file is still being read and written at a time.
Once the data is in RAM, how should it be periodically saved? The brute
force method of writing the whole file to disk with every save could be
workable for moderately sized data sets but is surely not the most elegant
solution and certainly not practical for larger files. The more efficient
approach would obviously be to keep track of unsaved data and write only
that information to disk with each save, but this may not be easy when dealing
with complex data structures. And how should the stream be reconciled with
the existing file? Should each save append unsaved data to the end of the
file, causing the file to grow even though existing data may have only been
modified versus new data being added? And this approach would still require
that, at some point, these appended saves all be reconciled together. Or
should each save be reconciled on the fly, which could require some rather
intricate code. On top of all of this, again, what happens when there is
a system failure during a disk write? Will the file be found in a consistent
state, or will it contain transient data, making portions of it or even
all of it indecipherable? Imagine the case where the data is in the form
of a linked list and the file is found with one of the links broken.
Fortunately, most all modern operating systems offer a superior means of
accessing files over standard file I/O in the form of memory mapping. The
memory mapped file is a powerful construct that addresses a few of the problems
noted above by using the VMM to handle the loading and saving of files at
page granularities. When a file view is initially mapped, the data is only
loaded virtually. Actual disk reads occur on demand only as pages are touched,
but from the perspective of the user, the data may always be assumed to
be completely present in memory. This is the beauty of virtual memory; it
relieves the user from having to worry about where his data lies (RAM or
disk).
Imagine using mapping to load in a large CAD file. The drawing loads almost
instantaneously because only the data need to paint the current view of
the drawing is actually read from disk. Under the surface, the processor
issues a page fault for each page that is touched during the initial screen
paint. Each fault, however, is trapped by the VMM, which goes to work reading
the accessed pages from disk.
Updates are also efficient with respect to mapped files. Essentially, a
mapped file is a named paging file. Understand that every page committed
in virtual memory will be backed by some sort of file. It can either be
named or anonymous, but the option to have no backing file doesn't exist.
Transparently to the user, the VMM continually moves pages to and from the
mapped file based on the demands of the system, just as it does with the
operating system’s paging files. Because the VMM keeps track of dirty pages,
file saves (or flushes) are highly efficient. Only those pages modified
since the last flush are written to disk, and of course, already paged-out
data are ignored. The user needn’t keep track of which pages are present
in memory, which pages have been modified, etc.
If memory mapped files can do all of this, you may be asking why a persistent
heap is even necessary. First off, a mapping is just a span of virtual address,
and while the VMM does a great job of managing virtual address space, it
only does so at the page level. Clearly, some other component must reside
on top of the VMM to organize this address space, that is: allow
it to be subdivided into blocks of varying size, both sub-page and multi-page.
This component is what’s commonly termed a heap. But heaps typically extend
the VMM layer in more ways than this through features like bounds checking,
etc.
The Taiyen system allows the creation of persistent heaps through its persistent
heap manager. This component inherits most of the functionality of the volatile
heap manager on which it’s based, which includes features such as: high
availability, optimal block alignment, native dynamic block support, PTE
redirection, good page efficiency, bounds checking, leak detection, and
compression/decompression.
The system allows for the creation of the three persistent heap types: archive,
snapshot, and transactional. Archive heaps are designed for information
retrieval. Snapshot heaps are designed for single-user desktop systems,
whereas transactional heaps are designed for multi-user server systems.
The one feature that is not inherited from volatile heaps is SMP scalability.
The reason for this is that scalability would primarily be an issue for
a server-based transactional heap, which can’t scale because it uses a transaction
log that must be serialized, effectively serializing access to the heap,
a subject to be expanded upon below. Scalability can be achieved, however,
by creating multiple persistent heaps, with the stipulation that they operate
independently from one another. Mentioning details such as the transaction
log highlights some important ways in which persistent heaps greatly extend
memory mapped files. At this point, these features will be explored in detail.
|
|
|
|
|
|
|
|
|
|
One weakness of the memory mapped file is that a system
failure during a file flush can leave the file in an inconsistent state.
The Taiyen system addresses this shortcoming by providing fail-safe snapshots.
A snapshot heap is supported by two files: a backing file and a data file,
which are collectively termed heap files. The backing file is mapped
and the data file is completely isolated. If a failure occurs at any point,
the heap will be restorable to its state as of the last successful snapshot
from one of these two files.
A snapshot is essentially a double file flush, first to the backing file
and secondly to the data file. The double flush is necessary to always maintain
a consistent image of the heap. While the backing file is being written,
that consistent image is located on the data file and vice versa. And because
the backing file is constantly being updated by the VMM as it flushes dirty
pages to it in the background, it can’t be relied upon to have a consistent
image between snapshots. The only time it does is mid-snapshot, after all
of the dirty pages in the heap have been explicitly and completely flushed.
This, of course, is predicated by the restriction that during a snapshot,
no pages in the heap be modified. Once the backing file contains a consistent
image, those pages that were modified since the last snapshot are now written
to the data file. If a failure were to occur during this data file update,
the heap would be restorable from the backing file, and if a failure were
to occur at any other time, the heap would be restorable from the data file.
To know which pages to flush to the data file during a snapshot, the system
must keep track of dirty pages. The VMM does this under the surface by setting
the modified bit in the PTE for each page that is dirtied. But the VMM clears
this bit when it reads the page back into memory, which means that the PTEs
representing the heap space can not be used to indicate which pages have
been modified since the last snapshot. Furthermore, any scheme that was
based on scanning PTEs, would require a device driver to gain access to
these kernel level structures. As noted above, we did write such a driver
for 32-bit Windows to measure page efficiency, but this was only meant to
be a testing tool.
To remedy this, the Taiyen system maintains its own structures to keep track
of dirty pages, which include both dirty segment and dirty page bit arrays,
with a segment being defined as 16 pages. The system sets page protections
that cause the CPU to issue an access violation (or SIGSEGV) when a particular
page is modified, which is immediately trapped by the system’s exception
handler, allowing the bit arrays to be updated. Including a segment array
allows upwards of 8 MB of heap space to be scanned per clock cycle on 64-bit
systems (64 bits * 131,072 bytes/bit = 8 MB). This comes in handy for multi-terabyte
heaps where only a handful of pages at the end of the heap, opposite that
where the scan is started, are dirty, making the page scan a significant
portion of the overall snapshot time. Bit scanning is a critical function
in the Taiyen system and most of these routines as well as other time critical
functions have been hand-coded in assembly to maximize performance.
The system grants access to the dirty page bit array so that the user can
monitor pages that have been modified. Those of you familiar with the Windows
API may notice the similarity between this capability and the Windows function,
GetWriteWatch(), which unfortunately, is not supported by NT, 2000, or later.
When a snapshot heap is created or restored without a data file, it will
be configured for archival use. Functionally, an archive heap operates identically
to a snapshot heap; it can even make snapshots, but they won’t be fail-safe.
If a system failure occurs on an archive heap that has been modified, the
heap may not be consistent and/or recoverable. A data file can be attached
to an archive heap at any time to enable durable snapshots. And a transaction
log may be attached as well to enable transactions.
|
|
|
|
|
|
|
Other technologies (beyond basic file I/O) conjured to
facilitate the management of persistent data include: archives, object databases,
and relational databases, with the latter providing a significant abstraction
away from the raw data. While persistent heaps can’t be viewed as direct
substitutes for relational databases, they are a compelling alternative
to archives and object databases, and in fact, were originally conceived
as such. One problem with archives like Microsoft’s CArchive (a class within
the MFC framework) and object databases is that they impose strict semantic
requirements. Persistent heaps, on the other hand, are data semantics and
organization agnostic, which makes them easier to use and more extensible.
This is sometimes referred to as orthogonal persistence, which allows the
programmer to treat structures the same way regardless of whether they exist
in volatile or persistent domains.
For example, to gain persistence in MFC, all objects must be derived from
the CObject base class, and with respect to object databases the base class
is generally, d_Object. Firstly, as soon as the word derived is mentioned,
the application must be coded in an object oriented language. While this
is certainly allowable for persistent heap utilization, it is not necessary,
as the Taiyen libraries provide C linkage, allowing their use in procedural
code. Moreover, no objects allocated in a persistent heap need to be derived
from any base class. Instead of segregating objects by type, they are segregated
by location, allowing the persistence means to be less meddlesome. Quite
simply, if a particular object is allocated in a volatile heap, it is volatile.
If the same object is allocated in a persistent heap, it is persistent.
Essentially, persistent heaps are used just as volatile heaps. In fact,
both types use the same core function calls: CHeapManager_pmAlloc(), CHeapManager_pmSetDynBlockSize(),
and CHeapManager_Free(). The system can tell what type of heap is being
used from the heap address passed to these calls. There are, however, similar
persistent heap functions that combine these calls with other actions such
as making undo buffer entries. For example, CPersistentHeapManager_pmAllocEUE()
allocates a block and makes the undo entries in the undo buffer (if attached)
necessary to deallocate and possibly reallocate the block during an undo/redo
cycle. Undo/redo will be covered in more detail below.
Another cumbersome facet of archive and object databases is that objects
stored thereby need to be individually serialized. Explicitly loading objects
and saving objects can be a tremendous programming burden, akin to the problems
associated with basic file I/O described above. The code can get quite convoluted,
especially if mechanisms have to be built to keep track of modified objects.
In contrast, no objects in a persistent heap need to be individually serialized.
When a persistent heap is restored, all objects therein are loaded at once,
but only virtually. Only as the objects are referenced in the application,
are they physically read from disk, courtesy of the VMM. And saving objects
is just as easy. There’s no need to develop an algorithm to keep track of
modified objects. Simply perform a snapshot, and any objects that have been
modified since the last snapshot are flushed to the backing and data files,
automatically and transparently to the user.
Of course, because the grunt work is being handled by the VMM, all of this
is occurring at the page level; but managing data at the page level is preferred
over the object level. For instance, if only a small portion of a multi-page
object is being accessed frequently, the dormant part of that object can
be paged to disk, allowing the consumed memory to be put to better use somewhere
else. It also enables the system to make good use of I/O bandwidth by moving
data a page at a time, although most VMMs will take this further and read
and write multiple contiguous pages at once. The VMM’s job can certainly
be aided by maximizing page utilization and maintaining strict object alignments,
and this is where the heap layer is beneficial. This highlights another
point: It’s unclear as to how efficiently archives and object databases
handle data churn, where objects are frequently allocated, resized, and
freed. By leveraging sound volatile heap algorithms, Taiyen persistent heaps
do excel in this type of environment.
Another problem with the typical database management system is that it plays
tug of war with the operating system instead of facilitating it. For example,
a common practice employed by database managers is to cache frequently used
objects. In principle, this is a good idea, but it demonstrates a redundancy.
The VMM is already charged with this task, and there’s no way to
turn the VMM off. Its threads are going to receive the same time slices
regardless of what other threads are running, so they might as well be utilized.
Moreover, the VMM is able to monitor page usage patterns on a global basis
across all running processes, so it is better suited for the mission of
allotting physical memory. Some embedded systems manage caching through
disk pages, which is getting even more redundant. Not only is there
no advantage to writing a custom paging system, there’s no way to bypass
that of the operating system, leaving you with a paging system on top of
a paging system.
A further problem with object and embedded databases is encountered when
pointer references are needed. For instance, OMDG databases require such
references be made through d_Ref<T> objects. Because persistent heaps are
always restored to the same address at which they were created, pointers
can be used at will. The system does offer a means of relocating a heap
to a different address, but this is done via compression and decompression.
Naturally, doing so will require that pointer references in the heap at
least be offset, and if the heap is defragmented, adjusted individually.
|
|
|
|
|
|
|
Creating and loading (or restoring) persistent heaps
are very simple processes, and the same functions are used to create and
restore all three heap types: archive, snapshot, and transactional, with
the type being determined by what file handles are passed to the functions.
For example, attaching just a backing file creates or restores an archive,
including a data file gives you a snapshot heap, and including a transaction
log with a buffer gives you a transactional heap. These last two files can
be attached at later time as well. Persistent heaps are created via CPersistentHeapManager_pcCreate()
and restored by CPersistentHeapManager_pcRestoreHeapFromHeapFiles().
This second function is very powerful as it can restore a heap that has
been closed properly or prematurely as a result of system failure. It can
determine the failure mode and apply the appropriate method of recovery.
Once a persistent heap has been created or restored, various buffers may
be attached to add functionality. Undo and redo buffers may be attached
to a snapshot heap and a rollback buffer may be attached to a transactional
heap. The system also provides many ancillary functions to do things such
as: extract heap parameters from the data file before restoration, check
internal structures after restoration, verify checksums, etc.
Restoring a persistent heap that has been closed normally (as opposed to
a closure resulting from system failure), is an extremely fast operation
indeed, because it is much more a function of processor speed than disk
speed. Persistent heaps are closed by calling CPersistentHeapManager_uwClose(),
which performs two tasks: One, it performs a final snapshot to place the
data file in congruence with the backing file (and with the heap); and two,
it frees any clusters on disk that are no longer being utilized, which is
possible by virtue of the system’s use of sparse heap files. When the restoration
function detects that the heap files are congruent, restoration mainly entails
mapping in the backing file’s views. As noted earlier, file mapping, in
general, is the ideal way to load persistent data because pages are only
read from disk as they are touched, thereby minimizing the disk access bottleneck.
For instance, our Itanium tests produced an average loading speed of 1.75
GB per second. By the way, the GB in 1.75 GB/s represents stored
data as opposed to heap space.
|
|
|
|
|
|
|
The system provides a very powerful undo/redo service
to snapshot heaps that is enabled by simply attaching undo and redo buffers.
Using this service can make the task of building undo and redo into an application
significantly easier than writing the necessary routines from scratch, which
can get very complex when the operations to undo are intricate and difficult
to reverse. Having undo/redo at the heap layer can also yield efficiencies
that are not possible otherwise. For example, when a block is freed, any
pages that are no longer needed can be returned to the VMM. This is possible
because the system has an explicit allocator that, when invoked, will restore
the freed block to its last address, allowing it to be repopulated with
its old data. In fact, the service can undo and redo all heap operations:
allocations, resizings, and the freeing of blocks, as well as other
ancillary operations. To allow an allocation to be undone for instance,
CPersistentHeapManager_pmAllocEUE() is called instead of CHeapManager_pmAlloc().
The EUE at the end of the call stands for embedded undo entry
and such functions in general are called u-procs. This function makes
a call to CHeapManager_pmAlloc() and then makes the data and procedure entries
in the undo buffer needed to undo the allocation. The source code for all
of the u-procs offered by the system is provided in the API Guide.
The undo buffer takes two types of entries: data and procedure. The data
entry is the simpler of the two and involves copying a block of data to
the buffer that will be restored to a specific address when an undo is invoked.
When that undo is invoked, the current block of data at that address is
copied to the redo buffer allowing it to be restored should a redo be invoked.
This illuminates another advantage of the service – redo is always automatic.
All the user has to do is call CPersistentHeapManager_uwRedo(), and the
last operation that was done will be undone. It’s that easy, and the reason
is that the user never has to make any entries in the redo buffer. These,
as we term mirror entries, are made automatically by the system as
operations are undone. And once an operation is redone, it can be undone
again, and redone, and so on.
Calling CPersistentHeapManager_uwUndo() implicitly starts an undo/redo cycle,
where undos are permitted to the extent that there are entries in the undo
buffer, and where each undo causes mirror entries to be added to the redo
buffer to allow the operation that was just undone to be redone. Invoking
all of the entries in the undo buffer will cause the redo buffer to become
fully populated with mirror entries, at which point no further entries will
be added (new entries are never added to the undo buffer during an undo/redo
cycle). Any redos or undos beyond this point will just move pointers in
these buffers to their respective current entries. To avoid placing the
heap out of sync with the undo and redo buffers, an undo/redo cycle must
be explicitly exited via a call to CPersistentHeapManager_ExitUndoRedoCycle()
before executing an operation that would modify any heap data. This call
clears the redo buffer and resets both buffers’ internal structures, effectively
locking any changes made to heap during the cycle before any modification
takes place.
Because most operations in an application will constitute multiple undo
buffer entries, a series of entries can be packaged as a transaction by
calling CPersistentHeapManager_CloseUndoBufferTransaction() In fact, only
transactions can be undone (note that the term transaction here should
not be confused with a heap transaction, which is a different animal). The
drawing of a line in a CAD program, for example, might constitute several
sub-operations: one to allocate a block for the entity; another to set its
coordinate, layer, color, and type information; and others yet to update
some ancillary structures. The user isn’t going to want to undo each sub-operation
individually. He’s going to want to undo them collectively in one atomic
operation. To accomplish this, the undo buffer entries that represent each
sub-operation are packaged into a transaction.
Any operation on a heap can be undone with one or more data entries. Data
entries are made via CPersistentHeapManager_uwMakeDataEntryInUndoBuffer().
All instructions processed by a CPU ultimately are meant to modify bytes,
so saving those bytes before they are modified allows a particular operation
to be undone merely by resetting the affected bytes to their pre-operation
values. This means that undo and redo for an entire application can be built
upon data entries. Data entries, however, in certain cases may not be easy
to implement because a particular operation may affect many non-contiguous
bytes, or they may not be efficient because the operation effects a large
swath of data. Imagine inverting a one-MB matrix. Making a one-MB data entry
to undo this operation would simply consume too much undo buffer space.
To remedy this, the system offers a way store procedures in the buffer via
the call, CPersistentHeapManager_uwMakeProcedureEntryInUndoBuffer(). Essentially,
storing a procedure entails copying its address, parameters, as well as
some ancillary information such as the calling convention, an address to
take the return value (if appropriate), and a return value to verify, etc.
to the buffer. But the procedure being stored shouldn’t be the one that
was used to perform the operation, it must be a mirror procedure
designed to reverse that operation. Furthermore, that mirror procedure must
include a recursive call to CPersistentHeapManager_uwMakeProcedureEntryInUndoBuffer()
to populate the redo buffer with a mirror of the mirror procedure, which
can often be the original procedure that performed the operation. The use
of this one function to make entries in both the undo and redo buffers may
be a point of confusion, but there is a simple explanation. During normal
processing, the target buffer for undo entries is the undo buffer. During
an undo, however, the target buffer automatically becomes the redo buffer;
and during a redo, there is no target buffer – no entries are made in either
buffer.
Taking the matrix inversion example further, the same function that performs
the inversion can reverse it, so implementing undo for this operation is
very straightforward. Other operations can be more complex to undo, such
as the resizing of a block. When CPersistentHeapManager_pmSetDynBlockSizeEUE()
is called to resize a block, it makes a series of data entries and a procedure
entry for the internal function, pmExplicitSetDynBlockSizeEUE() in the undo
buffer. When an undo is invoked, this second call will relocate the dynamic
block to its pre-resized address and restore it to its original size. Any
data that was clipped as a result of the block shrinking will be restored
via a data entry as well. Notice that the internal function is appended
by an EUE, which indicates that, when called, it will also makes an undo
entry, except in this case, it will be to the redo buffer. As described,
when a procedure entry is invoked by the undo processor, all entries made
thereby are automatically directed to the redo buffer; and when a procedure
entry is invoked by the redo processor, all resulting requests to make a
buffer entry are ignored.
There are two other features of the undo/redo service that should be mentioned.
One, the buffers can be configured to be either circular or non-circular.
When the buffer is circular, it won’t overflow. Instead, at some point,
new transactions will start overwriting old transactions. Recall that entries
are packaged together in transactions, so when one entry is overwritten,
all of the other entries in that transaction are invalidated, which can
get complicated because entries can wrap around from one end of the buffer
to the other. When the buffer is non-circular, however, it obviously can
overflow (at which point the undo service will be disabled), but using a
non-circular buffer will ensure that all of the operations made since the
point of attachment can be undone. The second feature that should be noted
is that the buffers can be made persistent and re-attached allowing operations
that were performed during one session to be undone during a subsequent
session, which may be of interest to some application developers.
This brief exposition of the undo/redo service may have made it sound more
complex to use than it really is. Most of the undo/redo functionality you’ll
need for your application can be implemented through the use of the u-procs
provided by the system and through data entries, which are very straight
forward. All of the gory under-the-surface details touched upon here can
generally be ignored, but in case your have more advanced requirements,
the framework is fully capable of addressing those needs and is well documented.
Additionally, the more complex features are not that difficult to master
and can take the work out of what, by nature, can be convoluted and buggy
code. Furthermore, the API Guide and the SDK, which contains plenty of sample
code, provide much direction in this area.
|
|
|
|
|
|
|
Persistent heaps can’t grow in size, and it would be
impractical to design them to. Any structure that attempts to grow in size
will have to contend with being impeded by another structure standing in
its way by allowing itself to be movable. Heaps aren’t very movable, of
course, because they contain references to absolute addresses, both in the
meta data and the user data. With the advent of virtual memory, however,
this is all a moot point because heaps can be created that are gigabytes
in size on 32-bit platforms and terabytes in size on 64-bit platforms while
consuming virtually no physical resources. For example, a one-GB heap on
a 32-bit system initially commits just two pages of physical storage. But
this may raise a question: If I create a several terabyte heap, do I need
backing and data files of the same size as the heap? The answer is yes,
but you certainly don’t need that much disk space, and the reason is file
sparseness.
Quite frankly, without file sparseness, a persistent heap wouldn’t be practical,
at least on 64-bit platforms (which is where server and desktop computing
is going at an ever-increasing rate). In this case, the heap size would
either be limited by the attached storage, or the heap would have to be
backed by a myriad of small files, which would be overly complex and a maintenance
nightmare. To simplify matters, the Taiyen system won’t work on a file system
that doesn’t support sparseness. Luckily, the only one of note that doesn’t
is FAT32, which won’t be found on modern hardware anyway, and certainly
not on 64-bit machines.
Sparseness can be likened to virtual memory for files. It allows an extremely
large file to be created that only consumes disk space to the extent that
there is data in the file. For example, a one-TB sparse file with a byte
written at the low end and a byte written at the high end would consume
only a couple of clusters. Some file systems such as Windows’ NTFS allow
specific regions of a sparse file to be logically zeroed, allowing the clusters
backing those regions to be returned to the pool of available clusters on
disk. Not only is this efficient in terms of disk utilization, it is much
faster than writing a stream of zeros to the media. Such zeroing operations
are undertaken when the transaction log is cleared, and as mentioned, when
a persistent heap is closed. Actually, to minimize fragmentation, only spans
of disk space that constitute a segment and above are returned to the file
system upon closure.
By virtue of using sparse backing and data files, expect disk utilization
to be at least equivalent to the page efficiency figures cited earlier and
in real world cases, much higher. Remember that these page efficiencies
are the products of a substantial amount of churning, which won’t be the
typical case. You might consider the use of two heap files to be an inefficiency,
but in actuality, it is not. Recall that the backing file is essentially
a surrogate paging file, and because this file is mapped, loading a persistent
heap does not allocate from the OS’s paging files. This means that for many
systems, especially server-based systems, the collective size of the paging
files can be reduced to the extent that storage is needed for any backing
files.
|
|
|
|
|
|
|
Transactional heaps extend the functionality of snapshot
heaps with transactions, which are essentially ultra fast fine-grained heap
saves. While a snapshot heap is guaranteed to be restorable to the last
snapshot, a transactional heap will always be restorable to the last transaction.
As mentioned, a transactional heap is created via the same function as a
snapshot heap, CPersistentHeapManager_pcCreate() with the only difference
being that the attachment of a transaction log and log buffer tells the
system to make the heap transactional.
The transactional heap inherits the ease of use traits of the snapshot heap,
such as freedom from having to serialize objects individually, and offers
its own features to simplify the programming task while at the same time,
maximizing performance. While other transactional systems exist, they can’t
match the flexibility and extensibility of transactional heaps. Because
these heaps operate below the application layer and are completely data
agnostic, they allow virtually any operation or series of operations to
be formed into a transaction. And they support all variety of structures
from linked-lists, to b-trees, to hash tables, to more exotic structures
such a square lists, to whatever the programmer conjures.
|
|
|
|
|
|
|
Similar to the undo buffer, the transaction log takes
two types of entries: data and procedure, with the main difference being
that entries in the log aren’t meant to reverse an operation; they’re meant
to record it. Actually, a rollback buffer can be attached to a transactional
heap that does take mirror entries like the undo buffer. Log entries also
carry more error checking information such as checksums than do undo buffer
entries because it assumed that they will be needed to rebuild a heap after
a system failure. In such a case, the system is paranoid about the consistency
of the heap files and the log.
One or more log entries followed by a terminator form a transaction, and
the first entry following a terminator or in a fresh log implicitly starts
a new transaction. Entries are made by calling CPersistentHeapManager_uwMakeDataEntryInLog()
and CPersistentHeapManager_uwMakeProcedureEntryInLog(), and are terminated
via CPersistentHeapManager_uwCloseTransaction(). Entries are not actually
made directly in the transaction log, they are first made in a log buffer.
Upon termination, the entries that constitute the current transaction in
the buffer are flushed to the log. Because the log is created with file
system buffering disabled (this is the case for all files used by the system),
writes are made directly to the storing device, and CPersistentHeapManager_uwCloseTransaction()
will not return until the transaction is resident therein.
Transactional heaps are closed and restored (just as snapshot heaps) via
CPersistentHeapManager_uwClose() and CPersistentHeapManager_pcRestoreHeapFromHeapFiles(),
respectively. If the heap was closed normally (resulting in congruent heap
files), the restoration is as described above – a process of mapping in
the backing files views. However, if the restoration is post system failure,
a transactional heap will first restore to the last snapshot, and then if
need be, to the last closed transaction. If the heap is being restored from
the backing file (a failure occurred mid-snapshot), the image in the backing
file itself represents the last transaction. On the other hand, if the heap
is being restored from the data file (a failure occurred during normal processing),
the last transaction will be somewhere in the transaction log. In this instance,
the system will process the transactions in the log one by one, invoking
data entries and procedure entries until no terminated entries are found.
Any entries that are found without a terminator are deemed part of an incomplete
transaction that must have been interrupted by a system failure and are
therefore ignored.
|
|
|
|
|
|
|
A transactional heap's log remains a fixed size throughout
the heap's life. This is possible by virtue of the automatic snapshot mechanism.
Two events can trigger an automatic snapshot: log overflow and an attempt
to make a log entry that is over half the log buffer size. Actually, these
events only cause an automatic snapshot to be scheduled. The actual snapshot
is made upon the next call to CPersistentHeapManager_uwCloseTransaction().
Here, the snapshot acts like sort of an über transaction, compiling all
of the little transactions in the log into one large transaction to the
heap files.
As is the case with a snapshot heap, a transactional-heap snapshot is comprised
of a double flush, first to the backing file and then to the data file.
In addition, once the backing file contains a consistent image, the log
and its buffer are cleared, opening the way for fresh entries. Because the
log is a sparse file, it can be zeroed out logically (and therefore very
quickly), versus having to write zeros to the media. The buffer, which is
fixed at two segments in size, can also be zeroed out in short order. Note
that because the buffer is generally much smaller than the log, it is cleared
numerous times as the log is populated, and the system always makes sure
to flush any unwritten entries before doing so. If there's a failure before
or during the backing file flush, the heap will be restorable from the data
file and log. If there's a failure during the remainder of the snapshot,
the heap will be restorable to from the backing file.
The advantage of this approach is two-fold: One, from the perspective of
the user, it makes the transaction log appear to be of infinite size, and
therefore extremely easy to use. The way it works is that when an attempt
is made to make an entry that would cause the log to overflow (or not leave
room enough for a terminator) the entry is simply ignored and a snapshot
is scheduled. In fact, once a snapshot is scheduled, all further entries
are ignored; and because no error codes are returned to the user, it seems
as though the entries were made without a hitch. But as soon as the transaction
is closed, instead of flushing the log buffer, the pending snapshot is made.
Any attempt to make an entry that is over half the buffer size is also ignored
and also schedules a snapshot. The logic here is that such an entry must
be a data entry, and it would make more sense to write a block of data that
size to the heap files immediately via a snapshot instead of the double-write
to the log and then at some point downstream, to the heap files. Again,
the system returns no error code here, and from the user’s perspective,
the entry is accepted.
The second advantage of this mechanism is efficiency. Because the log is
a fixed size and generally a fraction of the heap size, it can be placed
on a relatively small, high I/O speed (IOS) solid state drive (SSD); while
the heap files are placed in cheaper disk-based storage such as RAID that
can’t achieve the IOS of SSDs but can sustain relatively high throughput
for the large block writes produced during snapshots. In this way, the system
can convert a choppy input data stream into a bursty sequential feed (imagine
a large period square wave) that leverages the throughput capabilities of
a disk array and its interconnect. This is not possible on certain transactional
systems where the log continually grows throughout the life of the system
or is not readily truncated, and therefore must be placed on low IOS bulk
storage.
There is another efficiency gained by this approach that may not be obvious.
Imagine a series of transactions that reset the value of the same byte in
a heap repeatedly. As these transactions are added to the log, and let’s
say that the log is one GB in size, at some point, a snapshot will be scheduled.
When the pending snapshot is made, instead of having to stream a GB of transactions
to the heap files, only the pages modified by those transactions need to
be flushed, which in this case is one page. This allows the system to consolidate
a large block of transactions into a potentially smaller block of pages
to flush. Now, just the opposite effect could have occurred if each transaction
modified a byte on a different page, but this would be highly unlikely.
And the principle of locality teaches us otherwise: 80% of the time is spent
processing 20% of the data.
|
|
|
|
|
|
|
The system provides a rollback service that is enabled
upon attachment of a rollback buffer. This buffer takes mirror entries (just
as the undo buffer) that when invoked will reverse all of the operations
constituting the current transaction, thereby moving the heap’s data back
to the last consistent state – that which existed when last transaction
was closed. A rollback is initiated by calling CPersistentHeapManager_uwRollback(),
and if this function fails, the system offers the backup function, CPersistentHeapManager_uwRebuild().
This second call is the brute-force method of rollback where the entire
heap is rebuilt from the heap files and log. Because the current transaction
hasn’t been terminated, it is ignored during the rebuild process, effectively
rolling the heap back to the last transaction.
Unlike the undo service, a rollback can only reverse the current transaction.
Remember that transactional heaps are designed for multi-user environments,
so the previous transaction may well have been made by another user and
is therefore off limits. Thus, the rollback buffer is cleared upon the closure
of each transaction, and upon execution of a rollback. During a rollback,
entries are automatically made in the transaction log corresponding to each
invoked mirror entry. The log always moves forward, with the main reason
being that while user data is guaranteed to be returned to its state as
of the last transaction, the heap’s internal meta data may differ. So to
maintain coherence between the log and the heap as a whole, anytime an operation
is made on the heap (regardless of whether it is part of a rollback or not),
that operation must be reflected in the log.
Similar to the u-procs that integrate heap operations with the undo buffer
such as CPersistentHeapManager_uwFreeEUE(), the transactional system offers
l-procs that integrate heap operations with the log and rollback
buffer. One such l-proc, for example, is CPersistentHeapManager_pmAllocELE()
(the ELE here stands for embedded log entry), which combines
a call to CHeapManager_pmAlloc() with the log entries necessary to record
the operation and the rollback entries necessary to reverse it. The system
also provides r-procs, which are like l-procs, but only make rollback
entries. The source code for these procedures can, again, be found in the
API guide.
|
|
|
|
|
|
|
The system uses the one-writer / many-reader model, where
any operations that modify heap data and their corresponding log entries
must be wrapped within a log mutex (to be provided by the user). To guard
against snapshots being made while the heap is being updated, snapshots
need to be protected by this mutex as well. Sample pseudo code for this
arrangement is as follows:
Writing thread ν
{
Enter log mutex.
Wait for object A’s read/write auto-reset event.
// Waiting for any reading threads to finish up.
// Modify object A. For example:
Make a rollback entry for the data about to be modified.
// Data rollback entries are made pre data modification.
Modify a block of data.
Log the data operation. // Implicitly starting
the transaction. Data log entries are made post data modification.
Call a procedure.
Log the procedure.
Make a rollback entry for the procedure.
Close the transaction. // A rollback can
be performed at anytime before the transaction is closed.
Set object A’s read/write event. // Allow
any pending reading threads or another writing thread in.
Leave log mutex.
}
Reading thread ν
{
Enter object A’s thread count critical section.
If (++ thread count == 1) // Only the first
reading thread needs to potentially wait.
Wait for object A’s read/write auto-reset
event.
Leave object A’s thread count critical section.
// Once one reading thread has
access, all others can bypass the read/write synchronization object.
// Read object A.
Enter object A’s thread count critical section.
If (-- thread count == 0) // Once all of
the reading threads are out, open the object to any waiting writing thread.
Set object A’s auto-reset event.
Leave object A’s thread count critical section.
}
// The snapshot thread awakens every so often to get
a dirty page count, and if the count is over a certain threshold,
// a snapshot is made.
Snapshot thread ν
{
Enter log mutex. // The heap can be read
during the snapshot; it just can’t be modified.
Snapshot.
Leave log mutex.
}
This model in conjunction with the system’s persistence engine allow heap
transactions to meet the requirements given by the acronym, ACID, which
stands for atomic, consistent, isolated, and durable. Atomicity is achieved
as an object's updating thread blocks all reading threads until the update
is complete. From the perspective of the reading threads, the heap’s data
state only changes when each transaction is closed. The rollback mechanism
is atomic as well because when invoked, it completely restores the heap’s
user data to its state as of the previous transaction.
Consistency is achieved as each transaction moves the heap atomically from
one consistent state to the next. The model also ensures that upon closure
of each transaction, the heap is completely congruent with the log. This
is critical in case the transaction is closed via an automatic snapshot,
in which case the heap will need to be in a consistent state prior to exiting
the log mutex. Because the application riding on top of a transactional
heap determines when transactions are closed, some responsibility for maintaining
consistency falls to the application layer.
Isolation is achieved by allowing only one thread to modify the heap at
a time, ensuring that transactions are made serially, and by allowing the
modifications made by one transaction to be visible to other transactions
only after the first transaction either commits or aborts. Concurrency can
be achieved by utilizing multiple transactional heaps, although with one
stipulation: the heaps must be completely isolated from one another – no
transaction can straddle two or more heaps. The high degree of isolation
imposed here prevents dirty reads where a read of incomplete data
or an abandoned transaction can wind up as part of a second transaction,
thereby introducing data inconsistency.
Durability is achieved through use of the transaction log and fail-safe
snapshots, which make transactional heaps impervious to system failure,
and guarantee that a transactional heap will always restore to its state
as of the last closed transaction. Durability is assured because the log
entries comprising a transaction are committed to the log’s storing device
before the next transaction is started. This applies to snapshots as well,
which do not return until both the backing and data files have been updated.
No data in a transactional heap are lazily flushed, albeit for the paging
to and from the backing file, which is accounted for.
The serialization of heap transactions should not be deemed as a liability
because the associated log writes have to be serialized to make it across
the device bus, which can become saturated when the log device's latency
drops below that of the interconnect. By its very nature, the maximum attainable
log feed will be substantially less than the interconnect's bandwidth. The
reason is that each log update is non-pipelined (which is to say it must
be completed before the subsequent one is initiated) and is thus hindered
by non-transfer latencies, which can be significant. For example, recent
tests published by Intel of an eight-lane PCI Express implementation with
an average payload size of 2,048 bytes showed non-pipelined transfer rates
averaging 2.2 GB/s. But when the transfer size averaged 512 bytes, the transfer
rate fell to nearly 1.2 GB/s, showing that non-transfer latency factors
noticeably for these smaller payloads. Still, that figure is impressive
and higher throughputs are possible because PCI Express can scale to 16
lanes, predicting small transaction throughputs of roughly 2.4 GB/s. The
challenge for the log's SSD will be to receive transactions at these interconnect
speeds (≈ 4.7 million per second at 512 bytes per transaction). Such a unit
will definitely have to be directly attached to the interconnect and will
have to consist of battery-backed high-speed DRAM.
Other transactional systems concentrate on achieving writing concurrency,
which requires intricate structures to ensure transaction isolation. The
problem is that as the IOS the of the log device approaches that of the
interconnect, any achieved concurrency will show no net performance gains.
We prefer the more generalized and flexible approach that guarantees transaction
isolation for every conceivable operation, with the focus on getting each
transaction into persistent cache as fast as possible. Additional processors
are best used to concurrently serve querying threads and to accelerate individual
CPU-intensive operations, when they lend themselves to parallelization.
Also, available interconnect bandwidth and CPU cycles can be utilized to
lazily flush dirty pages to the backing file, thereby reducing the pending
snapshot size.
|
|
|
|
|
|
|
On top of outstanding programmability, extensibility,
and flexibility, transactional heaps offer unparalleled raw speed. While
the optimal media for the log is of the solid state variety, performance
is still very good when the log resides on a disk drive because all writes
are sequential and optimized in terms of the number the number of sectors
that are written per update. Furthermore, the use of procedure entries can
significantly reduce the average transaction size, but this is less of a
factor because transaction speed is predominantly a function of I/O speed,
not transaction size. To illustrate, the following are the results from
transactional heap tests that were run on the same 2.8 GHz PC mentioned
above, running Windows XP with the heap files and log being located on a
Maxtor 6Y200PO Ultra ATA-133 drive with 8 MB of cache.
Number of log entries: 12,807,351.000000
Number of log transactions: 1,441,968.000000
Average number of log entries per transaction: 8.881855
Average transaction size: 274.521254
Average transaction flushing efficiency: 35.081203
Number of forced snapshots: 0
Number of timed snapshots: 12
Average snapshot size: 72,615,253.333333
Average dirty page bytes per log entry bytes: 0.692030
Number of heap restorations: 16
Average log entry speed (log entries per second): 3,986,259.995762
Average transaction termination speed (transaction terminations per second):
5,357.543433
Average transaction speed (transactions per second): 5,294.343605
Average transaction speed (bytes per second): 1,453,409.847199
Average snapshot speed (bytes per second): 12,787,703.611579
Average sustained throughput (transactions per second): 4,908.287731
Average sustained throughput (bytes per second): 1,347,429.304278
Average heap restoration from backing file speed (bytes per second): 9,978,499.730237
Average heap restoration from data file speed (bytes per second): 1,418,307.172064
Average heap restoration from congruent heap files speed (bytes per second):
1,625,675,118.906685
Average heap compression speed to buffer (bytes per second): 543,986,452.160653
Average heap decompression speed from buffer (bytes per second): 163,992,872.510485
Average heap compression speed to file (bytes per second): 44,198,439.568796
Average heap decompression speed from file (bytes per second): 40,802,358.469421
Notice the exceptional transaction speed of nearly 5,300 per second with
an average of approximately nine log entries per transaction. These tests
were run without enabling file system buffering for the log, which means
that each transaction was written directly to the drive. The drive’s 8-MB
write cache, however, was enabled, which allows the drive to signal that
a write is complete once the data stream encompassing that write is resident
in cache, and not necessarily on disk. In contrast, when the write cache
is disabled, streamed bytes must be resident on the actual disk media before
the drive can accept the next transaction, which will severely hamper I/O
speed. For example, when the test at hand was run with the a disabled write
cache, transaction speed was reduced to a dismal average of 116 per second,
even with all writes being completely sequential. This, of course, illuminates
the principal bottleneck in disk drives – mechanical latency, which is measured
by the two quantities: average seek time and rotational latency. Actually,
the total random access time includes other second order factors and is
given by the following sum: (average seek time) + (average rotational latency)
+ (transfer time) + (controller overhead).
Based on the performance data for the Maxtor drive, the 116/s-transaction
speed is predictable. The published numbers declare an average seek time
≤ 9.4 ms, an average rotational latency = 4.18 ms, and a controller overhead
< .3 ms. Assuming that the log occupies contiguous sectors and tracks in
the drive, and given that writes to the log are sequential, the seek time
can be ignored. With an average of 856 sectors per track (or 438,272 bytes
per track) and an average transaction size of 275 bytes, each track holds
roughly 1600 transactions, which means that the drive performs relatively
few seeks as the log is populated. And when a seek is performed, it is only
to the next track over (a track-to-track seek takes .9 ms on this drive).
With transfer times in the tens of MBs per second and a controller overhead
of only .3 ms, the dominant factor is clearly rotational latency. The average
latency of 4.18 ms would predict a transaction speed of nearly 240/s, but
interestingly, because each transaction is made next to the one before it,
the drive may have to wait one full revolution before the disk head is in
the correct position to make the next write, resulting in a worst case latency
of 8.36 ms (1/.00836 ≈ 116). There might be ways to optimize the process
by spacing writes such that the target sector always leads the disk head
by a few degrees, but this is a moot point.
As the test shows, employing the write cache eliminates nearly all of the
drive latency. As mentioned, the log buffer is flushed when a transaction
is terminated, so it’s actually the termination speed that specifically
measures the storage IOS. (Note that the transaction speed is calculated
by adding the log entry overhead to the transaction termination speed, but
at nearly four million entries per second, this overhead is minimal.) At
5,358 terminations per second, the latency calculates to .187 ms, which
can almost entirely be attributed to controller overhead. Throughput is
maximized because sequential writes allow a continuous stream bytes to pass
through the write head as the platters spin underneath. The average transaction
flushing efficiency of 35% shows the actual amount of transaction data written
to disk as a percentage of each log buffer flush. Because these flushes
bypass the file system cache, they are aligned along sector boundaries and
are sector multiples in size. Given a disk sector size of 512 and an average
transaction size of 275, a 35% flushing efficiency is in line with expectations.
The extra bytes written per flush only trivially affect IOS. The transaction
throughput at 1.45 MB/s is well below the drive’s maximum sustained transfer
rate of 37 to 67 MB/s, which rate depending on where the write head is with
respect to the disk’s outer diameter where throughput is maximized. To gain
further insight, the same test was run with a sector size of 1,024, which
lowered the flushing efficiency to 21.2%. The resulting speed of 5,317 transaction
terminations per second reinforces the understanding that for these relatively
small transaction sizes, latency (specifically, controller latency) is the
predominant factor limiting IOS.
Another metric that provides insight is the average number of dirty page
bytes per log entry bytes (DPPL), which turned out to be around .7 for this
test. The DPPL is an indicator of the locality of heap operations where
the lower the ratio, the higher the locality. Recall the extreme example
above where the same byte is repeatedly reset, which would rapidly populate
the log while dirtying only one page. This would yield a DPPL approaching
zero. The test referenced here, because it continually churns the memory
by allocating, resizing, and freeing blocks of random sizes, renders a higher
DPPL than would be expected by a real world system.
Notice that the average compression and decompression speeds of 44.2 and
40.8 MB/s fall within the drive’s range of maximum sustained transfer rates.
Snapshots, which entail flushes to two files instead of one, should theoretically
be half the speed of compressions but fall short at about 57% of this mark.
The reason why is that snapshots are not completely sequential as are compressions
and decompressions. Because the snapshot algorithm does coalesce neighboring
dirty pages into single block writes, snapshots are largely sequential (and
with a smaller DPPL ratio, even more so), but even a small amount of sparseness
in a series of disk writes can induce latency overhead. On the other hand,
any sparseness in the snapshot could’ve been mitigated had the drive used
a cache larger than 8 MB, which was completely overwhelmed by the average
snapshot size of 72.6 MB.
Nevertheless, an average snapshot speed of 12.8 MB/s is respectable, and
by no means is the intended storage a $50.00 ATA drive. One would expect
even a modestly priced system to connect to U320 SCSI or Fibre Channel RAID,
where snapshot speeds are measured in hundreds of MB per second (and eventually,
thousands per second for 10-Gb FC). To avoid situations where a pending
snapshot becomes so large so as to noticeably affect system availability,
the system provides a function to retrieve the current dirty page count,
which along with the other mentioned automatic triggers, could force a snapshot
merely based on the dirty page tally breaching a certain threshold. Note
that snapshot speed can be increased by lazily and continuously writing
dirty pages to the backing file such that when each snapshot occurs, most
or all of the flush is directed towards the data file, thereby reducing
the total number of bytes flushed per snapshot. Most operating systems offer
a way of tuning mapped-file flushing threads.
While the transaction termination speed shows the rate at which transactions
are written to the log, they only partially indicate the sustained throughput
capacity of a transactional heap. The reason is that the log, being of fixed
size, can’t continue to accept entries ad infinitum. At some point, a snapshot
will be needed to clear the log and flush the pages that were dirtied in
the process of populating it. Because all writing threads must be suspended
during a snapshot (recall that all snapshots must be made inside the log
mutex), snapshot time must factor into any calculation of net throughput.
In equation two below, net throughput, TP in bytes per unit time is defined
as the number of transaction bytes, Tb in the log (at the time of a snapshot)
divided by the sum: average transaction time, Tt plus average snapshot time,
St.
|
|
|
|
|
|
|
The number of bytes flushed during a snapshot is equal
to (DPPL ● Tb), and dividing this product by the snapshot speed, Ss yields
the snapshot time. Equation three substitutes this quantity for St.
|
|
|
|
|
|
|
Dividing the numerator and denominator by Tb and substituting
transaction speed, Ts for the quotient (Tb / Tt) renders the final throughput
equation below.
|
|
|
|
|
|
|
This equation provides good insight into transactional
heap performance. As expected, maximizing transaction and snapshot speeds
are key, but what may have been unclear is the extent to which DPPL plays
into the equation. Clearly, the smaller the DPPL the better. Two factors
combine to reduce DPPL: locality and time. If modifications are relegated
to relatively few pages, the DPPL, by definition, will continue to shrink
as entries are added to the log. But allowing the dirty page count to grow
beyond a certain point means that any upcoming snapshot may be large enough
to noticeably delay any writing threads. Most applications will want to
keep snapshots under a second, so if a 100 MB/s snapshot can be achieved
with a decent RAID system, a snapshot trigger might be set at 50 MB. If
the DPPL for a particular application averaged .2, at least a 250-MB log
would be needed to avoid automatic snapshots. Notice how the DPPL determines
the minimum log size, with the lower the DPPL, the larger the requisite
log. Imagine that the log storage is able to achieve a transactional throughput
of ten MB/s. In this case, the heap could sustain a net throughput of 9.8
MB/s (1 / (1 / 10 + .2 / 100)) with snapshots occurring every 25 seconds.
In the test results above, the sustained throughput of 1.34 MB/s is only
slightly below the log throughput of 1.44 MB/s, indicating that on the hardware
used for this test, transaction speed has a long way to go before snapshot
speed factors appreciably in the equation. A much higher transaction speed
is possible when an SSD is used to store the log. The typical SSD
is capable of an IOS in the 100,000 per second range and even higher for
bus-attached units. To simulate the use of an SSD, the same test was run
with file buffering enabled for the log. The resulting transaction speed
wound up at over ten times the non-buffered figure at 56,639 transactions/s,
yielding a net throughput of 8.5 MB/s at 30,972 transactions/s. To put this
figure in perspective, this equates to over 1.8 million transactions per
minute. At the time of this writing, the world record transaction speed
was around one million per minute.
|
|
|
|
|
|
|
There’s no way to build a completely reliable system,
and by system we are referring to the combination of hardware and software.
There are just too many modes of failure from power outages, to actual circuitry
failures, to intentional destruction (as in the case of military systems),
to software bugs. The beauty of transactional systems, in general, is their
ability to recover from failure, but some do not do this very gracefully
and are not very easy to administer. Transactional heaps, on the other hand,
are highly adept at handling system failures and are extremely easy to administer.
Merely pass in handles to the two heap files and the log, and the restoration
function, CPersistentHeapManager_pcRestoreHeapFromHeapFiles() does the rest.
If a failure did occur, this function will fill out a CHeapFileParameters
structure with details about the failure and recovery.
After the heap is recovered, the user can call CHeapManager_uwVerifyHeap(),
which quickly performs a top-down check of all of the heap’s internal structures.
This function works on volatile heaps as well. The restoration function
performs many checks as well during the restoration process such as verifying
checksums as log entries are invoked. Restorations from failure are also
very fast. As the results above show, restorations from the backing and
data files averaged 10.0 and 1.4 MB per second, respectively.
|
|
|
|
|
|
|
The Taiyen system was devised with two fundamental principles
in mind. One, there are two basic data types: volatile and persistent. And
two, there are two basic application types: single-user and multi-user.
The outcome was an uncomplicated software stack where volatile data is managed
via volatile heaps, single-user persistent data is managed via snapshot
heaps, and multi-user persistent data is managed via transactional heaps.
It’s a powerful model that can service a wide variety of applications. And
by no means is the performance for one application type sacrificed by supporting
the other. In other words, the system’s volatile allocation speed could
not have been increased by foregoing persistent heap support, nor could’ve
its transactional performance been improved by omitting snapshot heaps.
Ultimately, the system’s generalized approach to memory management is one
of its key strengths.
First and foremost, the system is a performance component. But it’s almost
a misnomer to say that software is fast. It’s really hardware that
is fast. CPUs can be made faster through the brute-force increasing of clock
frequency and through more clever optimizations like pipelining, where multiple
instructions are overlapped in execution. Disk drive transfer rates can
be improved by increasing the areal density of the platters and rotation
speed. In software, there’s no clock to boost and there’s no motor to spin
faster. The hardware determines the potential performance. It is incumbent
upon the software to realize this potential, and somewhere in the myriad
of instruction combinations is an optimal solution. But there are general
principles that can guide the way: First-order optimizations focus on minimizing
algorithm time and memory complexity. Second-order optimizations maximize
the amount of work done per clock cycle, and focus on good coding techniques
such as the minimization of branching. The continual divergence between
CPU speed and storage speed (both volatile and persistent) makes locality
increasingly vital so as to maximize cache and page hits. Multi-threaded
applications demand the minimization of thread contention and context switching.
And disk access needs to be sequential as possible.
The Taiyen system leverages these principles to achieve exceptional performance.
The extent to which overall application performance will improve through
use of the system will depend on a variety of factors such as the allocation
type, how memory-intensive the application is, how allocation-intensive
the application is, whether access will be single-user or multi-user, and
so on. Server-based applications will potentially see the largest gains,
especially transactional systems, where the main bottleneck is the interaction
with persistent storage. A primary aim of the system was to deliver enterprise-level
transactional performance on modest hardware, which, as demonstrated, was
achieved. The system, unmatched in terms of transaction / (second ● dollar),
promises to lower IT infrastructure costs, operating costs, and increase
the accessibility of transactional systems to an ever wider audience. The
economic feasibility of many nascent technologies such as micro-payments
and RFID depends on the ability to sustain low per-transaction costs. But
beyond performance, transactional heaps are more adaptable and extensible
than other transactional systems, which typically shoehorn the developer
into a particular data model. In fact, these other object and relational
systems could easily be built on top of transactional heaps.
With respect to desktop applications, the Taiyen system and specifically,
snapshot heaps promise efficient and fail-safe saves, and ultra-fast document
loads. Moreover, they promise to make the development of complex data-intensive
programs a significantly easier task than it is otherwise. For one thing,
the programmer doesn’t have to necessarily invent a file format. He has
only to worry about the memory structures needed to run the application.
To add persistence, he simply allocates those exact structures in a snapshot
heap. PTE redirection allows the manipulation of large dynamic structures
without having to worry about relocation latency. Undo and redo is easily
incorporated via the built-in undo/redo service. Not having to mess with
a file format makes the application more extensible allowing the creation
of ever more complex and rich content. In lieu of a snapshot heap (or even
in conjunction there with), persistence could also be gained via heap compression.
With compression and decompression speeds in the 40 MB/s range on a typical
PC, even relatively large documents could typically be saved and loaded
in sub-second times.
The 64-bit desktop is already here and will soon be pervasive. The greatly
expanded addressability of 64-bit computing will yield ever more powerful
applications, especially in the categories of CAD/CAE, financial systems,
video editing, and scientific computing. The data sets handled by these
applications will grow from the megabyte range to the gigabyte range, and
it will be critical to handle this data as efficiently as possible. From
its inception, the Taiyen system was originally designed for 64 bits and
as such, would make an ideal platform for this new breed of application.
The capabilities outlined here really only touch the surface of the system’s
complete feature set, which includes over 80 API calls. For more information,
download our free white papers, API guide, and demo SDK. The demo SDK contains
the source code for tester used to generate all the performance results
cited herein as well as raw test results. It also contains a set of small
sample programs that run on both Unix and Windows and series of extensive
user guides that function as a tutorial for the system. The demo SDK contains
a complete set of non-optimized debug libraries for each platform that we
currently support. There is no time limit on the SDK. The only limits are
the number of CPUs to which the volatile heap manager will scale and how
many allocations can be made during a particular session, which is currently
set to 100K.
|
|
|
|
|
|
-
IA32(x86)
1. Linux. The system has been verified to run on the following distributions:
RedHat Linux 9.0.
2. Windows 2000 and later.
-
IA64(Itanium I/II)
The system has been optimized for Itanium II processors.
1. Windows XP/2003 and later.
-
USPARC64
1. Solaris 5.9 and later.
|
|
|
Builds for additional Unix platforms are forthcoming
as well as the 64-bit AMD Opteron version of Windows. Please call or email
if builds for a specific platform are needed, and we will try to expedite
their development.
|
|
|
|
|
|
|
Small entity (< 100 employees):
Large entity (>= 100 employees): |
Introductory price! (for a limited time)
$799.95 $349.95 per seat.
Please call (415.567.5850). |
|
|
|
All distributions are royalty free and include all of
the Unix and Windows libraries (32 and 64-bit), documentation and sample
code available at the time of order, plus an account through which to download
the latest Taiyen SDK for a period of one year. All retail libraries are
fully functional and allow full allocator scalability (to 32 CPUs on 32-bit
platforms and 64 CPUs on 64-bit platforms). In other words, no libraries
(other than those in the demo SDK) contain any artificial limitations. Each
seat is authorized one year of tech support via email; phone support packages
are also available.
The Taiyen system is the clear value choice when compared
with competing systems, which cost many times more and lack Taiyen's innovative
and powerful feature set.
Site licenses are also available and can offer savings
for multi-seat purchases.
Terms of use: Demo
SDK, Retail SDK
|