Garbage Collection Algorithm
- Pages: 58
- Word count: 14262
- Category: Engineering
A limited time offer! Get a custom sample essay written according to your requirements urgent 3h delivery guaranteedOrder Now
Garbage collection is a form of software engineering system which depends on automatic memory management technique depending on the availability of a garbage collector with purpose to reclaim the memory used by objects which are not to be accessed again by the application. The system was first developed by John McCarthy in late sixties of 20th century with sole purpose to provide solution to the problems of manual memory management in Lisp, a programming language. The need for this sort of system is to reuse the space which has once been used for running an application. The application or mutator which has used the space now has lived up its utility or has no use of the occupied space. The application for memory retrieval system is made to reclaim the inaccessible memory through a collector system.
Since this Garbage Collection System is a language based feature; hence with the development of a number of languages, similar development has been seen in the development of garbage development system for each of the individual language. Languages like Java, C# requires garbage collection either as part of the language specification while formal languages lambda calculus that is an effective practical implementation of the same. The languages are said to be garbage-collected languages. Other languages like C and C++ have been designed for use along with a manual memory management but have implementations for garbage collection.
With Garbage collection numerous software engineering advantages comes into fore but at the same time has been found to be in poor interaction with virtual memory managers. The popularity of languages like Java and C# has been due to the attached feature of Garbage collection. However, memory requirement of garbage collection has been considerably more than the explicit memory management and hence creates a need for larger RAM and fewer garbage-collected applications have been found to fit in a given amount of RAM. The substitute to this space problem is that of disc based garbage collection system where disc space is being made to the use rather than the physical memory. The performance of this garbage system degrades because of more expensive behavior of disc access than main memory access which requires almost six times more energy. This reduction in performance can extend to tens of seconds or even minutes when paging is applied. Even in a circumstance when main memory is sufficient enough to fit an application’s working set, the heap collection would later induce paging. Most of the existing garbage collectors tend to touch pages without taking into account of the pages are resident in memory. During full-heap collection, more pages are visited than those in the application’s working set.
Garbage collection’s application and use many a times disrupts proper performance of virtual memory management and destroys information taken by virtual memory manager for tracking reference history. This phenomenon is perhaps the most widely known undesirable behavior of the garbage collector but has been tackled indirectly because of the importance associated with generational garbage collectors with purpose being the collection efforts on short-lived objects. Objects with low survival rate generational collection reduces the frequency of full-heap garbage collections. However, when a generational collector eventually performs a full heap collection, it triggers paging. This problem has led to a number of workarounds. One standard way to avoid paging is to size the heap so that it never exceeds the size of available physical memory. However, choosing an appropriate size statically is impossible on a multi-programmed system, where the amount of available memory changes. Another possible approach is over provisioning systems with memory, but high-speed, high-density RAM remains expensive. It is also generally impractical to require that users purchase more memory in order to run garbage-collected applications. Furthermore, even in an over provisioned system, just one unanticipated workload exceeding available memory can render a system unresponsive. These problems have led some to recommend that garbage collection only be used for small applications with minimal memory footprints.
In a distributed application environment, where users want to retrieve data seamlessly, developers need to understand the needs of the user as well as resources, and other constraints of limited devices. Memory is one of biggest issues for mobile device applications; therefore developers need to understand garbage collection mechanism in order to make their application more efficient and reliable.
Garbage collection is often portrayed as the opposite of manual memory management, which requires the programmer to specify which objects to deallocate and return to the memory system. However, many systems use a combination of the two approaches, and there are other techniques being studied (such as region inference) to solve the same fundamental problem. Note that there is an ambiguity of terms, as theory often uses the terms manual garbage-collection and automatic garbage-collection rather than manual memory management and garbage-collection, and does not restrict garbage-collection to memory management, rather considering that any logical or physical resource may be garbage-collected.
The basic principle of how a garbage collector works is:
- Determine what data objects in a program will not be accessed in the future
- Reclaim the resources used by those objects
By making manual memory deallocation unnecessary (and typically impossible), garbage collection frees the programmer from having to worry about releasing objects that are no longer needed, which can otherwise consume a significant amount of design effort. It also aids programmers in their efforts to make programs more stable, because it prevents several classes of runtime errors. For example, it prevents dangling pointer errors, where a reference to a deallocated object is used. (The pointer still points to the location in memory where the object or data was, even though the object or data has since been deleted and the memory may now be used for other purposes, creating a dangling pointer.)
Many computer languages require garbage collection, either as part of the language specification (e.g. C#, and most scripting languages) or effectively for practical implementation (e.g. formal languages like lambda calculus); these are said to be garbage-collected languages. Other languages were designed for use with manual memory management, but have garbage collected implementations (e.g., C, C++). Newer Delphi versions support garbage collected dynamic arrays, long strings, variants and interfaces. Some languages, like Modula-3, allow both garbage collection and manual memory management to co-exist in the same application by using separate heaps for collected and manually managed objects, or yet others like D, which is garbage-collected but allows the user to manually delete objects and also entirely disable garbage collection when speed is required. In any case, it is far easier to implement garbage collection as part of the language’s compiler and runtime system, but post hoc GC systems exist, including ones that do not require recompilation. The garbage collector will almost always be closely integrated with the memory allocator.
1.1 Definition of Garbage Collector
The name “Garbage Collection” implies that objects that are no longer needed by the program are garbage and can be thrown away. Garbage Collection is the process of collecting all unused nodes and returning them to available space. This process is carried out in two phases. In the first phase, all the nodes in use are marked known as marking phase. In the second phase, all the unmarked nodes are returned to the available space list. It is required to compact memory when variable size nodes are in use so that all free nodes from a contiguous block of memory. Second phase is known as memory compaction. Compaction of disk space to reduce average retrieval time is desirable even for fixed size node.
Garbage Collection algorithm identifies the objects which are live. An object is live if it is referenced in a predefined variable called root, or if it is referenced in a variable contained in a live object. Non-live objects, which don’t have any references, are considered as garbage. Objects and references can be considered a directed graph; live objects are those which reachable from the root. Fig. 1 shows how garbage collection works.
Objects, which are in blue squares, are reachable from root but object that are in red color are not reachable. An object may refer to reachable object but still can be unreachable.
1.2 Basics of Garbage Collector Algorithms
There are three basic garbage collector algorithms available.
Reference Counting: In this case, object has count of number of references to it and garbage collector will reclaim memory when count reaches to zero.
Mark and Sweep: Mark and Sweep algorithm is also known as tracing garbage collector. In Mark, Garbage collector marks all accessible objects and in second phase, GC scans through heap and reclaims all unmarked objects.
- Figure shows the operation of mark and sweep garbage collector algorithm.
It shows the conditions before garbage collector begins. Fig. b shows the effect of mark phase. All live objects are marked at this point. Fig. c shows the effect after sweep has been performed.
Compact: Compact shuffles all the live objects in memory such that free entries form large contiguous chucks.
1.3 Problem Statement
The performance evaluations in this thesis were conducted with three major goals: to
make controlled comparisons so that the performance effects of isolated parameters can be determined, to allow easy exploration of the design space so that parameters of interest can be quickly evaluated, and to provide information about parts of the design space that are not easily implementable. As with other experimental sciences, hypotheses about performance can only be tested if experimental conditions are carefully controlled. For example, to accurately compare non-incremental with incremental copying garbage collection, other algorithm parameters, such as semi space size, promotion policy, allocation policy, and copying policy must be held constant. Furthermore, the Lisp systems in which the algorithms are implemented must be identical. Comparing incremental collection on a Lisp machine to stop-and-copy collection on a RISC workstation would provide little information.
A second characteristic of an effective evaluation method is its ability to allow easy exploration of the space of design possibilities. In the case of garbage collection evaluation, new algorithms should be easy to specify, parameterize, and modify. Parameters that govern the behavior of the algorithms should be easy to introduce and change. Examples of such parameters include semi-space size, physical memory page size, promotion policy, and the number of bytes in a pointer.
A good evaluation method will answer questions about systems that do not exist or are not readily implementable. If technology trends indicate certain systems are likely to be of interest, performance evaluation should help guide future system design. In the case of garbage collection, several trends have already been noted. In particular, garbage collection evaluation techniques may help guide computer architects in building effective memory system configurations. In the case of multiprocessors, evaluation methods that predict an algorithm’s performance without requiring its detailed implementation on a particular multiprocessor will save much implementation effort. If a technique for evaluating garbage collection algorithms can provide these capabilities, then a much broader understanding of the performance tradeoffs inherent in each algorithm is possible.
- GARBAGE COLLECTION ALGORITHMS
Garbage collection provides a solution where storage reclamation is automatic. This section provides an overview of the simplest approaches to garbage collection, and then discusses the two forms of garbage collection most relevant to this dissertation: generational collection and conservative collection.
2.1 Simple Approaches
All garbage collection algorithms attempt to de-allocate objects that will never be used again. Since they cannot predict future accesses to objects, collectors make the simplifying assumption that any object that is accessible to the program will indeed be accessed and thus cannot be de-allocated. Thus, garbage collectors, in all their variety, always perform two operations: identify unreachable objects (garbage) and then de-allocate (collect) them.
Reference-counting collectors identify unreachable objects and de-allocate them as soon as they are no longer referenced (Collins, 1960 & Knuth 1973) Associated with each object is a reference count that is incremented each time a new pointer to the object is created and decremented each time one is destroyed. When the count falls to zero, the reference counts for immediate descendents are decremented and the object is de-allocated. Unfortunately, reference counting collectors are expensive because the counts must be maintained and it is difficult to reclaim circular data structures using only local reachability information.
Mark-sweep collectors are able to reclaim circular structures by determining information about global reachability (Knuth 1973, McCarthy 1960). Periodically, (e.g. when a memory threshold is exhausted) the collector marks all reachable objects and then reclaims the space used by the unmarked ones. Mark-sweep collectors are also expensive because every dynamically allocated object must be visited, the live ones during the mark phase and the dead ones during the sweep phase. On systems with virtual memory where the program address space is larger than primary memory, visiting all these objects may require the entire contents of dynamic memory be brought into primary memory each time a collection is per formed. Also, after many collections, objects become scattered across the address space because the space reclaimed from unreachable objects is fragmented into many pieces by the remaining live objects. Explicit de-allocation also suffers from this problem. Scattering reduces reference locality and ultimately increases the size of primary memory required to support a given application program.
Copying collectors provide a partial solution to this problem (Baker 1978, Cohen 1981). These algorithms mark objects by copying them to a separate contiguous area of primary memory. Once all the reachable objects have been copied, the entire address space consumed by the remaining unreachable objects is reclaimed at once; garbage objects need not be swept individually. Because in most cases the ratio of live to dead objects tends to be small (by selecting an appropriate collection interval), the cost of copying live objects is more than offset by the drastically reduced cost of reclaiming the dead ones. As an additional benefit, spatial locality is improved as the copying phase compacts all the live objects. Finally, allocation of new objects from the contiguous free space becomes extremely inexpensive. A pointer to the beginning of the free space is maintained; allocation consists of returning the pointer and incrementing it by the size of the allocated object.
But copying collectors are not a panacea, they cause disruptive pauses and they can only be used when pointers can be reliably identified. Long pauses occur when a large number of reachable objects must be traced at each collection. Generational collectors reduce tracing costs by limiting the number of objects traced (Lieberman & Hewit, 1983; Moon, 1984; Ungar, 1984). Precise runtime-type information available for languages such as LISP, ML, Modula, and Smalltalk allows pointers to be reliably identified. However, for languages such as C or C++ copying collection is difficult to implement because lack of runtime type information prevents pointer identification. One solution is to have the compiler provide the necessary information (Diwan, Moss &Hudson, 92). Conservative collectors provide a solution when such compiler support is unavailable (Boehm & Weiser, 1988).
2.2 Generational Collection
For best performance, a collector should minimize the number of times each reachable object is traced during its lifetime. Generational collectors exploit the experimental observation that old objects are less likely to die than young ones by tracing old objects less frequently. Since most of the dead objects will be young, only a small fraction of the reclaimable space will remain unreclaimed after each collection and the cost of frequently retracing all the old objects is saved. Eventually, even the old objects will have to be traced to reclaim long lived dead objects. Generational collectors divide the memory space into several generations where each successive older generation is traced less frequently than the younger generations. Adding generations to a copying collector reduces scavenge time pauses because old objects are neither copied nor traced on every collection.
Generational collectors can avoid tracing objects in the older generation when pointers from older objects to younger objects are rare. Tracing the old objects is especially expensive when they are in paged out virtual memory on disc. This cost increases as the older generations become significantly larger than younger ones, as is typically the case. One way implementations of generational collectors reduce tracing costs is to segregate large objects that are known not to contain pointers are into a special untraced area (Ungar & Jackson, 1992). Another way to reduce costs is to maintain forward in time intergenerational pointers explicitly in a collector data structure, the remembered set, which be comes an extension of the root set. When a pointer to a young object is stored into an object in an older generation, that pointer is added into the remembered set for the younger generation. Tracking such stores is called maintaining the write barrier. Stores from young objects to old ones are not explicitly tracked. Instead, whenever a given generation is collected, all younger generations are also collected. The write barrier is often maintained by using virtual memory to write protect pages that are eligible to contain such pointers (Apple, Ellis & Li 1988). Another method is to use explicit inline code to check for such stores. Such a check may be implemented by the compiler, but other approaches are possible. For example, a post processing program may be able to recognize pointer stores in the compiler output, and insert the appropriate instructions.
Designers of generational collectors must also establish the size, collection and promotion policies for each generation and how many generations are appropriate. The collection policy determines when to collect, the number of generations, their size, and the promotion policy determines what is collected.
The collector must determine how frequently to scavenge each generation; more frequent collections reduce memory requirements at the expense of increased CPU time because space is reclaimed sooner but live objects are traced more frequently. As objects age, they must be promoted to older generations to reduce scavenge costs; promoting a short-lived object too soon may cause space to be wasted because it may be reclaimed long after it becomes unreachable; promoting a long-lived object too late results in wasted CPU time as that object is traced repeatedly. The space required by each generation is strongly influenced by the promotion and scavenge policies. If the promotion policy of a generational collector is chosen poorly, then tenured garbage will cause excessive memory consumption. Tenured garbage occurs when many objects that are promoted to older generations die long before the generation is scavenged. This problem is most acute with a fixed age policy that promotes objects after a fixed number of collections. Ungar and Jackson devised a policy that uses object demographics to delay promotion of objects until the collector’s scavenge costs require it (Ungar & Johnson, 1992).
Because generational collectors trade CPU time maintaining the remembered sets for a reduced scavenge time, their success depends upon many aspects of program behavior. If objects in older generations consume lots of storage, their lifetimes are always long; they contain few pointers to young objects’ pointer stores into them are rare and many objects die at a far younger age, then generational collectors will be very effective. However, even generational collectors must still occasionally do a full collection, which can cause long delays for some programs. Often, however, collectors provide tuning mechanisms that must be manipulated directly by the end user to optimize performance for each of their programs (Apple Computer Inc. 1992, Symbolics Inc 1985, Xerox Corp. 1983). Generational collectors have been implemented successfully in prototyping languages, such as LISP, Modula-3, Smalltalk and PCedar. These languages share the characteristic that pointers to objects are readily identifiable, or hardware tags are used to identify pointers. When pointers cannot be identified, copying collectors cannot be used, for when an object is copied all pointers referring to it must be changed to reflect its new address. If a pointer cannot be distinguished from other data then its value cannot be updated because doing so may alter the value of a variable. The existing practice in languages such as C and C++ which prevent reliable pointer identification has motivated research into conservative non-copying collectors.
- 3. Conservative Collection
Conservative collectors may be used in language systems where pointers cannot be reliably identified (Boehm & Weiser 1988). Indeed an implementation already exists that allows a C programmer to retrofit a conservative garbage collector to an existing application (Boehm 1994). This class of collectors makes use of the surprising fact that values that look like pointers, ambiguous pointers usually are pointers. Misidentified pointers result in some objects being treated as live when, in fact, they are garbage. Although some applications can exhibit severe leakage (Boehm 1993, Wenworth 1990) usually only a small percentage of memory is lost because of conservative pointer identification.
Imprecise pointer identification causes two problems valid pointers to allocated objects may not be recognized (derived pointers), or non-pointers may be misidentified as pointers (false pointers). Both cases turn out to be critical concerns for collector implementers.
A derived pointer is one that does not contain the base address of the object to which it refers. Such pointers are typically created by optimizations made either by a programmer or a compiler and occur in two forms. Interior pointers are ones that point into the middle of an object. Array indices, and fields, of a record are common examples (BGS 94). Sometimes an object that has no pointer into it from anywhere is still reachable. For example, an array whose lowest index is a non-zero integer may only be reachable from a pointer referring to index zero. Here the problem is that a garbage collector may mistakenly identify an object as unreachable because no explicit pointers to it exist.
With the exception of interior pointers, which are more expensive to trace, compiler support is required to solve this problem no matter what collection algorithm is used. In practice, it turns out that compiler optimizations have not been a problem yet (June 1995), because enabling sophisticated optimizations often breaks other code in the users program and is not used with garbage collected programs in practice (Boehm 1995b). Such support has been studied by other researchers and will not be discussed further in this dissertation (Boehm 1991, Diwan, Moss & Hudson, 1992, Ellis & Detlefs 1994).
False pointers exist when the type (whether it is a pointer or not) of an object is not available to the collector. For example, if the value contained in an integer variable corresponds to the address of an allocated but unreachable object) a conservative collector will not de-allocate that object. A heuristic called blacklisting reduces this problem by not allocating new objects from memory that corresponded to previously discovered false pointers (Boehm 93). But even when the type is available, false pointers may still exits. For example, a pointer may be stored into a compiler generated temporary (in a register or on the stack) that is not overwritten until long after its last use. While memory leakage caused by the degree of conservativism chosen for a particular collector is still an area of active research, it will not be discussed further in this dissertation except in the context of costs incurred by the conservative collector’s pointer finding heuristic.
Not only can false pointers cause memory leakage, but they also preclude copying. When a copying collector finds a reachable object, it creates a new one, copies the contents of the old object into it deletes the original object, and overwrites all pointers to the old object with the address of the new object. If the overwritten pointer was not a pointer, but instead was the value of a variable, this false pointer cannot be altered by the collector. This problem can be partly solved by moving only objects that are not referenced through false pointers as in Bartlett’s Mostly Copying collection algorithm (Barlett 1990).
If true pointers cannot be recognized, then the collector may not copy any objects after they are created. One of the chief advantages of copying collectors, reference locality, is lost (Moon 1984). A conservative collector can also cause a substantial increase in the size of a process’s working set as long lived objects become scattered over a large number of pages. Memory becomes fragmented as the storage freed from dead objects of varying sizes becomes interspersed with long lived live ones. This problem is no different than the one faced by traditional explicit memory allocation systems such as malloc/free in widespread use in the C and C++ community. Solutions to this problem may be readily transferable between garbage collection and explicit memory allocation algorithms.
The trace or sweep phases of garbage collection, which are not present in explicit memory allocation systems’ can dramatically alter the paging behavior of a program. Implementations of copying collectors already adjust the order in which reachable objects are traced during the mark phase to minimize the number of times each page must be brought into main memory. Zorn has shown that isolating the mark bits from the objects in a Mark-Sweep collector and other improvements also reduce collector induced paging. Generational collectors also dramatically reduce the pages referenced as well (Moon 1984).
Even though generational collectors reduce pause times work is also being done to make garbage collection suitable for the strict deadlines of real time computing. Baker (Baker 1978) suggested incremental collection, which interleaves collection with the allocating program (mutator) rather than stopping it for the entire duration of the collection. Each time an object is allocated, the collector does enough work to ensure the current collection completes before another one is required.
Incremental collectors must ensure that traced objects (those that have already been scanned for pointers) are not altered for if a pointer to an otherwise unreachable object is stored into the previously scanned object, that pointer will never be discovered and the object, which is now reachable, will be erroneously reclaimed. Although originally maintained by a read barrier (Baker 78) this invariant may also be maintained by a write barrier. The write barrier detects when a pointer to an untraced object is stored into a traced one, which is then retraced. Notice that this barrier may be implemented by the same method as the one for the remembered set in generational collectors; only the set of objects monitored by the barrier changes. Nettles and O’Toole (Nettles & O’Toole 93) relaxed this invariant in a copying collector by using the write barrier to monitor stores into threatened objects and altering their copies before de-allocation. Because incremental collectors are often used where performance is critical, any technology to improve write barrier performance is important to these collectors. Conversely, high performance collection of any type is more widely useful if designed so it may be easily adapted to become incremental. This dissertation will not explicitly discuss incremental collection further, but keep in mind that write barrier performance applies to incremental as well as generational collectors.
2.4. Related Work
This dissertation combines and expands upon the work done by several key researchers. Xerox PARC developed a formal model and the concept of explicit threatened and immune sets. Ungar and Jackson developed a dynamic promotion policy Hosking, Moss and Stefanovic compared the performance of various write barriers for precise collection, and Zorn showed that inline write barriers can be quite efficient. I shall now describe each of these works and then introduce the key contributions this dissertation will make and how they relate to the previous work.
2.4.1 Theoretical Models and Implementations
Researchers at Xerox PARC have developed a powerful formal model for describing the parameter spaces for collectors that are both generational and conservative. A garbage collection becomes a mapping from one storage state to another. They show that storage states may be partitioned into threatened and immune sets. The method of selecting these sets induces a specific garbage collection algorithm. A pointer augmentation provides the formalism for modeling remembered sets and imprecise pointer identifications. Finally, they show how the formalism may be used to combine any generational algorithm with a conservative one. They used the model to design and then implement two different conservative generational garbage collectors. Their Sticky Mark Bit collector uses two generations and promotes objects surviving a single collection. A refinement of this collector (Collector II) allows objects allocated beyond an arbitrary point in the past to be immune from collection and tracing. This boundary between old objects, which are immune, and the new objects, which are threatened, is called the threatening boundary. More recently, these authors have received a software patent covering their ideas.
Until now, Collector II was the only collector that made the threatening boundary an explicit part of the algorithm. It used a fixed threatening boundary and time scale that advanced only one unit per collection. This choice was made to allow an easy comparison with a non-generational collector, not to show the full capability of such an idea.
Both collectors show that the use of two generations substantially reduces the number of pages referenced by the collector during each collection. However, these collectors exhibited very high CPU overhead, the generational collectors frequently doubled the total CPU time. In later work, they implemented a Mostly Parallel concurrent two generation conservative Sticky Mark Bit collector for the PCeder language. This combination substantially reduced pause times for collection compared to a simple full sweep collector for the two programs they measured. These collectors used page protection traps to maintain the write barrier. They did so by write protecting the entire heap address space and installing a trap handler to update a dirty bit for the first write to each page. Pause times were reduced by conducting the trace in parallel with the mutator. Once the trace was complete, they stopped the mutator, and retraced objects on all pages that were flagged as dirty. All their collectors shared the limitation that once promoted to the next generation; objects were only reclaimed when a full collection occurred, so scavenger updates to the remembered set were not addressed. Tenured garbage could only be reclaimed by collecting the entire heap. My work extends upon theirs by exploiting the full power of their model to dynamically update the threatening boundary at each collection rather than relying only upon a simple fixed age or full collection policy.
Ungar and Jackson measured the effect of a dynamic promotion policy. Feedback Mediation upon the amount of tenured garbage and pause times for four six-hour Smalltalk sessions (UJ). They observed that object lifetime distributions are irregular and that object lifetime demographics can change during execution of the program. This behavior affects a fixed age tenuring policy by causing long pause times when a preponderance of young objects causes too little tenuring and excessive garbage when old objects cause too much tenuring.
They attempted to solve this problem using two different approaches. First, they placed pointer-free objects (bitmaps and strings) larger than one kilobyte into a separate area_ this approach was effective because such objects need not be traced and are expensive to trace and copy. Second, they devised a dynamic tenuring policy that used feedback mediation and demographic information to alter the promotion policy so as to limit pause times. Rather than promoting objects after a fixed number of collections. Feedback mediation only promoted objects when a pause time constraint was exceeded because a high percentage of data survived a scavenge and would be costly to trace again. To determine how much to promote, they maintained object demographic information as a table containing of the number of bytes surviving at (each age where age is number of scavenges). The tenuring threshold was then set so the next scavenge would likely promote the number of bytes necessary to reduce the size of the youngest generation to the desired value.
Their collector appears similar to Collector II in that it uses an explicit threatening boundary, but differs because it does so for promotion only not for selecting the immune set directly. My work extends theirs by allowing objects to be demoted. Their object promotion policies can be modeled by advancing the threatening boundary by an amount determined by the demographic information each time the pause time constraint is exceeded. I extend this policy by moving the threatening boundary backward in time to reclaim the tenured garbage that was previously promoted. Hanson implemented a movable threatening boundary for a garbage collector for the SNOBOL-4 programming language. After each collection, surviving objects were moved to the beginning of the allocated space and the remaining (now contiguous) space was freed. Allocation subsequently proceeded in sequential address order from the free space. After the mark phase, and before the sweep phase, the new threatening boundary was set to the address of the lowest unmarked object found by a sequential scan of memory. This action corresponds to a policy of setting the threatening boundary to the age of the oldest unmarked object before each sweep. His scheme is an optimization of a full copying garbage collector that saves the cost of copying long lived objects. His collector must still mark and sweep the entire memory space.
2.4.3. Write Barrier Performance
Hosking, Moss, and Stefanovific at the University of Massachusetts evaluated the relative performance of various inline write barrier implementations for a precise copying collector using five Smalltalk programs. They developed a language independent garbage collector toolkit for copying, precise, generational garbage collection which like Ungar and Jackson, maintains a large object space. They compared the performance of several write barrier implementations card marking using either inline store checks or virtual memory, and explicit remembered sets, and presented a breakdown of scavenge time for each write barrier and program. Their research showed that maintaining the remembered sets explicitly out performed other approaches in terms of CPU over head for Smalltalk.
Zorn, Zor a showed an inline write barrier exhibited lower than expected CPU overheads compared with using operating system page protection traps to maintain a virtual memory write barrier. Specifically, he concluded that carefully designed inline software tests appear to be the most effective way to implement the write barrier and result in overheads of 2-6%.
In separate work, he showed properly designed mark-sweep collectors can significantly reduce the memory overhead for a small increase in CPU overhead in large LISP programs. These results support the notion that using an inline write barrier and non-copying collection can improve performance of garbage collection algorithms.
Ungar and Jackson’s collector provided a powerful tool for reducing the creation rate of tenured garbage by adjusting the promotion policy dynamically. I take this policy a step further and adjust the generation boundary directly instead. PARC’s Collector II maintains such a threatening boundary, but they measured only the case where the time of the last collection was considered. I alter the threatening boundary dynamically before each scavenge which unlike Ungar and Jackson’s collector, allows objects to be un- tenured, and hence further reduce memory overhead due to tenured garbage. Unlike other generational garbage collection algorithms, I have adopted PARC’s notation for immune and threatened sets, which simplifies specification of my collector over generational collectors. In order to avoid compiler modifications, previous conservative collectors have used page protection calls to the operating system for maintaining the write barrier. Recent work has shown program binaries may be modified without compiler support. Tools exist, such as QPT, Pixie, and ATOM, that alter the executable directly to do such tasks as trace generation and profiling. The same techniques may be applied to generational garbage collectors to add an inline write barrier by inserting explicit instructions to check for pointer stores into the heap.
Previous work has only evaluated inline write barriers for languages other than C, e. g. LISP, Smalltalk, Cedar. I evaluate the costs of using an inline write barrier for compiled C programs. Generational copying collectors avoid destroying the locality of the program by compacting objects conservative, non-copying collectors cannot do this compaction. Even so, Zorn showed mark sweep collectors can perform well and malloc/free systems have been working in C and C++ for years with the same problem. However, in previous work I have examined the effectiveness of using the allocation site to predict short-lived objects. For the five C programs measured in that paper, typically over, of all objects were short lived and the allocation site often predicted over 80% of them. In addition, over 40% of all dynamic references were to predictable short lived objects. By using the allocation site and object size to segregate short-lived objects into a small (64 K-byte) arena short-lived objects can be prevented from fragmenting memory occupied by long-lived ones. Because most references are to short-lived objects now contained in a small arena, the reference locality is significantly improved. In this document, I will discuss new work based upon lifetime prediction and store behavior to show future opportunities for applying the prediction model.
The same could be said of designs for complex software systems. The designer’s task is to choose the simplest dynamic storage allocation system that meets the application’s needs. Which system is chosen ultimately depends upon program behavior. The designer chooses an algorithm, data structure, and implementation based upon the anticipated behavior and requirements of the application. Data of known size that lives for the entire duration of the program may be allocated statically. Stack allocation works well for the stack like control flow for subroutine invocations. Program portions that allocate only fixed sized objects lead naturally to the idea using explicit free lists to minimize memory fragmentation. The observation that the survival rate of objects is lower for the youngest ones motivated implementation of generational garbage collection. In all cases, observing behavior of the program resulted in innovative solutions. All the work presented in this dissertation is based upon concrete measurements of program behavior. Program behavior is often the most important factor in deciding what algorithm or policy is most appropriate. While I present measurements in the context of the above three contributions, they are presented in enough detail to allow current and future researchers to gain useful in sight from the behavior measurements themselves. Specifically, I present material about the store behavior of C programs which has previously not appeared elsewhere.
- Implementation of Garbage Collection Algorithm
Any type of dynamic storage allocation system imposes both CPU and memory costs. The costs often strongly affect the performance of the system and pass directly to the purchaser of the hardware as well as to software project schedules. Thus, the selection of the appropriate storage management technique will often be determined primarily by its costs. This chapter will discuss the implementation model for garbage collection so that the experimental methods and results to follow may be evaluated properly. I will proceed from the simplest storage allocation strategies to the more complex strategies, adding refinements and describing their costs as I proceed. For each strategy, I will discuss the outline of the algorithm and data structures, and then I will provide details of the CPU and memory costs. Initially, explicit storage allocation costs will be discussed and provide a context and motivation for the costs of the simplest garbage collection algorithms; mark-sweep and copy. Lastly, the more elaborate techniques of conservative and generational garbage collection are discussed.
3.1 Explicit Storage Allocation
Explicit dynamic storage allocation (DSA) provides two operations to the programmer; allocate and de-allocate. Allocate creates un-initialized contiguous storage of the required size for a new allocated object and returns a reference to that storage. De-allocate takes a reference to an object and makes its storage available for future allocation by adding it to a free list data structure (objects in the free list are called de-allocated objects). A size must be maintained for each allocated object so that de-allocate can update the free list properly. Allocate gets new storage either from the free list or by calling an operating system function. Allocate searches the free list first. If an appropriately sized memory segment is not available, allocate either break up an existing segment from the free list (if available) or requests a large segment from the operating system and adds it to the free list.
Correspondingly, de-allocate may coalesce segments with adjacent addresses into a single segment as it adds new entries to the free list (boundary tags may be added to each object to make this operation easier). The implementation is complicated slightly by alignment constraints of the CPU architecture since the storage must be appropriately aligned for access to the returned objects. The costs of this strategy, in terms of CPU and memory overhead depend critically upon the implementation of the free list data structure and the policies used to modify it. The CPU cost of allocation depends upon how long it takes to find a segment of the specified size in the free list (if present possibly fragment it_remove it_ and return the storage to the program_The CPU cost of deallocation depends upon the time to insert a segment of the speci_ed address and size into the free list and coalesce adjacent segments_ The total CPU overhead depends upon the allocation rate of the program as measured by the ratio of the total number of instructions re quired by the allocation and deallocation routines to the total number of instructions executed.
The memory overhead consists entirely of space consumed by objects in the free list waiting to be allocated _external fragmentation _Ran assuming that internal fragmentation and the space consumed by the size _elds and boundary tags is negligible_
Internal fragmentation is caused by objects that were allocated more storage than required _either to meet alignment constraints or to avoid creating too small a free space element careful tuning is often done to the allocator to minimize this internal fragmentation.
The data structure required to maintain the free list may often be ignored because it can be stored in the free space itself. The amount of storage consumed by items in the free list depends highly upon the program behavior and upon the policy used by allocate to select among multiple eligible candidates in the free list. For example, if the program interleaves creation of long lived objects with many small short lived ones and then later creates large objects, most of the items in the free list will be unused. Memory overheads _as measured by the ratio of size of the free space to the total memory required of thirty to fifty percent are not unexpected _Knu____ which leaves much room for improvement _CL____.
The total memory overhead depends upon the size of the free space as compared to the total memory required by the program. This free list overhead is the proper one to use for comparing explicit dynamic storage allocation space overheads to those of garbage collection algorithms since garbage collection can be considered to be a form of deferred deallocation. Often, both the CPU and memory costs of explicit deallocation are unacceptably high. Programmers often write specific allocation routines for objects of the same size and maintain a free list for those objects explicitly thereby avoiding both memory fragmentation and high CPU costs to maintain the free list_ But_ as the number of distinct object sizes increase_ the space consumed by the multiple free lists become prohibitive. Also, the memory savings depend critically upon the programmer’s ability to determine as soon as possible when storage is no longer required. When allocated objects may have more than one reference to them _object sharing _ high CPU costs can occur as code is invoked to maintain reference counts. Memory can become wasted by circular structures or by storage that is kept live longer than necessary to ensure program correctness.
3.2 Mark-Sweep Garbage Collection
Mark-sweep garbage collection relieves the programmer from the burden of invoking the deal locate operation, the collector performs the deal location. In the simplest case, there is assumed to be a finite fixed upper bound on the amount of memory available to the allocate function. When the bound is exceeded, a garbage collector is invoked to search for and deallocate objects that will never be referenced again. The mark phase discovers reachable objects, and the sweep phase deallocates all unmarked objects. A set of mark bits is maintained_ one mark bit for each allocated object. A queue is maintained to record reachable objects that have not yet been traced. The algorithms proceed as follows. First, the queue is empty, all the mark bits are cleared and the search for reachable objects begins by adding to the queue all roots, that is, statically allocated objects, objects on the stack, and objects pointed to by CPU registers. As each object is removed from the queue, its contents are scanned sequentially for pointers to allocated objects. As each pointer is discovered, the mark bit for the object being pointed to tested and set and, if unmarked, the object is queued. The mark phase terminates when the queue is empty. Next, during the sweep phase, the mark bit for each allocated object is examined and, if clear, deallocate is called with that object.
As a refinement, the implementor may use a set instead of a queue and may choose an order other than first-in-first out for removing elements from the set. Mark-sweep collection adds CPU costs over explicit DSA for clearing the mark bits and, for each reachable object, setting the mark bit, en-queuing, scanning, and de-queuing. In addition, the mark bit must be tested for each allocated object and each unreachable object must be located and de-allocated. Deferred sweeping may be used to reduce the length of pauses caused when the collector interrupts the application. For deferred sweep, the collector resumes the program after the mark phase. Subsequent allocate requests test mark bits, deallocating unmarked objects until one of the required size is found. Deferred sweeping should be completed before the next collection is invoked since starting a collection when memory is available is probably premature. The first component of the memory cost for mark-sweep is the same as for explicit deallocation where the deallocation for each object is deferred until the next collection, this cost can be a very significant, often one and one half to three times the memory required by explicit deallocation. In addition to the size, a mark bit must be maintained for each allocated object.
Memory for the queue to maintain the set of objects to be traced must be maintained by clever means to avoid becoming excessive. A brute force technique, to handle queue overflow, is to discard the queue and restart the mark phase without clearing the previously set mark bits. If at least one mark bit is set before the queue is discarded, the algorithm will eventually terminate. Virtual memory makes it attractive to collect more frequently than each time the entire virtual address space is exhausted. The frequency of collection affects both the CPU and memory over head. As collections occur more frequently the memory overhead is reduced because unreachable objects are deallocated sooner but the CPU over head rises as objects are traced multiple times before they are deallocated. The two degenerate cases are interesting. Collecting at every allocation uses no more storage than explicit deallocation but at the maximal CPU cost, no collection at all has the minimum CPU overhead of explicit deallocation with a zero cost deallocate operation, but consumes the most memory. The latter case may often be the best for short-lived programs that must be composed rapidly.
The designer of the collector must tune the collection interval to match the resources available. Although this dissertation will not discuss it further, policies for setting the collection interval are an interesting topic in their own right, and there is much room for future research. As mentioned earlier, during explicit dynamic storage deallocation, fragmentation can consume a significant portion of available memory, especially for systems that have high allocation and deallocation rates of objects of a wide variety of sizes and lifetimes. Other researchers have observed that the vast majority of objects have very short lifetimes, under one megabyte of allocation or a few million instructions. This observation motivates two other forms of garbage collection, copying collection, which reduces fragmentation and sweep costs, and generational collection, which reduces trace times for each collection.
3.3 Copying Garbage Collection
Copying garbage collection marks objects by copying them to a separate empty address space –to-space. Mark bits are unnecessary because an address in to space implicitly marks the object as reachable. After each object is copied, the address of the newly copied object is written into the old object, s storage. The presence of this forwarding pointer indicates a previously marked object that need not be copied each subsequent time the object is visited. As each object is copied or a reference to a forwarding pointer is discovered, the collector overwrites the original object reference with the address of the new copy. The sweep phase does not require examining mark bits or explicit calls to de-allocate each unmarked object. Instead, the unused portion of to space and the entire old address space, from space becomes the new free list new space.
Allocation from new space becomes very inexpensive, incrementing an address, testing it for overflow, and returning the previous address. Collection occurs each time the test indicates overflow of the size of to space. No explicit free list management is required. Copying collection adds CPU overhead for the copying of the contents of each of the reachable objects. Memory overhead is added for maintaining a copy in to space during the collection, but fragmentation is eliminated because copying makes the free list a contiguous new space_ Tospace may be kept small by ensuring that the survival rate is kept low by increasing the collection interval. Copying collection can only be used where pointers can be reliably identified. If a value that appears to point to an object is changed to reflect the updated object’s address and that value is not a pointer, the program semantics would be altered.
3.4 Conservative Garbage Collection
Unlike with copying collection, conservative collectors may be used in languages where pointers are difficult to reliably identify. Conservative collectors are conservative in two ways: they assume that values are pointers for the purposes of determining whether an object is reachable, and that values are not pointers when considering an object for movement. They will not deallocate any object (or its descendents referenced only by a value that appears to be a pointer) and they will not move an object once it has been allocated. Conservative garbage collection requires a pointer finding heuristic to determine which values will be considered potential pointers. More precise heuristics avoid unnecessary retained memory caused by misidentified pointers at the cost of additional memory and CPU overhead. The heuristic must maintain all allocated objects in a data structure that is accessed each time a value is tested for pointer membership. The test takes a value that appears to be an address, and returns true if the value corresponds to the address pointing into a currently allocated object. This test will occur for each value contained in each traced root or heap object during the mark phase.
The precise cost of the heuristic depends highly upon the architecture of the computer, operating system, language, compiler, runtime environment and the program itself. The Boehm collector usually requires instructions on the DEC Alpha to map a bit value to the corresponding allocated object descriptor. In addition to the trace cost, CPU overhead
is incurred to insert an object into the pointer finding data structure at each allocation_ and to remove it at each deallocation. As with mark- sweep, deferred sweep may be used.
In addition to the memory for the mark bits previously mentioned for mark-sweep. conservative collectors require space for the pointer finding data structure. On the DEC Alpha, the Boehm collector uses a two level hash table to map bit addresses to a page descriptor. All objects on a page are the same size. Six pointer sized words per virtual memory, sized page are required. The space for page descriptors is interleaved through out dynamically allocated memory in pages that are never deallocated.
3.5 Generational Garbage Collection
Recall that generational garbage collectors attempt to reduce collection pauses by partitioning memory into one or more generations based upon the allocation time of an object, the youngest objects are collected more frequently than the oldest. Objects are assigned to generations are promoted to older generation’s as they age, and a write barrier is used to maintain the remembered set for each generation. The memory overhead consists of generation identifiers, tenured garbage, and the remembered set. Also, partition fragmentation can increase memory consumption for copying generational collectors when the memory space reserved for one generation cannot be used for the other generation. The CPU overhead consists of costs for promoting objects, the write barrier and updating the remembered set. Each of these costs is discussed in this section. An understanding of them is required to evaluate the results presented in the experimental chapters later in this dissertation. The collector must keep track of which generation each object belongs to. For copying collectors, the generation is encoded by the object’s address. For mark-sweep collectors, the generation must be maintained explicitly usually by clustering objects into blocks of contiguous addresses and maintaining a word in the block encoding the generation to which all objects within the block be long.
As objects age, they may be promoted to older generations either by copying or changing the value of the corresponding generation field. Tenured garbage is memory overhead that occurs in generational collectors when objects in promoted generations are not collected until long after they become unreachable. In a sense, all garbage collectors generate tenured garbage from the time objects become unreachable until the next collection and memory leaks are the tenured garbage of explicit dynamic storage allocation systems. One of the central research contributions of this dissertation is to quantify the amount of tenured garbage for some applications, to show how it may be reduced, and to show how that reduction can impact total memory requirements. time to the next scavenge of the generations containing that garbage.
In order to avoid tracing objects in generations older than the one currently being collected, a data structure, called the remembered set, is maintained for each generation. The remembered set contains the locations of all pointers into a generation from objects outside that generation. The remembered set is traced long with the root set when the scavenge begins. PARC’s formal model called the remembered set a pointer augmentation and each element of the set was called a rescuer. This additional tracing guarantees that the collector will not erroneously collect objects in the younger, traced generation reachable only through indirection through the older untraced generations. CPU overhead occurs during the trace phase in adding the appropriate remembered set to the roots, and in scanning each object pointed to from the remembered
Set. A heuristic to reduce the size, and memory overhead of the remembered set is often indeed universally used only pointers from generations older than the scavenged generation are recorded, but at the cost of requiring all younger generations to be traced. This heuristic makes a time space trade off between increased CPU overhead for tracing younger generations to reduce the size of the remembered set based upon the assumption that forward, in time pointers pointers from older objects to younger ones are rare. If objects containing pointers are rarely overwritten after being initialized, then the assumption would appear to be justified, however empirical evidence supporting this assumption is often not well supported in the literature when generational garbage collection is used in a specific language environment. Still, collecting all younger generations does have the advantage of reducing circular structures crossing generation boundaries. The write barrier adds pointers to the remembered set as they are created by the application program. Each store that creates a pointer into a younger generation from an older one inserts that pointer into the remembered set. The write barrier may implement either by an explicit inline instruction sequence, or by virtual memory page protection traps. The CPU cost of the instruction sequence consists of instructions inserted at each store. The sequence tests for creation of a forward intime intergenerational pointer and inserts the address of each pointer into the remembered set.
The virtual memory CPU cost consists of delays caused by page write protect traps used to field the first store to each page in an older generation since the last collection of that generation. The cost of page protection traps can be significant on the order of micro-seconds, there is motivation for investigating using an explicit instruction sequence for the write barrier. When three or more generations exist, updating the remembered sets requires the capability to delete entries. The collector must ensure that unreachable objects discovered and deallocated from scavenged generations are removed from the remembered sets. A crude, but correct, approach is to delete all pointers from the remembered sets for the scavenged generations and then add them back as the trace phase proceeds. Consider an n generation collector containing generations, the youngest, to generation and the oldest. Before initiating the trace phase, suppose we decide to collect generations k and younger for some k such that k n. We delete from the remembered set for each generation such that all pointers from generations s such that as the trace
Proceeds, any pointer traced that crosses one or more generation boundaries from an older generation’s to a younger generation t is then added to the remembered set for the target generation. Another approach is to explicitly remove from each generation’s remembered set all entries corresponding to pointers contained in each object as it is scanned. This deletion can occur during the mark phase or as each object is de allocated during the (possibly deferred sweep phase). The recent literature is not very precise about this presumably because currently only generational collectors that use two generations are common. In this case, only one remembered set exists (for generation) and it is completely cleared only when a full collection occurs, precise remembered set update operations are not required.
- EVALUATION OF GARBAGE COLLECTION ALGORITHMS
3.1 Write-Barrier for C
3.2 Garbage Collection for C++
A number of possible approaches to automatic memory management in C++ have been considered over the years. A number of different mechanisms for adding automatic memory reclamation (garbage collection) to C++ have been considered:
- Smart-pointer-based approaches which recycle objects no longer referenced via special library-defined replacement pointer types. Boost shared ptrs (in TR1, see N1450=03-0033) are the most widely used example. The underlying implementation is often based on reference counting, but it does not need to be.
- The introduction of a new kind of primitive pointer type which must be used to refer to garbage-collected (“managed”) memory. Uses of this type are more restricted than C pointers. This is the approach taken by C++/CLI, which is currently under consideration by ECMA TC39/TG5. This approach probably provides the most freedom to the implementor of the underlying garbage collector, thus potentially providing the best GC performance, and possibly the best interoperability with aggressive implementations of languages like C#.
- Transparent GC, which allows objects referenced by ordinary pointers to be reclaimed when they are no longer reachable.
We propose to support the third alternative, independently of the other two. While manual memory management is powerful feature of C++, this proposal provides a developer the choice of not using manual memory management without feeling penalized by its presence in the language. This is supported by the principle that C++ programmers should not be impacted by unused features. Likewise, programs using explicit memory management should not be impacted in any way by the presence of the optional garbage collection feature we are proposing. This proposal allows C++ to provide full support for the large class of applications that do not have a specific need for manual memory management and could be more quickly and reliably developed in a fully garbage collected environment. We believe this will make C++ a simpler and more attractive option for the large number of developers and development organizations that are not willing or able to use manual memory management and do not develop applications requiring manual memory management without negatively affecting current users of C++. Our intent is to support use of preexisting C++ code with a garbage collector in as many cases as possible.
Transparent collection creates support for a variety of useful C++ scenarios:
- Transparent garbage collection provides C++ with support for fully garbage collected applications on a par with other popular languages with respect to ease of use, standard library support, performance, automatic collection of cycles, etc. This would make C++ a simpler and more attractive for the large class of applications that do not require manual memory management, which are currently often written in other languages solely due to their transparent support for automatic memory management. Although smart pointers are known to work well in some contexts, particularly if only a distinguished set of large objects are affected, and if smart-pointer updates can be made infrequent, they are not suitable for the myriad programmers who wish to dispense with manual memory management entirely. This underscores the complementary value provided by the transparent garbage collection approach.
- Most existing code can be converted to garbage collection with no code changes, such that the code no longer fails to deallocate “unreachable” memory. Because the existing code’s deallocation calls are still executed, garbage collection is only used to reclaim leaked memory, so collection cycles need only occur very infrequently, providing the safety of full garbage collection without the performance cost of running frequent garbage collection cycles. This mode of operation is often referred to as “litter collection” as described in.
- Even if the programmer’s goal is to continue to use explicit memory deallocation, this approach strengthens the use of tools such as the use of tools such as IBM/Rational Purify’s leak detector. Since these tools are based on conservative garbage collectors, they suffer the same issues as transparently garbage-collected applications, though the failure mode is often limited to spurious error messages.
- Unlike the smart-pointer based approaches, this approach to garbage collection allows pointers to be manipulated as in traditional C and C++ code. There are no correctness restrictions on, for example, the life-time of C++ references to garbage-collected memory. There is no performance motivation to pass pointers by reference. Thus it does not require the programmer to relearn some basic C idioms. Since we do not reference count, we avoid difficult-to-debug cyclic pointer chain issues that may occur with reference-counted smart pointers.
- This approach will normally significantly outperform smart-pointer based techniques for applications manipulating many small objects, particularly if the application is multi-threaded. Transparent garbage collection allows garbage-collector implementations that perform well enough to be used in open source Java and CLI implementations, though probably not quite as well as what can be accomplished for C++/CLI.
- Unlike the C++/CLI approach, transparent garbage collection allows easy “cut-and-paste” reuse of existing source code and object libraries without the need to modify their memory management or learn how to manipulate two types of pointers.5 The same template code that was designed for manually managed memory can almost always be applied to garbage collected memory. The transparent garbage collection approach also allows safe reuse of the large body of C and C++ code that is not known to be fully type safe as long as the Required Changes below are verified. The tradeoff from the greater reuse and simplicity is that transparent garbage collection is not quite as safe as for the C++/CLI because we require that programmers must recognize when they are hiding pointers and use one of the Required Changes mechanisms in that infrequent case.
- The approach will interact well with atomic pointer update primitives, once those are added to the language. Smart-pointer-based approaches generally cannot accommodate concurrent updates to a shared pointer, at least probably not without significant additional cost. This is important for some high-performance lock-free algorithm.
We believe we can provide robust support for transparent GC with minimal changes to the existing language. More importantly, we believe that except for those few programs requiring “advanced” garbage collection features, most programs will require no code changes at all.
- In obscure cases, the current language allows the program to effectively hide “pointers” from the garbage collector, thus potentially inducing the collector to recycle memory that is still in use. We propose rules similar to Stroustrup’s original proposal (N0932) to clarify when this may happen.
- We propose a set of pragmas to allow the programmer to specify any assumptions about garbage collection made by the source file. In the absence of any such specifications, it is implementation defined whether a garbage collector will be used. We expect this to be controlled by a compiler flag.
- We propose a small set of APIs and classes to access advanced but occasionally necessary garbage collection features. We expect that these APIs will not be used outside of specialized circumstances.
3.3 Garbage Collection for C#
3.3 Garbage Collection in Java based Embedded System
Embedded systems have tight memory requirements and applications are often long-running. Therefore, it is absolutely essential to be able to place a tight bound on memory loss due to fragmentation. This can not be done without compaction (or some other technique which moves objects). Mark-compact collector based on Saunders’ original mark compact algorithm allocates linearly until the heap is exhausted and then compacts by sliding objects “to the left” (towards low memory). It therefore tends to preserve (or even improve) locality and fragmentation is eliminated completely on every collection. As in a semi-space copying collector, allocation is very fast: a simple bump pointer and range check. This allocation sequence has the advantage that it is short enough to consider inlining, although JVMwe use a hand-coded, but out-of-line, allocation sequence. However, note that on platforms that present a segmented, nonvirtual memory interface, fragmentation at the end of segments becomes an issue that must be addressed.
2.1 Compaction and Its Optimizations
The sliding compaction algorithm requires
- Mark: Traverse the object graph beginning at the roots, marking each object encountered as live.
- Sweep: Scan memory sequentially, looking for dead objects and coalescing them into contiguous free chunks. Compute the new address for each object and store a forwarding pointer in the object.
- Forward: Change all object pointers to point to the forwarded value as determined by the Sweep phase.
- Compact: From left to right, move objects to their new locations. Typically, the Sweep phase is the most expensive since it needs to scan all of memory, while the other phases are proportional to the live data. The Mark and Forward phases are typically similar in cost, since they both essentially traverse the live objects and examine each field. The Compact phase is the fastest since it does not look inside objects, but just copies them a word at a time. Although the Forward and Compact phases scan the heap linearly, their costs are proportional to only the live objects since the previous Sweep phase has coalesced adjacent dead objects into contiguous free chunks. Whenever during the sweep phase, an object is encountered; the forwarded addresses of objects to its left have already been computed. Therefore, each pointer in the object is examined in turn, and if it points to the left, the pointer is replaced with the forwarded version stored with the destination object. If it points to the right, it is left unchanged. The Forward phase is entirely omitted. The Compact phase is extended so that before moving an object (by sliding it to the left each pointer is examined in turn, and if it points to the right, then it was not forwarded in the previous pass, and the forwarding pointer is still available. The reason is that there are now two passes that examine pointers in each object for forwarding. To do this they must look at the target of the pointer, which results in a random access pattern. Thus it appears that two sequential passes over live memory cost about the same as one pass with random access. The reason is that we are looking through the pointers in each object an extra time, and this is an expensive operation. Since there are fewer objects than pointers, the extra phase wins. Therefore, further this optimization is not considered, although it may give important performance benefits in systems with different languages, memory technologies, etc.
2.2 Forwarding Pointer Elimination
The compaction algorithm requires an extra forwarding pointer at the beginning of every object since, unlike a copying collector, the forwarding pointer cannot overlay any data or header fields.
2.2.1 Encoded Class Indices
Every Java object contains a class pointer, which is used to find the table of virtual functions of the class, to perform class tests and cast operations, and to support various run-time system operations. The representation of a Java object with a fowarding pointer
is shown in Figure 1(a). Separate forwarding pointers can be eliminated by encoding the class pointer during the compaction phase, and then using the space made available in the class pointer word to store a compressed forwarding pointer. Instead of a class pointer a 14-bit class index can be used, and obtain the class pointer by looking up the index in a table. The class page table (CPT) contains pointers to the class pages. The CPT only requires 64 entries or 256 bytes. The loss of space due to internal fragmentation in the last class page is at most 1KB, and only 512 bytes on average. The 14-bit class index is sub-divided into a 6-bit class page table index and an 8-bit class page offset. To reduce the overhead of CPT look ups, a single-element cache of the last lookup value is used. Each class object must also contain its 14-bit class index (stored in a half-word). So the total overhead is 1.5 words per class plus 256 bytes plus 0-1020 bytes lost to internal fragmentation. Class pointers are converted into class indices during the Forward phase, and are converted back into class pointers during the Compact phase.
2.2.2 Encoded Relocation Addresses
The reason that traditional compaction requires an extra word per object is that a relocation address is computed for each object. However, that relocation address must be encoded as well, since we do not have enough space for a full-width relocation pointer. We observe that sliding compaction has the property that the relocation addresses of successive objects in memory increase monotonically. Therefore, for any region of memory of sizes, as long as objects do not increase in size during relocation, the relocation address can be represented as the relocation address of the first object in the region plus an offset in the range [0, s). There are two potential sources of object expansion: one is the potential change in object size due to optimizations in object representation. These optimizations and the manner in which they avoid such expansion are described in Section 4. The other source of expansion is alignment requirements: an arbitrary number of objects may have been correctly aligned with no padding necessary at their original addresses, whereas their target addresses are misaligned. This can lead to a relocated region actually growing in size.
However, there is always a schedule of relocations that eliminates such misalignment. In particular, it is sufficient to align the first object in the page to the same alignment that it had in its original location. This is always possible, since if there is no space left to align it, we must be able to place it in exactly the same relative position. Preserving the alignment of the first object guarantees that there will be no subsequent growth within the memory region due to alignment changes. Therefore, memory is divided into 128KB pages, and has a relocation base table (RBT) which contains the relocation address of the first live object in each 128KB page. The RBT is allocated at startup time based on the maximum heap size. For example, on a system with 16MB of memory, the RBT contains 128 entries, which consume 512 bytes. This is the only space overhead for relocation. To determine the relocation address of an object, its (shifted) original address is used as an index into the RBT, from which the relocation base address is loaded. The relocation address is then the sum of the base plus the offset.
2.3 Mark Coalesce Compact
A variant of the mark-compact collector described so far is one that avoids compaction by skipping compaction entirely if it discovers enough contiguous free space. Compaction is only performed when a large allocation request can not be satisfied with contiguous memory, or if excessive fragmentation is discovered. This technique has been used in a number of collectors. While it seems like it could provide large speedups, since it eliminates half of the collection phases, it eliminates the two fastest phases, so the performance impact is not as dramatic as might be expected. Nevertheless, it provides a potential improvement while introducing minimal additional complexity and code expansion, and may therefore be worthwhile.
2.3.1 Synchronization Issues
In fact, both the mark-compact and the mark-coalesce-compact collectors normally allocate into small thread-local chunks of memory. Otherwise, synchronization overhead would dominate the cost of allocation causing roughly a 15% reduction in application through-put. To eliminate this, the hand-coded assembly language allocation sequence attempts to allocate in the thread local area (typically 1-4 kilobytes, depending on object size demographics and the level of multiprogramming). If a large object is requested, or if the thread local area is full, a call is made to the synchronized allocator. These synchronization issues have a significant impact on collector design. In particular, it means that the mark-coalesce-compact collector can not directly re-use all of the recovered space that it finds, but only contiguous free chunks sufficiently large to amortize the synchronization cost. Such an optimization may be particularly important in very tight heaps, since the smaller the heap the smaller the average contiguous free region.
- PAGED MARK SWEEP DEFRAGMENT
The PMSD collector is a whole-heap, mark-sweep collector with optional de-fragmentation. The heap is divided into 1KB pages. Each page either holds meta-data that describes other pages or else holds application data. In our configuration, 1.5% of the heap is dedicated to meta-data. Pages that hold application data are categorized as holding small data (objects less than 512 bytes) or large data. Each small-data page has an associated size class (chosen from one of 25 sizes ranging from 16 bytes to 512 bytes). The page is sub-divided into blocks of the associated size. A small object is allocated into the smallest free block that will accommodate it. Large objects consume multiple contiguous pages. The type and state of each page is stored in its corresponding address-indexed meta-data structure. At the end of each garbage collection, contiguous free pages are coalesced into contiguous block ranges. There are two block range lists, one for holding singleton blocks and one formulti-block ranges. During allocation, page requests that result from (small) free block exhaustion are preferentially satisfied from the singleton block list. For multi-page requests and failed single page requests, a first-fit search from the multi-block list is used. Whenever the free list of a size is exhausted, the dead (hence free) blocks of a small object page of the same size are linked together. The batching allows most small object allocation to be fast.
If all small object pages of the request size are used, a completely fresh page is requested. To avoid expensive atomic operations on the free list, each thread has its own free lists, which are created dynamically in response to application demand. Each garbage collection begins with a mark phase where traversal of all reachable objects from the roots causes the mark bits of live objects to be set. The sweep phase then clears the mark bits of live objects and designates blocks containing unmarked objects as dead blocks. In this phase, the overall fragmentation of the system is computed. If the fragmentation exceeds 25% or if the current allocation request is unsatisfiable due to fragmentation, defragmentation is triggered. There are five sources of fragmentation in this scheme. If a small object’s size does not exactly match an existing size class, the next larger size class is chosen.
This resulting per-object wastage is called block-internal fragmentation. Since the page size 1KB may not be a precise multiple of a size class, the end of each small object page may be wasted. This is called page-internal fragmentation. Perhaps the most important source of fragmentation is blockexternal fragmentation which results from partially used pages of small objects. Consider a program that allocates enough objects of the same size to fill 10 pages of memory. If every other objects dies and the program then ceases to allocate objects of that size class, then half of the blocks in those pages will be wasted. Page-external fragmentation can result from the allocation of multi-page objects that leave multi-page holes. If there is a multipage request is smaller than the sum of the holes but larger than a single hole, then the request will fail even though there are sufficient pages.
Finally, since using even a single block of a page forces the page to be dedicated to a particular size class, up to almost one page per size class can be wasted if that size class is only lightly used. In the worst case, the size-external fragmentation is the product of the page size and the number of size classes. Page-internal fragmentation is eliminated by moving small objects from mostly empty pages to same-sized pages that are mostly full pages. Since there is no overlap of live and dead data, the forwarding pointer can be written in the class pointer slot without any compression. In some cases, page-level defragmentation is necessary to combat page-external fragmentatoin. Currently, pages
holding small objects can be relocated to empty pages by a block copy but there is no multi-page defragmentation support. This temporary shortcoming puts PMSD at a disadvantage for applications that make heavy use of large arrays. Because our size classes are statically chosen, the size-external fragmentation can be severe for very small heaps. One solution is to choose size classes dynamically. At runtime, neighboring size classes are coalesced if the smaller size class is not heavily utilized. In this way, the slight increase in page-internal fragmentation can be more than offset by the decrease in size-external fragmentation. Fewer size classes can also combat page-internal fragmentation.
On the other hand, the same adaptive technique can create more size classes to densely cover size ranges where objects are prolific. In this case, more size classes will decrease block-internal fragmentation.
- SINGLE WORD OBJECT HEADER
Typical Java run-time environments use 3-word object headers: one word for the class pointer, one word containing a thin lock, and one word containing a hash code and garbage collector information. Furthermore, mark-compact collectors previously required an additional word to hold the forwarding pointer, which is only used during garbage collection. However, in an embedded environment, this profligate use of space is not acceptable. Bacon, Fink, and Grove  showed how the object header (without a forwarding pointer) can be compacted into a single word, at the cost of requiring a mask operation on the class pointer, or into two words at virtually no cost. The optimizations can be briefly summarized as follows: the thin lock is removed from the object header and instead is treated as an optional field that is implicitly declared by the first synchronized method or synchronized(this) block that appears in the class hierarchy.
Since most objects are not synchronized, and virtually all objects that are synchronized have synchronized methods, this gives virtually the same performance as a dedicated thin lock in all objects, and yet only requires space in a very small number. A special case is instances of Object, which are provided with a thin lock since one of the only uses for such instances is to serve as a lock for synchronized blocks.
4.1 The Mash Table
In a collector that does not perform compaction, objects never move and the hash code can simply be implemented as a function of the object’s address. However, compaction is a requirement for embedded systems. Previous work  showed that the space for the hash code could be reduced to only two bits, by using the address of an unmoved object as its hash code. When an object whose hash code has been taken is moved, its original address is appended to the object and the hash function makes use of this value instead.
As a result, the extra state for each object is normally only 2 bits for the hash code and a few bits for the garbage collector state. If the collector state bits are sufficiently few, then class objects can be aligned on (for instance) 16-byte boundaries, providing 4 unused low bits in which to store object state. Then, to use a class pointer, the low bits must be masked out with an and immediate instruction. The result is shown in Figure 2(d).
However, this technique of hash code compression suffers from two significant disadvantages: (1) it consumes bits in the header word of each object, even though hash codes are rarely used; even worse, those bits are modified during execution and during garbage collection, which tends to complicate the implementation. (2) It causes objects to change size during their lifetime, which significantly complicates garbage collection. In the mark-compact collector, the forwarding pointer compression technique relies on the property that live objects in a range of memory will be compacted into an equally sized or smaller range of memory. If objects can increase in size when they are moved, this is no longer true.
Therefore, rather than storing hash codes of moved objects at the end of the object, they are stored in a structure called the mash table. The mash table is a hash table of hash codes. The mash table works as follows: when an object’s Java hash-Code() method is called, we compute a hash value based on its current address in storage. This is its hash index into the mash table. At garbage collection time, objects may move or die. Thus we must in essence perform garbage collection of the mashtable: references to dead objects are removed, and references to moved objects have their key field updated to the new address and are relocated in the mashtable based on the new mashcode. This is done after marking and forwarding have been performed, but before actual relocation of objects.
The only complication with the mashtable is that errors due to concurrent access must be prevented to the mashtable by multiple threads. On a uniprocessor, if the virtual machine only switches between Java threads at “safe points”, then this is achieved by not having safe points in the mashtable code. In our current implementation we have implemented the mashtable in C++ as a separate structure. However, in order to be robust in the face of pathological cases it is necessary to be able to resize the mashtable and collect unused mashtable entries. Trying to do this in a separate region of memory is complex, error-prone, and inefficient. Therefore, in the next generation we plan to implement the mashtable in Java as a collection of private helper methods of java.lang.Object. This will include a helper method that can obtain the physical address of an object, and a helper method that is called by the system at the end of garbage collection to rehash the moved objects.
4.2 Elimination of Header Masking
The single-word object models of Figures 2(d) and (e) only contain a class pointer and a few state bits. Thus when the system makes use of the class pointer (for virtual method dispatch or dynamic type tests), it must first mask off the low bits of the header word which are not part of the class pointer. This both slows down the code and increases the code size, due to the extra instruction. However, after eliminating the hash code the only remaining object state bits are 1-3 bits for the garbage collector. Therefore, they are zero during normal execution, and the masking operation can be eliminated.