next up previous
Next: 10 Discussion Up: An ASCII Database Previous: 8 Potential extensions

9 Comparisons to commonly used approaches

Our design of qddb was influenced strongly by the refer suite of programs [Les78], which store and retrieve bibliographic data. Like refer, qddb uses Ascii for the principal data file, allows arbitrary strings as attribute values, allows attribute values to be missing, builds index files for quick searches, and searches for keys independently of which attribute they are in. Unlike refer, qddb\ permits a relation to contain both a stable and an instable component, permits subattributes, and distinguishes Ascii presentation from Ascii storage format.

We believe the flexibility of our approach is its major advantage over many other approaches. Most techniques for retrieval are optimized for access using keys in particular attributes. We do not think that restricting to given attributes is always reasonable; some applications may require access to entries containing a specific key in an arbitrary attribute. Many other methods require that a key be unique in the relation. We feel that this assumption is too restrictive. We have shown how to enhance our approach to allow attribute-specific retrievals, so our approach may be used whenever the other approaches are appropriate. Consistency checkers can police attribute uniqueness when that restriction is useful.

Our implementation imposes a particular structure on the relation file. Other methods, such as B-trees [Bay72], impose a rigid structure requiring a number of read system calls that increases logarithmically (with a possibly large base) with the number of keys in the relation. The number of reads required by our method does not increase with the size of the relation, although the underlying file organization of Unix will impose an increased number of low-level disk reads for random access within very large files. A truly native implementation of qddb would place raw disk addresses in locators and avoid this extra level of overhead [Sto81].

9.1 Indexed files

Methods that maintain indexed files, such as the indexed-sequential method, keep an index that consists of (key, index number) pairs. Generally, index files are kept in a tree form so that a particular key may be found in logarithmic time. The entire index is sometimes kept in memory to make the search fast; but this step is not possible on machines with limited memory or with very large relations. An index must be maintained for each attribute on which searches are allowed.

Qddb only requires that the hash table and a single hash bucket reside in memory at any given time. Using RandomHashTable, we reduce the memory requirement to one hash-table entry and one hash bucket. The hash buckets are read at the time of the search and discarded afterwards to free memory. Qddb requires enough memory to hold the largest hash bucket, so a good hash function will keep the memory requirements small. The support of TupleList operations requires enough memory to hold a TupleList for each key used in the operations; a query using a dozen keys would require memory for a dozen TupleLists. A TupleList requires memory proportional to the number of entries in the solution set of the single key query.

9.2 Hashed files

A classical hashed-file scheme builds a separate hash file for each attribute on which searches might be made[Wie87]. Within such a file, each hash bucket contains the contents of (not locators to) all tuples that have keys in a given attribute that hash to that bucket. Hashed files provide reasonable search times when:

  1. The hash function is good (the number of collisions is very small).
  2. Only searches on keys within a single attribute are required.
  3. The size of the individual entries is reasonably small.
The disadvantages of hashed database files include an extra read of an entry for each collision and the inflexibility of hashing keys in a single attribute.

Qddb uses the method of hashed files in its indexes to produce the speed we desire. Since we don't update the hash table until the next stabilization, we need no extra space for future additions to buckets, and the hash buckets are contiguous and of known length. We can always read an entire bucket with one system call.

9.3 Tree-structured files

Tree-structured files provide a tree structure (usually binary) for a particular attribute in each entry. Multi-attribute searches require a separate tree for each attribute on which queries are allowed. The number of reads necessary for a single attribute query is logarithmic. Multi-attribute searches require a search for each attribute, so the number of reads is:

displaymath843

This estimate is based on a balanced binary tree.

Our approach also requires a search for each attribute in a multi-attribute search, but the number of reads required for each search is a constant (two) plus the number of entries that satisfy the query. The number of trees required to allow a search on any attribute in a tree-structured database is the number of attributes, so each attribute in each entry must have two pointers associated with it. Any search that is not specific to a particular attribute would require searching for the key in every tree. Our approach requires a similar amount of storage, but provides a significant performance improvement for large relations and searches that are not specific to a particular attribute.

9.4 B-trees

B-tree implementations generally impose a fixed-length key so that blocks of keys are of a fixed size (see Figure 7). Our approach is somewhat similar to the following variant of a two-level B-tree: We could fix the size of the each bucket so that the size of each read of Index is of a given size ( tex2html_wrap_inline845 some percentage). The entire Index file is sorted by key. Instead of a hash table, we could access the Index file via the top-level node, which would be a list of key ranges and associated locators to buckets. A binary search on the top-level node produces a locator to the bucket containing the desired key, where another binary search finds the appropriate locators to Stable.

   figure337
Figure 7: A two-level b-tree

The number of read system calls is identical to the number required by our approach, but it allows range searching on Ascii keys. Another advantage to B-trees is that the size of the largest bucket is somewhat constrained. We do not have control over the size of the TupleList associated with any given key, but our stabilization process could easily find a good distribution. The b-tree approach costs somewhat more than our hash-table approach in storage (keys are arbitrary lengths) and in memory lookup (a logarithmic number of string comparisons must be made).


next up previous
Next: 10 Discussion Up: An ASCII Database Previous: 8 Potential extensions

Herrin Software Development, Inc.