Implement your own Scala collection
In Scala there are two families of collections: mutable and immutable ones. When you operate on a mutable collection, you change it in place, as a side effect. On the other hand, an immutable collection doesn’t change but rather returns a new collection with the update you performed. Right now, the Scala library provides an immutable TreeSet but not its mutable counterpart. Let’s see how to integrate our own.
Preliminary choices
- Since it’s a tree based set implementation, it would be a shame not to make it sorted.
- I’m really interested in exploring assets and liabilities of immutable data structures. So although the API is exposing a mutable collection, the underlying tree will be immutable. Mutability will be obtained by simply changing the referenced immutable tree.
- AVL tree gets my preference since it’s an efficient and elegant kind of balanced and sorted trees.
Implementation & integration
The first step in order to integrate the collection is to extend the right base trait(s). This choice is important, since it determines the interface exposed by the collections as well as the methods to implement.
In order to allow reusability of methods shared by distinct collections, the Scala collection convention is factor them in traits suffixed by ‘Like’. For instance, TraversableLike provides implementation of many methods based on the ‘foreach’ base method. This method must be implemented by the concrete class mixing-in this trait.
Pluging into the Scala collection hierarchy
As said above we are implementing a Set that keeps its elements sorted, so not only we have to mix-in from Set and SetLike but also from SortedSet and SortedSetLike.
Here is the definition of TreeSet from a type hierarchy point of view :
class TreeSet[A](implicit val ordering: Ordering[A]) extends SortedSet[A] with SetLike[A, TreeSet[A]] with SortedSetLike[A, TreeSet[A]] with Set[A] { ... }
Since TreeSet sorts its elements, you need to provide it with a means of defining the order relation beetween all its elements (of type A). This is why TreeSet’s primary constructor requires an Ordering for the elements of type A.
Extending this hierarchy only requires implementation of a few methods :
empty: TreeSet[A] rangeImpl(from: Option[A], until: Option[A]): TreeSet[A] -=(elem: A): TreeSet[A] +=(elem: A): TreeSet[A] contains(elem: A): Boolean iterator: Iterator[A]
AVL Tree
An AVL tree is both a binary and a balanced tree. An AVL is said balanced if for any of its nodes, the depth difference beetween its left and right sub-trees is stricty beetween -2 and 2. It allows insertion, removal and presence test in logarithmic time. Please check the dedicated Wikipedia page for more information. This data structure is implemented in the file AVLTree.scala. It defines basic operations, such as insertion, removal and iteration. In the implementation, a special care was taken when writing the recursive methods in order to ensure that the compiler applies tail call optimisation. This optimization allows recursion without exceeding the stack limit.
Iterator
As expected, iteration on the tree is obtained by an inorder traversal. This means we traverse the left-subtree first since it contains the smallest elements, then it’s the turn of the node’s value itself and finally we process the biggest elements by travesing the right-subtree.
Bounded view
/** Creates a ranged projection of this collection. Any mutations in the * ranged projection will update this collection and vice versa. Note: keys * are not garuanteed to be consistent between this collection and the projection. * This is the case for buffers where indexing is relative to the projection. * * @param from The lower-bound (inclusive) of the ranged projection. * <code>None</code> if there is no lower bound. * @param until The upper-bound (exclusive) of the ranged projection. * <code>None</code> if there is no upper bound. */ def rangeImpl(from: Option[K], until: Option[K]): This
rangeImpl is a mandatory method in order to conforms to the SortedSetLike trait. The simplest way I found to do this is to modify ‘iterator’ behavior of the projection. Indeed since in an AVL all elements in the left sub-tree are smaller than the node’s element, witch is itself smaller than all the elements in its right sub-tree, we can simply stop the left or right iteration once we detect we are off bounds. Here is the code :
// AVLTree.scala /** * Returns a bounded stream of elements in the tree */ def toStream[A](tree: AVLTree[A], isLeftAcceptable: A => Boolean, isRightAcceptable: A => Boolean): Stream[A] = tree match { case Leaf => Stream.empty case Node(a, left, right) => if (isLeftAcceptable(a)) { if (isRightAcceptable(a)) toStream(left, isLeftAcceptable, isRightAcceptable) ++ Stream(a) ++ toStream(right, isLeftAcceptable, isRightAcceptable) else toStream(left, isLeftAcceptable, isRightAcceptable) } else if (isRightAcceptable(a)) { toStream(right, isLeftAcceptable, isRightAcceptable) } else { Stream.empty } }
// AVLTree.scala def iterator[A](tree: AVLTree[A], isLeftAcceptable: A => Boolean, isRightAcceptable: A => Boolean): Iterator[A] = toStream(tree, isLeftAcceptable, isRightAcceptable).iterator
// TreeSet.scala // Projection constructor private def this(base: Option[TreeSet[A]], from: Option[A], until: Option[A])(implicit ordering: Ordering[A]) { this() this.base = base this.from = from this.until = until }
// TreeSet.scala override def rangeImpl(from: Option[A], until: Option[A]): TreeSet[A] = new TreeSet(Some(this), from, until)
rangeImpl returns a projection of the initial tree, as a proxy, every operation is in fact applied to the original TreeSet. Doing so we ensure that ‘Any mutations in the ranged projection will update this collection and vice versa’. The only specific data is the tuple of bounds ‘from’ and ‘until’. This data is used to limit iteration on projection. So the desired effect is obtained.
Size method optimisation
In order to improve performances, some base methods have to be overriden. This is the case for ‘size’ for example, where the base implementation consumes an iterator to compute its size. This makes this method inefficient since its complexity in time is O(n). A basic solution is to store size and keep it updated upon insertions and removals.
However, this optimisation doesn’t work well for projections. Indeed a projection doesn’t have exclusive access to the avl tree reference. This means that avl tree can change without the projection even ‘knowing’ it. Moreover, the size computation has to take bounds into account. For this reasons, it can become tricky for a projection to maintain a valid cached size. I decided to give up concerning size optimisation for projections.
Clone method optimisation
clone is also a good candidate for optimisation. The base implementation creates an empty set and iterates over the inital one to add its elements to the clone. So here again there is an O(n) complexity in time. To resolve this, we take advantage of the *immutable* nature of the underlying AVL tree. Thanks to the immutability of the avl, we can simply share it among the original set and his clone. Good point for immutable data structures! Complexity in times becomes constant.
Quick performance benchmarks
These curves give an indication of the observed time consumption in order to fill up and empty collections.
Conclusion
While mutable.TreeSet exposes a large number of methods, only about ten of them had to be implemented. And not all of them were mandatory. This has been achieved by the factorization of every recurrent operation into base trait(s) in the Scala library. This property makes implementing your own scala collection relatively fast and easy. In that sense, this experience helped me better understand Scala’s name origin, i.e. ScAlable LAnguage .
This implementation is an answer to this ticket t4147. The full source code can be accessed here.