Wednesday, June 17, 2009

A note on CAN implementation

Content Addressable Network is an unique structured P2P system. Its topology is a d-dimensional torus. This overlay offer constant states (number of neighbours per nodes) and short hop counts (in terms of routing path length). Some works cited CAN as using hyper-cube space as its topology, which is a mistake. It is clearly stated in the original CAN paper that the space wrapped around, meaning the shape of the space is a torus.

Having a number of interesting properties, CAN is remarkably unpopular compared to Chord and Pastry, with respect to implementations. The keys challenges that a CAN implementation is to address are as below (solutions to them involve thinking in high dimension space):

1. What is the nature of a peer's ID? How is its generated and changed during the evolution of the network?

2. How to determine if two peers are neighbors, in d-dimensional tori.

3. How to determine if one peer can "takeover" another? In other words, can two neighbors be merged?

4. How to compute the distance from a search key to a peer ?

Googling an existing implementation of CAN will most likely yield nothing. Thanks to Ryan Huebsch (author of the paper titled: Content-based multicast: comparison of implementation options), a Java implementation of the system comes out of its hidding corner. As part of the PIER project, Ryan's group has implemented the CAN overlay, whose source can be obtained in the project's website.

Reading through the code, the challenges mentioned above can be addressed as follows:

1. ID of a peer IS the coordinate of its zone. In particular, it will be the vector ((l_1,h_1), (l_2,h_2), ..,(l_d,h_d)) where l_i, h_i is the low and high value of the edge in the i^th dimension. As new peer joins, the zone is split and as the consequence the peer's ID changes. Notice that in the PhD thesis of the CAN's author, an alternative view of the ID as path from root to leaf of a partition tree is not easy to implement.

2. In a d-dimensional torus, two peers (or zones) are neighbors if their edges overlap in exactly (d-1) dimensions and abut in exactly 1 dimension. The overlapping and abut properties can be easily checked using the above representation of a zone (the edge in dimension i is the pair (l_i,h_i)). One must also take the space-wrapping into consideration. Basically, one can see the dimension i as a ring, instead of a straight line.

3. To see if zone P and Q can be merged together, the original paper of CAN provide an inelegant solution based on the partition tree. As a result of that, a peer must also store an history of other nodes it took-over, as opposed to simply storing a set of coordinates as in (1). In this implementation, P and Q are mergable if their areas added up to the new zone's area, whose coordinate set is constructed from that of P and Q. In particular: (l_i,h_i) = (min(l_i',l_i''), max(h_i',h_i'')). Simple!

4. The key is represent by its coordinate (k_1,k_2,..,k_d). Its distance from/towards a zone P is computed by first identifying the closest point in the zone to K. Then the distance is sqrt((k_1-p_1)^2 + (k_2-p_2)^2 + ... + (k_d-p_d)^2)

Tuesday, June 16, 2009

The cost of privacy

A Slashdot article shows links to an interesting article about the Hidden Cost of Privacy, which also caught Bruce Schieier's attention. As Bruce himself said, this article presents valid points, they are summarized as below:

1. Too much paperworks in the "process" of protecting one's privacy could be so overwhelming that it has negative effects on increasing one's awareness of his own privacy.

2. Too many hoops, too many papers to go through hinder scientific progress. This is especially true for medical research. In other words, regulations are too tight, while ones whose privacy are being protected may not even know about them.

This is quite an insightful piece of thought, especially for an online article from a full-of-ads website. This article really reminds me of a recent book I just finished reading: Free Culture.
There seems to be similarities between privacy regulations and copy-right regulations. On one hand, these laws meet what they set out to do: to protect one's right. Copyright laws prevents others from taking advantage of one's creativity. It protects the creative work just like other properties. Privacy laws protect one's control of the dissemination of his information (in any form). As far as the normal citizens/users are concerned, these law are effective. We're at fault at not spotting companies or individuals that violates these rights, we can not blame the laws for not protecting us.

Now, there's a grey area that is discussed extensively in the book Free Culture, which could also be extrapolated to privacy law. The author argues that while the current copyright regulations protect copyright owners, the public/soceity and our own culture suffer. Among other things, it is because of the long duration during which copyrighted work are private properties. It is also because of the system in use for tracking copyright owner and obtaining permission. All these process hinder creativity an innovation, as individuals would be refrained from create and innovate using others' works (not in the public domain). The same could be said for privacy law.

There's another point of resemblance between copyright and privacy. There do exists people who don't want to copyright their work; as well as there are people who are not concerned about how their information are being disseminated (abeilt trivial information). The laws must make room for these cases, by changing the current regulations to be more flexible.

The author of Free culture argues that regulations must always seek for a balance among many different factors, and they should also be adjusted as new (technologicial) environment emerges. However, not all of these cost and benefit could be easily quantified. In addition, it is difficult to predict how the current trends (of technology) develop in the future. Thus, making regulations regarding both copyright and privacy are difficult.

In conclusion, while I'm in favour of the current law of privacy to be made more flexible, I don't see it happening in the near future. There must be more cases against the tightness of the current regulations, which must also demonstrate losses in money or other important values of soceity.

Wednesday, June 03, 2009

Binary search tree feature with TreeSet

The following information is for the TreeSet class in Java.

As the name suggests, this class implements a tree datastructure, achieving the complexity of O(logN) with respects to add(), remove() and contains() methods. Take the last method, contains(k), where k is the search key as an example. The implementation of TreeSet will make sure this method return True or False within O(logN).

Now if k is not in the set, one may wish to find the elements immediately on the left and right of k in the set. In other words, the result would be the same as executing binarySearch() method on the ordered list of all the elements in the set. The most efficient way I found in doing that is as follows:


left = set.headSet(k).last();
right = set.tailSet(k).first();


These will have run-time complexity of O(logN) without introducing any other datastructure (or extra memory). I've experimented with other methods such as using Iterator on the set, or ArrayList, but they all appear inferior. The reason for this efficiency seems to be the fact that eventhough headSet(k) and tailSet(k) return subsets, elements of which are only computed while using Iterators to access them. Thus, calling last() and first() on these subsets are O(1). The O(logN) comes from the search for key k, a side-effect of which are pointers to these subsets.