Tuesday, September 15, 2009

Things learned from P2P'09

Besides many excellent technical papers presented at the P2P'09 conference in Seattle, I am particularly impressed with the 3 keynote speeches. Videos of these talks are now available on the Website (I was sitting at the back, therefore can't be found). Just like to blog a few lines here to remind me to not forget what were communicated:

1. Ian Clark, creator of the Freenet Project, kicks off with his recent work on SWARM, a mid-layer infrastructure to write scalable distributed application. The one point I took away from his presentation, besides the colorful and animated slides, is the argument that if Facebook, Twitter and other large Web applications used better tools during developments, they can save development time and rest assured that their applications will scale. Such infrastructure is particular useful if adopted by Cloud providers. As details are abstracted away from developers, they can now concentrate in making applications with more practical values.

Also, this is the talk in which I first heard of MapReduce. Few hours of reading and following links in Wikipedia reveal that it is a very popular programming paradigm used by Google for distributing its computation loads. Even though Database experts aren't very fond of it, MapReduce looks easy to use and achieve great parallelism by Mapping the problems, distributing the sub-problems to servers, results of which are eventually aggreated by calls to Reduce functions. A distributed programming course using MapReduce is also availbe online.


2. Roger Barga from Microsoft makes his contribution to the Cloud computing theme in an engaging talk. He starts describing how High Performance Computing (HPC) and Cloud Computing are essentially twins separated at birth. Cloud Computing, on the other hands, enjoys explosive growth due to interest from the industry. He paints a quite ironic picture that a medium-size data center offers more computation power than the world's three largest academic clusters put together. One of Microsoft's focus at the moment (beside churning out worse and worse Windows - this is not his words) is on developing a new programming paradigm and laguge to take full advantage of the Cloud. The argument for this endeavour is that the current implementation is Virtual-Machine based where customers have full controls over VMs. This does not scale well. A abstraction with which developers only see data and need not concern with how many and which VMs they have is more scalable and preferable. He ends the talk with the observation that Cloud computing is transforming science. More specifically, science history started with Experimental Science (observe, then hypothesize), moved to Theoretical Sience (the like of quantum physics, Einstein et al), then Computational Science (weather modeling, etc). Now, science is in the Data-Intensive age, with more and more data coming from the ProteinFolding, LHC, etc. And the enabler is the Cloud.

3. Ken Birman from Cornell is the last keynote speaker. He argues for the current change in practice adopted by the industry: embrace consistency. This means: forget about consistency algorithms, learns how to deal with it. Ken continues by introducing his Gossip protocol that could provide an answer to the consistency problem in the cloud. It is important to preserve consistency, because things like security can not be achieved without consistency. Maybe this is our last hope for a stable and secure Cloud ?

Wednesday, June 17, 2009

A note on CAN implementation

Content Addressable Network is an unique structured P2P system. Its topology is a d-dimensional torus. This overlay offer constant states (number of neighbours per nodes) and short hop counts (in terms of routing path length). Some works cited CAN as using hyper-cube space as its topology, which is a mistake. It is clearly stated in the original CAN paper that the space wrapped around, meaning the shape of the space is a torus.

Having a number of interesting properties, CAN is remarkably unpopular compared to Chord and Pastry, with respect to implementations. The keys challenges that a CAN implementation is to address are as below (solutions to them involve thinking in high dimension space):

1. What is the nature of a peer's ID? How is its generated and changed during the evolution of the network?

2. How to determine if two peers are neighbors, in d-dimensional tori.

3. How to determine if one peer can "takeover" another? In other words, can two neighbors be merged?

4. How to compute the distance from a search key to a peer ?

Googling an existing implementation of CAN will most likely yield nothing. Thanks to Ryan Huebsch (author of the paper titled: Content-based multicast: comparison of implementation options), a Java implementation of the system comes out of its hidding corner. As part of the PIER project, Ryan's group has implemented the CAN overlay, whose source can be obtained in the project's website.

Reading through the code, the challenges mentioned above can be addressed as follows:

1. ID of a peer IS the coordinate of its zone. In particular, it will be the vector ((l_1,h_1), (l_2,h_2), ..,(l_d,h_d)) where l_i, h_i is the low and high value of the edge in the i^th dimension. As new peer joins, the zone is split and as the consequence the peer's ID changes. Notice that in the PhD thesis of the CAN's author, an alternative view of the ID as path from root to leaf of a partition tree is not easy to implement.

2. In a d-dimensional torus, two peers (or zones) are neighbors if their edges overlap in exactly (d-1) dimensions and abut in exactly 1 dimension. The overlapping and abut properties can be easily checked using the above representation of a zone (the edge in dimension i is the pair (l_i,h_i)). One must also take the space-wrapping into consideration. Basically, one can see the dimension i as a ring, instead of a straight line.

3. To see if zone P and Q can be merged together, the original paper of CAN provide an inelegant solution based on the partition tree. As a result of that, a peer must also store an history of other nodes it took-over, as opposed to simply storing a set of coordinates as in (1). In this implementation, P and Q are mergable if their areas added up to the new zone's area, whose coordinate set is constructed from that of P and Q. In particular: (l_i,h_i) = (min(l_i',l_i''), max(h_i',h_i'')). Simple!

4. The key is represent by its coordinate (k_1,k_2,..,k_d). Its distance from/towards a zone P is computed by first identifying the closest point in the zone to K. Then the distance is sqrt((k_1-p_1)^2 + (k_2-p_2)^2 + ... + (k_d-p_d)^2)

Tuesday, June 16, 2009

The cost of privacy

A Slashdot article shows links to an interesting article about the Hidden Cost of Privacy, which also caught Bruce Schieier's attention. As Bruce himself said, this article presents valid points, they are summarized as below:

1. Too much paperworks in the "process" of protecting one's privacy could be so overwhelming that it has negative effects on increasing one's awareness of his own privacy.

2. Too many hoops, too many papers to go through hinder scientific progress. This is especially true for medical research. In other words, regulations are too tight, while ones whose privacy are being protected may not even know about them.

This is quite an insightful piece of thought, especially for an online article from a full-of-ads website. This article really reminds me of a recent book I just finished reading: Free Culture.
There seems to be similarities between privacy regulations and copy-right regulations. On one hand, these laws meet what they set out to do: to protect one's right. Copyright laws prevents others from taking advantage of one's creativity. It protects the creative work just like other properties. Privacy laws protect one's control of the dissemination of his information (in any form). As far as the normal citizens/users are concerned, these law are effective. We're at fault at not spotting companies or individuals that violates these rights, we can not blame the laws for not protecting us.

Now, there's a grey area that is discussed extensively in the book Free Culture, which could also be extrapolated to privacy law. The author argues that while the current copyright regulations protect copyright owners, the public/soceity and our own culture suffer. Among other things, it is because of the long duration during which copyrighted work are private properties. It is also because of the system in use for tracking copyright owner and obtaining permission. All these process hinder creativity an innovation, as individuals would be refrained from create and innovate using others' works (not in the public domain). The same could be said for privacy law.

There's another point of resemblance between copyright and privacy. There do exists people who don't want to copyright their work; as well as there are people who are not concerned about how their information are being disseminated (abeilt trivial information). The laws must make room for these cases, by changing the current regulations to be more flexible.

The author of Free culture argues that regulations must always seek for a balance among many different factors, and they should also be adjusted as new (technologicial) environment emerges. However, not all of these cost and benefit could be easily quantified. In addition, it is difficult to predict how the current trends (of technology) develop in the future. Thus, making regulations regarding both copyright and privacy are difficult.

In conclusion, while I'm in favour of the current law of privacy to be made more flexible, I don't see it happening in the near future. There must be more cases against the tightness of the current regulations, which must also demonstrate losses in money or other important values of soceity.

Wednesday, June 03, 2009

Binary search tree feature with TreeSet

The following information is for the TreeSet class in Java.

As the name suggests, this class implements a tree datastructure, achieving the complexity of O(logN) with respects to add(), remove() and contains() methods. Take the last method, contains(k), where k is the search key as an example. The implementation of TreeSet will make sure this method return True or False within O(logN).

Now if k is not in the set, one may wish to find the elements immediately on the left and right of k in the set. In other words, the result would be the same as executing binarySearch() method on the ordered list of all the elements in the set. The most efficient way I found in doing that is as follows:


left = set.headSet(k).last();
right = set.tailSet(k).first();


These will have run-time complexity of O(logN) without introducing any other datastructure (or extra memory). I've experimented with other methods such as using Iterator on the set, or ArrayList, but they all appear inferior. The reason for this efficiency seems to be the fact that eventhough headSet(k) and tailSet(k) return subsets, elements of which are only computed while using Iterators to access them. Thus, calling last() and first() on these subsets are O(1). The O(logN) comes from the search for key k, a side-effect of which are pointers to these subsets.

Tuesday, April 28, 2009

A notes on Ubuntu's fonts

This note is for ones like me who moved from Redhat family to the Ubuntu community. It can be easily noted that the fonts used in Firefox and Thunderbird are very different.

For Thunderbird, it can be changed in the Preferences/Fonts to match the default settings used in Fedora, for instance.

For Firefox, the fix is not that simple. Thanks to helps from a friend, the gist of the problem lies in the fact that many websites use the Microsoft fonts. It therefore requires to download and install such package in Ubuntu. As the matter of fact, these fonts are in the ubuntu-restricted-extras package. In particular:

sudo apt-get install ubuntu-restricted-extras

The end.

Tuesday, March 31, 2009

Exporting diagrams from Latex

It's more common to import pictures/diagrams into Latex.

The Message Sequence Chart (msc) package in Latex is brilliant at drawing diagrams like protocols, UML message sequences. Unlike the algorithm package (alg2.sty), the MSC environment doesn't let you export its content to external files.

The reason you may want to export these graphs is to include them into your Beamer slides, the environment that scales up everything from the text to your drawing.

And as many, I prefer the diagrams in EPS format. So:

1. Draw your MSC in one page (using
or other
environment). Make sure the diagram is the only thing on the page. Suppose you named it "file.pdf"

2. pdfcrop file.pdf image.pdf

3. pdftops -eps image.pdf image.eps

All commands used in (2) and (3) come with standard Latex package. Executing these will give you the high quality image.eps file.

Tuesday, December 02, 2008

Bittorrent hits the news again

Starting with an article from the Register, condemning Bittorrent of starting the war on other network-friendly applications such as VoIP and killing the Internet.

The P2P community responded keenly to the news. Some ponder the ease of NAT travelling this "improvement" would bring. Others turn absolutely furious against the article making mountain out of molehill. Later on (not much later to be fair), a "statement" appears on TorrentFreak, where the uTorrent manager himself denied the facts on the article. Most importantly, he claims that the author of the original article (on Register) had not contacted him at all, at least for the sake of fact-checking. Talk about journalis source reliablity!

uTorrent doesn't kill the Internet, people kill the Internet. That's fact. Another fact is that I personally don't care much about anything written on the Register.