== Architecture ==
Zookeeper can run in two modes: Standalone and Replicated.
In standalone mode, it is just running on one machine and for practical purposes we do not use stanalone mode. This is only for testing purposes as it doesn't have high availability.
In production environments and in all practcal usecases, the replicated mode is used. In replicated mode, zookeeper runs on a cluster of machine which is called ensemble. Basically, zookeeper servers are installed on all of the machines in the cluster. Each zookeeper server is informed about all of the machines in the ensemble.
As soon as the zookeeper servers on all of the machines in ensemble are turned on, the phase 1 that is leader selection phase starts. This election is based on Paxos algorithm.
The machines in ensemble vote other machine based on the ping response and freshness of data. This way a distinguished member called leader is elected. The rest of the servers are termed as followers. Once all of the followers have synchronized their state with newly elected leader, the election phase finishes.
The election does not succeed if majority is not available to vote. Majority means more than 50% machines. Out of 20 machines, majority means 11 or more machines.
If at any point the leader fails, the rest of the machine or ensemble hold an election within 200 millseconds.
If the majority of the machines aren't available at any point of time, the leader automatically steps down.
The second phase is called Atomic Broadcast. Any request from user for writing, modification or deletion of data is redirected to leader by followers. So, there is always a single machine on which modifications are being accepted. The request to read data such as ls or get is catered by all of the machines.
Once leader has accepted a change from user, leader broadcasts the update to the followers - the other machines. [Check: This broadcasts and synchronization might take time and hence for some time some of the followers might be providing a little older data. That is why zookeeper provides eventual consistency no strict consistency.]
When majority have saved or persisted the change to disk, the leader commits the update and the client or users is sent a confirmation.
The protocol for achieving consensus is atomic similar to two phase commits. Also, to ensure the durability of change, the machines write to the disk before memory.