Thursday, 1 December 2011

Liskov Substitution Principle




SOLID - Liskov Substitution Principle

Liskov Substitution principle (LSP) states that,
Methods that use references to the base classes must be able to use the objects of the derived classes without knowing it

This principle was written by Barbara Liskov in 1988.
The idea here is that the subtypes must be replaceable for the super type references without affecting the program execution.

This principle is very closely related to Open Closed Principle (OCP), violation of LSP in turn violates the OCP. Let me explain:


If the subtype is not replaceable for the supertype reference, then in order to support the subtype instances as well we go ahead and make changes to the existing code and add the support. This is a clear violation of OCP.

This is mostly seen in places where we do run time type identification and then cast it to appropriate reference type. And if we add a new subtype implementation then we would have to edit the code to test for instance of for the new subtype.

Let me give a subtle example:

class Bird {
    public void fly(){}
    public void eat(){}
}

class Crow extends Bird {} 
class Ostrich extends Bird{
    fly(){ 
        throw new UnsupportedOperationException();
    }
}
public BirdTest{
    public static void main(String[] args){
        List<Bird> birdList = new ArrayList<Bird>();
        birdList.add(new Bird());
        birdList.add(new Crow());
        birdList.add(new Ostrich());
        letTheBirdsFly ( birdList );
    } 
    static void letTheBirdsFly ( List<Bird> birdList ){
        for ( Bird b : birdList ) {
            b.fly();
        }
    }
}  

What do you think would happen when this code is executed? As soon as an Ostrich instance is passed, it blows up!!! Here the sub type is not replaceable for the super type.

How do we fix such issues?

By using factoring. Sometimes factoring out the common features into a separate class can help in creating a hierarchy that confirms to LSP.

In the above scenario we can factor out the fly feature into- Flight and NonFlight birds.   

class Bird {
    public void eat(){}
}

class FlightBird extends Bird{
    public void fly()()
}

class NonFlight extends Bird{}

So instead of dealing with Bird, we can deal with 2 categories of birds- Flight and NonFlight.

How can we identify LSP violation?

Derived class may require less functionalities than the Base class, so some methods would be redundant.
We might be using IS-A to check for Super-Sub relationships, but LSP doesn't use only IS-A, but it also requires that the Sub types must be substitutable for the Super class. And one cannot decide the substitutability of sub class in isolation. One has to consider how the clients of the class hierarchy are going to use it.

Wednesday, 30 November 2011

Single Responsibility Principle


The Single Responsibility principle (SRP) states that:


There should never be more than one reason for a class to change.

We can relate the "reason to change" to "the responsibility of the class". So each responsibility would be an axis for change. This principle is similar to designing classes which are highly cohesive. So the idea is to design a class which has one responsibility or in otherwords caters to implementing a functionality . I would like to clarify here that one responsibility doesnt mean that the class has only ONE method. A responsibility can be implemented by means of different methods in the class.

Why is that this principle is required?

Imagine designing classes with more than one responsibility/implementing more than one functionality. There's no one stopping you to do this. But imagine the amount of dependency your class can create within itself in the due course of the development time. So when you are asked to change a certain functionality, you are not really sure how it would impact the other functionalities implemented in the class. The change might or might not impact other features, but you really can't take risk, especially in production applications. So you end up testing all the dependent features.

You might say, we have automated tests, and the number of tests to be checked are low, but imagine the impact over time. These kind of changes get accumulate owing to the viscosity of the code making it really fragile and rigid.

One way to correct the violation of SRP is to decompose the class functionalities into different classes, each of which confirms to SRP.

An example to clarify this principle:

Suppose you are asked to implement a UserSetting service where in the user can change the settings but before that the user has to be authenticated. One way to implement this would be:
public class UserSettingService {
    public void changeEmail(User user){     
        if(checkAccess(user)){
            //Grant option to change
        }
    }
    public boolean checkAccess(User user){
        //Verify if the user is valid.
    }
} 
All looks good, until you would want to reuse the checkAccess code at some other place OR you want to make changes to the way checkAccess is being done OR you want to make change to the way email changes are being approved. In all the later 2 cases you would end up changing the same class and in the first case you would have to use UserSettingService to check for access as well, which is unnecessary.

One way to correct this is to decompose the UserSettingService into UserSettingService and SecurityService. And move the checkAccess code into SecurityService.

public class UserSettingService{
    public void changeEmail(User user{
        if(SecurityService.checkAccess(user)){
            //Grant option to change
        }
    } 

public class SecurityService{
    public static boolean checkAccess(User user){
        //check the access.
    }
}
Another example would be:

Suppose there is a requirement to download the file - may be in csv/json/xml format, parse the file and then update the contents into a database or file system. One approach would be to:

public class Task{
    public void downloadFile(location){
 //Download the file
    }
    public void parseTheFile(file){
 //Parse the contents of the file- XML/JSON/CSV
    }
    public void persistTheData(data){
        //Persist the data to Database or file system.
    }
}

Looks good, all in one place easy to understand. But what about the number of times this class has to be updated? What about the reusability of parser code? or download code? Its not good design in terms of reusabiltiy of different parts of the code, in terms of cohesiveness.

One way to decompose the Task class is to create different classes for downloading the file - Downloader, for parsing the file - Parser and for persisting to the database or file system.

Even in JDK you must have seen that Rectangle2D or other Shape classes in java.awt package dont really have information regarding how it has to be drawn on the UI. The drawing information has been embedded in the Graphics/Graphics2D package.

A detailed description can be found here. 

Tuesday, 29 November 2011

NoSQL


What is NoSQL ?


NoSQL is a term used to refer to a class of database systems that differ from the traditional relational database management systems (RDBMS) in many ways. RDBMSs are accessed using SQL. Hence the term NoSQL implies not accessed by SQL. More specifically not RDBMS or more accurately not relational. 


Some key characteristics of NqSQaL datbases are :
  • They are distributed, can scale horizontally and can handle data volumes of the order of €several terrabytes or petabytes, with low latency.
  • They have less rigid schemas than a traditional RDBMS.
  • They have weaker transactional guarantees.
  • As suggested by the name, these databases do not support SQL.
  • Many NoSQL databases model data as row with column families, key value pairs or documents

To understand what non relational means, it might be useful to recap what relational means.




Theoretically, relational databases comply with Codds 12 rules of relational model. More simply, in RDBMS, a table is relation and database has a set of such relations. A table has rows and columns. Each table has contraints and the database enforces the constraints to ensure the integrity of data.Each row in a table is identified by a primary key and tables are related using foreign keys. You eliminate duplicate data during the process of normalization, by moving columns into separate tables but keeping the relation using foreign keys. To get data out of multiple tables requires joining the tables using the foreign keys. This relational model has been useful in modeling most real world problems and is in widespread use for the last 20 years.

In addition, RDBMS vendors have gone to great lengths to ensure that RDBMSs do a great job in maintaining ACID (actomic, consistent, integrity, durable) transactional properties for the data stored. Recovery is supported from unexpected failures. This has lead to relational databases becoming the de facto standard for storing enterprise data.

If RDBMSs are so good, Why does any one need NoSQL databases ?

Even the largest enterprises have users only in the order of 1000s and data requirements in the order of few terra bytes. But when your application is on the internet, where you are dealing with millions of users and data in the order of petabytes, things start to slow down with a RDBMS. The basic operations with any database are read and write. Reads can be scaled by replicating data to multiple machines and load balancing read requests. However this does not work for writes because data consistency needs to be maintained. Writes can be scaled only by partitioning the data. But this affects read as distributed joins can be slow and hard to implement. Additionally, to maintain ACID properties, databases need to lock data at the cost of performance.

The Googles, facebooks , Twitters have found that relaxing the constraints of RDBMSs and distributing data gives them better performance for usecases that involve

  • Large datasets of the order of petabytes. Typically this needs to stored using multiple machines.
  • The application does a lot of writes.
  • Reads require low latency.
  • Data is semi structured.
  • You need to be able to scale without hitting a bottleneck.
  • Application knows what it is looking for. Adhoc queries are not required.

What are the NoSQL solutions out there ? 

There are a few different types.

1. Key Value Stores 

They allow clients to read and write values using a key. Amazon's Dynamo is an example of a key value store.

get(key) returns an object or list of objects
put(key,object) store the object as a blob

Dynamo use hashing to partition data across hosts that store the data. To ensure high availability, each write is replicated across several hosts. Hosts are equal and there is no master. The advantage of Dynamo is that the key value model is simple and it is highly available for writes.

2. Document stores 

The key value pairs that make up the data are encapsulated as a document. Apache CouchDB is an example of a document store. In CouchDB , documents have fields. Each field has a key and value. A document could be

1 "firstname " " John ",
2 "lastname " "Doe" ,
3 "street " "1 main st",
4 "city " "New york"
In CouchDB, distribution and replication is peer to peer. Client interface is RESTful HTTP, that integrated well with existing HTTP loadbalancing solutions.

3. Column based stores 

Read and write is done using columns rather than rows. The best known examples are Google's BigTable and the likes of HBase and Cassandra that were inspired by BigTable. The BigTable paper says that BigTable is a sparse, distributed, persistent, multidimensional sorted Map. While that sentence seems complicated, reading each word individually gives clarity.

  • sparse - some cells can be empty
  • distributed - data is partitioned across many hosts
  • persistent - stored to disk
  • multidimensional - more than 1 dimension
  • Map - key and value
  • sorted - maps are generally not sorted but this one is

This sample might help you visualize a BigTable map
01 {
02 row1:{
03     user:{
04           name: john
05           id : 123
06     },
07     post: {
08           title:This is a post   
09           text : xyxyxyxx
10     }
11 }
12 row2:{
13     user:{
14           name: joe
15           id : 124
16     },
17     post: {
18           title:This is a post   
19           text : xyxyxyxx
20     }
21 }
22 row3:{
23     user:{
24           name: jill
25           id : 125
26     },
27     post: {
28           title:This is a post   
29           text : xyxyxyxx
30     }
31 }
32
33 }
The outermost keys row1,row2, row3 are analogues to rows. user and post are what are called column families. The column family user has columns name and id. post has columns title and text.

Columnfamily:column is how you refer to a column. For eg user:id or post:text. In Hbase, when you create the table, the column families need to be specified. But columns can be added on the fly. HBase provides high availability and scalability using a master slave architecture.

Do I needs a NoSQL store ? 

You do not need a NoSQL store if
  • All your data fits into 1 machine and does not need to be partitioned.
  • You are doing OLTP which required the ACID transaction properties and data consistency that RDBMSs are good at.
  • You need ad hoc querying using a language like SQL.
  • You have complicated relationships between the entities in your applications.
  • Decoupling data from application is important to you.
You might want to start considering NoSQL stores if
  • Your data has grown so large that it can no longer be handled without partitioning.
  • Your RDBMS can no longer handle the load.
  • You need very high write performance and low latency reads.
  • Your data is not very structured.
  • You can have no single point of failure.
  • You can tolerate some data inconsistency.

Bottomline is that NoSql stores are a new and complex technology. There are many choices and no standards. There are specific use cases for which NoSql is a good fit. But RDBMS does just fine for most vanilla use case