Skip to main content

Cassandra data modelling: Redundant data, a tough decision

The biggest challenge around building an efficient data model for Cassandra is data redundancy.

Though the basic rules for data modelling with Cassandra, mention the usual RDBMS modelling goals, as non goals for Cassandra (Refer: Basic rules for C* data modelling), it builds upon assumptions that clusters are built on commodity hardware, storage is cheap, and as data needs increase more nodes can be added to the cluster incurring very low cost.

But in real life we are faced with technical as well as non technical problems.
a. Keeping multiple column families in sync can be a major overhead, if the same data is spread across them. What if writes to some CF succeed and some fail? How long and how much will we retry?
b. Horizontal scalability may be a truth, but think of a mundane question of where to keep all those heat producing, energy guzzling machines?

So how do we model the database that does not allow joins without redundancy?

The simple answer is, we do not. What we do is, manage redundancy.

Let us consider a case where we need to query a dataset containing 1000 attributes, and the queries involve two mutually exclusive identifying keys.
If we know that the criteria is going to yield just a few rows for each of the keys, we'd rather build one column family indexed on the "more often used key" (say key1, be it used just .1% more, the idea is choose the key more used) and a second column family containing mapping between the two keys, indexed on the other key. We'd choose to do an in memory join*.

On the other hand, if we need to query a dataset containing a 100 attributes, and the queries involve two mutually exclusive keys and each key would yield 10000 rows of data, we'd want to live with data redundancy. An in memory join would be a really bad idea here.

Some may ask, what if the dataset contains 1000 attributes and 10000 rows of data. Keeping with the idea that our data is well spread across the cluster, this case would fall into the category of "have we really spread the data correctly"? Another factor to consider will be to remember, model around your queries. Do all our queries need all those 1000 attributes? Do all our queries need all those 10000 rows? Most of the time, the answer will be no to one of those questions. If it is no to first one, we create our column families with only relevant columns. If it is no to the second one, we divide our data more evenly by choosing another attribute from that data and creating a composite partition key.

We can further enhance it by choosing a column family with a generated id as partition key. Then build lookup column families with partition keys as the query keys and collections of generated ids as attribute. Thus the data can be read by identifying generated ids and then invoking individual queries based on those ids. This will reduce data redundancy and also keep the in memory processing, light. Also the data will be much better spread across the cluster.

P. S: Try to have the queries read one key at a time only. Contrary to RDBMS, an IN clause with multiple keys will do more harm than good. Rather loop over the keys and invoke multiple queries. If needed, the queries can be invoked asynchronously.

Comments

Popular posts from this blog

Using JNDI managed JMS objects with Apache CAMEL

Apache CAMEL uses Spring JMS to work with JMS Queues or Topics. Evidently, we will need Spring to configure and use JMS capabilities provided by CAMEL. Details about how to implement JMS based routes using Apache CAMEL can be found in the CAMEL documentation. However, the documentation leaves a lot to be figured out. In a typical Java EE container, it is usually a good idea to abstract the underlying JMS resources by using JNDI. We can use the below configuration to achieve that. This configuration is tested in Websphere environment, but should work in any JEE container. Create a JMS queue connection factory in the JNDI registry. CAMEL configuration will be able to use only one queue connection factory, even if we have more than one. Create one or more JMS queue or topics, in the JNDI registry, as required. The above two steps are related to generic JNDI configuration for JMS resources. Now we come to the setup required for making these JMS resources work with CAMEL rout...

Catch hold of that Exception and hide that stacktrace!!!

E xceptions happen!!! Rules are to be followed, too. Time and again, Java developers are told the golden rule to catch specific exceptions and not to catch the generic Exception. The thought process behind that is, applications should not catch runtime exceptions. This is apt as runtime exceptions are an indicator of "bugs" in the code. However, blindly following rules, as always, can have unexpected consequences. If you are developing services that are to be exposed over the wire, it is always a good idea to break this rule and "catch that Exception". Instead, follow the below principles: Service methods should implement a generic Exception block, along with catching declared exceptions, thrown from inner layers of the code.  If needed, the service can throw another exception back to the client. What's important is that we create a new Exception instance to be thrown, along with relevant message for the client. The service can log stacktrace for the E...

Container ready spring boot application

Spring boot applications are now ubiquitous. The usual way to build one is to create an uber jar. At the same time, Docker allows us to build self reliant containers which are unaffected by the underlying server architecture or neighboring applications or their dependencies. Spring boot applications can also run in docker containers.  However running an uber jar inside a container fails to satisfy an important goal. That of high speed build and deployment. An uber jar is a heavy weight entity. That will make docker image heavy and slow to build. Here's a step by step solution, which leverages docker layer caching feature for faster builds with all spring boot goodness. We will use Maven to build a deployment structure for our docker image that allows fast deployments. Step 1: Create a spring boot application with only the SpringBootApplication and Configuration classes, such as one for REST configuration package scanning, one for JPA etc. Most often this wil...