jeudi 8 mai 2014

Comparant vs CQL de Cassandra étincelle/Shark interroge vs ruche/Hadoop (version DSE) - Stack Overflow


I would like to hear your thoughts and experiences on the usage of CQL and in-memory query engine Spark/Shark. From what I know, CQL processor is running inside Cassandra JVM on each node. Shark/Spark query processor attached with a Cassandra cluster is running outside in a separated cluster. Also, Datastax has DSE version of Cassandra which allows to deploy Hadoop/Hive. The question is in which use case we would pick a specific solution instead of the other.




I will share a few thoughts based on my experience. But, if possible for you, please let us know about your use-case. It'll help us in answering your queries in a better manner.


1- If you are going to have a lot of writes and lesser no. of reads, Cassandra is obviously a good choice. Having said that, if you are coming from SQL background and planning to use Cassandra then you'll definitely find CQL very helpful. But if you need to perform operations like JOIN and GROUP BY then CQL is not the answer.


2- Shark is very fast for the obvious reason, in-memory processing. In-memory processing makes it ~100x faster than Hive. But this imposes a threat when your dataset is too huge to be fit into the memory. Situation become worse when you need JOIN kinda things on such a huge dataset. In such a case it runs on-disk queries, and they are still faster than Hive(by ~5-10x). Go for it when you need ad-hoc real time querying. Not suitable when you need long running jobs over gigantic amounts of data.


3- Hive is basically a warehouse that runs on top your existing Hadoop cluster and provides you SQL like interface to handle your data. But Hive is not suitable for real-time needs. It is best suited for offline batch processing. Doesn't need any additional infra as it uses underlying HDFS for data storage. Go for it when you have to perform operations like JOIN, GROUP BY etc on large dataset and for OLAP.


Note : Shark builds directly on the Apache Hive codebase, so it naturally supports virtually all Hive features. It supports the existing Hive SQL language, Hive data formats (SerDes), user-defined functions (UDFs), and queries that call external scripts.


But I think you will be able to evaluate the pros and cons of all these tools properly only after getting your hands dirty. I could just suggest based on your questions.


Hope this answers some of your queries.


P.S. : The above answer is based on solely my experience. Comments/corrections are welcome.




There is a very good effort for benchmark documented here - https://amplab.cs.berkeley.edu/benchmark/



I would like to hear your thoughts and experiences on the usage of CQL and in-memory query engine Spark/Shark. From what I know, CQL processor is running inside Cassandra JVM on each node. Shark/Spark query processor attached with a Cassandra cluster is running outside in a separated cluster. Also, Datastax has DSE version of Cassandra which allows to deploy Hadoop/Hive. The question is in which use case we would pick a specific solution instead of the other.



I will share a few thoughts based on my experience. But, if possible for you, please let us know about your use-case. It'll help us in answering your queries in a better manner.


1- If you are going to have a lot of writes and lesser no. of reads, Cassandra is obviously a good choice. Having said that, if you are coming from SQL background and planning to use Cassandra then you'll definitely find CQL very helpful. But if you need to perform operations like JOIN and GROUP BY then CQL is not the answer.


2- Shark is very fast for the obvious reason, in-memory processing. In-memory processing makes it ~100x faster than Hive. But this imposes a threat when your dataset is too huge to be fit into the memory. Situation become worse when you need JOIN kinda things on such a huge dataset. In such a case it runs on-disk queries, and they are still faster than Hive(by ~5-10x). Go for it when you need ad-hoc real time querying. Not suitable when you need long running jobs over gigantic amounts of data.


3- Hive is basically a warehouse that runs on top your existing Hadoop cluster and provides you SQL like interface to handle your data. But Hive is not suitable for real-time needs. It is best suited for offline batch processing. Doesn't need any additional infra as it uses underlying HDFS for data storage. Go for it when you have to perform operations like JOIN, GROUP BY etc on large dataset and for OLAP.


Note : Shark builds directly on the Apache Hive codebase, so it naturally supports virtually all Hive features. It supports the existing Hive SQL language, Hive data formats (SerDes), user-defined functions (UDFs), and queries that call external scripts.


But I think you will be able to evaluate the pros and cons of all these tools properly only after getting your hands dirty. I could just suggest based on your questions.


Hope this answers some of your queries.


P.S. : The above answer is based on solely my experience. Comments/corrections are welcome.



There is a very good effort for benchmark documented here - https://amplab.cs.berkeley.edu/benchmark/


0 commentaires:

Enregistrer un commentaire