Rejoignez - Cluster en utilisant plusieurs colonnes dans la ruche

jeudi 8 mai 2014

Rejoignez - Cluster en utilisant plusieurs colonnes dans la ruche - Stack Overflow

I'm studying the bucketed tables as an option for my storage. What would be use case where it is useful to cluster by multiple columns?

I 'm trying to solve a problem of optimizing a join between two tables with filtering.

Let's say Table A has columns (id, country, .....) and table has columns (Id, country....)

Note: A country could have multiple Ids.

Single column clustering

If I cluster both tables by Id column, into 8 buckets.

Table A would have files FileA1, FileA2..FileA8

And similarly Table B would have FileB1..FileB8

In case of a join on column Id, I would imagine FileA1 would be joined with FileB1.. FileA2 with FileB2... so on and so forth. the filter is applied on the country in each join. This would avoid the need for comparing FileA1 with files other than FileB1 and I see a performance gain.

Multiple Column Clustering

How would clustering on two columns Id and country play in this scenario..