I'm studying the bucketed tables as an option for my storage. What would be use case where it is useful to cluster by multiple columns?
I 'm trying to solve a problem of optimizing a join between two tables with filtering.
Let's say Table A has columns (id, country, .....) and table has columns (Id, country....)
Note: A country could have multiple Ids.
Single column clustering
If I cluster both tables by Id column, into 8 buckets.
Table A would have files FileA1, FileA2..FileA8
And similarly Table B would have FileB1..FileB8
In case of a join on column Id, I would imagine FileA1 would be joined with FileB1.. FileA2 with FileB2... so on and so forth. the filter is applied on the country in each join. This would avoid the need for comparing FileA1 with files other than FileB1 and I see a performance gain.
Multiple Column Clustering
How would clustering on two columns Id and country play in this scenario..
I'm studying the bucketed tables as an option for my storage. What would be use case where it is useful to cluster by multiple columns?
I 'm trying to solve a problem of optimizing a join between two tables with filtering.
Let's say Table A has columns (id, country, .....) and table has columns (Id, country....)
Note: A country could have multiple Ids.
Single column clustering
If I cluster both tables by Id column, into 8 buckets.
Table A would have files FileA1, FileA2..FileA8
And similarly Table B would have FileB1..FileB8
In case of a join on column Id, I would imagine FileA1 would be joined with FileB1.. FileA2 with FileB2... so on and so forth. the filter is applied on the country in each join. This would avoid the need for comparing FileA1 with files other than FileB1 and I see a performance gain.
Multiple Column Clustering
How would clustering on two columns Id and country play in this scenario..
0 commentaires:
Enregistrer un commentaire