samedi 9 août 2014

performance - distribué local algorithme de coefficient clustering (MapReduce/Hadoop) - Stack Overflow


I have implemented MapReduce paradigm based local clustering coefficient algorithm. However I have run into serious troubles for bigger datasets or specific datasets (high average degree of a node). I tried to tune my hadoop platform and the code however the results were unsatisfactory (to say the least). No I have turned my attention to actually change/improve the algorithm. Below is my current algorithm (pseudo code)


foreach(Node in Graph) {
//Job1
/* Transform edge-based input dataset to node-based dataset */

//Job2
map() {
emit(this.Node, this.Node.neighbours) //emit myself data to all my neighbours
emit(this.Node, this.Node) //emit myself to myself
}

reduce() {
NodeNeighbourhood nodeNeighbourhood;
while(values.hasNext) {
if(myself)
this.nodeNeighbourhood.setCentralNode(values.next) //store myself data
else
this.nodeNeighbourhood.addNeighbour(values.next) //store neighbour data
}

emit(null, this.nodeNeighbourhood)
}

//Job3
map() {
float lcc = calculateLocalCC(this.nodeNeighbourhood)
emit(0, lcc) //emit all lcc to specific key, combiners are used
}

reduce() {
float combinedLCC;
int numberOfNodes;
while(values.hasNext) {
combinedLCC += values.next;
}

emit(null, combinedLCC/numberOfNodes); // store graph average local clustering coefficient
}
}

Little bit more details about the code. For directed graphs neighbour data is restricted to node ID and OUT edges destination IDs (to decrease the data size), for undirected its also node ID and edges destination IDs. Sort and Merge buffers are increased to 1.5 Gb, merge streams 80.


It can be clearly seen that Job2 is the actual problem of the whole algorithm. It generates massive amount of data that has to be sorted/copied/merged. This basically kills my algorithm performance for certain datasets. Can someone guide me on how to improve the algorithm (I was thinking about creating an iterative Job2 ["process" only M nodes out of N in each iteration until every node is "processed"], but I have abandoned this idea for now). In my opinion the Job2 map-output should be decreased, to avoid costly sort/merge processes, which kill the performance.


I have also implemented the same algorithm (3 Jobs as well, same "communication" pattern, also "Job2" problem) for the Giraph platform. However Giraph is an in-memory platform and the algorithm for the same "problematic" datasets results in an OutOfMemoryException.


For any comment, remark, guideline I will be grateful.




UPDATE


I'm going to change the algorithm "drastically". I've found this article Counting Triangles.


Once the code is implemented I'm gonna post my opinion here and more detailed code (if this approach will be successful).




UPDATE_2


In the end I've ended "modifying" NodeIterator++ algorithm to my needs (Yahoo paper is available through a link in the article). Unfortunately though I can see an improvement in the performance the end result is not as good as I have hoped. The conclusion I have reached is that the cluster which is available to me is just too small to make the LCC calculations feasible for these specific datasets. So the question remains, or rather it evolves. Does any one know of an efficient distributed/sequential algorithm for calculating LCC or triangles with limited resources available? (By no means I am stating that the NodeIterator++ algorithm is bad, I simple state that the resources which are available to me are just not sufficient).



I have implemented MapReduce paradigm based local clustering coefficient algorithm. However I have run into serious troubles for bigger datasets or specific datasets (high average degree of a node). I tried to tune my hadoop platform and the code however the results were unsatisfactory (to say the least). No I have turned my attention to actually change/improve the algorithm. Below is my current algorithm (pseudo code)


foreach(Node in Graph) {
//Job1
/* Transform edge-based input dataset to node-based dataset */

//Job2
map() {
emit(this.Node, this.Node.neighbours) //emit myself data to all my neighbours
emit(this.Node, this.Node) //emit myself to myself
}

reduce() {
NodeNeighbourhood nodeNeighbourhood;
while(values.hasNext) {
if(myself)
this.nodeNeighbourhood.setCentralNode(values.next) //store myself data
else
this.nodeNeighbourhood.addNeighbour(values.next) //store neighbour data
}

emit(null, this.nodeNeighbourhood)
}

//Job3
map() {
float lcc = calculateLocalCC(this.nodeNeighbourhood)
emit(0, lcc) //emit all lcc to specific key, combiners are used
}

reduce() {
float combinedLCC;
int numberOfNodes;
while(values.hasNext) {
combinedLCC += values.next;
}

emit(null, combinedLCC/numberOfNodes); // store graph average local clustering coefficient
}
}

Little bit more details about the code. For directed graphs neighbour data is restricted to node ID and OUT edges destination IDs (to decrease the data size), for undirected its also node ID and edges destination IDs. Sort and Merge buffers are increased to 1.5 Gb, merge streams 80.


It can be clearly seen that Job2 is the actual problem of the whole algorithm. It generates massive amount of data that has to be sorted/copied/merged. This basically kills my algorithm performance for certain datasets. Can someone guide me on how to improve the algorithm (I was thinking about creating an iterative Job2 ["process" only M nodes out of N in each iteration until every node is "processed"], but I have abandoned this idea for now). In my opinion the Job2 map-output should be decreased, to avoid costly sort/merge processes, which kill the performance.


I have also implemented the same algorithm (3 Jobs as well, same "communication" pattern, also "Job2" problem) for the Giraph platform. However Giraph is an in-memory platform and the algorithm for the same "problematic" datasets results in an OutOfMemoryException.


For any comment, remark, guideline I will be grateful.




UPDATE


I'm going to change the algorithm "drastically". I've found this article Counting Triangles.


Once the code is implemented I'm gonna post my opinion here and more detailed code (if this approach will be successful).




UPDATE_2


In the end I've ended "modifying" NodeIterator++ algorithm to my needs (Yahoo paper is available through a link in the article). Unfortunately though I can see an improvement in the performance the end result is not as good as I have hoped. The conclusion I have reached is that the cluster which is available to me is just too small to make the LCC calculations feasible for these specific datasets. So the question remains, or rather it evolves. Does any one know of an efficient distributed/sequential algorithm for calculating LCC or triangles with limited resources available? (By no means I am stating that the NodeIterator++ algorithm is bad, I simple state that the resources which are available to me are just not sufficient).


0 commentaires:

Enregistrer un commentaire