mardi 8 avril 2014

Java - ArrayList Vs LinkedList - Stack Overflow


I was following a previous post on this that says:


For LinkedList

* get is O(n)
* add is O(1)
* remove is O(n)
* Iterator.remove is O(1)

For ArrayList

* get is O(1)
* add is O(1) amortized, but O(n) worst-case since the array must be resized and copied
* remove is O(n)

So by looking at this, I concluded that If I've to do just sequential insert in my collection for say 5000000 element, LinkedList will outclass ArrayList.


And If I've to just fetch the elements from collection by iterating i.e. Not grabbing the element in middle, still LinkedList will outclass ArrayList.


Now to verify my above two statements, I wrote below sample program... But I'm surprised that my above statements were proven wrong.


ArrayList outclass Linkedlist in both the cases. It took less time than LinkedList for adding as well as fetching them from Collection. Is there anything I'm doing wrong, or the initial statements about LinkedList and ArrayList does not holds true for collections of size 5000000?


I mentioned size, because if i reduce the number of elements to 50000, LinkedList perform better and initial statements holds true.


long nano1 = System.nanoTime();

List<Integer> arr = new ArrayList();
for(int i=0;i<5000000;++i){
arr.add(i);
}
System.out.println( (System.nanoTime() - nano1) );

for(int j: arr){
;
}
System.out.println( (System.nanoTime() - nano1) );

long nano2 = System.nanoTime();

List<Integer> arrL = new LinkedList();
for(int i=0;i<5000000;++i){
arrL.add(i);
}
System.out.println( (System.nanoTime() - nano2) );

for(int j:arrL){
;
}
System.out.println( (System.nanoTime() - nano2) );



Remember that big-O complexity describes asymptotic behaviour and may not reflect actual implementation speed. It describes how the cost of each operation grows with the size of the list, not the speed of each operation. For example, the following implementation of add is O(1) but is not fast:


public class MyList extends LinkedList {
public void add(Object o) {
Thread.sleep(10000);
super.add(o);
}
}

I suspect in your case ArrayList is performing well because it increases it's internal buffer size fairly aggressively so there will not be a large number of reallocations. When the buffer does not need to be resized ArrayList will have faster adds.


You also need to be very careful when you do this kind of profiling. I'd suggest you change your profiling code to do a warm-up phase (so the JIT has the opportunity to do some optimization without affecting your results) and average the results over a number of runs.


private final static int WARMUP = 1000;
private final static int TEST = 1000;
private final static int SIZE = 500000;

public void perfTest() {
// Warmup
for (int i = 0; i < WARMUP; ++i) {
buildArrayList();
}
// Test
long sum = 0;
for (int i = 0; i < TEST; ++i) {
sum += buildArrayList();
}
System.out.println("Average time to build array list: " + (sum / TEST));
}

public long buildArrayList() {
long start = System.nanoTime();
ArrayList a = new ArrayList();
for (int i = 0; i < SIZE; ++i) {
a.add(i);
}
long end = System.nanoTime();
return end - start;
}

... same for buildLinkedList

(Note that sum may overflow and you might be better to use System.currentTimeMillis()).


It's also possible that the compiler is optimizing away your empty get loops. Make sure the loop actually does something to ensure that the right code is getting called.




This is a bad benchmark IMO.



  • need to repeat in loop multiple times to warm up jvm

  • need to DO something in your iterative loop or it can be optimized array

  • ArrayList resizes, which is costly. If you had constructed ArrayList as new ArrayList(500000) you would construct in one blow, and then all allocations would be quite cheap (one preallocating backed array)

  • You don't specify your memory JVM - it should be run with -xMs == -Xmx (everything preallocated) and sufficiently high that no GC is likely to be triggered

  • This benchmark doesn't cover the most unpleasant aspect of LinkedList - random access. (an iterator isn't necessarily the same thing). If you feed say 10% of the size of a large collection as a random selection of list.get you will find linkedlists are awful for grabbing anything other than the first or last element.


For an arraylist: the jdk get is what you'd expect:


public E get(int index) {
RangeCheck(index);

return elementData[index];
}

(basically just return the indexed array element.,


For a linkedlist:


public E get(int index) {
return entry(index).element;
}

looks similar? Not quite. entry is a method not an primitive array, and look what it has to do:


private Entry<E> entry(int index) {
if (index < 0 || index >= size)
throw new IndexOutOfBoundsException("Index: "+index+
", Size: "+size);
Entry<E> e = header;
if (index < (size >> 1)) {
for (int i = 0; i <= index; i++)
e = e.next;
} else {
for (int i = size; i > index; i--)
e = e.previous;
}
return e;
}

That's right, if you ask for say list.get(250000), it's gotta start at the head and repeatedly iterate through the next element. 250000 accesses or so (there's an optimization in the code where it starts at head or tail depending on which would be less accesses.)




An ArrayList is a simpler data structure than a LinkedList. An ArrayList has a single array of pointers in contiguous memory locations. It only has to be recreated if the array is expanded beyond its allocated size.


A LinkedList consists of a chain of nodes; each node is separated allocated and has front and back pointers to other nodes.


So what does this mean? Unless you need to insert in the middle, splice, delete in the middle etc. an ArrayList will usually be faster. It needs less memory allocations, has much better locality of reference (which is important for processor caching) etc.




To understand why the results you got do not contradict the "big O" characterization. We need to go back to first principles; i.e. the definition.



Let f(x) and g(x) be two functions defined on some subset of the real numbers. One writes


f(x) = O(g(x)) as x -> infinity

if and only if, for sufficiently large values of x, f(x) is at most a constant multiplied by g(x) in absolute value. That is, f(x) = O(g(x)) if and only if there exists a positive real number M and a real number x0 such that


|f(x)| <= M |g(x)| for all x > x_0.

In many contexts, the assumption that we are interested in the growth rate as the variable x goes to infinity is left unstated, and one writes more simply that f(x) = O(g(x)).



So, the statement add1 is O(1), means is that the time cost of an add1 operation on a list of size N tends towards a constant Cadd1 as N tends to infinity.


And the statement add2 is O(1) amortized over N operations, means is that the average time cost of one of a sequence of N add2 operations tends towards a constant Cadd2 as N tends to infinity.


What is does not say is what those constants Cadd1 and Cadd2 are. In fact the reason that LinkedList is slower than ArrayList in your benchmark is that Cadd1 is larger than Cadd2.


The lesson is that big O notation does not predict absolute or even relative performance. All it predicts is the shape of the performance function as the controlling variable gets very large. This is useful to know, but it doesn't tell you everything you need to know.




The big-O-notation is not about absolut timings, but about relative timings, and you can't compare the numbers of one algorithm to another.


You only get information how the same algorithm reacts to increasing or decreasing numbers of tuples.


One algorithm might take an hour for one operation, and 2h for two operations, and is O(n), and another one is O(n) too, and takes one millisecond for one operation, and two milliseconds for two operations.


Another issue if measuring with the JVM is the optimization of the hotspot-compiler. A do-nothing-loop might be eliminated by the JIT-compiler.


A third thing to consider is the OS and JVM, using caches and running the garbage collection meanwhile.




It's hard to find a good use case for LinkedList. If you only need to make use of the Dequeu interface, you should probably use ArrayDeque. If you really need to use the List interface, you will often hear the suggestion to use always ArrayList because LinkedList behaves really poorly in accessing a random element.


Unfortunately also ArrayList has its performance problems if elements at the beginning or in the middle of the list must be removed or inserted.


There is however a new list implementation called GapList which combines the strengths of both ArrayList and LinkedList. It has been designed as drop-in replacement for both ArrayList and LinkedList and therefore implements both the interfaces List and Deque. Also all public methods provided by ArrayList are implemented (ensureCapacty, trimToSize).


GapList's implementation guarantees efficient random access to elements by index (as ArrayList does) and at the same time efficient adding and removing elements to and from head and tail of the list (as LinkedList does).


You find more information about GapList at http://java.dzone.com/articles/gaplist-%E2%80%93-lightning-fast-list.




O notation analysis provides important information, but it has it's limitations. By definition O notation analysis considers that every operation takes approximately the same time to execute, which is not true. As @seand pointed out, linked lists internally uses more complex logic to insert and fetch elements (take a look at the source code, you can ctrl+click in your IDE). ArrayList internally only needs to insert elements into an array and increase its size once in a while (which even being an o(n) operation, in practice can be accomplished pretty fast).


Cheers



I was following a previous post on this that says:


For LinkedList

* get is O(n)
* add is O(1)
* remove is O(n)
* Iterator.remove is O(1)

For ArrayList

* get is O(1)
* add is O(1) amortized, but O(n) worst-case since the array must be resized and copied
* remove is O(n)

So by looking at this, I concluded that If I've to do just sequential insert in my collection for say 5000000 element, LinkedList will outclass ArrayList.


And If I've to just fetch the elements from collection by iterating i.e. Not grabbing the element in middle, still LinkedList will outclass ArrayList.


Now to verify my above two statements, I wrote below sample program... But I'm surprised that my above statements were proven wrong.


ArrayList outclass Linkedlist in both the cases. It took less time than LinkedList for adding as well as fetching them from Collection. Is there anything I'm doing wrong, or the initial statements about LinkedList and ArrayList does not holds true for collections of size 5000000?


I mentioned size, because if i reduce the number of elements to 50000, LinkedList perform better and initial statements holds true.


long nano1 = System.nanoTime();

List<Integer> arr = new ArrayList();
for(int i=0;i<5000000;++i){
arr.add(i);
}
System.out.println( (System.nanoTime() - nano1) );

for(int j: arr){
;
}
System.out.println( (System.nanoTime() - nano1) );

long nano2 = System.nanoTime();

List<Integer> arrL = new LinkedList();
for(int i=0;i<5000000;++i){
arrL.add(i);
}
System.out.println( (System.nanoTime() - nano2) );

for(int j:arrL){
;
}
System.out.println( (System.nanoTime() - nano2) );


Remember that big-O complexity describes asymptotic behaviour and may not reflect actual implementation speed. It describes how the cost of each operation grows with the size of the list, not the speed of each operation. For example, the following implementation of add is O(1) but is not fast:


public class MyList extends LinkedList {
public void add(Object o) {
Thread.sleep(10000);
super.add(o);
}
}

I suspect in your case ArrayList is performing well because it increases it's internal buffer size fairly aggressively so there will not be a large number of reallocations. When the buffer does not need to be resized ArrayList will have faster adds.


You also need to be very careful when you do this kind of profiling. I'd suggest you change your profiling code to do a warm-up phase (so the JIT has the opportunity to do some optimization without affecting your results) and average the results over a number of runs.


private final static int WARMUP = 1000;
private final static int TEST = 1000;
private final static int SIZE = 500000;

public void perfTest() {
// Warmup
for (int i = 0; i < WARMUP; ++i) {
buildArrayList();
}
// Test
long sum = 0;
for (int i = 0; i < TEST; ++i) {
sum += buildArrayList();
}
System.out.println("Average time to build array list: " + (sum / TEST));
}

public long buildArrayList() {
long start = System.nanoTime();
ArrayList a = new ArrayList();
for (int i = 0; i < SIZE; ++i) {
a.add(i);
}
long end = System.nanoTime();
return end - start;
}

... same for buildLinkedList

(Note that sum may overflow and you might be better to use System.currentTimeMillis()).


It's also possible that the compiler is optimizing away your empty get loops. Make sure the loop actually does something to ensure that the right code is getting called.



This is a bad benchmark IMO.



  • need to repeat in loop multiple times to warm up jvm

  • need to DO something in your iterative loop or it can be optimized array

  • ArrayList resizes, which is costly. If you had constructed ArrayList as new ArrayList(500000) you would construct in one blow, and then all allocations would be quite cheap (one preallocating backed array)

  • You don't specify your memory JVM - it should be run with -xMs == -Xmx (everything preallocated) and sufficiently high that no GC is likely to be triggered

  • This benchmark doesn't cover the most unpleasant aspect of LinkedList - random access. (an iterator isn't necessarily the same thing). If you feed say 10% of the size of a large collection as a random selection of list.get you will find linkedlists are awful for grabbing anything other than the first or last element.


For an arraylist: the jdk get is what you'd expect:


public E get(int index) {
RangeCheck(index);

return elementData[index];
}

(basically just return the indexed array element.,


For a linkedlist:


public E get(int index) {
return entry(index).element;
}

looks similar? Not quite. entry is a method not an primitive array, and look what it has to do:


private Entry<E> entry(int index) {
if (index < 0 || index >= size)
throw new IndexOutOfBoundsException("Index: "+index+
", Size: "+size);
Entry<E> e = header;
if (index < (size >> 1)) {
for (int i = 0; i <= index; i++)
e = e.next;
} else {
for (int i = size; i > index; i--)
e = e.previous;
}
return e;
}

That's right, if you ask for say list.get(250000), it's gotta start at the head and repeatedly iterate through the next element. 250000 accesses or so (there's an optimization in the code where it starts at head or tail depending on which would be less accesses.)



An ArrayList is a simpler data structure than a LinkedList. An ArrayList has a single array of pointers in contiguous memory locations. It only has to be recreated if the array is expanded beyond its allocated size.


A LinkedList consists of a chain of nodes; each node is separated allocated and has front and back pointers to other nodes.


So what does this mean? Unless you need to insert in the middle, splice, delete in the middle etc. an ArrayList will usually be faster. It needs less memory allocations, has much better locality of reference (which is important for processor caching) etc.



To understand why the results you got do not contradict the "big O" characterization. We need to go back to first principles; i.e. the definition.



Let f(x) and g(x) be two functions defined on some subset of the real numbers. One writes


f(x) = O(g(x)) as x -> infinity

if and only if, for sufficiently large values of x, f(x) is at most a constant multiplied by g(x) in absolute value. That is, f(x) = O(g(x)) if and only if there exists a positive real number M and a real number x0 such that


|f(x)| <= M |g(x)| for all x > x_0.

In many contexts, the assumption that we are interested in the growth rate as the variable x goes to infinity is left unstated, and one writes more simply that f(x) = O(g(x)).



So, the statement add1 is O(1), means is that the time cost of an add1 operation on a list of size N tends towards a constant Cadd1 as N tends to infinity.


And the statement add2 is O(1) amortized over N operations, means is that the average time cost of one of a sequence of N add2 operations tends towards a constant Cadd2 as N tends to infinity.


What is does not say is what those constants Cadd1 and Cadd2 are. In fact the reason that LinkedList is slower than ArrayList in your benchmark is that Cadd1 is larger than Cadd2.


The lesson is that big O notation does not predict absolute or even relative performance. All it predicts is the shape of the performance function as the controlling variable gets very large. This is useful to know, but it doesn't tell you everything you need to know.



The big-O-notation is not about absolut timings, but about relative timings, and you can't compare the numbers of one algorithm to another.


You only get information how the same algorithm reacts to increasing or decreasing numbers of tuples.


One algorithm might take an hour for one operation, and 2h for two operations, and is O(n), and another one is O(n) too, and takes one millisecond for one operation, and two milliseconds for two operations.


Another issue if measuring with the JVM is the optimization of the hotspot-compiler. A do-nothing-loop might be eliminated by the JIT-compiler.


A third thing to consider is the OS and JVM, using caches and running the garbage collection meanwhile.



It's hard to find a good use case for LinkedList. If you only need to make use of the Dequeu interface, you should probably use ArrayDeque. If you really need to use the List interface, you will often hear the suggestion to use always ArrayList because LinkedList behaves really poorly in accessing a random element.


Unfortunately also ArrayList has its performance problems if elements at the beginning or in the middle of the list must be removed or inserted.


There is however a new list implementation called GapList which combines the strengths of both ArrayList and LinkedList. It has been designed as drop-in replacement for both ArrayList and LinkedList and therefore implements both the interfaces List and Deque. Also all public methods provided by ArrayList are implemented (ensureCapacty, trimToSize).


GapList's implementation guarantees efficient random access to elements by index (as ArrayList does) and at the same time efficient adding and removing elements to and from head and tail of the list (as LinkedList does).


You find more information about GapList at http://java.dzone.com/articles/gaplist-%E2%80%93-lightning-fast-list.



O notation analysis provides important information, but it has it's limitations. By definition O notation analysis considers that every operation takes approximately the same time to execute, which is not true. As @seand pointed out, linked lists internally uses more complex logic to insert and fetch elements (take a look at the source code, you can ctrl+click in your IDE). ArrayList internally only needs to insert elements into an array and increase its size once in a while (which even being an o(n) operation, in practice can be accomplished pretty fast).


Cheers


0 commentaires:

Enregistrer un commentaire