Wednesday 19 February 2014

How does sort actually work?

I talked about sort in my last post, but I started to read more about it and got wondering, how does it actually work?

You can use sort to order lists numerically or lexically. To sort lexically you use the cmp operator within the curly braces:
1.   sort{$a cmp $b}("please", "sort", "this", "list");
and for sorting numerically, you use the <=> operator:
1.   sort{$a <=> $b}(13,2,8,1,3,1,5,21);
What I think this means is that $a and $b refer to the two strings or numbers that are being compared when the algorithm is run. The letter "a" comes before the letter "b" in the alphabet and $a is put in front of $b in the code which means that words beginning with a letter that comes before other words in the alphabet should be sorted in front of those other words. Hopefully this all makes sense and is actually accurate.

Perl 5.6 and earlier used the quicksort algorithm and perl 5.7 and onwards, uses the merge sort algorithm. On wikipedia (http://en.wikipedia.org/wiki/Merge_sort) there's a pretty good animation at the top of the page that shows how merge sort works.

Quicksort - perl 5.6 and earlier

I remember learning this at school and uni and thinking it's pretty cool but also wondering what it was used for. We had to write a program in Java that carried out a quicksort. I should have just submitted one line of perl code...

Quicksort is a divide and conquer algorithm and there are different variations of it based on which position you choose for the pivot (explanation very soon) but I couldn't find out which version was used for sort in perl. So for simplicity, and because it's how I was taught, I'm going to choose the pivot and the value in the middle of the list.

So we have a list of numbers (assuming we're using numbers and sorting them in ascending numerical order) and the value in the centre of the list is the pivot. All numbers lower than the pivot are put on the left of this pivot and all numbers higher are put on the right. This creates three different lists - numbers lower than the pivot, numbers higher than the pivot and the pivot itself. The pivot list is one element long and is considered sorted so now we need to sort the other two lists in the same way. For each list, a pivot is chosen and again, lower numbers go on the left and higher numbers go on the right. This now creates seven lists. This continues until you have lists containing only one element, putting these lists all together will give one sorted list. Here's an example (please feel free to skip the example if you already know how it works or if it doesn't interest you!):

(4,2,9,6,5,1,8,7,3)

The number in the centre of the list is the pivot.

Starting from the left, numbers lower than 5 are put into one list to the left and numbers higher than 5 are put into another list on the right. 4 is taken out first giving:

(4)  (2,9,6,5,1,8,7,3)

Then 2 is taken out and put in the list on the left:

(4,2) (9,6,5,1,8,7,3)

Then 9:

(4,2) (6,5,1,8,7,3) (9)

This is repeated, excluding the pivot until we have three list with the pivot on its own:

(4,2,1,3)  (5)  (9,6,8,7)

Pivots are now chosen from any lists with more than one element. I think this time, as there is no real middle element, I will use the (n+1)/2th element.

(4,2,1,3)  (5)  (9,6,8,7)

The same is applied starting with the first list - all element lower than one go on the left and all numbers higher than one go on the right:

(1) (4,2,3) (5) (9,6,8,7)

The same is applied to the other list:

(1) (4,2,3) (5) (6,7) (8) (9)

Again, pivots are picked for the lists with more than one element:

(1) (4,2,3) (5) (6,7) (8) (9)

And the same sorting is applied giving:

(1) (2) (3,4) (5) (6) (7) (8) (9)

Then the final list that contains more than one element is sorted:

(1) (2) (3) (4) (5) (6) (7) (8) (9)

Put this all together and you have a sorted list:

(1,2,3,4,5,6,7,8,9)

Merge Sort - perl 5.7 onwards

So all of that wasn't entirely relevant for those of you that have upgraded your perl version since 2002 because from perl 5.6, the sort function has used the merge sort algorithm.

Merge sort is also a divide and conquer algorithm. It works by splitting up the list into individual elements. Each element is compared to its neighbour and (again assuming we're using numbers and sorting them in ascending numerical order) the smaller number is put on the left and the larger on the right. Then the next set of pairs are compared and so on until the end of the list. Then these sorted pair are compared to each other and ordered to make groups of four. Then these sorted groups are compared to the other groups and this is repeated until the whole list is in one big, sorted group. Hopefully this example will make all this clearer (again, feel free to skip this bit if you don't really care!):

Say we have a list of:

(4,2,6,5,1,8,7,3)

We then split it into sublists which contain one individual element. This means we have eight sublists of single elements:

(4)  (2)  (6)  (5)  (1)  (8)  (7)  (3)

Then we compare the elements to their next door neighbour. First will be (4) and (2). 2 is smaller so goes in front of 4 and we have our first sorted pair of (2,4)

(2,4)  (6)  (5)  (1)  (8)  (7)  (3)

The smallest sublists are always compared first so we compare the next two single element sublists - (6) and (5) giving:

(2,4)  (5,6)  (1)  (8)  (7)  (3)

This is continued until all the sublists contain two sorted elements:

(2,4)  (5,6),  (1,8)  (7)  (3)

(2,4)  (5,6)  (1,8)  (3,7)

Then the pairs are compared to each other starting with (2,4) and (5,6) and always comparing the first element of each list. First 2 and 5 are compared, 2 is smaller so is taken out to make a new list:

(2,4)  (5,6)  (1,8)  (3,7)

(2

Then the first elements of the target sublists are compared so 4 and 5, 4 is smaller so is taken out making:

(4)  (5,6)  (1,8)  (3,7)

(2,4

Then there's only one target sublist left so elements are taken out one by one from the front:

(5,6)  (1,8)  (3,7)

(2,4,5

(6)  (1,8)  (3,7)

(2,4,5,6)

And finally:

(2,4,5,6) (1,8) (3,7)

Then the pairs (1,8) and (7,3) are compared in the same way with intermediate steps:

(2,4,5,6) (1,8) (3,7)

(1

(2,4,5,6) (8) (3,7)

(1,3

(2,4,5,6) (8) (7)

(1,3,7

(2,4,5,6) (8)

(1,3,7,8)

Giving us two sorted lists of:

(2,4,5,6) (1,3,7,8)

Finally these two sublists are compared in the same way to give one sorted list of

(1,2,3,4,5,6,7,8)

If that doesn't make sense, again, I highly recommend looking at the animation on the wikipedia page.


Why was quicksort ditched in favour of merge sort?

According to the perl docs, the quicksort algorithm used was unstable because it can become quadratic. What????

A stable sort preserves the order of list elements that compare equal so maybe the quicksort algorithm kept sorting list items that were exactly the same as each other, which made the run time and complexity higher.

The run time of both of these algorithms comes out as O(n log n) when averaged over all arrays of length n. However, because quicksort does not always preserve the order of equal elements, its run time can become O(n^2) - which is a quadratic because the n is squared. This behaviour doesn't happen with merge sort so quicksort was replaced.

The perl docs do say that for "some inputs" on "some platforms" the original quicksort was faster, but no other information is given - I'm not sure which inputs or which platforms, but I'm sure that these are a minority. Also as an extra note, I think that 99.99% of the time, any time or performance differences will not be large enough to be noticeable anyway.



7 comments:

  1. I'm quite enjoying your perl blog. Ive found that many perl programmers are rather long in the tooth, and as such no longer cover some of the topics that you've touched on. Its quite refreshing!

    Cheers, pete / http://petermblair.com

    ReplyDelete
  2. You have to profile your code given the task you are trying to tackle. When you say, "but I'm sure that these are a minority..." you are making a generalization, an unfounded one at that. The fact is that the choice of algorithm can lessen the problem space -- that is, you know that algorithm A *should* perform better than algorithm B -- but that is little more than a starting point. Some small unanticipated factor may skew your results and demand a switch in your algorithmic choice for the implementation.

    Perl 5 can be extended through XS modules so if you find that you are unhappy with the particular sorting algorithm (or anything else that is performing like crud) employed you can write your own in C and call it to your heart's content. Your blog post appears to omit that. Profiling should tell you when that needs to occur. It doesn't matter if you are talking about Perl, PHP, Scala, or whatever -- you are going to find yourself in the same boat here. The only place on this page that I find the word 'profile' is under your picture.

    Furthermore, your article should be classified under 'generic coverage of sorting algorithms' rather than Perl. Core Perl implemented a particular algorithm found a problem with that, and implemented a different one but properly indicated that in POD. The coverage of WHY the switch was made may be entirely too involved for a POD.

    My recommendation is that, if you insist that upon doing so, blogging about Perl is that you spend more time researching or asking questions of people in the Perl community that have a better understanding of it. Try PerlMonks.org.

    ReplyDelete
  3. Joel, her blog is not about being a Perl expert, it's about readers getting a window into her journey of learning Perl. Her explanation made sense to me in a way that a lot of Perl documentation doesn't. Too much Perl documentation explains things by using terms that themselves need explaining. If you're not a computer expert before you start learning Perl, it can be daunting. I've learned about two different ways that Perl approaches sort today, and it was really helpful. I'm not going to be "writing Perl extended modules in C."

    ReplyDelete
    Replies
    1. I completely agree with what you said about perl documentation and I'm glad that this blog helped you!

      Delete
  4. 'sort { $a cmp $b } @array' is exactly the same as 'sort @array', it's also the the same as 'sort sub{ $a cmp $b } @array'.

    The code that is in the braces ('{ }') is ordinary Perl code, it could be any code, it could also be a subroutine name 'sub example { $a cmp $b } ... sort example @array'.

    For example if the array is a list of hexadecimal numbers 'sort { hex($a) <=> hex($b) } @array'.

    '<=>' returns -1 if the left value is lower numerically, 0 if the two values are equal, 1 if the right value is lower.

    'cmp' does the same thing but compares them by their codepoints.
    A is before Z, but Z is before a. It is the same as 'ord($a) <=> ord($b)'.

    Also you can change the sort type using the 'sort' pragma.

    A more efficient way to use sort if there is a sizeable overhead in the code block given to sort, is to use what is called a Schwartzian transform ( known as decorate-sort-undecorate in Lisp ).

    ReplyDelete
  5. If it makes you feel better, everything you wrote (including the "please correct me if I'm wrong" parts) is correct. I would like to focus on

    > According to the perl docs, the quicksort algorithm used was unstable
    > because it can become quadratic. What????
    >
    > A stable sort preserves the order of list elements that compare equal so
    > maybe the quicksort algorithm kept sorting list items that were exactly the
    > same as each other, which made the run time and complexity higher.

    (because of the "maybe").

    You are correct that a stable sorting algorithm maintains the relative positions of equivalent elements. Sometimes, people talk about "stable algorithms" in the sense that you don't have abysmal worst case performance. It so happens that merge sort is stable in both senses of the term, and quick sort isn't stable in either sense. Unfortunately, the documents aren't all that clear here.

    Quicksort's worst case performance comes from what could be called a bug in how the original implementation chooses the pivot. If there were a lot of equivalent elements being sorted, the performance fell apart. Unfortunately, many libraries exist with this performance bug.

    However, it *is* possible to pick the pivot in a way that will never cause running time to go quadratic: http://www.cs.princeton.edu/~rs/talks/QuicksortIsOptimal.pdf .

    ReplyDelete