• Re: sorting Was: Isn't that beauty ? (no it's not)

    From cross@[email protected] (Dan Cross) to comp.lang.c on Fri Apr 10 11:35:38 2026
    From Newsgroup: comp.lang.c

    In article <10r9d06$32585$[email protected]>,
    Waldek Hebisch <[email protected]> wrote:
    Tim Rentsch <[email protected]> wrote:
    [email protected] (Waldek Hebisch) writes:

    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Mon, 06 Apr 2026 15:13:32 -0700
    Tim Rentsch <[email protected]> wrote:

    Obviously what words (or lines) appear can affect the character
    counts, but that still doesn't change BigO. By the way you don't
    say whether you are sorting words or lines.

    This sub-thread is about sorting lines with average length of few
    dozens characters, i.e. many times longer than log2(N). That was
    stated at one of earlier posts.

    That has nothing to do with BigO, which is about asymptotic
    behavior as N goes to infinity.

    Honest Big(O) varies length of the key with N. In practical range
    key length may be constant, but fixing length gives unrealistic
    problem for Big(O) analysis: without varying key length there are
    finitely many keys and sorting is equivalent to counting how many
    times each key appears in the input.

    There's an important clarification to make here. There are two
    independent parameters: N, the number of records to be sorted (a
    record being a character string that is either a word or a line),
    and the (maximum) length of any record, which in the discussion is
    bounded above by a constant.

    There is one choice, made often to simplify problem. But there
    is quite universal choice: N is total size of input data.

    This is too simplistic.

    Analysis of sorting algorithms treats $n$ as the number of input
    _records_, only considering their size tangentially. In turn,
    records are distinguished by having some kind of "key" and the
    existence of some comparison operator, "<", that obeys the usual
    ordering properties of mathematics.

    Sorting involves establishing an order for some collection of
    records, <R_1, ..., R_n> so that R_1 < R_2 < ... < R_n. When we
    say, O(n lg n), we are referring to the number of comparisons
    required.

    If the comparison function itself is complex, specifically if it
    is non-constant, is an entirely separate matter.

    "Total size of input data" doesn't enter into it at at all;
    indeed, it doesn't even make much sense conceptually: suppose I
    have some sequence of large records, but the key is just a fixed
    size integer in that record. Then the comparison is cheap;
    probably just a single machine instruction on real computers,
    despite the data itself being large.

    "Aha, but what about the cost of moving data within such a
    sequence? If the records are large, that's non-trivial and you
    must account for that." Indeed, this can be an issue; often one
    works around it by holding an auxiliary array of pointers to the
    records themselves, and sorting those. Again, exchanges here
    are cheap: just a handful of machine instructions.

    Do real-world programmers care about the actual cost of both
    the comparison function _and_ moving data around to exchange
    elements in e.g., an array when sorting? Absolutely. But
    trying to shoehorn those concerns into big-O notation is not a
    useful exercise in accounting for them.

    - Dan C.

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Tim Rentsch@[email protected] to comp.lang.c on Sat Apr 11 09:04:44 2026
    From Newsgroup: comp.lang.c

    [email protected] (Dan Cross) writes:

    In article <[email protected]>,
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Mon, 06 Apr 2026 21:00:46 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Mon, 06 Apr 2026 15:13:32 -0700
    Tim Rentsch <[email protected]> wrote:

    Obviously what words (or lines) appear can affect the character
    counts, but that still doesn't change BigO. By the way you don't
    say whether you are sorting words or lines.

    This sub-thread is about sorting lines with average length of few
    dozens characters, i.e. many times longer than log2(N). That was
    stated at one of earlier posts.

    That has nothing to do with BigO, which is about asymptotic behavior
    as N goes to infinity.

    So, how do you call part of the curve from 1e2 to 3e5 lines? MiddleO ?

    Consider a thought experiment. We have a qsort-like sorting function,
    for the sake of discussion let's say it uses heapsort. It is widely
    understood that heapsort is an O(N*log(N)) algorithm (not counting the
    theoretical objections mentioned in Waldek Hebisch's post and my
    response to those comments).

    Now suppose we have two distinct invocations of said function. In
    both cases the arguments are length 1000 character strings, having
    only spaces and alphabetic characters, and no duplicates between the
    strings. In one call all the strings have distinct values in the
    first six characters, and in the other call the strings are all the
    same in the first 993 characters, differing only in the last six
    characters. The call to the sorting function points back to a
    function that uses strcmp() to do its comparisons.

    The heapsort algorithm is well-known to be N log N. All the strings
    in both sorts are exactly 1000 characters, so the strcmp() calls have
    a fixed maximum time cost. And yet one of these sorts runs almost 200
    times as fast as the other.

    I think what you're seeing is a data dependency that is orthogonal to
    the time complexity of the algorithm. For example, the travelling
    salesman problem is widely known to be NP complete, and yet on some
    inputs a solution can be found very quickly.

    A useful thought experiment, but it is perhaps easier to explain
    that the asymptotic time complexity of a sorting algorithm, for
    example heapsort being O(n log n), is expressed in terms of the
    number of comparison operations required by the algorithm,
    independent of the complexity of those comparisons.

    No. This misses the essence of what I was trying to convey.

    BigO is about worst case behavior, or sometimes average case
    behavior. Other kinds of behavior can depend dramatically on the
    nature of the inputs.

    I disagree with this. The "Big-O" of an algorithm isn't about
    worst-case behavior, it's a notation for categorizing which term
    dominates in an expression describing the complexity of an
    algorithm, as the number of inputs to that algorithm grows.

    You misunderstood the point of my comment. Of course O()
    notation is about expressing an asymptotic upper bound of
    some function. The key issue here though is not about O()
    but about what function should be the one under consideration.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From cross@[email protected] (Dan Cross) to comp.lang.c on Sat Apr 11 19:55:10 2026
    From Newsgroup: comp.lang.c

    In article <[email protected]>,
    Tim Rentsch <[email protected]> wrote:
    [email protected] (Dan Cross) writes:

    In article <[email protected]>,
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Mon, 06 Apr 2026 21:00:46 -0700
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Mon, 06 Apr 2026 15:13:32 -0700
    Tim Rentsch <[email protected]> wrote:

    Obviously what words (or lines) appear can affect the character
    counts, but that still doesn't change BigO. By the way you don't >>>>>>> say whether you are sorting words or lines.

    This sub-thread is about sorting lines with average length of few
    dozens characters, i.e. many times longer than log2(N). That was
    stated at one of earlier posts.

    That has nothing to do with BigO, which is about asymptotic behavior >>>>> as N goes to infinity.

    So, how do you call part of the curve from 1e2 to 3e5 lines? MiddleO ? >>>
    Consider a thought experiment. We have a qsort-like sorting function,
    for the sake of discussion let's say it uses heapsort. It is widely
    understood that heapsort is an O(N*log(N)) algorithm (not counting the
    theoretical objections mentioned in Waldek Hebisch's post and my
    response to those comments).

    Now suppose we have two distinct invocations of said function. In
    both cases the arguments are length 1000 character strings, having
    only spaces and alphabetic characters, and no duplicates between the
    strings. In one call all the strings have distinct values in the
    first six characters, and in the other call the strings are all the
    same in the first 993 characters, differing only in the last six
    characters. The call to the sorting function points back to a
    function that uses strcmp() to do its comparisons.

    The heapsort algorithm is well-known to be N log N. All the strings
    in both sorts are exactly 1000 characters, so the strcmp() calls have
    a fixed maximum time cost. And yet one of these sorts runs almost 200
    times as fast as the other.

    I think what you're seeing is a data dependency that is orthogonal to
    the time complexity of the algorithm. For example, the travelling
    salesman problem is widely known to be NP complete, and yet on some
    inputs a solution can be found very quickly.

    A useful thought experiment, but it is perhaps easier to explain
    that the asymptotic time complexity of a sorting algorithm, for
    example heapsort being O(n log n), is expressed in terms of the
    number of comparison operations required by the algorithm,
    independent of the complexity of those comparisons.

    No. This misses the essence of what I was trying to convey.

    Then what you are trying to convey is misleading. Note that I
    was making factual statements.

    BigO is about worst case behavior, or sometimes average case
    behavior. Other kinds of behavior can depend dramatically on the
    nature of the inputs.

    I disagree with this. The "Big-O" of an algorithm isn't about
    worst-case behavior, it's a notation for categorizing which term
    dominates in an expression describing the complexity of an
    algorithm, as the number of inputs to that algorithm grows.

    You misunderstood the point of my comment. Of course O()
    notation is about expressing an asymptotic upper bound of
    some function. The key issue here though is not about O()
    but about what function should be the one under consideration.

    The misunderstanding is yours, I'm afraid.

    You wrote, "BigO is about worst case behavior." That is
    definitionally incorrect, and it's not really something that is
    up for debate or open to alternative interpretations.

    If you want to account for the complexity of a comparison
    function, and you can characterize it in some useful, way, you
    can of course incorporate that into the analysis. For example,
    string comparison is linear in the size of the input strings;
    if one denotes that as m, then the time required to sort strings
    using heapsort might be described is O(m*n*log(n))

    - Dan C.

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Tim Rentsch@[email protected] to comp.lang.c on Sat Apr 11 21:32:21 2026
    From Newsgroup: comp.lang.c

    [email protected] (Dan Cross) writes:

    In article <[email protected]>,
    Tim Rentsch <[email protected]> wrote:

    [email protected] (Waldek Hebisch) writes:

    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Mon, 06 Apr 2026 15:13:32 -0700
    Tim Rentsch <[email protected]> wrote:

    Obviously what words (or lines) appear can affect the character
    counts, but that still doesn't change BigO. By the way you don't
    say whether you are sorting words or lines.

    This sub-thread is about sorting lines with average length of few
    dozens characters, i.e. many times longer than log2(N). That was
    stated at one of earlier posts.

    That has nothing to do with BigO, which is about asymptotic
    behavior as N goes to infinity.

    Honest Big(O) varies length of the key with N. In practical range
    key length may be constant, but fixing length gives unrealistic
    problem for Big(O) analysis: without varying key length there are
    finitely many keys and sorting is equivalent to counting how many
    times each key appears in the input.

    There's an important clarification to make here. There are two
    independent parameters: N, the number of records to be sorted (a
    record being a character string that is either a word or a line),
    and the (maximum) length of any record, which in the discussion is
    bounded above by a constant.

    What is being asked about is the behavior as a function of N as N
    increases without bound. Of course, theoretically, as the number of
    records increases without bound, eventually the character strings
    being sorted will have to have duplicates. But long before that
    happens the index variable N will run out of bits. This property is
    well understood in theoretical computer science, not just in terms
    of how much time is used but how much storage is needed. In theory
    log N bits are needed just to hold the index pointers. It is
    customary though, outside of purely theoretical discussions, to
    ignore that and treat the size of an index or pointer variable as
    constant. In purely theoretical terms no sorting algorithm is
    O(N*log(N)), because just incrementing a pointer takes more than
    O(1) operations. Surely the discussions in Knuth's books take such
    things into consideration.

    If by "Knuth's books" you're referring to TAOCP, then he does
    not seem to give it too much attention. [...]

    In most of the chapter on Sorting, TAOCP uses the number of
    comparisons as the basis of comparison. But not everywhere
    in the chapter.

    My statement was not meant to be limited to the discussion of
    Sorting.

    On the practical side, which almost
    always covers discussions that take place in usenet newsgroups,
    these minor theoretical issues are ignored. Any actul computer in
    the physical universe will never have occasion to process more than
    2**512 records, due to the limitation of the number of elementary
    particles in the universe, so a 512-bit address (or index value)
    always suffices.

    So yes, in theory, the considerations around processing an enormous
    number of values are relevant. In the practical context of the
    discussion underway here, they aren't.

    Indeed. As Rob Pike once put it, "Fancy algorithms are slow
    when $n$ is small, and $n$ is usually small. Fancy algorithms
    have big constants. Until you know that $n$ is frequently going
    to get big, don't get fancy."

    Whether the Rob Pike advisory is applicable or not is beside the
    point. My comment was about fancy mathematics, not fancy
    algorithms. My statement is just as applicable to Tim Sort (one
    of the fancier sorting algorithms) as it is to Bubble Sort.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From cross@[email protected] (Dan Cross) to comp.lang.c on Sun Apr 12 04:59:39 2026
    From Newsgroup: comp.lang.c

    In article <[email protected]>,
    Tim Rentsch <[email protected]> wrote:
    [email protected] (Dan Cross) writes:
    In article <[email protected]>,
    Tim Rentsch <[email protected]> wrote:
    [email protected] (Waldek Hebisch) writes:
    Tim Rentsch <[email protected]> wrote:
    Michael S <[email protected]> writes:
    On Mon, 06 Apr 2026 15:13:32 -0700
    Tim Rentsch <[email protected]> wrote:
    Obviously what words (or lines) appear can affect the character
    counts, but that still doesn't change BigO. By the way you don't >>>>>>> say whether you are sorting words or lines.

    This sub-thread is about sorting lines with average length of few
    dozens characters, i.e. many times longer than log2(N). That was
    stated at one of earlier posts.

    That has nothing to do with BigO, which is about asymptotic
    behavior as N goes to infinity.

    Honest Big(O) varies length of the key with N. In practical range
    key length may be constant, but fixing length gives unrealistic
    problem for Big(O) analysis: without varying key length there are
    finitely many keys and sorting is equivalent to counting how many
    times each key appears in the input.

    There's an important clarification to make here. There are two
    independent parameters: N, the number of records to be sorted (a
    record being a character string that is either a word or a line),
    and the (maximum) length of any record, which in the discussion is
    bounded above by a constant.

    What is being asked about is the behavior as a function of N as N
    increases without bound. Of course, theoretically, as the number of
    records increases without bound, eventually the character strings
    being sorted will have to have duplicates. But long before that
    happens the index variable N will run out of bits. This property is
    well understood in theoretical computer science, not just in terms
    of how much time is used but how much storage is needed. In theory
    log N bits are needed just to hold the index pointers. It is
    customary though, outside of purely theoretical discussions, to
    ignore that and treat the size of an index or pointer variable as
    constant. In purely theoretical terms no sorting algorithm is
    O(N*log(N)), because just incrementing a pointer takes more than
    O(1) operations. Surely the discussions in Knuth's books take such
    things into consideration.

    If by "Knuth's books" you're referring to TAOCP, then he does
    not seem to give it too much attention. [...]

    In most of the chapter on Sorting, TAOCP uses the number of
    comparisons as the basis of comparison. But not everywhere
    in the chapter.

    It seems like I pointed out a few places where he acknowledges
    a more complex picture. Are there other places to which you are
    referring?

    My statement was not meant to be limited to the discussion of
    Sorting.

    What do you think I was referring to, exactly? I was responding
    to your comments about Knuth's books, specifically, and the
    quoted text above, which seems concerned solely with sorting.

    As I mentioned, Dasgupta et al _do_ mention that analysis of
    algorithms is more complex than most treatments, because of
    precisely the idea that as things grow, seemingly constant
    operations are no longer constant. As I mentioned, they did
    this within the context of Fibonacci numbers, not sorting, but
    the point stands. Since, as you say, your statement was not
    meant to be limited to discussions of sorting, then it seems to
    be supporting what you are saying.

    On the practical side, which almost
    always covers discussions that take place in usenet newsgroups,
    these minor theoretical issues are ignored. Any actul computer in
    the physical universe will never have occasion to process more than
    2**512 records, due to the limitation of the number of elementary
    particles in the universe, so a 512-bit address (or index value)
    always suffices.

    So yes, in theory, the considerations around processing an enormous
    number of values are relevant. In the practical context of the
    discussion underway here, they aren't.

    Indeed. As Rob Pike once put it, "Fancy algorithms are slow
    when $n$ is small, and $n$ is usually small. Fancy algorithms
    have big constants. Until you know that $n$ is frequently going
    to get big, don't get fancy."

    Whether the Rob Pike advisory is applicable or not is beside the
    point.

    On the contrary; I mentioned it because it supports your thesis.

    My comment was about fancy mathematics, not fancy
    algorithms. My statement is just as applicable to Tim Sort (one
    of the fancier sorting algorithms) as it is to Bubble Sort.

    There's nothing particularly fancy about it, but that aside, I'm
    honestly not sure what exactly I said that you are (apparently?)
    disagreeing with.

    I was responding with a specific statement about Knuth's books,
    a reference to another book in support of your statement, and
    yet another reference to something that Pike had written that,
    again, supports your point.

    - Dan C.

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Tim Rentsch@[email protected] to comp.lang.c on Sun Apr 12 06:17:12 2026
    From Newsgroup: comp.lang.c

    Michael S <[email protected]> writes:

    On Thu, 9 Apr 2026 21:15:14 -0000 (UTC)
    [email protected] (Dan Cross) wrote:
    [...]
    Indeed. As Rob Pike once put it, "Fancy algorithms are slow
    when $n$ is small, and $n$ is usually small. Fancy algorithms
    have big constants. Until you know that $n$ is frequently going
    to get big, don't get fancy."

    One case where these considerations are not at all theoretical and
    where simple quicksort from the books performs very very slowly exactly because when sorting progresses each lexicographic comparison
    takes more and more time, is a sorting at core of Burrows-Wheeler
    transform, which in turn is at core of various compression schemes,
    including bzip2. The problem hits you the worst when data set compresses well.

    I'm curious to know if you can quantify this to some degree, and
    if so how much. I don't have any experience with Burrows-Wheeler
    transform or (any internals of) bzip2.

    In specific case of bzip2, they limited block size to 900KB which
    is quite low and did preprocessing on input data which often seriously impacts the quality of compression. I can't say for sure, but it seems
    to me that the reason was exactly that - avoiding prohibitively slow
    sorting. Were they had time and desire to use "fancy algorithms",
    either combinations of bucket and non-bucket variants of radix sort, or quicksort that memorizes common prefixes, or even combination of all
    three, then they would not need to use preprocessing and small blocks
    and would end up with better compression ratios.

    There could be other reasons for wanting to limit the block size.
    Have you done any web searches or tried to investigate in some
    other ways?
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Tim Rentsch@[email protected] to comp.lang.c on Sun Apr 12 07:13:41 2026
    From Newsgroup: comp.lang.c

    [email protected] (Waldek Hebisch) writes:

    Tim Rentsch <[email protected]> wrote:

    [email protected] (Waldek Hebisch) writes:

    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Mon, 06 Apr 2026 15:13:32 -0700
    Tim Rentsch <[email protected]> wrote:

    Obviously what words (or lines) appear can affect the character
    counts, but that still doesn't change BigO. By the way you don't
    say whether you are sorting words or lines.

    This sub-thread is about sorting lines with average length of few
    dozens characters, i.e. many times longer than log2(N). That was
    stated at one of earlier posts.

    That has nothing to do with BigO, which is about asymptotic
    behavior as N goes to infinity.

    Honest Big(O) varies length of the key with N. In practical range
    key length may be constant, but fixing length gives unrealistic
    problem for Big(O) analysis: without varying key length there are
    finitely many keys and sorting is equivalent to counting how many
    times each key appears in the input.

    There's an important clarification to make here. There are two
    independent parameters: N, the number of records to be sorted (a
    record being a character string that is either a word or a line),
    and the (maximum) length of any record, which in the discussion is
    bounded above by a constant.

    There is one choice, made often to simplify problem. But there
    is quite universal choice: N is total size of input data.

    No, N is usually the number of records to be sorted, not the size of
    the input. Also, in the discussion I was responding to, there were
    clearly two distinct axes relevant to the discussion.

    What is being asked about is the behavior as a function of N as N
    increases without bound. Of course, theoretically, as the number of
    records increases without bound, eventually the character strings
    being sorted will have to have duplicates. But long before that
    happens the index variable N will run out of bits. This property is
    well understood in theoretical computer science, not just in terms
    of how much time is used but how much storage is needed. In theory
    log N bits are needed just to hold the index pointers. It is
    customary though, outside of purely theoretical discussions, to
    ignore that and treat the size of an index or pointer variable as
    constant. In purely theoretical terms no sorting algorithm is
    O(N*log(N)), because just incrementing a pointer takes more than
    O(1) operations. Surely the discussions in Knuth's books take such
    things into consideration. On the practical side, which almost
    always covers discussions that take place in usenet newsgroups,
    these minor theoretical issues are ignored. Any actul computer in
    the physical universe will never have occasion to process more than
    2**512 records, due to the limitation of the number of elementary
    particles in the universe, so a 512-bit address (or index value)
    always suffices.

    So yes, in theory, the considerations around processing an enormous
    number of values are relevant. In the practical context of the
    discussion underway here, they aren't.

    It was you who insisted on asymptitic complexity, rejecting
    practical amendments...

    This statement isn't right. BigO was already part of the discussion
    when I joined in. Also, it is customary in discussion of sorting
    algorithms to use the metric of number of comparisons done, without
    regard to the size of the variables needed to hold the indices of
    the records being sorted. See Knuth chapter 5, on Sorting, in
    volume 3 of TAOCP.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Tim Rentsch@[email protected] to comp.lang.c on Sun Apr 12 11:16:50 2026
    From Newsgroup: comp.lang.c

    Tim Rentsch <[email protected]> writes:

    [...]

    After my previous posting I got curious about how different sorting
    algorithms behave for "nearly sorted" inputs. I decided to run some
    trials for some of the sorting functions given in my last posting.
    Here are the results:

    Average number of compares for sorting 100 elements, with one
    element out of place:

    insertion sort 132.647
    gnome sort 166.293
    ping pong sort 180.177
    knuth bubble sort 956.500

    Average number of compares for sorting 100 elements, with two
    elements out of place:

    insertion sort 166.126
    gnome sort 233.253
    ping pong sort 234.163
    knuth bubble sort 1566.487

    My conclusion, based on the above, is that bubble sort has nothing
    to recommend it as a practical sorting algorithm, even considering
    the refinement of early pruning (called the "knuth bubble sort" in
    the list above). The gnome sort is both simpler and better in terms
    of performance.

    Incidentally, some years ago I devised a variation of gnome sort,
    which I call "leaping gnome sort". A leaping gnome sort is like
    gnome sort when going right to left, that is, when the elements
    being looked at are out of order, swap them and move one place to
    the left. When no swap is needed, the gnome "leaps" to the last
    place a rightward shift was initiated. In terms of code, leaping
    gnome sort can be written as follows:

    void
    leaping_gnome_sort( unsigned n, Element a[n] ){
    for(
    unsigned i = 1, j = 2;
    i < n;
    i = follows(i-1,i,a) ? swap(i-1,i,a), i>1 ?i-1 :j++ : j++
    ){}
    }

    The "leaping" behavior is evident whenever the current location 'i'
    gets the value 'j++' rather than 'i-1'.

    Of course this algorithm could be written using a while() loop
    rather than a for() loop:

    void
    leaping_gnome_sort( unsigned n, Element a[n] ){
    unsigned i = 1, j = 2;
    while( i < n ){
    if( !follows(i-1,i,a) ) i = j++;
    else {
    swap(i-1,i,a);
    i = i > 1 ? i-1 : j++;
    }
    }
    }

    In either case it's nice having only a single loop structure rather
    than an outer loop and a second, nested loop.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From cross@[email protected] (Dan Cross) to comp.lang.c on Mon Apr 13 20:44:07 2026
    From Newsgroup: comp.lang.c

    In article <[email protected]>,
    Tim Rentsch <[email protected]> wrote:
    [email protected] (Waldek Hebisch) writes:
    Tim Rentsch <[email protected]> wrote:
    [email protected] (Waldek Hebisch) writes:
    Tim Rentsch <[email protected]> wrote:
    Michael S <[email protected]> writes:
    On Mon, 06 Apr 2026 15:13:32 -0700
    Tim Rentsch <[email protected]> wrote:

    Obviously what words (or lines) appear can affect the character
    counts, but that still doesn't change BigO. By the way you don't >>>>>>> say whether you are sorting words or lines.

    This sub-thread is about sorting lines with average length of few
    dozens characters, i.e. many times longer than log2(N). That was
    stated at one of earlier posts.

    That has nothing to do with BigO, which is about asymptotic
    behavior as N goes to infinity.

    Honest Big(O) varies length of the key with N. In practical range
    key length may be constant, but fixing length gives unrealistic
    problem for Big(O) analysis: without varying key length there are
    finitely many keys and sorting is equivalent to counting how many
    times each key appears in the input.

    There's an important clarification to make here. There are two
    independent parameters: N, the number of records to be sorted (a
    record being a character string that is either a word or a line),
    and the (maximum) length of any record, which in the discussion is
    bounded above by a constant.

    There is one choice, made often to simplify problem. But there
    is quite universal choice: N is total size of input data.

    No, N is usually the number of records to be sorted, not the size of
    the input. Also, in the discussion I was responding to, there were
    clearly two distinct axes relevant to the discussion.

    What is being asked about is the behavior as a function of N as N
    increases without bound. Of course, theoretically, as the number of
    records increases without bound, eventually the character strings
    being sorted will have to have duplicates. But long before that
    happens the index variable N will run out of bits. This property is
    well understood in theoretical computer science, not just in terms
    of how much time is used but how much storage is needed. In theory
    log N bits are needed just to hold the index pointers. It is
    customary though, outside of purely theoretical discussions, to
    ignore that and treat the size of an index or pointer variable as
    constant. In purely theoretical terms no sorting algorithm is
    O(N*log(N)), because just incrementing a pointer takes more than
    O(1) operations. Surely the discussions in Knuth's books take such
    things into consideration. On the practical side, which almost
    always covers discussions that take place in usenet newsgroups,
    these minor theoretical issues are ignored. Any actul computer in
    the physical universe will never have occasion to process more than
    2**512 records, due to the limitation of the number of elementary
    particles in the universe, so a 512-bit address (or index value)
    always suffices.

    So yes, in theory, the considerations around processing an enormous
    number of values are relevant. In the practical context of the
    discussion underway here, they aren't.

    It was you who insisted on asymptitic complexity, rejecting
    practical amendments...

    This statement isn't right. BigO was already part of the discussion
    when I joined in. Also, it is customary in discussion of sorting
    algorithms to use the metric of number of comparisons done, without
    regard to the size of the variables needed to hold the indices of
    the records being sorted.

    Hmm, this response echoes almost exactly what I wrote in <10ranaa$ihf$[email protected]> on April 10th, also in
    response to Waldek. Did you see it?

    See Knuth chapter 5, on Sorting, in volume 3 of TAOCP.

    As for TAOCP, as much as I respect and admire Knuth, I don't
    think "The Art of Computer Programming" is a good reference for
    working programmers. At 391 pages long, Chapter Five occupies
    more than half of an ~750 page book, the examples are all in
    assembly language for his notional MIX computer, and the
    analysis he presents presumes a mathematics background that,
    frankly, most programmers neither have nor need. "See chapter
    5" is thus not useful as a reference, and rather comes across as
    an admonishment.

    A much more useful treatment is Cormen, Leiserson, Rivest, and
    Stein's, "Introduction to Algorithms." The treatment is much
    more accessible than TAOCP, while still rigorous. Disclaimer:
    I once served as a industry mentor for students taking one of
    Leiserson's classes, though I haven't worked with him closely.

    CLRS suggests that Aho, Hopcroft, and Ullman advocated for
    asymptotic analysis of algorithms in, "The Design and Analysis
    of Computer Algorithms". Again, it's a wonderful book, but I
    would argue it's more theoretical than most programmers would
    find comfortable. Disclaimer: Aho taught me compilers.

    Personally, I like Skiena's, "Algorithm Design Manual" and think
    that its treatment is among the best I've seen when it comes to
    threading the needle between rigorous attention to detail and
    accessibility, though one needs more mathematics to negotiate it
    than e.g., CLRS.

    - Dan C.

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Tim Rentsch@[email protected] to comp.lang.c on Thu Apr 16 10:23:23 2026
    From Newsgroup: comp.lang.c

    Michael S <[email protected]> writes:

    On Tue, 07 Apr 2026 02:12:49 -0700
    Tim Rentsch <[email protected]> wrote:


    <snip>

    My conclusions...

    The right place to start on sorting algorithms is with bubble
    sort, not because it's a good algorithm, but because it's the
    easiest one to show, and what is more important it's a good way
    to teach about how algorithms evolve.

    If you write it in terms of moves rather than in terms of swaps,
    bubble sort is not easier than insertion and selection.

    A reason to use swap() is that swap is easier to understand
    than moves. Each swap() operation, all by itself, preserves
    the invariant that the array contains the same elements after
    the swap() is done that were there before. With code written
    using moves, several statements together need to be looked at
    to verify that they are doing the right thing. Furthermore,
    when used in conjunction with follows(), whenever we see the
    two together written like this

    if( follows(x,y,array) ) swap(x,y,array);

    we know that both (1) the array has the same elements as it did
    before, and (2) the "sortedness" of array elements has increased,
    or at least has remained unchanged. Compare that to an insertion
    sort written using moves, where a whole loop body (including a
    'break;' no less), plus a statement after, needs to be examined
    to verify that it is doing something sensible. Clearly less
    mental effort is needed to see that the one line written above is
    working than what is needed for the eight lines seen in the
    insertion sort shown in your (much) earlier posting.

    There is no reason to use the knuth version of bubble sort.
    Compared to ordinary bubble sort the upside is small and not
    worth the downside of needing more intricate code.

    knuth version has at least one merit. You claim that in practice the
    it's too rare that the merit is exercised. May be, it is true. Or may
    be cases of almost sorted array with misplaced elements coming in pairs
    with close proximity between peers is not that uncommon. I don't know.
    But merit exist.

    I posted a performance comparison of several sorts including knuth
    bubble sort. It's true that knuth bubble sort is better than plain
    bubble sort. But it's still awful compared to the even simpler
    gnome sort.

    OTOH, 'ordinary bubble sort' has no merits at all relatively to simple insertion sort.

    Simple bubble sort is easier to understand than insertion sort.

    For those who don't like bubble sort, there is gnome sort, which
    is even simpler than bubble sort, and actually has better
    performance. And, if people will forgive me, it's only a short
    leap to get to leaping gnome sort, which has better performance
    still.

    BTW, thank you for 'boustrophedon'. Not that there is a
    serious chance that I will remember the word tomorrow or even
    tonight, But hopefully I'd remember that such word exists.

    It comes from Greek, meaning roughly "ox plow turning". Maybe
    that will help you remember; I think it does for me.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Tim Rentsch@[email protected] to comp.lang.c on Sat Apr 25 15:47:28 2026
    From Newsgroup: comp.lang.c

    [email protected] (Dan Cross) writes:

    In article <[email protected]>,
    Tim Rentsch <[email protected]> wrote:

    [email protected] (Waldek Hebisch) writes:

    Tim Rentsch <[email protected]> wrote:

    [email protected] (Waldek Hebisch) writes:

    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Mon, 06 Apr 2026 15:13:32 -0700
    Tim Rentsch <[email protected]> wrote:

    Obviously what words (or lines) appear can affect the character >>>>>>>> counts, but that still doesn't change BigO. By the way you don't >>>>>>>> say whether you are sorting words or lines.

    This sub-thread is about sorting lines with average length of few >>>>>>> dozens characters, i.e. many times longer than log2(N). That was >>>>>>> stated at one of earlier posts.

    That has nothing to do with BigO, which is about asymptotic
    behavior as N goes to infinity.

    Honest Big(O) varies length of the key with N. In practical range
    key length may be constant, but fixing length gives unrealistic
    problem for Big(O) analysis: without varying key length there are
    finitely many keys and sorting is equivalent to counting how many
    times each key appears in the input.

    There's an important clarification to make here. There are two
    independent parameters: N, the number of records to be sorted (a
    record being a character string that is either a word or a line),
    and the (maximum) length of any record, which in the discussion is
    bounded above by a constant.

    There is one choice, made often to simplify problem. But there
    is quite universal choice: N is total size of input data.

    No, N is usually the number of records to be sorted, not the size of
    the input. Also, in the discussion I was responding to, there were
    clearly two distinct axes relevant to the discussion.

    What is being asked about is the behavior as a function of N as N
    increases without bound. Of course, theoretically, as the number of
    records increases without bound, eventually the character strings
    being sorted will have to have duplicates. But long before that
    happens the index variable N will run out of bits. This property is
    well understood in theoretical computer science, not just in terms
    of how much time is used but how much storage is needed. In theory
    log N bits are needed just to hold the index pointers. It is
    customary though, outside of purely theoretical discussions, to
    ignore that and treat the size of an index or pointer variable as
    constant. In purely theoretical terms no sorting algorithm is
    O(N*log(N)), because just incrementing a pointer takes more than
    O(1) operations. Surely the discussions in Knuth's books take such
    things into consideration. On the practical side, which almost
    always covers discussions that take place in usenet newsgroups,
    these minor theoretical issues are ignored. Any actul computer in
    the physical universe will never have occasion to process more than
    2**512 records, due to the limitation of the number of elementary
    particles in the universe, so a 512-bit address (or index value)
    always suffices.

    So yes, in theory, the considerations around processing an enormous
    number of values are relevant. In the practical context of the
    discussion underway here, they aren't.

    It was you who insisted on asymptitic complexity, rejecting
    practical amendments...

    This statement isn't right. BigO was already part of the discussion
    when I joined in. Also, it is customary in discussion of sorting
    algorithms to use the metric of number of comparisons done, without
    regard to the size of the variables needed to hold the indices of
    the records being sorted.

    Hmm, this response echoes almost exactly what I wrote in <10ranaa$ihf$[email protected]> on April 10th, also in
    response to Waldek. Did you see it?

    Probably, but I don't remember it specifically.

    See Knuth chapter 5, on Sorting, in volume 3 of TAOCP.

    As for TAOCP, as much as I respect and admire Knuth, I don't
    think "The Art of Computer Programming" is a good reference for
    working programmers. At 391 pages long, Chapter Five occupies
    more than half of an ~750 page book, the examples are all in
    assembly language for his notional MIX computer, and the
    analysis he presents presumes a mathematics background that,
    frankly, most programmers neither have nor need.

    I am not recommending TAOCP as a textbook. It's a well-known work
    and respected for its theoretical treatments. Also it happens to be
    what I consulted before posting, mainly because it was handy.

    "See chapter 5" is thus not useful as a reference, and rather
    comes across as an admonishment.

    No admonishment was intented. I mentioned it only as a supporting
    work for my previous statement.
    --- Synchronet 3.21f-Linux NewsLink 1.2