Tim Rentsch <[email protected]> wrote:
[email protected] (Waldek Hebisch) writes:
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <[email protected]> wrote:
Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't
say whether you are sorting words or lines.
This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.
That has nothing to do with BigO, which is about asymptotic
behavior as N goes to infinity.
Honest Big(O) varies length of the key with N. In practical range
key length may be constant, but fixing length gives unrealistic
problem for Big(O) analysis: without varying key length there are
finitely many keys and sorting is equivalent to counting how many
times each key appears in the input.
There's an important clarification to make here. There are two
independent parameters: N, the number of records to be sorted (a
record being a character string that is either a word or a line),
and the (maximum) length of any record, which in the discussion is
bounded above by a constant.
There is one choice, made often to simplify problem. But there
is quite universal choice: N is total size of input data.
In article <[email protected]>,
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
On Mon, 06 Apr 2026 21:00:46 -0700
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <[email protected]> wrote:
Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't
say whether you are sorting words or lines.
This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.
That has nothing to do with BigO, which is about asymptotic behavior
as N goes to infinity.
So, how do you call part of the curve from 1e2 to 3e5 lines? MiddleO ?
Consider a thought experiment. We have a qsort-like sorting function,
for the sake of discussion let's say it uses heapsort. It is widely
understood that heapsort is an O(N*log(N)) algorithm (not counting the
theoretical objections mentioned in Waldek Hebisch's post and my
response to those comments).
Now suppose we have two distinct invocations of said function. In
both cases the arguments are length 1000 character strings, having
only spaces and alphabetic characters, and no duplicates between the
strings. In one call all the strings have distinct values in the
first six characters, and in the other call the strings are all the
same in the first 993 characters, differing only in the last six
characters. The call to the sorting function points back to a
function that uses strcmp() to do its comparisons.
The heapsort algorithm is well-known to be N log N. All the strings
in both sorts are exactly 1000 characters, so the strcmp() calls have
a fixed maximum time cost. And yet one of these sorts runs almost 200
times as fast as the other.
I think what you're seeing is a data dependency that is orthogonal to
the time complexity of the algorithm. For example, the travelling
salesman problem is widely known to be NP complete, and yet on some
inputs a solution can be found very quickly.
A useful thought experiment, but it is perhaps easier to explain
that the asymptotic time complexity of a sorting algorithm, for
example heapsort being O(n log n), is expressed in terms of the
number of comparison operations required by the algorithm,
independent of the complexity of those comparisons.
BigO is about worst case behavior, or sometimes average case
behavior. Other kinds of behavior can depend dramatically on the
nature of the inputs.
I disagree with this. The "Big-O" of an algorithm isn't about
worst-case behavior, it's a notation for categorizing which term
dominates in an expression describing the complexity of an
algorithm, as the number of inputs to that algorithm grows.
[email protected] (Dan Cross) writes:
In article <[email protected]>,
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
On Mon, 06 Apr 2026 21:00:46 -0700Consider a thought experiment. We have a qsort-like sorting function,
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <[email protected]> wrote:
Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't >>>>>>> say whether you are sorting words or lines.
This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.
That has nothing to do with BigO, which is about asymptotic behavior >>>>> as N goes to infinity.
So, how do you call part of the curve from 1e2 to 3e5 lines? MiddleO ? >>>
for the sake of discussion let's say it uses heapsort. It is widely
understood that heapsort is an O(N*log(N)) algorithm (not counting the
theoretical objections mentioned in Waldek Hebisch's post and my
response to those comments).
Now suppose we have two distinct invocations of said function. In
both cases the arguments are length 1000 character strings, having
only spaces and alphabetic characters, and no duplicates between the
strings. In one call all the strings have distinct values in the
first six characters, and in the other call the strings are all the
same in the first 993 characters, differing only in the last six
characters. The call to the sorting function points back to a
function that uses strcmp() to do its comparisons.
The heapsort algorithm is well-known to be N log N. All the strings
in both sorts are exactly 1000 characters, so the strcmp() calls have
a fixed maximum time cost. And yet one of these sorts runs almost 200
times as fast as the other.
I think what you're seeing is a data dependency that is orthogonal to
the time complexity of the algorithm. For example, the travelling
salesman problem is widely known to be NP complete, and yet on some
inputs a solution can be found very quickly.
A useful thought experiment, but it is perhaps easier to explain
that the asymptotic time complexity of a sorting algorithm, for
example heapsort being O(n log n), is expressed in terms of the
number of comparison operations required by the algorithm,
independent of the complexity of those comparisons.
No. This misses the essence of what I was trying to convey.
BigO is about worst case behavior, or sometimes average case
behavior. Other kinds of behavior can depend dramatically on the
nature of the inputs.
I disagree with this. The "Big-O" of an algorithm isn't about
worst-case behavior, it's a notation for categorizing which term
dominates in an expression describing the complexity of an
algorithm, as the number of inputs to that algorithm grows.
You misunderstood the point of my comment. Of course O()
notation is about expressing an asymptotic upper bound of
some function. The key issue here though is not about O()
but about what function should be the one under consideration.
In article <[email protected]>,
Tim Rentsch <[email protected]> wrote:
[email protected] (Waldek Hebisch) writes:
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <[email protected]> wrote:
Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't
say whether you are sorting words or lines.
This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.
That has nothing to do with BigO, which is about asymptotic
behavior as N goes to infinity.
Honest Big(O) varies length of the key with N. In practical range
key length may be constant, but fixing length gives unrealistic
problem for Big(O) analysis: without varying key length there are
finitely many keys and sorting is equivalent to counting how many
times each key appears in the input.
There's an important clarification to make here. There are two
independent parameters: N, the number of records to be sorted (a
record being a character string that is either a word or a line),
and the (maximum) length of any record, which in the discussion is
bounded above by a constant.
What is being asked about is the behavior as a function of N as N
increases without bound. Of course, theoretically, as the number of
records increases without bound, eventually the character strings
being sorted will have to have duplicates. But long before that
happens the index variable N will run out of bits. This property is
well understood in theoretical computer science, not just in terms
of how much time is used but how much storage is needed. In theory
log N bits are needed just to hold the index pointers. It is
customary though, outside of purely theoretical discussions, to
ignore that and treat the size of an index or pointer variable as
constant. In purely theoretical terms no sorting algorithm is
O(N*log(N)), because just incrementing a pointer takes more than
O(1) operations. Surely the discussions in Knuth's books take such
things into consideration.
If by "Knuth's books" you're referring to TAOCP, then he does
not seem to give it too much attention. [...]
On the practical side, which almost
always covers discussions that take place in usenet newsgroups,
these minor theoretical issues are ignored. Any actul computer in
the physical universe will never have occasion to process more than
2**512 records, due to the limitation of the number of elementary
particles in the universe, so a 512-bit address (or index value)
always suffices.
So yes, in theory, the considerations around processing an enormous
number of values are relevant. In the practical context of the
discussion underway here, they aren't.
Indeed. As Rob Pike once put it, "Fancy algorithms are slow
when $n$ is small, and $n$ is usually small. Fancy algorithms
have big constants. Until you know that $n$ is frequently going
to get big, don't get fancy."
[email protected] (Dan Cross) writes:
In article <[email protected]>,
Tim Rentsch <[email protected]> wrote:
[email protected] (Waldek Hebisch) writes:
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <[email protected]> wrote:
Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't >>>>>>> say whether you are sorting words or lines.
This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.
That has nothing to do with BigO, which is about asymptotic
behavior as N goes to infinity.
Honest Big(O) varies length of the key with N. In practical range
key length may be constant, but fixing length gives unrealistic
problem for Big(O) analysis: without varying key length there are
finitely many keys and sorting is equivalent to counting how many
times each key appears in the input.
There's an important clarification to make here. There are two
independent parameters: N, the number of records to be sorted (a
record being a character string that is either a word or a line),
and the (maximum) length of any record, which in the discussion is
bounded above by a constant.
What is being asked about is the behavior as a function of N as N
increases without bound. Of course, theoretically, as the number of
records increases without bound, eventually the character strings
being sorted will have to have duplicates. But long before that
happens the index variable N will run out of bits. This property is
well understood in theoretical computer science, not just in terms
of how much time is used but how much storage is needed. In theory
log N bits are needed just to hold the index pointers. It is
customary though, outside of purely theoretical discussions, to
ignore that and treat the size of an index or pointer variable as
constant. In purely theoretical terms no sorting algorithm is
O(N*log(N)), because just incrementing a pointer takes more than
O(1) operations. Surely the discussions in Knuth's books take such
things into consideration.
If by "Knuth's books" you're referring to TAOCP, then he does
not seem to give it too much attention. [...]
In most of the chapter on Sorting, TAOCP uses the number of
comparisons as the basis of comparison. But not everywhere
in the chapter.
My statement was not meant to be limited to the discussion of
Sorting.
On the practical side, which almost
always covers discussions that take place in usenet newsgroups,
these minor theoretical issues are ignored. Any actul computer in
the physical universe will never have occasion to process more than
2**512 records, due to the limitation of the number of elementary
particles in the universe, so a 512-bit address (or index value)
always suffices.
So yes, in theory, the considerations around processing an enormous
number of values are relevant. In the practical context of the
discussion underway here, they aren't.
Indeed. As Rob Pike once put it, "Fancy algorithms are slow
when $n$ is small, and $n$ is usually small. Fancy algorithms
have big constants. Until you know that $n$ is frequently going
to get big, don't get fancy."
Whether the Rob Pike advisory is applicable or not is beside the
point.
My comment was about fancy mathematics, not fancy
algorithms. My statement is just as applicable to Tim Sort (one
of the fancier sorting algorithms) as it is to Bubble Sort.
On Thu, 9 Apr 2026 21:15:14 -0000 (UTC)[...]
[email protected] (Dan Cross) wrote:
Indeed. As Rob Pike once put it, "Fancy algorithms are slow
when $n$ is small, and $n$ is usually small. Fancy algorithms
have big constants. Until you know that $n$ is frequently going
to get big, don't get fancy."
One case where these considerations are not at all theoretical and
where simple quicksort from the books performs very very slowly exactly because when sorting progresses each lexicographic comparison
takes more and more time, is a sorting at core of Burrows-Wheeler
transform, which in turn is at core of various compression schemes,
including bzip2. The problem hits you the worst when data set compresses well.
In specific case of bzip2, they limited block size to 900KB which
is quite low and did preprocessing on input data which often seriously impacts the quality of compression. I can't say for sure, but it seems
to me that the reason was exactly that - avoiding prohibitively slow
sorting. Were they had time and desire to use "fancy algorithms",
either combinations of bucket and non-bucket variants of radix sort, or quicksort that memorizes common prefixes, or even combination of all
three, then they would not need to use preprocessing and small blocks
and would end up with better compression ratios.
Tim Rentsch <[email protected]> wrote:
[email protected] (Waldek Hebisch) writes:
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <[email protected]> wrote:
Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't
say whether you are sorting words or lines.
This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.
That has nothing to do with BigO, which is about asymptotic
behavior as N goes to infinity.
Honest Big(O) varies length of the key with N. In practical range
key length may be constant, but fixing length gives unrealistic
problem for Big(O) analysis: without varying key length there are
finitely many keys and sorting is equivalent to counting how many
times each key appears in the input.
There's an important clarification to make here. There are two
independent parameters: N, the number of records to be sorted (a
record being a character string that is either a word or a line),
and the (maximum) length of any record, which in the discussion is
bounded above by a constant.
There is one choice, made often to simplify problem. But there
is quite universal choice: N is total size of input data.
What is being asked about is the behavior as a function of N as N
increases without bound. Of course, theoretically, as the number of
records increases without bound, eventually the character strings
being sorted will have to have duplicates. But long before that
happens the index variable N will run out of bits. This property is
well understood in theoretical computer science, not just in terms
of how much time is used but how much storage is needed. In theory
log N bits are needed just to hold the index pointers. It is
customary though, outside of purely theoretical discussions, to
ignore that and treat the size of an index or pointer variable as
constant. In purely theoretical terms no sorting algorithm is
O(N*log(N)), because just incrementing a pointer takes more than
O(1) operations. Surely the discussions in Knuth's books take such
things into consideration. On the practical side, which almost
always covers discussions that take place in usenet newsgroups,
these minor theoretical issues are ignored. Any actul computer in
the physical universe will never have occasion to process more than
2**512 records, due to the limitation of the number of elementary
particles in the universe, so a 512-bit address (or index value)
always suffices.
So yes, in theory, the considerations around processing an enormous
number of values are relevant. In the practical context of the
discussion underway here, they aren't.
It was you who insisted on asymptitic complexity, rejecting
practical amendments...
[email protected] (Waldek Hebisch) writes:
Tim Rentsch <[email protected]> wrote:
[email protected] (Waldek Hebisch) writes:
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <[email protected]> wrote:
Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't >>>>>>> say whether you are sorting words or lines.
This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.
That has nothing to do with BigO, which is about asymptotic
behavior as N goes to infinity.
Honest Big(O) varies length of the key with N. In practical range
key length may be constant, but fixing length gives unrealistic
problem for Big(O) analysis: without varying key length there are
finitely many keys and sorting is equivalent to counting how many
times each key appears in the input.
There's an important clarification to make here. There are two
independent parameters: N, the number of records to be sorted (a
record being a character string that is either a word or a line),
and the (maximum) length of any record, which in the discussion is
bounded above by a constant.
There is one choice, made often to simplify problem. But there
is quite universal choice: N is total size of input data.
No, N is usually the number of records to be sorted, not the size of
the input. Also, in the discussion I was responding to, there were
clearly two distinct axes relevant to the discussion.
What is being asked about is the behavior as a function of N as N
increases without bound. Of course, theoretically, as the number of
records increases without bound, eventually the character strings
being sorted will have to have duplicates. But long before that
happens the index variable N will run out of bits. This property is
well understood in theoretical computer science, not just in terms
of how much time is used but how much storage is needed. In theory
log N bits are needed just to hold the index pointers. It is
customary though, outside of purely theoretical discussions, to
ignore that and treat the size of an index or pointer variable as
constant. In purely theoretical terms no sorting algorithm is
O(N*log(N)), because just incrementing a pointer takes more than
O(1) operations. Surely the discussions in Knuth's books take such
things into consideration. On the practical side, which almost
always covers discussions that take place in usenet newsgroups,
these minor theoretical issues are ignored. Any actul computer in
the physical universe will never have occasion to process more than
2**512 records, due to the limitation of the number of elementary
particles in the universe, so a 512-bit address (or index value)
always suffices.
So yes, in theory, the considerations around processing an enormous
number of values are relevant. In the practical context of the
discussion underway here, they aren't.
It was you who insisted on asymptitic complexity, rejecting
practical amendments...
This statement isn't right. BigO was already part of the discussion
when I joined in. Also, it is customary in discussion of sorting
algorithms to use the metric of number of comparisons done, without
regard to the size of the variables needed to hold the indices of
the records being sorted.
See Knuth chapter 5, on Sorting, in volume 3 of TAOCP.
On Tue, 07 Apr 2026 02:12:49 -0700
Tim Rentsch <[email protected]> wrote:
<snip>
My conclusions...
The right place to start on sorting algorithms is with bubble
sort, not because it's a good algorithm, but because it's the
easiest one to show, and what is more important it's a good way
to teach about how algorithms evolve.
If you write it in terms of moves rather than in terms of swaps,
bubble sort is not easier than insertion and selection.
There is no reason to use the knuth version of bubble sort.
Compared to ordinary bubble sort the upside is small and not
worth the downside of needing more intricate code.
knuth version has at least one merit. You claim that in practice the
it's too rare that the merit is exercised. May be, it is true. Or may
be cases of almost sorted array with misplaced elements coming in pairs
with close proximity between peers is not that uncommon. I don't know.
But merit exist.
OTOH, 'ordinary bubble sort' has no merits at all relatively to simple insertion sort.
BTW, thank you for 'boustrophedon'. Not that there is a
serious chance that I will remember the word tomorrow or even
tonight, But hopefully I'd remember that such word exists.
In article <[email protected]>,
Tim Rentsch <[email protected]> wrote:
[email protected] (Waldek Hebisch) writes:
Tim Rentsch <[email protected]> wrote:
[email protected] (Waldek Hebisch) writes:
Tim Rentsch <[email protected]> wrote:
Michael S <[email protected]> writes:
On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <[email protected]> wrote:
Obviously what words (or lines) appear can affect the character >>>>>>>> counts, but that still doesn't change BigO. By the way you don't >>>>>>>> say whether you are sorting words or lines.
This sub-thread is about sorting lines with average length of few >>>>>>> dozens characters, i.e. many times longer than log2(N). That was >>>>>>> stated at one of earlier posts.
That has nothing to do with BigO, which is about asymptotic
behavior as N goes to infinity.
Honest Big(O) varies length of the key with N. In practical range
key length may be constant, but fixing length gives unrealistic
problem for Big(O) analysis: without varying key length there are
finitely many keys and sorting is equivalent to counting how many
times each key appears in the input.
There's an important clarification to make here. There are two
independent parameters: N, the number of records to be sorted (a
record being a character string that is either a word or a line),
and the (maximum) length of any record, which in the discussion is
bounded above by a constant.
There is one choice, made often to simplify problem. But there
is quite universal choice: N is total size of input data.
No, N is usually the number of records to be sorted, not the size of
the input. Also, in the discussion I was responding to, there were
clearly two distinct axes relevant to the discussion.
What is being asked about is the behavior as a function of N as N
increases without bound. Of course, theoretically, as the number of
records increases without bound, eventually the character strings
being sorted will have to have duplicates. But long before that
happens the index variable N will run out of bits. This property is
well understood in theoretical computer science, not just in terms
of how much time is used but how much storage is needed. In theory
log N bits are needed just to hold the index pointers. It is
customary though, outside of purely theoretical discussions, to
ignore that and treat the size of an index or pointer variable as
constant. In purely theoretical terms no sorting algorithm is
O(N*log(N)), because just incrementing a pointer takes more than
O(1) operations. Surely the discussions in Knuth's books take such
things into consideration. On the practical side, which almost
always covers discussions that take place in usenet newsgroups,
these minor theoretical issues are ignored. Any actul computer in
the physical universe will never have occasion to process more than
2**512 records, due to the limitation of the number of elementary
particles in the universe, so a 512-bit address (or index value)
always suffices.
So yes, in theory, the considerations around processing an enormous
number of values are relevant. In the practical context of the
discussion underway here, they aren't.
It was you who insisted on asymptitic complexity, rejecting
practical amendments...
This statement isn't right. BigO was already part of the discussion
when I joined in. Also, it is customary in discussion of sorting
algorithms to use the metric of number of comparisons done, without
regard to the size of the variables needed to hold the indices of
the records being sorted.
Hmm, this response echoes almost exactly what I wrote in <10ranaa$ihf$[email protected]> on April 10th, also in
response to Waldek. Did you see it?
See Knuth chapter 5, on Sorting, in volume 3 of TAOCP.
As for TAOCP, as much as I respect and admire Knuth, I don't
think "The Art of Computer Programming" is a good reference for
working programmers. At 391 pages long, Chapter Five occupies
more than half of an ~750 page book, the examples are all in
assembly language for his notional MIX computer, and the
analysis he presents presumes a mathematics background that,
frankly, most programmers neither have nor need.
"See chapter 5" is thus not useful as a reference, and rather
comes across as an admonishment.
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,114 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 492515:55:17 |
| Calls: | 14,267 |
| Calls today: | 3 |
| Files: | 186,321 |
| D/L today: |
27,599 files (9,006M bytes) |
| Messages: | 2,518,520 |