Abstract | ||
---|---|---|
Shannon's entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is b, the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing b is NP-complete, a popular gold standard is z, the number of phrases in the Lempel-Ziv parse of the text, where phrases can be copied only from the left. While z can be computed in linear time, almost nothing has been known for decades about its approximation ratio with respect to b. In this paper we prove that z = O(b log(n/b)), where n is the text length. We also show that the bound is tight as a function of n, by exhibiting a string family where z = Omega(b log n). Our upper bound is obtained by building a run-length context-free grammar based on a locally consistent parsing of the text. Our lower bound is obtained by relating b with r, the number of equal-letter runs in the Burrows-Wheeler transform of the text. On our way, we prove other relevant bounds between compressibility measures. |
Year | DOI | Venue |
---|---|---|
2018 | 10.1007/978-3-319-77404-6_36 | Lecture Notes in Computer Science |
Field | DocType | Volume |
Compressibility,Binary logarithm,Discrete mathematics,Combinatorics,Upper and lower bounds,Computer science,Grammar,Parsing,Time complexity | Conference | 10807 |
ISSN | Citations | PageRank |
0302-9743 | 5 | 0.44 |
References | Authors | |
17 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Travis Gagie | 1 | 643 | 63.61 |
Gonzalo Navarro | 2 | 109 | 11.07 |
Nicola Prezza | 3 | 71 | 19.38 |