Mathjax

Sunday, October 13, 2013

Impact of Corpus Size on Quality of Frequency Table

Hypothesis

The reverse Huffman steganography technique uses as input a n-character long frequency table drawn from a corpus. We know that the quality of the output increases as n grows, albeit not monotonically. n cannot grow beyond the size of the corpus and the number of branches declines as n grows leading to greater spatial inefficiency in the encoding process. We hypothesize that there are similar declines in the marginal improvement as the corpus size, s, increases.

Methodology
During the encoding process, the algorithm walks through a Huffman encoding of the frequency list to determine what tokens to emit based on a binary interpretation of the input. The actual frequency values are relatively unimportant; it is the frequency values in relation to each other that impact how the Huffman tree is generated. To measure quality, we compared the frequency tables for different sizes of s, where the different versions of the corpus are drawn from the same work.

As our distance measure, we used the Euclidean distance of the normalized frequency vectors. The tokens represent dimensions and the frequency counts are the lengths. For instance, the distance between u={a:4, b:2, c:1} and v={a:8, b:4, c:2} is zero.

Choice of Corpus
We used Darwin's The Origin of Species as the corpus for this test. Species is a fairly large work with greater than 893,000 characters in size, which allowed us to break it into 88 "chunks" of 10,149 characters. We then aggregated the chunks in increasingly large sizes via the Fibonacci sequence to arrive at nine corpora segments. The sizes of each of the corpora segments, along with sizes of representative works for comparison, are depicted in Figure 1.

Figure 1: Size of corpora increases per Fibonacci sequence
Sizes of representative works were taken from the "Plain Text UTF-8" version posted to Project Gutenberg.

For each of the nine corpora, we generated a frequency list for n=2 to 5, inclusive. We then compared the frequency lists for each pair of corpora.

Results
In the tables below, the rows and columns indicate the corpora number and the contents are the distance score for that pair (lower numbers mean that the corpora are more similar to each other).

Figure 2: Distance when n=2
Figure 3: Distance when n=3
Figure 4: Distance when n=4

Figure 5: Distance when n=5

In Figure 6 below, we show the same data as in the tables above, except the x-axis represents the ratio of size (in characters) of the two corpora being compared and the y-axis is the computed distance. Although it appears that distances converge to some average value as the ratio of sizes increase, that is almost certainly an artifact of the fact that we have fewer instances of those ratios.

Figure 6: Ratio of Corpora Sizes versus Distances
Conclusion
Figure 6 suggests that a frequency table generated from one segment of a single corpus will remain fairly distinct from a frequency table generated from another segment of the same corpus and that distinction is fairly scale invariant. This is evidence for rejecting the hypothesis. The results could be peculiar to our choice of corpus; for instance, one chapter might be written in a very different style than another chapter in the same book. However, it is also promising because it suggests that if the key is constructed from a subset of a corpus, a leak of the identity of that corpus may reveal less information to the attacker than previously assumed.

No comments:

Post a Comment