The Hide in Plain Sight add-on has passed preliminary review!
You can download it at: https://addons.mozilla.org/en-US/firefox/addon/hide-in-plain-sight/
Mathjax
Thursday, October 31, 2013
Thursday, October 24, 2013
Hide in Plain Sight add-on submitted for preliminary review
The add-on is now in the AMO's review queue, but is already available at: https://addons.mozilla.org/en- US/firefox/addon/hide-in- plain-sight/ !
Wednesday, October 23, 2013
Improving the performance of reversehuffman.encrypt, Part II
In the encrypt algorithm, one stage requires a scan of the frequency table to generate a list of potential next characters based off of the previous n-1 characters. In the previous versions of the algorithm, this was performed using a linear scan. Since the frequency table is sorted in string order, an obvious optimization is to use a binary search and terminate the linear scan early.
The results:
Now, that's a nice result.
For completeness, my updated detailed profile results:
Statistical profiling result from /home/starr/svn/plainsight-addon/lib/v8.log, (1550 ticks, 0 unaccounted, 0 excluded).
[Shared libraries]:
ticks total nonlib name
433 27.9% 0.0% /usr/lib/libv8.so.3.14.5
16 1.0% 0.0% /lib/x86_64-linux-gnu/libc-2.17.so
13 0.8% 0.0% 7fff58bfe000-7fff58c00000
1 0.1% 0.0% /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.18
1 0.1% 0.0% /usr/bin/nodejs
1 0.1% 0.0% /lib/x86_64-linux-gnu/ld-2.17.so
[JavaScript]:
ticks total nonlib name
115 7.4% 10.6% LazyCompile: Object.defineProperty.value /home/starr/svn/plainsight-addon/lib/reversehuffman.js:28
83 5.4% 7.6% LazyCompile: huffmantree /home/starr/svn/plainsight-addon/lib/textutils.js:220
71 4.6% 6.5% LazyCompile: *InsertionSort native array.js:764
68 4.4% 6.3% LazyCompile: *nextChars /home/starr/svn/plainsight-addon/lib/reversehuffman.js:170
54 3.5% 5.0% LazyCompile: *binaryInsertionPoint /home/starr/svn/plainsight-addon/lib/textutils.js:20
53 3.4% 4.9% Stub: CompareICStub
47 3.0% 4.3% Stub: FastNewClosureStub
41 2.6% 3.8% LazyCompile: *sort native array.js:741
39 2.5% 3.6% LazyCompile: *reduce native array.js:1381
37 2.4% 3.4% Stub: CompareICStub {2}
35 2.3% 3.2% LazyCompile: *self.push /home/starr/svn/plainsight-addon/lib/priority_queue.js:129
35 2.3% 3.2% LazyCompile: *QuickSort native array.js:793
34 2.2% 3.1% LazyCompile: *self.pop /home/starr/svn/plainsight-addon/lib/priority_queue.js:57
31 2.0% 2.9% Builtin: A builtin from the snapshot
The function nextChars is where I implemented the binary search.
The results:
Implementation | User Time |
---|---|
(Baseline) String.indexOf(...) === 0 | 1m21.828s |
String.beginsWith(...) | 1m10.723s |
beginsWith(str, prefix) | 1m11.457s |
Binary search | 0m01.713s |
Now, that's a nice result.
For completeness, my updated detailed profile results:
Statistical profiling result from /home/starr/svn/plainsight-addon/lib/v8.log, (1550 ticks, 0 unaccounted, 0 excluded).
[Shared libraries]:
ticks total nonlib name
433 27.9% 0.0% /usr/lib/libv8.so.3.14.5
16 1.0% 0.0% /lib/x86_64-linux-gnu/libc-2.17.so
13 0.8% 0.0% 7fff58bfe000-7fff58c00000
1 0.1% 0.0% /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.18
1 0.1% 0.0% /usr/bin/nodejs
1 0.1% 0.0% /lib/x86_64-linux-gnu/ld-2.17.so
[JavaScript]:
ticks total nonlib name
115 7.4% 10.6% LazyCompile: Object.defineProperty.value /home/starr/svn/plainsight-addon/lib/reversehuffman.js:28
83 5.4% 7.6% LazyCompile: huffmantree /home/starr/svn/plainsight-addon/lib/textutils.js:220
71 4.6% 6.5% LazyCompile: *InsertionSort native array.js:764
68 4.4% 6.3% LazyCompile: *nextChars /home/starr/svn/plainsight-addon/lib/reversehuffman.js:170
54 3.5% 5.0% LazyCompile: *binaryInsertionPoint /home/starr/svn/plainsight-addon/lib/textutils.js:20
53 3.4% 4.9% Stub: CompareICStub
47 3.0% 4.3% Stub: FastNewClosureStub
41 2.6% 3.8% LazyCompile: *sort native array.js:741
39 2.5% 3.6% LazyCompile: *reduce native array.js:1381
37 2.4% 3.4% Stub: CompareICStub {2}
35 2.3% 3.2% LazyCompile: *self.push /home/starr/svn/plainsight-addon/lib/priority_queue.js:129
35 2.3% 3.2% LazyCompile: *QuickSort native array.js:793
34 2.2% 3.1% LazyCompile: *self.pop /home/starr/svn/plainsight-addon/lib/priority_queue.js:57
31 2.0% 2.9% Builtin: A builtin from the snapshot
The function nextChars is where I implemented the binary search.
Improving the performance of reversehuffman.encrypt, Part I
The Reverse Huffman encryption and decryption routines are the most expensive operations within the add-on. One of my beta testers reports that it took him 108 seconds to encrypt 116 words (of Lorem Ipsum) on his computer. Let's bring that down.
I created this script for profiling use:
Running my initial implementation through the profiler, I get the following report (report cropped):
Statistical profiling result from /home/starr/svn/plainsight-addon/lib/v8.log, (77854 ticks, 0 unaccounted, 0 excluded).
[Shared libraries]:
ticks total nonlib name
31451 40.4% 0.0% /usr/lib/libv8.so.3.14.5
36 0.0% 0.0% /lib/x86_64-linux-gnu/libc-2.17.so
16 0.0% 0.0% 7fff2d9fe000-7fff2da00000
3 0.0% 0.0% /usr/bin/nodejs
1 0.0% 0.0% /lib/x86_64-linux-gnu/libpthread-2.17.so
1 0.0% 0.0% /lib/x86_64-linux-gnu/libm-2.17.so
[JavaScript]:
ticks total nonlib name
21179 27.2% 45.7% LazyCompile: encrypt /home/starr/svn/plainsight-addon/lib/reversehuffman.js:60
8920 11.5% 19.2% LazyCompile: *indexOf native string.js:118
5013 6.4% 10.8% Stub: KeyedLoadElementStub
3718 4.8% 8.0% Stub: CEntryStub
395 0.5% 0.9% KeyedLoadIC: args_count: 0
177 0.2% 0.4% LazyCompile: huffmantree /home/starr/svn/plainsight-addon/lib/textutils.js:219
128 0.2% 0.3% Stub: FastNewClosureStub
85 0.1% 0.2% LazyCompile: *sort native array.js:741
79 0.1% 0.2% LazyCompile: *slice native string.js:510
78 0.1% 0.2% LazyCompile: *InsertionSort native array.js:764
60 0.1% 0.1% Builtin: A builtin from the snapshot
58 0.1% 0.1% Stub: FastCloneShallowObjectStub
54 0.1% 0.1% LazyCompile: *self.push /home/starr/svn/plainsight-addon/lib/priority_queue.js:129
52 0.1% 0.1% LazyCompile: *QuickSort native array.js:793
51 0.1% 0.1% LazyCompile: huffmanencodetoken /home/starr/svn/plainsight-addon/lib/textutils.js:179
Within the table, the 'ticks' represents the number of times the sampling profiler detected that the code was within that function. Overall, the baseline test takes 1 minute, 21.828 seconds of user time.
(By impact, I should focus on encrypt first, but I'll go after indexOf first because the costs to improve are much smaller.)
A surprising amount of time is spent in indexOf, within the encryption algorithm, we are scanning the key value and extracting strings that start with a prefix. Mozilla provides a String.startsWith function, but it is implemented in terms of computing indexOf and comparing against zero.
As an alternative, I implement a beginsWith alternate function:
Re-running the tests, I get:
That shows a nice gross improvement, but what do the more detailed results say?
Statistical profiling result from /home/starr/svn/plainsight-addon/lib/v8.log, (65316 ticks, 0 unaccounted, 0 excluded).
[Shared libraries]:
ticks total nonlib name
2253 3.4% 0.0% /usr/lib/libv8.so.3.14.5
524 0.8% 0.0% /lib/x86_64-linux-gnu/libc-2.17.so
57 0.1% 0.0% 7ffff93fe000-7ffff9400000
2 0.0% 0.0% /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.18
1 0.0% 0.0% /lib/x86_64-linux-gnu/libm-2.17.so
[JavaScript]:
ticks total nonlib name
28448 43.6% 45.5% LazyCompile: Object.defineProperty.value /home/starr/svn/plainsight-addon/lib/reversehuffman.js:22
18774 28.7% 30.0% LazyCompile: encrypt /home/starr/svn/plainsight-addon/lib/reversehuffman.js:81
4278 6.5% 6.8% Stub: KeyedLoadElementStub
3993 6.1% 6.4% Stub: FastNewBlockContextStub
3549 5.4% 5.7% Stub: CompareICStub {1}
774 1.2% 1.2% Stub: CompareICStub {2}
398 0.6% 0.6% Stub: ToBooleanStub_Bool
390 0.6% 0.6% KeyedLoadIC: args_count: 0
142 0.2% 0.2% LazyCompile: huffmantree /home/starr/svn/plainsight-addon/lib/textutils.js:219
137 0.2% 0.2% Stub: FastNewClosureStub
93 0.1% 0.1% LazyCompile: *InsertionSort native array.js:764
90 0.1% 0.1% LazyCompile: *sort native array.js:741
82 0.1% 0.1% LazyCompile: *slice native string.js:510
The Object.defineProperty.value refers to the new beginsWith code. Why is that so expensive? To see if it is a one-time expense versus a high overhead, I incremented the iterations from 1000 to 10000 in my profiling script and re-ran. That run took 11m43.463s of user time (703.463 seconds), which is almost exactly ten times the previous amount of time. The detailed profiler results showed the same -- the number of ticks is scaling with the number of iterations, so it is not a setup expense.
Furthermore, making the function stand-alone is not helpful (although it does make the detailed results easier to understand since the actual function being executed isn't hiding behind Object.defineProperty.value):
So, at this point, we've made a slight improvement in performance but we need to now work on the main algorithm rather than improving a subsidiary function.
I created this script for profiling use:
var fs = require("fs");
var rh = require("./reversehuffman");
var textutils = require("./textutils");
let source = fs.readFileSync("../doc/independence.txt", {encoding: "utf-8", flag: "r"});
let ft = textutils.frequencyTable(4, source);
for(let i = 0; i < 1000; i++) {
rh.encrypt("this is some plain text", ft);
}
Running my initial implementation through the profiler, I get the following report (report cropped):
Statistical profiling result from /home/starr/svn/plainsight-addon/lib/v8.log, (77854 ticks, 0 unaccounted, 0 excluded).
[Shared libraries]:
ticks total nonlib name
31451 40.4% 0.0% /usr/lib/libv8.so.3.14.5
36 0.0% 0.0% /lib/x86_64-linux-gnu/libc-2.17.so
16 0.0% 0.0% 7fff2d9fe000-7fff2da00000
3 0.0% 0.0% /usr/bin/nodejs
1 0.0% 0.0% /lib/x86_64-linux-gnu/libpthread-2.17.so
1 0.0% 0.0% /lib/x86_64-linux-gnu/libm-2.17.so
[JavaScript]:
ticks total nonlib name
21179 27.2% 45.7% LazyCompile: encrypt /home/starr/svn/plainsight-addon/lib/reversehuffman.js:60
8920 11.5% 19.2% LazyCompile: *indexOf native string.js:118
5013 6.4% 10.8% Stub: KeyedLoadElementStub
3718 4.8% 8.0% Stub: CEntryStub
395 0.5% 0.9% KeyedLoadIC: args_count: 0
177 0.2% 0.4% LazyCompile: huffmantree /home/starr/svn/plainsight-addon/lib/textutils.js:219
128 0.2% 0.3% Stub: FastNewClosureStub
85 0.1% 0.2% LazyCompile: *sort native array.js:741
79 0.1% 0.2% LazyCompile: *slice native string.js:510
78 0.1% 0.2% LazyCompile: *InsertionSort native array.js:764
60 0.1% 0.1% Builtin: A builtin from the snapshot
58 0.1% 0.1% Stub: FastCloneShallowObjectStub
54 0.1% 0.1% LazyCompile: *self.push /home/starr/svn/plainsight-addon/lib/priority_queue.js:129
52 0.1% 0.1% LazyCompile: *QuickSort native array.js:793
51 0.1% 0.1% LazyCompile: huffmanencodetoken /home/starr/svn/plainsight-addon/lib/textutils.js:179
Within the table, the 'ticks' represents the number of times the sampling profiler detected that the code was within that function. Overall, the baseline test takes 1 minute, 21.828 seconds of user time.
(By impact, I should focus on encrypt first, but I'll go after indexOf first because the costs to improve are much smaller.)
A surprising amount of time is spent in indexOf, within the encryption algorithm, we are scanning the key value and extracting strings that start with a prefix. Mozilla provides a String.startsWith function, but it is implemented in terms of computing indexOf and comparing against zero.
As an alternative, I implement a beginsWith alternate function:
if (!String.prototype.beginsWith) {
Object.defineProperty(String.prototype, 'beginsWith', {
enumerable: false,
configurable: false,
writable: false,
value: function (prefix) {
"use strict";
if(this.length < prefix.length) {
return false;
} else {
for(let i = 0; i < prefix.length; i++) {
if(this.charAt(i) != prefix.charAt(i)) {
return false;
}
}
return true;
}
}
});
}
Re-running the tests, I get:
Implementation | User Time |
---|---|
(Baseline) String.indexOf(...) === 0 | 1m21.828s |
String.beginsWith(...) | 1m10.723s |
That shows a nice gross improvement, but what do the more detailed results say?
Statistical profiling result from /home/starr/svn/plainsight-addon/lib/v8.log, (65316 ticks, 0 unaccounted, 0 excluded).
[Shared libraries]:
ticks total nonlib name
2253 3.4% 0.0% /usr/lib/libv8.so.3.14.5
524 0.8% 0.0% /lib/x86_64-linux-gnu/libc-2.17.so
57 0.1% 0.0% 7ffff93fe000-7ffff9400000
2 0.0% 0.0% /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.18
1 0.0% 0.0% /lib/x86_64-linux-gnu/libm-2.17.so
[JavaScript]:
ticks total nonlib name
28448 43.6% 45.5% LazyCompile: Object.defineProperty.value /home/starr/svn/plainsight-addon/lib/reversehuffman.js:22
18774 28.7% 30.0% LazyCompile: encrypt /home/starr/svn/plainsight-addon/lib/reversehuffman.js:81
4278 6.5% 6.8% Stub: KeyedLoadElementStub
3993 6.1% 6.4% Stub: FastNewBlockContextStub
3549 5.4% 5.7% Stub: CompareICStub {1}
774 1.2% 1.2% Stub: CompareICStub {2}
398 0.6% 0.6% Stub: ToBooleanStub_Bool
390 0.6% 0.6% KeyedLoadIC: args_count: 0
142 0.2% 0.2% LazyCompile: huffmantree /home/starr/svn/plainsight-addon/lib/textutils.js:219
137 0.2% 0.2% Stub: FastNewClosureStub
93 0.1% 0.1% LazyCompile: *InsertionSort native array.js:764
90 0.1% 0.1% LazyCompile: *sort native array.js:741
82 0.1% 0.1% LazyCompile: *slice native string.js:510
The Object.defineProperty.value refers to the new beginsWith code. Why is that so expensive? To see if it is a one-time expense versus a high overhead, I incremented the iterations from 1000 to 10000 in my profiling script and re-ran. That run took 11m43.463s of user time (703.463 seconds), which is almost exactly ten times the previous amount of time. The detailed profiler results showed the same -- the number of ticks is scaling with the number of iterations, so it is not a setup expense.
Furthermore, making the function stand-alone is not helpful (although it does make the detailed results easier to understand since the actual function being executed isn't hiding behind Object.defineProperty.value):
Implementation | User Time |
---|---|
(Baseline) String.indexOf(...) === 0 | 1m21.828s |
String.beginsWith(...) | 1m10.723s |
beginsWith(str, prefix) | 1m11.457s |
So, at this point, we've made a slight improvement in performance but we need to now work on the main algorithm rather than improving a subsidiary function.
Tuesday, October 22, 2013
Profiling Mozilla Add-ons (Stopgap)
There is currently no good way to profile Firefox JavaScript add-on code. For code that runs within the context of a web page, there are a number of options available (e.g. Firebug), but those solutions do not work for code that runs within the browser itself. For many add-ons, this is acceptable since they do not include any complex computations, but my add-on, Hide in Plain Sight, includes some expensive operations so I needed a profiler in order to perform evidence-based optimization.
My stopgap measure is to use the V8 JavaScript engine profiler. This is the JavaScript engine used in Chrome. Profiling code that will run in FireFox with the Chrome engine is not ideal, but I hypothesize that the two engines have similar gross performance characteristics.
Installation
In Ubuntu 13.10, using apt-get, I installed nodejs. (The version of nodejs installed with 12.04 many versions behind.)
Getting tools/linux-tick-processor (and the other functions necessary to parse the profiling output) was more involved.
Based on my history (i.e. I haven't double-checked these steps), this is what I did:
Running the Profiler
First, you need some code to profile. Not ideally, I wrote a driving script for frequencyTable, one of the functions I wanted to profile:
The "fs" library is provided by node.js.
Inside my add-ons lib/ directory, I ran the command:
The '--harmony' and '--use_strict' arguments are necessary to force node.js to allow "let" and other modern JavaScript features.
This command will run your script and create a "v8.log" file in your current directory.
To extract some useful data from v8.log, change directory to your libv8 directory and run:
This will write to stdout a profiling report.
My stopgap measure is to use the V8 JavaScript engine profiler. This is the JavaScript engine used in Chrome. Profiling code that will run in FireFox with the Chrome engine is not ideal, but I hypothesize that the two engines have similar gross performance characteristics.
Installation
In Ubuntu 13.10, using apt-get, I installed nodejs. (The version of nodejs installed with 12.04 many versions behind.)
Getting tools/linux-tick-processor (and the other functions necessary to parse the profiling output) was more involved.
Based on my history (i.e. I haven't double-checked these steps), this is what I did:
$ apt-get source libv8
$ cd libv8-3.8.9.20
$ make dependencies
$ make native werror=no
Running the Profiler
First, you need some code to profile. Not ideally, I wrote a driving script for frequencyTable, one of the functions I wanted to profile:
var fs = require("fs");
var textutils = require("./textutils");
let source = fs.readFileSync("../doc/independence.txt", {encoding: "utf-8", flag: "r"});
for(let i = 0; i < 10000; i++) {
textutils.frequencyTable(4, source);
}
The "fs" library is provided by node.js.
Inside my add-ons lib/ directory, I ran the command:
$ nodejs --harmony --use_strict --prof prof_frequencyTable.js
The '--harmony' and '--use_strict' arguments are necessary to force node.js to allow "let" and other modern JavaScript features.
This command will run your script and create a "v8.log" file in your current directory.
To extract some useful data from v8.log, change directory to your libv8 directory and run:
$ tools/linux-tick-processor path/to/your/v8.log
This will write to stdout a profiling report.
Monday, October 21, 2013
Limited Beta Release of Hide in Plain Sight add-on
Today, we initiated a limited beta of the Hide in Plain Sight Firefox add-on. This release implements a Reverse Huffman steganographic algorithm and key management capabilities.
After limited beta, our plan is to push the add-on to addons.mozilla.org to simplify installation and maintenance and improve security.
Superior algorithms and releases for other browsers (Chrome has the highest priority) are on the road map.
After limited beta, our plan is to push the add-on to addons.mozilla.org to simplify installation and maintenance and improve security.
Superior algorithms and releases for other browsers (Chrome has the highest priority) are on the road map.
Sunday, October 13, 2013
Impact of Corpus Size on Quality of Frequency Table
Hypothesis
The reverse Huffman steganography technique uses as input a n-character long frequency table drawn from a corpus. We know that the quality of the output increases as n grows, albeit not monotonically. n cannot grow beyond the size of the corpus and the number of branches declines as n grows leading to greater spatial inefficiency in the encoding process. We hypothesize that there are similar declines in the marginal improvement as the corpus size, s, increases.
Methodology
During the encoding process, the algorithm walks through a Huffman encoding of the frequency list to determine what tokens to emit based on a binary interpretation of the input. The actual frequency values are relatively unimportant; it is the frequency values in relation to each other that impact how the Huffman tree is generated. To measure quality, we compared the frequency tables for different sizes of s, where the different versions of the corpus are drawn from the same work.
As our distance measure, we used the Euclidean distance of the normalized frequency vectors. The tokens represent dimensions and the frequency counts are the lengths. For instance, the distance between u={a:4, b:2, c:1} and v={a:8, b:4, c:2} is zero.
Choice of Corpus
We used Darwin's The Origin of Species as the corpus for this test. Species is a fairly large work with greater than 893,000 characters in size, which allowed us to break it into 88 "chunks" of 10,149 characters. We then aggregated the chunks in increasingly large sizes via the Fibonacci sequence to arrive at nine corpora segments. The sizes of each of the corpora segments, along with sizes of representative works for comparison, are depicted in Figure 1.
Sizes of representative works were taken from the "Plain Text UTF-8" version posted to Project Gutenberg.
For each of the nine corpora, we generated a frequency list for n=2 to 5, inclusive. We then compared the frequency lists for each pair of corpora.
Results
In the tables below, the rows and columns indicate the corpora number and the contents are the distance score for that pair (lower numbers mean that the corpora are more similar to each other).
In Figure 6 below, we show the same data as in the tables above, except the x-axis represents the ratio of size (in characters) of the two corpora being compared and the y-axis is the computed distance. Although it appears that distances converge to some average value as the ratio of sizes increase, that is almost certainly an artifact of the fact that we have fewer instances of those ratios.
Conclusion
Figure 6 suggests that a frequency table generated from one segment of a single corpus will remain fairly distinct from a frequency table generated from another segment of the same corpus and that distinction is fairly scale invariant. This is evidence for rejecting the hypothesis. The results could be peculiar to our choice of corpus; for instance, one chapter might be written in a very different style than another chapter in the same book. However, it is also promising because it suggests that if the key is constructed from a subset of a corpus, a leak of the identity of that corpus may reveal less information to the attacker than previously assumed.
The reverse Huffman steganography technique uses as input a n-character long frequency table drawn from a corpus. We know that the quality of the output increases as n grows, albeit not monotonically. n cannot grow beyond the size of the corpus and the number of branches declines as n grows leading to greater spatial inefficiency in the encoding process. We hypothesize that there are similar declines in the marginal improvement as the corpus size, s, increases.
Methodology
During the encoding process, the algorithm walks through a Huffman encoding of the frequency list to determine what tokens to emit based on a binary interpretation of the input. The actual frequency values are relatively unimportant; it is the frequency values in relation to each other that impact how the Huffman tree is generated. To measure quality, we compared the frequency tables for different sizes of s, where the different versions of the corpus are drawn from the same work.
As our distance measure, we used the Euclidean distance of the normalized frequency vectors. The tokens represent dimensions and the frequency counts are the lengths. For instance, the distance between u={a:4, b:2, c:1} and v={a:8, b:4, c:2} is zero.
Choice of Corpus
We used Darwin's The Origin of Species as the corpus for this test. Species is a fairly large work with greater than 893,000 characters in size, which allowed us to break it into 88 "chunks" of 10,149 characters. We then aggregated the chunks in increasingly large sizes via the Fibonacci sequence to arrive at nine corpora segments. The sizes of each of the corpora segments, along with sizes of representative works for comparison, are depicted in Figure 1.
Figure 1: Size of corpora increases per Fibonacci sequence |
For each of the nine corpora, we generated a frequency list for n=2 to 5, inclusive. We then compared the frequency lists for each pair of corpora.
Results
In the tables below, the rows and columns indicate the corpora number and the contents are the distance score for that pair (lower numbers mean that the corpora are more similar to each other).
Figure 2: Distance when n=2 |
Figure 3: Distance when n=3 |
Figure 4: Distance when n=4 |
Figure 5: Distance when n=5 |
In Figure 6 below, we show the same data as in the tables above, except the x-axis represents the ratio of size (in characters) of the two corpora being compared and the y-axis is the computed distance. Although it appears that distances converge to some average value as the ratio of sizes increase, that is almost certainly an artifact of the fact that we have fewer instances of those ratios.
Figure 6: Ratio of Corpora Sizes versus Distances |
Figure 6 suggests that a frequency table generated from one segment of a single corpus will remain fairly distinct from a frequency table generated from another segment of the same corpus and that distinction is fairly scale invariant. This is evidence for rejecting the hypothesis. The results could be peculiar to our choice of corpus; for instance, one chapter might be written in a very different style than another chapter in the same book. However, it is also promising because it suggests that if the key is constructed from a subset of a corpus, a leak of the identity of that corpus may reveal less information to the attacker than previously assumed.
Friday, October 11, 2013
Launch of Website - Technical infrastructure
Tonight, I did something you are never supposed to do: I launched the company website on a Friday.
For this release, my goal was to establish an initial presence along with some technical content to give a sense of direction. For that, I did not need any dynamic content. I looked into adopting a content management system (a la WordPress), but decided against it for now.
Technologically, the website is based on HTML5 and CSS and uses the Bootstrap framework. jQuery is my JavaScript library of choice, although the current website makes minimal use of it.
Website content is being served using Amazon's S3 service and DNS via Route 53. I chose this stack due to the low cost and ease of switching to a different solution in the future.
For this release, my goal was to establish an initial presence along with some technical content to give a sense of direction. For that, I did not need any dynamic content. I looked into adopting a content management system (a la WordPress), but decided against it for now.
Technologically, the website is based on HTML5 and CSS and uses the Bootstrap framework. jQuery is my JavaScript library of choice, although the current website makes minimal use of it.
Website content is being served using Amazon's S3 service and DNS via Route 53. I chose this stack due to the low cost and ease of switching to a different solution in the future.
Tuesday, October 8, 2013
Ambiguous Pattern Matching in Mathematica
I was tripped up in this ambiguity within Mathematica's pattern matching. This is the behavior I want:
In[92]:= {{"a", 1}, {"c", 2}, {"e", 3}} /. {t_, w_} -> {Null, t, w, Null}
Out[92]= {{Null, "a", 1, Null}, {Null, "c", 2, Null}, {Null, "e", 3, Null}}
However, if the list has only two items in it, then the output is:
In[91]:= {{"a", 1}, {"c", 2}} /. {t_, w_} -> {Null, t, w, Null}
Out[91]= {Null, {"a", 1}, {"c", 2}, Null}
One of the ways to correct the issue (if possible) is to constrain the patterns further, such as:
In[99]:= {{"a", 3}, {"c", 4}} /. {t_String, w_Integer} -> {Null, t, w, Null}
Out[99]= {{Null, "a", 3, Null}, {Null, "c", 4, Null}}
In[92]:= {{"a", 1}, {"c", 2}, {"e", 3}} /. {t_, w_} -> {Null, t, w, Null}
Out[92]= {{Null, "a", 1, Null}, {Null, "c", 2, Null}, {Null, "e", 3, Null}}
However, if the list has only two items in it, then the output is:
In[91]:= {{"a", 1}, {"c", 2}} /. {t_, w_} -> {Null, t, w, Null}
Out[91]= {Null, {"a", 1}, {"c", 2}, Null}
One of the ways to correct the issue (if possible) is to constrain the patterns further, such as:
In[99]:= {{"a", 3}, {"c", 4}} /. {t_String, w_Integer} -> {Null, t, w, Null}
Out[99]= {{Null, "a", 3, Null}, {Null, "c", 4, Null}}
Monday, October 7, 2013
Mimicking Tweets
In Chapter 6 of Peter Wayner's Disappearing Cryptography, he lays out an algorithm for generating text that would appear statistically similar to another text. Essentially, the algorithm works by "chunking" the source text into n-character long strings and, based on the overlap between chunks, probabilistically generating a new character. Iterate until you sufficient text. Using the chapter itself as a source text, he generates examples of the algorithm output for n from one to five (first order to fifth order). By n=5, the text appears very similar to English, although a close reading will show that the content makes no sense and it only "statistically" follows grammatical and syntactical rules
What if the source text is not an academic work, but rather text filled with acronyms, misspellings, and idioms? In other words, does increasing the source text's entropy increase the quality of the algorithm's output?
In terms of storage space, the number of records in the frequency table increase per Figure 2. In the example, the source text was held constant and each Tweet was treated as an independent record. (A Tweet with a length less than n are not included in the table. Only a small percentage of Tweets are eliminated; Figure 3 shows a histogram of Tweet length.)
As n grows, the number of possible branches for the text decreases. If we hope to encode information as branches, this is an undesirable behavior.
Increasing the size of the source text can alleviate the problem since it will, almost certainly, increase the number of branches, but at ever increasing computational cost.
What if the source text is not an academic work, but rather text filled with acronyms, misspellings, and idioms? In other words, does increasing the source text's entropy increase the quality of the algorithm's output?
Entropy
As a baseline measurement, the 1.5 million characters in my Twitter dataset have a measured entropy (base e) of 9.84. For comparison, Darwin's On the Origin of Species has an entropy of 2.97. The Twitter dataset has 787 unique characters while Origin has only 75. The increased entropy can be contributed to a combination of the wider variety of characters (see Figure 1) and the more colloquial writing style.Fig. 1: "English" Tweets feature characters from the Latin alphabet, box drawing, dingbats, and Korean |
Example n-Order Output
n=3
Third order text is highly random, with only a few recognizable words spliced between noise.6 me banshou tweded HURT @oui: I
odur our itto chapposeeknowspass
_que-se.Yore!!!! houst usias th t
o dif able forthe whine aspre On
and penreekins_ @ouldri @on a ar
roursomemy ar en ithellbyFan http
stude... ❤ ❤ ❤ ❤✌ on I\[Ellipsis] hamilaut
atersto gasequally Tictince Hom c
| MERE Bry, tomed 5: Itz: @ ohtt
y we fole wit ittink &ame justur
n=4
Fourth order text has more recognizable words, but is still very rough and easily recognizable as random text.U MONE? I far is me sing LikeJohnn
y: Thats best Mar0tte to Gla\[Ellipsis] Middodr
s to Down and and Go Britisiter tw
yn Halfd to suren_ @o a kiddlovers
llo97 whool I WAS THE OF THEY #cam
andthe LET'S AND I aciet as Some.
ed to get with kick his us an You
" Luke: i way?.. El Dont, built lo
ation. (Re)-straigestill could I c
ed it's Clasten. Welcore #Frentshu
n=5
Fifth order text is again an incremental improvement on fourth order text.mishopper_: makes tho <<3 CA
solute their to get wonde :)x I get
m1345://t.co/2GXdLx0fW2 via @Lilia_rmz:
post "Wreckless is dey hurt I said
ght wing peace Tell message really
x5HRu us to stats: No next me followed
' over 5sos almost into far upset a
anz: Some I'm wait--it's become sea
not get up & Shiieel soon. #En
ng for a Duval in dem when I'm not
n=9
Ninth order text has recognizable sentence fragments and could be mistaken as badly written English.A Cop ðpping in fellaake: so far we get the champion
oncentration Falsifying Federal Retail
sleep TW\[Ellipsis]l talk to me at booth 125 and 670! htt
a wizard crib for the follow or tweet #
abb_sanchez: Want Vs. Need #TheStruggle
sx awwww thank you poop, thank YOU! It
on the floor \[Ellipsis]opped my iPhone 5c! Join to win!
heart: I meet One Direction is gonna st
ng anything else is just inside.. She w
e, but FAITH is what happens , happen t
n=12
Twelfth order text is filled with recognizable sentence fragments.ingEDM: The sun hits like a bullet of fait
ES YOU #camilacafollowspree @camilacabello
age when the sun is starting this month my
tness #beverages #drinking #lifestyleK, kids who cut
hanelpuke: my rapper name would be A1 righ
ats Behind Luis Suarez made his long antic
y the same person twice. Becomes a prob wh
BTHEBASEDGOD @GirlTimeUSA You're very bea
gh key hate Steviess you think is funny but everyone
ink the point where we realize we've... h
Cost-Benefit of Increasing n
Given the increasing quality of the algorithm's output as n increases, what is the cost?In terms of storage space, the number of records in the frequency table increase per Figure 2. In the example, the source text was held constant and each Tweet was treated as an independent record. (A Tweet with a length less than n are not included in the table. Only a small percentage of Tweets are eliminated; Figure 3 shows a histogram of Tweet length.)
Fig. 2: Sub-Linear growth as n increases but with a very large constant factor |
Fig. 3: Tweet lengths cluster around 50 and 140 |
Fig. 4: Branches decrease as |
Increasing the size of the source text can alleviate the problem since it will, almost certainly, increase the number of branches, but at ever increasing computational cost.
Sunday, October 6, 2013
Quick local web server
For all my time with Python, I didn't know you could create a local web server with:
More here.
$ python -m SimpleHTTPServer
More here.
Wednesday, October 2, 2013
Twitter archives for research
Twitter is a source of massive, meta-data rich, and multi-lingual textual feeds. For researchers seeking a large set of tweets for analysis, they will eventually be able to access the Library of Congress' archive. There are also commercial sources that maintain a database. However, using Twitter's sampling streaming API is free and straight-forward.
First, you need to get a Twitter account. Second, using dev.twitter.com, you need to create a new application. With the new application, Twitter will assign you four codes: a consumer key, a consumer secret, access token, and an access token secret. You will need these four codes for authentication.
For my sampling application, I used Python and an API called tweepy. Tweepy's streaming example is almost exactly what I needed, but in order to receive the broadest possible content I modified the script to:
JSON-encoded tweets will be written to the file. Kill the script when you have a sufficient quantity.
First, you need to get a Twitter account. Second, using dev.twitter.com, you need to create a new application. With the new application, Twitter will assign you four codes: a consumer key, a consumer secret, access token, and an access token secret. You will need these four codes for authentication.
For my sampling application, I used Python and an API called tweepy. Tweepy's streaming example is almost exactly what I needed, but in order to receive the broadest possible content I modified the script to:
#!/usr/bin/env python
# Stream tweets -- https://github.com/joshthecoder/tweepy/blob/master/examples/streaming.py
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
# Go to http://dev.twitter.com and create an app.
# The consumer key and secret will be generated for you after
consumer_key="xxxxx"
consumer_secret="xxxxx"
# After the step above, you will be redirected to your app's page.
# Create an access token under the the "Your access token" section
access_token="xxxxx"
access_token_secret="xxxxx"
class StdOutListener(StreamListener):
""" A listener handles tweets are the received from the stream.
This is a basic listener that just prints received tweets to stdout.
"""
def on_data(self, data):
print data
return True
def on_error(self, status):
print status
if __name__ == '__main__':
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
stream.sample()
Run the program with:
$ python streamtweets.py > tweets.json
JSON-encoded tweets will be written to the file. Kill the script when you have a sufficient quantity.
Subscribe to:
Posts (Atom)