Mathjax

Monday, October 7, 2013

Mimicking Tweets

In Chapter 6 of Peter Wayner's Disappearing Cryptography, he lays out an algorithm for generating text that would appear statistically similar to another text. Essentially, the algorithm works by "chunking" the source text into n-character long strings and, based on the overlap between chunks, probabilistically generating a new character. Iterate until you sufficient text. Using the chapter itself as a source text, he generates examples of the algorithm output for n from one to five (first order to fifth order). By n=5, the text appears very similar to English, although a close reading will show that the content makes no sense and it only "statistically" follows grammatical and syntactical rules

What if the source text is not an academic work, but rather text filled with acronyms, misspellings, and idioms? In other words, does increasing the source text's entropy increase the quality of the algorithm's output?

Entropy

As a baseline measurement, the 1.5 million characters in my Twitter dataset have a measured entropy (base e) of 9.84. For comparison, Darwin's On the Origin of Species has an entropy of 2.97. The Twitter dataset has 787 unique characters while Origin has only 75. The increased entropy can be contributed to a combination of the wider variety of characters (see Figure 1) and the more colloquial writing style.

Fig. 1: "English" Tweets feature characters from the Latin alphabet, box drawing, dingbats, and Korean

Example n-Order Output

n=3

Third order text is highly random, with only a few recognizable words spliced between noise.
6 me banshou tweded HURT @oui: I
odur our itto chapposeeknowspass
_que-se.Yore!!!! houst usias th t
o dif able forthe whine aspre On
and penreekins_ @ouldri @on a ar
roursomemy ar en ithellbyFan http
stude... ❤ ❤ ❤ ❤✌ on I\[Ellipsis] hamilaut
atersto gasequally Tictince Hom c
 | MERE Bry, tomed 5: Itz: @ ohtt
y we fole wit ittink &ame justur

n=4

Fourth order text has more recognizable words, but is still very rough and easily recognizable as random text.
U MONE? I far is me sing LikeJohnn
y: Thats best Mar0tte to Gla\[Ellipsis] Middodr
s to Down and and Go Britisiter tw
yn Halfd to suren_ @o a kiddlovers
llo97 whool I WAS THE OF THEY #cam
 andthe LET'S AND I aciet as Some.
ed to get with kick his us an You
" Luke: i way?.. El Dont, built lo
ation. (Re)-straigestill could I c
ed it's Clasten. Welcore #Frentshu

n=5

Fifth order text is again an incremental improvement on fourth order text.
 mishopper_: makes tho <<3 CA
solute their to get wonde :)x I get
m1345://t.co/2GXdLx0fW2 via @Lilia_rmz:
post "Wreckless is dey hurt I said
ght wing peace Tell message really
x5HRu us to stats: No next me followed
' over 5sos almost into far upset a
anz: Some I'm wait--it's become sea
 not get up & Shiieel soon. #En
ng for a Duval in dem when I'm not

n=9

Ninth order text has recognizable sentence fragments and could be mistaken as badly written English.
A Cop ðpping in fellaake: so far we get the champion
oncentration Falsifying Federal Retail
sleep TW\[Ellipsis]l talk to me at booth 125 and 670! htt
a wizard crib for the follow or tweet #
abb_sanchez: Want Vs. Need #TheStruggle
sx awwww thank you poop, thank YOU! It
on the floor \[Ellipsis]opped my iPhone 5c! Join to win!
heart: I meet One Direction is gonna st
ng anything else is just inside.. She w
e, but FAITH is what happens , happen t

n=12

Twelfth order text is filled with recognizable sentence fragments.
ingEDM: The sun hits like a bullet of fait
ES YOU #camilacafollowspree @camilacabello
age when the sun is starting this month my
tness #beverages #drinking #lifestyleK, kids who cut
hanelpuke: my rapper name would be A1 righ
ats Behind Luis Suarez made his long antic
y the same person twice. Becomes a prob wh
BTHEBASEDGOD @GirlTimeUSA You're  very bea
gh key hate Steviess you think is funny but everyone
ink the point where we realize we've... h

Cost-Benefit of Increasing n

Given the increasing quality of the algorithm's output as n increases, what is the cost?

In terms of storage space, the number of records in the frequency table increase per Figure 2. In the example, the source text was held constant and each Tweet was treated as an independent record. (A Tweet with a length less than n are not included in the table. Only a small percentage of Tweets are eliminated; Figure 3 shows a histogram of Tweet length.)

Fig. 2: Sub-Linear growth as n increases but with a very large constant factor
Fig. 3: Tweet lengths cluster around 50 and 140
As n grows, the number of possible branches for the text decreases. If we hope to encode information as branches, this is an undesirable behavior.


Fig. 4: Branches decrease as n increases

Increasing the size of the source text can alleviate the problem since it will, almost certainly, increase the number of branches, but at ever increasing computational cost.



No comments:

Post a Comment