memes.angrygoats.net

Haiku2

I've had a lot of requests from people to explain how the thing works. A lot of people wonder why the haiku seem so pertinent to their own lives, contain the poems they've written, and so on. The answer is that it's entirely based on your own journal - the generator has no knowledge of English (or any other language) at all!

It's not a Haiku! The syllable count is wrong!

Yes, the output may not be strictly 5-7-5, although there is code to try and make it count syllables correctly. English (my primary language) is hard to syllable count accurately! Some syllable information is taken from GCIDE, a free dictionary.

There isn't a seasonal reference!

Well, make more of them in your blog! The meme only uses words you have written!

My haiku is rude, my haiku is depressing!

Then, in all likelihood at some point your LiveJournal was rude or depressing :-) I wouldn't worry about it, it's just a silly meme.

The gory details

The script has four steps:

  1. Determine the location of your blog's RSS feed, check if it has updated since the last time we visited it, and if so download and tokenize it.
  2. Construct a Markov model of the journal
  3. Apply entropic chunking. I won't go into this here; this process effectively determines the tokens which represent 'grammar' in your entries. Typically these would be 'of the', 'with the', etc. These tokens are upchunked, and become represented as a single token. This helps the model make more "sense" in terms of the language.
  4. Attempt to produce a Haiku using the model.

The Markov chains code builds an index, so that for any two words you've used after each other, it knows the probability of the third word. For example if you'd used "I am", then the probability table for the next word might look like:

WordProbability
angry30%
happy50%
silly10%
smelly10%

The program then rolls a virtual dice; 30% of the time it would use the word "angry" in the haiku, 10% "smelly", etc.

So the program picks the first two words of the haiku from words you have used; it then moves on choosing the next word by rolling this virtual dice. For each word, it uses a simple algorithm to get an estimate of the number of syllables. If the number of syllables on a generated line doesn't match the requested form, it drops the line and tries again.

If you haven't got many entries, there will often only be one possibility for the generator to choose from -- for example, the words "I met" might only have been followed by "Sally" in your journal. This causes the behaviour where it parrots back entire sentences of your journal. If you come back later after writing more posts, they will be included in your new Haiku.

Hope this helps; if you have any questions feel free to post in my LJ. I'll try and get back to you :-)

Why?

I wanted a cool signature for my email, so I wrote a haiku generator and fed it the King James bible from Project Gutenburg. This was kind of fun, and then I came up with the idea of running it over people's LiveJournals. I then procrastinated about it for ages, and eventually finished it. Woo :-)

More information

Credits

The most important change to the Haiku meme is thanks to Tom Lynch. Many thanks to him for pointing me in the direction of entropic chunking!

Much open source software lets this site function; it is running via web.py on lighttpd, and of course is written in the Python programming language. Feedparser provides an invaluable interface which allows the site to grab blog feeds.

zedfnegt2zzca@gaisde.angrygoats.net