Background

On a long road trip in the Smoky Mountains in Tennessee, I was stuck behind a lady driving at 30 mph on a 55 mph street. Being a law-abiding citizen who shuns road rage, I vented out my frustration through a passionate and short Pallavi in the raga Pantuvarali set to a five beat cycle.

nI bAdhamEmiTi ammA, nAku teliyarAda \ What is your problem, lady? I do not comprehend it.

While that made for a memorable few hours of improvisational music 1, what struck me the most was that my friend was impressed with the realistic line I had come up with. While coming up with a line appropriate to the situation is definitely a hard problem, I felt it shouldn’t be very difficult to generate coherent, realistic-sounding lines.

In fact, the requirement of making semantic sense could be safely dropped because:

  1. said friend didn’t speak Telugu and would’ve bought into anything that sounded authentic.
  2. the brain is great at filling gaps (or else most jokes wouldn’t work), and even a moderately creative person will make sense of almost anything.

The rest of this post is about how I went about verifying this hunch.

Idea

The simplest statistical model that can generate language is an n-gram language model. Such a model gobbles up text and only remembers how often sequences of words occur together - it doesn’t “understand” anything per se, and in addition, is restricted to short sequences of words 2.

I should add that natural language processing is a decades-old field and the current state of the art consists of vastly more superior methods 3. The flip side is that these sophisticated methods require much more data than I had.

The goal was then to train an n-gram language model using as many of Tyagaraja’s compositions as I could lay my hands on and see what came out.

Challenges

A general challenge in using computational methods for classical works is that they span multiple centuries and straddle many different cultural eras through which the language itself would have evolved. In other words, the amount of data available in a particular linguistic “era” is limited. The issue is worsened in our case because carnatic music was composed in multiple languages and the good folks who composed in Telugu passed away in the 18th century, leaving behind only a precious few thousand compositions.

In addition, there is no standard way of romanizing Indian languages, which makes the data hard to clean and also adds to data sparsity (heuristics go only so far in cleaning the data, and patience is a scarce resource!)

Implementation

I scraped many popular websites hosting carnatic music lyrics and picked the one that was easiest to parse and most consistent. I then cleaned up the data a little, but did not stem the words.

I only had 700 songs and so didn’t prune the vocabulary. There were 13k words, 30k pairs and 33k triples. The lists of top n-grams will be of no surprise.

  1. Top words: rAma, SrI, tyAgarAja
  2. Top pairs: SrI rAma, tyAgarAja nuta, rAma END
  3. Top triples: tyAgarAja nuta END, tyAgarAja vinuta END, O manasA END

I wrote a node.js app which sampled 4 from this trigram language model 50% of the times and from the unigram distribution the rest of the times and hosted it on Heroku. I also wanted the generated lyrics to be stored somewhere so I eventually made a Twitter bot to tweet out lyrics every 15 minutes.

Sample lyrics

These samples span the full spectrum from profoundly philosophical to hilarious. Translations provided courtesy yours truly.

In the Kaliyuga, one can only achieve a cute name.

Rama having summoned someone like a tiger.

Unable to see the swift power, easily wander and cross the town, having felt like serving Garuda 5.

While some of these are too contrived, others are simple and profound. Your help is needed in following the twitter bot and liking the good lyrics!

TL;DR

Computer generates lyrics similar to 18th century Indian classical music! Please follow!

Tweets by TyagaRaga


  1. with my musician friend doing the heavy lifting and me tagging along. ↩︎

  2. once you fix a set of W words as your vocabulary, there can be W^k possible sequences of length k, and this quickly becomes too large to handle. ↩︎

  3. long short-term memory (LSTM) recurrent neural networks, for example. ↩︎

  4. Do you know of the Alias method to efficiently sample repeatedly from a discrete probability distribution? ↩︎

  5. humanoid bird in Hindu mythology ↩︎