Mathjax

Monday, May 26, 2014

OCaml and Functional Programming Contest - May '14

About two weeks ago, I decided it was time for me to get experience with the ML-family of programming languages, so I installed OCaml and started writing programs. For motivation, I first followed the Functional Programming track on HackerRank and then entered this month's FP contest. (I scored 19th out of 161.) My thoughts so far on the language:

  • Pattern matching is a great feature in a language. Pattern matching provides an "if-then" mechanism that can use both the input's value and the input's structure. In particular, the SKIBC translation problem from the contest decomposed marvellously into a recursive pattern match/transform function once you specify the proper data types.
  • The standard library is weak. I had to write several debugging functions to print structures. Because of how Ocaml implements polymorphism and discards type information within the compiled output, it's basically impossible to write a generic 'to_string' function. It usually took more code to write the I/O routines (particularly in parsing the contest input) than to write the algorithms.
    • Although I don't think there is a way to use them in HackerRank, Jane Street's OCaml library is an attempt at 'completing' the standard library and making it more competitive with other languages (e.g. Perl, Python, Ruby). It would be worth my time to study Jane Street's library.
  • Performance profiling tools, at least as far as "how many times does this execute" are quite good. However, I have not seen a way to get a sense of how expensive individual operations are versus how often they are committed.

Sunday, May 18, 2014

OCaml Profiling Idiom

The idiom to profile an Ocaml program is:

ocamlcp -p a program.ml -o program
cat input | ./program
ocamlprof program.ml > program.ml.prof

The file program.ml.prof will contain a source listing of program.ml but annotated with call counts.

Tuesday, May 13, 2014

Choosing Algorithm based on Characteristics of the Problem

I believe I first came across a reference to an algorithm that chose an algorithm to solve problems based on characteristics of that problem in a review of Fast Fourier Transform methods. In this article, Leyton-Brown et al. describe a process where they use machine learning to predict the algorithm runtime of various SAT-solvers. Practically, this meant they could write a program that selected the appropriate SAT solver from a pool and, in general, exceed the performance of competitor algorithms overall. The far more useful result, as the paper lightly addresses, would be understanding why SAT solvers do so well with problems that are, complexity-wise, NP-complete. By itself, I am doubtful a machine learning approach will yield a model susceptible to answer "why" but perhaps if the approach incorporated more performance outputs (that collectively help determine the run time) with an augmented machine learning approach as in Schmidt and Lipson?


Sunday, May 11, 2014

OCaml - Return lines of text from file as list

I've started some challenges on HackerRank and, after completing several challenges using Python, decided to try them using OCaml, a functional language. I found the documentation around I/O to be a bit obtuse, so I'm sharing some sample code to facilitate completing the challenges. This code reads all lines from a channel and returns them as a list in OCaml:


let line_stream_of_channel channel =
  Stream.from
    (fun _ ->
      try Some (input_line channel) with End_of_file -> None);;

let list_of_stream stream =
  let result = ref [] in
  Stream.iter (fun value -> result := value :: !result) stream;
  List.rev !result ;;


As an example, this line calls print_string on every element read from stdin:


List.iter print_string (list_of_stream (line_stream_of_channel stdin)) ;;

Thursday, February 27, 2014

Squarified Treemap Algorithm, Functionally

I'm preparing to implement the squarified treemap algorithm in an unfamiliar package, so I wanted a clear and complete description of the process. The original paper by Bruls, Huizing, and van Wijk is written procedurally and relies on global state.

I've written an extended explanation of the algorithm, but I've chosen to implement the algorithm in a functional manner. The code is in Mathematica, but should be readily adaptable to other functional languages like Schema, Scala, F#, or JavaScript/D3.

Format: PDF NB Github

Friday, February 21, 2014

Axis/Scale Boundaries

When charting unknown data, we would like to set the axis scale such that the details in the data are clear and the end points on the scale are aesthetically pleasing.

D3 and Protovis use the "nice" algorithm for linear scales (and a modification for logarithmic scales). The pull request implementing the algorithm in JavaScript is on github, but, mathematically, for an array of data containing a \(min\) and \(max\) value:

Let $$step = 10^{round(log_{10}(max - min)) - 1}$$

then,

$$scale_{min} = \lfloor\frac{min}{step}\rfloor \times step$$
$$scale_{max} = \lceil\frac{max}{step}\rceil \times step$$

If the data has a range of 0.002454 to 0.1455, the scale will extend from 0.002 to 0.015. Visually, the bounding boxes (tan and green) will look like:

Wednesday, February 12, 2014

Minimizing Autocorrelation of a List

As part of a graph rendering algorithm, I needed to take a list of y-values and disperse the points within a xy-range. (Within the list, the x coordinate has no meaning. I'm not re-ordering a time series.) To model the dispersion, I decided to use the autocorrelation metric. Autocorrelation measures the correlation of a list to itself. Put another way, autocorrelation measures how much the prior value can be used to predict the next value. (For this discussion, I'm using 1 as the lag parameter. Additionally, minimizing the autocorrelation is the same as minimizing the covariance.)

For example, the figure below shows a sinusoid (sampled at a regular interval). The autocorrelation of the figure is 0.994705 --- almost the maximum value of 1.


In contrast, the figure below has been modified by the autocorrelation minimization algorithm. The two figures have the same y-values, but the x-values have been changed resulting in a new correlation of -0.999955.

The Algorithm

The formula for calculating autocorrelation includes a number of terms, but the only terms that are impacted by changing the order of the list can be reduced to the sum:

$$ \Sigma(x_i - \mu)(x_{i+1} - \mu) $$

where \(\mu\) is the arithmetic mean of the list.

The approach is to maximize occurrences of large negative values.

  1. Create a list terms with the values of the list minus the mean of the list.
  2. Sort the list terms.
  3. Initialize the merged list with the head and tail of terms.
  4. While elements remain in terms,
    1. Take an element from terms (either the head or tail) such that the element minus the mean multiplied by the head or tail of merged has minimal value.
  5. Add the mean to each element of merged. Return that list.
I've uploaded a Python implementation of the algorithm to Gist.

Updated: Algorithm simplified

Friday, January 31, 2014

Dateinfer Updated

The dateinfer package has been updated to version 0.20.

You can get it from PyPi, Github, or install it via pip.

Tuesday, January 14, 2014

Release of dateinfer: v0.1.1

I have released the first version of a new Python library called dateinfer. dateinfer makes a "best guess" date format given a list of example date strings. For example:

>>> import dateinfer
>>> dateinfer.infer(['Mon Jan 13 09:52:52 MST 2014', 'Tue Jan 21 15:30:00 EST 2014'])
'%a %b %d %H:%M:%S %Z %Y'
>>>

The library is available through pypi and is hosted on github.

Wednesday, January 8, 2014

Access denied checking streaming input path

When I started launching my Elastic Map Reduce (EMR) jobs from within a Elastic Beanstalk EC2 instance, I was stymied by the error message:

Terminated with errors Access denied checking streaming input path: s3://bucket/key

I first opened up permissions on my S3 bucket and file, but that didn't work. I then explicitly set the IAM role for EMR and assigned a policy for full read/write rights to S3. That also did not work.

After conversing with Amazon Web Services technical support, they noted that I was making requests using temporary credentials. EMR does not support temporary credentials, so the actual request was being performed by something with no authority to access any resources.

I solved the problem by explicitly setting my credentials (AWS access key, secret access key) in my job creation code. Since I am using mrjob,  this was a matter of:

runner = EMRJobRunner(
            aws_access_key_id='xxxxx',

            aws_secret_access_key='xxxxxxx',
            ...)

Before, I was not explicitly setting the access keys, so mrjob was using the keys boto was using, which are apparently temporary credentials passed in via Elastic Beanstalk.