Oranges, Lemons and Apples dataset
In March 2006 I gave a talk to the Thinking Society in Cambridge on Machine Learning and artificial thinking. For this talk I wanted to show a data set and show a couple of algorithms running on it so that, hopefully, people could understand what was going on from start to finish. I did not want to draw a fake dataset and assert that “data looks like this”. And gathering it myself forced me to think about some of the issues involved a bit more too. This is one of the plots I made:
I recorded the height, width and mass of a selection of oranges, lemons and apples. I deliberately bought a few each of some different types to introduce some variety. I did not have ready access to calipers or any sophisticated equipment. The masses were measured with some digital kitchen scales, which rounded to the nearest 2g. The lengths were measured by holding the fruit between two CDs (compact discs) and making marks on a sheet of paper(!). This probably introduced some systematic error and a fair amount of random error. I just tried to be fairly consistent in my procedure. We often have to deal with whatever poor-quality features are available. The heights were measured along the core of the fruit. The widths were the widest width perpendicular to the height.
If you just want to look at some more pictures, see a subset of the slides from my talk showing just the oranges and lemons and a demo of K-means clustering. The clusters it finds roughly correspond to the different types of orange and lemon I bought. Only the “seconds” oranges, a selection of deformed oranges I bought cheaply, are not in a cluster of their own. These high variances oranges are scattered around the normal oranges, and one of them looks like a lemon…especially when coloured yellow. Full disclosure: it took me three random initializations to get the K-means demo to find the “right” solution.
The data are available in a tab-separated unix text file format. The columns correspond to fruit type, defined in this file. An example Octave/Matlab script that read in these data and do a simple scatter plot may help clarify things. Several people have emailed me to ask if I have more of this data than provided here. I don’t.
October 2011: The BBC report on a real machine vision application: detecting rotten oranges.