Use the html version of this lab if you want to easily access the links or copy and paste commands.
Or use the pdf version if you want nicer formatting or a printable sheet.
This week's lab is intended to help build your intuitions about estimating probability distributions from data, and to give you some example code for generating bar plots and for sampling from a discrete probability distribution.
We have written most of the code here already and included a lot of explanatory comments, but we will ask you to add a few things here and there. For students with more programming background, the 'Going Further' section will give you a chance to explore some more advanced topics and efficiency issues.
We encourage you to work with a partner on the lab and discuss the questions and solutions with each other. Try to help with each other's weaknesses: if one student is stronger with programming, let the other student take the keyboard and give them a chance to work things out. And maybe they can help you with the math!
At the end of today's lab session, please email me a short message including the following information:
I will post solutions to each lab but will not provide individual feedback, I just want to make sure that you are actually making an effort.
Before starting, you'll need to download the lab materials. We suggest creating a folder to hold all the labs (e.g., introCL-labs), with a subfolder for each one (e.g., lab1 for this lab).
Download the file lab1.py into your new folder. (In Windows, this can be done by right-clicking on the link and selecting Save link as..., then choosing the folder to save in.)
Using a Python editor or development environment, open the lab1.py file so you can see the code, and start a Python interpreter. (If you've installed Anaconda, run its development environment, called Spyder, and open the file there. The interpreter console is in the bottom right.)
Run the code in the lab1.py file. (If using Spyder, click on the green triangle.)
You should see a plot appear with two sets of bars, "True" and "Est". "True" plots the probabilities of the different outcomes defined by the distribution dictionary at the top of the main body of code. "Est" will eventually plot the probabilities as estimated from some data, although at the moment the values of the bars are incorrect.
Now look at the text that was printed out in the interpreter window. (In some versions of Python, you might need to close the plot window first.) By looking at the printout, the code, and/or debugging information in the top right window of Spyder, can you see what str_list, str_counts, and str_probs are intended to be, and what datatypes they are? Which of these three variables has an incorrect value?
Take a look at the function normalize_counts. What is this function supposed to do? What is it actually doing? Fix the function so that it does what it is supposed to do. (You may want to use the sum function to help you.)
What is a simple error check you could perform on the return value of normalize_counts to make sure it is a probability distribution?
Note: you may find as you are working through this lab that some of the numerical results you get are not quite what you expect, for example you might get 0.20000000000000001 as the answer to 1/5. (If you want to test this in the interpreter, you will need to type from __future__ import division first.) This is because the answer to 1/5 is easy to represent in base 10 (which we use), but not in base 2 (which the computer uses internally). There is a tiny amount of error involved when the computer converts from base 2 back to base 10. Most programming languages by default only show numbers to five or six decimal places, so you won't see the error, but Python shows more, so you do see it.
If you correctly fixed the normalize_counts function, you should be able to rerun the whole file and see a comparison between the true distribution over characters and an estimate of that distribution which is based on the observed counts of each character in the randomly generated sequence (which in turn was generated from the true distribution). What is the name for the kind of estimate produced here?
Now look at the plot comparing the true and estimated distributions. Notice that there are several letters with very low probability under the true distribution. You will probably see that some of them occur in the randomly generated sequence and some do not, just due to random chance. For the low-probability letters that do occur in the sequence, are their estimated probabilities generally higher or lower than the true probabilities? What about for the low-probability letters that do not occur in the sequence?
Change the code so that the length of the generated sequence is much smaller or much larger than 50 and rerun the code. Are the estimated probabilities more or less accurate with larger amounts of data? Are you able to get estimates that no longer include zero probabilities? Would it be possible to adjust the sample size in a natural language corpus to avoid zero probabilities while still using the same kind of probability estimation you have here? Why or why not?
In class, we discussed the likelihood, defined as the probability of the observed data given a particular model. Here, our true distribution and estimated distribution are different possible models. Let's find the likelihood of each model.
First, change the code back so that you are generating sequences of length 50 again. You may also want to comment out the line that produces the plot, since we won't be using it again.
Next, fill in the correct code in the compute_likelihood function so that it returns the probability of a sequence of data (first argument) given the model provided as the second argument.
The function you just defined is being used to compute two different likelihoods using the generated sequence of 50 characters: first, the likelihood of the true distribution, and then the likelihood of the estimated distribution. Look at the values being printed out. Which model assigns higher probability to the data? By how much? (You may want to run the code a few times to see how these values change depending on the particular random sequence that is generated eaach time.)
Note: make sure you are familiar with the floating-point notation used by Python (and other computer programs): A number like 1.2e4 means 1.2 x 104 or 12000, similarly 1.2e-4 means 1.2 x 10-4 or .00012
Increase the length of the random sequence of data to 500 characters. You should find that your program now says both likelihoods are 0. Is that correct? Do you know why this happened?
The problem you just saw is one practical reason for using logs in so many of the computations we do with probabilities. So, instead of computing the likelihood, we typically compute the log likelihood (or negative log likelihood).
One way to try to compute the log likelihood would be to just call the likelihood function you already wrote, and then take the log of the result. However, we don't do things that way. Why not?
To correctly compute the log likelihood, you will need to fill in the body of the compute_log_likelihood function with code that is specific to the purpose. Please use log base 10 (see note below). Hint: remember that log xy = log x + log y.
Note: For this lab, we ask you to use log base 10 because it has an intuitive interpretation. For any x that is a power of 10, log10 x is the number of places to shift the decimal point from 1 to get x. For example log10 100 = 2, and log10 .01 is -2. Numbers falling in between powers of 10 will have non-integer logs, but rounding the log value to the nearest integer will still tell you how many decimal places the number has. Consider: what is the relationship between log10 x and the floating-point representation of x?
Now, what is the log likelihood of your random sequence of 500 characters?
Some of you have probably used the NumPy (numpy) library before. If you have, you can probably skip this section. If you haven't, and are planning to continue with numerical computing (including computational linguistics), it's a very useful package to know about---especially if you want to reuse or modify any of our code here!
There is a huge amount of stuff in NumPy but one of the most basic concepts, and one we used here in several places, is the NumPy array datatype. An array stores a sequence of items, like a list, except that all the items must have the same type (e.g., all strings or all integers). If the items are numbers, then the array can be treated as a vector. To see how it works, let's create some arrays and lists we can manipulate. (Note that we imported numpy as np at the top of our file)
l=range(4) m=[1,0,2,3] a=np.arange(4) b=np.array([2,0,1,1]) c=np.array([2,3])
Now, try to predict what each of the following lines will do, then check to see if you are right. Note that a few of these statements will give you errors.
l a l+[1] a+[1] l+1 a+1 l+m a+b a+c 2*l 2*a np.array(l) a[b] a[c] a[m] l[m] sum(l) sum(a) np.product(c)
This section is just a tiny taste of NumPy, and even of arrays (e.g., we can create multi-dimensional arrays and use them as matrices with functions available for standard matrix operations). If you are going to be doing a lot of numeric programming in Python, we recommend spending the time to familiarize yourself with more of NumPy (and SciPy, another package for scientific computing in Python, which we will refer to in the Going Further section).
You don't need to understand how most of the functions in this lab are implemented in order to do the previous parts of the lab. However, you may want to reuse/modify some of them in the future, and for that reason it is a good idea to understand how they work. First, make sure you understand the previous section on NumPy arrays. Then, look at the following two statements, which are used in the generate_random_sequence function. Can you figure out what these statements do, and how the whole function works? You may need to try calling cumsum or digitize yourself on some examples and/or look up the documentation.
np.cumsum(a) np.digitize([.1, .7, 2.3, 1.2, 3.1, .3], a)
You can go even further with this question by looking at the numpy.random.multinomial() function as well. It does not compute a sequence explicitly, instead returns only the total number of outcomes of each type. But, if that is all you care about, is this function more efficient than the other two?
If you pursue this question carefully, please send your code and results! It will be especially interesting to compare if we get results from multiple people.
In general, if you are interested in questions of program efficiency in Python, you may want to look at this page.