Wednesday, June 6, 2007

Collocations

Chap 5 of foundations of statistical natural language processing

- a collocation is an expression consisting of two or more words that correspond to some conventional say of saying things.

- Collocations are characterized by limited compositionality.
(we call a natural language expression compositional if the meaning of the expression can be predicted from the meaning of the parts.)
Collocations are note fully compositional in that there is usually an element of meaning added to the combination.
--> non-compositionality
--> non-substitutability
--> non-modifiability


-term: the word term has a different meaning in information retrieval. There it refers to both words and phrases.

- a number of approaches to finding collocations:
a)selections by frequency,
raw frequency doesn't work.
With part of speech tag patterns, one gets surprisingly good result.<-- Justeson and Katz' method. hints: a simple quantitative technique combined with a small amount of linguistic knowledge goes a longway.
works well for fixed phrases.

b)selection based on mean and variance of the distance between focal word collocating word
scenario: the distance between two words in not constant so a fixed phrase approach would not work.

collocational window (usually a window of 3 to 4 words on each side fo a word)

Mean and variance o the offsets between two words in a corpus.

c)hypothesis testing (********)

in b) we can not make for sure that the high frequency and low variance of two words can be accidental. So we are also taking into account how much data we have seen. Even if there is a remarkable pattern, we will discount it if we haven't seen enough data to be certain that it couldn't be due to chance.

-->null hypothesis.
-->t test : assume that the probabilities are approximately normally distributed.
The t test looks at the mean and variance of a sample of measurements, where the null hypothesis is that the sample is drawn from a distribution with mean miu. The test looks at the difference between the observed and expected means, scaled by the variance of the data, and tells us how likely one is to get a sample of that mean and variance ( or a more extreme mean and variance) assuming that the sample is drawn from a normal distribution with mean miu.
(todo)

--> Chi-square test
The essence of the test is to compare the observed frequencies in the table with the frequencies expected for independence. If the difference between observed and expected frequencies is large, then we can eject the null hypothesis of independence.
d)mutual information


--> likelihood ratios:
more appropriate for sparse data than the chi-square test

No comments: