Sunday, June 17, 2007

Generative Model and Disriminative Model

==
Ref: MLWiki
A generative model is one which explicitly states how the observations are assumed to have been generated. Hence, it defines the joint probability of the data and latent variables of interest.



==
See also: Generative Model from wikipedia.

==
ref: A simple comparison
of
generative model (model likelihood and prior) <-- NB..
and discriminative model (model posterior) <-- SVM

==
ref: Classify Semantic Relations in Bioscience Texts.

Generative models learn the prior probability of the class and the probability of the features given the class; they are the natural choice in cases with hidden variables (partially observed or missing data). Since labeled data is expensive to collect, these models may be useful when no labels are available. However, in this paper we test the generative models on fully observed data and show that , although not as accurate as the discriminative model, their performance is promising enough to encourage their use for the case of partially observed data.

Discriminative models learn the probability of the class given the features. When we have fully observed data and we just need to learn the mapping from features to classes(classification), a discriminative approach may be more appropriate.

It must be pointed out that the neural network (discriminative model) is much slower than the graphical models (HMM-like generative models), and requires a great deal of memory.

Wednesday, June 6, 2007

Collocations

Chap 5 of foundations of statistical natural language processing

- a collocation is an expression consisting of two or more words that correspond to some conventional say of saying things.

- Collocations are characterized by limited compositionality.
(we call a natural language expression compositional if the meaning of the expression can be predicted from the meaning of the parts.)
Collocations are note fully compositional in that there is usually an element of meaning added to the combination.
--> non-compositionality
--> non-substitutability
--> non-modifiability


-term: the word term has a different meaning in information retrieval. There it refers to both words and phrases.

- a number of approaches to finding collocations:
a)selections by frequency,
raw frequency doesn't work.
With part of speech tag patterns, one gets surprisingly good result.<-- Justeson and Katz' method. hints: a simple quantitative technique combined with a small amount of linguistic knowledge goes a longway.
works well for fixed phrases.

b)selection based on mean and variance of the distance between focal word collocating word
scenario: the distance between two words in not constant so a fixed phrase approach would not work.

collocational window (usually a window of 3 to 4 words on each side fo a word)

Mean and variance o the offsets between two words in a corpus.

c)hypothesis testing (********)

in b) we can not make for sure that the high frequency and low variance of two words can be accidental. So we are also taking into account how much data we have seen. Even if there is a remarkable pattern, we will discount it if we haven't seen enough data to be certain that it couldn't be due to chance.

-->null hypothesis.
-->t test : assume that the probabilities are approximately normally distributed.
The t test looks at the mean and variance of a sample of measurements, where the null hypothesis is that the sample is drawn from a distribution with mean miu. The test looks at the difference between the observed and expected means, scaled by the variance of the data, and tells us how likely one is to get a sample of that mean and variance ( or a more extreme mean and variance) assuming that the sample is drawn from a normal distribution with mean miu.
(todo)

--> Chi-square test
The essence of the test is to compare the observed frequencies in the table with the frequencies expected for independence. If the difference between observed and expected frequencies is large, then we can eject the null hypothesis of independence.
d)mutual information


--> likelihood ratios:
more appropriate for sparse data than the chi-square test

Controlled Language

What are Controlled Natural Languages?
"Controlled Natural Languages are subsets of natural languages whose grammars and dictionaries have been restricted in order to reduce or eliminate both ambiguity and complexity. Traditionally, controlled languages fall into two major categories: those that improve readability for human readers, particularly non-native speakers, and those that improve computational processing of the text."

[link]