How Does Glove Encoder Work

15.5.1. Skip-Gram with Global Corpus Statistics¶

Denoting by (q_{ij}) the conditional probability (P(w_jmid w_i)) of word (w_j) given word (w_i) in the skip-gram model, we have

where for any index (i) vectors (mathbf{v}_i) and (mathbf{u}_i) represent word (w_i) as the center word and context word, respectively, and (mathcal{V} = {0, 1, ldots, |mathcal{V}|-1}) is the index set of the vocabulary.

Consider word (w_i) that may occur multiple times in the corpus. In the entire corpus, all the context words wherever (w_i) is taken as their center word form a multiset (mathcal{C}_i) of word indices that allows for multiple instances of the same element. For any element, its number of instances is called its multiplicity. To illustrate with an example, suppose that word (w_i) occurs twice in the corpus and indices of the context words that take (w_i) as their center word in the two context windows are (k, j, m, k) and (k, l, k, j). Thus, multiset (mathcal{C}_i = {j, j, k, k, k, k, l, m}), where multiplicities of elements (j, k, l, m) are 2, 4, 1, 1, respectively.

Now let’s denote the multiplicity of element (j) in multiset (mathcal{C}_i) as (x_{ij}). This is the global co-occurrence count of word (w_j) (as the context word) and word (w_i) (as the center word) in the same context window in the entire corpus. Using such global corpus statistics, the loss function of the skip-gram model is equivalent to

Refer to more articles:  How To Design Eyeglasses

We further denote by (x_i) the number of all the context words in the context windows where (w_i) occurs as their center word, which is equivalent to (|mathcal{C}_i|). Letting (p_{ij}) be the conditional probability (x_{ij}/x_i) for generating context word (w_j) given center word (w_i), (15.5.2) can be rewritten as

In (15.5.3), (-sum_{jinmathcal{V}} p_{ij} log,q_{ij}) calculates the cross-entropy of the conditional distribution (p_{ij}) of global corpus statistics and the conditional distribution (q_{ij}) of model predictions. This loss is also weighted by (x_i) as explained above. Minimizing the loss function in (15.5.3) will allow the predicted conditional distribution to get close to the conditional distribution from the global corpus statistics.

Though being commonly used for measuring the distance between probability distributions, the cross-entropy loss function may not be a good choice here. On the one hand, as we mentioned in Section 15.2, the cost of properly normalizing (q_{ij}) results in the sum over the entire vocabulary, which can be computationally expensive. On the other hand, a large number of rare events from a large corpus are often modeled by the cross-entropy loss to be assigned with too much weight.

Related Posts

How Big Is Yoenis Cespedes Glove

How Big Is Yoenis Cespedes Glove

Player Profile: Yoenis Cespedes Before I get started, I just want to say sorry for not having posted for the last few days. I was in Boston…

How To Buy Softball Glove

Younger players buying a new mitt should look for a softer mitt that they can squeeze and close. Most young players also should look for lightweight options…

How Much Are Old Baseball Gloves Worth

GUEST: They’ve come from different places over a number of years, but mostly flea markets- this one definitely a flea market- or a garage sale.You may be…

Do Caregivers Wear Gloves When Assisting With Showers

Do Caregivers Wear Gloves When Assisting With Showers

When helping a client take a bath or shower, watch out for muscle strains and sprains from lifting, transferring, and reaching. Precautions must be taken for possible…

How To Keep Golf Glove Dry

Hand washing tops my list when we think of dependable methods of cleaning our cherished golf gloves. It’s a gentle process that goes a long way in…

How To Get Gloves Dave The Diver

How To Get Gloves Dave The Diver

Dave the Diver has a large number of ingredients for you to collect in the ocean, but some of them, such as the Purple Sea Urchin, are…