Word2vec explained: Deriving mikolov et al.’s negative-sampling Unsupervised construction of large paraphrase corpora: Exploiting Variational inference: A review for statisticians.ĭolan, B., Quirk, C., and Brockett, C. 3.2 Bayesian paragraph vectorsīlei, D. M., Kucukelbir, A., and McAuliffe, J. D. When training the model, we resort to the heuristics proposed in ( Mikolov et al., 2013b) to create artificial evidence for the negative examples (see Section 3.2 below). The so-called negative examples with label z i j = 0ĭo not correspond to any observation in the corpus. The pairs with label z i j = 1 form the so-called positive examples, and are assumed to correspond to occurrences of the word i in the context of word j somewhere in the corpus. Then, the model assigns to the proposal pair a binary label z i j ∼ Bern ( σ ( U ⊤ i V j ) ), where σ ( x ) = 1 / ( 1 + e − x ) Is drawn from a uniform distribution over the vocabulary. The model then constructs N pairs labeled pairs of words following a two-step process.įirst, a proposal pair of words ( i, j )
The left part of Figure 1 shows the generative process.įor each word i in the vocabulary, the model draws a latent word embedding vector U i ∈ R E and a latent context embedding vector V i ∈ R E from a Gaussian prior N ( 0, λ 2 I ). Then we will have the exactly same results of word vector and paragraph vector if we feed into the same data.The Bayesian skip-gram model ( Barkan, 2017) is a probabilistic interpretation of word2vec ( Mikolov et al., 2013b). To summarize, the following is what we need to do to exclude randomness thoroughly:Ģ. But notice that single thread will tremendously slow down the training. Hence, if we set the number of worker as 1, it will run in single thread and have the exact same order of feeding data in each time of training. LimitUpper = workers * batchSize * 2 // threads get data from the queue through Val sequencer = new AsyncSequencer(erator, this.stopWords) // each threads are pointing to the same sequencer initialize a sequencer to provide data to threads Since every threads can come randomly, in every time of training, each thread can get different sequences to train. This LinkedBlockingQueue gets sequences from the iterator of training text, and provides these sequences to each threads. syn0 = Nd4j.rand(new int Queue that provides sequences to every thread to train Here is the place where the seed takes effect. Hence, if we set seed as 0, we will get exact same initialization every time.
We know that before training, the weights of model and vector representation will be initialized randomly, and the randomness is controlled by seed. Where the randomness comes The initialization of weights and matrix If you want to take look on the other package, go to gensim’s doc2vec, which has the same method of implementation. I will use DL4j’s implementation of paragraph vectors to show the code.
PARAGRAPH VECTOR CODE HOW TO
Code can talk itself, let’s take a look on where the randomness comes and how to eliminate it thoroughly. This is because of the randomness introduced in the training time. If you got any chance to train word vectors yourself, you may find that the model and the vector representation is different across every training even you feed into same training data.