word2vec

Automatically exported from code.google.com/p/word2vec

Go to file

tmikolov d83ccfba4d bugfix in InitUnigramTable() - some words could not have been sampled as negative examples		2015-01-30 19:30:30 +00:00
LICENSE	aa	2013-07-30 05:35:50 +00:00
README.txt	removed some specific information from README, as I just changed the scritps	2013-08-01 20:12:56 +00:00
compute-accuracy.c	fixed minor bugs	2014-09-08 20:14:01 +00:00
demo-analogy.sh	update to 0.1c version	2014-09-06 16:54:27 +00:00
demo-classes.sh	update to 0.1c version	2014-09-06 16:54:27 +00:00
demo-phrase-accuracy.sh	update to 0.1c version	2014-09-06 16:54:27 +00:00
demo-phrases.sh	update to 0.1c version	2014-09-06 16:54:27 +00:00
demo-train-big-model-v1.sh	Bug fix - added questions-phrases.txt	2014-09-15 21:12:09 +00:00
demo-word-accuracy.sh	update to 0.1c version	2014-09-06 16:54:27 +00:00
demo-word.sh	update to 0.1c version	2014-09-06 16:54:27 +00:00
distance.c	fixed minor bugs	2014-09-08 20:14:01 +00:00
makefile	update to 0.1c version	2014-09-06 16:54:27 +00:00
questions-phrases.txt	aa	2013-08-01 19:31:17 +00:00
questions-words.txt	aa	2013-08-01 19:31:32 +00:00
word-analogy.c	fixed minor bugs	2014-09-08 20:14:01 +00:00
word2phrase.c	aa	2013-07-30 05:36:45 +00:00
word2vec.c	bugfix in InitUnigramTable() - some words could not have been sampled as negative examples	2015-01-30 19:30:30 +00:00

README.txt

Tools for computing distributed representtion of words
------------------------------------------------------

We provide an implementation of the Continuous Bag-of-Words (CBOW) and the Skip-gram model (SG), as well as several demo scripts.

Given a text corpus, the word2vec tool learns a vector for every word in the vocabulary using the Continuous
Bag-of-Words or the Skip-Gram neural network architectures. The user should to specify the following:
 - desired vector dimensionality
 - the size of the context window for either the Skip-Gram or the Continuous Bag-of-Words model
 - training algorithm: hierarchical softmax and / or negative sampling
 - threshold for downsampling the frequent words 
 - number of threads to use
 - the format of the output word vector file (text or binary)

Usually, the other hyper-parameters such as the learning rate do not need to be tuned for different training sets. 

The script demo-word.sh downloads a small (100MB) text corpus from the web, and trains a small word vector model. After the training
is finished, the user can interactively explore the similarity of the words.

More information about the scripts is provided at https://code.google.com/p/word2vec/