帮酷LOGO
  • 显示原文与译文双语对照的内容
文章标签:WORD  framework  Implementation  Word2vec  IMP    
implementation Word2Vec for .Net framework

  • 源代码名称:Word2Vec.Net
  • 源代码网址:http://www.github.com/eabdullin/Word2Vec.Net
  • Word2Vec.Net源代码文档
  • Word2Vec.Net源代码下载
  • Git URL:
    git://www.github.com/eabdullin/Word2Vec.Net.git
  • Git Clone代码到本地:
    git clone http://www.github.com/eabdullin/Word2Vec.Net
  • Subversion代码到本地:
    $ svn co --depth empty http://www.github.com/eabdullin/Word2Vec.Net
    Checked out revision 1.
    $ cd repo
    $ svn up trunk
  • Word2Vec.Net

    Gitter

    面向. NET 框架的Word2Vec ( https://code.google.com/p/word2vec/ )

    #Getting 已经启动

    ##Using

    varbuilder = Word2VecBuilder.Create();
     if ((i = ArgPos("-train", args))> -1)
     builder.WithTrainFile(args[i + 1]);
     if ((i = ArgPos("-output", args))> -1)
     builder.WithOutputFile(args[i + 1]);
     //to all other parameters will be set default valuesvarword2Vec = builder.Build();
     word2Vec.TrainModel();
     vardistance = newDistance(args[i + 1]);
     BestWord[] bestwords = distance.Search("some_word");

    或者

    //more explicit optionstringtrainfile="C:/data.txt";
     stringoutputFileName = "C:/output.bin";
     varword2Vec = Word2VecBuilder.Create()
    . WithTrainFile(trainfile)//Use text data to train the model;. WithOutputFile(outputFileName)//Use to save the resulting word vectors/word clusters. WithSize(200)//Set size of word vectors; default is 100. WithSaveVocubFile()//The vocabulary will be saved to <file>. WithDebug(2)//Set the debug mode (default = 2 = more info during training). WithBinary(1)//Save the resulting vectors in binary moded; default is 0 (off). WithCBow(1)//Use the continuous bag of words model; default is 1 (use 0 for skip-gram model). WithAlpha(0.05)//Set the starting learning rate; default is 0.025 for skip-gram and 0.05 for CBOW. WithWindow(7)//Set max skip length between words; default is 5. WithSample((float) 1e-3)//Set threshold for occurrence of words. Those that appear with higher frequency in the training data twill be randomly down-sampled; default is 1e-3, useful range is (0, 1e-5). WithHs(0)//Use Hierarchical Softmax; default is 0 (not used). WithNegative(5)//Number of negative examples; default is 5, common values are 3 - 10 (0 = not used). WithThreads(5)//Use <int> threads (default 12). WithIter(5)//Run more training iterations (default 5). WithMinCount(5)//This will discard words that appear less than <int> times; default is 5. WithClasses(0)//Output word classes rather than word vectors; default number of classes is 0 (vectors are written). Build();
     word2Vec.TrainModel();
     vardistance = newDistance(outputFile);
     BestWord[] bestwords = distance.Search("some_word");

    Google word2vec的##Information: ###Tools 用于计算单词的分布式 representtion

    我们提供了连续 Bag-of-Words ( CBOW ) 和跳过gram模型( SG )的实现,以及几个演示脚本。

    给定一个文本语料库,word2vec工具使用连续Bag-of-Words或者skip神经网络结构来学习词汇中每个单词的一个向量。 用户应指定以下内容:

    • 所需矢量维数
    • 跳过gram或者连续Bag-of-Words模型的上下文窗口的大小
    • 训练算法:分层softmax和/或者负采样
    • 对常用单词进行采样的阈值
    • 要使用的线程数
    • 输出字向量文件的格式( 文本或者二进制)

    通常,其他超参数如学习率不需要针对不同的训练集进行调整。

    脚本 demo-word.sh 从网络下载一个小的( 100mb ) 文本语料库,并训练一个小单词向量模型。 训练完成后,用户可以交互式地探索单词的相似性。

    有关脚本的更多信息在 https://code.google.com/p/word2vec/ 提供。



    文章标签:  IMP  framework  Implementation  WORD  Word2vec  

    Copyright © 2011 HelpLib All rights reserved.    知识分享协议 京ICP备05059198号-3  |  如果智培  |  酷兔英语