qfdk's Blog

word2vec 基于维基百科训练小结

2017-03-05 word2vec 458

word2vec 基于维基百科训练小结

首先去维基百科进行下载资料

wget https://dumps.wikimedia.org/zhwiki/20170301/zhwiki-20170301-pages-articles-multistream.xml.bz2

下载完成之后要进行对文本的处理

# 下载解压脚本

git clone https://github.com/attardi/wikiextractor.git wikiextractor

python wikiextractor/WikiExtractor.py -b 2000M -o zhwiki_extr......

Apache nifi with Elasticsearch experiment

2016-12-13 Apache elasticsearch 5329

1. General1.1 HORTONWORKS DATAFLOW (HDF™)

HDF makes streaming analytics faster and easier, by enabling accelerated data collection, curation, analysis and delivery in real-time, on-premises or in the cloud through an integrated solution with Apache NiFi, Kafka and Storm.

1.2 What is Apache NiFi

......

Apache Spark with Pipeline and LDA

2016-09-21 Apache Spark 347

最近入坑Apache Spark，这个分布式框架让我知道了什么叫做大数据，以及在处理大数据之中所碰到的一些问题。首先说明一下语言当然用Scala 虽然一开始让人感觉比较反人类，但是经过一段时间的摸索发现真的挺好用的，甚至喜欢上了它，前提是不报错。

说一下工作环境:

Scala IED

Scala 2.10.6

Apache Spark 1.6.1

Apache Zepprlin

这些版本要对应起来，要不然吃不了兜着走。这里集群的管理工具是ambari，这个工具可以让你轻松的进行图像可视化。

这里我们用的是 Spark on Yarn 模式，其中进行提交......

word2vec 基于维基百科训练小结

Apache nifi with Elasticsearch experiment

Apache Spark with Pipeline and LDA

Recent Articles

Tags