您的位置：首页 > 大数据

大数据工程人员知识图谱

2015-06-25 13:16 295 查看

http://yanbohappy.sinaapp.com/?cat=32

大数据工程人员知识图谱

3 Replies

在企业里面从事大数据相关的工作到底需要掌握哪些知识呢？我认为需要从两个角度来看：一个是技术；一个是业务。技术上主要涉及到概率和数理统计，计算机系统、算法和编程等；而业务的角度呢则是因公司业务的不同而异。对于从事大数据的工程人员来说，需要学会使用数据挖掘方法在计算机系统和编程工具的帮助下解决实际的问题，这样才能够在海量数据中挖掘出业务增长的助推剂，才能在激烈的市场竞争中为企业创造更多的价值。

因为业务会因公司的不同而不同，但是技术点是想通的。我在这里简单总结了一下大数据相关工程人员需要掌握的技术相关知识点。主要涉及到数据库、数据仓库、编程、分布式系统、Hadoop生态系统相关、数据挖掘和机器学习相关的基础知识点。当然我这里列出来的应该是一个team的人员汇集在一起所具备的，每个人会因在团队中的角色不同而有所侧重。在此剖砖引玉，欢迎大家发表意见。

Topic	Content	Key points	Reference
DB/OLTP & DW/OLAP	Database/OLTP basic	The relational model, SQL, index/secondary index, inner join/left join/right join/full join, transaction/ACID	Ramakrishnan, Raghu, and Johannes Gehrke. Database Management Systems.
Database internal & implementation	Architecture, memory management, storage/B+ tree, query parse /optimization/execution, hash join/sort-merge join
Distributed and parallel database	Sharding, database proxy
Data warehouse/OLAP	Materialized views, ETL, column-oriented storage, reporting, BI tools
Basic programming	Programming language	Java, Python (Pandas/NumPy/SciPy/scikit-learn), SQL, Functional programming, R/SAS/SPSS	Wes McKinney. Python for Data Analysis: Agile Tools for Real World Data.
OS	Linux
DB & DW system	MySQL/ Hive/Impala
Text format and process	JSON/XML, regex
Tool	Git/SVN, Maven
Distributed system & Hadoop ecosystem & NoSQL	Distributed system principal theory	CAP theorem, RPC (Protocol Buffer/Thrift/Avro), Zookeeper, Metadata management (HCatalog)
Distributed storage & computing framework & resource management	Hadoop/HDFS/MapReduce/YARN	Tom White. Hadoop : The Definitive Guide. Donald Miner, Adam Shook. MapReduce Design Patterns : Building Effective Algorithm and Analytics for Hadoop and Other Systems.
SQL on Hadoop	Data (log) acquisition/integration/fusion, normalization, feature extraction	Sqoop, Flume/Scribe/Chukwa,SerDe	Edward Capriolo, Dean Wampler, Jason Rutherglen. Programming Hive.
Query & In-database analytics	Hive, Impala, UDF/UDAF
Large scale data mining & machine learning framework	Spark/MLbase, MR/Mahout
Streaming process	Storm
NoSQL	HBase/Cassandra (column oriented database)	Lars George. HBase: The Definitive Guide.
Mongodb (Document database)
Neo4j (graph database)
Redis (cache)
Data mining & Machine learning	DM & ML basic	Numerical/Categorical variable, training/test data, over fitting, bias/variance, precision/recall, tagging
Statistic	Data exploration (mean, median/range/standard deviation/variance/histogram), Continues distributions (Normal/ Poisson/Gaussian), covariance, correlation coefficient, distance and similarity computing, Bayes theorem, Monte Carlo Method, Hypothesis testing
Supervised learning	Classifier, boosting, prediction, regression analysis	Han, Jiawei,Micheline Kamber, and Jian Pei. Data mining: concepts and techniques.
Unsupervised learning	Cluster, deep learning
Collaborative filtering	Item based CF, user based CF
Algorithm	Classifier	Decision trees, KNN (K-Nearest neighbor), SVM (support vector machines), SVD (Singular Value Decomposition), naïve Bayes classifiers, neural networks,
Regression	Linear regression, logistic regression, ranking, perception
Cluster	Hierarchical cluster, K-means cluster, Spectral Cluster
Dimensionality reduction	PCA (Principal Component Analysis), LDA (Linear discriminant Analysis), MDS (Multidimensional scaling)
Text mining & Information retrieval	Corpus, term document matrix, term frequency & weight, association rules, market based analysis, vocabulary mapping, sentiment analysis, tagging, PageRank, VSM (Vector Space Model), inverted index	Jimmy Lin and Chris Dyer. Data-Intensive Text Processing with MapReduce.

This entry was posted in
Data Mining,
Data Warehouse,
Database,
Hadoop,
HBase,
Hive,
Impala,
Machine Learning,
NewSQL,
NoSQL,
PostgreSQL and tagged
BigData,
Data Mining,
Data Warehouse,
Database,
Hadoop,
HBase,
Machine Learning on
2013 年 11 月 5 日 by ybliang8@gmail.com.

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航