您的位置:首页 > 大数据

大数据工程人员知识图谱

2015-06-25 13:16 295 查看
http://yanbohappy.sinaapp.com/?cat=32

大数据工程人员知识图谱

3 Replies

在企业里面从事大数据相关的工作到底需要掌握哪些知识呢?我认为需要从两个角度来看:一个是技术;一个是业务。技术上主要涉及到概率和数理统计,计算机系统、算法和编程等;而业务的角度呢则是因公司业务的不同而异。对于从事大数据的工程人员来说,需要学会使用数据挖掘方法在计算机系统和编程工具的帮助下解决实际的问题,这样才能够在海量数据中挖掘出业务增长的助推剂,才能在激烈的市场竞争中为企业创造更多的价值。

因为业务会因公司的不同而不同,但是技术点是想通的。我在这里简单总结了一下大数据相关工程人员需要掌握的技术相关知识点。主要涉及到数据库、数据仓库、编程、分布式系统、Hadoop生态系统相关、数据挖掘和机器学习相关的基础知识点。当然我这里列出来的应该是一个team的人员汇集在一起所具备的,每个人会因在团队中的角色不同而有所侧重。在此剖砖引玉,欢迎大家发表意见。

TopicContentKey pointsReference
DB/OLTP & DW/OLAPDatabase/OLTP basicThe relational model, SQL, index/secondary index, inner join/left join/right join/full join, transaction/ACIDRamakrishnan, Raghu, and Johannes Gehrke. Database Management Systems.
Database internal & implementationArchitecture, memory management, storage/B+ tree, query parse /optimization/execution, hash join/sort-merge join
Distributed and parallel databaseSharding, database proxy
Data warehouse/OLAPMaterialized views, ETL, column-oriented storage, reporting, BI tools
Basic programmingProgramming languageJava, Python (Pandas/NumPy/SciPy/scikit-learn), SQL, Functional programming, R/SAS/SPSSWes McKinney. Python for Data Analysis: Agile Tools for Real World Data.
OSLinux
DB & DW systemMySQL/ Hive/Impala
Text format and processJSON/XML, regex
ToolGit/SVN, Maven
Distributed system & Hadoop ecosystem & NoSQLDistributed system principal theoryCAP theorem, RPC (Protocol Buffer/Thrift/Avro), Zookeeper, Metadata management (HCatalog)
Distributed storage & computing framework & resource managementHadoop/HDFS/MapReduce/YARNTom White. Hadoop : The Definitive Guide.
Donald Miner, Adam Shook. MapReduce Design Patterns : Building Effective Algorithm and Analytics for Hadoop and Other Systems.

SQL on HadoopData (log) acquisition/integration/fusion, normalization, feature extractionSqoop, Flume/Scribe/Chukwa,SerDeEdward Capriolo, Dean Wampler, Jason Rutherglen. Programming Hive.
Query & In-database analyticsHive, Impala, UDF/UDAF
Large scale data mining & machine learning frameworkSpark/MLbase, MR/Mahout
Streaming processStorm
NoSQLHBase/Cassandra (column oriented database)Lars George. HBase: The Definitive Guide.
Mongodb (Document database)
Neo4j (graph database)
Redis (cache)
Data mining & Machine learningDM & ML basicNumerical/Categorical variable, training/test data, over fitting, bias/variance, precision/recall, tagging
StatisticData exploration (mean, median/range/standard deviation/variance/histogram), Continues distributions (Normal/ Poisson/Gaussian), covariance, correlation coefficient, distance and similarity computing,
Bayes theorem, Monte Carlo Method, Hypothesis testing
Supervised learningClassifier, boosting, prediction, regression analysisHan, Jiawei,Micheline Kamber, and Jian Pei. Data mining: concepts and techniques.

Unsupervised learningCluster, deep learning
Collaborative filteringItem based CF, user based CF

AlgorithmClassifierDecision trees, KNN (K-Nearest neighbor), SVM (support vector machines), SVD (Singular Value Decomposition), naïve Bayes classifiers, neural networks,
RegressionLinear regression, logistic regression, ranking, perception
ClusterHierarchical cluster, K-means cluster, Spectral Cluster
Dimensionality reductionPCA (Principal Component Analysis), LDA (Linear discriminant Analysis), MDS (Multidimensional scaling)
Text mining & Information retrievalCorpus, term document matrix, term frequency & weight, association rules, market based analysis, vocabulary mapping, sentiment analysis, tagging, PageRank, VSM (Vector Space Model), inverted indexJimmy Lin and Chris Dyer. Data-Intensive Text Processing with MapReduce.
This entry was posted in
Data Mining,
Data Warehouse,
Database,
Hadoop,
HBase,
Hive,
Impala,
Machine Learning,
NewSQL,
NoSQL,
PostgreSQL and tagged
BigData,
Data Mining,
Data Warehouse,
Database,
Hadoop,
HBase,
Machine Learning on
2013 年 11 月 5 日 by ybliang8@gmail.com.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: