大数据工程人员知识图谱
2015-06-25 13:16
295 查看
http://yanbohappy.sinaapp.com/?cat=32
在企业里面从事大数据相关的工作到底需要掌握哪些知识呢?我认为需要从两个角度来看:一个是技术;一个是业务。技术上主要涉及到概率和数理统计,计算机系统、算法和编程等;而业务的角度呢则是因公司业务的不同而异。对于从事大数据的工程人员来说,需要学会使用数据挖掘方法在计算机系统和编程工具的帮助下解决实际的问题,这样才能够在海量数据中挖掘出业务增长的助推剂,才能在激烈的市场竞争中为企业创造更多的价值。
因为业务会因公司的不同而不同,但是技术点是想通的。我在这里简单总结了一下大数据相关工程人员需要掌握的技术相关知识点。主要涉及到数据库、数据仓库、编程、分布式系统、Hadoop生态系统相关、数据挖掘和机器学习相关的基础知识点。当然我这里列出来的应该是一个team的人员汇集在一起所具备的,每个人会因在团队中的角色不同而有所侧重。在此剖砖引玉,欢迎大家发表意见。
This entry was posted in
Data Mining,
Data Warehouse,
Database,
Hadoop,
HBase,
Hive,
Impala,
Machine Learning,
NewSQL,
NoSQL,
PostgreSQL and tagged
BigData,
Data Mining,
Data Warehouse,
Database,
Hadoop,
HBase,
Machine Learning on
2013 年 11 月 5 日 by ybliang8@gmail.com.
大数据工程人员知识图谱
3 Replies在企业里面从事大数据相关的工作到底需要掌握哪些知识呢?我认为需要从两个角度来看:一个是技术;一个是业务。技术上主要涉及到概率和数理统计,计算机系统、算法和编程等;而业务的角度呢则是因公司业务的不同而异。对于从事大数据的工程人员来说,需要学会使用数据挖掘方法在计算机系统和编程工具的帮助下解决实际的问题,这样才能够在海量数据中挖掘出业务增长的助推剂,才能在激烈的市场竞争中为企业创造更多的价值。
因为业务会因公司的不同而不同,但是技术点是想通的。我在这里简单总结了一下大数据相关工程人员需要掌握的技术相关知识点。主要涉及到数据库、数据仓库、编程、分布式系统、Hadoop生态系统相关、数据挖掘和机器学习相关的基础知识点。当然我这里列出来的应该是一个team的人员汇集在一起所具备的,每个人会因在团队中的角色不同而有所侧重。在此剖砖引玉,欢迎大家发表意见。
Topic | Content | Key points | Reference |
DB/OLTP & DW/OLAP | Database/OLTP basic | The relational model, SQL, index/secondary index, inner join/left join/right join/full join, transaction/ACID | Ramakrishnan, Raghu, and Johannes Gehrke. Database Management Systems. |
Database internal & implementation | Architecture, memory management, storage/B+ tree, query parse /optimization/execution, hash join/sort-merge join | ||
Distributed and parallel database | Sharding, database proxy | ||
Data warehouse/OLAP | Materialized views, ETL, column-oriented storage, reporting, BI tools | ||
Basic programming | Programming language | Java, Python (Pandas/NumPy/SciPy/scikit-learn), SQL, Functional programming, R/SAS/SPSS | Wes McKinney. Python for Data Analysis: Agile Tools for Real World Data. |
OS | Linux | ||
DB & DW system | MySQL/ Hive/Impala | ||
Text format and process | JSON/XML, regex | ||
Tool | Git/SVN, Maven | ||
Distributed system & Hadoop ecosystem & NoSQL | Distributed system principal theory | CAP theorem, RPC (Protocol Buffer/Thrift/Avro), Zookeeper, Metadata management (HCatalog) | |
Distributed storage & computing framework & resource management | Hadoop/HDFS/MapReduce/YARN | Tom White. Hadoop : The Definitive Guide. Donald Miner, Adam Shook. MapReduce Design Patterns : Building Effective Algorithm and Analytics for Hadoop and Other Systems. | |
SQL on Hadoop | Data (log) acquisition/integration/fusion, normalization, feature extraction | Sqoop, Flume/Scribe/Chukwa,SerDe | Edward Capriolo, Dean Wampler, Jason Rutherglen. Programming Hive. |
Query & In-database analytics | Hive, Impala, UDF/UDAF | ||
Large scale data mining & machine learning framework | Spark/MLbase, MR/Mahout | ||
Streaming process | Storm | ||
NoSQL | HBase/Cassandra (column oriented database) | Lars George. HBase: The Definitive Guide. | |
Mongodb (Document database) | |||
Neo4j (graph database) | |||
Redis (cache) | |||
Data mining & Machine learning | DM & ML basic | Numerical/Categorical variable, training/test data, over fitting, bias/variance, precision/recall, tagging | |
Statistic | Data exploration (mean, median/range/standard deviation/variance/histogram), Continues distributions (Normal/ Poisson/Gaussian), covariance, correlation coefficient, distance and similarity computing, Bayes theorem, Monte Carlo Method, Hypothesis testing | ||
Supervised learning | Classifier, boosting, prediction, regression analysis | Han, Jiawei,Micheline Kamber, and Jian Pei. Data mining: concepts and techniques. | |
Unsupervised learning | Cluster, deep learning | ||
Collaborative filtering | Item based CF, user based CF | ||
Algorithm | Classifier | Decision trees, KNN (K-Nearest neighbor), SVM (support vector machines), SVD (Singular Value Decomposition), naïve Bayes classifiers, neural networks, | |
Regression | Linear regression, logistic regression, ranking, perception | ||
Cluster | Hierarchical cluster, K-means cluster, Spectral Cluster | ||
Dimensionality reduction | PCA (Principal Component Analysis), LDA (Linear discriminant Analysis), MDS (Multidimensional scaling) | ||
Text mining & Information retrieval | Corpus, term document matrix, term frequency & weight, association rules, market based analysis, vocabulary mapping, sentiment analysis, tagging, PageRank, VSM (Vector Space Model), inverted index | Jimmy Lin and Chris Dyer. Data-Intensive Text Processing with MapReduce. |
Data Mining,
Data Warehouse,
Database,
Hadoop,
HBase,
Hive,
Impala,
Machine Learning,
NewSQL,
NoSQL,
PostgreSQL and tagged
BigData,
Data Mining,
Data Warehouse,
Database,
Hadoop,
HBase,
Machine Learning on
2013 年 11 月 5 日 by ybliang8@gmail.com.
相关文章推荐
- 服务器RAID类型解析
- Filter及FilterChain的使用详解
- Filter及FilterChain的使用详解
- 云计算之路-阿里云上:9:55-10:08因流量攻击被进黑洞,造成主站不能正常访问
- error LNK2001: unresolved external symbol __DllMainCRTStartup@12
- Zoho CEO:云计算泡沫巨大 Salesforce只是新的Siebel
- Shallow heap & Retained heap
- AIX平台下面long的长度与编译选项-q64的关系
- 大数据和云计算
- 用catalog连接sde(直连)报错:Failed to connect to the specified server. Operation Failed
- Leetcode NO.217 Contains Duplicate
- leetcode 11 -- Container With Most Water
- rails 4 中 因为secret key 引起在production环境下无法访问 应用的解决办法
- TrainActivity、DetailActivity未完,待续
- hdu1026(linxingqiangglai)
- 1216: 斐波那契数列
- 基于Qt有限状态机人工智能的一种实现及改进方法
- Contains Duplicate II
- Climbing Stairs
- Mysql_Faq: ERROR 1396 (HY000): Operation CREATE USER failed for ‘username’@’hostname’