您的位置:首页 > 其它

Books on Scala for statistical computing and data science

2017-04-15 18:10 489 查看


Introduction

People regularly ask me about books and other resources for getting started with Scala for statistical computing and data science. This post will focus on books, but it’s worth briefly noting that there are a number of other resources available, on-line and
otherwise, that are also worth considering. I particularly like the Coursera course Functional
Programming Principles in Scala – I still think this is probably the best way to get started with Scala and functional programming for most people. In fact, there is an entire Functional
Programming in Scala Specialization that is worth considering – I’ll probably discuss that more in another post. I’ve got a draft page of Scala
links which has a bias towards scientific and statistical computing, and I’m currently putting together a short
coursein that area, which I’ll also discuss further in future posts. But this post will concentrate on books.


Reading list


Getting started with Scala

Before one can dive into statistical computing and data science using Scala, it’s a good idea to understand a bit about the language and about functional programming. There are by now many books on Scala, and I haven’t carefully reviewed all of them, but I’ve
looked at enough to have an idea about good ways of getting started.

Programming
in Scala: Third edition, Odersky et al, Artima.

This is the Scala book, often referred to on-line as PinS.
It is a weighty tome, and works through the Scala language in detail, starting from the basics. Every serious Scala programmer should own this book. However, it isn’t the easiest introduction to the language.

Scala
for the Impatient, Horstmann, Addison-Wesley.

As the name suggests, this is a much quicker and easier introduction to Scala than PinS, but assumes reasonable familiarity with programming in general, and sort-of assumes that the reader has a basic knowledge of Java and the JVM ecosystem. That said, it does
not assume that the reader is a Java expert. My feeling is that for someone who has a reasonable programming background and a passing familiarity with Java, then this book is probably the best introduction to the language. Note that there is a second edition
in the works.

Functional
Programming in Scala Chiusano and Bjarnason, Manning.

It is possible to write Scala code in the style of "Java-without-the-semi-colons", but really the whole point of Scala is to move beyond that kind of Object-Oriented programming style. How much you venture down the path towards pure Functional Programming is
very much a matter of taste, but many of the best Scala programmers are pretty hard-core FP, and there’s probably a reason for that. But many people coming to Scala don’t have a strong FP background, and getting up to speed with strongly-typed FP isn’t easy
for people who only know an imperative (Object-Oriented) style of programming. This is the book that will help you to make
the jump to FP. Sometimes referred to online as FPiS, or more often even just as the red
book, this is also a book that every serious Scala programmer should own (and read!). Note that is isn’t really a book about Scala
– it is a book about strongly typed FP that just "happens" to use Scala for illustrating the ideas. Consequently, you will
probably want to augment this book with a book that really is about Scala, such as one of the books above. Since this is the first book on the list published by Manning,
I should also mention how much I like computing books from this publisher. They are typically well-produced, and their paper books (pBooks) come with complimentary access to well-produced DRM-free eBook versions, however you purchase them.

Functional
and Reactive Domain Modeling, Ghosh, Manning.

This is another book that isn’t really about Scala, but about software engineering using a strongly typed FP language. But
again, it uses Scala to illustrate the ideas, and is an excellent read. You can think of it as a more practical "hands-on" follow-up to the red book, which shows how the ideas from the red book translate into effective solutions to real-world problems.

Structure
and Interpretation of Computer Programs, second editionAbelson et al, MIT Press.

This is not a Scala book! This is the only book in this list which doesn’t use Scala at all. I’ve included it on the list because it is one of the best books on programming that I’ve read, and is the book that I wish someone had told me about 20 years ago!
In fact the book uses Scheme (a Lispderivative)
as the language to illustrate the ideas. There are obviously important differences between Scala and Scheme – eg. Scala is strongly statically typed and compiled, whereas Scheme is dynamically typed and interpreted. However, there are also similarities – eg.
both languages support and encourage a functional style of programming but are not pure FP languages. Referred to on-line as SICP this
book is a classic. Note that there is no need to buy a paper copy if you like eBooks, since electronic versions are available free
on-line.


Scala for statistical computing and data science

Scala
for Data Science, Bugnion, Packt.

Not to be confused with the (terrible) book, Scala
for machine learningby the same publisher. Scala for Data Science is my top recommendation for getting started with statistical computing and data science applications using Scala. I have reviewed
this book in another post, so I won’t say more about it here (but I like it).

Scala
Data Analysis Cookbook, Manivannan, Packt.

I’m not a huge fan of the cookbook format, but this book is really mis-named, as it isn’t really a cookbook and isn’t really about data analysis in Scala! It is really a book about Apache
Spark, and proceeds fairly sequentially in the form of a tutorial introduction to Spark. Spark is an impressive piece of technology, and it is obviously one of the factors driving interest in Scala, but it’s important to understand that Spark isn’t Scala,
and that many typical data science applications will be better tackled using Scala without Spark. I’ve not read this book cover-to-cover as it offers little over Scala for Data Science, but its coverage of Spark is a bit more up-to-date than the Spark books
I mention below, so it could be of interest to those who are mainly interested in Scala for Spark.

Scala
High Performance Programming, Theron and Diamant, Packt.

This is an interesting book, fundamentally about developing high performance streaming data processing algorithm pipelines in Scala. It makes no reference to Spark. The running application is an on-line financial trading system. It takes a deep dive into understanding
performance in Scala and on the JVM, and looks at how to benchmark and profile performance, diagnose bottlenecks and optimise code. This is likely to be of more interest to those interested in developing efficient algorithms for scientific and statistical
computing rather than applied data scientists, but it covers some interesting material not covered by any of the other books in this list.

Learning
Spark, Karau et al, O’Reilly.

This book provides an introduction to Apache Spark, written by some of the people who developed it. Spark is a big data analytics framework built on top of Scala. It is arguably the best available framework for big data analytics on computing clusters in the
cloud, and hence there is a lot of interest in it. The book is a perfectly good introduction to Spark, and shows most examples implemented using the Java and Python APIs in addition to the canonical Scala (Spark Shell) implementation. This is useful for people
working with multiple languages, but can be mildly irritating to anyone who is only interested in Scala. However, the big problem with this (and every other) book on Spark is that Spark is evolving very quickly, and so by the time any book on Spark is written
and published it is inevitably very out of date. It’s not clear that it is worth buying a book specifically about Spark at this stage, or whether it would be better to go for a book like Scala
for Data Science, which has a couple of chapters of introduction to Spark, which can then provide a starting point for engaging with Spark’s
on-line documentation (which is reasonably good).

Advanced
Analytics with Spark, Ryza et al, O’Reilly.

This book has a bit of a "cookbook" feel to it, which some people like and some don’t. It’s really more like an "edited volume" with different chapters authored by different people. Unlike Learning Spark it focuses exclusively on the Scala API. The book basically
covers the development of a bunch of different machine learning pipelines for a variety of applications. My main problem with this book is that it has aged particularly badly, as all of the pipelines are developed with raw RDDs, which isn’t how ML pipelines
in Spark are constructed any more. So again, it’s difficult for me to recommend. The message here is that if you are thinking of buying a book about Spark, check very carefully when it was published and what version of Spark it covers and whether that is sufficiently
recent to be of relevance to you.


Summary

There are lots of books to get started with Scala for statistical computing and data science applications. My "bare minimum" recommendation would be some generic Scala book (doesn’t really matter which one), the red
book, and Scala for data science. After reading those, you will be very well placed to top-up your knowledge as
required with on-line resources.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  scala