Hadley Wickham  
RStudio的首席科学家,莱斯大学的助理教授,资深R社区成员,已开发了30多个R包。因在数据处理和可视化开发工具方面的卓越贡献,获得专为统计计算而设立的约翰·钱伯斯奖。

enter image description here

Hadley(哈德利)出生在新西兰 · 汉密尔顿的一个从事数据统计的家庭。他的父亲布莱恩•韦翰是康奈尔大学动物育种方面的数据统计博士,妹妹获得了加州大学伯克利分校数据统计的博士学位。

如果数据结构方面存在神童一说的话,Hadley应该算一个。他曾自豪地讲述自己的经历:

"15岁时,我的第一份工作就是开发Microsoft Access数据库,很有趣。我当时做一些数据库文档,现在人们仍然在使用我写的数据库。”

Hadley第一次接触R语言是在新西兰奥克兰大学的统计专业课上。他认为R语言是“一门用于理解数据的编程语言。”同SQL和Python一样,R语言对于数据科学家来说,是最流行的编程语言。

和Hadley一样,R编程语言也来自新西兰。R语言成立于1993年,由奥克兰大学的统计学家Ross Ihaka和Robert Gentleman一起创建,主要用于数据分析,却也存在一些怪癖(如索引数据结构的方式、物理内存存储的方式等)。所以,其他开发语言的使用者大都认为R语言很奇怪。使用过Java、VBA和PHP之后,Hadley发现R“与众不同”。“(许多程序员)认为R语言荒谬、笨拙,我不这么认为,”他说,“我认为R非常有趣。”

到美国的爱荷华州立大学攻读博士之后,Hadley开始开发R包。用哈德利自己的话说,开发包需要涵盖“帮助人们解决问题的代码,然后必须用文档记录下这些代码,别人才可以理解怎样使用这些代码。”他创建的第一个包,作为类项目的一部分,用于生物信息学数据的可视化。虽然这个包从未公开过,这丝毫不影响他喜欢分享的态度。

2005年,他发布了reshape包,广受关注,也是R包开发的起点。这个包已经被下载了成千上万次。reshape的目的是减少聚合和操作数据过程中的“乏味和痛苦”。简化数据转化的过程看上去并不是什么难事儿,但对于数据科学家和统计学家来说,这往往是最耗时的工作。

显然,Hadley很享受reshape开发包的成功。他认为现有的方法并不完美,所以需要开发出新的包。这并不是吹嘘,他有足够的信心,“我坚信我掌握了正确的开发方法,”他再次强调,“要么更好,要么更糟。”

--------------

最新力作《R包开发》,着眼于将读者从R包的使用者晋升为R包的开发者,展示了R包开发的哲学。书中详细介绍了如何将可重用的R函数、示例数据以及文档一起打包,以便与他人分享代码、节省开发时间、组织数据分析,尽可能让工作自动化。

  • 学习R包最有用的组件,包括使用指南和单元测试
  • 利用devtools自动执行任务
  • 掌握良好编码风格的技巧,比如如何把函数组织成文件
  • 使用devtools简化开发流程
  • 发现提交包到CRAN的最佳途径

--------

点击查看中文版

Having seen your picture, some followers suggest, “With such a pretty face, why don't you make a living by appearance instead of coding?” In fact, it's a tendency for fans to praise their superstars with words like “With such a perfect face, (Johnny Depp,etc.) can completely live well, but he struggles with acting improvement.” So what's your reason for coding?

I love coding for two main reasons. Firstly, I really enjoy figuring out the underlying structure behind problems that on the surface seem very different. For example, I found it very satisfying to develop the ideas behind tidy data and the tidyr package because it I enjoyed figuring out the deeper underlying theory.

Secondly, I really enjoy programming because it helps other people. Producing R packages is a great way to turn my ideas in to tools that other people can take advantage, and I enjoy all the feedback that I get from the R community. Hearing that people are using my code and finding it useful is one of the things that keeps my motivated.

R Packages is available for free online. Don't you fear it may decrease paper version's sales? Or why do you choose to publish the book then, since there is little financial incentive?

My goal from writing books is not to make money, but to reach as many people as possible. I think making the book available in both forms achieves this goal well. Younger people who don’t have a lot of money to spend on books can use the website. People who enjoy reading physical books can still buy one, and the marketing around a physical book is more likely to reach people who aren’t as active on the internet.

I know R Packages was written in the open. Could you describe your understandings on the crowd-sourcing experience?

I think writing a book is a truly excellent way to write. One of the challenges of writing a book is that it is a large project that can take one or more years. It’s hard to maintain excitement and motivation about such a big project. However, when you write in the open, you constantly get feedback. This makes it much easier to stay motivated!

I’m also quite bad at proof reading, and I really enjoy that the R community can contribute through github pull requests to fix all of my silly mistakes! People also contribute larger fixes, and point out other problems with the text. All together, writing in the open makes the book much better than it would otherwise be!

Could we compare R packages development to API design? In addition to encapsulation, robustness and usability, is there anything special need to be paid attention to?

I think there are some general principles that make my packages work together particularly smoothly. Currently, those principles are mostly intutuitive to me: I know what to do, but I can’t explain it well so that other people can learn. I am trying to change that by writing up the principles that underlie the “tidyverse”, and you can find my first attempt at https://github.com/hadley/tidyverse/blob/master/vignettes/manifesto.Rmd. I think these are important principles for the design of R packages because they make an API feel like R, and help packages work together naturally.

R was designed for data analysis, but has some quirks, like data structures are indexed and have to be stored in physical memory. Do you think the memory management way of C++ and Spark would be referred to ?

R is not perfect, but I think it does a really good job of making the human data analyst as effective as possible. R is a very flexible language which means that it’s possible to design domain specific languages like ggplot2 and dplyr that help solve certain subdomains of the data anaylsis problem. That flexibility has it’s downsides: generally slower performance. I think it’s worthwhile to have different languages for different domains: R is great for making humans efficient at doing data analysis; C++ is great for making computers calculate as efficiently as possible. I personally don’t believe it’s possible to have one language that does both. (In other words, I believe in Ousterhout’s dichotomy, https://en.wikipedia.org/wiki/Ousterhout%27s_dichotomy)

Data statistics and analysis with R has its unique advantages but with low efficiency. Could interfaces of C be used in the development of R packages so as to build components easily and efficiently to be used ?

Yes, and many many packages now use Rcpp and C++ to do exactly that. As we see more experienced programmers learn R, and more R users become experienced programmers, I think we will see more and more packages that are designed for high-efficiency.

Microsoft and IBM have employed R. There are also commercial companies providing R packages with better performance like H2o. What's your idea concerning company's influences on R development?

I think it’s a great sign of R’s continued evolution and it’s growing-up as a programming language. R is now a critical part of many companies, and that means that there will be more resources to work on R generally. One particularly exciting initiative that I’m involved with is the R consortium (https://www.r-consortium.org). This is a way for companies to give back to the R community, and have their money be spend to make R better for everyone.

According to you, RStudio is the best development environment for R users. A few readers concern your books might be too focused on RStudio. They suggest it's better to separate from integration with RStudio.

There are other ways to use R apart from RStudio, and most popular tool after RStudio is ESS or Emacs speaks statistics. These tools are powerful, but because they’re more tailored for advanced users, I’ve chosen to focus on RStudio in my books. I think that’s a reasonable trade-off as if you don’t use RStudio, you can just ignore the bits that don’t apply (and you’re probably a more experience R programmer so you are able to figure out the equivalents yourself).

You've contributed so much to R, particularly in R packages. How could you be so productive?

Here are a few more thoughts from a personal perspective.

Writing. I have worked really hard to build a solid writing habit - I try and write for 60-90 minutes every morning. It's the first thing I do after I get out of bed. I think writing is really helpful to me for a few reasons. First, I often use my writing as a reference - I don't program in C++ every day, so I'm constantly referring to @Rcpp every time I do. Writing also makes me aware of gaps in my knowledge and my tools, and filling in those gaps tends to make me more efficient at tackling new problems.

Reading. I read a lot. I follow about 300 blogs, and keep a pretty close eye on the R tags on Twitter and Stack Overflow. I don't read most things deeply - the majority of content I only briefly skim. But this wide exposure helps me keep up with changes in technology, interesting new programming languages, and what others are doing with data. It's also helpful that if when you're tackling a new problem you can recognise the basic name - then googling for it will suggest possible solutions. If you don't know the name of a problem, it's very hard to research it.

Chunking. Context-switching is expensive, so if I worked on many packages at the same time, I'd never get anything done. Instead, at any point in time, most of my packages are lying fallow, steadily accumulating issues and ideas for new feature. Once a critical mass has accumulated, I'll spend a couple of days on the package.

Finally, it's hard to over-emphasise the impact that working full-time on R makes. Since I've left Rice, I now spend well over 90% of my work time thinking about and programming in R. This has a compounding effect because as I built better tools (cognitive and computational) it becomes even easier to build new tools. I can create a new package in seconds, and I have many techniques on-hand (in-brain) for solving new problems.


——See More


更多精彩,加入图灵访谈微信!