Holden Karau是IBM首席软件工程师,负责改进Apache Spark并协助开发者向Spark贡献代码。Holden曾是Databricks的软件开发工程师,负责Spark和Databricks Cloud的后端开发。她曾在Google和亚马逊从事软件开发工作,分别负责Google+的后端开发和亚马逊的智能分类系统。她在大数据和搜索领域有着丰富的经验,精通Scala, Scheme, Java, Perl, C, C++, Ruby等语言。Holden著有《Spark快速数据处理》,与人合著有《Spark快速大数据分析》

iTuring: You have been authors of Fast Data Processing With Sparkand Learning Spark, what are the differences between these two books? What was your writing experience like?

Fast Data Processing with Soark was the first book written on Apache Spark and was very focused on just getting people started. Learning Spark was written much later, after Spark SQL and other important components were added to Spark and is a bit more detail oriented while still targeted at individuals new to Spark. My writing practice between the two changed a lot for a mixture of reasons. Learning Spark is a much more collaborative book and we had early releases along with technical reviewers involved from a very early stage so it was much easier to make changes and the feedback we got was quite helpful to making Learning Spark. I was also working at Databricks while writing Learning Spark so it was much easier to fact check and get feedback from the committees since many of them worked in the same office.

iTuring: What is the biggest difference between your job in Databricks and that in IBM? Do you have to make any adjustments in your work?

Probably the biggest difference in my day-to-day at IBM is more time to focus working on Spark, while I was at Databricks I had to spend a lot of my time working on Databricks Cloud (the commercial offering). Some other changes are Databricks has most of the Spark committers so getting questions answered and code reviewed was faster. There are also the usual small company versus big company differences, but within our group things are surprisingly flexible.

iTuring: As a developer who had spent much time building Spark, considering the popularity of R language in the open source world, do you think Spark will provide interface for R in the future?

They already are! The SparkR project is now part of Spark and offers an R API, although as the newest component it is far from done and quite a way from feature parity with Scala.

iTuring: There are many enterprises who have difficulties transforming from relational databases to modern big data processing tools, such as Spark. What are your suggestions for these companies?

I think moving from traditional relational to more distributed systems involves a lot of changes for the developers. Spark SQL can bridge some of the gap for analytics - but I think an important part is gaining the understanding of how distributed systems work in practice. Rather than try and start by rewriting an existing complex system, starting with a new project from scratch (perhaps on a new data source) can help build the instructional knowledge.

iTuring: Many people believe that Spark will overthrow Hadoop with its superb performance, do you agree? What will the ecosystem of big data processing technologies—like Hadoop, Pig, Tez, Hive, and Spark—be like in the future?

Its difficult to predict whats going to happen with the Big Data ecosystem over time, especially with so many people involved in the open source community. I believe Spark will replace much of Map/Reduce and many specialized systems over time, and other systems may use Spark as an execution engine. There will still be use cases where specialized systems will be a better fit.

iTuring: How do you choose between command line and Spark for different environments of data analysis?

Generally I tend to be more comfortable in the command line, although for exploratory work rather than debugging, using things like notebooks is really quite useful. There is of course Databricks cloud, but I've also had good experiences with Jupyter and Zeppelin. For production jobs though I find notebooks to be too limiting and difficult to test, so more traditional packaged jars are what I use when moving beyond the exploratory phase.

iTuring: What is the relationship between Hive On Spark and SparkSQL? Which one do you believe will have a more promising future?

Spark SQL is an important component of Spark - with the introduction of Datasets bringing functional style programming to Spark SQL in addition to the existing relational APIs. I'm very excited about the future for Spark SQL.

iTuring: For someone who has already mastered Hadoop, what is his roadmap of learning Spark? Is reading source code a recommendable way to learn Spark?

I'm obviously a little biased and think Learning Spark would be a great book - but also doing exploratory work in the Spark shell can be a great way to get up to speed. I think Spark is at the point where reading the code makes sense for people who are going to be developers on Spark its self, but for end users hopefully its not necessary unless you want to use the latest features.

iTuring: How to effectively read source code of giant open source projects like Spark and Hadoop? Are there any tools would help in the process?

I think reading source code of Spark is an excellent activity for people interested in contributing to Spark. Since I'm an emacs user I tend to use magit, but I've also used ensime. A lot of other developers find IntelliJ to be quite useful.

iTuring: Female developers are rarely seen in China, especially in the field of “big data”. What suggestions do you have for the girls and women in China who want to be developers or software engineers?

I wish I had better advice and obviously what advice I do have comes from my experiences which may be different. That being said, I've found joining groups like Women Who Code and Double Union (a local Women's hacker space in San Francisco) really useful both for learning and having a network. I think getting involved in Open Source can be a good way to gain experience and build a portfolio when getting started & can help when interviewing. That being said open source communities can sometimes have a lot of infighting depending on the project, so I always try and look for friendly people or work with my friends when possible. I also think giving talks can be helpful as a way to showcase your work and also meet interesting people in the field.


更多精彩,加入图灵访谈微信!