Cathy O'Neil是约翰逊实验室高级数据科学家、哈佛大学数学博士、麻省理工学院数学系博士后、巴纳德学院教授,曾发表过大量算术代数几何方面的论文。他曾在著名的全球投资管理公司D.E. Shaw担任对冲基金金融师,后加入专门评估银行和对冲基金风险的软件公司RiskMetrics。Cathy是一位数学家,后来转型为数据科学家,她的个人博客http://mathbabe.org/广受欢迎。她和哥伦比亚大学统计系兼职教授Rachel Schutt根据一门名为“数据科学导论”的课程撰写了《数据科学实战》一书。

iTuring: Where does being data scientist attract you the most? Have you found any clues to this question: what can a non-academic mathematician do that makes the world a better place?

I love data! I love seeing how we can learn about the way things work by measuring them. I particularly enjoy figuring out how to quantify something we only vaguely knew, or to compare the effects of two things that until that moment seemed incomparable.

My biggest clue is that I need to spend more time and energy making sure data scientists think before they act. Data science is powerful and influential and can be used for evil or good. We need to recognize that fact.

iTuring: Many readers are inspired by your blog Mathbebe, have you been inspired by them through interaction?

Absolutely! My readers have brought me on many strange and intellectually stimulating journeys. I am thankful every day to them for doing so.

iTuring: In your perspectives, what quality and learning background of a person would be most qualified to do the job of data science?

It really depends. I've written the book Doing Data Science with a mathematical background in mind, but honestly a data science team should also have people with backgrounds in philosophy and ethics who learn more scientific approaches. We need diversity of thought to solve a problem well.

iTuring: Some people believe that applications that based on big data actually indulge people’s reliance on old habits, which constrain tryouts of various experiences, do you agree?

That can be true. For example, a resume or application sorting algorithm that simply learns from historical data and the regenerates that old-fashioned decision-making is merely codifying all the biases that the system had, whether it's sexism, or an over-reliance on certain college degrees. I suggest to people to try to figure out what it is that they are actually looking for, and how they can locate those skills while staying as unbiased as possible. We should at least attempt this.

iTuring: Many companies have benefited magnificently from big data analysis, but there are also companies who use big data to formulate policies and strategies but have benefited little or even failed. What are their mistakes during the process?

Often they think that big data is magical. Of course it's not, you need good questions, and moreover you don't just need big data, you need the right data, which often hasn't been collected.

iTuring: For a large part, big data are used for prediction. Do you think accidental incident could be predicted by determinate data?

This question is a bit vague, but if I understand it correctly it's asking whether something that is fundamentally unpredictable can be predicted. I guess not! However, it's of course true that even stochastic processes have some underlying characteristics. For example, if you have a waiting time process, you can talk about when you'd be "surprised" that the event hasn't occurred, after defining what surprises you.

iTuring: To better access web data, NoSQL arises. While traditional database also come up with the concept of Data Space, in which data comes first then comes the model. How does this technology apply nowadays? Are there similar topics that are not familiar to most people?

Generally speaking, big data uses unstructured and dirty data, at least to develop models. After the models go into production, there's sometimes a standard database being used, and definitely by the time the results and daily reports are being made, it is using standard databases.

I tend to ignore the details of this kind of data storage question, not because it's uninteresting but because it is rapidly changing. When I need to work on a new project, I go figure out what the current best technology is.

iTuring: In machine learning, training data are usually given. Engineeringly speaking, what is the most important (tricky) thing while extracting training data from database? Data traits, data size or the way data is extracted?

Really hard to say in general! Of course, sometimes you need just a huge amount of training data to train your model, and other times not so much but you need to be careful you are pulling a representative sample.

For my part I almost always train my models according to timestamps, when possible. I start earlier and train my data, then I test it on later data.

iTuring: In order to extract the key factors of a model, data analysts often had to have good understandings of certain business. Is there any easy way to do it? Or it’s the inevitable part of the job?

It is truly inevitable; only domain experts will be able to guide the modeling, at least near the beginning, when there are still easily achieved goals. Later on, when all the domain expertise has been included, it might become less domain specific.

iTuring: As data science has greatly advanced these years, do you think any of the contents in your book need to be updated? And what contents would remain unchanged for a long time?

Certainly! This is a fast moving field which I wanted to explain as an overview. If I rewrote this book today every chapter would be different. Even so, the overall approach of learning what you need, and being technical without losing sight of the human impact, will remain. As things progress, the techniques will get better and more mathematically complex, so in some sense this is the best time to be a data scientist.


更多精彩,加入图灵访谈微信!