作者简介:
Peter Harrington,拥有电气工程学士和硕士学位,他曾经在美国加州和中国的英特尔公司工作7年。Peter拥有5项美国专利,在三种学术期刊上发表过文章。他现任HG Data首席科学家。如果说LinedIn跟踪的是人和人之间的商务往来,HG Data则是致力于挖掘公司间的商业往来。他曾是Zillabyte公司的创始人和首席科学家,在此之前,他曾担任2年的机器学习软件顾问。Peter在业余时间还参加编程竞赛和建造3D打印机。

中文版

iTuring: Machine learning seems to be harder than most subjects in computer science, especially for the programmers who have weaker knowledge about mathematics. Do you have any suggestions for these programmers?

PH: I would suggest trying to learn basic probability, statistics, and linear algebra on your own. You do not need to take a full semester long course, the basics will get you very far. There are a number of online resources such as Kahn academy videos. (I looked on 56.comand the Kahn academy videos are there but they are in English, there are some Chinese language courses.) Also there are some books that are easy to approach, once again I am familiar with the US English versions: “teach yourself”, “statistics for dummies”, “probability refresher”. “statistics demystified”, etc.

I actually think there is a business opportunity here. The Kahn Academy videos are great because they are short, but they are in English. The Chinese videos I saw for linear algebra were very long. If you could make short videos like Kahn academy in Chinese, I think these would be very popular.

iTuring: How to learn machine learning by steps? Do you have a roadmap for beginners? Or do you have a book list for these arduous learners to read?

PH: I would read “Data Mining” by Witten and Frank, (数据挖掘:实用机器学习工具与技术) this has very little math, and gives a good introduction to common algorithms. I think a step up from this is: “Introduction to Data Mining” by Tan, Steinbach, and Kumar (数据挖掘导论).

Of course these are longer books, and if you want to figure out immediately you don’t want to read these long books. If you are trying to figure out an algorithm, I would read many different tutorials online. Like the Adboost algorithm I think reading many different tutorials is more helpful than reading one and spending all your time on one tutorial.

Finally I would add: play around with toy examples. Ask yourself: if I changed the data, how would the results change?

iTuring: In real applications, data preprocessing may be more important than the algorithms themselves, would you consider adding some data preprocessing techniques and examples in the new version?

PH: I totally agree, I spend most of my time preprocessing data. I will try to add some preprocessing in the future. I’m not sure that there are any magic tricks here, sometimes it’s just hard work. I also add: make sure you automate everything to reduce the amount of work you have to do in the future.

iTuring: Much about machine learning is about algorithm. To some people, these are the ‘fun’ parts. But there is always some work tedious and boring about it, like data preprocessing, and the standard data is also not very ‘playful’. How do you handle the ‘not so interesting’ work according to your own experience?

PH: Sure there are boring parts, you can try to automate these tasks so you don’t have to repeat the boring parts in the future. It will also make you a better software developer.

iTuring: Would you please introduce some machine learning open source software projects?

PH: Scikit-learn (http://scikit-learn.org/stable/) is the best one I can think of right now. It is written in Python, and uses Scipy and Numpy.

iTuring: Data scientists have been appraised as the hottest job in the world, do you agree? As a data scientist yourself, do you have some experience to share with us? What does it take to become a data scientist?

PH: I think data scientists have an easy time finding jobs right now. What is a data scientist? I think a data scientist is somewhere between a statistician and software engineer. Companies, individuals, non-profit organizations, and even sports teams are using data to make decisions. They want people that can analyze this data. It requires some skills from the two things I mentioned earlier. They don’t want a pure statistician who is going to sit around arguing if he is Bayesian on non-Bayesian: they want people that DO THINGS.

So I would recommend doing things. What do I mean by that? Create projects, collect data, preprocess the data, do some data analysis, display it, make it publicly available, show it to people. If you do these things you have a portfolio of things you show employers and talk about. Almost every example in my book could be made into a web or smartphone app that you could show off.

iTuring: Artificial intelligence has encountered its bottleneck, machine learning is likely to be where breakthroughs take place. What do you think are the major factors that empower machine learning such a big role?

PH: The field of AI is relatively young if you compare it to physics or electrical engineering. As with anything that young, the topics and principles are still being discovered and refined. Often things that are research projects get presented as fact, and I think that this is where “AI over promised and under delivered” comes from.

I think a great example of this is researchers trying to recreate the mammal brain with neural networks. It reminds me of early attempts at building airplanes when people would build wings that looked like bird wings and try to fly only to end up with some broken bones. I’m not going to criticize anyone for working on neural networks: it was an experiment, there are some useful applications, but it did not solve all our problems or lead to sentient machines. The problem is these experiments get put in textbooks, movies and news and they are perceived as fact, when they are just experiments.

Back to the airplane example. When humans first figured out powered flight, they did it by trying to solve a simple task, not by trying to build a robotic bird. I think this same approach has led to some very big successes in AI. 2010 — 2011 gave us some very big successes: IBM’s Watson computer, Google’s self driving car, the Siri speech recognition on the iPhone, there is a even a company that uses AI to write news articles. These are not experiments, these are production products used by millions of people every day. AI purists will say these are tools designed to do a specific task well, but not intelligent machines.

Back to the question, I think machine learning is a very practical tool, used to solve specific problems, while AI has a very lofty goal that is hard to attain. That is the reason we will continue to feel let down by AI but surprised by machine learning.

iTuring: Many big (data) companies such as Google, Facebook and Baidu have invested a lot of energy in deep learning. Do you think that deep learning will replace the methodology of ‘artificial features + machine learning’ in the near future?

PH: No, I don’t think deep learning will replace artificial features + machine learning. There are some problems that deep learning is good at, such as recognizing things in images. There are also some problems where existing algorithms perform better.

iTuring: After deep learning, what do you think is the next hotspot in machine learning?

PH: I am not really sure, maybe you can create a model to predict research hotspots based on conference paper submissions.

iTuring: It has been pointed out that prediction will be the major application of big data and machine learning. Let’s take a specific example, if a company’s revenue can be predicted, which model might it use?

PH: You are absolutely correct. I know large retailers who are staffing teams of people to only do prediction of sales. If they can predict sales they can save a lot of money. For a company’s revenue, I would start with regression + logistic regression. Logistic regression allows us to turn things on and off which may be a good model for relating events taking place to money coming in the door.

iTuring: Some readers ask about “the famous 45 problem” in Section 7.3, what is it?

PH: Sorry, this should have been explained in the book, it came up on the English language forum also.

The 45 problem refers to data in a line that has an angle of 45 degrees, or is of the form y=x. It is a problem trying to build a simple classifier for this data.

To see why this is a problem, say we have one class: 1 that is on the line y = x, and our second class: 0 is on the line y = x + 6. Now try to choose a value on the X axis (a vertical line) that puts all the values from class 1 on one side and all the values from class 0 on the other side. Try again with a single value on the Y axis (horizontal line). You cannot choose a single combination of X &Y split points that will discriminate the two classes. That is the 45 problem.

A support vector machine, or logistic regression will not have a problem with this data. Also you could do a transform on the data and easily handle it with a decision stump.

iTuring: Do you have plans to make this book even more interesting? For example, like including a daily life related problem in each chapter?

PH: That sounds like a good idea.