终于把《大数据时代》一书粗略地读了一遍,一路上踉踉跄跄,感觉不少地方逻辑不通,令人费解。前三章观点时而激进,时而温和,后几章非常理性,像是将打出去的拳头又缩了回来。想到作者从事法律工作,还在《科学》杂志上发表过文章,逻辑不应如此混乱,决定找来原书一探究竟,发现中文版与英文版不太一致。先来看书中重要观点的概括,下面是摘录于中文版第29页的内容:

首先,分析与某事物相关的所有数据,而不是依靠分析少量的数据样本。

其次,我们乐于接受数据的纷繁复杂,而不再追求精确性。

最后,我们的思想发生了转变,不再探索难以捉摸的因果关系,转而关注事物的相关关系。

而英文版的内容如下:

The first is the ability to analyze vast amounts of data about a topic rather than be forced to settle for smaller sets. The second is a willingness to embrace data’s real-world messiness rather than privilege exactitude. The third is a growing respect for correlations rather than a continuing quest for elusive causality.

总体而言,中文版的观点是掷地有声但难免偏颇,而英文版温和而不那么绝对。总之,中文版的观点是彼此对立、有我没你,而英文版是和睦共存、此消彼长。

下面罗列一些中英文版不太一致的地方,这里关注的是三个重要观点,而不管其他细微末节上的出入。其中的页码说的是中文版。

P9:最惊人的是,社会需要放弃它对因果关系的渴求,而仅需关注相关关系。也就是说只需要知道是什么,而不需要知道为什么。Most strikingly, society will need to shed some of its obsession for causality in exchange for simple correlations: not knowing why but only what.

P18:第二个改变就是,研究数据如此之多,以至于我们不再热衷于追求精确度。这部分内容将在第2章阐述。当我们测量事物的能力受限时,关注最重要的事情和获取最精确的结果是可取的。Looking at vastly more data also permits us to loosen up our desire for exactitude, the second shift, which we identify in Chapter Three. It’s a tradeoff: with less error from sampling we can accept more measurement error. When our ability to measure is limited, we count only the most important things. Striving to get the exact number is appropriate.

P18:这种思维方式适用于掌握“小数据量”的情况,因为需要分析的数据很少,所以我们必须尽可能精准地量化我们的记录。在某些方面,我们已经意识到了差别。This type of thinking was a function of a “small data” environment: with so few things to measure, we had to treat what we did bother to quantify as precisely as possible. In some ways this is obvious:

P18:达到精确需要有专业的数据库。针对小数据量和特定事件,追求精确性依然是可行的,比如一个人的银行账户上是否有足够的钱开具支票。但是,在这个大数据时代,很多时候追求精确度已经变得不可行,甚至不受欢迎了。当我们拥有海量即时数据时,绝对的精准不再是我们追求的主要目标。Exactness requires carefully curated data. It may work for small quantities, and of course certain situations still require it: one either does or does not have enough money in the bank to write a check. But in return for using much more comprehensive datasets we can shed some of the rigid exactitude in a big-data world.

P18-19:相反,在大数据时代,我们无须再紧盯事物之间的因果关系,而应寻找事物之间的相关关系,这会给我们提供非常新颖且有价值的观点。In a big-data world, by contrast, we won’t have to be fixated on causality; instead we can discover patterns and correlations in the data that offer us novel and invaluable insights.

P19:大数据告诉我们“是什么”而不是“为什么”。在大数据时代,我们不必知道现象背后的原因,我们只要让数据自己发声。Big data is about what, not why. We don’t always need to know the cause of a phenomenon; rather, we can let data speak for itself.

罗列至此,想起几个月前《环球时报》的乌龙事件。当时接连发生了多起校长猥亵小学生的事件,有网友恶搞仓井空,PS口号“校长:开房找我,放过小学生”。《环球时报》不明就里,发表文章批评仓井空为炒作没有道德底线,弄得煞有其事似的。由此我想,毕竟我是从网上下载的英文版,也不确定购买的中文版就是正版(虽然是从大型网上书店购买的),那么这些差别是真的存在,还是盗版导致的呢?为免重蹈《环球时报》的覆辙,就此打住吧。