Web数据挖掘:超文本数据的知识发现(英文版)
1推荐 收藏
4.3K阅读
图灵原版计算机科学系列

Web数据挖掘:超文本数据的知识发现(英文版)

Soumen Chakrabarti (作者)
终止销售
本书是信息检索领域的名著,深入讲解了从大量非结构化Web数据中提取和产生知识的技术。书中首先论述了Web的基础(包括Web信息采集机制、Web标引机制以及基于关键字或基于相似性搜索机制),然后系统地描述了Web挖掘的基础知识,着重介绍基于超文本的机器学习和数据挖掘方法,如聚类、协同过滤、监督学习、半监督学习,最后讲述了这些基本原理在Web挖掘中的应用。本书为读者提供了坚实的技术背景和最新的知识。
本书是从事数据挖掘学术研究和开发的专业人员理想的参考书,同时也适合作为高等院校计算机及相关专业研究生的教材。
纸质书
¥59.00

其他购买方式?

出版信息

  • 书  名Web数据挖掘:超文本数据的知识发现(英文版)
  • 系列书名图灵原版计算机科学系列
  • 执行编辑关于本书的内容有任何问题,请联系 傅志红
  • 出版日期2008-12-30
  • 书  号978-7-115-19404-6
  • 定  价59.00 元
  • 页  数360
  • 开  本16开
  • 出版状态终止销售
  • 原书名Mining the Web: Discovering Knowledge from Hypertext Data
  • 原书号1-55860-754-4

同系列书

目录

1 INTRODUCTION
1.1 Crawling and Indexing 6
1.2 Topic Directories 7
1.3 Clustering and Classification 8
1.4 Hyperlink Analysis 9
1.5 Resource Discovery and Vertical Portals 11
1.6 Structured vs. Unstructured Data Mining 11
1.7 Bibliographic Notes 13
part i INFRASTRUCTURE
2 CRAWLING THE WEB
2.1 HTML and HTTP Basics 18
2.2 Crawling Basics 19
2.3 Engineering Large-Scale Crawlers 21
2.3.1 DNS Caching, Prefetching, and Resolution 22
2.3.2 Multiple Concurrent Fetches 23
2.3.3 Link Extraction and Normalization 25
2.3.4 Robot Exclusion 26
2.3.5 Eliminating Already-Visited URLs 26
2.3.6 Spider Traps 28
2.3.7 Avoiding Repeated Expansion of Links on Duplicate Pages 29
2.3.8 Load Monitor and Manager 29
2.3.9 Per-Server Work-Queues 30
2.3.10 Text Repository 31
2.3.11 Refreshing Crawled Pages 33
2.4 Putting Together a Crawler 35
2.4.1 Design of the Core Components 35
2.4.2 Case Study: Using w3c-libwww 40
2.5 Bibliographic Notes 40
3 WEB SEARCH AND INFORMATION RETRIEVAL
3.1 Boolean Queries and the Inverted Index 45
3.1.1 Stopwords and Stemming 48
3.1.2 Batch Indexing and Updates 49
3.1.3 Index Compression Techniques 51
3.2 Relevance Ranking 53
3.2.1 Recall and Precision 53
3.2.2 The Vector-Space Model 56
3.2.3 Relevance Feedback and Rocchio?s Method 57
3.2.4 Probabilistic Relevance Feedback Models 58
3.2.5 Advanced Issues 61
3.3 Similarity Search 67
3.3.1 Handling òFind-Similaró Queries 68
3.3.2 Eliminating Near Duplicates via Shingling 71
3.3.3 Detecting Locally Similar Subgraphs of the Web 73
3.4 Bibliographic Notes 75
part ii LEARNING
4 SIMILARITY AND CLUSTERING
4.1 Formulations and Approaches 81
4.1.1 Partitioning Approaches 81
4.1.2 Geometric Embedding Approaches 82
4.1.3 Generative Models and Probabilistic Approaches 83
4.2 Bottom-Up and Top-Down Partitioning Paradigms 84
4.2.1 Agglomerative Clustering 84
4.2.2 The k-Means Algorithm 87
4.3 Clustering and Visualization via Embeddings 89
4.3.1 Self-Organizing Maps (SOMs) 90
4.3.2 Multidimensional Scaling (MDS) and FastMap 91
4.3.3 Projections and Subspaces 94
4.3.4 Latent Semantic Indexing (LSI) 96
4.4 Probabilistic Approaches to Clustering 99
4.4.1 Generative Distributions for Documents 101
4.4.2 Mixture Models and Expectation Maximization (EM) 103
4.4.3 Multiple Cause Mixture Model (MCMM) 108
4.4.4 Aspect Models and Probabilistic LSI 109
4.4.5 Model and Feature Selection 112
4.5 Collaborative Filtering 115
4.5.1 Probabilistic Models 115
4.5.2 Combining Content-Based and Collaborative Features 117
4.6 Bibliographic Notes 121
5 SUPERVISED LEARNING
5.1 The Supervised Learning Scenario 126
5.2 Overview of Classification Strategies 128
5.3 Evaluating Text Classifiers 129
5.3.1 Benchmarks 130
5.3.2 Measures of Accuracy 131
5.4 Nearest Neighbor Learners 133
5.4.1 Pros and Cons 134
5.4.2 Is TFIDF Appropriate? 135
5.5 Feature Selection 136
5.5.1 Greedy Inclusion Algorithms 137
5.5.2 Truncation Algorithms 144
5.5.3 Comparison and Discussion 145
5.6 Bayesian Learners 147
5.6.1 Naive Bayes Learners 148
5.6.2 Small-Degree Bayesian Networks 152
5.7 Exploiting Hierarchy among Topics 155
5.7.1 Feature Selection 155
5.7.2 Enhanced Parameter Estimation 155
5.7.3 Training and Search Strategies 157
5.8 Maximum Entropy Learners 160
5.9 Discriminative Classification 163
5.9.1 Linear Least-Square Regression 163
5.9.2 Support Vector Machines 164
5.10 Hypertext Classification 169
5.10.1 Representing Hypertext for Supervised Learning 169
5.10.2 Rule Induction 171
5.11 Bibliographic Notes 173
6 SEMISUPERVISED LEARNING
6.1 Expectation Maximization 178
6.1.1 Experimental Results 179
6.1.2 Reducing the Belief in Unlabeled Documents 181
6.1.3 Modeling Labels Using Many Mixture Components 183
6.2 Labeling Hypertext Graphs 184
6.2.1 Absorbing Features from Neighboring Pages 185
6.2.2 A Relaxation Labeling Algorithm 188
6.2.3 A Metric Graph-Labeling Problem 193
6.3 Co-training 195
6.4 Bibliographic Notes 198
part iii APPLICATIONS
7 SOCIAL NETWORK ANALYSIS
7.1 Social Sciences and Bibliometry 205
7.1.1 Prestige 205
7.1.2 Centrality 206
7.1.3 Co-citation 207
7.2 PageRank and HITS 209
7.2.1 PageRank 209
7.2.2 HITS 212
7.2.3 Stochastic HITS and Other Variants 216
7.3 Shortcomings of the Coarse-Grained Graph Model 219
7.3.1 Artifacts of Web Authorship 219
7.3.2 Topic Contamination and Drift 223
7.4 Enhanced Models and Techniques 225
7.4.1 Avoiding Two-Party Nepotism 225
7.4.2 Outlier Elimination 226
7.4.3 Exploiting Anchor Text 227
7.4.4 Exploiting Document Markup Structure 228
7.5 Evaluation of Topic Distillation 235
7.5.1 HITS and Related Algorithms 235
7.5.2 Effect of Exploiting Other Hypertext Features 238
7.6 Measuring and Modeling the Web 243
7.6.1 Power-Law Degree Distributions 243
7.6.2 The òBow Tieó Structure and Bipartite Cores 246
7.6.3 Sampling Web Pages at Random 246
7.7 Bibliographic Notes 254
8 RESOURCE DISCOVERY
8.1 Collecting Important Pages Preferentially 257
8.1.1 Crawling as Guided Search in a Graph 257
8.1.2 Keyword-Based Graph Search 259
8.2 Similarity Search Using Link Topology 264
8.3 Topical Locality and Focused Crawling 268
8.3.1 Focused Crawling 270
8.3.2 Identifying and Exploiting Hubs 277
8.3.3 Learning Context Graphs 279
8.3.4 Reinforcement Learning 280
8.4 Discovering Communities 284
8.4.1 Bipartite Cores as Communities 284
8.4.2 Network Flow/Cut-Based Notions of Communities 285
8.5 Bibliographic Notes 288
9 THE FUTURE OF WEB MINING
9.1 Information Extraction 290
9.2 Natural Language Processing 295
9.2.1 Lexical Networks and Ontologies 296
9.2.2 Part-of-Speech and Sense Tagging 297
9.2.3 Parsing and Knowledge Representation 299
9.3 Question Answering 302
9.4 Profiles, Personalization, and Collaboration 305
References 307
Index 327
暂无评论!