000 11959cam a2200673 i 4500
001 on1019853884
005 20250919141833.0
006 m o d
007 cr cnu---unuuu
008 171212s2018 nju ob 001 0 eng
010 _a 2017059014
020 _a9781118897140
_q(electronic book)
020 _a1118897145
_q(electronic book)
020 _a9781118897126
_q(electronic book)
020 _a1118897129
_q(electronic book)
020 _z9781118897133
020 _z1118897137
020 _z9781118897157
_q(hardcover)
029 1 _aCHNEW
_b001003072
029 1 _aCHVBK
_b516427059
035 _a(OCoLC)1019853884
040 _aDLC
_beng
_erda
_epn
_cDLC
_dOCLCO
_dNST
_dOCLCF
_dSNM
_dDG1
_dMERER
_dOCLCQ
_dUAB
_dDEFHM
_dOCLCQ
_dUPM
_dYDX
_dOCLCO
_dCOO
_dOCLCQ
_dWYU
_dRECBK
_dLVT
042 _apcc
049 _aMAIN
050 1 0 _aQA276.45.R3
_bJ66 2018
072 7 _aMAT
_x003000
_2bisacsh
072 7 _aMAT
_x029000
_2bisacsh
082 0 0 _a519.50285/5133
_223
100 1 _aJonge, Edwin de,
_d1972-
_eauthor.
245 1 0 _aStatistical data cleaning with applications in R /
_cEdwin de Jonge, Mark van der Loo.
264 1 _a[Hoboken, NJ] :
_bJohn Wiley & Sons Ltd,
_c[2018]
300 _a1 online resource
336 _atext
_btxt
_2rdacontent
337 _acomputer
_bc
_2rdamedia
338 _aonline resource
_bcr
_2rdacarrier
504 _aIncludes bibliographical references and index.
588 0 _aOnline resource; title from digital title page (viewed on June 13, 2018).
520 _aA comprehensive guide to automated statistical data cleaning The production of clean data is a complex and time-consuming process that requires both technical know-how and statistical expertise. Statistical Data Cleaning brings together a wide range of techniques for cleaning textual, numeric or categorical data. This book examines technical data cleaning methods relating to data representation and data structure. A prominent role is given to statistical data validation, data cleaning based on predefined restrictions, and data cleaning strategy. Key features: -Focuses on the automation of data cleaning methods, including both theory and applications written in R. -Enables the reader to design data cleaning processes for either one-off analytical purposes or for setting up production systems that clean data on a regular basis. -Explores statistical techniques for solving issues such as incompleteness, contradictions and outliers, integration of data cleaning components and quality monitoring. -Supported by an accompanying website featuring data and R code. This book enables data scientists and statistical analysts working with data to deepen their understanding of data cleaning as well as to upgrade their practical data cleaning skills. It can also be used as material for a course in data cleaning and analyses.
505 0 _aCover -- Title Page -- Copyright -- Contents -- Foreword -- About the Companion Website -- Chapter 1 Data Cleaning -- 1.1 The Statistical Value Chain -- 1.1.1 Raw Data -- 1.1.2 Input Data -- 1.1.3 Valid Data -- 1.1.4 Statistics -- 1.1.5 Output -- 1.2 Notation and Conventions Used in this Book -- Chapter 2 A Brief Introduction to R -- 2.1 R on the Command Line -- 2.1.1 Getting Help and Learning R -- 2.2 Vectors -- 2.2.1 Computing with Vectors -- 2.2.2 Arrays and Matrices -- 2.3 Data Frames -- 2.3.1 The Formula-Data Interface -- 2.3.2 Selecting Rows and Columns -- Boolean Operators -- 2.3.3 Selection with Indices -- 2.3.4 Data Frame Manipulation: The dplyr Package -- 2.4 Special Values -- 2.4.1 Missing Values -- 2.5 Getting Data into and out of R -- 2.5.1 File Paths in R -- 2.5.2 Formats Provided by Packages -- 2.5.3 Reading Data from a Database -- 2.5.4 Working with Data External to R -- 2.6 Functions -- 2.6.1 Using Functions -- 2.6.2 Writing Functions -- 2.7 Packages Used in this Book -- Chapter 3 Technical Representation of Data -- 3.1 Numeric Data -- 3.1.1 Integers -- 3.1.2 Integers in R -- 3.1.3 Real Numbers -- 3.1.4 Double Precision Numbers -- 3.1.5 The Concept of Machine Precision -- 3.1.6 Consequences of Working with Floating Point Numbers -- 3.1.7 Dealing with the Consequences -- 3.1.8 Numeric Data in R -- 3.2 Text Data -- 3.2.1 Terminology and Encodings -- 3.2.2 Unicode -- 3.2.3 Some Popular Encodings -- 3.2.4 Textual Data in R: Objects of Class Character -- 3.2.5 Encoding in R -- 3.2.6 Reading and Writing of Data with Non-Local Encoding -- 3.2.7 Detecting Encoding -- 3.2.8 Collation and Sorting -- 3.3 Times and Dates -- 3.3.1 AIT, UTC, and POSIX Seconds Since the Epcoch -- 3.3.2 Time and Date Notation -- 3.3.3 Time and Date Storage in R -- 3.3.4 Time and Date Conversion in R -- 3.3.5 Leap Days, Time Zones, and Daylight Saving Times.
505 8 _a3.4 Notes on Locale Settings -- Chapter 4 Data Structure -- 4.1 Introduction -- 4.2 Tabular Data -- 4.2.1 data.frame -- 4.2.2 Databases -- 4.2.3 dplyr -- 4.3 Matrix Data -- 4.4 Time Series -- 4.5 Graph Data -- 4.6 Web Data -- 4.6.1 Web Scraping -- 4.6.2 Web API -- 4.7 Other Data -- 4.8 Tidying Tabular Data -- 4.8.1 Variable Per Column -- 4.8.2 Single Observation Stored in Multiple Tables -- Chapter 5 Cleaning Text Data -- 5.1 Character Normalization -- 5.1.1 Encoding Conversion and Unicode Normalization -- 5.1.2 Character Conversion and Transliteration -- 5.2 Pattern Matching with Regular Expressions -- 5.2.1 Basic Regular Expressions -- 5.2.2 Practical Regular Expressions -- 5.2.3 Generating Regular Expressions in R -- 5.3 Common String Processing Tasks in R -- 5.4 Approximate Text Matching -- 5.4.1 String Metrics -- 5.4.2 String Metrics and Approximate Text Matching in R -- Chapter 6 Data Validation -- 6.1 Introduction -- 6.2 A First Look at the validate Package -- 6.2.1 Quick Checks with check_that -- 6.2.2 The Basic Workflow: validator and confront -- 6.2.3 A Little Background on validate and DSLs -- 6.3 Defining Data Validation -- 6.3.1 Formal Definition of Data Validation -- 6.3.2 Operations on Validation Functions -- 6.3.3 Validation and Missing Values -- 6.3.4 Structure of Validation Functions -- 6.3.5 Demarcating Validation Rules in validate -- 6.4 A Formal Typology of Data Validation Functions -- 6.4.1 A Closer Look at Measurement -- 6.4.2 Classification of Validation Rules -- 6.5 Validating Data with the validate Package -- 6.5.1 Validation Rules in the Console and the validator Object -- 6.5.2 Validating in the Pipeline -- 6.5.3 Raising Errors or Warnings -- 6.5.4 Tolerance for Testing Linear Equalities -- 6.5.5 Setting and Resetting Options -- 6.5.6 Importing and Exporting Validation Rules from and to File.
505 8 _a6.5.7 Checking Variable Types and Metadata -- 6.5.8 Checking Value Ranges and Code Lists -- 6.5.9 Checking In-Record Consistency Rules -- 6.5.10 Checking Cross-Record Validation Rules -- 6.5.11 Checking Functional Dependencies -- 6.5.12 Cross-Dataset Validation -- 6.5.13 Macros, Variable Groups, Keys -- 6.5.14 Analyzing Output: validation Objects -- 6.5.15 Output Dimensionality and Output Selection -- 6.5.15 Exercises for Section -- Chapter 7 Localizing Errors in Data Records -- 7.1 Error Localization -- 7.2 Error Localization with R -- 7.2.1 The Errorlocate Package -- 7.3 Error Localization as MIP-Problem -- 7.3.1 Error Localization and Mixed-Integer Programming -- 7.3.2 Linear Restrictions -- 7.3.3 Categorical Restrictions -- 7.3.4 Mixed-Type Restrictions -- 7.4 Numerical Stability Issues -- 7.4.1 A Short Overview of MIP Solving -- 7.4.2 Scaling Numerical Records -- 7.4.3 Setting Numerical Threshold Values -- 7.5 Practical Issues -- 7.5.1 Setting Reliability Weights -- 7.5.2 Simplifying Conditional Validation Rules -- 7.6 Conclusion -- Chapter 8 Rule Set Maintenance and Simplification -- 8.1 Quality of Validation Rules -- 8.1.1 Completeness -- 8.1.2 Superfluous Rules and Infeasibility -- 8.2 Rules in the Language of Logic -- 8.2.1 Using Logic to Rewrite Rules -- 8.3 Rule Set Issues -- 8.3.1 Infeasible Rule Set -- 8.3.2 Fixed Value -- 8.3.3 Redundant Rule -- 8.3.4 Nonrelaxing Clause -- 8.3.5 Nonconstraining Clause -- 8.4 Detection and Simplification Procedure -- 8.4.1 Mixed-Integer Programming -- 8.4.2 Detecting Feasibility -- 8.4.3 Finding Rules Causing Infeasibility -- 8.4.4 Detecting Conflicting Rules -- 8.4.5 Detect Partial Infeasibility -- 8.4.6 Detect Fixed Values -- 8.4.7 Detect Nonrelaxing Clauses -- 8.4.8 Detect Nonconstraining Clauses -- 8.4.9 Detect Redundant Rules -- 8.5 Conclusion.
505 8 _aChapter 9 Methods Based on Models for Domain Knowledge -- 9.1 Correction with Data Modifying Rules -- 9.1.1 Modifying Functions -- 9.1.2 A Class of Modifying Functions on Numerical Data -- 9.1.2 Exercises for Section -- 9.2 Rule-Based Correction with dcmodify -- 9.2.1 Reading Rules from File -- 9.2.2 Modifying Rule Syntax -- 9.2.3 Missing Values -- 9.2.4 Sequential and Sequence-Independent Execution -- 9.2.5 Options Settings Management -- 9.3 Deductive Correction -- 9.3.1 Correcting Typing Errors in Numeric Data -- 9.3.1 Exercises for Section -- 9.3.2 Deductive Imputation Using Linear Restrictions -- Chapter 10 Imputation and Adjustment -- 10.1 Missing Data -- 10.1.1 Missing Data Mechanisms -- 10.1.2 Visualizing and Testing for Patterns in Missing Data Using R -- 10.2 Model-Based Imputation -- 10.3 Model-Based Imputation in R -- 10.3.1 Specifying Imputation Methods with simputation -- 10.3.2 Linear Regression-Based Imputation -- 10.3.3 M-Estimation -- 10.3.4 Lasso, Ridge, and Elasticnet Regression -- 10.3.5 Classification and Regression Trees -- 10.3.6 Random Forest -- 10.4 Donor Imputation with R -- 10.4.1 Random and Sequential Hot Deck Imputation -- 10.4.2 k Nearest Neighbors and Predictive Mean Matching -- 10.5 Other Methods in the simputation Package -- 10.6 Imputation Based on the EM Algorithm -- 10.6.1 The EM Algorithm -- 10.6.2 EM Imputation Assuming the Multivariate Normal Distribution -- 10.7 Sampling Variance under Imputation -- 10.8 Multiple Imputations -- 10.8.1 Multiple Imputation Based on the EM Algorithm -- 10.8.2 The Amelia Package -- 10.8.3 Multivariate Imputation with Chained Equations (Mice) -- 10.8.4 Imputation with the mice Package -- 10.9 Analytic Approaches to Estimate Variance of Imputation -- 10.9.1 Imputation as Part of the Estimator -- 10.10 Choosing an Imputation Method -- 10.11 Constraint Value Adjustment.
505 8 _a10.11.1 Formal Description -- 10.11.2 Application to Imputed Data -- 10.11.3 Adjusting Imputed Values with the rspa Package -- Chapter 11 Example: A Small Data-Cleaning System -- 11.1 Setup -- 11.1.1 Deterministic Methods -- 11.1.2 Error Localization -- 11.1.3 Imputation -- 11.1.4 Adjusting Imputed Data -- 11.2 Monitoring Changes in Data -- 11.2.1 Data Diff (Daff) -- 11.2.2 Summarizing Cell Changes -- 11.2.3 Summarizing Changes in Conformance to Validation Rules -- 11.2.4 Track Changes in Data Automatically with lumberjack -- 11.3 Integration and Automation -- 11.3.1 Using RScript -- 11.3.2 The docopt Package -- 11.3.3 Automated Data Cleaning -- References -- Index -- EULA.
650 0 _aStatistics
_xData processing.
650 0 _aR (Computer program language)
650 7 _aMATHEMATICS
_xApplied.
_2bisacsh
650 7 _aMATHEMATICS
_xProbability & Statistics
_xGeneral.
_2bisacsh
650 7 _aR (Computer program language)
_2fast
_0(OCoLC)fst01086207
650 7 _aStatistics
_xData processing.
_2fast
_0(OCoLC)fst01132113
655 4 _aElectronic books.
700 1 _aLoo, Mark van der,
_d1976-
_eauthor.
776 0 8 _iPrint version:
_aJonge, Edwin de, 1972-
_tStatistical data cleaning with applications in R.
_dHoboken, NJ : John Wiley & Sons, 2018
_z9781118897157
_w(DLC) 2017049091
856 4 0 _uhttps://eresourcesptsl.ukm.remotexs.co/user/login?url=https://doi.org/10.1002/9781118897126
_zWiley Online Library
907 _a.b16814551
_b2022-11-02
_c2020-07-17
942 _n0
998 _ae
_b2020-07-17
_cm
_dz
_feng
_gnju
_y0
_z.b16814551
999 _c648865
_d648865