MARC View

000			11959cam a2200673 i 4500
001			on1019853884
005			20250919141833.0
006			m o d
007			cr cnu---unuuu
008			171212s2018 nju ob 001 0 eng
010			_a 2017059014
020			_a9781118897140 _q(electronic book)
020			_a1118897145 _q(electronic book)
020			_a9781118897126 _q(electronic book)
020			_a1118897129 _q(electronic book)
020			_z9781118897133
020			_z1118897137
020			_z9781118897157 _q(hardcover)
029	1		_aCHNEW _b001003072
029	1		_aCHVBK _b516427059
035			_a(OCoLC)1019853884
040			_aDLC _beng _erda _epn _cDLC _dOCLCO _dNST _dOCLCF _dSNM _dDG1 _dMERER _dOCLCQ _dUAB _dDEFHM _dOCLCQ _dUPM _dYDX _dOCLCO _dCOO _dOCLCQ _dWYU _dRECBK _dLVT
042			_apcc
049			_aMAIN
050	1	0	_aQA276.45.R3 _bJ66 2018
072		7	_aMAT _x003000 _2bisacsh
072		7	_aMAT _x029000 _2bisacsh
082	0	0	_a519.50285/5133 _223
100	1		_aJonge, Edwin de, _d1972- _eauthor.
245	1	0	_aStatistical data cleaning with applications in R / _cEdwin de Jonge, Mark van der Loo.
264		1	_a[Hoboken, NJ] : _bJohn Wiley & Sons Ltd, _c[2018]
300			_a1 online resource
336			_atext _btxt _2rdacontent
337			_acomputer _bc _2rdamedia
338			_aonline resource _bcr _2rdacarrier
504			_aIncludes bibliographical references and index.
588	0		_aOnline resource; title from digital title page (viewed on June 13, 2018).
520			_aA comprehensive guide to automated statistical data cleaning The production of clean data is a complex and time-consuming process that requires both technical know-how and statistical expertise. Statistical Data Cleaning brings together a wide range of techniques for cleaning textual, numeric or categorical data. This book examines technical data cleaning methods relating to data representation and data structure. A prominent role is given to statistical data validation, data cleaning based on predefined restrictions, and data cleaning strategy. Key features: -Focuses on the automation of data cleaning methods, including both theory and applications written in R. -Enables the reader to design data cleaning processes for either one-off analytical purposes or for setting up production systems that clean data on a regular basis. -Explores statistical techniques for solving issues such as incompleteness, contradictions and outliers, integration of data cleaning components and quality monitoring. -Supported by an accompanying website featuring data and R code. This book enables data scientists and statistical analysts working with data to deepen their understanding of data cleaning as well as to upgrade their practical data cleaning skills. It can also be used as material for a course in data cleaning and analyses.
505	0		_aCover -- Title Page -- Copyright -- Contents -- Foreword -- About the Companion Website -- Chapter 1 Data Cleaning -- 1.1 The Statistical Value Chain -- 1.1.1 Raw Data -- 1.1.2 Input Data -- 1.1.3 Valid Data -- 1.1.4 Statistics -- 1.1.5 Output -- 1.2 Notation and Conventions Used in this Book -- Chapter 2 A Brief Introduction to R -- 2.1 R on the Command Line -- 2.1.1 Getting Help and Learning R -- 2.2 Vectors -- 2.2.1 Computing with Vectors -- 2.2.2 Arrays and Matrices -- 2.3 Data Frames -- 2.3.1 The Formula-Data Interface -- 2.3.2 Selecting Rows and Columns -- Boolean Operators -- 2.3.3 Selection with Indices -- 2.3.4 Data Frame Manipulation: The dplyr Package -- 2.4 Special Values -- 2.4.1 Missing Values -- 2.5 Getting Data into and out of R -- 2.5.1 File Paths in R -- 2.5.2 Formats Provided by Packages -- 2.5.3 Reading Data from a Database -- 2.5.4 Working with Data External to R -- 2.6 Functions -- 2.6.1 Using Functions -- 2.6.2 Writing Functions -- 2.7 Packages Used in this Book -- Chapter 3 Technical Representation of Data -- 3.1 Numeric Data -- 3.1.1 Integers -- 3.1.2 Integers in R -- 3.1.3 Real Numbers -- 3.1.4 Double Precision Numbers -- 3.1.5 The Concept of Machine Precision -- 3.1.6 Consequences of Working with Floating Point Numbers -- 3.1.7 Dealing with the Consequences -- 3.1.8 Numeric Data in R -- 3.2 Text Data -- 3.2.1 Terminology and Encodings -- 3.2.2 Unicode -- 3.2.3 Some Popular Encodings -- 3.2.4 Textual Data in R: Objects of Class Character -- 3.2.5 Encoding in R -- 3.2.6 Reading and Writing of Data with Non-Local Encoding -- 3.2.7 Detecting Encoding -- 3.2.8 Collation and Sorting -- 3.3 Times and Dates -- 3.3.1 AIT, UTC, and POSIX Seconds Since the Epcoch -- 3.3.2 Time and Date Notation -- 3.3.3 Time and Date Storage in R -- 3.3.4 Time and Date Conversion in R -- 3.3.5 Leap Days, Time Zones, and Daylight Saving Times.
505	8		_a3.4 Notes on Locale Settings -- Chapter 4 Data Structure -- 4.1 Introduction -- 4.2 Tabular Data -- 4.2.1 data.frame -- 4.2.2 Databases -- 4.2.3 dplyr -- 4.3 Matrix Data -- 4.4 Time Series -- 4.5 Graph Data -- 4.6 Web Data -- 4.6.1 Web Scraping -- 4.6.2 Web API -- 4.7 Other Data -- 4.8 Tidying Tabular Data -- 4.8.1 Variable Per Column -- 4.8.2 Single Observation Stored in Multiple Tables -- Chapter 5 Cleaning Text Data -- 5.1 Character Normalization -- 5.1.1 Encoding Conversion and Unicode Normalization -- 5.1.2 Character Conversion and Transliteration -- 5.2 Pattern Matching with Regular Expressions -- 5.2.1 Basic Regular Expressions -- 5.2.2 Practical Regular Expressions -- 5.2.3 Generating Regular Expressions in R -- 5.3 Common String Processing Tasks in R -- 5.4 Approximate Text Matching -- 5.4.1 String Metrics -- 5.4.2 String Metrics and Approximate Text Matching in R -- Chapter 6 Data Validation -- 6.1 Introduction -- 6.2 A First Look at the validate Package -- 6.2.1 Quick Checks with check_that -- 6.2.2 The Basic Workflow: validator and confront -- 6.2.3 A Little Background on validate and DSLs -- 6.3 Defining Data Validation -- 6.3.1 Formal Definition of Data Validation -- 6.3.2 Operations on Validation Functions -- 6.3.3 Validation and Missing Values -- 6.3.4 Structure of Validation Functions -- 6.3.5 Demarcating Validation Rules in validate -- 6.4 A Formal Typology of Data Validation Functions -- 6.4.1 A Closer Look at Measurement -- 6.4.2 Classification of Validation Rules -- 6.5 Validating Data with the validate Package -- 6.5.1 Validation Rules in the Console and the validator Object -- 6.5.2 Validating in the Pipeline -- 6.5.3 Raising Errors or Warnings -- 6.5.4 Tolerance for Testing Linear Equalities -- 6.5.5 Setting and Resetting Options -- 6.5.6 Importing and Exporting Validation Rules from and to File.
505	8		_a6.5.7 Checking Variable Types and Metadata -- 6.5.8 Checking Value Ranges and Code Lists -- 6.5.9 Checking In-Record Consistency Rules -- 6.5.10 Checking Cross-Record Validation Rules -- 6.5.11 Checking Functional Dependencies -- 6.5.12 Cross-Dataset Validation -- 6.5.13 Macros, Variable Groups, Keys -- 6.5.14 Analyzing Output: validation Objects -- 6.5.15 Output Dimensionality and Output Selection -- 6.5.15 Exercises for Section -- Chapter 7 Localizing Errors in Data Records -- 7.1 Error Localization -- 7.2 Error Localization with R -- 7.2.1 The Errorlocate Package -- 7.3 Error Localization as MIP-Problem -- 7.3.1 Error Localization and Mixed-Integer Programming -- 7.3.2 Linear Restrictions -- 7.3.3 Categorical Restrictions -- 7.3.4 Mixed-Type Restrictions -- 7.4 Numerical Stability Issues -- 7.4.1 A Short Overview of MIP Solving -- 7.4.2 Scaling Numerical Records -- 7.4.3 Setting Numerical Threshold Values -- 7.5 Practical Issues -- 7.5.1 Setting Reliability Weights -- 7.5.2 Simplifying Conditional Validation Rules -- 7.6 Conclusion -- Chapter 8 Rule Set Maintenance and Simplification -- 8.1 Quality of Validation Rules -- 8.1.1 Completeness -- 8.1.2 Superfluous Rules and Infeasibility -- 8.2 Rules in the Language of Logic -- 8.2.1 Using Logic to Rewrite Rules -- 8.3 Rule Set Issues -- 8.3.1 Infeasible Rule Set -- 8.3.2 Fixed Value -- 8.3.3 Redundant Rule -- 8.3.4 Nonrelaxing Clause -- 8.3.5 Nonconstraining Clause -- 8.4 Detection and Simplification Procedure -- 8.4.1 Mixed-Integer Programming -- 8.4.2 Detecting Feasibility -- 8.4.3 Finding Rules Causing Infeasibility -- 8.4.4 Detecting Conflicting Rules -- 8.4.5 Detect Partial Infeasibility -- 8.4.6 Detect Fixed Values -- 8.4.7 Detect Nonrelaxing Clauses -- 8.4.8 Detect Nonconstraining Clauses -- 8.4.9 Detect Redundant Rules -- 8.5 Conclusion.
505	8		_aChapter 9 Methods Based on Models for Domain Knowledge -- 9.1 Correction with Data Modifying Rules -- 9.1.1 Modifying Functions -- 9.1.2 A Class of Modifying Functions on Numerical Data -- 9.1.2 Exercises for Section -- 9.2 Rule-Based Correction with dcmodify -- 9.2.1 Reading Rules from File -- 9.2.2 Modifying Rule Syntax -- 9.2.3 Missing Values -- 9.2.4 Sequential and Sequence-Independent Execution -- 9.2.5 Options Settings Management -- 9.3 Deductive Correction -- 9.3.1 Correcting Typing Errors in Numeric Data -- 9.3.1 Exercises for Section -- 9.3.2 Deductive Imputation Using Linear Restrictions -- Chapter 10 Imputation and Adjustment -- 10.1 Missing Data -- 10.1.1 Missing Data Mechanisms -- 10.1.2 Visualizing and Testing for Patterns in Missing Data Using R -- 10.2 Model-Based Imputation -- 10.3 Model-Based Imputation in R -- 10.3.1 Specifying Imputation Methods with simputation -- 10.3.2 Linear Regression-Based Imputation -- 10.3.3 M-Estimation -- 10.3.4 Lasso, Ridge, and Elasticnet Regression -- 10.3.5 Classification and Regression Trees -- 10.3.6 Random Forest -- 10.4 Donor Imputation with R -- 10.4.1 Random and Sequential Hot Deck Imputation -- 10.4.2 k Nearest Neighbors and Predictive Mean Matching -- 10.5 Other Methods in the simputation Package -- 10.6 Imputation Based on the EM Algorithm -- 10.6.1 The EM Algorithm -- 10.6.2 EM Imputation Assuming the Multivariate Normal Distribution -- 10.7 Sampling Variance under Imputation -- 10.8 Multiple Imputations -- 10.8.1 Multiple Imputation Based on the EM Algorithm -- 10.8.2 The Amelia Package -- 10.8.3 Multivariate Imputation with Chained Equations (Mice) -- 10.8.4 Imputation with the mice Package -- 10.9 Analytic Approaches to Estimate Variance of Imputation -- 10.9.1 Imputation as Part of the Estimator -- 10.10 Choosing an Imputation Method -- 10.11 Constraint Value Adjustment.
505	8		_a10.11.1 Formal Description -- 10.11.2 Application to Imputed Data -- 10.11.3 Adjusting Imputed Values with the rspa Package -- Chapter 11 Example: A Small Data-Cleaning System -- 11.1 Setup -- 11.1.1 Deterministic Methods -- 11.1.2 Error Localization -- 11.1.3 Imputation -- 11.1.4 Adjusting Imputed Data -- 11.2 Monitoring Changes in Data -- 11.2.1 Data Diff (Daff) -- 11.2.2 Summarizing Cell Changes -- 11.2.3 Summarizing Changes in Conformance to Validation Rules -- 11.2.4 Track Changes in Data Automatically with lumberjack -- 11.3 Integration and Automation -- 11.3.1 Using RScript -- 11.3.2 The docopt Package -- 11.3.3 Automated Data Cleaning -- References -- Index -- EULA.
650		0	_aStatistics _xData processing.
650		0	_aR (Computer program language)
650		7	_aMATHEMATICS _xApplied. _2bisacsh
650		7	_aMATHEMATICS _xProbability & Statistics _xGeneral. _2bisacsh
650		7	_aR (Computer program language) _2fast _0(OCoLC)fst01086207
650		7	_aStatistics _xData processing. _2fast _0(OCoLC)fst01132113
655		4	_aElectronic books.
700	1		_aLoo, Mark van der, _d1976- _eauthor.
776	0	8	_iPrint version: _aJonge, Edwin de, 1972- _tStatistical data cleaning with applications in R. _dHoboken, NJ : John Wiley & Sons, 2018 _z9781118897157 _w(DLC) 2017049091
856	4	0	_uhttps://eresourcesptsl.ukm.remotexs.co/user/login?url=https://doi.org/10.1002/9781118897126 _zWiley Online Library
907			_a.b16814551 _b2022-11-02 _c2020-07-17
942			_n0
998			_ae _b2020-07-17 _cm _dz _feng _gnju _y0 _z.b16814551
999			_c648865 _d648865