Iraq - Household Socio-Economic Survey, IHSES 2006/2007
Data editing took place at a number of stages throughout the processing, including:
A. Software packages
The data processing system for the IHSES survey was constructed primarily with CSPro, a specialized package widely used for census and household surveys. In addition, Visual Basic was used to build the user’s menu for the system.
Validation rules were established for most fields, with screens to control the entered data. The objectives of these validation rules are to:
• Ensure accurate entry and editing of the questionnaire data.
• Check that all rules and instructions for filling out the questionnaire are followed—for example, skipping between fields and filtering the data.
• Provide capacity to detect, follow up, and correct inconsistencies.
Data entry, editing, and data processing employed the following programs:
• Data entry: CSPro was primarily used to write the system. Screens were built to conform with the numbering of the questionnaire items and the field names.
• Data editing and consistency: CSPro was used to create rejection reports in the three languages used in the survey (Arabic, Kurdish, and English). The programs were prepared to detect and report a total of 315 abnormal situations in the data.
• Exporting data to the system to produce output tables: SPSS was used to produce output tables. A separate program was designed to transfer the raw data into the SPSS databases for statistical analysis. The exporting process produced files corresponding to the parts of the questionnaire.
• Processing for remaining rejections: The STATA software package was used to create programs to check and correct unresolved errors or rejections in the data files after the fieldwork had ended. These programs relied on mathematical and statistical methods and comparisons among households and governorates. They were able to identify outliers
and adjust values automatically. When these data checks were complete, the files were converted from STATA to SPSS in order to create the output tables.
• Remote access: Log-Me-In service through the Internet was used, allowing the data management team at a central location to follow up and download files from the data entry computers in the field.
B. Stages of data processing
To ensure accuracy and consistency, the data were edited at the following stages:
• Interviewer: Doublechecks all answers on the household questionnaire, confirming that they are clear and correct. Writes in codes by hand for each field. Some calculations are made within the questionnaire.
• Local supervisor: Checks to make sure that questionnaire has been completed correctly before being forwarded to the data entry operator.
• Data management: During data entry, rejected items are flagged through editing and a consistency check program, based on validation rules and price ranges specified in the program. These controls are repeated, first during the entry sessions and then when the data is entirely entered. The same entry program is used, with adaptations for interactive work and for batch-runs without entry operators.
• Statistical analysis: After exporting the data files from CSPro to SPSS, the Statistical Analysis Unit uses program commands to identify irregular or nonlogical values, in addition to auditing some variables.
• World Bank consultants in coordination with the COSIT data management team: The World Bank technical consultants use additional programs in SPSS and STATA to examine and correct remaining inconsistencies within the data files. The software detects errors by analyzing questionnaire items according to the expected
- The SPSS package is used to clean and harmonize the datasets.
- The harmonization process starts with a cleaning process for all raw data files received from the Statistical Agency.
- All cleaned data files are then merged to produce one data file on the individual level containing all variables subject to harmonization.
- A country-specific program is generated for each dataset to generate/ compute/ recode/ rename/ format/ label harmonized variables.
- A post-harmonization cleaning process is then conducted on the data.
- Harmonized data is saved on the household as well as the individual level, in SPSS and then converted to STATA, to be disseminated.
The data collected in the field was entered, wave after wave, separately in each governorate. All the rejections issued by the entry programs were dealt with within each team. At the end of each of the 18 waves, the data was sent to (or centrally picked up from) the Data Management Team (DMT), which re-checked the information and sent back for fixing any incomplete or unacceptable data.
Then, the final consolidated data for a wave was exported to SPSS into a set of files delivered to the Data Analysis Unit (DAU) in a pack known as "generation 1" of the wave. DAU identified specific issues for the data and requested further fixes from DMT of cleaned up the outliers and unacceptable cases. This activity produced a "generation 2" of the SPSS databases, which was used as input for adding variables such as expenditure and income aggregates, new classifications of households and persons, including unemployment descriptors, for producing a "generation 3". The latter was used for creating a last "generation 4" of the databases, adding consumption aggregates, the classification of households by poverty status and other poverty-related variables.
To deal with all the data management responsibilities, the DMT produced or acquired a number of software tools for better supporting the project.
The core piece of software, a data entry program (developed in CSPro 3.01), allowed entry operators to enter and validate the information collected in the field, with strong consistency checks for improving the quality of the data. Main controls included: (1) ranges for numeric variables, (2) demographic consistency within the household including full control on education, health and labor data, (3) check unitary values and measurement units for acquired items, (4) extensive use of control subtotals for critical sections, (5) check the household metadata against the sample, and (6) balance of calories per capita based on food transactions. The screens and error messages were displayed in three languages (Arabic, Kurdish and English) depending on the choice of the data entry operator.
Time use sheets collected for 1/3 of the surveyed households were converted into text files using scanners in each governorate. In spite the difficulties opposed by the variety of formats and scan devices available, scanning was the only choice for recording the activities declared by the interviewees at a scale of one quarter hour along 24 hours a day.
An export module, also in CSPro, was included for transferring data into SPSS and Stata. During the export process, the same consistency checks of the data entry program were run again, plus other controls that checked the completion of the work in each governorate after each wave. The scripted export module reduced the data to just 12 interlinkable files.
Friendly menus written in Visual Basic allowed for a simplified utilization of the different components of the entry tool.
Starting 7th wave, the data files of some governorates could be accessed and retrieved from a central location using remote internet access via LogMeIn. Remaining governorates kept sending their files by email, since there ware technical problems that the data management team could not solve for security constraints.6. Processing ends when data has been verified by both Data Management and DAU