Cleaning operations
Raw Data
=======
The questionnaires for the 15 Northern states were scanned centrally at CBS premises in Khartoum. A high capacity scanner and optical character recognition (OCR) software were used. Approximately 96-97% of all characters filled in was automatically interpreted and entered into the software internal database. The scanning procedure included manual on-screen verification of remaining data that could not be automatically interpreted. Finally, the scanned data were exported as ASCII files with corresponding digital images of each questionnaire. The data files were converted, further processed/edited and also tabulated using the software SPSS/PASW.
The NBHS2009 was edited as a combination of post-scanning automated edits and manual back-checks on electronic images (TIF-files) stored for each questionnaire. The latter mainly used for verifying outliers due to possible scanning or fieldworker errors.
The automated edits were pre-programmed to identify and correct consistency errors within each thematic section of the questionnaire and, especially for age related variables (marital status, education and work), also across section checks were applied.
Outliers were defined as outside the range of MEAN +/- 3 x STDV of actual variable in stratum. Outliers were listed and, unless manual intervention from subject matter specialist, the outliers were automatically imputed to MEDIAN value of stratum.
However, for the very thorough edits of the questionnaire section M (purchase and consumption) additional information on local market prices were used to correct the raw data.
If skip was missing or inconsistent with responses given in the related detailed question, the detailed question response overruled the skip and the skip was adjusted.
The difficulties with achieving consistency between age and level of current school attending was approached by introducing a predefined acceptable age range with upper and lower cut-off for each level of school from Primary 1 to University. People defined too old for a certain school level reported, was corrected to "not currently attending" and the initially reported school level was imputed in the "highest ever school level" variable.
To keep track of the amount and type of edits done, all variables with automated or manual intervention were flagged.
Two cleaned data master files are produced from the NBHS2009. One file with individuals distributed (section B-D) and one file with households distribute (E-O). In addition special files are produced for commodities (section M) used for poverty and food security calculation and for the agriculture (section N) concerning crop production and structures.
There were some challenges encountered in the implementation of the survey:
· Change from Quick Baseline Poverty Survey (QBPS) to the NBHS concept resulted in addition of other modules that inflated the questionnaire which involved much more work and additional funds were required to conduct the survey
· Delay of transfer of filed work budget to the CBS statistical offices at the states to almost one month had delayed the start of data collection stage from April to May 2009.
· Due to insecurity situations in some parts of Darfur region; six clusters in South Darfur, three in North Darfur and one in West Darfur were replaced in the same geographical areas. In addition, due to respondents refusal to cooperate with the field work teams in two EAs (clusters) one in each of Blue Nile and Nahr Elnil states, these selected EAs were replaced and the field work was completed.
· The collection of consumption information for some items was made especially hard by the lack of standardized units of measurement in North Sudan. Because, consumption of these items is sourced in non-standardized units (such as heaps, cups, bundles, rubu etc.), it is hard to calculate consumption in standardized comparable units (such as kilograms and litres). Accordingly, the questionnaire allowed respondents to report consumption in non-standardized units. A market survey, conducted at state level, provided specific conversion factors for the non-standardized measurement units. While this was the only feasible solution, it may still be prone to non-trivial measurement errors.
Harmonized Data
============
- The Statistical Package for Social Science (SPSS) is used to clean and harmonize the datasets.
- The harmonization process starts with cleaning all raw data files received from the Statistical Agency.
- Cleaned data files are then all merged to produce one data file on the individual level containing all variables subject to harmonization.
- A country-specific program is generated for each dataset to generate/ compute/ recode/ rename/ format/ label harmonized variables.
- A post-harmonization cleaning process is then conducted on the data.
- Harmonized data is saved on the household as well as the individual level, in SPSS and converted to STATA format.