Corvus Corax
2009-07-28
Statistical data is gathered using the tool safecopy (safecopy web page http://safecopy.sourceforge.net) version 1.5. It provides the following features:
To produce significant statistical data, user definable options for safecopy are set as follows for all statistic gatherings:
safecopy -sync -L 2 -Z 0 -R 4 -f 1* -T timing.log -o badblocks.log -b 2048 /dev/cdrom test.dat
The relevant output of safecopy is stored in two files:
These files can then be read by a graphical visualization tool like gnu-plot.
Safecopy only tries to repeatedly read the first sector of any set of adjacent sectors that cause read errors. As such, it can not distinguish between recoverable sectors and unrecoverable sectors within or at the end of such a set. Also the amount of read attempts is limited to a finite amount, which implies to possibility of missed recoverable sectors, just because they did not recover during the attempted read calls. Therefore it must be assumed that some of the sectors listed as unrecoverable are hidden recoverable sectors.
On the other hand some of the sectors identified as readable sectors could be hidden recoverable sectors that happened to be successfully read on first attempt. Then again every sector has the potential to eventually fail.
This has implications when using the gathered data as a template for simulated bad media, as discussed below.
To provide data on real world behavior, statistic data has been gathered on real damaged media under several different conditions.
CDROMs are prime suspects for data recovery statistics since they are very common medias, easily damaged, and common damages to them like scratches are easily reproducible. The particular CD used for data generation is worn from heavy use and additionally had purposely made deep scratches on both the front and backside of the self burned CD, producing both permanently damaged sectors and sectors on which a drive with good error correction has a chance of recovery. This CD is read in two different drives.
Physical sector size of this medium is 2352 bytes, with 2048 bytes user data and 304 bytes error correction and sector address coded. Safecopy can address it in RAW mode, reading the 2352 byte sectors directly, but stores only 2048 byte user data in the destination file. Overall disk size (user data) is 730040320 byte (696 MB).
An interesting effect on data recovery is, while the area with missing reflective layer around sector #100000 is just about 4 millimeters in diameter, the affected count of unreadable sectors is often up to 12 sectors, which is more than half a track, sometimes leaving only a few sectors readable on that track.
The likely cause of this behavior, which occur ed on all tested CD-drives, is that the laser optic looses its focus where the reflective layer is missing, and needs some time to refocus on the CD surface. Since the CD is constantly spinning, sectors that ``spun through'' during refocusing time cannot be read. The focus loss appears on every turn in that area, reproducing the ``after scratch out of focus effect'' so to speak. Workarounds for that could be to force an extremely low spinning speed, but since that cannot be easily set in software, it requires a firmware patch or special hardware meant for data rescue.
Re-establishing the reflective features of the CD back surface could possibly reduce the out of focus effect, even if they cannot make the directly affected data readable it would make the directly following data on the same track re-accessible.
The mentioned CD is read in a Dell Inspiron 8200 inbuilt DVD-ROM drive, identifying itself as
HL-DT-STDVD-ROM GDR8081N, ATAPI CD/DVD-ROM driveThis drive is handling scratches on the front-side of the media badly, unable to recover most of the affected sectors. Interesting to see is the relative difference in absolute reading speed as the drive automatically adjusts its spin speed when it encounters problems.
ATAPI 24X DVD-ROM drive, 512kb Cache, DMA
As one can see, read attempts on unrecoverable sectors are all slower by about 3 orders of magnitude. Reading from the area where the reflective layer has been damaged is even slower, leading to recovery times up to 10 seconds per block.
Unrecoverable sectors are heavily mixed with readable sectors and recoverable sectors in bad areas, usually with both types on the same track, only a small number of sectors apart, but spanning over a noticeable span of tracks affected by the error cause.
The same CD is read in a
LITE-ON LTR-32123S, ATAPI CD/DVD-ROM driveThis drive performs significantly better in terms of reading from problematic media. It has read most of the data covered by scratches on the underside of the CD. Its hard to belief this was the same source medium.
ATAPI 40X CD-ROM CD-R/RW drive, 1984kB Cache
Here, too, unrecoverable sectors are heavily mixed with readable sectors and recoverable sectors in the bad area, as described above due to the laser refocusing issue.
SAMSUNG DVD-ROM SD-608, ATAPI CD/DVD-ROM driveThis particular drive has not been used for statistics collection, it is just mentioned for reference. Any occurred unreadable sector caused a firmware crash, leading to a 60 second timeout during which the drive would not respond, followed by an ATAPI bus reset and a drive reset, automatically issued by the kernel, rebooting of the drive firmware. Reading an entire damaged CD on this drive would simply have taken far too long.
ATAPI 32X DVD-ROM drive, 512kB Cache
While hopelessly outdated, any old floppy disks still around are by now bound to have at least some unreadable sectors due to loss of magnetization, dust, long term effect of background radioactivity, etc. Therefore its likely that any important data needing to be recovered from such a disk will be subject to data recovery tools.
Sector size of this medium is 512 byte, stored in IBM AT 1.2 MB 80 track low level format. Safecopy reads this medium through the Linux kernel driver from a standard (though old and dusty) 5¼'' floppy drive using O_DIRECT for reading from /dev/fd1, seeing only the 512 byte user data per block. This particular disk has 3 unreadable sectors on tracks 0 and 1 for unknown reasons (probably loss of magnetization).
A significant feature of these drives is the high head positioning time, resulting in high read times for the first sector of each track. This delay is is in the same order of magnitude as the delay imposed by read errors. However the exact timings for sectors in the different groups are all significantly distinct.
Bad sectors are more than 4 orders of magnitudes slower to respond than good ones.
A feature of this particular disk is the low amount of unrecoverable sectors. Never the less, the existing bad sectors occur physically and logically close to each other, grouped in a bad area.
SATA and PATA hard disks of different capacity are among the most common hard disks found, and as such often subject to data recovery. According to Murphy's law the probability of fatal hard disk failure increases exponentially with the time since the last backup. Recovery performance on these disks is heavily influenced by intelligent drive firmware, sometimes even for the better.
This is a 6007357440 byte (6 so called GB) IDE drive with a logical sector size of 512 byte, identifying as:
TOSHIBA MK6015MAP, ATA DISK driveThis drive had been removed from a laptop after sector read errors occurred.
Model=TOSHIBA MK6015MAP, FwRev=U8.12 A, Serial No=51K92837T
As one can see, similar to the behavior of CD-ROMS, the drive firmware reduces the absolute drive speed after it encounters errors. However the drive does not spin up again until a subsequent power down. On hard disks, the difference in read timings between successfully read data and errors is extreme, ranging over 5 orders of magnitude! (100 SEC versus 10 seconds per sector)
The examined data confirmed the assumption that accessing unrecoverable sectors or even recoverable sectors is exceedingly slow to accessing readable sectors. It also shows a typical distribution of unrecoverable sectors, which are clustered in bad areas, caused by a shared error cause (for example a scratch, or a head crash). This bad areas can span from a few tracks up to a significant part of the storage medium, depending again on the error cause and severity.
It needs to be kept in mind that on some media (for example a HDD with a head crash) accessing unrecoverable sectors, or even bad areas can potentially further damage the drive, rendering more sectors unreadable and increasing data loss.
Therefore, when recovering data from a bad medium, accessing unrecoverable sectors should be avoided completely and accessing recoverable sectors should be minimized and take place at the end of a recovery process. Accessing data outside of bad areas should take place at the beginning.
Within a bad area itself there are two likely sector distributions:
For the second (and apparently more common) distribution however, a satisfying solution is difficult to find. Even finding the end of the bad area with few read attempts is tricky, since there is a chance that such a read attempt matches a readable sector within the bad area, suggesting a premature end of the bad area. This could be coped with heuristic approaches, for example by analyzing the periodicity of the disk and guessing the likeliness of an unreadable sector based on its track position modulo relative to the first encountered bad sector. For this however the disk geometry must be known, since sector counts per track vary from disk to disk and sometimes even from track to track. In such a case the likeliness of guessing correctly would be increasingly low the further a guessed sector is apart from the reference sector.
The other problem with the second distribution is potentially recoverable data within a bad area. Especially on hard disks with multiple physical disks inside, the amount of data on a singly cylinder (spanning all disk surfaces) and as such between two sets of unrecoverable sectors can be significantly high, and potentially span entire files that need to be recovered.
It is thinkable that a heuristic approach could rescue some of this data fast, if making use of the periodicity and the Gaussian-like guessable size of the bad spot within each track, avoiding those areas. However this still won't rescue readable and recoverable sectors closer to the unrecoverable sectors on the same track, requiring a more thorough access later.
The bad area skipping algorithm of safecopy currently works quite efficiently at avoiding unrecoverable sectors within a set of unrecoverable sectors on a single track. However skipping over entire bad areas in a first rescue attempt is done via a relatively dump approach (-stage 1 skips 10% of the disk regardless of the size of the erroneous area). As mentioned above, finding the end of an oscillating bad area efficiently is a non trivial process if the disk geometry is not known. A heuristic algorithm that efficiently determines those could therefore increase the data rescue effectiveness of safecopy.
The same is true for rescuing within a bad area. A probabilistic approach could proactively avoid accessing sectors if they have a high likeliness of being unreadable, and treat them like unreadable sectors, to be accessed in a later safecopy run.
For both of these the periodicity of errors needs to be learned on the fly for each particular medium. Since safecopy can distinguish some hardware by means of its low level driver, for some media a assumed geometry can be provided by the low level back end.
The best implementation would probably be an optionally activatable heuristic module, which would dynamically adjust seek resolution and failure skip size, and proactively assume sectors to be bad based on probabilism. Since the geometry is subject to change, the probabilism should decrease with the ``distance'' to the last occurred ``real'' error.
After creating a benchmark to create reproducible test result of the behavior of a recovery tool under realistic conditions, this can be used to benchmark other tools to show their strengths and shortcomings in comparison, and lead to conclusions on the effectiveness of different approaches.
A benchmark for data recovery should mimic the behavior of a set of representative test cases of real bad media. For the creation of this benchmark suite both the qualitative effect of a sector read attempt and the time consumption of such should be realistic.
A good starting point for such a benchmark is the debug test library coming with safecopy, since it sets up a test environment transparently for a to be tested program unaware of this virtual environment. The test library of safecopy 1.4 already provides the ability to produce simulated recoverable and unrecoverable sectors, so all it needs in addition to this is the ability to simulate timing behavior, and a reference set of to be simulated realistic test media behavior.
As the analysis of the timing statistics of real media above suggests, the timing behavior of tested media can be approximated with a simplifying timing behavior, as long as the spanning of read timing over orders of magnitude is kept in a way it would still mimic the characteristic of real media. Non exponential variations on the other side can probably be ignored since they are subject to uncontrollable dynamic causes (CPU and IO load caused by other processes, etc). That leads to the following observations.
A benchmark tool that wants to simulate a damaged medium according to the simplified media timing characteristics deducted above must simulate the following:
These requirements are met by the modified test debug library as shipped with safecopy-1.5.
Some characteristic behavior of media can not be described by the simplified media timing characteristics. This model simulates sector read times for each sector individually. On real world media however, the read time also depends on the state of the drive.
One important feature not modelled is the seek time a drive needs to address a certain sector. Instead, as can be seen on the floppy disk data, these times are simulated as a feature of a single sector (for example the first sector of a track). However this doesn't apply to real world media, as this delay depends on the sequence and order of sectors being read. For example reading a media backwards would induce this delay on the last sector of a track instead.
A possible workaround would be to measure seeking times on a device and include them into the simulation. However the seek time is a complex function depending on both head position, turning speed, turning position and several other drive dependant parameters, which can not easily be measured in the generic case. A linear delay function depending on the block address distance of current position and seeking destination could probably create better simulation results if the reading software does a lot of seeking, however the accuracy of simulation would still be lacking.
To create and demonstrate a benchmark test case, the statistical data of a real media test must be analyzed and the parameters of the simplified media timing characteristics need to be determined. This test case should then be validated by comparing the statistics generated on the test case to the base data set of the real media and confirm that the test case behaves ``reasonably similar'' to the original. ``Reasonably similar'' hereby means, closer to each other than to the typical damaged media that is not the reference source, while also making recovery tools behave similarly, when comparing success rate and time consumption.
Data source for this example is the above mentioned KNOPPIX CD read on the Dell.
Calculating the exponent multiplicands based on T1 leads to the following simplified media timing characteristics:
Simulating this disk, using the safecopy test debug library, measuring sector read timings with safecopy-1.5, the following sector read timing statistics were gathered:
As expected, there are random derivations of actual sector read timings from the simulated times, but they stay within the estimated error band E where . Therefore the simplification of the simulation results would reproduce the original simplified media characteristics given by T1,{Xs},{Xb},{Xr},{Yr}.
During simulation, the same sectors were recovered as in the original safecopy run on the real disk and read attempts were undertaken on the same sectors. Intuitively the above requirements of ``reasonable similarity'' seem to be met. To give this claim foundation it is necessary to calculate the accumulated performance difference between original run and simulation based on the gathered statistics:
The simulation by simplified media characteristics caps the reading time error at 200% for each single sector, while the overall reading time for the whole disk is determined by the reading time error accumulated through the most dominant sectors. In the case of this CD, those clearly are damaged sectors.
The similarity between the two curves for overall reading time is clearly visible in this diagram.
The same CD in a LightOn drive:
Data source is the above analyzed 5¼'' floppy disk.
Most of the simulated readable sectors take a bit too long to read. However the overall simulation is too short by about 20%, since the significantly long read times of the dominant first sector of each track (originally caused by head seeking) are simulated about that percentage too fast. This errors are within the expected error margin of the simplified media characteristics.
It seems to be possible to undertake a reasonably accurate simulation of the behavior of damaged media and use this for benchmarking recovery tools based on simplified media characteristics.
The most important aspect not covered is, that a benchmark simulation would need to simulate the time consumption of seeking. Otherwise data recovery tools that do a lot of seeking on the source media had the potential to behave significantly better on the benchmark than on real damaged media. However a linear delay, based on sector address difference should be sufficient to penalize excessive seeking, even if it does not simulate real world seek times accurately in all cases.
Special consideration needs to be taken about media where seeking times are high compared to read times of damaged sectors even during a linear read, like the seen 5¼'' floppy disk. If the seek time is simulated by the above separate algorithm, significant delays inflicted on single sectors by seeking during statistics gathering of the source must be compensated by taking those sectors out of the {Xs} set.
Luckily most media types have insignificant seek times when reading linearly, so no compensation needs to be done.
To estimate the seek time, we compare the sector read times of two readable sectors that are physically far apart and are not in {Xs} (meaning their read time when reading linearly is within ) when read directly after each other. The seek time to seek a distance of one sector would then be calculated as where t2 is the read time of the second sector and n is the numbers of sectors the two sectors are apart.
This delay therefore is measured on some media covered by this article, and the results are bundled into a benchmark suite, to be published when safecopy-1.5 is, together with this article, data files and scripts.
Doing a forced seek over the whole disk (on the Dell drive) with a sparse include file using safecopy, resulted in the following read times:
After setting up the simulator to delay to 649 nanoseconds per seeked sector, the simulation of the same sequence lead to these very accurate timings:
The difference in seek and read times when reading media ``backwards'' however are still not simulated. My assumption would be that seeking backwards on a CD might be slower, while on a hard disk there should be no measurable difference. However since safecopy does not offer a ``read backwards'' mode, measuring these is currently just too much effort.
This document was generated using the LaTeX2HTML translator Version 2008 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -no_subdir -split 0 -show_section_numbers analysis.tex
The translation was initiated by Corvus Corax on 2009-07-28