|
Update: Friday 8 January 2021 - Fixed!
Since kernel version 5.4, my Aarch64 systems have become very unreliable,
requiring regular reboots to keep them working. Worringly, symptoms have
so far pointed towards filesystem data corruption, which results in the
root filesystem being marked read-only. This normally results in something
like one of these messages:
EXT4-fs error (device nvme0n1p2): ext4_lookup:1707: inode #271688: comm mandb: iget: checksum invalid
[7478798.720368] EXT4-fs error (device mmcblk0p1): ext4_lookup:1707: inode #157096: comm mandb: iget: checksum invalid
EXT4-fs error (device mmcblk0p1): ext4_lookup:1707: inode #173544: comm mandb: iget: checksum invalid
[365750.234472] EXT4-fs error (device mmcblk0p1): ext4_lookup:1707: inode #166384: comm mandb: iget: checksum invalid
[4175456.231948] EXT4-fs error (device mmcblk0p1): htree_dirblock_to_tree:1004: inode #396582: comm find: Directory block failed checksum
The result is the journal is aborted, the rootfs is marked read only.
The known facts so far:
- it has not been seen on kernel 5.2 on Armada 8040 hardware
(with an uptime of 560 days).
- it has been seen on all mainline kernel versions from 5.4 to 5.9.
- it occurs on several of my Armada 8040 and NXP LX2160A based systems,
which are both Cortex-A72 based systems. I have all the errata enabled
in the kernel.
- it seems independent of the media; it has been seen on the rootfs of
two different NVMes on two different platforms, uSD, and eMMC.
- it occurs between a week and three months, which makes attempting a
bisection of the changes between 5.2 to 5.4 infeasible.
- I've run xfstests (as suggested by tytso) on the LX2160A and
generic/531 triggered the inode checksum error.
Investigation with debugfs sometimes shows that the inode checksum
is invalid, but if the block device is flushed (via hdparm) and re-read
from the media, the inode checksum is then correct. This implies that the
data in memory/CPU caches does not match the data on the media, especially
when the inode has not changed for days.
Below is a log of some of the recent instances:
29th February 2020
Error: [73729.556544] EXT4-fs error (device nvme0n1p2): ext4_lookup:1700: inode #917524: comm rm: iget: checksum invalid
Platform: NXP LX2160A
Media: XPG SX8200PNP NVMe
Kernel: 5.5
Uptime: 20 hours
Inode #917524 was /var/backups/dpkg.status.6.gz.
Running e2fsck -n /dev/nvme0n1p2 without rebooting showed that the
checksum was incorrect, so further investigation with debugfs was
warranted:
debugfs: id <917524>
0000 a481 0000 30ff 0300 3d3d 465e bd77 4f5e ....0...==F^.wO^
0020 29ca 345e 0000 0000 0000 0100 0002 0000 ).4^............
0040 0000 0800 0100 0000 0af3 0100 0400 0000 ................
0060 0000 0000 0000 0000 4000 0000 c088 3800 ........@.....8.
0100 0000 0000 0000 0000 0000 0000 0000 0000 ................
*
0140 0000 0000 5fc4 cfb4 0000 0000 0000 0000 ...._...........
0160 0000 0000 0000 0000 0000 0000 af23 0000 .............#..
0200 2000 1cc3 ac95 c9c8 a4d2 9883 583e addf ...........X>..
0220 3de0 485e b04d 7151 0000 0000 0000 0000 =.H^.MqQ........
0240 0000 0000 0000 0000 0000 0000 0000 0000 ................
*
debugfs: stat <917524>
Inode: 917524 Type: regular Mode: 0644 Flags: 0x80000
Generation: 3033515103 Version: 0x00000000:00000001
User: 0 Group: 0 Project: 0 Size: 261936
File ACL: 0
Links: 1 Blockcount: 512
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x5e4f77bd:c8c995ac -- Fri Feb 21 06:25:01 2020
atime: 0x5e463d3d:dfad3e58 -- Fri Feb 14 06:25:01 2020
mtime: 0x5e34ca29:8398d2a4 -- Sat Feb 1 00:45:29 2020
crtime: 0x5e48e03d:51714db0 -- Sun Feb 16 06:25:01 2020
Size of extra inode fields: 32
Inode checksum: 0xc31c23af
EXTENTS:
(0-63):3705024-3705087
This is, as I remember, operating on the in-memory data rather than
the on-disk data, and the inode checksum of 0xc31c23af was incorrect.
I corrected the checksum using debugfs "sif" command, which wrote a
corrected checksum. This resulted in:
debugfs: id <917524>
0000 a481 0000 30ff 0300 3d3d 465e bd77 4f5e ....0...==F^.wO^
0020 29ca 345e 0000 0000 0000 0100 0002 0000 ).4^............
0040 0000 0800 0100 0000 0af3 0100 0400 0000 ................
0060 0000 0000 0000 0000 4000 0000 c088 3800 ........@.....8.
0100 0000 0000 0000 0000 0000 0000 0000 0000 ................
*
0140 0000 0000 5fc4 cfb4 0000 0000 0000 0000 ...._...........
0160 0000 0000 0000 0000 0000 0000 b61f 0000 ................
^^^^
0200 2000 aa15 ac95 c9c8 a4d2 9883 583e addf ...........X>..
^^^^
0220 3de0 485e b04d 7151 0000 0000 0000 0000 =.H^.MqQ........
0240 0000 0000 0000 0000 0000 0000 0000 0000 ................
*
With only that change, e2fsck then passed:
e2fsck -n /dev/nvme0n1p2
e2fsck 1.44.5 (15-Dec-2018)
Warning: skipping journal recovery because doing a read-only filesystem check.
/dev/nvme0n1p2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nvme0n1p2: 121163/2097152 files (0.1% non-contiguous), 1349227/8388608 blocks
The file seemed to be intact; being a gzip file, that's easy to verify
since gzip files contain their own checksums, and if the data is invalid
they won't be readable anyway.
6th June 2020
Error: EXT4-fs error (device nvme0n1p2): ext4_lookup:1707: inode #271688: comm mandb: iget: checksum invalid
Platform: NXP LX2160A
Media: XPG SX8200PNP NVMe
When I originally noticed the problem just after midnight, debugfs said
that the inode did indeed have an incorrect checksum. However, by 11am,
debugfs said the checksum was correct - the machine had not been rebooted
and the rootfs was mounted read-only. This suggests that the on-media
copy was in fact correct, but the in-memory copy was incorrect.
This is the dump of the inode after it had "self-healed":
debugfs: id <271688>
0000 a481 0000 f108 0000 2518 fd5d 2518 fd5d ........%..]%..]
0020 9f49 715c 0000 0000 0000 0100 0800 0000 .Iq\............
0040 0000 0800 0100 0000 0af3 0100 0400 0000 ................
0060 0000 0000 0000 0000 0100 0000 ed19 1100 ................
0100 0000 0000 0000 0000 0000 0000 0000 0000 ................
*
0140 0000 0000 b42f 4f06 0000 0000 0000 0000 ...../O.........
0160 0000 0000 0000 0000 0000 0000 c9cf 0000 ................
0200 2000 8d83 086d bebf 0000 0000 086d bebf ....m.......m..
0220 2518 fd5d 086d bebf 0000 0000 0000 0000 %..].m..........
0240 0000 0000 0000 0000 0000 0000 0000 0000 ................
*
debugfs: stat <271688>
Inode: 271688 Type: regular Mode: 0644 Flags: 0x80000
Generation: 105852852 Version: 0x00000000:00000001
User: 0 Group: 0 Project: 0 Size: 2289
File ACL: 0
Links: 1 Blockcount: 8
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x5dfd1825:bfbe6d08 -- Fri Dec 20 18:51:17 2019
atime: 0x5dfd1825:bfbe6d08 -- Fri Dec 20 18:51:17 2019
mtime: 0x5c71499f:00000000 -- Sat Feb 23 13:24:47 2019
crtime: 0x5dfd1825:bfbe6d08 -- Fri Dec 20 18:51:17 2019
Size of extra inode fields: 32
Inode checksum: 0x838dcfc9
EXTENTS:
(0):1120749
# e2fsck -n /dev/nvme0n1p2
e2fsck 1.44.5 (15-Dec-2018)
Warning: skipping journal recovery because doing a read-only filesystem check.
/dev/nvme0n1p2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nvme0n1p2: 147476/2097152 files (0.1% non-contiguous), 1542719/8388608 blocks
12th July 2020
Error: [7478798.720368] EXT4-fs error (device mmcblk0p1): ext4_lookup:1707: inode #157096: comm mandb: iget: checksum invalid
Platform: SolidRun Clearfog GT-8k (Armada 8040)
Media: eMMC
Uptime: 89 days
Inode #157096 is /usr/share/man/nl/man1/apt-transport-mirror.1.gz, which
debugfs gives:
ctime: 0x5ebcd62f:ba34bf1c -- Thu May 14 06:25:03 2020
atime: 0x5ebcd63b:a2906fa0 -- Thu May 14 06:25:15 2020
mtime: 0x5eba730a:00000000 -- Tue May 12 10:57:30 2020
crtime: 0x5ebcd62f:a25cccf4 -- Thu May 14 06:25:03 2020
Inode checksum: 0x13fd5c3c (bad)
Inode checksum: 0x600eba80 (good)
The different checksum is the only difference that debugfs reports for
the inode between the failing and corrected inodes. Also, we seem to
have an inode that has demonstrably not changed for over a month with an
incorrect checksum in memory, but good checksum on the media. This seems
to mean that either the checksum in memory is wrong or the data in memory
is wrong. The following rather confirms this.
Running e2fsck -n /dev/mmcblk0p1 without a reboot gave:
Inode 13755 passes checks, but checksum does not match inode. Fix? no
Inode 157096 passes checks, but checksum does not match inode. Fix? no
Simply using "hdparm -f" does not make these errors go away; e2fsck
then did not complain about the checksum failures.
The contents of the file are valid gzip, and the only thing that is wrong
is the inode checksum.
16th August 2020
Error: EXT4-fs error (device mmcblk0p1): ext4_lookup:1707: inode #173544: comm mandb: iget: checksum invalid
Platform: SolidRun Macchiatobin Single-shot (Armada 8040)
Media: eMMC
This is another instance where the problem with inode #173544 has
corrected itself.
30th August 2020
Error: [365750.234472] EXT4-fs error (device mmcblk0p1): ext4_lookup:1707: inode #166384: comm mandb: iget: checksum invalid
Platform: SolidRun Clearfog GT-8K (Armada 8040)
Media: eMMC
Uptime: 4 days
Kernel: 5.8
I've added some debug code to ext4_inode_csum_verify() to dump out the
inode contents and checksums when there is a checksum failure.
9th November 2020 failure
Error: not recorded iget: checksum invalid
Platform: SolidRun Clearfog GT-8K (Armada 8040)
Media: eMMC
Uptime: 70 days
Kernel: 5.8
After the previous instance, I added some debug, which, after running for
70 days on kernel 5.8, the kernel spat out another inode checksum failure
along with my debug:
[6131696.234604] provided = ea2b60d5 calculated = 7929a3c0
[6131696.238402] inode(ffffff839e059500) = a4 81 00 00 46 0d 00 00 5c 92 88 5e 17 92 88 5e c6 56 f0 5b 00 00 00 00 00 00 01 00 08 00 00 00 00 00 08 00 01 00 00 00 0a f3 01 00 04 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 e4 14 0a 00
This translates to (ext4 data is little endian):
i_mode = 0x81a4
i_uid = 0x0000
i_size_lo = 0x00000d46
i_atime = 0x5e88925c (Sat Apr 4 14:57:48 2020 +0100)
i_ctime = 0x5e889217 (Sat Apr 4 14:56:39 2020 +0100)
i_mtime = 0x5bf056c6 (Sat Nov 17 17:58:30 2018 +0100)
i_dtime = 0x00000000
i_gid = 0x0000
i_links_count = 0x0001
i_blocks_lo = 0x00000008
i_flags = 0x00080000
l_i_version = 0x00000001
i_block = {
0x0001f30a
0x00000004
0x00000000
0x00000000
0x00000001
0x000a14e4
...
}
Note that i_blocks has many different purposes, and is 15 32-bit words
long.
Further investigation reveals:
- The data dump above (first 64 bytes) matches the on-media copy.
- The access/modification times are months before the time when the
checksum error has happened, suggesting that the inode has not been
modified recently.
- The "provided" checksum is correct for the data on the media, as
confirmed with debugfs.
Unfortunately, this is not the complete 256 bytes of inode, so there is
no way to know why the checksum has failed - it doesn't even contain the
stored checksums (which are stored as two separate 16-bit integers.) I
updated the debug code to print the full 256 bytes of the inode as per
the patch below, rebooted the system into a 5.9 kernel and waited for
the problem to recur.
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bf596467c234..f5d335452f1d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -98,6 +98,13 @@ static int ext4_inode_csum_verify(struct inode *inode, struct ext4_inode *raw,
else
calculated &= 0xFFFF;
+if (provided != calculated) {
+ pr_err("provided = %08x calculated = %08x\n", provided, calculated);
+ pr_err("inode(%p)\n", raw);
+ print_hex_dump(KERN_ERR, "", DUMP_PREFIX_OFFSET, 16, 1, raw, EXT4_INODE_SIZE(inode->i_sb), false);
+ pr_err("recalculated = %08x\n", ext4_inode_csum(inode, raw, ei));
+}
+
return provided == calculated;
}
In this patch, I print the in-memory checksum and the calculated
checksum, print the address of the inode, dump all 256 bytes of the
inode, and then print a calculated the checksum (which should match
the initial calculation. If it doesn't match the first calculation, it
means that there's some bug in the CRC32c crypto code, or a problem
with memory ordering/coherency.
28th December 2020 failure
Error: [4175456.231948] EXT4-fs error (device mmcblk0p1): htree_dirblock_to_tree:1004: inode #396582: comm find: Directory block failed checksum
Platform: SolidRun Clearfog GT-8K (Armada 8040)
Media: eMMC
Uptime: 48 days
This failure is different from the previous, as it did not produce the
usual "iget: checksum invalid" but "Directory block failed
checksum" instead, which has never been seen before. The directory
concerned is "/var/cache/man/pt/cat8". However, as with many of
the previous instances, I find that the problem seems to have "self-healed"
by the time I've noticed it. The directory which failed its checksum is
perfectly readable by the system (and without rebooting it) - again
suggesting that the in-memory copy was faulty but the on-media copy was
fine.
Consequently, it means a reboot and restarting the three month wait for
the next failure.
Some questions and answers:
- how does ext4 calculate the inode checksums?
The checksums are calculated by calling out to the kernel's crypto
shash crc32c code.
- how are ext4 inodes aligned on disk and memory?
ext4 inodes on this media are 256 bytes in size, and are naturally
aligned.
The corruption feels very much like a memory ordering bug or a cache
coherence bug, but these systems are supposed to be cache coherent.
As this has been going on for so long, and there isn't a clear cause, it
has completely eroded my confidence in Aarch64 as a viable architecture
for running anything useful. This presents quite a problem: if the problem
"vanishes" without there being an adequate explanation (e.g. if changing
the compiler or filesystem type), how would I know that the systems are
then stable? Would they be stable if it runs without problem for three
months, six months, a year, a decade?
I have reported this problem a number of times on mailing lists but it
has attracted very little interest - somewhat understandably so, given
that it takes up to three months to appear.
4th January 2021
Today, I've been able to trigger an inode checksum failure a few times on
the LX2160A:
provided = d06328dd calculated = 3ba43925
inode(ffffffa6d6782000)
00000000: a4 81 00 00 de 08 00 00 79 2b 05 5e 7e 2b 05 5e
00000010: 6a cb 45 5c 00 00 00 00 00 00 01 00 08 00 00 00
00000020: 00 00 08 00 01 00 00 00 0a f3 01 00 04 00 00 00
00000030: 00 00 00 00 00 00 00 00 01 00 00 00 30 a2 0a 00
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000060: 00 00 00 00 24 42 a8 74 00 00 00 00 00 00 00 00
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 dd 28 00 00
00000080: 20 00 63 d0 5c 48 ce e5 00 00 00 00 00 00 00 00
00000090: 7e 2b 05 5e e0 1a a2 ab 00 00 00 00 00 00 00 00
000000a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000000b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000000c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000000d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000000e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000000f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
recalculated = d06328dd
EXT4-fs error (device nvme0n1p2): ext4_lookup:1707: inode #144001: comm md5sum: iget: checksum invalid
The dump of the inode data appears to be correct, and the recalculated
checksum after dump agrees with the checksum in the inode itself. This
is, of course, the worst news, because it doesn't really narrow down
what is going on. The same questions remain - was the data fed into
the first checksum incorrect in some way (due to cache coherence or
memory ordering) or was the checksum calculation faulty? There's no way
to know from the above. What we can say is that dumping the data and
reculating the checksum gives the correct answer, and the first checksum
is wrong for some reason.
I have disabled the ARM64 optimised CRC32 support in (arch/arm64/lib/crc32.S) introduced in 7481cddf29ed ("arm64/lib: add accelerated crc32
routines") which was part of v4.20, and as expected, the problem still
exists.
5th January 2021
Having ruled out the CRC32 code yesterday, that left two possibilities -
cache coherence or memory ordering. To work out which, I decided to add
a mb() into ext4_inode_csum_verify() right before the checksum is
initially calculated. Initial testing on the LX2160A platform seems to
suggest that makes the inode checksum failure much less likely to happen.
(I am not going to say "doesn't" since it's going to take at least three
if not more months to even give a hint.)
Digging further, there was a change during the 5.4 merge window which
changed the barriers - 22ec71615d82 ("arm64: io: Relax implicit barriers
in default I/O accessors"). Will Deacon assures me that this is correct,
and he spent a long time validating it. I have now reverted this commit,
rebuilt the kernel and put it on the ARM64 platforms that I have been
running 5.4+ kernels. Time (three to six months) will tell whether this
has fixed the problem.
In the mean time, some further debug - I've tried changing the __iormb()
and __iowmb() to use "dmb osh" rather than the load/store variants. I've
now ended up with this from the LX2160A platform:
[ 23.252955] provided = d22f8aab calculated = cac5d3d7
[ 23.256697] inode(ffffffa6d9006f00)
[ 23.258963] 00000000: a4 81 00 00 43 02 00 00 ec 56 f3 5f 2c 18 fd 5d
[ 23.264104] 00000010: 7d c9 ff 5b 00 00 00 00 00 00 01 00 08 00 00 00
[ 23.269246] 00000020: 00 00 08 00 01 00 00 00 0a f3 01 00 04 00 00 00
[ 23.274389] 00000030: 00 00 00 00 00 00 00 00 01 00 00 00 76 91 20 00
[ 23.279529] 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 23.284671] 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 23.289813] 00000060: 00 00 00 00 fd aa b9 50 00 00 00 00 00 00 00 00
[ 23.294953] 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 ab 8a 00 00
[ 23.300097] 00000080: 20 00 2f d2 8c 79 5a c7 00 00 00 00 48 4f 1c 90
[ 23.305239] 00000090: 2c 18 fd 5d 8c 79 5a c7 00 00 00 00 00 00 00 00
[ 23.310378] 000000a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 23.315520] 000000b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 23.320659] 000000c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 23.325798] 000000d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 23.330940] 000000e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 23.336079] 000000f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 23.341218] recalculated = cac5d3d7
[ 23.343397] EXT4-fs error (device nvme0n1p2): ext4_lookup:1707: inode #525808: comm md5sum: iget: checksum invalid
This is unexpected - note that the recalculated value is the same as the
initial calculation. However, after extensively checking the hexdump,
the hexdump matches what is on disk, and debugfs and e2fsck are happy
with that - as now is the kernel. So, d22f8aab is in fact the correct
checksum. And the hexdump is correct. But calling ext4_inode_csum()
after dumping the entire inode correctly still produced the wrong
result. This makes little sense.
Further testing - with __iormb() using dma_rmb() (dmb oshld) and __iowmb()
using wmb() (dsb st), the problem remains:
provided = 2e5f9e28 calculated = 12ec3a0e
...
recalculated = 2e5f9e28
EXT4-fs error (device nvme0n1p2): ext4_lookup:1707: inode #272094: comm md5sum: iget: checksum invalid
However, testing with __iormb() as rmb() (dsb ld) and __iowmb() as
dma_wmb() (dmb oshst) appears to pass tests.
7th January 2021
Finally, we have got to the bottom of this problem, with the help of
Will Deacon and Arnd Bergmann. It appears to be a
bug
in mainline gcc-4.9 - Android and Linaro gcc-4.9 have the fix. The
kernel tickles this bug in the EXT4 checksum code with the stack
protector disabled.
This exhibits itself with this code in ext4:
static inline u32 ext4_chksum(struct ext4_sb_info *sbi, u32 crc,
const void *address, unsigned int length)
{
struct {
struct shash_desc shash;
char ctx[4];
} desc;
BUG_ON(crypto_shash_descsize(sbi->s_chksum_driver)!=sizeof(desc.ctx));
desc.shash.tfm = sbi->s_chksum_driver;
*(u32 *)desc.ctx = crc;
BUG_ON(crypto_shash_update(&desc.shash, address, length));
return *(u32 *)desc.ctx;
}
generating:
0000000000000004 <ext4_chksum.isra.14.constprop.19>:
4: a9be7bfd stp x29, x30, [sp, #-32]! <------
8: 2a0103e3 mov w3, w1
c: aa0203e1 mov x1, x2
10: 910003fd mov x29, sp <------
14: f9000bf3 str x19, [sp, #16]
18: d10603ff sub sp, sp, #0x180 <------
1c: 9101fff3 add x19, sp, #0x7f <------
20: b9400002 ldr w2, [x0]
24: 9279e273 and x19, x19, #0xffffffffffffff80 <------
28: 7100105f cmp w2, #0x4
2c: 540001a1 b.ne 60 <ext4_chksum.isra.14.constprop.19+0x5c> // b.any
30: 2a0303e4 mov w4, w3
34: aa0003e3 mov x3, x0
38: b9008264 str w4, [x19, #128] <------
3c: aa1303e0 mov x0, x19
40: f9000263 str x3, [x19] <------
44: 94000000 bl 0 <crypto_shash_update>
44: R_AARCH64_CALL26 crypto_shash_update
48: 350000e0 cbnz w0, 64 <ext4_chksum.isra.14.constprop.19+0x60>
4c: 910003bf mov sp, x29 <======
50: b9408260 ldr w0, [x19, #128] <======
54: f9400bf3 ldr x19, [sp, #16]
58: a8c27bfd ldp x29, x30, [sp], #32
5c: d65f03c0 ret
60: d4210000 brk #0x800
64: 97ffffe7 bl 0 <ext4_chksum.isra.14.part.15>
The bug is the order of the instructions marked with "<======" - this
deallocates the local variable "desc" from the stack, and then
reads from it. If we receive an interrupt and context switch at that
point, "desc" will be overwritten, and hence the checksum will be
corrupted.
8 January 2021
It is a big relief that a definitive reason for the problem has finally
been found. When you consider that merely upgrading the compiler would
have made the bug vanish without explanation, that would leave one in
a situation where you don't know whether the bug has been solved, or
whether it has been merely masked by different instruction timings. This
in turn means that you'd forever be wondering whether your filesystems
would be corrupted, or your system would fail at some random point in
the future - would you trust your data on such a system? Many would
likely not.
Hence, it became very important to find the cause of this problem. As
I have said, I got to the point of considering taking all my Aarch64
hardware down to the local recycling centre precisely because this bug
had completely eroded my ability to trust Aarch64 as an architecture,
and it was taking so long to track down the bug.
I am very grateful to Will Deacon and Arnd Bergmann for their time
helping to track this down - which was really key. Will Deacon found a
recipe that reproduced it more reliably than I had managed. Will also
identified that 5.10 built with his kernel configuration did not exhibit
it, but 5.9 built with my configuration did - that then gave me something
to work with, to identify what change in the kernel configuration seemed
to mask the bug. Thanks!
|