My Server keeps crashing (well, to be precise, getting so stuck it needs a cold reboot) - has anyone an idea on what to do?
@transcaffeine have you tried to reproduce this with other kernel versions or even linux distros? Had a similar bug that was fixed with a kernel update (but this was years ago. I don't think that those two are directly related ^^)
@jakob since the behaviour is very unpredictable (not load-related, temp-related or kernel-version-related), i could not reproduce it. i have a similar motherboard with the same chipset and the same firmware, which does not show this problem. it's very strange, crashing after hours of running with no indication in the logs
@jakob that is the pattern, the missing spots are crashes. there were crashes at ~17:00h aswell, but i caught these faster than in the middle of the night)
@transcaffeine hm. Sounds a lot like bad hardware. If it's an option to have the system down for a day I'd suggest trying to re-seat cpu and memory and running memtest (even though I don't think it's the ram)
@jakob funny enough, it all started with a CPU cooler swap. since then, i have reseated the CPU once again and inspected socket and CPU pins, and everything appears to be normal
@x44203 that's calling sqrt() and malloc()/free() a lot, and is holding up completely fine since it started. i dont have a "known-good" CPU to swap, sadly. how could i stresstest every instruction the CPU has?
@transcaffeine I don't know, I guess running what was running before could work too? Though it'd take a while till it would crash again. It would be nice to be able to reproducibly crash it.
@x44203 yep, even 100% load is not crashing it. it's mostly crashing after hours and hours of running totally normal with no warnings, no high load and normal thermals, running the usual services
@transcaffeine Well the thing with ESD is that it can cause that kind of hard to diagnose problems and sporadic errors, because you don't really know which transistor(s) where are damaged
@x44203 so basically i need to swap everything to be sure
@transcaffeine At least that's what you usually do when likely 1 component is damaged, if it stops crashing after swapping something, that was likely broken.
@transcaffeine @jakob Hm, apparently the IHS is connected with thermal epoxy, I guess the backside of the chip is grounded or so. I dont think that dried paste would make much ESD, and if you didn't remove the chip itself, I dont think that ESD is very likely unless you touched other stuff on the mainboard... Did you do things like ram test yet to exclude that being the problem?
@x44203 @jakob i did not run a memtest yet, but should look into it - is there a memtest86 live USB for DDR4 now? last time i checked, there still was none.
the CPU got pulled out of the socket because the thermal paste was so shitty, it basically glued the CPU to the HS, which is super annoying and might have caused problems.
The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!