Follow

My Server keeps crashing (well, to be precise, getting so stuck it needs a cold reboot) - has anyone an idea on what to do?

Β· Web Β· 1 Β· 4 Β· 1

@transcaffeine have you tried to reproduce this with other kernel versions or even linux distros? Had a similar bug that was fixed with a kernel update (but this was years ago. I don't think that those two are directly related ^^)

@jakob since the behaviour is very unpredictable (not load-related, temp-related or kernel-version-related), i could not reproduce it. i have a similar motherboard with the same chipset and the same firmware, which does not show this problem. it's very strange, crashing after hours of running with no indication in the logs

@jakob that is the pattern, the missing spots are crashes. there were crashes at ~17:00h aswell, but i caught these faster than in the middle of the night)

@transcaffeine hm. Sounds a lot like bad hardware. If it's an option to have the system down for a day I'd suggest trying to re-seat cpu and memory and running memtest (even though I don't think it's the ram)

@jakob funny enough, it all started with a CPU cooler swap. since then, i have reseated the CPU once again and inspected socket and CPU pins, and everything appears to be normal

@x44203 @jakob possible but unlikely, ground was connected at all times. it would have to be ESD on the CPU when it was out of the case. which _could_ happen but since the pins weren't touched...? idk

@x44203 @jakob is it possible to detect this? especially if it is a bad CPU or MoBo? I dont want to swap both...

@transcaffeine @jakob Maybe with stress testing with something that tries to execute all possible stuff on the CPU, or just what was running before, but insert a known-good CPU or put the CPU into a system where the rest works fine.

@x44203 that's calling sqrt() and malloc()/free() a lot, and is holding up completely fine since it started. i dont have a "known-good" CPU to swap, sadly. how could i stresstest every instruction the CPU has?

@transcaffeine I don't know, I guess running what was running before could work too? Though it'd take a while till it would crash again. It would be nice to be able to reproducibly crash it.

@x44203 yep, even 100% load is not crashing it. it's mostly crashing after hours and hours of running totally normal with no warnings, no high load and normal thermals, running the usual services

@transcaffeine Well the thing with ESD is that it can cause that kind of hard to diagnose problems and sporadic errors, because you don't really know which transistor(s) where are damaged

@transcaffeine At least that's what you usually do when likely 1 component is damaged, if it stops crashing after swapping something, that was likely broken.

@transcaffeine @x44203 esd failures are pretty hard to detect (they can even start to appear months after the ESD event) :/
maybe you can borrow a motherboard/cpu combo with the same socket to cross-test your hardware?

@jakob @x44203 but even then, i'd need to run it for 2-3 days, because the CPU has run for almost two days without symptoms aswell...

@transcaffeine @jakob Hm as long as you were grounded, it's relatively unlikely. Though if not, ESD can sometimes jump a few mm.

@x44203 @jakob the whole rack is grounded and PSU was connected but not online -> ground path was there all the time. rubbing of dried thermal paste of the CPU-IHS _could_ be a curlprit, but the IHS can take more ESD than the pins

@transcaffeine @jakob Hm, apparently the IHS is connected with thermal epoxy, I guess the backside of the chip is grounded or so. I dont think that dried paste would make much ESD, and if you didn't remove the chip itself, I dont think that ESD is very likely unless you touched other stuff on the mainboard... Did you do things like ram test yet to exclude that being the problem?

@x44203 @jakob i did not run a memtest yet, but should look into it - is there a memtest86 live USB for DDR4 now? last time i checked, there still was none.
the CPU got pulled out of the socket because the thermal paste was so shitty, it basically glued the CPU to the HS, which is super annoying and might have caused problems.

@transcaffeine @jakob Hm if there was much force applied, maybe something cracked a bit or so, making sporadic contact when hot / cold / vibrations or so.

@x44203 @jakob might be, but artificial load which increases temps shows nothing. dont really want to vibrate a system full of HDDs...

@transcaffeine @jakob Hm does memtest refuse to work on DDR4? (I've only got DDR2/3 and only used it on that yet)

Sign in to participate in the conversation
Mondbasis

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!