I run a little headless NAS, which runs Obarun [because my daily driver and my experimental machine use Obarun as well, why complicate matters]. This is a simple little machine that is mostly responsible for storage, but also transcodes video files. Since it has a very basic processor and no video card, this is a slow process.
As long as the machine is not transcoding, it is rock solid. Never any outages. When it is transcoding, it is not so stable. Transcoding a big video file can take 50+ hours, and it happens very frequently that during such a transcode the system freezes. This doesn't happen after a set number of hours or after a set amount of data, it seems to be entirely random.
Stress-testing RAM and chipset, stress-testing HD's and power supply all revealed nothing wrong. Replacing the RAM didn't help. I logged temperatures, voltages, df
, top
and free
every minute, and up to the minute a freeze happens they look perfectly normal.
I access the machine using SSH, and this works fine up until a freeze happens. When I try to access the machine after a freeze, it's not even registered by my router any more. Restarting the machine makes it respond as usual, but I can see that there is no disk operation after the moment of the freeze. All logs show nothing unusual until the sudden freeze.
Since this only happens when I run a specific piece of software [avidemux], I must assume at this point that the machine fails because of some software error.
Is it possible to set up the s6-system to keep tabs on the misbehaving software, and kill it if it freezes the PC? I have never found Linux systems to be so easily crashed by misbehaving software, and I am quite nonplussed how this happens.