Bistable multivibrator
Non-state actor
Tabs for AI indentation, spaces for AI alignment
410,757,864,530 DEAD COMPUTERS

  • 25 Posts
  • 384 Comments
Joined 1 year ago
cake
Cake day: July 6th, 2023

help-circle
  • There are tradeoffs to higher and higher grades of redundancy and the appropriate level depends on the situation. Across VMs you just need to know how to set up HA for the system. Across physical hosts requires procuring a second server and more precious Us on a rack. Across racks/aisles might sometimes require renting a whole second rack. Across fire door separated rooms requires a DC with such a feature. Across DCs might require more advanced networking, SDN fabrics, VPNs, BGP and the like. Across sites in different regions you might have latency issues, you might have to hire people in multiple locations or deal with multiple colo providers or ISPs, maybe even set up entire fiber lines. Across states or countries you might have to deal with regulatory compliance in multiple jurisdictions. Especially in 2001 none of this was as easy as selecting a different Availability Zone from a dropdown.

    Running a business always involves accepting some level of risk. It seem reasonable for some companies to decide that if someone does a 9/11 to them, they have bigger problems than IT redundancy.


  • At a previous job a colleague and I used to take on most of the physical data center work. Many of the onprem customers were moving to public cloud, so a good portion of the work was removing hardware during decommissioning.

    We wanted to optimize out use of the rack space, so the senior people decided we would move one of our storage clusters to the adjacent rack during a service break night. The box was built for redundancy, with dual PSUs and network ports, so we considered doing the move with the device live, with at least one half of the device connected at all times. We settled on a more conventional approach and one of the senior specialists live migrated the data on another appliance before the move.

    Down in the DC the other senior showed us what to move and where and we started to carefully unplug the first box. He came to check on us just after we had taken out the first box.

    Now I knew what a storage cluster appliance looked like, having carried our old one out of the DC not too long ago. You have your storage controller, with the CPU and OS and networking bits on it, possibly a bunch of disk slots too, and then you had a number of disk shelves connected to that. This one was quite a bit smaller, but that’s just hardware advancement for you. From four shelves of LFF SAS drives to some SSDs. Also the capacity requirements were trending downwards what with customers moving to pubcloud.

    So we put the storage controller to its new home and started to remove the disk shelf from under it. There was a 2U gap between the controller and the shelf, so we decided to ask if that was on purpose and if we should leave a gap in the new rack as well.

    “What disk shelf?”

    Turns out the new storage appliance was even smaller than I had thought. Just one 2U box, which contained two entire independent storage controllers, not just redundant power and network. The thing we removed was not a part of the cluster we were moving, it was the second cluster, which was currently also handling the duties of the appliance we were actually supposed to move. Or would have, if we hadn’t just unplugged it and taken it out.

    We re-racked the box in a hurry and then spent the rest of the very long night rebooting hundreds of VMs that had gone read only. Called in another specialist, told the on-duty admin to ignore the exploding alarm feed and keep the customers informed, and so on. Next day we had a very serious talk with the senior guy and my boss. I wrote a postmortem in excruciating detail. Another specialist awarded me a Netflix Chaos Monkey sticker.

    The funny thing is that there was quite reasonable redundancy in place and so many opportunities to avert the incident, but Murphy’s law struck hard:

    1. We had decomm’d the old cluster not a long ago, reinforcing my expectation of a bigger system.
    2. The original plan of moving the system live would have left both appliances reachable at all times. Even if we made a mistake, it would have only broken one cluster’s worth of stuff.
    3. Unlike most of the hardware in the DC, the storage appliances were unlabeled.
    4. The senior guy went back to his desk right before we started to unwittingly unplug the other system
    5. The other guy I was working with was a bit unsure about removing the second box, but thought I knew better and trusted that.


  • Yea a plane hijacking is totally like a buffer overflow.

    Bleeding is also a bit like a buffer overflow, since blood goes in a place it’s not supposed to. Hurricanes are another example of a buffer overflow. Accidentally wearing a shirt inside out? Buffer overflow. Unskippable ads are buffer overflow. War is buffer overflow. I had my buffer overflown by some guy claiming to be a wallet inspector. Aliens are a type of buffer overflow. I sometimes have buffer overflow with my girlfriend. Buffer overflow was an inside job. I put too much shine paste in my polishing machine and you better believe that was a buffer overflow.

    When a train crashes into a station building, that’s not a buffer overflow, though. That’s a buffer overrun.










  • AWS is only tolerated because product managers ask for it, not because engineers like it; AWS is shit.

    Yes, but the competition is hardly much better. Well, maybe Google is, I didn’t touch it much back when I still did public cloud stuff. Azure leads with “look, our VPS offering is called ‘Virtual Machines’ instead of ‘EC2’, isn’t that simple?” and then proceeds to make everything even clunkier and more complicated than AWS. And don’t get me started on the difference in technical and customer support from the two.

    There is no moat.

    You keep reiterating this, but I still need you to explain the implications. Ok sure, you can run a model on a home computer. Nonwithstanding that those models still amount to overhyped novelty toys, home computers are also capable of running servers, databases, APIs, office suites, you name it. Still, corporations and even consumers are renting these as SaaS and will continue to do so in the foreseeable future.

    The AI fad is highly hype driven, so there’s still incentive to be the one who trains the latest, biggest and shiniest model, and that still takes datacenters’ worth of specialized compute and training data. LLM-based AI is an industry built on FOMO. How long until that shiny new LLM torrent you got from 4chan is so last season?

    And the OP is correct. Llama is not open source. “The neighbors” only took it from Meta in the same sense warez sites have taken software forever. Only in this case the developer was the one committing the copyright infringement.