October 1, 2009
17:29
What would people really like to see in a presentation about #haiku-os ? Any interesting facts and features anybody knows of?
Author: ddevine 
October 3, 2009
20:48

This time I am very happy to be part of the organization team for BeGeistert, the bi-annual gathering of BeOS and Haiku fans in Düsseldorf, Germany. That's because I get to see who registers, and I can tell you that I am almost bursting with excitement, since this BeGeistert will be a big one! Beside the regular BeGeistert visitors, this time there are people coming who I've known for years only via the Internet and who I can now finally meet in person. And there are also a bunch of old-timers coming who didn't participate in the event in years. Even new contributors will show up for the first time, like some of this year's Google Summer of Code and Haiku Code Drive students.

read more

Author: Haiku OS 
October 5, 2009
03:38

As many of you are well aware, we’ve been experiencing some serious downtime the past couple of days. Starting Friday evening, our network storage became virtually unavailable to us, and the site crawled to a halt.

We’re hosting everything on Amazon EC2, aka. “the cloud”, and we’re also using their EBS service for storage of everything from our database, logfiles, and user data (repositories.)

Amazon EBS is a persistent storage solution for EC2, where you get high-speed (and free) connectivity from your instances, while it’s also replicated. That gives you a lot for free, since you don’t have to worry about hardware failure, and you can create periodic “snapshots” of your volumes easily.

While we were down, it was unknown to us what exactly the problem was, but it was almost certainly a problem with the EBS store. We’ve been working closely with Amazon the past 24 hours resolving the issue, and this post will outline what exactly went wrong, and what was done to remedy the problem.

Symptoms

What we were seeing on the server was high load, even after turning off anything that took up CPU. Load is a result of stuff “waiting to happen”, and after reviewing iostat, it became apparent that the “iowait” was very high, while the “tps” (transactions per second) was very low for our EBS volume. We tried several things at this point:

  • Un-mounting and re-mounting the volume.
  • Runing xfs_check on the volume, which reported no errors (we use XFS.)
  • Moving our instances and volumes from us-east-1b to both us-east-1a and us-east-1c.

None of these resolved the problem, and it was at this point we decided to upgrade to the “Gold plan” of support to gain access to the 1-hour turnaround technical support with Amazon.

The Support (0 hours after reporting it)

We filed an “urgent” ticket with Amazons support system, and within 5 minutes we had them on the phone. I spoke to the person there, describing our issue, continuously claiming that everything pointed to a network problem between the instance and the store.

What came from that, was 5 or 6 hours of advice, some of which were obvious timesinks, while others were somewhat credible. What they kept coming back to was that EBS is a “shared network resource” and performance would vary. We were also told to use RAID0 to distribute our load over several EBS instances to increase the throughput.

At this point, we were getting less throughput than you can pull off of a 1.44MB floppy, so we didn’t accept this for an answer. We did some more tests, trying to measure the bandwidth of the machine by fetching their “100mb.bin” files, which we couldn’t do. We again emphasized that this was in fact, in all likelihood, a network problem.

At this point, our outage was well known, especially in the Twittosphere. We have some rather large customers relying on service with us, and some of these customers have some hefty support contracts with Amazon. Emails were sent.

Shortly after this, I requested an additional phone-call from Amazon, this time to our system administrator. He had been compiling some rather worrying numbers over the past hours, since up until now, the support had refused to acknowledge a problem with the service. They claimed that everything was working fine, when clearly, it was not.

This time, a different support rep. called, and this time, they were ready to acknowledge our problem as “very serious.” We sent them our aggregated logs, and shortly thereafter, they reported that “we have found something out of the ordinary with your volume.”

We had been extremely frustrated up until this point, because 1) we couldn’t actually *do* anything about it, and 2) we were being told that everything should be fine. It felt like there was an elephant right in front of us, and a person next to us was insisting that there wasn’t.

Anyway (8 hours after reporting it)

From here on, we had been graced with the acknowledgement we had been waiting for: There was a problem, and it wasn’t us. We had been thinking that, you know, *maybe* we had screwed up somewhere and this was our fault. We didn’t find anything.

So, back to waiting.

What exactly triggered what happened after this, I’m not sure.

The Big Dogs (11 hours after reporting it)

I received an unrequested phone-call from some higher-up at Amazon. He wanted to tell me what was  going on, which was much appreciated.

He wanted to re-assure me that we were now their top priority, and he had brought in a whole team of specialized engineers to look at our case. That’s nice.

I received periodic updates, and frequent things for us to try. We sent them the logs they asked for, and complied with their wishes.

From this point on, we were treated like they owed us money, which is quite the difference from basically being called a liar earlier on.

Closing in (15 hours after reporting it)

OK, so we are finally getting somewhere. We all agreed that there was a serious networking problem between our EC2 instances and our EBS. This is around the time Amazon called me and asked me to try and put the application back online. So I did. And all was well.

I kindly asked the manager I had on the phone to please explain to me what the problem had been. He said he wasn’t really sure, and that he would set up a telephone conference with his team of engineers.

I dial in, and they start explaining what the problem is.

Now, I have been specifically advised not to say what the problem was, but I believe we owe it to our customers to explain what went wrong. Also, we owe it to Amazon to clear it up, since they were looking pretty bad due to this. I’ve already mentioned the cause shortly on our earlier status page, as well as on IRC, but let me re-iterate.

We were attacked. Bigtime. We had a massive flood of UDP packets coming in to our IP, basically eating away all bandwidth to the box. This explains why we couldn’t read with any sort of acceptable speed from our EBS, as that is done over the network. So, basically a massive-scale DDOS. That’s nice.

This is 16-17 hours after we reported the problem, which frankly, is a bit disheartening. Why did it take so long to discover? Oh well.

Amazon blocked the UDP traffic a couple of levels above us, and everything went back to normal. We surveyed the services for a while longer, and after deciding that everything was holding up fine, we went to bed (it was 4am in the morning.)

This morning

So, when we got up again this morning, things weren’t looking good, again. We were having the exact same symptoms as previously, and before our morning coffee, we re-opened our urgent ticket with Amazon. 2 minutes later I had them on the phone.

I explained that the problem was back, and they assured me the team of engineers working on this yesterday would be re-gathered and have a look. Cool.

About… 2 hours later, the problem was again resolved. Seems that the DDOS-ees figured that we were now invulnerable to UDP flood, so they instead initiated something like a TCP SYNFLOOD. Amazon employed new techniques for filtering our traffic and everything is fine again now.

What’s next

Amazon contacted us again after this was over, and told us they wanted to work with us in the coming days to make sure this doesn’t happen again. They have some ideas on how both they and we can improve things in general.

Are we going to do that? Maybe. We’re seriously considering moving to a different setup now. Not because Amazon isn’t providing us with decent service, which they are, most of the time. While we were down, several large hosting companies took direct contact with us, pitching their solutions. I won’t mention names, but some of the offerings are quite tempting, and have several advantages over what we get with Amazon.

One thing’s for sure, we’re investing a lot of man-hours into making sure this won’t happen again. If this means moving to a different host, so be it. We haven’t decided yet.

In conclusion

Let me round this post off by saying that Amazon doesn’t entirely deserve the criticism it has received over this outage. I do think they could’ve taken precautions to at least be warned if one of their routers started pumping through millions of bogus UDP packets to one IP, and I also think that 16+ hours is too long to discover the root of the problem.

After a bit of stalling with their first rep., our case received absolutely stellar attention. They were very professional in their correspondence, and in communicating things to us along the way.

And to re-iterate, the problem wasn’t really Amazon EC2 or EBS, it was isolated to our case, due to the nature of the attack. All the UDP traffic was conveniently spoofed, so we can’t tell where it originated.

image
03:38

As many of you are well aware, we’ve been experiencing some serious downtime the past couple of days. Starting Friday evening, our network storage became virtually unavailable to us, and the site crawled to a halt.

We’re hosting everything on Amazon EC2, aka. “the cloud”, and we’re also using their EBS service for storage of everything from our database, logfiles, and user data (repositories.)

Amazon EBS is a persistent storage solution for EC2, where you get high-speed (and free) connectivity from your instances, while it’s also replicated. That gives you a lot for free, since you don’t have to worry about hardware failure, and you can create periodic “snapshots” of your volumes easily.

While we were down, it was unknown to us what exactly the problem was, but it was almost certainly a problem with the EBS store. We’ve been working closely with Amazon the past 24 hours resolving the issue, and this post will outline what exactly went wrong, and what was done to remedy the problem.

Symptoms

What we were seeing on the server was high load, even after turning off anything that took up CPU. Load is a result of stuff “waiting to happen”, and after reviewing iostat, it became apparent that the “iowait” was very high, while the “tps” (transactions per second) was very low for our EBS volume. We tried several things at this point:

  • Un-mounting and re-mounting the volume.
  • Runing xfs_check on the volume, which reported no errors (we use XFS.)
  • Moving our instances and volumes from us-east-1b to both us-east-1a and us-east-1c.

None of these resolved the problem, and it was at this point we decided to upgrade to the “Gold plan” of support to gain access to the 1-hour turnaround technical support with Amazon.

The Support (0 hours after reporting it)

We filed an “urgent” ticket with Amazons support system, and within 5 minutes we had them on the phone. I spoke to the person there, describing our issue, continuously claiming that everything pointed to a network problem between the instance and the store.

What came from that, was 5 or 6 hours of advice, some of which were obvious timesinks, while others were somewhat credible. What they kept coming back to was that EBS is a “shared network resource” and performance would vary. We were also told to use RAID0 to distribute our load over several EBS instances to increase the throughput.

At this point, we were getting less throughput than you can pull off of a 1.44MB floppy, so we didn’t accept this for an answer. We did some more tests, trying to measure the bandwidth of the machine by fetching their “100mb.bin” files, which we couldn’t do. We again emphasized that this was in fact, in all likelihood, a network problem.

At this point, our outage was well known, especially in the Twittosphere. We have some rather large customers relying on service with us, and some of these customers have some hefty support contracts with Amazon. Emails were sent.

Shortly after this, I requested an additional phone-call from Amazon, this time to our system administrator. He had been compiling some rather worrying numbers over the past hours, since up until now, the support had refused to acknowledge a problem with the service. They claimed that everything was working fine, when clearly, it was not.

This time, a different support rep. called, and this time, they were ready to acknowledge our problem as “very serious.” We sent them our aggregated logs, and shortly thereafter, they reported that “we have found something out of the ordinary with your volume.”

We had been extremely frustrated up until this point, because 1) we couldn’t actually *do* anything about it, and 2) we were being told that everything should be fine. It felt like there was an elephant right in front of us, and a person next to us was insisting that there wasn’t.

Anyway (8 hours after reporting it)

From here on, we had been graced with the acknowledgement we had been waiting for: There was a problem, and it wasn’t us. We had been thinking that, you know, *maybe* we had screwed up somewhere and this was our fault. We didn’t find anything.

So, back to waiting.

What exactly triggered what happened after this, I’m not sure.

The Big Dogs (11 hours after reporting it)

I received an unrequested phone-call from some higher-up at Amazon. He wanted to tell me what was  going on, which was much appreciated.

He wanted to re-assure me that we were now their top priority, and he had brought in a whole team of specialized engineers to look at our case. That’s nice.

I received periodic updates, and frequent things for us to try. We sent them the logs they asked for, and complied with their wishes.

From this point on, we were treated like they owed us money, which is quite the difference from basically being called a liar earlier on.

Closing in (15 hours after reporting it)

OK, so we are finally getting somewhere. We all agreed that there was a serious networking problem between our EC2 instances and our EBS. This is around the time Amazon called me and asked me to try and put the application back online. So I did. And all was well.

I kindly asked the manager I had on the phone to please explain to me what the problem had been. He said he wasn’t really sure, and that he would set up a telephone conference with his team of engineers.

I dial in, and they start explaining what the problem is.

Now, I have been specifically advised not to say what the problem was, but I believe we owe it to our customers to explain what went wrong. Also, we owe it to Amazon to clear it up, since they were looking pretty bad due to this. I’ve already mentioned the cause shortly on our earlier status page, as well as on IRC, but let me re-iterate.

We were attacked. Bigtime. We had a massive flood of UDP packets coming in to our IP, basically eating away all bandwidth to the box. This explains why we couldn’t read with any sort of acceptable speed from our EBS, as that is done over the network. So, basically a massive-scale DDOS. That’s nice.

This is 16-17 hours after we reported the problem, which frankly, is a bit disheartening. Why did it take so long to discover? Oh well.

Amazon blocked the UDP traffic a couple of levels above us, and everything went back to normal. We surveyed the services for a while longer, and after deciding that everything was holding up fine, we went to bed (it was 4am in the morning.)

This morning

So, when we got up again this morning, things weren’t looking good, again. We were having the exact same symptoms as previously, and before our morning coffee, we re-opened our urgent ticket with Amazon. 2 minutes later I had them on the phone.

I explained that the problem was back, and they assured me the team of engineers working on this yesterday would be re-gathered and have a look. Cool.

About… 2 hours later, the problem was again resolved. Seems that the DDOS-ees figured that we were now invulnerable to UDP flood, so they instead initiated something like a TCP SYNFLOOD. Amazon employed new techniques for filtering our traffic and everything is fine again now.

What’s next

Amazon contacted us again after this was over, and told us they wanted to work with us in the coming days to make sure this doesn’t happen again. They have some ideas on how both they and we can improve things in general.

Are we going to do that? Maybe. We’re seriously considering moving to a different setup now. Not because Amazon isn’t providing us with decent service, which they are, most of the time. While we were down, several large hosting companies took direct contact with us, pitching their solutions. I won’t mention names, but some of the offerings are quite tempting, and have several advantages over what we get with Amazon.

One thing’s for sure, we’re investing a lot of man-hours into making sure this won’t happen again. If this means moving to a different host, so be it. We haven’t decided yet.

In conclusion

Let me round this post off by saying that Amazon doesn’t entirely deserve the criticism it has received over this outage. I do think they could’ve taken precautions to at least be warned if one of their routers started pumping through millions of bogus UDP packets to one IP, and I also think that 16+ hours is too long to discover the root of the problem.

After a bit of stalling with their first rep., our case received absolutely stellar attention. They were very professional in their correspondence, and in communicating things to us along the way.

And to re-iterate, the problem wasn’t really Amazon EC2 or EBS, it was isolated to our case, due to the nature of the attack. All the UDP traffic was conveniently spoofed, so we can’t tell where it originated.

October 8, 2009
15:26
@hex I want my "i" dammit. I want to know how much data I really have. Normal users don't care about the i. You can't take my "i"!
Author: ddevine 
October 11, 2009
20:50
Wasted all weekend waiting for VPSlink to provision a server after a problem with their billing system. Got in a spat and changed providers.
Author: ddevine 
October 13, 2009
11:15
I can't stay awake while reading this MS WMI crap. Horrible book, #VisualBasic is shit and a massive time sink. Far easier to do on Linux.
Author: ddevine 
October 15, 2009
03:50

Well, after a long delay and BeGeistert 021 among us, I suppose it's time to tell my story about the Ohio LinuxFest (OLF).

My friend, Amir, and I set out to Columbus from Pittsburgh on the Friday evening before OLF, and as we traveled, I could already feel the excitement. Once settled at the Hyatt hotel, I scrambled a bit to build/install Haiku on my MSI Wind U100 netbook. I also let Amir try out a live CD on his laptop, and we shared some conversation about Haiku. He wanted to come to OLF mainly for the various talks, and naturally I was there to help man the Haiku table. As he slept, I eventually got my system set up as I liked it with a GCC4 hybrid trunk build (with additional apps/media), and then went to sleep for a couple of hours.

This was my first time being at the OLF, or any computer-related event for that matter, so I didn't know what to expect. Upon meeting Darkwyrm and Mike after registration, I felt right at home and knew the day would go just fine.

read more

Author: Haiku OS 
October 19, 2009
12:34
Modem stops working, won't boot. Fiddled with it for a while to no avail. Go to work. Girlfriend plugs it in. It works. WTF!
Author: ddevine 
14:55
@dantheman Say my name, bitch.
Author: ddevine 
October 23, 2009
16:55
@bigbrovar if you are looking for support/compatibility/performance you can't go past NVIDIA, but of course the drivers are closed source.
Author: ddevine 
October 27, 2009
04:27

On October 24, 2009 I attended and spoke at the Florida Linux Show in Orlando, FL. In this post I'll talk about my experience at the conference.

read more

Author: Haiku OS 
14:10

I'm in the TGV back to Valence on wednesday, which luckily has many power plugs, unlike the Thalys which has wifi but no plug for those battery-drained guys like me. It's 21:30 as I start writing this. Will take some more days to finish though...

Not there yet

But first things first, after attending a meeting on Friday in Grenoble, I headed back to Valence to leave some stuff there, then back to the train station, where my train got delayed by an hour or so. But the other frenchies I was to share a car with were also a bit late, so they didn't have to wait for me too much. We then took the road to Düsseldorf and started to talk about each others work, and GSoC since we had two of the students on board.

read more

Author: Haiku OS 
16:40
@supertunaman interesting thought, but uClibc is under LGPL and BusyBox is under GPLv2 so they are still GNUish. ;)
Author: ddevine 
October 29, 2009
04:10

To all the Haiku fans out there who have been eagerly looking forward to getting their hands on the first official Haiku CD, the wait is finally over: Haiku R1/Alpha 1 CDs are now available on the Haiku Store! Since a lot of sweat and tears have gone into this first Haiku release, we wanted the availability of our first official CD to also be an opportunity for the community to give something back to the project. To that effect, we have priced it as a commemorative CD which, for every unit you purchase, $15 will go into the project coffers. Thus, not only do you get a nicely branded Haiku CD, but you also help fund the future development of Haiku.

Show your support, and head over to the Haiku Store!

(Note: Cafepress-created CDs are CD-R with a full color silkscreen label)

Author: Haiku OS 
05:30
Off Haiku's webpage, Haiku Alpha R1 CDs are now available to purchase. Costing $20USD, $15 of the cost will go towards Haiku! Get your copy from CafePress:

408438071v119_350x350_Disc
Author: Haikuware 
October 30, 2009
15:23
@borncrusader Ubuntu is bloated. Try apt-getting boot up manager and thinning your startup.
Author: ddevine 
15:35
@christianpm18 Only 1600?
Author: ddevine 
16:03
@ports No idea what you said, but that song is awesome.
Author: ddevine