Ranting, Technically Speaking

Thursday, 10 March 2011

Flexible Storage Replication

I have recently been looking quite a lot at different storage setups including storage replication and have been so far mostly relying on running rsync to copy a file system to an appropriate secondary host. For large file systems - either with a lot of files or simply a lot of changing data, this is slow and resource intensive. Not really a problem in some cases, but very problematic if you want your secondary system to have very current data. If you want to cobble something together yourself from commodity hardware, DRBD is an excellent tool and very feature-rich.

First of all, I can't recommend the DRBD User Guide enough. It really lays out the features and usage not just of DRBD but also some common applications you would use alongside like LVM for storage management and Pacemaker and Heartbeat (and others) for clustering.

What DRBD is going to do is basically copy writes to a block device over the network to a replica device - this storage set is called a "resource". Generally, you will expect to have two nodes for each resource. During normal operation, you will have one "Primary" node and one "Secondary" node for each resource which logically indicates that one node is writing changes to the resource while the other is making a copy. DRBD is generally very slick in handling replication and the status of the nodes. First of all, when you configure the resource, you specify an IP address for the replication target and generally you are going to want this to be a separate network interface from your general data plane - for example a cross-over cable for point-to-point connection between the two nodes. If the replication path goes down, DRBD is basically going to mark at what point in time it happened and then keep track of which blocks changed since that point so when the path comes back up, it has a list of which blocks need to be transferred instead of having to resync the whole device. That's another thing - it does the whole device sync for you too when you create the device. And also, you get basically the same behaviour if your secondary node tanks, or if both nodes tank for that matter, or even the primary node.

Unless both nodes end up in a "primary" state during some overlapping time. So if you automatically bring up the secondary node in case of a primary failure with Pacemaker, for example, but the issue was a path failure and not a node failure, then both nodes may end up in "primary" state. Since DRBD is tracking when communication is disrupted, it will detect this problem - a "split brain". You get several options for manual resolution (I think automatic as well) including taking the changes of one node or the other, the node with the "most" changes, the node with the "least" changes, the oldest primary, the youngest primary... You may still be stuck losing some data - but you can keep both nodes in split brain and consolidate externally (e.g. if you have critical data like financial data where you can never drop a transaction).

DRBD supports three replication "protocols" called, intuitively, A, B, and C. "A" is asynchronous so writes to local storage device unblock after the local device finishes writing. "B" is "semi-synchronous" which unblocks after the data has reached the peer. And "C" which is "synchronous" so the write operation is only complete once the data is written to both devices. I was finding that "A" and "B" got me similar speeds and "C" was slower - but this is not very rigorous testing and my replication link was 100Mbps through a shared data plane.

One of the things about any of these replication options compared to rsync is that they are going to generally be much nicer on your memory. I find that when rsync scrapes the file system, this effectively nukes the OS's disk cache such that after rsync runs, users may notice it takes a while to "warm up" again. But, replication is not a backup - if a virus eats your files on your primary node, it will eat them on the secondary node synchronously or asynchronously - your choice.

If you are using LVM (and you should be, I've posted about LVM before, so have others), you'll wonder whether you layer DRBD on top of LVM or vise-versa. As Chef would say: Use DRBD on top of your LVs. Dramatic over-simplification aside, it does depend on what you are doing. If you are using LVM to carve up a pool of storage for example for virtualization and then want the storage layer to replicate your VMs, it may make more sense to create your DRBD volume from physical storage, then it will replicate the whole LVM structure to your replica node. But there's complications like ensuring LVM will even look at DRBD devices for PVs and managing size changes, etc. There's a time and a place for everything, and that's college.

Um, what else is awesome about DRBD? Offline initialization, "truck based replication" (a.k.a. sneakernet), replicate the node locally, ship it to the remote site, turn-up from there. DRBD Proxy (paid feature) for when you need to buffer replication for slow or unreliable network links. Dual-primary (for use with something like GFS) operation. 3 node operation by layering DRBD on top of DRBD.

Yeah, it's cool. It's Free and free. You can get it stock with Fedora and CentOS (probably Ubuntu and others, but haven't tried it yet).

And one last thing - you cannot mount a resource that is "Secondary". So if you are getting crazy error messages that you can neither mount nor even fsck your file system, it's probably in Secondary - don't bang your head against the wall, just do "drbdadm primary <resourcename>". Is clear?

Wednesday, 26 January 2011

Tape Devices for Amanda

I've found a few times now that my tape server can be a bit of a pain about tape devices. Generally, I have Amanda configured to use /dev/nst0 but the tape drive isn't always this device if I attach other devices (at least other drives). So rather than configuring the "nst0" and then changing it to "nst1" after a few days of realizing the backups aren't working for some reason, I've started using the "tape/by-id" device instead. So my amanda.conf now shows:

changerdev 	"/dev/tape/by-id/scsi-1IBM_3573-TL_00X2U78M1255_LL0"	# tape device controlled by mtx
tapedev 	"/dev/tape/by-id/scsi-35000e11138aa0001-nst"	# the non-rewind

Is clear?

Saturday, 22 January 2011

No more mail

One more service down - no more mail. All "real" email has been offloaded or canceled except for uro.mine.nu which has basically just been sacked. I've closed the ports for SMTP, POP, and IMAP. So now this is it, I'm down to just web applications that I'm hosting from home.

What I'd like to do is find a dirt-cheap web-host for this stuff. None of it is high volume - the old URO forums which is still used by some of my gamer buddies (I think - haven't checked in a while) and some personal blogs including this one, and I have a couple personal site type things up. iweb.ca is still offering hosting for $1.67 / mo so I'd like to give them a shot. We shall see, I'll try a couple different services over the next couple months.

Sunday, 26 December 2010

One Less Service on Alia

Alia, the latest in the line of servers hosted at home has one less service to host today. I've sacked the DNS service which had in the past provided primary DNS for some the public domains I had used. However, those are all now hosted by the DNS providers. I cleaned up the Bind configuration and closed that port so that it no longer forwards in from the Internet.

The last thing it was doing was DNS for local LAN - the internal DNS to lookup the printer (mostly). This is easily handled by DNSMasq in DD-WRT which is basically a tick-box to replace everything that Alia was doing for DNS. And it automatically adds the lookups for statically configured DHCP hosts so I don't have to setup a host once on the router for DHCP and then again on Alia for DNS.

At this point, it looks like Alia will be the last server I host at home. I've offloaded Jabber and now DNS leaving SMTP and HTTP. SMTP is almost ready to go already as there's only one personal domain for one user using that and that user may retire the domain otherwise we can move to Google Apps along with the other email. And that will leave HTTP which, since I can get shared hosting for less than $2 a month, is an easy one to offload. Not free, but shutting off Alia, even as an energy efficient system (low-power CPU and everything), will save just over $2 / mo in electricity consumption.

So we're coming to the end of an era. It really goes to show just how greatly improved hosted services are today and also the breadth of features you can get from consumer products for home. To have all the trappings of a full network that is so easy to use and so cheap, it is really amazing.

Monday, 6 December 2010

Patch Your #$%^!

According to SANS, the top security threat right now is *drum roll* unpatched applications! *gasp* *shock* Yes, it's blindingly obviously, but organizations (and individuals) are downright negligent in patching desktop applications. Applications that are highly targeted, again no surprise here, Adobe Flash, Adobe Acrobat Reader, Apple Quicktime, and Microsoft Office. And furthermore, "On average, major organizations take at least twice as long to patch client-side vulnerabilities as they take to patch operating system vulnerabilities. In other words the highest priority risk is getting less attention than the lower priority risk."

So patch your #$%^ or else Walter is going to come beat the #$%^ out of your new car while shouting "This is what happens when you find a stranger in the Alps!" .

Or block Flash, Acrobat Reader, and Quicktime - can't say I'd shed any tears for those apps myself ;)

Saturday, 13 November 2010

Disk management with Logical Volume Manager (LVM)

There is a lot of documentation on how to use Logical Volume Manager (LVM) Online but I'd like to just go over how I've been using LVM to illustrate some of the strengths and weaknesses.

The initial driving issue which made LVM a killer app was for handling large disks. This one system had an older SCSI RAID attached which only supported 2TB drives (a limitation of 32bit LBA, I think) but the sum of the disks (14 x 300GB) was, well, bigger. The equipment basically let me carve the array into 2TB disks. Using LVM, I can add those Physical Volumes (PVs) to a Volume Group (VG) and create Logical Volumes (LV) of any size desired including, ultimately, the total capacity of the RAID.

Another great feature of LVM is snapshots. Generally, a snapshot means you get a temporally fixed view of the file system for special purposes while general use continues unimpeded by storing the subsequent changes separately. So I can take a snapshot and then backup the snapshot which will assure that the filesystem (in the snapshot) is consistent from the time the backup starts to the time the backup finishes. Snapshots can also be used as a facility to simply roll-back files to a previous state. For example, I take a snapshot, run a test application which modifies a file, then restore that file from the snapshot to revert back.

However, LVM snapshots aren't as elegant as they are on some platforms. To create a snapshot, you must first have some unallocated space in your VG. You then allocate that space to the "snapshot" where disk changes since the snapshot can be stored. The bummer, man, is that this is a fixed amount of space you have to have on-hand and if it fills up, your "snapshot" device fails and if you had say a long backup running, you have to restart that backup. Even with this limitation, however, snapshots are still pretty useful. You can sortof figure out what the minimum size you need for a snapshot and ultimately, if you have snapshot space equal to the live system space, you're snapshot will never fill up.

The last feature I'd like to rant about is Online filesystem resizing. Now this is just absolutely great and very useful especially in concert with handling large volumes and managing snapshots. First of all, if you have a hardware RAID controller which lets you add drives and expand existing arrays as an Online operation, LVM is the layer which will let you expand your volumes to suit. There's two ways of doing this and first is to expand an existing block device (e.g. grow your sda from 1TB to 1.5TB) and you have to do this by modifying the partition table. This is slightly tricky but can be done online. The other way is by adding additional devices. Some RAID controllers (good ones) would let you add a second "logical disk" (or "virtual disk" depending on your vendor's jargon). If you add that additional disk, you simply initialize it as a new PV, add it to your VG and then add whatever you want to your LV.

Take the first example I had where the equipment would only allow 2TB devices. So first, you put all your disks in an array, and because you've got a lot of disks, maybe reserve 1 as a hot spare. So your total capacity is (14 disks - 1 hot spare - 1 for RAID-5 parity) * 300GB = 3600GB. You carve out your first LD and it's 2TB and appears in the OS as /dev/sda. Now generally, you should be putting a partition on your drives, to my knowledge, it's not required, but generally accepted that most disk applications will behave saner if they see a partition. Anyhow, so you've got /dev/sda1, so you initialize it (pvcreate /dev/sda1), you create a volume group (vgcreate myvgblah /dev/sda1), and you spin out your first LV (lvcreate -l 100%FREE -n mylv myvgblah). Hooray, you create your filesystem (mke2fs -j -L bigfs /dev/myvgblah/mylv) and mount it for regular use. Now sometime later you fill up that 2TB and realize that there's a pile of unused space. Well, you carve out another LD with the remaining 1.6TB which appears to the OS as /dev/sdb. Generally, I would expect this device to just show up, no rebooting or any crap like that. So you throw a partition on there, initialize the PV (pvcreate /dev/sdb1), add it to the existing volume group (pvextend myvgblah /dev/sdb1). With this free space, you can either add it all (lvextend -l 100%FREE /dev/myvgblah/mylv) or you could add it incrementally (lvextend -L +100G /dev/myvgblah/mylv) reserving free space for snapshots, additional LVs, and future growth.

Very handy to have all your disks in a pool (your VG) and be able to add logical drives (LVs), snapshot your drives, and incrementally expand your drives.

- Arch

Friday, 10 September 2010

Tab Mix Plus Trick

I had been using a Firefox plugin called New Tab Jumpstart which for new tabs shows like a splash of recently used pages much like you get with Chrome. I found that it was rarely useful and I was only using a single page from it, if anything. So I removed that plugin and found the feature I needed in Tab Mix Plus. You can control what appears in a new tab including a specific URL. Since my "home page" is 3 pages, the "home page" isn't quite what I need, but a specific URL does just the trick.

So there, now I use 2 features of Tab Mix Plus, but it was already #1 in my Essential Plugins simply for the mouse-wheel tab scrolling.