Migrating from GFS to GlusterFS
This week I started the process of migrating from GFS to GlusterFS. The hardware running my GFS cluster is older and I decided it was better to replace it than continue maintaining it.
Background Information
Back in 2003 I needed to find a storage solution that was fast, reliable, and fault-tolerant. It also needed to be accessed by multiple clients simultaneously. After doing much research, I ended up going with an IBM FAStT 600 2GB Fibre Channel (FC) based storage system. The intent was to replace a single server running a JBOD SCSI disk chassis and RAID-5 card acting as an NFS server.
So, with 10 new servers, 6 FC HBA's, and a pair of 8-port FC switches, and RedHat Enterprise Linux Advanced Server 2.1 with the GFS add-on, I began configuring the new cluster setup. The primary purpose of this cluster was to serve several thousand customer home directories to web and mail servers. Mail was delivered and stored in Maildir format and stored in their home directory. Each user had a disk quota, so with everything contained in one directory per user, things were organized very cleanly. I split the home directories up logically, creating 3 base home directories -- one per server as the "primary", and then broken down by first letter of the login, i.e. /homes1/a, /homes2/a, /homes3/a, homes1/b, etc. When all was said and done, my home directory was in /homes1/m/marc. So far so good.
The next process was to get the servers talking to the FAStT. So, using the Q-Logic configuration tools, I set up the multi-pathing through redundant switches and connected everything up. I partitioned the storage array and got GFS up and running. Since I had never worked with a shared storage system like this before, it took me several weeks of trial and error to get things going. Once everything was setup, I began the migration process, which went off without a hitch. Time to make it live.
What an abysmal failure that was. It ran for only a day or two before I migrated everything back. The problem was the GFS distributed locking mechanism, or DLM. In order to perform an action on a file, all of the active nodes had to agree that the file was locked before the operation could begin. This was a mail store for thousands of users, moving hundreds of thousands of email messages on and off of it each day. In addition, web pages were served from this same store. The DLM couldn't keep up with the sheer number of file operations that I needed.
So, the FAStT sat mostly idle for a few years.
Current Cluster Configuration
Several years ago, I sold the ISP business and started focusing on consulting and my VoIP products. I had the FAStT setup and several of the servers as a starting point. RHEL was now up to version 5, so I decided to test again after 5 more years of development in shared storage and GFS. This time my needs were very different. I still needed highly available shared storage for the VoIP system. The intent was to share voicemail files and other configuration files between servers in a way that there was no single point of failure. So, four systems each with dual port HBA's, multi-pathing, and two FC switches provided the no single point of failure in accessing the FAStT. I set the FAStT up to run RAID1+0 with two hot spares. Very fast, and very fault tolerant.
GFS in this environment works really well. It is very fast, and I can exceed 200MB per second throughput for reading and writing. A single system can fail and the others all still have access to shared storage. Perfect. Most of the time. There have been a couple of instances in the last 3 years where I have had to shut down the entire cluster and bring it back up because of DLM and quorum issues. This requires a multi-system failure, though, and is fairly rare. Over the last three years, this configuration has provided me with five nines of reliability, i.e. 99.999% uptime.
Time to Expand the Cluster
So now its time to expand the cluster. I need to add several more machines into it to meet the growing need for capacity. The new servers are in a different cabinet on the other side of the data center and have no HBA's in them. That means I need network based access to the shared storage. This part is not really a problem, as I have an HA NFS setup running on the cluster.
The FAStT system is also now quite old. The disks need to be replaced and the replacement cost for FC disks is very high compared to commodity SATA disks.
Where the problem comes in is that I also need to create a backup cluster at another physical location using the same shared storage. This is where things get more complicated. If I had an unlimited budget and an unlimited amount of bandwidth, I would have several options available to me from vendors like EMC and IBM just to name two. Since I don't have either of those two, I have to explore other avenues.
Looking at GlusterFS
Early last year a colleague of mine suggested I look at GlusterFS as a possible solution. The GlusterFS approach is very different than traditional network storage solutions. At the time I didn't feel that it was quite ready. That has changed in the last year. It has become much more polished, better documented and feature complete.
GlusterFS uses the concept of "storage bricks", which are not much more than local storage on each Gluster server. You then combine bricks into volumes, which determines how those bricks are to be accessed by the clients using what Gluster calls translators. The latest GlusterFS release contains four different translators: distributed, replicated, striped and distributed replicated. Each has their place in different types of networks. Gluster also allows you to stack volumes as well, which is a very cool feature.
GlusterFS also stores all of its files using standard file systems with extended attributes. This means that normal backup software can easily backup and restore data to the storage bricks.
The biggest deviation from "traditional" network file systems as I see it is the fact that the translators are all client driven. What that means is that it is the client that decides which servers to write files to and how to distribute it. The servers do coordinate things by tracking their own storage bricks, but the heavy lifting is done by the client themselves. With the way technology has been, LAN speeds will almost certainly outperform hard drive speeds. Gigabit networks are the norm in data centers these days, and 10-gig networks are right around the corner. Processors have been following Moore's Law, but hard drives have not -- they have mechanical parts. SSD is still far to expensive to be practical, and even then SSD's would still not be cost effective over a gigabit or 10-gigabit network. So with today's technology this approach makes sense.
This same approach also helps it to scale. Rather than having expensive file servers using all of their resources coordinating disk access, file distribution, replication, etc., between themselves, it shifts that task to the client. This does a very effective job of distributing resources in a cost-effective way and makes it massively scalable.
The latest GlusterFS (3.2.1) also includes geo-replication, which is intended to keep storage volumes in sync, even if they are running in different physical locations. This was the last piece that I needed GlusterFS to do before it was feature complete enough for my application.
Deploying GlusterFS
Actually deploying GlusterFS was very straightforward. I had two servers with freshly installed CentOS 5.5 x86_64 on them, I downloaded the RPM's installed them and 10 minutes later I had a replicated volume setup. I installed the glusterfs-core and fuse packages on a third server that will act as a client, and just like the server setup, was off and running in a matter of minutes.
Each of the two "server" machines has a pair of 500GB drives running hardware RAID1 on the first and software RAID1 on the second. While not necessary, it does add one additional level of protection since the servers won't be singly tasked with just running GlusterFS. Creating the volumes was simple:
server1# mkdir -p /exports/shared
server2# mkdir -p /exports/shared
server1# ??gluster peer probe server2
server1# gluster volume create shared replica 2 trsnport tcp server1:/exports/shared server2:/exports/shared
server1# gluster volume start shared
client1# mkdir /shared
client1# mount -t glusterfs server1:/shared /shared
client1# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/md0 149408868 2414152 139282720 2% /
tmpfs 1028712 0 1028712 0% /dev/shm
glusterfs#server1:/shared
466539648 66220160 376238336 15% /shared
That was all it took to get GlusterFS up and running. Total time spent was just a few minutes. The GFS setup took many hours to get things just right.
There was one last thing I needed to do. Since GlusterFS prefers the 64-bit architecture and I have a mixture of 32 and 64 bit systems, I decided that 64-bit clients will run the native Gluster client (as illustrated above) and that the 32-bit clients will access it via Gluster's built in NFS server. So, I needed to tune the volume to have the NFS server return 32-bit inode addresses for NFS access. This was also very simple:
server1# gluster volume set shared.nfs.enable-ino32 on
From any 32-bit clients, I can now use:
mount -t nfs server1:/shared /shared
Conclusions
I'm quite impressed with the latest GlusterFS. The latest GlusterFS release made transitioning away from GFS quite painless. It performs very well and is tunable to different workloads. Any future projects I work on that require highly available shared storage will almost certainly be using GlusterFS.
LUKS Encrypted Disks under Ubuntu 10.10
This posting is mostly just a reference for myself since I don't do this often enough for me to have it memorized. These are the steps I use to create LUKS encrypted disks to use as a backup target that I can take off-site. Since they're off site and stored in my desk where others may have access to the disks, I want to make sure that I'm the only one with access to the data. I use bare OEM style hard drives in a Thermaltake BlacX hard drive docking station. I have one at home and one at the office.
Before I store a backup on a disk, it needs to be set it up for LUKS encryption. This posting explains that part of the process.
WARNING: Following these steps will erase disks and lose data!
Consider yourself duly warned. I also do everything from the command line except the final steps. There are GUI tools to do this as well, but the command line is much quicker for me.
To start, I load a drive into the dock and power it up. In this case, the drive I'm loading is a 500G Maxtor drive.
From a shell window, I issue the 'dmesg' command to determine which drive it came up as:
[535060.118638] usb 1-1.4.4: new high speed USB device using ehci_hcd and address 9
[535060.230194] usb-storage 1-1.4.4:1.0: Quirks match for vid 152d pid 2329: 8020
[535060.230358] scsi13 : usb-storage 1-1.4.4:1.0
[535061.228868] scsi 13:0:0:0: Direct-Access MAXTOR S TM3500630AS PQ: 0 ANSI: 2 CCS
[535061.229582] sd 13:0:0:0: Attached scsi generic sg8 type 0
[535061.230173] sd 13:0:0:0: [sdi] 976773168 512-byte logical blocks: (500 GB/465 GiB)
[535061.230918] sd 13:0:0:0: [sdi] Write Protect is off
[535061.230922] sd 13:0:0:0: [sdi] Mode Sense: 34 00 00 00
[535061.230925] sd 13:0:0:0: [sdi] Assuming drive cache: write through
[535061.232459] sd 13:0:0:0: [sdi] Assuming drive cache: write through
[535061.232464] sdi: sdi1
[535061.249699] sd 13:0:0:0: [sdi] Assuming drive cache: write through
[535061.249703] sd 13:0:0:0: [sdi] Attached SCSI disk
So it came up as /dev/sdi. This particular disk I was once using with Fedora before I switched over to using Ubuntu, so I know there are partitions on it. I'll need to get rid of those first using fdisk, and then create a single new partition on it:
WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
switch off the mode (command 'c') and change display units to
sectors (command 'u').
Command (m for help): p
Disk /dev/sdi: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x33dc2272
Device Boot Start End Blocks Id System
/dev/sdi1 * 1 60802 488386583+ 8e Linux LVM
Command (m for help): d
Selected partition 1
Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-60801, default 1):
Using default value 1
Last cylinder, +cylinders or +size{K,M,G} (1-60801, default 60801):
Using default value 60801
Command (m for help): w
The partition table has been altered!
Calling ioctl() to re-read partition table.
Syncing disks.
So now the disk is ready. Set up the encryption.
WARNING!
========
This will overwrite data on /dev/sdi1 irrevocably.
Are you sure? (Type uppercase yes): YES
Enter LUKS passphrase:
Verify passphrase:
marc@fozzie:~$ sudo cryptsetup luksOpen /dev/sdi1 backups01
Enter passphrase for /dev/sdi1:
marc@fozzie:~$ sudo fdisk /dev/mapper/backups01
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0x75dd258c.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)
WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
switch off the mode (command 'c') and change display units to
sectors (command 'u').
Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-60800, default 1):
Using default value 1
Last cylinder, +cylinders or +size{K,M,G} (1-60800, default 60800):
Using default value 60800
Command (m for help): w
The partition table has been altered!
Calling ioctl() to re-read partition table.
WARNING: Re-reading the partition table failed with error 22: Invalid argument.
The kernel still uses the old table. The new table will be used at
the next reboot or after you run partprobe(8) or kpartx(8)
Syncing disks.
marc@fozzie:~$ sudo mkfs.ext4 -Lbackups01 /dev/mapper/backups01
mke2fs 1.41.12 (17-May-2010)
Filesystem label=backups01
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
30531584 inodes, 122095743 blocks
6104787 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
3727 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 35 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
marc@fozzie:~$ sudo cryptsetup luksClose backups01
And thats it. At this point, I turn off the drive, wait a few seconds, and then turn it back on and Gnome automatically prompts me to unlock the volume:

After entering the correct password, it will automatically mount using the volume label I specified, backups01, and appear in the file manager, ready to use. When finished using the volume, I use the Nautilus "eject" button to unmount the drive and its ready to be taken off-site.
There are many ways to set up encrypted volumes, this way is the best, easiest and most convenient for me.
Linux Soft Phone Roundup
I've been working heavily with VoIP for the last couple of years, and every few months I find myself looking at SIP soft phones again. I haven't really used them at all under Linux in a long time because none of them quite fit my needs or are as good as the ones available under Windows. Because of this, and the fact that I do 99% of my work under Linux, I've got 8 SIP phones, 3 ATA's and 2 regular phones sitting on my desk right now. This makes for quite a bit of clutter.
As I said, every few months I look again to see how the various soft phones have progressed. The projects I've been working on the for the last few weeks would have gone much easier if I had a soft phone that would have suited my needs. Make a minor software tweak, dial from one phone on my desk to another, wait 10 seconds, repeat 30 or 40 times in the average day depending on what I was testing.
This is a rather big posting, so it has been split into multiple pages. Click the Read More link below to get to the actual roundup.
Open Source and other nuggets
The silence for the last few months has been because my life got crazy. I've been spending a lot of time working and what hasn't been working has been spent in preparation for our new baby. Yes, thats right, baby. In 2006 we were told by a fertility doctor that we would likely never have kids. We didn't figure it out until less than a week before Jessica entered the second trimester. So thats been consuming all of my weekends getting the nursery (strange calling it that) and the rest of the house ready for the new baby.
I'll still try to update as time permits on the Myth setup as I have had time to work on it, just not write about it.
That brings us to the next topic I wanted to blurb about.

Classic Foxtrot
Over the last 15 years, I've written quite a bit of code. When I sold my ISP business a few years back, I retained all of the rights to the source code. I haven't really done much with the code lately, but the need to update at least some of it became pressing over the last few weeks. That being said, I decided to release my code under the GPL through my consulting company, Cheetah Information Systems.
There are three big projects and two libraries that I'll be releasing over the coming weeks. The libraries have already been released and one of the projects. I'll talk more about those later, as well as provide some of the history of each of them here. For now, the code is located at http://code.cheetahis.com/.
These are my first "real" open source contributions in a very long time. Back in the day I did add a bit to the Linux kernel (v1.x days) and had a bit of fame with the "Marc Lewis" MySQL patches to Cistron RADIUS (now FreeRADIUS), but haven't contributed much lately.
The packages were all integral parts of my old business, but I'm not in that business any more. I'm releasing them in the hopes that someone may find them useful.
They are:
- The Communications Control Center (CCC) - A Webmail client written in PHP
- Total AccountAbility (TACC) - An ISP billing system written in C++ using the Qt library
- ACMS - A content management system written in PHP
- CWFC - A collection of PHP scripts and libraries that I've written and used to build the CCC and ACMS
- cistools - A C++ library that has mostly string manipulation functions, but also contains a MySQL wrapper for easily creating queries and walking through results
I'll talk about each of them more in future posts, and give histories. Hopefully within the next week.
Moving to MythTV Part 2 – The Frontend
So, in the last 3 weeks I've done a lot with MythTV. Most notably, I've finished the frontend system that is connected to our TV. There was a lot of trial and error, frustration and an entire system replacement, but its now up and near perfect.
So, a week and a half ago Comcast got installed. I didn't have any of the capture cards yet, so I did more reading and checking and it turns out the Comcast boxes include firewire ports. Despite what the installers say, they're active on the boxes thanks to a nice FCC ruling a few years back. With active firewire ports, I wouldn't need to get the PVR-350. I connected everything, following the guides from the MythTV wiki and went pretty smoothly. Or so I thought. I did manage to get things recording, but not reliably. In part 3, I'll talk about my backend configuration since I'll be doing more with that this afternoon and I'm not done with it. With the current setup, I have been managing to record HD programs and SD programs via firewire. That brought me to the shortcomings on the frontend...
So after recording my first Daily Show in HD and trying to watch it, there was a lot of stuttering and artifacts on the front end. Turns out it was the nVidia GeForce 7300GT in the frontend I had been using. It wasn't quite up to the task of decoding HD content. So, a bit more reading and I found what appeared to be the perfect video card for it. I ordered from my favorite online store and it arrived a few days later. The card I ended up getting was a Zotec GeForce GT220. It seems that the GT2xx series are the first nVidia based cards to allow hardware accelerated decoding of MPEG4 as well as MPEG2 content. Additionally, the GT2xx series cards have a high quality scaler thats also in hardware.
Unfortunately, the frontend box was an old AGP based board. That means replacing the frontend with something a bit more modern. I managed to scrounge up a motherboard and CPU combination that would give Jessica an upgrade on her desktop machine and took her motherboard and used it for the frontend. I took her Athlon 64 3700 (2.2GHz) and replaced it with a dual core 3.4GHz Pentium D. After setting up the upgraded frontend hardware and connecting it to the TV, I still had issues with slowdowns. Xvid videos had a bar on the top 10% or so of the screen that shifted everything during scenes with motion, and MPEG2 streams from Comcast had a bit of stuttering and some blurring issues.
Turns out that I hadn't yet optimized the frontend box for the new hardware yet. By default, the Myth frontend uses the CPU only for decoding and displaying video content. So, I went into the configuration under Setup -> Settings -> TV Display and made a few adjustments there. I enabled the VPDAU "Normal" profile and also turned on a few more tuning parameters since I have the GT220 card:
I added "colorspace=0,vpdaubuffersize=32,vpdauhqscaling" to the Custom Filters section.
That may have been enough, but I also did some X optimizations while I was at it. I disabled the Composite extension in xorg.conf:
Section "Extensions"
Option "Composite" "Disabled"
EndSection
And turned on Triple Buffering in the Device section:
Section "Device"
Identifier "Device0"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BoardName "GeForce GT 220"
Option "TripleBuffer" "True"
EndSection
One other thing is that since I'm running an AMD CPU, I needed to adjust the automatic CPU frequency scaling as well. By default, it will drop down to 1GHz and lower the bus speed as well, which would slow down how fast the GT220 would get data. I added this to /etc/rc.local to tell it to only take the CPU down to 1.8Ghz:
echo 1800000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
All of that may have been overkill, but let me tell you, everything is perfectly smooth now. No stutters, pauses or other glitches at all now. Fast forward and rewind are the smoothest they've ever been, rivaling our old TiVo's capability.
All in all, I'm extremely pleased with the frontend configuration. Today I will be migrating from CentOS to Ubuntu 9.10 for the backend system, and then likely migrating to Ubuntu 10.4 LTS in a few weeks.
After updating the backend, I'll post part 3 that will cover both firewire and Hauppauge setups. When I'm "finished" with the whole setup, I'll post detailed specs and relevant configuration files for X, and for LIRC with the TiVo peanut remote.
