RAID5 + LVM2 + recovery + resize HOWTO

I was looking forward to creating a big fileserver with disk crash recovery capabilities. LVM2 with reiserfs partitions couldn’t do the trick for me. I had 3 200Gb disks “united” under a logical volume, and formated them with reiserfs and I want to test what would happen if one disk “crashed”. So I created a fake crash..I shut the machine down, pulled the plugs of a disk and rebooted. I managed to see the logical volume using the latest lvm2 sources and the latest version of the device mapper:

# lvm version
version LVM version: 2.00.24 (2004-09-16)
version Library version: 1.00.19-ioctl (2004-07-03)
version Driver version: 4.1.0

Unfortunately I had no luck in reading the reiserfs partition. The superblock was corrupted and the reiserfsck –rebuild-sb /device did not work… Salvation was impossible.
While googling the web and trying to find out possible solutions I came up to the wonderful idea of creating a software raid5 array of the 3 disks and have LVM2 on top of the raid. I would lose 1 disk in “space”…but I gained the ability to recover after an error and to be able to add more disks if that was necessary.

Before we continue I must say that it’s necessary that you HAVE worked before with raid and lvm so some commands are familiar to you. This is NOT a step by step guide…but more like a draft of how things are done.I am not going to explain every little detail…man pages and google are always around if you have any questions.

Enough of this…let’s start.

  • Initialization
  • First of all let’s say that we got our 3 disks on /dev/hde, /dev/hdg, /dev/hdi
    1) We create 1 partition on each one covering the total space using our favorite disk managment software (fdisk, cfdisk,etc). (btw, drives MUST be IDENTICAL).
    2) Then it’s time to create the /etc/raidtab file. Our contents should look like:

    raiddev /dev/md0
    raid-level 5
    nr-raid-disks 3
    nr-spare-disks 0
    persistent-superblock 1
    chunk-size 32
    parity-algorithm right-symmetric
    device /dev/hde1
    raid-disk 0
    device /dev/hdg1
    raid-disk 1
    device /dev/hdi1
    raid-disk 2

    3) Now let’s create our array:

    mkraid /dev/md0

    4) It’s time for LVM2 now…let’s edit the /etc/lvm/lvm.conf so that we add support for raid devices. My filter line looks like this:

    filter =[ “a|loop|”, “a|/dev/md0|”, “r|.*|” ]

    5) Start initializing the LVM:

    pvcreate /dev/md0 (you can issue a pvdisplay to see if all things are correct)
    vgcreate test /dev/md0 (you can issue a vgdisplay to see if all things are correct)

    6) Time to create a small logical volume just for testing:

    lvcreate -L15000 -nbig test

    (you can issue a lvdisplay to see if all things are correct)
    7) Now there’s something that’s distro-specific. “Usually” lvm is started on init script before software raid. But in our case, when a reboot occurs, we want a) start the raid b) start the lvm. I am using gentoo as a distro and gentoo had these things the other way round…It first started the lvm and then the raid, which resulted in errors during the boot process. This case is easily solved in gentoo by editing /etc/init.d/checkfs and moving the part about the LVM below the part about the software raid. The config file is really easy to read so I don’t think anyone might have a problem on that…
    8) Let’s test what we’ve done so far…Let’s format that logical volume we’ve created with ext3.

    mke2fs -j /dev/test/big

    9) Make an entry inside your /etc/fstab to point to a place you want to mount that logical volume…and then issue a:

    mount /dev/test/big

    10) You are now ready to start copying data onto that volume…I’d suggest that you copy 5-10Gb out of the first 15Gb that we’ve created (remember that -L15000 ?).

  • Now it’s time to simulate a crash! 🙂
  • 11) We first stop the raid device (after unmounting it and changing the activation of the logical volume, lvchange -a n /dev/test/big):

    raidstop /dev/md0

    12) Let’s destroy one disk. Open up again your favorite disk managment tool and pick up one disk to destroy…let’s say /dev/hdi. Delete the partition it already has…and create a new one. All previous data is now lost!
    13) If you want to make sure that you are on the right path of destroying everything…reboot your machine. Upon reboot you should get errors on the software raid and on the LVM not being able to activate the volume group “test”.
    14) Upon the root prompt issue:

    raidstart /dev/md0

    and then do a: cat /proc/mdstat
    You should probably see something similar to this:

    cat /proc/mdstat
    Personalities : [linear] [raid0] [raid1] [raid5] [multipath]
    md0 : active raid5 hdi1[2] hdg1[1] hde1[0]
    390716672 blocks level 5, 32k chunk, algorithm 3 [3/3] [UUU]
    [========>…………] resync = 43.9% (85854144/195358336) finish=115.9min speed=15722K/sec

    15) When that is finished, it will mean that raid5 has rebuilt the array after recovering from the “faulty” disk, that we’ve created, and the placement of the “new” drive. (both destruction and the new disk placement was done on step 12)
    16) Issue: vgscan
    It will make the volume group active again.

  • Resizing the Logical Volume
  • 17) Say that you need more space to that logical volume you had created…15Gb is not that much after all…

    lvextend -L100G /dev/test/big

    We’ve now made that previous 15Gb logical volume to a 100Gb one…already feels much better…doesn’t it ?
    18) But that’s not all, we now need to extend the ext3 partition to cover up all that “new space”

    e2fsck -f /dev/test/big ; resize2f /dev/test/big

    We first check that the partition is ok…and then resize it to the full extends of the logical volume.
    19) We are set! We just need to mount our new partition…and we now have 100Gb of space! You can now extend that even further or create more logical volumes to satisfy your needs.

  • Extend the raid5 array
  • This section is to come in a few days…stay tuned.

    I hope that all the abobe helped you to create a better and more secure fileserver. Comments are much appreciated.

    Igniting The Web (or the budget)

    There has been an ongoing campaign in order to spread Firefox any have as many users as possible install Firefox 1.0PR. In my trully humble opinion this is tottaly wrong and I can’t understand why people from the Mozilla.org are so anxious about this campaign.
    First of all I have to admit that I am a firefox fan for more than 8 months (I can still remember installing and testing it for the first time, somewhere around version 0.6). I am very happy about this project..it serves me well and I do my best to “spread the word” for a long time now. I liked it’s stability, even though it still has severe crashing problems with Acrobat 6, Windows Media files, and several other minor or major problems. I liked they way extensions can be used…and of course the interface.
    Having the Netscape example in my mind, which is somehow related in many ways to the mozilla engine..and mozilla developers/managers should know very well what happened when netscape needed to be updated every 10-15 days, it strikes me pretty bad to see the same scenario come out of netscape’s grave and apply to firefox “promotion” campaign. Netscape was by far more superior than the IE at the time…but it kept releasing products with errors and people had to download it from scratch…15Mb every 15 days…There were no patches for netscape like MS did for IE. It’s very different to download 300kb every week than 15Mb every 15days…even though IE’s probs were bigger…who could understand ? Have you ever read what the windows updates “fix” ? Most of them cannot be understood by the computer illeterate users that are the vast majority of the internet.
    I can’t get what’s the point in promoting a PreRelease to as many people as you can, when even you as a developer know that this is a PreRelease and it will have problems (as previous versions did). Firefox’s “strong” point is that it doesn’t have that many security problems as IE. What happens though if you promote firefox as a flawless broswer that’s here to replace IE …and then suddenly..as more and more people start messing around with it a big security hole is found? Then people will surely get back to IE because their “dreams” of secure surfing were crippled by the only one left to magically make that dream come true.
    MS has chosen not to release another broswer until the new windows…that’s about 1-2 years ahead. What’s the rush for creating a rumour for a “perfect browser” NOW…and spreading it NOW…when you know as developer that it DOES have probs ? (else it would be called Firefox 1.0, no PR after it)
    What will you say to all those poor 56K modem users than will have to download another 4.5Mb in 10-15 days ? You become from their beloved one..to the one they hate the most…cause at least MS is “accepting” that their product has flows..and so have the users accepted that fact. And they keep downloading MS’s patches whatever happens (cause it’s on windows automatic updates) , but who would download one patch after another(or one release after another probably because firefox does not provide patches…just like netscape) from an “unknown” company and accept this fact and keep doing it for as long as it takes ? For computer and Internet illiterates…mozilla is an unknown company…it’s certainly not MS…and I don’t think that firefox managers would like only to target the computer literate..cause that would be devastating for their economics.

    I am really happy that in 5 days firefox has passed the 1.300.000 downloads but I am really anxious to see what will happen if a major error appears.What will all those users say? In my opinion Firefox 1.0PR stands for 1.0Public Relations, and some people rushed a LOT to get this product on the market…and to raise their budgets. The PR team of firefox looks pretty bad to me…

    Looks like the netscape example didn’t teach them anything…

    Me, Myself and my bad luck

    Yesterday morning I did a chkdsk on my pc at home just to make sure all was ok…just in case. What I saw made me furious! My boot disk had 1 bad sector!!! The disk was just 1 year old…a Western Digital 120GB JB model. What’s going on with Western Digital ? They keep making one crappy disk after the other. More than 4 200Gb JB models I have owned for the past 1-2 years have crashed. Now this 120GB…I was more than cautious with this machine because it’s the machine I use for project development. It’s on a UPS…it has seperate hard disk fans…so what went wrong ? I can’t get it….
    Luckilly I had a spare 120GB without any data on it. I booted my slackware and did a dd_rescue from the disk with the bad sector to the other. It took around 10..maybe more hours…but now I am working on the “spare” disk (that I have double checked it for bad sectors) and all look fine.

    I am not going to buy another Westerd Digital disk…I am not even sure I want to send the disk back to the company to send me another for refund (warranty is for 3 years…so I can send it back anytime I want). I think I might sell the new one that they will send me to someone while it’s inside the company’s packaging and buy another disk from another brand. But what should I buy ? Maxtor disks go toooo hot when used for a long time (I have hard disk fans…but how secure would you feel ?), Seagate disks are rumored to be very good (as far as stability is concerned) but they are a lot slower than all others. I am willing to hear/view any suggestions as comments…

    Waiting for the next problem to appear.

    Protected: Taking down a rogue wifi network

    This content is password protected. To view it please enter your password below:

    LVM Beauty

    I started using lvm today…it simply rocks. Nothing more nothing less. I won’t say much…just take a look at this…and you decide if it’s worth it…
    The following actions were performed WITHOUT unmounting ANY disks…

    # df -HT /dev/koko/test
    Filesystem Type Size Used Avail Use% Mounted on
    /dev/koko/test
    reiserfs 200G 7.2G 193G 4% /lvm

    # resize_reiserfs -s -100G /dev/koko/test (reducing the partition by 100Gb)

    # lvreduce -L100G /dev/koko/test (reducing the logical volume by 100Gb)

    # df -HT
    Filesystem Type Size Used Avail Use% Mounted on
    /dev/koko/test
    reiserfs 93G 7.2G 86G 8% /lvm

    LVM made my day 😉

    New voice chat for our wireless network

    Some of you might know that I am part of the local wireless network in the city I live. Until today we used Ventrilo for our voice chats. The problem with ventrilo is that it only accepts 8 simultaneous connections…no more than 8 people can join the server in the unregistered version and we are more than 10 and have lots of friends joining our chats from all over the world (arizona, london. germany, etc,etc). We tried to contact ventrilo to register it…but they don’t accept any further registrations (!?!?!).
    quote from their site:

    Update: At the current time we are not accepting any new license applications. Any other licensing issues should be directed to the sales@ventrilo.com account.

    I had set up TeamSpeak but we never used it because we had some problems. We had not managed to set up correctly the NAT(+ port forwading) from the dsl (that connects us to the Internet) in order for people coming from the Internet to be able to join our conversations. I just did that 5mins ago 🙂
    For an unknown reason whenever I used the default ports for listening I saw no packets (using ethereal)came from the dsl to the box inside the wireless…the ISP probably has some kind of firewall to that port…who knows why though…so I changed the port the Teamspeak server listens (which wasn’t that easy because I had to “guess” that I can only change that through the webinterface and not through the .ini lying inside the dir the program runs….d0000000000h). At least now it’s fully working!

    happy voice-chatting for us!
    YEAH!

    There goes my uptime…

    After 36 days and for no apparent reason my main machine at home running windows XP rebooted while I was taking a nap.
    The “infamous” event viewer shows nothing as usual and the pc is on a UPS. No rational explanation…apart from: “Hey man! it’s Windows…what did you expect?”. I expect from my OS to tell me what happens to my machine…what service freaked out THAT much that made the machine to reboot.
    Maybe the machine found out a way to say: “Make me gentoo…plz!”

    I’ll satisfy it’s needs when my exams are over at the end of the month. After all the only reason I wanted that machine open all day and night was to overcome the 27 day barrier that a friend posed XP would stand…I reached 36…but still, there’s a big: WHY? WHY? WHY ?

    Qmail + vpopmai l+ procmail + spamassassin

    You might probably think that’s crazy…but yes it is possible. I have a qmail lazydog installation that has built in vpopmail. But no mailer is complete these days unless it features antispam and antivirus protection. So I though I should implement spamassassin + clamav. I won’t show how to setup spamd or clamd but how to process and deliver mails to users.
    How it works:
    inside each domain in vpopmail there’s a .qmail-default file that it has probably something like this,

    | /home/vpopmail/bin/vdelivermail ” /home/vpopmail/domains/DOMAINNAME/postmaster

    But we want to use procmail, so we make it like this:

    | preline /usr/bin/procmail -p -m /home/vpopmail/etc/procmailrc

    My procmailrc file:

    # qmail Lazydog procmailrc file
    SHELL=”/bin/bash”
    VHOME=`/home/vpopmail/bin/vuserinfo -d $EXT@$HOST`
    VERBOSE=”no”

    # Make sure that we have a .Spam and .Virus folder to sort spam and virus into.
    # This will create directorys under the ~vpopmail/domains///Maildir
    # direcory. This directory will be created as soon as the user
    # recives any mail. It simply creates the .Spam and .Virus directories,
    # as well as subscribes them to courier-imap
    :0wic
    * ? test ! -d $VHOME/Maildir/.Spam
    |( /var/qmail/bin/maildirmake $VHOME/Maildir/.Spam ; /bin/echo “INBOX/Spam” >> $VHOME/Maildir/.bincimap-subscribed )
    :0wic
    * ? test ! -d $VHOME/Maildir/.Virus
    |( /var/qmail/bin/maildirmake $VHOME/Maildir/.Virus ; /bin/echo “INBOX/Virus” >> $VHOME/Maildir/.bincimap-subscribed )

    # Run Anti-Virus and Anit-spam tests.
    :0fw
    | /var/qmail/bin/scanmail.sh

    :0:
    * ^X-Virus-Status: INFECTED
    $VHOME/Maildir/.Virus/

    # Sort anything marked as SPAM into the users Maildir/.Spam/
    :0:
    * ^X-Spam-Status: YES
    $VHOME/Maildir/.Spam/

    # Everything else goes to the users default Maildir/
    #:0:
    #*
    #$VHOME/Maildir/
    :0w
    | /home/vpopmail/bin/vdelivermail ” bounce-no-mailbox

    notice the last 2 lines, they make procmail return the mail back to vpopmail so any quotas or other options are applied. Take a look at the Spam and Virus folders that are created inside each everyone’s account. scanmail.sh that is referred inside procmailrc is provided by the lazydog package. You can configure it as you want ..and it has a lot of options on how viruses and spam is treated.

    Have fun with your secure and without spam mail….you do use smtp auth and ssl patches for your smtp+imap…don’t you ?

    Something was wrong…

    The blog had a prob and I couldn’t login. I really don’t know what ‘exactly’ was wrong…but I created a second database and started moving the data from the old db to the new bit by bit. Export – Import…Export – Import. Finally I got it working…
    Then I exported both the working db and the non-working one, diff-ed them and saw that for a strange reason the last post I had made on the old db had gone between two others…like that:
    (49, 1, ‘2004-08-28 17:10:26’, ‘2004-08-28 14:10:26’,
    (57, 1, ‘2004-09-05 01:03:37’, ‘2004-09-04 22:03:37’,
    (53, 1, ‘2004-08-31 03:46:59’, ‘2004-08-31 00:46:59’

    Everything looks ok now…let’s hope it stays that way 🙂

    Bad Routing HOW-TO

    I recently bought a Linksys WRT54GS as an AP but until placing it on the roof I use it for testing. What I had done and was absolutely wrong was this:rnIn my configuration I had 2 pcs behind the switch ports (of Linksys) and the whole Linksys machine connected to another switch of mine. On that second switch my wireless client and another 2 pcs are connected (already confused ? 🙂 ). What I had been testing were the routing capabilities of the linksys. Due to my fault I had set up linksys to route all traffic (LAN & WAN instead of just WAN) through a gateway far away on the wireless network. With that setup I had perfect pings for machines behind the (linksys) switch ports (1ms) but lousy pings (10-20ms) for the linksys switch itself. I couldn’t figure out what was wrong until I pinged -r the switch IP…what I saw was that the packets were going from my pc to the switch…then to the gateway on the wireless network and back to my pc…
    just try to imagine this:
    PC1–switch(1)–WRT–switch(1)–my wrls client–AP(1)–remote wrls client–gateway router–remote wrls client–AP(1)–my wrls client–switch(1)–WRT–PC1

    Nice heh ?