At the Circus we just built our first Backblaze storage pod and I would like to take the time to document it. We rebuilt the server a number of times for testing and verification with different numbers of disks so output may differ throughout this post.
The cost per terabyte is right up our alley as we are a non-profit hospital. We tried to set ours up as a Windows server so it would have direct attached storage but changed direction and decided to make it a Linux based iSCSI target.
Disk Mapping
The first problem is mapping out the port multiplier backplanes. If you follow this link it shows the way the pod is supposed to be built, however, our drives did not map out accordingly. We took the time to map out our drives by literally shutting down, pulling a disk and turning the server back on to find the layout. If you don’t take the time to do this, I feel for you when a disk dies and you try to figure out how to replace it.
Boot Drives. sd 0:0:0:0: [sda] sd 1:0:0:0: [sdb] First row from right. sd 7:0:0:0: [sdh] sd 7:1:0:0: [sdi] sd 7:2:0:0: [sdj] sd 7:3:0:0: [sdk] sd 7:4:0:0: [sdl] sd 6:0:0:0: [sdc] sd 6:1:0:0: [sdd] sd 6:2:0:0: [sde] sd 6:3:0:0: [sdf] sd 6:4:0:0: [sdg] sd 8:0:0:0: [sdm] sd 8:1:0:0: [sdn] sd 8:2:0:0: [sdo] sd 8:3:0:0: [sdp] sd 8:4:0:0: [sdq Second row from right. sd 11:0:0:0: [sdw] sd 11:1:0:0: [sdx] sd 11:2:0:0: [sdy] sd 11:3:0:0: [sdz] sd 11:4:0:0: [sdaa] sd 10:0:0:0: [sdr] sd 10:1:0:0: [sds] sd 10:2:0:0: [sdt] sd 10:3:0:0: [sdu] sd 10:4:0:0: [sdv] sd 12:0:0:0: [sdab] sd 12:1:0:0: [sdac] sd 12:2:0:0: [sdad] sd 12:3:0:0: [sdae] sd 12:4:0:0: [sdaf] Third row from right. sd 14:0:0:0: [sdag] sd 14:1:0:0: [sdah] sd 14:2:0:0: [sdai] sd 14:3:0:0: [sdaj] sd 14:4:0:0: [sdak] sd 15:0:0:0: [sdal] sd 15:1:0:0: [sdam] sd 15:2:0:0: [sdan] sd 15:3:0:0: [sdao] sd 15:4:0:0: [sdap] sd 16:0:0:0: [sdaq] sd 16:1:0:0: [sdar] sd 16:2:0:0: [sdas] sd 16:3:0:0: [sdat] sd 16:4:0:0: [sdau]
Disk Setup
The next problem you have is that fdisk will not handle partitions larger than 2TB, parted to the rescue. Because there were forty-five 4TB disks in the server I did not want to have to do it manually. The other problem was that we had also tested the server as a Windows server so it already had partitions on the disks. As a result we had to remove the old partitions, then create a new one. Luckily you can script parted. Please note that parts of the script are commented out because we ran the script multiple times for different setups.
for I in `dmesg|grep ^sd|cut -d \ -f 1,2,3|grep -v Attach |sort -u | cut -d [ -f 2 | cut -d ] -f 1 `; do echo /dev/${I}\ ; done >>devices-list.txt cat /usr/local/bin/parted-script.sh #!/bin/sh for i in `cat devices-list.txt` do # delete previous partitions #parted $i --script -- rm 1 #parted $i --script -- rm 2 #parted $i --script -- rm 3 # create partition to take whole disk parted $i --script -- mkpart primary ext4 1 -1 # set type lvm for jbod # parted $i --script -- set 1 lvm on # set type RAID for RAID 6. parted $i --script -- set 1 raid on parted $i --script print done
Create the RAID
The first time through we made all of the disks a JBOD to play, but long term that did not make sense. As a result I am only going to document creating a RAID 6 iSCSI target for Windows servers as this is the purpose of our storage pod.
I try not to do many tasks manually, so here is the work around for trying not to have type 45 disk names.
dmesg|grep ^sd|cut -d \ -f 1,2,3|grep -v Attach |sort -u | cut -d [ -f 2 | cut -d ] -f 1 >>devices.txt for I in `cat devices.txt`; do echo -n /dev/${I}1\ ; done >devices1.txt
This creates a file with all of the disk names.
cat devices1.txt /dev/sda1 /dev/sdc1 /dev/sdb1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1 /dev/sdn1 /dev/sdo1 /dev/sdp1 /dev/sdq1 /dev/sdr1 /dev/sds1 /dev/sdt1 /dev/sdu1 /dev/sdv1 /dev/sdw1 /dev/sdx1 /dev/sdy1 /dev/sdz1 /dev/sdaa1 /dev/sdab1 /dev/sdac1 /dev/sdad1 /dev/sdae1 /dev/sdaf1
Create the different software RAID configurations. I created three RAID devices, md0, md1 and md2.
This mdadm command creates a RAID 6 container with 14 physical disks and one spare. We were being cautious with our data.
mdadm --create --verbose /dev/md1 --level=6 --chunk=512 --raid-devices=14 --spare-devices=1 /dev/sdr1 /dev/sds1 /dev/sdt1 /dev/sdu1 /dev/sdv1 /dev/sdw1 /dev/sdx1 /dev/sdy1 /dev/sdz1 /dev/sdaa1 /dev/sdab1
This mdadm command create a RAID 6 container with all 15 physical disks, I used this configuration for testing the throughput later.
mdadm --create --verbose /dev/md0 --level=6 --chunk=512 --raid-devices=15 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdh1 /dev/sdg1 /dev/sdi1 /dev/sdk1 /dev/sdj1 /dev/sdl1 /dev/sdm1 /dev/sdn1 /dev/sdo1 /dev/sdp1 /dev/sdq1 mdadm --create --verbose /dev/md1 --level=6 --chunk=512 --raid-devices=15 /dev/sdr1 /dev/sds1 /dev/sdt1 /dev/sdu1 /dev/sdv1 /dev/sdw1 /dev/sdx1 /dev/sdy1 /dev/sdz1 /dev/sdaa1 /dev/sdab1 /dev/sdad1 /dev/sdac1 /dev/sdae1 /mdadm --create --verbose /dev/md2 --level=6 --chunk=512 --raid-devices=15 /dev/sdag1 /dev/sdah1 /dev/sdai1 /dev/sdaj1 /dev/sdak1 /dev/sdal1 /dev/sdam1 /dev/sdan1 /dev/sdao1 /dev/sdap1 /dev/sdaq1 /dev/sdar1 /dev/sdas1 /dev/sdat1 /dev/sdau1
If you are truly just building an iSCSI target the next steps are pointless. I wanted to do a throughput test so I had to lay down a file system, but once again there were problems. There is a 16TB size limit with mke2fs that ships with RedHat, as a result you need to build a newer version of e2fsprogs.
git clone git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git cd e2fsprogs mkdir build ; cd build/ ../configure make make install mke2fs -O 64bit,has_journal,extents,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize -i 4194304 /dev/md0 mke2fs 1.43-WIP (22-Sep-2012) Warning: the fs_type huge is not defined in mke2fs.conf Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 11446336 inodes, 11721045504 blocks 586052275 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=13870563328 357698 block groups 32768 blocks per group, 32768 fragments per group 32 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632, 2560000000, 3855122432, 5804752896 Allocating group tables: done Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done
Next is mount it up and test.
mount -t ext4 /dev/md0 /backup0 mount /dev/mapper/vg_leroy-lv_root on / type ext4 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) /dev/mapper/ddf1_Rootp1 on /boot type ext4 (rw) /dev/mapper/vg_leroy-lv_home on /home type ext4 (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) /dev/md0 on /backup0 type ext4 (rw) watch cat /proc/mdstat Every 2.0s: cat /proc/mdstat Tue Nov 13 14:50:58 2012 md2 : active raid6 sdau1[14] sdat1[13] sdas1[12] sdar1[11] sdaq1[10] sdap1[9] sdao1[8] sdan1[7] sdam1[6] sdal1[5] sdak1[4] sdaj1[3] sdai1[2] sdah1[1] sdag1[0] 50791197184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [15/15] [UUUUUUUUUUUUUUU] [>....................] resync = 0.0% (72704/3907015168) finish=5372.5min speed=12117K/sec md1 : active raid6 sdaf1[14] sdae1[13] sdac1[12] sdad1[11] sdab1[10] sdaa1[9] sdz1[8] sdy1[7] sdx1[6] sdw1[5] sdv1[4] sdu1[3] sdt1[2] sds1[1] sdr1[0] 50791197184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [15/15] [UUUUUUUUUUUUUUU] [>....................] resync = 0.0% (2583680/3907015168) finish=4776.4min speed=13623K/sec md0 : active raid6 sdq1[14] sdp1[13] sdo1[12] sdn1[11] sdm1[10] sdl1[9] sdj1[8] sdk1[7] sdi1[6] sdg1[5] sdh1[4] sdf1[3] sde1[2] sdd1[1] sdc1[0] 50791197184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [15/15] [UUUUUUUUUUUUUUU] [>....................] resync = 0.0% (3255892/3907015168) finish=5886.7min speed=11052K/sec
Finally you need to save the software raid configuration.
mdadm --detail --scan >> /etc/mdadm.conf
Testing
I wanted to try a throughput test so I copied a CD over to the server. We just weren’t getting enough throughput with the reads and writes so I decided to create a ramdisk, read from it and write to the filesystem.
Create the ramdisk.
ls -alh /dev/ram* mknod -m 660 /dev/ramdisk b 1 1 chown root.disk /dev/ramdisk dd if=/dev/zero of=/dev/ramdisk bs=1k count=4194304 /sbin/mkfs -t ext2 -m 0 /dev/ramdisk 16384 mkdir /ramdisk mount -t ext2 /dev/ramdisk /ramdisk dd if=/dev/urandom of=/ramdisk/file.txt bs=1k count=15k ls -alh /ramdisk/
Now copy the 15mb file from the ramdisk 500,000 times. I ran this script for /backup0, /backup1 and /backup2.
for i in `jot -s 1 -e 500000`; do cp /ramdisk/file.txt /backup0/test0-${i}; done
And the test output, in one minute we had written 12MB.
date && df -h Sat Nov 10 16:29:49 CST 2012 Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_leroy-lv_root 50G 3.2G 44G 7% / tmpfs 3.9G 0 3.9G 0% /dev/shm /dev/mapper/ddf1_Rootp1 485M 37M 423M 8% /boot /dev/mapper/vg_leroy-lv_home 236G 188M 224G 1% /home /dev/md0 48T 78G 45T 1% /backup0 /dev/md1 48T 84G 45T 1% /backup1 /dev/md2 48T 78G 45T 1% /backup2 /dev/ramdisk 16M 16M 302K 99% /ramdisk date && df -h Sat Nov 10 16:30:49 CST 2012 Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_leroy-lv_root 50G 3.2G 44G 7% / tmpfs 3.9G 0 3.9G 0% /dev/shm /dev/mapper/ddf1_Rootp1 485M 37M 423M 8% /boot /dev/mapper/vg_leroy-lv_home 236G 188M 224G 1% /home /dev/md0 48T 82G 45T 1% /backup0 /dev/md1 48T 88G 45T 1% /backup1 /dev/md2 48T 82G 45T 1% /backup2 /dev/ramdisk 16M 16M 302K 99% /ramdisk
And the IOSTAT command while it was writing.
avg-cpu: %user %nice %system %iowait %steal %idle 0.22 0.00 6.40 55.14 0.00 38.23 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 2.18 73.97 117.05 6527092 10328224 sdb 0.79 0.06 117.05 5536 10328224 sdc 87.69 882.70 82847.18 77890479 7310556800 sdd 86.17 715.18 82822.85 63108394 7308410360 sde 11.11 714.80 6402.75 63074967 564987674 sdf 4.92 714.24 135.85 63025858 11987866 sdh 5.16 714.32 134.02 63032209 11825930 sdg 5.10 714.65 135.50 63062197 11956714 sdi 4.98 714.30 133.72 63030809 11799450 sdk 4.92 714.45 133.54 63044265 11784026 sdj 4.95 714.24 133.88 63025249 11813618 sdl 5.02 714.18 134.05 63020313 11828514 sdm 5.08 714.06 133.96 63009609 11821122 sdn 5.00 714.15 133.74 63017213 11801082 sdo 4.97 714.16 133.85 63018757 11811058 sdp 4.62 714.38 130.34 63038351 11501106 sdq 4.59 714.04 128.39 63007996 11329122 sdr 4.82 784.45 45.74 69221236 4036394 sds 4.78 784.46 46.01 69221608 4060330 sdt 4.79 784.55 47.90 69229964 4226866 sdu 4.79 784.75 46.05 69247648 4063482 sdv 4.75 784.68 45.86 69241532 4046386 sdw 4.81 784.77 45.80 69249556 4041530 sdx 4.78 784.75 45.90 69247718 4050298 sdy 4.76 784.88 45.84 69259062 4045058 sdz 4.77 784.79 45.92 69251000 4052066 sdaa 4.75 784.69 45.99 69242304 4058370 sdab 4.47 784.89 45.81 69259548 4042074 sdad 4.40 784.83 45.80 69254484 4041794 sdac 4.32 784.92 47.64 69262304 4203442 sdae 4.23 784.84 45.64 69255316 4027730 sdaf 4.12 784.88 45.60 69258620 4024146 sdag 4.42 702.93 41.19 62027358 3634226 sdah 4.37 702.73 41.38 62009962 3651746 sdai 4.37 702.79 41.67 62015092 3677450 sdaj 4.35 702.87 41.50 62022040 3661962 sdak 4.35 703.19 41.26 62050556 3640690 sdal 4.37 703.53 40.96 62080556 3614570 sdam 4.34 703.60 40.85 62086828 3604554 sdan 4.33 703.42 41.02 62070532 3620082 sdao 4.34 703.41 41.22 62069532 3637226 sdap 4.32 703.41 41.15 62069548 3631570 sdaq 4.08 703.40 41.29 62069444 3643514 sdar 4.01 703.10 41.58 62042804 3669194 sdas 3.94 702.87 41.55 62021960 3666570 sdat 3.85 703.25 40.92 62055866 3611258 sdau 3.77 703.06 40.93 62039220 3611930 dm-0 16.60 73.91 117.05 6521508 10328224 dm-1 0.05 0.37 0.00 32976 168 dm-2 16.51 73.25 117.04 6463516 10328056 dm-3 5.79 72.21 32.65 6372058 2880648 dm-4 10.60 0.36 84.40 32112 7447384 dm-5 0.04 0.31 0.00 26938 24 md0 54.27 0.56 433.70 49578 38270384 md1 67.95 0.56 543.17 49602 47930328 md2 60.73 0.56 485.41 49594 42832904
Create an iSCSI target.
Once you create the iSCSI target and format the drive with a Windows file system, you have lost any data that was on the drive you created earlier. Remember with iSSCSI you are presenting a target “physical” drive.
Install the iSCSI target utilities.
yum install scsi-target-utils
The iSCSI configuration file.
cat /etc/tgt/targets.conf default-driver iscsi # Parameters below are only global. They can't be configured per LUN. # Only allow connections from 192.168.100.1 and 192.168.200.5 initiator-address 192.168.100.1 initiator-address 192.168.200.5 <target iqn.2012-11.org.eamc:leroy.target0> backing-store /dev/md0 write-cache off lun 11 </target> <target iqn.2012-11.org.eamc:leroy.target1> backing-store /dev/md1 write-cache off lun 12 </target>
Turn on tgtd.
chkconfig iptables off chkconfig tgtd on chkconfig tgtd --list
SMARTD
One of the guys on the team brought up that we should be doing some hard drive monitoring to make sure we knew if we were having trouble with a drive. As a result I installed smartmontools and configured the daemon to email when a drive starts to fail.
Install smartmontools.
yum install smartmontools
Edit the configuration file to email, but the first time test to make sure an email is sent.
cat /etc/smartd.conf DEVICESCAN -a -I 194 -W 4,45,55 -R 5 -m jud@circus.org -M test
Start the smartd daemon.
chkconfig smartd on service smartd start
Now go back and remove the -M test from the configuration file to make sure you don’t get emails every time the smartd daemon restarts. There are a number of configuration options, so read the /etc/smartd.conf file for a better understanding.
Some random commands:
mdadm --stop /dev/md124 mdadm --remove /dev/md124 mdadm --query --detail /dev/md1 mdadm --detail-platform mdadm --monitor mdadm --explain /dev/md0