At the Circus we just built our first Backblaze storage pod and I would like to take the time to document it. We rebuilt the server a number of times for testing and verification with different numbers of disks so output may differ throughout this post.
The cost per terabyte is right up our alley as we are a non-profit hospital. We tried to set ours up as a Windows server so it would have direct attached storage but changed direction and decided to make it a Linux based iSCSI target.
Disk Mapping
The first problem is mapping out the port multiplier backplanes. If you follow this link it shows the way the pod is supposed to be built, however, our drives did not map out accordingly. We took the time to map out our drives by literally shutting down, pulling a disk and turning the server back on to find the layout. If you don’t take the time to do this, I feel for you when a disk dies and you try to figure out how to replace it.
[code]
Boot Drives.
sd 0:0:0:0: [sda]
sd 1:0:0:0: [sdb]
First row from right.
sd 7:0:0:0: [sdh]
sd 7:1:0:0: [sdi]
sd 7:2:0:0: [sdj]
sd 7:3:0:0: [sdk]
sd 7:4:0:0: [sdl]
sd 6:0:0:0: [sdc]
sd 6:1:0:0: [sdd]
sd 6:2:0:0: [sde]
sd 6:3:0:0: [sdf]
sd 6:4:0:0: [sdg]
sd 8:0:0:0: [sdm]
sd 8:1:0:0: [sdn]
sd 8:2:0:0: [sdo]
sd 8:3:0:0: [sdp]
sd 8:4:0:0: [sdq
Second row from right.
sd 11:0:0:0: [sdw]
sd 11:1:0:0: [sdx]
sd 11:2:0:0: [sdy]
sd 11:3:0:0: [sdz]
sd 11:4:0:0: [sdaa]
sd 10:0:0:0: [sdr]
sd 10:1:0:0: [sds]
sd 10:2:0:0: [sdt]
sd 10:3:0:0: [sdu]
sd 10:4:0:0: [sdv]
sd 12:0:0:0: [sdab]
sd 12:1:0:0: [sdac]
sd 12:2:0:0: [sdad]
sd 12:3:0:0: [sdae]
sd 12:4:0:0: [sdaf]
Third row from right.
sd 14:0:0:0: [sdag]
sd 14:1:0:0: [sdah]
sd 14:2:0:0: [sdai]
sd 14:3:0:0: [sdaj]
sd 14:4:0:0: [sdak]
sd 15:0:0:0: [sdal]
sd 15:1:0:0: [sdam]
sd 15:2:0:0: [sdan]
sd 15:3:0:0: [sdao]
sd 15:4:0:0: [sdap]
sd 16:0:0:0: [sdaq]
sd 16:1:0:0: [sdar]
sd 16:2:0:0: [sdas]
sd 16:3:0:0: [sdat]
sd 16:4:0:0: [sdau]
[/code]
Disk Setup
The next problem you have is that fdisk will not handle partitions larger than 2TB, parted to the rescue. Because there were forty-five 4TB disks in the server I did not want to have to do it manually. The other problem was that we had also tested the server as a Windows server so it already had partitions on the disks. As a result we had to remove the old partitions, then create a new one. Luckily you can script parted. Please note that parts of the script are commented out because we ran the script multiple times for different setups.
[code lang=”bash”]
for I in `dmesg|grep ^sd|cut -d \ -f 1,2,3|grep -v Attach |sort -u | cut -d [ -f 2 | cut -d ] -f 1 `; do echo /dev/${I}\ ; done >>devices-list.txt
cat /usr/local/bin/parted-script.sh
#!/bin/sh
for i in `cat devices-list.txt`
do
# delete previous partitions
#parted $i –script — rm 1
#parted $i –script — rm 2
#parted $i –script — rm 3
# create partition to take whole disk
parted $i –script — mkpart primary ext4 1 -1
# set type lvm for jbod
# parted $i –script — set 1 lvm on
# set type RAID for RAID 6.
parted $i –script — set 1 raid on
parted $i –script print
done
[/code]
Create the RAID
The first time through we made all of the disks a JBOD to play, but long term that did not make sense. As a result I am only going to document creating a RAID 6 iSCSI target for Windows servers as this is the purpose of our storage pod.
I try not to do many tasks manually, so here is the work around for trying not to have type 45 disk names.
[code lang=”bash”]
dmesg|grep ^sd|cut -d \ -f 1,2,3|grep -v Attach |sort -u | cut -d [ -f 2 | cut -d ] -f 1 >>devices.txt
for I in `cat devices.txt`; do echo -n /dev/${I}1\ ; done >devices1.txt
[/code]
This creates a file with all of the disk names.
[code]
cat devices1.txt
/dev/sda1 /dev/sdc1 /dev/sdb1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1 /dev/sdn1 /dev/sdo1 /dev/sdp1 /dev/sdq1 /dev/sdr1 /dev/sds1 /dev/sdt1 /dev/sdu1 /dev/sdv1 /dev/sdw1 /dev/sdx1 /dev/sdy1 /dev/sdz1 /dev/sdaa1 /dev/sdab1 /dev/sdac1 /dev/sdad1 /dev/sdae1 /dev/sdaf1
[/code]
Create the different software RAID configurations. I created three RAID devices, md0, md1 and md2.
This mdadm command creates a RAID 6 container with 14 physical disks and one spare. We were being cautious with our data.
[code lang=”bash”]
mdadm –create –verbose /dev/md1 –level=6 –chunk=512 –raid-devices=14 –spare-devices=1 /dev/sdr1 /dev/sds1 /dev/sdt1 /dev/sdu1 /dev/sdv1 /dev/sdw1 /dev/sdx1 /dev/sdy1 /dev/sdz1 /dev/sdaa1 /dev/sdab1
[/code]
This mdadm command create a RAID 6 container with all 15 physical disks, I used this configuration for testing the throughput later.
[code lang=”bash”]
mdadm –create –verbose /dev/md0 –level=6 –chunk=512 –raid-devices=15 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdh1 /dev/sdg1 /dev/sdi1 /dev/sdk1 /dev/sdj1 /dev/sdl1 /dev/sdm1 /dev/sdn1 /dev/sdo1 /dev/sdp1 /dev/sdq1
mdadm –create –verbose /dev/md1 –level=6 –chunk=512 –raid-devices=15 /dev/sdr1 /dev/sds1 /dev/sdt1 /dev/sdu1 /dev/sdv1 /dev/sdw1 /dev/sdx1 /dev/sdy1 /dev/sdz1 /dev/sdaa1 /dev/sdab1 /dev/sdad1 /dev/sdac1 /dev/sdae1
/mdadm –create –verbose /dev/md2 –level=6 –chunk=512 –raid-devices=15 /dev/sdag1 /dev/sdah1 /dev/sdai1 /dev/sdaj1 /dev/sdak1 /dev/sdal1 /dev/sdam1 /dev/sdan1 /dev/sdao1 /dev/sdap1 /dev/sdaq1 /dev/sdar1 /dev/sdas1 /dev/sdat1 /dev/sdau1
[/code]
If you are truly just building an iSCSI target the next steps are pointless. I wanted to do a throughput test so I had to lay down a file system, but once again there were problems. There is a 16TB size limit with mke2fs that ships with RedHat, as a result you need to build a newer version of e2fsprogs.
[code lang=”bash”]
git clone git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
cd e2fsprogs
mkdir build ; cd build/
../configure
make
make install
mke2fs -O 64bit,has_journal,extents,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize -i 4194304 /dev/md0
mke2fs 1.43-WIP (22-Sep-2012)
Warning: the fs_type huge is not defined in mke2fs.conf
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
11446336 inodes, 11721045504 blocks
586052275 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=13870563328
357698 block groups
32768 blocks per group, 32768 fragments per group
32 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
2560000000, 3855122432, 5804752896
Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
[/code]
Next is mount it up and test.
[code lang=”bash”]
mount -t ext4 /dev/md0 /backup0
mount
/dev/mapper/vg_leroy-lv_root on / type ext4 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/mapper/ddf1_Rootp1 on /boot type ext4 (rw)
/dev/mapper/vg_leroy-lv_home on /home type ext4 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/md0 on /backup0 type ext4 (rw)
watch cat /proc/mdstat
Every 2.0s: cat /proc/mdstat Tue Nov 13 14:50:58 2012
md2 : active raid6 sdau1[14] sdat1[13] sdas1[12] sdar1[11] sdaq1[10] sdap1[9] sdao1[8] sdan1[7] sdam1[6] sdal1[5] sdak1[4] sdaj1[3] sdai1[2] sdah1[1] sdag1[0]
50791197184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [15/15] [UUUUUUUUUUUUUUU]
[>………………..] resync = 0.0% (72704/3907015168) finish=5372.5min speed=12117K/sec
md1 : active raid6 sdaf1[14] sdae1[13] sdac1[12] sdad1[11] sdab1[10] sdaa1[9] sdz1[8] sdy1[7] sdx1[6] sdw1[5] sdv1[4] sdu1[3] sdt1[2] sds1[1] sdr1[0]
50791197184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [15/15] [UUUUUUUUUUUUUUU]
[>………………..] resync = 0.0% (2583680/3907015168) finish=4776.4min speed=13623K/sec
md0 : active raid6 sdq1[14] sdp1[13] sdo1[12] sdn1[11] sdm1[10] sdl1[9] sdj1[8] sdk1[7] sdi1[6] sdg1[5] sdh1[4] sdf1[3] sde1[2] sdd1[1] sdc1[0]
50791197184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [15/15] [UUUUUUUUUUUUUUU]
[>………………..] resync = 0.0% (3255892/3907015168) finish=5886.7min speed=11052K/sec
[/code]
Finally you need to save the software raid configuration.
[code]
mdadm –detail –scan >> /etc/mdadm.conf
[/code]
Testing
I wanted to try a throughput test so I copied a CD over to the server. We just weren’t getting enough throughput with the reads and writes so I decided to create a ramdisk, read from it and write to the filesystem.
Create the ramdisk.
[code lang=”bash”]
ls -alh /dev/ram*
mknod -m 660 /dev/ramdisk b 1 1
chown root.disk /dev/ramdisk
dd if=/dev/zero of=/dev/ramdisk bs=1k count=4194304
/sbin/mkfs -t ext2 -m 0 /dev/ramdisk 16384
mkdir /ramdisk
mount -t ext2 /dev/ramdisk /ramdisk
dd if=/dev/urandom of=/ramdisk/file.txt bs=1k count=15k
ls -alh /ramdisk/
[/code]
Now copy the 15mb file from the ramdisk 500,000 times. I ran this script for /backup0, /backup1 and /backup2.
[code lang=”bash”]
for i in `jot -s 1 -e 500000`; do cp /ramdisk/file.txt /backup0/test0-${i}; done
[/code]
And the test output, in one minute we had written 12MB.
[code lang=”bash”]
date && df -h
Sat Nov 10 16:29:49 CST 2012
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_leroy-lv_root
50G 3.2G 44G 7% /
tmpfs 3.9G 0 3.9G 0% /dev/shm
/dev/mapper/ddf1_Rootp1
485M 37M 423M 8% /boot
/dev/mapper/vg_leroy-lv_home
236G 188M 224G 1% /home
/dev/md0 48T 78G 45T 1% /backup0
/dev/md1 48T 84G 45T 1% /backup1
/dev/md2 48T 78G 45T 1% /backup2
/dev/ramdisk 16M 16M 302K 99% /ramdisk
date && df -h
Sat Nov 10 16:30:49 CST 2012
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_leroy-lv_root
50G 3.2G 44G 7% /
tmpfs 3.9G 0 3.9G 0% /dev/shm
/dev/mapper/ddf1_Rootp1
485M 37M 423M 8% /boot
/dev/mapper/vg_leroy-lv_home
236G 188M 224G 1% /home
/dev/md0 48T 82G 45T 1% /backup0
/dev/md1 48T 88G 45T 1% /backup1
/dev/md2 48T 82G 45T 1% /backup2
/dev/ramdisk 16M 16M 302K 99% /ramdisk
[/code]
And the IOSTAT command while it was writing.
[code]
avg-cpu: %user %nice %system %iowait %steal %idle
0.22 0.00 6.40 55.14 0.00 38.23
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 2.18 73.97 117.05 6527092 10328224
sdb 0.79 0.06 117.05 5536 10328224
sdc 87.69 882.70 82847.18 77890479 7310556800
sdd 86.17 715.18 82822.85 63108394 7308410360
sde 11.11 714.80 6402.75 63074967 564987674
sdf 4.92 714.24 135.85 63025858 11987866
sdh 5.16 714.32 134.02 63032209 11825930
sdg 5.10 714.65 135.50 63062197 11956714
sdi 4.98 714.30 133.72 63030809 11799450
sdk 4.92 714.45 133.54 63044265 11784026
sdj 4.95 714.24 133.88 63025249 11813618
sdl 5.02 714.18 134.05 63020313 11828514
sdm 5.08 714.06 133.96 63009609 11821122
sdn 5.00 714.15 133.74 63017213 11801082
sdo 4.97 714.16 133.85 63018757 11811058
sdp 4.62 714.38 130.34 63038351 11501106
sdq 4.59 714.04 128.39 63007996 11329122
sdr 4.82 784.45 45.74 69221236 4036394
sds 4.78 784.46 46.01 69221608 4060330
sdt 4.79 784.55 47.90 69229964 4226866
sdu 4.79 784.75 46.05 69247648 4063482
sdv 4.75 784.68 45.86 69241532 4046386
sdw 4.81 784.77 45.80 69249556 4041530
sdx 4.78 784.75 45.90 69247718 4050298
sdy 4.76 784.88 45.84 69259062 4045058
sdz 4.77 784.79 45.92 69251000 4052066
sdaa 4.75 784.69 45.99 69242304 4058370
sdab 4.47 784.89 45.81 69259548 4042074
sdad 4.40 784.83 45.80 69254484 4041794
sdac 4.32 784.92 47.64 69262304 4203442
sdae 4.23 784.84 45.64 69255316 4027730
sdaf 4.12 784.88 45.60 69258620 4024146
sdag 4.42 702.93 41.19 62027358 3634226
sdah 4.37 702.73 41.38 62009962 3651746
sdai 4.37 702.79 41.67 62015092 3677450
sdaj 4.35 702.87 41.50 62022040 3661962
sdak 4.35 703.19 41.26 62050556 3640690
sdal 4.37 703.53 40.96 62080556 3614570
sdam 4.34 703.60 40.85 62086828 3604554
sdan 4.33 703.42 41.02 62070532 3620082
sdao 4.34 703.41 41.22 62069532 3637226
sdap 4.32 703.41 41.15 62069548 3631570
sdaq 4.08 703.40 41.29 62069444 3643514
sdar 4.01 703.10 41.58 62042804 3669194
sdas 3.94 702.87 41.55 62021960 3666570
sdat 3.85 703.25 40.92 62055866 3611258
sdau 3.77 703.06 40.93 62039220 3611930
dm-0 16.60 73.91 117.05 6521508 10328224
dm-1 0.05 0.37 0.00 32976 168
dm-2 16.51 73.25 117.04 6463516 10328056
dm-3 5.79 72.21 32.65 6372058 2880648
dm-4 10.60 0.36 84.40 32112 7447384
dm-5 0.04 0.31 0.00 26938 24
md0 54.27 0.56 433.70 49578 38270384
md1 67.95 0.56 543.17 49602 47930328
md2 60.73 0.56 485.41 49594 42832904
[/code]
Create an iSCSI target.
Once you create the iSCSI target and format the drive with a Windows file system, you have lost any data that was on the drive you created earlier. Remember with iSSCSI you are presenting a target “physical” drive.
Install the iSCSI target utilities.
[code]
yum install scsi-target-utils
[/code]
The iSCSI configuration file.
[code]
cat /etc/tgt/targets.conf
default-driver iscsi
# Parameters below are only global. They can’t be configured per LUN.
# Only allow connections from 192.168.100.1 and 192.168.200.5
initiator-address 192.168.100.1
initiator-address 192.168.200.5
<target iqn.2012-11.org.eamc:leroy.target0>
backing-store /dev/md0
write-cache off
lun 11
</target>
<target iqn.2012-11.org.eamc:leroy.target1>
backing-store /dev/md1
write-cache off
lun 12
</target>
[/code]
Turn on tgtd.
[code]
chkconfig iptables off
chkconfig tgtd on
chkconfig tgtd –list
[/code]
SMARTD
One of the guys on the team brought up that we should be doing some hard drive monitoring to make sure we knew if we were having trouble with a drive. As a result I installed smartmontools and configured the daemon to email when a drive starts to fail.
Install smartmontools.
[code]
yum install smartmontools
[/code]
Edit the configuration file to email, but the first time test to make sure an email is sent.
[code]
cat /etc/smartd.conf
DEVICESCAN -a -I 194 -W 4,45,55 -R 5 -m jud@circus.org -M test
[/code]
Start the smartd daemon.
[code]
chkconfig smartd on
service smartd start
[/code]
Now go back and remove the -M test from the configuration file to make sure you don’t get emails every time the smartd daemon restarts. There are a number of configuration options, so read the /etc/smartd.conf file for a better understanding.
Some random commands:
[code]
mdadm –stop /dev/md124
mdadm –remove /dev/md124
mdadm –query –detail /dev/md1
mdadm –detail-platform
mdadm –monitor
mdadm –explain /dev/md0
[/code]