Storage Pod

At the Circus we just built our first Backblaze storage pod and I would like to take the time to document it. We rebuilt the server a number of times for testing and verification with different numbers of disks so output may differ throughout this post.

The cost per terabyte is right up our alley as we are a non-profit hospital. We tried to set ours up as a Windows server so it would have direct attached storage but changed direction and decided to make it a Linux based iSCSI target.

Disk Mapping
The first problem is mapping out the port multiplier backplanes. If you follow this link it shows the way the pod is supposed to be built, however, our drives did not map out accordingly. We took the time to map out our drives by literally shutting down, pulling a disk and turning the server back on to find the layout. If you don’t take the time to do this, I feel for you when a disk dies and you try to figure out how to replace it.

Boot Drives.
sd 0:0:0:0: [sda]
sd 1:0:0:0: [sdb]

First row from right.
sd 7:0:0:0: [sdh]
sd 7:1:0:0: [sdi]
sd 7:2:0:0: [sdj]
sd 7:3:0:0: [sdk]
sd 7:4:0:0: [sdl]

sd 6:0:0:0: [sdc]
sd 6:1:0:0: [sdd]
sd 6:2:0:0: [sde]
sd 6:3:0:0: [sdf]
sd 6:4:0:0: [sdg]

sd 8:0:0:0: [sdm]
sd 8:1:0:0: [sdn]
sd 8:2:0:0: [sdo]
sd 8:3:0:0: [sdp]
sd 8:4:0:0: [sdq

Second row from right.
sd 11:0:0:0: [sdw]
sd 11:1:0:0: [sdx]
sd 11:2:0:0: [sdy]
sd 11:3:0:0: [sdz]
sd 11:4:0:0: [sdaa]

sd 10:0:0:0: [sdr]
sd 10:1:0:0: [sds]
sd 10:2:0:0: [sdt]
sd 10:3:0:0: [sdu]
sd 10:4:0:0: [sdv]

sd 12:0:0:0: [sdab]
sd 12:1:0:0: [sdac]
sd 12:2:0:0: [sdad]
sd 12:3:0:0: [sdae]
sd 12:4:0:0: [sdaf]

Third row from right.
sd 14:0:0:0: [sdag]
sd 14:1:0:0: [sdah]
sd 14:2:0:0: [sdai]
sd 14:3:0:0: [sdaj]
sd 14:4:0:0: [sdak]

sd 15:0:0:0: [sdal]
sd 15:1:0:0: [sdam]
sd 15:2:0:0: [sdan]
sd 15:3:0:0: [sdao]
sd 15:4:0:0: [sdap]

sd 16:0:0:0: [sdaq]
sd 16:1:0:0: [sdar]
sd 16:2:0:0: [sdas]
sd 16:3:0:0: [sdat]
sd 16:4:0:0: [sdau]

Disk Setup
The next problem you have is that fdisk will not handle partitions larger than 2TB, parted to the rescue. Because there were forty-five 4TB disks in the server I did not want to have to do it manually. The other problem was that we had also tested the server as a Windows server so it already had partitions on the disks. As a result we had to remove the old partitions, then create a new one. Luckily you can script parted. Please note that parts of the script are commented out because we ran the script multiple times for different setups.

for I in `dmesg|grep ^sd|cut -d \  -f 1,2,3|grep -v Attach |sort -u | cut -d [ -f 2 | cut -d ] -f 1 `; do echo /dev/${I}\ ; done >>devices-list.txt

cat /usr/local/bin/parted-script.sh 
#!/bin/sh
for i in `cat devices-list.txt`
do
# delete previous partitions
#parted $i --script -- rm 1
#parted $i --script -- rm 2
#parted $i --script -- rm 3

# create partition to take whole disk
parted $i --script -- mkpart primary ext4 1 -1

# set type lvm for jbod
# parted $i --script -- set 1 lvm on

# set type RAID for RAID 6.
parted $i --script -- set 1 raid on

parted $i --script print
done

Create the RAID
The first time through we made all of the disks a JBOD to play, but long term that did not make sense. As a result I am only going to document creating a RAID 6 iSCSI target for Windows servers as this is the purpose of our storage pod.

I try not to do many tasks manually, so here is the work around for trying not to have type 45 disk names.

dmesg|grep ^sd|cut -d \  -f 1,2,3|grep -v Attach |sort -u | cut -d [ -f 2 | cut -d ] -f 1 >>devices.txt
for I in `cat devices.txt`; do  echo -n /dev/${I}1\ ; done >devices1.txt

This creates a file with all of the disk names.

cat devices1.txt 
/dev/sda1 /dev/sdc1 /dev/sdb1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1 /dev/sdn1 /dev/sdo1 /dev/sdp1 /dev/sdq1 /dev/sdr1 /dev/sds1 /dev/sdt1 /dev/sdu1 /dev/sdv1 /dev/sdw1 /dev/sdx1 /dev/sdy1 /dev/sdz1 /dev/sdaa1 /dev/sdab1 /dev/sdac1 /dev/sdad1 /dev/sdae1 /dev/sdaf1 

Create the different software RAID configurations. I created three RAID devices, md0, md1 and md2.

This mdadm command creates a RAID 6 container with 14 physical disks and one spare. We were being cautious with our data.

mdadm --create --verbose /dev/md1 --level=6 --chunk=512 --raid-devices=14 --spare-devices=1 /dev/sdr1 /dev/sds1 /dev/sdt1 /dev/sdu1 /dev/sdv1 /dev/sdw1 /dev/sdx1 /dev/sdy1 /dev/sdz1 /dev/sdaa1 /dev/sdab1

This mdadm command create a RAID 6 container with all 15 physical disks, I used this configuration for testing the throughput later.


mdadm --create --verbose /dev/md0 --level=6 --chunk=512 --raid-devices=15 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdh1 /dev/sdg1 /dev/sdi1 /dev/sdk1 /dev/sdj1 /dev/sdl1 /dev/sdm1 /dev/sdn1 /dev/sdo1 /dev/sdp1 /dev/sdq1 

mdadm --create --verbose /dev/md1 --level=6 --chunk=512 --raid-devices=15 /dev/sdr1 /dev/sds1 /dev/sdt1 /dev/sdu1 /dev/sdv1 /dev/sdw1 /dev/sdx1 /dev/sdy1 /dev/sdz1 /dev/sdaa1 /dev/sdab1 /dev/sdad1 /dev/sdac1 /dev/sdae1 

/mdadm --create --verbose /dev/md2 --level=6 --chunk=512 --raid-devices=15 /dev/sdag1 /dev/sdah1 /dev/sdai1 /dev/sdaj1 /dev/sdak1 /dev/sdal1 /dev/sdam1 /dev/sdan1 /dev/sdao1 /dev/sdap1 /dev/sdaq1 /dev/sdar1 /dev/sdas1 /dev/sdat1 /dev/sdau1 

If you are truly just building an iSCSI target the next steps are pointless. I wanted to do a throughput test so I had to lay down a file system, but once again there were problems. There is a 16TB size limit with mke2fs that ships with RedHat, as a result you need to build a newer version of e2fsprogs.

git clone git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
cd e2fsprogs
mkdir build ; cd build/
../configure
make
make install

mke2fs -O 64bit,has_journal,extents,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize -i 4194304 /dev/md0
mke2fs 1.43-WIP (22-Sep-2012)

Warning: the fs_type huge is not defined in mke2fs.conf

Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
11446336 inodes, 11721045504 blocks
586052275 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=13870563328
357698 block groups
32768 blocks per group, 32768 fragments per group
32 inodes per group
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848, 512000000, 550731776, 644972544, 1934917632, 
	2560000000, 3855122432, 5804752896

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done         

Next is mount it up and test.

mount -t ext4 /dev/md0 /backup0

mount
/dev/mapper/vg_leroy-lv_root on / type ext4 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/mapper/ddf1_Rootp1 on /boot type ext4 (rw)
/dev/mapper/vg_leroy-lv_home on /home type ext4 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/md0 on /backup0 type ext4 (rw)

watch cat /proc/mdstat 

Every 2.0s: cat /proc/mdstat                                                                                                                                                              Tue Nov 13 14:50:58 2012
md2 : active raid6 sdau1[14] sdat1[13] sdas1[12] sdar1[11] sdaq1[10] sdap1[9] sdao1[8] sdan1[7] sdam1[6] sdal1[5] sdak1[4] sdaj1[3] sdai1[2] sdah1[1] sdag1[0]
      50791197184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [15/15] [UUUUUUUUUUUUUUU]
      [>....................]  resync =  0.0% (72704/3907015168) finish=5372.5min speed=12117K/sec
      
md1 : active raid6 sdaf1[14] sdae1[13] sdac1[12] sdad1[11] sdab1[10] sdaa1[9] sdz1[8] sdy1[7] sdx1[6] sdw1[5] sdv1[4] sdu1[3] sdt1[2] sds1[1] sdr1[0]
      50791197184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [15/15] [UUUUUUUUUUUUUUU]
      [>....................]  resync =  0.0% (2583680/3907015168) finish=4776.4min speed=13623K/sec
      
md0 : active raid6 sdq1[14] sdp1[13] sdo1[12] sdn1[11] sdm1[10] sdl1[9] sdj1[8] sdk1[7] sdi1[6] sdg1[5] sdh1[4] sdf1[3] sde1[2] sdd1[1] sdc1[0]
      50791197184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [15/15] [UUUUUUUUUUUUUUU]
      [>....................]  resync =  0.0% (3255892/3907015168) finish=5886.7min speed=11052K/sec

Finally you need to save the software raid configuration.

mdadm --detail --scan >> /etc/mdadm.conf

Testing
I wanted to try a throughput test so I copied a CD over to the server. We just weren’t getting enough throughput with the reads and writes so I decided to create a ramdisk, read from it and write to the filesystem.

Create the ramdisk.

ls -alh /dev/ram*
mknod -m 660 /dev/ramdisk b 1 1
chown root.disk /dev/ramdisk
dd if=/dev/zero of=/dev/ramdisk bs=1k count=4194304
/sbin/mkfs -t ext2 -m 0 /dev/ramdisk 16384
mkdir /ramdisk
mount -t ext2 /dev/ramdisk /ramdisk
dd if=/dev/urandom of=/ramdisk/file.txt bs=1k count=15k
ls -alh /ramdisk/

Now copy the 15mb file from the ramdisk 500,000 times. I ran this script for /backup0, /backup1 and /backup2.

for i in `jot -s 1 -e 500000`; do  cp /ramdisk/file.txt /backup0/test0-${i}; done

And the test output, in one minute we had written 12MB.

date && df -h
Sat Nov 10 16:29:49 CST 2012
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_leroy-lv_root
                       50G  3.2G   44G   7% /
tmpfs                 3.9G     0  3.9G   0% /dev/shm
/dev/mapper/ddf1_Rootp1
                      485M   37M  423M   8% /boot
/dev/mapper/vg_leroy-lv_home
                      236G  188M  224G   1% /home
/dev/md0               48T   78G   45T   1% /backup0
/dev/md1               48T   84G   45T   1% /backup1
/dev/md2               48T   78G   45T   1% /backup2
/dev/ramdisk           16M   16M  302K  99% /ramdisk


date && df -h
Sat Nov 10 16:30:49 CST 2012
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_leroy-lv_root
                       50G  3.2G   44G   7% /
tmpfs                 3.9G     0  3.9G   0% /dev/shm
/dev/mapper/ddf1_Rootp1
                      485M   37M  423M   8% /boot
/dev/mapper/vg_leroy-lv_home
                      236G  188M  224G   1% /home
/dev/md0               48T   82G   45T   1% /backup0
/dev/md1               48T   88G   45T   1% /backup1
/dev/md2               48T   82G   45T   1% /backup2
/dev/ramdisk           16M   16M  302K  99% /ramdisk

And the IOSTAT command while it was writing.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.22    0.00    6.40   55.14    0.00   38.23

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               2.18        73.97       117.05    6527092   10328224
sdb               0.79         0.06       117.05       5536   10328224
sdc              87.69       882.70     82847.18   77890479 7310556800
sdd              86.17       715.18     82822.85   63108394 7308410360
sde              11.11       714.80      6402.75   63074967  564987674
sdf               4.92       714.24       135.85   63025858   11987866
sdh               5.16       714.32       134.02   63032209   11825930
sdg               5.10       714.65       135.50   63062197   11956714
sdi               4.98       714.30       133.72   63030809   11799450
sdk               4.92       714.45       133.54   63044265   11784026
sdj               4.95       714.24       133.88   63025249   11813618
sdl               5.02       714.18       134.05   63020313   11828514
sdm               5.08       714.06       133.96   63009609   11821122
sdn               5.00       714.15       133.74   63017213   11801082
sdo               4.97       714.16       133.85   63018757   11811058
sdp               4.62       714.38       130.34   63038351   11501106
sdq               4.59       714.04       128.39   63007996   11329122
sdr               4.82       784.45        45.74   69221236    4036394
sds               4.78       784.46        46.01   69221608    4060330
sdt               4.79       784.55        47.90   69229964    4226866
sdu               4.79       784.75        46.05   69247648    4063482
sdv               4.75       784.68        45.86   69241532    4046386
sdw               4.81       784.77        45.80   69249556    4041530
sdx               4.78       784.75        45.90   69247718    4050298
sdy               4.76       784.88        45.84   69259062    4045058
sdz               4.77       784.79        45.92   69251000    4052066
sdaa              4.75       784.69        45.99   69242304    4058370
sdab              4.47       784.89        45.81   69259548    4042074
sdad              4.40       784.83        45.80   69254484    4041794
sdac              4.32       784.92        47.64   69262304    4203442
sdae              4.23       784.84        45.64   69255316    4027730
sdaf              4.12       784.88        45.60   69258620    4024146
sdag              4.42       702.93        41.19   62027358    3634226
sdah              4.37       702.73        41.38   62009962    3651746
sdai              4.37       702.79        41.67   62015092    3677450
sdaj              4.35       702.87        41.50   62022040    3661962
sdak              4.35       703.19        41.26   62050556    3640690
sdal              4.37       703.53        40.96   62080556    3614570
sdam              4.34       703.60        40.85   62086828    3604554
sdan              4.33       703.42        41.02   62070532    3620082
sdao              4.34       703.41        41.22   62069532    3637226
sdap              4.32       703.41        41.15   62069548    3631570
sdaq              4.08       703.40        41.29   62069444    3643514
sdar              4.01       703.10        41.58   62042804    3669194
sdas              3.94       702.87        41.55   62021960    3666570
sdat              3.85       703.25        40.92   62055866    3611258
sdau              3.77       703.06        40.93   62039220    3611930
dm-0             16.60        73.91       117.05    6521508   10328224
dm-1              0.05         0.37         0.00      32976        168
dm-2             16.51        73.25       117.04    6463516   10328056
dm-3              5.79        72.21        32.65    6372058    2880648
dm-4             10.60         0.36        84.40      32112    7447384
dm-5              0.04         0.31         0.00      26938         24
md0              54.27         0.56       433.70      49578   38270384
md1              67.95         0.56       543.17      49602   47930328
md2              60.73         0.56       485.41      49594   42832904

Create an iSCSI target.
Once you create the iSCSI target and format the drive with a Windows file system, you have lost any data that was on the drive you created earlier. Remember with iSSCSI you are presenting a target “physical” drive.

Install the iSCSI target utilities.

yum install scsi-target-utils

The iSCSI configuration file.

cat /etc/tgt/targets.conf
default-driver iscsi

# Parameters below are only global. They can't be configured per LUN.
# Only allow connections from 192.168.100.1 and 192.168.200.5
initiator-address 192.168.100.1
initiator-address 192.168.200.5

<target iqn.2012-11.org.eamc:leroy.target0>
	backing-store /dev/md0	
	write-cache off
	lun 11
</target>
<target iqn.2012-11.org.eamc:leroy.target1>
	backing-store /dev/md1
	write-cache off
	lun 12
</target>

Turn on tgtd.

chkconfig iptables off
chkconfig tgtd on
chkconfig tgtd --list

SMARTD
One of the guys on the team brought up that we should be doing some hard drive monitoring to make sure we knew if we were having trouble with a drive. As a result I installed smartmontools and configured the daemon to email when a drive starts to fail.

Install smartmontools.

yum install smartmontools

Edit the configuration file to email, but the first time test to make sure an email is sent.

cat /etc/smartd.conf
DEVICESCAN -a -I 194 -W 4,45,55 -R 5 -m jud@circus.org -M test

Start the smartd daemon.

chkconfig smartd on
service smartd start

Now go back and remove the -M test from the configuration file to make sure you don’t get emails every time the smartd daemon restarts. There are a number of configuration options, so read the /etc/smartd.conf file for a better understanding.

Some random commands:

mdadm --stop /dev/md124
mdadm --remove /dev/md124
mdadm --query --detail /dev/md1
mdadm --detail-platform
mdadm --monitor
mdadm --explain /dev/md0
This entry was posted in Linux. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s