Background

I’ve been fascinated by high-volume, fault-tolerant data storage systems for a long time. I started my data storage setup in earnest with a 4-disk RAID 5 array on an ARC-1210 controller installed in my daily-driver Windows desktop. When that array inevitably filled up, I added another array on a second ARC-1210. That slowly filled up, too, and I knew I couldn’t just keep stuffing RAID cards in my desktop; I had to build a serious file storage server eventually.

I considered many different storage options and configurations, including a large hardware-controlled RAID system on a Windows or Linux environment, a software-controlled array in a Windows-based server, a Drobo-type “keep it simple, stupid” system, and continuing to simply add more drives to my desktop computer. None of these options seemed to address all my requirements very well. I eventually stumbled upon FreeNAS, a FreeBSD-based, network storage oriented operating system that uses the ZFS file system and a web interface to configure network sharing and other settings. While most of the setup and system management is done through this web interface, but you can extend the machine’s capabilities quite a bit through the terminal via SSH. In this article, I’ll go over my hardware selections, the build and configuration process, some of the other applications I have running on the machine, and a bit of theory about how ZFS allocates array storage space and how it can be tuned to reduce allocation overhead.

I want to make a quick note before diving into this excessively long article. I started writing the first few sections intending to create a detailed build log for my server. I took pictures and documented every step as well as I could; I had been planning this server over the span of several years, so I was understandably excited to get into it. As the article progressed, it started to shift into a combination of a build log and a tutorial. While the exact set of parts and the sequence of their assembly and configuration will likely be fairly unique to the machine I built, I hope that people undertaking a similar project will find helpful information in some portion of this article. My contact information is at the bottom of this article if you would like to get in touch with me for any reason.

[2017 Update:] I've had this server for a little over a year now and have made several changes to the configuration. Those changes include the following:

[2018 Update:] The server has been going strong for two years now. I've made some more changes to the system in the past year:

[2019 Update:] Three years! I've made some major changes this year to support the addition of a second 24-bay chassis:

[2020 Update:] I made a few minor upgrades through 2020:

[2021 Update:] I added another expansion shelf in 2021:

I've made updates to the original text to reflect these changes, marking paragraphs as updated where appropriate. The system summary section just below will always include whatever the latest configuration is. There are places where I've included original text from the article even though it's no longer applicable to my build; I've done so because in these cases, the old might be helpful (for example, info on preparing the Ikea LACK table to hold the server).

Contents:

System Summary

The server is running on FreeNAS 12.0 with 6x 8-drive RAID-Z2 virtual devices (vdevs) with 8TB disks for a total of ~250 TB of usable, redundant space. These drives in the main chassis (or "head unit") are connected to an LSI 9305-24i. The drives in the second chassis (or "expansion shelf") are connected with an LSI 9305-16e and a SAS3 expander backplane. The boot volume is stored on 2x mirrored Intel 540s 120 GB SSDs. I have an internally-mouted pair of 2 TB SATA SSDs striped together for ~4TB of fast, temporary data storage. The head unit is housed in a SuperMicro SC846 chassis with two 920W redundant PSUs; the expansion has the same. The system is built on a SuperMicro X10SRL-F LGA2011-v3 motherboard. I’m using an Intel Xeon E5-2666 v3 (10C/20T @ 2.9GHz) and 8 modules of 16GB DDR4 Samsung ECC/registered memory for a total of 128GB of RAM. I’m using a Noctua cooler on the CPU and I replaced the noisy stock chassis fans with quieter Noctua fans. I also created a front fan shroud that holds 3x 140mm fans to increase cooling while keeping noise levels down. I have one mounted on both the head unit and the expansion shelf. I have two APC 1500VA Smart-UPSs and have them and the file server in a 42U Dell rack cabinet from Craigslist.

I set my primary dataset with recordsize = 1MiB to account for data block placement in an 8-drive Z2 array. Most of the data is shared via SMB/CIFS. I also have an iSCSI share on the main pool mounted on my desktop for my Steam library, and an iSCSI share on the mirrored SATA SSDs mounted in one of the VMs (and then reshared via SMB). The system hosts several different Debian-based bhyve VMs to run various services I use on a day-to-day basis (including nginx, pi-hole, irssi, an OpenVPN server, and an rclone client I use to back up my data to ~~the cloud~~). I have scripts set up to send SMART and zpool status reports to my email address on a weekly basis and scrubs/SMART checks scheduled every 2 weeks. I also have a script that automatically adjust the chassis fan speeds based on live HDD and CPU temps.

The fan control for each chassis is now handled through Raspberry Pis that output the appropraite PWM signal. I still have the script running on the FreeNAS system to monitor drive temperatures, but now instead of setting fan speeds by sending ipmitool commands, it sends the commands to the Raspberry Pi systems in each chassis via Python's sockets module. The Pis also measure fan speed and ambient chassis temperature which is all sent to another Pi running a webserver with flask, socket.io, and redis that displays lots of system vitals.

Of course the primary purpose of the NAS is data storage. The vast majority of the data I store is from the high-resolution photography and videography I've done over the past 10-15 years, most of which is stored in some sort of raw format. I could probably go through and delete about 95% (even 99%) of the data, but I would rather keep everything around so I can pretend that maybe some day, someone will be interested in looking at them (maybe I'll have very patient kids?). I look at this use case as a modern version of the boxes full photo albums and slides my parents and grandparents kept in their basement for decades and never looked at. Even if no one ever looks at my pictures and videos, it's still be a really fun project to work on. By the way, if you're interested in looking at some of my favorite photographs, you can find them here!

Photos

The front of the server with the front fan shroud on. The top chassis is the head unit, the lower chassis is the expansion shelf.


The front of the server with the front fan shroud removed (note the weather stripping tape around the edges).


A few different views of the inside of the head unit.






A view of the fan wall, secured with zip ties. The strip on top of the fan wall helps to seal off the two sections of the chassis, making sure air flows through the drive bays rather than back over top of the fan wall.


Inside of the expansion shelf chassis.



All expansion shelf drives connected via a single SAS3 cable.



The fan control Raspberry Pi in the head unit and in the shelf.



The dedicated 11" display for system stats, including drive temperatures, ambient chassis temp, CPU temp and load, and fan speeds.



A view of the front and back of the rack. From top to bottom, I have my Proxmox server, the FreeNAS head unit, the expansion shelf, my new workstation, a FreeNAS mini, a UniFi NVR, and some UPSs.



The whole rack rolls out several feet so I can get behind it when needed.


Here is a screenshot of the mounted shares.


Parts and Price List

[2019 Update:] Now that the system has gone through so many changes, I've decided to restructure this table to group things based on the yearly upgrades. Parts that have been functionally replaced in a later upgrade will be marked with strikethrough text. These replaced parts are still included in the price totals.

Part Make/Model Qty $ Per $ Total From
Original Build (2016)
Chassis SuperMicro SC846 1 $200 $200 TheServerStore
Motherboard SuperMicro X10SRL-F 1 $272 $272 Amazon
CPU Intel Xeon E5-1630v3 1 $373 $373 SuperBiiz
RAM Samsung M393A2G40DB0-CPB (16GB) 4 $80 $320 Amazon
HBA IBM M1015 2 $75 $150 eBay
PSU SuperMicro PWS-920P-SQ 2 $118 $236 eBay
Backplane SuperMicro BPN-SAS-846A 1 $250 $250 eBay
Data Drive WD 8 TB Red (8 + 1 Spare) 9 $319 $2,871 Amazon
Data Drive WD 4 TB Red (Already Owned) 8 $0 $0 -
Boot Device Intel 540s 120GB SSD 2 $53 $105 Amazon
CPU Cooler Noctua NH-U9DXi4 1 $56 $56 Amazon
120mm Fan Noctua NF-F12 iPPC 3000 PWM 3 $24 $72 Amazon
80mm Fan Noctua NF-R8 PWM 2 $10 $20 Amazon
UPS APC SUA1500RM2U Smart-UPS 1 $300 $300 eBay
SAS Cable SFF-8087 to SFF-8087 4 $11 $44 Amazon
HDD Screws SuperMicro HDD Screws (100 ct) 1 $8 $8 Amazon
Tax, Misc. Cables, etc. Tax, misc. 1 $250 $250 -
SSD Cage SuperMicro MCP-220-84603-0N 1 $25 $25 eBay
Original System Total: $5,528
2017 Update
HBA IBM M1015 1 $75 $75 eBay
Data Drive WD 8 TB Red (from WD EasyStore) 8 $130 $1,040 Best Buy
Scratch Disk WD 8 TB Red 1 $130 $130 Best Buy
Scratch Disk Cage Supermicro MPC-220-84601-0N 1 $20 $20 Best Buy
VM SSD Samsung 960 Pro NVMe SSD 1 $290 $290 Amazon
M.2 PCIe Adapter StarTech Brand 1 $22 $22 Amazon
Front Fan Shroud (3D Printed PLA) 1 $180 $180 3DHubs
140mm Fan Noctua NF-A14 iPPC 3000 PWM 3 $25 $75 Amazon
Rack Cabinet Dell 42U 1 $300 $300 Craigslist
Server Rails Supermicro MCP-290-00057-0N 1 $75 $75 eBay
2017 Update Total: $2,207
2018 Update
10GbE NIC Intel X540T2 1 $248 $248 Amazon
SSD Cage SuperMicro MCP-220-84603-0N 1 $25 $25 eBay
SSD Scratch Disks Micron MX500 2 TB 2 $339 $678 Amazon
2018 Update Total: $976
2019 Update
Data Drive WD 8 TB Red (from WD EasyStore) 16 $130 $2,080 Best Buy
Expansion Chassis SuperMicro SC846 1 $250 $250 TheServerStore
Expansion PSU SuperMicro PWS-920P-SQ 2 $100 $200 TheServerStore
Server Rails Supermicro MCP-290-00057-0N 1 $100 $100 TheServerStore
Expansion Backplane SuperMicro BPN-SAS3-846EL1 1 $630 $630 eBay
Additional RAM Samsung M393A2G40DB0-CPB (16GB) 4 $151 $604 Newegg
L2ARC Drive Intel Optane 900P 280GB AIC 1 $133 $133 Amazon
Internal HBA LSI SAS 9305-24i 1 $570 $570 Amazon
External HBA LSI SAS 9305-16e 1 $550 $550 Amazon
Front Fan Shroud (3D Printed PLA) 1 $180 $180 3DHubs
140mm Fan Noctua NF-A14 iPPC 3000 PWM 3 $25 $75 Amazon
120mm Fan Noctua NF-F12 iPPC 3000 PWM 3 $24 $72 Amazon
80mm Fan Noctua NF-R8 PWM 2 $10 $20 Amazon
SAS Ext. to Int. Adapter Supermicro AOM-SAS3-8I8E 1 $80 $80 Supermicro Store
SAS3 Ext. Cable SuperMicro CBL-SAST-0573 1 $60 $60 Supermicro Store
SAS3 Int. Cable SuperMicro CBL-SAST-0593 1 $15 $15 Supermicro Store
SAS3 to SAS2 Cable SuperMicro CBL-SAST-0508-02 6 $13 $78 Supermicro Store
Cable Management Arm SuperMicro MCP-290-00073-0N 2 $56 $112 Supermicro Store
UPS APC SUA1500RM2U Smart-UPS 1 $300 $300 eBay
Dual PSU Adapter Generic Amazon Model 1 $7 $7 Amazon
Rasp. Pi + SD Card RPi Model B+, 32GB MicroSD Card 3 $45 $135 Amazon
Thermal Probes Aideepen 5pc DS18B20 1 $12 $12 Amazon
System Vitals Display 11" 1080p Generic Touchscreen 1 $159 $159 Amazon
Arm for Display VIVO VESA Arm 1 $32 $32 Amazon
2019 Update Total: $6,454
2020 Update
Data Drive WD 8 TB Red (from WD Elements) 13 $130 $1,690 B&H
CPU Intel Xeon E5-2666 V3 1 $190 $190 eBay
2020 Update Total: $1,880
2021 Update
Data Drive WD 8 TB Red (from WD EasyStore) 34 $140.00 $4,760.00 Best Buy
Expansion Chassis Supermicro SC847 1 $701.00 $701.00 TheServerStore
Front Backplane Supermicro BPN-SAS3-846EL1 1 - - TheServerStore
Rear Backplane Supermicro BPN-SAS3-826EL1 1 - - TheServerStore
PSUs Supermicro PWS-1K28P-SQ 2 - - TheServerStore
Rails Supermicro MCP-290-00057-0N 1 - - TheServerStore
SAS3 Int. Cable SuperMicro CBL-SAST-0593 2 - - TheServerStore
Front Fan Shroud (3D Printed PLA) 1 $339.00 $339.00 3DHubs
SAS3 Ext. Cable Supermicro CBL-SAST-0677 2 $90.00 $180.00 eBay
SAS Ext. to Int. Adapter Supermicro AOM-SAS3-8I8E 1 $45.00 $45.00 eBay
140mm Fan Noctua NF-A14 iPPC 3000 PWM 3 $28.00 $84.00 Amazon
120mm Fan Noctua NF-F12 iPPC 3000 PWM 3 $26.00 $78.00 Amazon
80mm Fan NF-A8 PWM 10 $16.00 $160.00 Amazon
PWM Fan Splitter 5-Way Fan Hub 2 $7.00 $14.00 Amazon
PSU Adapter Thermaltake Dual 24-Pin 1 $11.00 $11.00 Amazon
2.5" Tray Adapter Supermicro MCP-220-00118-0B 2 $24.00 $48.00 eBay
SSD Scratch Disks Micron MX500 2 TB 2 $200.00 $400.00 Amazon
Cable Management Arm SuperMicro MCP-290-00073-0N 1 $56.00 $56.00 Supermicro Store
Replacement UPS APC SMT1500RM2U 1 $230.00 $230.00 eBay
UPS Battery Pack APC APCRBC133 1 $260.00 $260.00 APC Store
2021 Update Total: $7,366
Grand Total: $24,411

Parts Selection Details

My primary objectives when selecting parts were as follows:

  1. Allow for up to 24 drives,
  2. Be able to saturate a gigabit ethernet line on SMB/CIFS
  3. Have a quiet enough system that it can sit next to me in my office

Redundancy and overall system stability were also obvious objectives and led me to select server-grade components wherever appropriate. Here’s a breakdown of the reasoning behind each part I selected:

I’m very happy with the parts selection and I don’t think I would change anything if I had to do it again. I have a few future upgrades in mind, including a proper rack and rails (Update: Done!), getting another M1015 and filling the filling the 8 empty HDD bays (Update: Also done!), installing 10GbE networking (Update: Done as well!), and replacing the 4TB drives with 8TB drives, (Update: Did this too!) but the current setup will probably hold me for a while (Update: It's already like 75% full...).

Build Process

For the most part, the system build was pretty similar to a standard desktop computer build. The only non-standard steps I took were around the HDD fan wall modification, which I discussed briefly in the section above. The stock fan wall removal was pretty easy, but some of the screws securing it are hidden under the hot swap fan caddies, so I had to remove those first. With the fan wall structure out of the way, there were only two minor obstructions left – the chassis intrusion detection switch and a small metal tab near the PSUs that the fan wall screwed in to. The intrusion detection switch was easily removable by a pair of screws and I cut the small metal tab off with a Dremel (but you could probably just bend it out of the way if you wanted to). With those gone, the chassis was ready for my 120mm fan wall, but because the fans would block easy access to the backplane once they’re installed, I waited until the very end of the build to install them.

With the fan wall gone, swapping out the EOL backplane (which came pre-installed in my chassis) for the new version I purchased was pretty easy. Some of the screws are a little tough to access (especially the bottom one closest to the PSUs), but they all came out easily enough with some persistence. There are 6x Molex 4-pin power connectors that plug into the backplane to provide power to all the drives. The SuperMicro backplanes have a ton of jumpers and connectors for stuff like I2C connectors, activity LEDs, and PWM fans, but I didn’t use any of those. Drive activity information is carried over the SAS cable and all my fans are connected directly to the motherboard. If you’re interested, check the backplane manual on the SuperMicro website for more information on all the jumpers and connectors.

After I swapped out the backplane, the motherboard, CPU, RAM, CPU cooler, PSUs, SSDs, and HBA cards all went in like a standard computer build. The only noteworthy thing about this phase of the installation was the orange sticker over the motherboard’s 8 pin power connector that reads “Both 8pins required for heavy load configuration”. It’s noteworthy because there is only one 8 pin power connector on the board... Maybe they meant the 8 pin and 24 pin power connectors? Whatever the case may be, just make sure both the 8 pin power and 24 pin power connectors are attached and you’ll be fine. I also made note of the SAS addresses listed on the back of each of the M1015 cards before installing them. The SAS address is printed on a sticker on the back of the card and should start with “500605B”, then there will be a large blank space followed by 9 alpha-numeric characters interspersed with a couple of dashes. These addresses are needed in the initial system configuration process.

[2018 Update:] I ended up removing this scratch disk and replacing it with a pair of SATA SSDs. [2017 Update:] I wanted an extra drive to host a ZFS pool outside the main volume. This pool would just be a single disk and would be for data that didn't need to be stored with redundancy on the main pool. This drive is mounted to a tray that sits right up against the back of the power supplies on the inside of the chassis and it tended to get much hotter than all the front-mounted drives. My fan control script goes off of maximum drive temperature, so this scratch disk kept the fans running faster than they would otherwise. To help keep this drive cool, I drilled new holes in the drive tray to give a bit of space between the back of the drive and the chassis wall. I also cut a small hole in the side of the drive tray and mounted a little blower fan blowing into the hole so that air would circulating behind the drive. I had to cut away a portion of the SSD mounting tray to accommodate the blower fan. In the end, I'm not sure if the blower fan with its whopping 4 CFM of airflow makes any difference, but it was a pain to get in there so I'm leaving it. In fact, I ended up just modifying the fan script to ignore this scratch disk, but I do keep an eye on its temperature to make sure it's not burning up. A picture of the blower fan is below:


As this was my first server build, I was a little surprised that unlike consumer computer equipment, server equipment doesn’t come with any of the required screws, motherboard standoffs, etc., that I needed to mount everything. Make sure you order some extras or have some on-hand. I ordered a 100-pack of SuperMicro HDD tray screws on Amazon for $6 shipped; I would recommend using these screws over generic ones because if you use screws that don’t sit flush with the HDD sled rails, you’ll have a lot of trouble getting the sled back in the chassis and could even damage the chassis backplane.

As I mentioned above, the CPU cooler I’m using provides enough vertical clearance for the RAM, but I will probably have to remove the cooler to actually get the RAM into the slot if I ever need to add RAM. This isn’t a huge deal as the cooler is really very easy to install. I will note here that the cooler came with 2 different sets of mounting brackets for the LGA2011-v3 narrow ILM system so you can orient the airflow direction either front-to-back or side-to-side (allowing you to rotate the cooler in 90 degree increments). Obviously, for this system, I wanted air flowing in from the HDD side and out the back side, so I had to use the appropriate mounting bracket (or, more accurately, I realized there were two sets of narrow ILM brackets only after I installed the incorrect set on the cooler).

The front panel connector was a little confusing as the non-maskable interrupt (NMI) button header is in the same assembly on the motherboard as all the front panel headers (this header assembly is marked “JF1” on the motherboard and is not very clearly described in the manual). The connectors for all the front panel controls and LEDs are also contained in one single plug with 16 holes and no discernible orientation markings. After studying the diagrams in the motherboard manual, I was able to determine that the NMI button header pins are the 2 pins on this JF1 assembly that are closest to the edge of the motherboard, then (moving inwards) there are 2 blank spots, and then the 16 pins for the rest of the front panel controls and LEDs. The 16 pin front panel connector plugs into these inner 16 pins and should be oriented so the cable exits the 16 pin connector towards the PSU side of the chassis. Basically, if you have the front panel connector plugged into these 16 pins but the power button isn’t working, try flipping the plug around. If you have an NMI button (not included in the stock chassis), it will plug into those last 2 pins closest to the motherboard’s edge. If you don’t have an NMI button, just leave those pins empty.

I also swapped out the rear fans for quieter Noctua 80mm models at this point. The only way to mount them in the chassis is with the hot swap caddies (the chassis isn’t drilled for directly-mounted fans), but the process is pretty straight-forward. The stock fans have very short cables, maybe 1 inch long, because the PWM connectors are mounted onto the side of the caddie so they can mate with the hot-swap plug on the chassis itself when you slide the caddie into its “rail” system. That plug connects to what is essentially a PWM extension cable mounted to the caddie rails which connects the fans to the motherboard’s PWM fan headers. I took out this whole hotswap mechanism because the Noctua fan cables are much longer than the stock fan cables and the Noctua PWM connectors are missing a small notch on the plug that is needed to secure it in the hot swap caddie. It’s tough to describe, but it would be pretty obvious what I mean if you examine the rear fan caddies yourself.

With all the server guts installed and connected, I did some basic cable management and prepared to install my 120mm fan wall. I started by using zip-ties to attach the 3 fans together (obviously ensuring they would all blow in the same direction). The Noctua fans have soft silicone pads in each corner, so vibrations shouldn’t be a big issue if you get the pads lined up right. I put the fan wall in place in the chassis and marked off where the zip tie mounts should be placed with a marker, stuck the mounts on the marks (4 in total on the bottom), and used more zip ties to mount the fan wall in place. With the bottom of the fan wall secured in place, the whole thing is pretty solid, but I added one more zip tie mount to the top of the wall on the PSU side. This sort of wedges the fan wall in place and makes it all feel very secure. Once the fans were secure, I connected them to the 3-to-1 PWM fan splitter, attached that to the FANA header (this is important for the fan control script discussed later), and cleaned up all the cables.

[2019 Update:] The center fan wall and the front fans discussed below are no longer connected to the motherboard. With the modifications I did to the fan control setup, these fans are connected directly to an independent Raspberry Pi system that generates the PWM signals based on commands received from a script on the FreeNAS itself. Power for the fans is provided by a +12V line spliced in from the PSU.

While I’m talking about the HDD fan wall, I’ll also mention here that after running the server for a few days, I noticed some of the drive temperatures were in the low 40s (Celsius), much higher than they should be. The Noctua fans I originally had installed maxed out at 1500 RPMs, but I decided I would be safer with the Noctua iPPC fans that could hit 3000 RPM. I have a fan control script running (more on that below), so they hardly ever need to spin faster than 1500 RPM, but it’s nice to know the cooling is there if I ever need it. In addition to upgrading my original fans, I made a few minor modifications to improve overall cooling efficiency for the whole system:

  1. I used masking tape to cover the ventilation holes on the side of the chassis. These holes are on the hard drive side of the fan wall and are intended to prevent the stock fans from starving, but with lower speed fans they allow air to bypass the hard drives which cuts the total cooling efficiency.

  2. I cut pieces of index cards and used masking tape to block air from flowing through the empty drive bays. The air flow resistance through the empty bays was much lower than it was through the populated bays so most of the air was bypassing the hard drives. You can see a picture of it here. [2017 Update:] These bays are now populated with drives, so the index cards and masking tape came off!

  3. Air was flowing from the CPU side of the HDD fan wall back over the top of the fans rather than coming through the HDD trays, so I cut a long ~3/4” thick strip of wood to block the space between the top of the fans and the chassis lid. I measured the wood strip to be a very tight fit and zip-tied it to the fans to secure it in place. I even cut out little divots where the zip ties cross the top of the wood strip to be extra cautious. You can see this wood strip in the 3rd and 4th pictures in the section above.

  4. [2017 Update:] The fans started to get noisy in the summer when ambient temperatures went up, so I took more drastic measures. I designed a bezel that fits over the front part of the chassis and allows me to mount 3x 140mm fans blowing air into the drive bays from the outside. The bezel is secured in place with zip ties and powered via a PWM extension cable that I ran through one of the side vent holes and along the outside of the chassis. This fan bezel has had a substantial improvement in overall airflow and noise level. More information on it just below.

With these simple modifications in place, effective airflow to the drives increased dramatically and HDD temps dropped by ~10C even with fan speeds under 1500 RPM. You can check relative airflow levels to the hard drive bays by holding a piece of paper up in front of the drive bays and observing the suction force from the incoming air. With a heavy workload, the fans sometimes spin up to 2000 RPM for a minute or two, but overall the system is very quiet. The fan control script I’m running is set to spin up when any drive gets above 36C.

[2017 Update:] I built this machine in the fall, and through the colder winter months, the cooling I had in place was able to keep up with the heat output without making too much noise. When summer rolled around, however, the fans started to get annoyingly loud. I eventually decided to design a fan shroud for the front of the server. It would allow me to mount 3x 140mm fans in front of the drive bays blowing inward. I had the part 3D printed via 3DHubs in PLA (10um layer size) and it turned out pretty nice. There's a link to the 3D model of the bezel below. After a lot of sanding, priming, painting, and some light bondo application, I ended up with piece below:



The 20d nail run through these knobs allows for more secure mounting.


The shroud is zip-tied to the chassis handles. Also note the weather stripping.


The PWM extension cable is run out of one of the side vent holes and along the bottom of the chassis (covered with black tape).


[2017 Update ctd.] The 3D model for the fan bezel can be found on Sketchfab. You should be able to download the STL file on that same page. There are a few things to note about the model for anyone that wants to try something similar:

[2017 Update ctd.] The fan bezel has had a significant impact on overall cooling performance and noise level. Without the bezel, the internal 120mm fans would need to run at 2000+ RPM almost constantly during the summer months. Now that the bezel is installed, I can keep the fans at 1200-1300 RPM and all the drives are just as cool as before.

[2017 Update ctd.] I made a short video that covers the various changes I made to my chassis' cooling system and demonstrates the noise level at various fan speeds:



The last step in the system build was to get all the hard drives loaded into their sleds and slide them into the chassis. If you aren’t populating all 24 bays in the chassis, be sure to note which mini-SAS ports connect to which bays; this is labeled on the rear of the backplane and in the backplane manual.

[2017 Update:] With everything built, I could load the server and the UPS into the rack cabinet. The inner rails snapped right into place on the sides of the chassis and the outter rails slotted directly into the square holes on the rack posts. I originally had the outer rails installed a third of a rack unit too low, so I had to move them up a slot. If you're unsure of which set of holes to use for the outter rails so your machine lines up with the marked rack units, check the photos above of my machine in the rack. The UPS is just sitting on the floor of the rack cabinet (which solid steel and seems extremely study) and occupies the lowest 2U.

[Original text with details on building the LackRack:] With everything built, I could load the server and the UPS into the LackRack. The UPS went on the floor and the server went on the lower shelf. I have all my networking gear on the top shelf along with some other knick-knacks. Assembly of the LackRack itself was pretty easy, but there were a few minor things worth noting. I picked up some basic metal corner braces from a hardware store for reinforcement of the lower shelf; they’re around 3” long and 3/4” wide and seem to work pretty well. I mounted the braces to the legs of the table and the underside of the lower shelf with basic wood screws. The lower shelf is only ~1/3” thick, so I got very stubby screws for that side of the brace. When measuring how low or high to install the lower shelf, I forgot to make sure leave enough room for the server to sit in the space and had to re-do part of the installation at a lower height. For a 4U server (like the one I’ve got), you’ll need a smidge over 7”, so the shelf has to go an inch or two lower than the IKEA instructions would have you mount it. The legs of the table (into which I mounted the braces) are very light weight; it feels like they’re totally hollow except for a small solid area around the middle where you’re supposed to mount the tiny IKEA-provided braces that come with the table. Don’t over-tighten the screws you put into the legs even a little bit, otherwise it will completely shred out the wood and won’t be very secure. In fact, while I was installing one of the braces, I leaned on my screw gun a bit too hard and before I even started to turn the screw, it broke through the outer “wall” of the leg and just went chonk and fell down into place. Not a confidence-inspiring event while building the “rack” that will soon house my ~$5,000 server... Regardless, with all the corner braces installed, the two shorter ends of the shelf seem pretty sturdy. However, the shelf is so thin that it would have started to sag (and could have possibly broken) with any weight in the middle. With a file server, most of the weight is in the front due to the drives, but I thought it was still a good idea to brace the middle of the shelf from the underside. I cut a short piece of 2x4 that I could use to prop up the middle of the lower shelf from underneath.

With everything installed and mounted, I was finally ready to power on the system for the first time and move on to the installation and configuration process!

Flashing the M1015 HBA Cards & Installing FreeNAS

[2019 Update:] The two new LSI HBAs I bought to replace the M1015's do not require any crossflashing or reflashing (unless the firmware is out of date). The process below is for the IBM cards which are just re-branded LSI 9211-8is. The crossflashing operation with the MEGAREC utility lets you erase the IBM firmware and put LSI's firmware back on them. Again, with the newer LSI cards I got, this isn't necessary because they're already running LSI's IT firmware out of the box.

I was pretty lucky and my server POST’d on the first try. Before actually installing an OS, I needed to flash the M1015 cards with the IT mode firmware. This article has instructions on that process. The download linked in that article goes down quite a bit, so I’ve rehosted the necessary firmware files here [.zip file]. This file contains 3 DOS applications (sas2flsh.exe, megarec.exe, and dos4gw.exe), the IT firmware image (2118it.bin), the BIOS image file (mptsas2.rom), and an empty file for backing up stock card firmware (sbrempty.bin). If you're flashing more than one card, you will want to copy this file so you have one per card. The sas2flsh and megarec applications are used below to back up, erase, and reflash the cards. The dos4gw application allows these applications to address more memory space than they would be able to otherwise, but you won't need to run it directly.

I used Rufus to create a bootable FreeDOS USB drive and copied in the files from the above .ZIP archive. Before performing the rest of the process, it is a good idea to visit the controller manufacturer’s website to make sure you’re using the most recent firmware image and BIOS. They change the layout and URL of the official Broadcom website that hosts the firmware, so just search Google for “SAS 9211-8i firmware”, find the downloads section, and open the firmware sub-section. The versions are marked by “phase” numbers; the firmware/BIOS version I included in the above ZIP file is from phase 20 or “P20” as it’s listed on the site. If a more recent version is available, download the MSDOS and Windows ZIP file, find the BIOS image (called mptsas2.rom) and the IT firmware (called 2118it.bin; you do not want the IR firmware called 2118ir.bin) and copy them both onto your bootable USB drive overwriting the files I provided.

With the SAS addresses I wrote down during the build process on hand, I booted from my USB drive into the FreeDOS shell and executed the following from the DOS terminal:


megarec -adpList
	

MegaREC is a utility provided by LSI that I'll use to back up each card's original firmware and then wipe them. The above command lists all the adapters it finds; make sure all your cards are listed in its output. When I originally flashed my cards, I had two installed, so I run each command once per card with the adapter number after the first flag. I made two copies of the sbrempty.bin file, called sbrempty0.bin and sbrempty1.bin; make sure to adjust your -writesbr commands accordingly. If you only have one card, you can omit the adapter number. Run the following commands to back up and wipe each of the cards:


megarec -writesbr 0 sbrempty0.bin
megarec -writesbr 1 sbrempty1.bin
megarec -cleanflash 0
megarec -cleanflash 1
	

(Reboot back to USB drive.)

Once I backed up and wiped all the cards, I rebooted the server. When it came online (again in FreeDOS), I could flash the cards with the IT mode firmware using the following commands:


sas2flsh -o -f 2118it.bin -c 0
sas2flsh -o -f 2118it.bin -c 1
sas2flsh -o -sasadd 500605bXXXXXXXXX -c 0
sas2flsh -o -sasadd 500605bXXXXXXXXX -c 1
	

(Shut down and remove USB drive.)

There are a couple of things to note here. As above, the -c 0 and -c 1 at the end of these commands specify the controller number. If you’re also following the guide I linked above, you may notice that I’ve left out the flag to flash a BIOS (-b mptsas2.rom) in the first set of commands. This is because I don’t need a BIOS on these cards for my purposes; you will need the BIOS if you want to boot from any of the drives attached to the controller (but don’t do that... Either use USB drives or connect your SSDs directly to the motherboard SATA ports). I’ve included the latest BIOS file in the zip just in case someone needs it; just add -b mptsas2.rom to the end of the first (set of) command(s), but again, you really shouldn’t need it. The last thing to note is the SAS addresses in the second set of commands. The XXXXXXXXX part should be replaced with last part of the SAS address of that controller (without the dashes). Make sure the address matches up with the correct card; you can run sas2flsh -listall to check the PCI addresses if you aren’t sure which controller number maps to which physical card. The -listall command requires firmware to be flash to the card or else it will throw an error and prompt for the firmware filename, so run it after the -f commands. After all the cards were flashed, I powered down the server, removed the USB drive, and prepared to install FreeNAS.

I downloaded the latest FreeNAS 9.10 ISO from here, used Rufus again to make a bootable USB drive with it, and started the install process by booting off the USB stick. The FreeNAS installation process in very easy. When selecting the boot volume, I checked off both my SSDs and FreeNAS handled the mirroring automatically. After the installation finished, I rebooted the system from the SSDs and the FreeNAS web UI came online a few minutes later.

Initial FreeNAS Configuration

The very first thing I did in the FreeNAS configuration is change the root password and enable SSH. I also created a group and user for myself (leaving the home directory blank to start with) so I didn’t have to do everything as root. If you’re having trouble getting in via SSH, make sure the SSH service is actually enabled; in the web UI, go to Services > Control Services and click the SSH slider to turn the service on.

With SSH access set up, I connected to a terminal session with my new FreeNAS machine and followed this guide on the FreeNAS forums for most of my initial setup, with a few minor modifications. The text in this section is largely based off that guide. My first step is to determine the device names for all the installed disks. You can do this by running:


camcontrol devlist
	

After determining the device names, I did a short SMART test on each of my drives using:


smartctl -t short /dev/da<#>
	

Where da<#> is the device name from the camcontrol devlist output. The test only takes a couple minutes and you can view the results (or the ongoing test progress) using:


smartctl -a /dev/da<#>
	

After checking that all the SMART tests passed, I created my primary volume. My process was a little non-standard because I moved my 4TB drives into the server after I transferred the data off them, so I’ll go through my process first and discuss the standard process afterwards. However, before diving into that, I want to review how ZFS allocates disk space and how it can be tuned to minimize storage overhead (by as much as 10 percent!). This next section gets pretty technical and if you aren’t interested in it, you can skip it for now.

Calculating & Minimizing ZFS Allocation Overhead

Calculating the disk allocation overhead requires some math and an understanding of how ZFS stripes data across your disks when storing files. Before we get into the math, let’s take a look at how ZFS stores data by discussing two examples:

  1. Storing a very small file, and

  2. Storing a large(r) file.

We’ll start out with the small file. Hard disks themselves have a minimum storage unit called a “sector”. Because a sector is the smallest unit of data a hard disk can write in a single operation, any data written to a disk that is smaller than the sector size will still take up the full sector. It's still possible for a drive to perform a write that's smaller than its sector size (for instance, changing a single byte in an already-written sector), but it needs to first read the sector, modify the relevant part of the sector's contents, and then re-write the modified data. Obviously this sequence of three operations will be a lot slower than simply writing a full sector’s worth of data. This read-write-modify cycle is called "write amplification".

On older hard drives (pre ~2010), the user data portion of a sector (the part we care about) is typically 512 bytes wide. Newer drives (post ~2011) use 4096-byte sectors (4KiB, or simply 4K). Each hard disk sector also has some space for header information, error-correcting code (ECC), etc., so the total sector size is actually 577 bytes on older drives and 4211 bytes on newer drives, but we only care about the portion in each sector set aside for user data; when I refer to a “sector”, I’m referring only to the user data portion of that sector.

Because the hard disk sector size represents the smallest possible unit of storage on that disk, it is obviously a very important property for ZFS to keep track of. ZFS keeps track of disk sector sizes through the “alignment shift” or ashift parameter. The ashift parameter is calculated as the base 2 logarithm of a hard disk’s sector size and is set per virtual device (“vdev”). ZFS will attempt to automatically detect the sector size of its drives when you create a vdev; you should always double-check that the ashift value is set accurately on your vdev as some hard disks do not properly report their sector size. For a vdev made up of older disks with 512-byte sectors, the ashift value for that vdev will be 9 (\(2^9 = 512\)). For a vdev made up of newer disks with 4096-byte sectors, the ashift value for that vdev will be 12 (\(2^{12} = 4096\)). Obviously, mixing disks with 512-byte sectors and disks with 4096-byte sectors in a single vdev can cause issues and isn’t recommended; if you set ashift = 9 in a vdev with 4K drives, performance will be greatly degraded as every write will require the read-modify-write operation sequence I mentioned above in order to complete. It follows then that 2^ashift then represents the smallest possible I/O operation that ZFS can make on a given vdev (or at least before we account for parity data added on by RAID-Z).

Let’s quickly review how data is stored on a “striped” RAID configuration (i.e., RAID 5, RAID 6, RAID-Z, RAID-Z2, and RAID-Z3) before going any further. On these RAID configurations, the data stored on the array will be spread across all the disks that make up that array; this is called “striping” because it writes the data in “stripes” across all the disks in the array. You can visualize this with a 2-dimensional array: the columns of the array are the individual disks and the rows are the sectors on those disks (a full row of sectors would then be called a “stripe”).

When you write data to a RAID 5 or RAID 6 system, the RAID controller (be it hardware or software) will write that data across the stripes in the array, using one sector per disk (or column). Obviously, when it hits the end of a row in the array, it will loop back around to the first column of the next row and continue writing the data. RAID 5 and RAID 6 systems can only handle full-stripe writes and will always have 1 parity sector per stripe for RAID 5 and 2 parity sectors per stripe for RAID 6. The parity data is not stored on the same disk(s) in every row otherwise there would be a lot of contention to access that disk. Instead, the parity sectors are staggered, typically in a sort of barber pole fashion, so that when you look at the whole array, each disk has roughly the same number of parity sectors as all the others. Again, this ensures that in the event of a bunch of small writes that should only involve writing to two or three disks, one disk is bogged down handling all the parity data for every one of those writes. Because RAID 5 and 6 can only handle full-stripe writes, if it's told to write data that is smaller than a single stripe (minus the parity sectors), it needs to read the data in that stripe, modify the relevant sectors, recalculate the parity sector(s), and rewrite all sectors in the stripe. Very similar to the write amplification example above, this long sequence of events to handle a single small write ends up hobbling performance.

RAID-Z can handle partial-stripe writes far more gracefully. It simply makes sure that for every block of data written, there are \(p\) parity sectors per stripe of data, where \(p\) is the parity level (1 for Z1, 2 for Z2, and 3 for Z3). Because ZFS can handle partial-stripe writes, ZFS doesn't pay special attention to making sure parity sectors are "barber poled" as in RAID 5 and 6. Lots of small write operations that would cause contention for a single parity disk as above would just get their own pairty blocks in their own partial-stripe write. It should be noted that ZFS stripes the data down the array rather than across it, so if the write data will occupy more than a single stripe, the second sector of the data will be written directly under the first sector (on the next sector in the same disk) rather than directly to the right of it (on a sector on the next disk). It still wraps the data around to the next disk in a similar fashion to RAID 5 and 6, it just does it in a different direction. If the write data fits in a single stripe, it stripes the data across the array in an almost identical manner to RAID 5 and 6. ZFS's vertical RAID-Z stripe orientation doesn't really impact anything we'll discuss below, but it is something to be aware of.

Getting back on track, we were discussing the smallest possible writes one can make to a ZFS array. Small writes will obviously be used for small file sizes (on the order of a couple KiB). The smallest possible write ZFS can make to an array is:

$$ n_{min} = 1+p $$

As above, p is the parity level (1 for RAID-Z1, 2 for RAID–Z2, and 3 for RAID-Z3) and the 1 represents the sector for the data itself. So \(n_{min}\) for various RAID-Z configurations will be as follows:

$$ \text{RAID-Z1: } n_{min} = 2 $$

$$ \text{RAID-Z2: } n_{min} = 3 $$

$$ \text{RAID-Z3: } n_{min} = 4 $$

When ZFS writes to an array, it makes sure the total number of sectors it writes is a multiple of this \(n_{min}\) value defined above. ZFS does this to avoid situations where data gets deleted and it ends up with a space on the disk that’s too small to be used (for example, a 2-sector wide space can’t be used by RAID-Z2 because there’s not enough room for even a single data sector and the necessary two parity sectors). Any sectors not filled by user data or parity information are known as “padding”; the data, parity information, and padding make up the full ZFS block. Padding in ZFS blocks is one of the forms of allocation overhead we’re going to look at more closely. Study the table below for a better idea of how block padding can cause storage efficiency loss. Note that this table assumes everything is written to a single stripe; we’ll look at how data is striped and how striping can cause additional overhead in the next section.

Data, Parity, and Padding Sectors with Efficiency (Note: Assumes Single Stripe)
Data
Sectors
Parity Sectors Padding Sectors Total Sectors
(Block Size)
Efficiency
(Data/Total)
Z1 Z2 Z3 Z1 Z2 Z3 Z1 Z2 Z3 Z1 Z2 Z3
1 1 2 3 0 0 0 2 3 4 50.0% 33.3% 25.0%
2 1 2 3 1 2 3 4 6 8 50.0% 33.3% 25.0%
3 1 2 3 0 1 2 4 6 8 75.0% 50.0% 37.5%
4 1 2 3 1 0 1 6 6 8 66.7% 66.7% 50.0%
5 1 2 3 0 2 0 6 9 8 83.3% 55.6% 62.5%
6 1 2 3 1 1 3 8 9 12 75.0% 66.7% 50.0%
7 1 2 3 0 0 2 8 9 12 87.5% 77.8% 58.3%
8 1 2 3 1 2 1 10 12 12 80.0% 66.7% 66.7%
9 1 2 3 0 1 0 10 12 12 90.0% 75.0% 75.0%
10 1 2 3 1 0 3 12 12 16 83.3% 83.3% 62.5%
11 1 2 3 0 2 2 12 15 16 91.7% 73.3% 68.8%
12 1 2 3 1 1 1 14 15 16 85.7% 80.0% 75.0%
13 1 2 3 0 0 0 14 15 16 92.9% 86.7% 81.3%

If the data you’re writing fits in a single stripe, ZFS will allocate the block based on the above table, again making sure that the block size is a factor of \(n_{min}\). When the data you’re writing doesn’t fit in a single stripe, ZFS simply stripes the data across all the disks in the array (again, in the vertical orientation discussed above) making sure that there is an appropriate quantity of parity sectors per stripe. It will still make sure the size of the block (which is now spread across multiple stripes and contains multiple sets of parity sectors) is a factor of \(n_{min}\) to avoid the situation outlined above. When considering how ZFS stripes its data, remember that RAID-Z can handle partial stripe writes. This means that RAID-Z parity information is associated with each block rather than with each stripe; thus it is possible to have multiple sets of parity sectors on a given disk stripe if there are multiple blocks per stripe. The below figures show (roughly) how ZFS might store several data blocks of varying sizes on a 6-wide and an 8-wide RAID-Z2 array. Data sectors are preceded by a "D", parity sectors by a "P", and padding sectors are indicated by an "X". Each set of colored squares represents a different ZFS block.

If we define \(w\) as the stripe width (or the number of disks in the array), we can see that ZFS will make sure there are \(p\) parity sectors for every set of \(1\) to \(w-p\) data sectors so a disk failure doesn’t compromise the stripe. In other words, if there are between \(1\) and \(w-p\) user data sectors, the ZFS block will have \(p\) total parity sectors. If there are between \((w-p)+1\) and \(2(w-p)\) user data sectors, the block will have \(2p\) total parity sectors. This point can be tough to conceptualize, but if you study all the examples in the above figure, you should see what I mean by this. It is also interesting to compare the number of total sectors that are required to store a given number of data sectors for the 6-wide and 8-wide RAID-Z2 examples. The table below shows this comparison.

Note first that all the numbers in the “Total Sectors” column are divisible by

$$ n_{min} = 1+p = 1+2 = 3 $$

This is due to the “padding” sectors allocated at the end of the blocks so their lengths are divisible by \(n_{min}\). Because of this, the sequence in which these blocks are stored is irrelevant when we determine how many total sectors will be required to store that data. Comparing the values in the two “Total” columns (particularly for the larger data blocks) hint at the next form of overhead we will cover.

To review, ZFS dynamically sizes data blocks based on the amount of user data and parity in that block. The smallest block size is

$$ n_{min} = 1+p $$

Where \(p\) is the parity level. The blocks can grow in increments of \(n_{min}\). We also defined \(w\) as the stripe width. Next, we’ll look at how larger writes are handled (for files of a couple MiB and larger).

The maximum size of a ZFS data block size is controlled by a user-definable parameter called recordsize. Its value represents the maximum amount of data (before parity and padding) that a file system block can contain. The default value for the recordsize parameter in a FreeNAS is 128KiB, but you can set the value to any power of 2 between 512b and 1MiB. The recordsize parameter can be set per ZFS dataset and even modified after the dataset is created (but will only affect data written after the parameter is changed). You may realize at this point that blocks of length recordsize might not always contain a total number of sectors that is divisible by \(n_{min}\)... We’ll get to this in just a bit.

We now have all four parameters we need to consider when calculating the allocation overhead of a ZFS array: A ZFS block’s recordsize, the vdev’s parity level (\(p\)), the vdev’s stripe width (\(w\)), and disks’ sector size (ashift). The allocation overhead will be calculated as a percentage of the total volume size so it is independent of individual disk size. To help us understand how all of these factors fit together, I will focus on 4 different examples. We will go through the math to calculate allocation overhead (defined below) for each example, then look at them all visually. The four examples are as follows:

Ex. Num Parity Level Stripe Width recordsize sector size (ashift)
1 2 (RAID-Z2) 6 128KiB 4KiB (12)
2 2 (RAID-Z2) 8 128KiB 4KiB (12)
3 2 (RAID-Z2) 6 1MiB 4KiB (12)
4 2 (RAID-Z2) 8 1MiB 4KiB (12)

As we will see later on, the parity level, stripe width, and ashift values are typically held constant while the recordsize value can be tuned to suit the application and maximize the storage efficiency by minimizing allocation overhead. If the parity level and stripe width are not held constant, decreasing parity level and/or increasing stripe width will always increase overall storage efficiency (more on this below). The ashift parameter should not be adjusted unless ZFS incorrectly computed its value.

For the first example, we’ll look at a 6-wide RAID-Z2 array with 4KiB sectors and a recordsize of 128KiB. 128KiB of data represents 128KiB/4KiB = 32 total sectors worth of user data. Since we’re using RAID-Z2, we need 2 parity sectors per stripe, leaving 6-2 = 4 sectors per stripe for user data. 32 total sectors divided by 4 user data sectors per stripe gives us 8 total stripes. 8 stripes * 2 parity sectors per stripe gives us 16 total parity sectors. 16 parity sectors + 32 data sectors give us 48 total sectors, which is divisible by 3, so no padding sectors are needed. In this example, you will notice that all the numbers divided into each other nicely. Unfortunately, this is not always the case in every configuration.

In our second example, we’ll now look at an 8-wide RAID-Z2 array. The array will still use 4KiB sector disks and will still have a recordsize of 128KiB. We will still need to store 128KiB/4KiB = 32 total sectors worth of user data, but now we have 8-2 = 6 sectors per stripe for user data. 32 data sectors/6 sectors per stripe gives us 5.333 total stripes. As we saw in the previous section, we can’t have .333 stripes worth of parity data. ZFS creates 5 full stripes (which cover 30 sectors worth of user data) and 1 partial stripe for the last 2 sectors of user data, but all 6 stripes (5 full stripes and 1 partial stripe) need full parity data to maintain data resiliency. This “extra” parity data for the partial stripe is our second source of ZFS allocation overhead. So we have 32 data sectors and 6*2 = 12 parity sectors giving us a total of 44 sectors. 44 is not divisible by 3, so we need one padding sector at the end of the block, bringing our total to 45 sectors for the data block.

Before continuing to the final two examples, it would be worthwhile to generalize this and combine it with what we discussed in the previous section. We have the two sources of allocation overhead defined, which are:

  1. Padding sectors added to the end of a data block so the total number of sectors in that block is a multiple of \(n_{min}\)

  2. Parity sector(s) on partial stripes

We can say that the size of a given ZFS data block (in terms of the number of disk sectors) will be dynamically allocated somewhere between \(n_{min}\) and \(n_{max}\) and will always be sized in multiples of \(n_{min}\) (where \(n_{min}\) and \(n_{max}\) are defined below):

$$ \bf{n_{min}} = 1 + p $$

$$ \bf{n_{max}} = n_{data} + n_{parity} + n_{padding} $$

$$ n_{data} = \frac{recordsize}{2^{ashift}} $$

$$ n_{parity} = ceiling\left(\frac{n_{data}}{w-p}\right) * p $$

$$ n_{padding} = ceiling\left(\frac{n_{data} + n_{parity}}{n_{min}}\right) * n_{min} - (n_{data} + n_{parity}) $$

Where

$$ p = \text{vdev parity level} $$

$$ w = \text{vdev stripe width} $$

Once a file grows larger than the dataset’s recordsize value, it will be stored in multiple blocks, each with a length of \(n_{max}\).

The value of \(n_{max}\) will be our primary focus when discussing allocation efficiency and how to maximize the amount of data you can fit on your array. If your application is anything like mine, the vast majority of your data is made up of files larger than 1MiB. By defining one additional value, we can calculate the allocation overhead percentage:

$$ n_{theoretical} = w * \frac{n_{data}}{w-p} $$

$$ overhead = \left(\frac{n_{max}}{n_{theoretical}} -1 \right) * 100\% $$

In the first example above, we calculated \(n_{max}\) as 48 sectors. For the same example,

$$ n_{theoretical}(Ex. 1) = 6 * \frac{32}{6-2} = 48 $$

$$ overhead(Ex. 1) = \left(\frac{48}{48}-1\right) * 100\% = 0\% $$

As we saw while working through the example, everything divided nicely, so we have no allocation overhead with this configuration.

In the second example, we calculated \(n_{max}\) as 45 sectors. From this, we can calculate the allocation overhead:

$$ n_{theoretical}(Ex. 2) = 8 * \frac{32}{8-2} = 42.6667 $$

$$ overhead(Ex. 2) = \left(\frac{45}{42.6667}-1\right) * 100\% = 5.469\% $$

Data written to the array in example 2 will take up ~5.5% more usable disk space than the same data written to the array in example one. Obviously, an overhead of this amount is undesirable in any system.

You may notice that in example one, it took 48 total sectors to store 128KiB of data while in example two it only took only 45 total sectors to store the same 128KiB. This is because the overhead values calculated above do not account for the space consumed by parity data (except parity data written on partial stripes). As mentioned above, decreasing parity level and/or increasing stripe width will always increase overall storage efficiency, which is exactly what we are seeing here. For our purposes, we are looking at maximizing efficiency by decreasing overhead from data block padding and from parity data on partial stripes. If you wanted to factor parity size of your configuration and allocation overhead into an overall efficiency value, you could use the following:

$$ efficiency = \frac{n_{data}}{n_{max}} * 100\% $$

For our two examples:

$$ efficiency(Ex. 1) = \frac{32}{48} * 100\% = 66.67\% $$

$$ efficiency(Ex. 2) = \frac{32}{45} * 100\% = 71.11\% $$

In this comparison, example 2 stores its data more efficiently overall despite the ~5.5% allocation overhead calculated above. Our next steps will show how we can reduce this overhead value (thus increasing the overall efficiency) by adjusting the recordsize value.

Example three is a revisit of the first example (a 6-wide RAID-Z2 array with 4KiB sectors), but this time we will use a recordsize of 1MiB. 1MiB of data represents 1MiB/4KiB = 256 total sectors for user data. Since we’re still using RAID-Z2, we need 2 parity sectors per stripe, leaving 4 sectors per stripe for user data. 256 total sectors divided by 4 sectors per stripe gives us 64 stripes. 64 stripes * 2 parity sectors per stripe give us 128 total parity sectors. 128 parity sectors + 256 sectors give us 384 total sectors, which is divisible by 3, so no padding is needed. As before, everything divides nicely, so changing the recordsize value didn’t change the allocation overhead (which is still 0%) or the overall storage efficiency (still 66.67%):

$$ n_{theoretical}(Ex. 3) = 6 * \frac{256}{6-2} = 384 $$

$$ overhead(Ex. 3) = \left(\frac{384}{384}-1\right) * 100\% = 0\% $$

$$ efficiency(Ex. 3) = \frac{256}{384} * 100\% = 66.67\% $$

Example four will look at the 8-wide RAID-Z2 setup in example 2, but with a recordsize of 1MiB. Again, 1MiB of data represents 1MiB/4KiB = 256 total sectors for user data. We have 6 sectors per stripe for user data, so 256 total sectors divided by 6 sectors per stripe gives us 42.667 sectors. We end up with 42 full stripes and one partial stripe (but as before, all 43 stripes get full parity information). So we have 256 sectors and 43*2 = 86 parity sectors giving us a total of 342 sectors. 342 is divisible by 3, so no padding sectors are required. The only allocation overhead we have is from the partial stripe parity data. Calculating the overhead, we find that:

$$ n_{theoretical}(Ex. 4) = 8 * \frac{256}{8-2} = 341.33 $$

$$ overhead(Ex. 4) = \left(\frac{342}{341.33}-1\right) * 100\% = 0.196\% $$

$$ efficiency(Ex. 4) = \frac{256}{342} * 100\% = 74.85\% $$

Compare that to the results from Example two:

$$ n_{theoretical}(Ex. 2) = 8 * \frac{32}{8-2} = 42.6667 $$

$$ overhead(Ex. 2) = \left(\frac{45}{42.6667}-1\right) * 100\% = 5.469\% $$

$$ efficiency(Ex. 2) = \frac{32}{45} * 100\% = 71.11\% $$

You’ll recall that the only difference between the configuration in examples two and four are the recordsize value. From these results, it is obvious that changing the recordsize from 128KiB to 1MiB in the 8-wide configuration reduced the allocation overhead which in turn increased the overall storage efficiency of the configuration. You may wonder how such a substantial improvement was achieved when the only difference in overhead factors was the one padding sector in the 128KiB configuration (indeed, both configurations required extra parity data for a partial data stripe). It’s important to remember how much data we are storing per block in each configuration as the overhead is “added” to each block; the first configuration required 3 overhead sectors per 128KiB of data stored, while the second configuration required 2 overhead sectors per 1MiB of data stored. The 3 overhead sectors per block in the 128KiB configuration get compounded very quickly when large amounts of data are written. It’s easy to see this effect visually by looking at the diagrams below. Overhead from padding sectors are highlighted in orange and overhead from partial stripe parity data are highlighted in red. The thick black lines separate the data blocks.

Examples One and Two:


Examples Three and Four:

Notice in these diagrams how changing the recordsize on the 6-wide array doesn’t impact allocation overhead; this is because the ZFS configuration aligns with the so-called \(2^n+p\) rule (which states that you should configure your dataset so its stripe width \(w\) is \(2^n+p\) for small-ish values of \(n\)). ZFS datasets that conform to this rule will always line up nicely with the default 128KiB recordsize and have an allocation overhead of 0%. If you’re not interested in fiddling with your dataset’s recordsize value, consider sticking with a configuration that conforms to this rule.

Examining how changing the recordsize value on the 8-wide array impacts the allocation overhead is worth a closer look. The figure below shows examples two and four side-by-side. In example two, notice how the overhead compounds much quicker for a given amount of user data than in example four. Also notice how many total stripes are required to store the given amount of user data in each configuration.

Examples Two and Four:

There are a couple final points I want to make on recordsize tuning before moving on. Determining the amount of data written to a disk by ZFS from a file size isn’t always easy because ZFS commonly employs data compression before making those writes. For example, if you’re hosting a database with 8KiB logical blocks, ZFS will likely be able to compress that 8KiB before it is written to disk. There are disadvantages to increasing recordsize to 1MiB in some applications that deal with only very small files (like databases). For my purposes of storing a lot of big files, setting recordsize to 1MiB is a no-brainer. In terms of tuning stripe width and parity level to optimize performance for your application, the articles linked below provide some excellent information.

Much of the above section is based on a calculation spreadsheet (/u/SirMaster on reddit). I also want to thank to Timo Schrappe (twitter @trytuna) for catching a mistake in the above formula for \(n_{padding}\) as well as some typos! If you’re interested in getting an even deeper understanding of the ZFS inner mechanics, I would encourage you to read the following three articles (all of which were tremendously helpful in writing this section):

Here are links to more general (but still very helpful) guides on tuning ZFS parameters:

If you're generally interested in the technical analysis of data systems, you may also enjoy reading through my R2-C2 page, which looks at maximizing the reliability of a RAID system from a purely statistical perspective.

Setting Up the Storage Volumes

[2018 Update:] I wrote a couple of blog posts for iXsystems that examine the mechanics of ZFS pool performance in various layouts. If you're unsure which pool layout would be ideal for your use case, you should check them out. Part 1 can be found here, and part 2 can be found here.

When I first set up my server, I didn’t fully understand recordsize tuning, so I created my main storage dataset with a recordsize of 128KiB (the default value). After doing more research, I realized my mistake and created a second dataset with recordsize set to 1MiB. I set up a second SMB share with this dataset and copied all my data from the 128KiB-based dataset to the 1MiB-based dataset; once all the data was moved over, I wiped the first dataset. The reduction in allocation overhead manifests itself on the share by a reduction in the reported “size on disk” value in Windows (where I have my share mounted). The reduction I saw when copying the same data from the 128KiB dataset to the 1MiB dataset was in line with the ~5% overhead reduction demonstrated above.

The proper way to set everything up would have been to first create a volume (close out of the wizard that pops up when you initially log into FreeNAS for the first time) with the volume manager. Volumes are traditionally named tank, but you can call yours whatever you want (the rest of this guide will assume you’ve named it tank). Make sure you have the volume layout correct in the volume manager before hitting “Ok”, otherwise you’ll have to destroy the volume and re-create it to change its layout. Once you create the volume, it will create a dataset inside that volume at /mnt/tank (assuming you’ve indeed named your volume tank). I recommend creating another dataset in this new one (named whatever you want). In this new dataset, I set the recordsize to 1024K as per the discussion above (hit “Advanced Mode” if you don’t see the recordsize option). Make sure compression is set to lz4 (typically its default value). I named my dataset britlib, but you can of course name yours as you please. If you’re interested in tuning the other dataset parameters, check the FreeNAS guide (linked below, or here for the version hosted on freenas.org) for details on what each item does; I left the rest of the parameters at their default values.

After I had my volume and dataset set up, I created a couple of user groups (one for primary users called nas and one for daemons/services called services) and some users. I set the user home folders to /mnt/tank/usr/<username>, but this is optional (I use my primary user’s home folder to store scripts and logs and stuff). Once you have a primary user account set up (other than root), you can go back and change the permissions on the dataset you created in the previous step so that the new user is the dataset owner:

Further System Configuration

At this point, you have the basics set up, but there is still a lot to do. Most of the following items are don’t require too much discussion, so I won’t go into as as much depth as I did with previous topics. I recommend reading the relevant section from the FreeNAS user guide while going through these steps. You can access the user guide from your FreeNAS web UI by clicking “Guide” on the left-hand navigation pane (or by going to http://<FreeNAS server IP or host>/docs/freenas.html if you want it in a separate tab). Here was the general process that I took, but you don’t necessarily have to do these in order:

The next 3 items involve scheduling recurring tasks, some of which will impact overall system performance and can take 24+ hours to complete (depending on your pool size). For example, both the main pool scrub and the long SMART check will each typically take a long time and will each slightly degrade system performance. For that reason, I’ve scheduled them so they are never running at the same time. See my example cron table below for an example of how you might balance the SMART tests and scrubs.

As I mentioned above, scheduling the SMART tests, scrubs, and email reports relative to each other is important. As an example, the table below shows what my cron schedule looks like. Each column is a different scheduled cron event, the rows represent the days of the month, and each cell has the time the event will run (in 24-hour format, so 00:00 is midnight, 06:00 is 6 AM). For those familiar with crontab basics, it's worth pointing out that you shouldn't edit the crontab directly as system reboots will reset it to whatever was set in the web UI.

Setting Up SMB Sharing

With all the administrative and monitoring settings in place, I could move on to setting up some shares. This section will focus on SMB/CIFS-based shares because that’s what I use, but FreeNAS offers a wide variety of network file sharing protocols. On the subject of SMB/CIFS, Microsoft summarizes the common question “how are SMB and CIFS different” as follows: “The Server Message Block (SMB) Protocol is a network file sharing protocol, and as implemented in Microsoft Windows is known as Microsoft SMB Protocol. The set of message packets that defines a particular version of the protocol is called a dialect. The Common Internet File System (CIFS) Protocol is a dialect of SMB. Both SMB and CIFS are also available on VMS, several versions of Unix, and other operating systems.” The full article text is here. Samba also comes up a lot, which is an open-source *nix SMB server. It can do some other stuff too (related to Active Directory), but the Samba software isn’t really necessary as FreeNAS has built-in support for several SMB “dialects” or versions (including CIFS).

Getting network file sharing fully configured can be a pain, mostly due to permissions configuration. Because I only work with SMB shares, I do all my permissions management from my primary Windows 10 machine. The Windows machines in my environment (all on Win10) connect over SMB protocol version 3.1.1 (listed as SMB3_11 in smbstatus); the *nix and OS X machines in my environment connect on SMB protocol version NT1. I’ll provide some basic examples from my configuration, but SMB sharing can get very tricky very fast. If you get too complicated, it will become a giant pain a lot faster than it’s worth, so be forewarned. If you find yourself at that point, take a step back and think through possibly simpler ways to accomplish your goal.

The last 6 settings come from the FreeNAS forums post here and are all set to ‘no’ with the goal of speeding up SMB access (specifically, while browsing directories). The first two are ‘no’ by default, but I have them set explicitly. If you have legacy devices or applications that need to access your SMB shares, you may need to set these to ‘yes’, but doing so could cause a performance penalty. Setting all of these parameters to ‘no’ will prevent SMB from using extended attributes (EAs), tell SMB not to store the DOS attributes (any existing bits that are set are simply abandoned in place in the EAs), and will cause the four DOS parameter bits to be ignored by ZFS.

Once I created the SMB share, I was able to mount it on another machine. In the following examples, I’ll show how to mount the share and manage permissions from a Windows 10 machine. To mount the share on Windows, open a new “My PC” window and click “Map network drive”. Select a drive letter and set the folder as


\\<server hostname or ip>\<smb share name>
	

Check “Reconnect at sign-in” if you want the share the be automatically mounted. You’ll likely also need to check “Connect using different credentials”. Once you hit “Finish” (assuming you checked “Connect using different credentials”), you’ll be prompted for connection credentials. Click “More choices”, “Use a different account”, set the username as \\<server hostname>\<the username you made in FreeNAS>, and enter your password. This will let you connect to the share with the credentials you created in FreeNAS rather than credentials stored on the Windows machine. If the username and password combination are exactly the same on FreeNAS and your Windows machine, sometimes you can get away with leaving the domain specification (the \\<server hostname>\ part) out of the username string, but it’s always best to be explicit.

Adjusting Permissions

With the share mounted, I could finally move some files in. As I mentioned before, everything that follows will be fairly specific to Windows 10, but you should be able to apply the same process to any modern Windows version. Once you have some data copied over, you can start adjusting the permissions on that data. Open the properties window for a directory and select the Security tab. The system will display a list of groups and user names and each of their respective permissions for this given directory (it may take a second to resolve the group IDs; be patient). You can adjust basic permissions by clicking the “Edit” button; a new window will pop up and you’ll be able to adjust or remove the permissions for each group or user and add new permission definitions. You may notice that the default set of permissions aren’t editable here; this is because they’re inherited from the parent folder (if the folder you’re looking at is in the share’s root directory, its permissions are inherited from the share itself; to adjust those permissions, open the properties window for the mounted share from the “My Computer” window and adjust its settings in the Security tab).

To adjust permissions inheritance settings for a file or folder (collectively referred to as a “object”), click the Advanced button in the Security tab of the object’s properties window. In this new window (referred to as the “Advanced Security Settings” window) you can see where each entry on the permission list (or “Access Control List”, ACL) is inherited from or if it is defined for that specific object. If you want to disable inheritance for a given folder, you can do so by clicking the “Disable inheritance” button on this window; you’ll then be able to define a unique set of permissions for that object that might be totally different from its parent object permissions. You can also control the permissions for all of this object’s children by clicking the check box “Replace all child object permissions...” at the bottom of the window. We’ll go through the process of adding a read/execute-only ACL entry for the services group to a given folder.

Open the Advanced Security Settings window for the folder you would like to allow the services group to access (Read/Execute only), click the Add button, click Select a principal at the top of the window (“principal” means user or group), type in services (or whatever user or group you want) and click Check Names. It should find the services group and resolve the entry (if it doesn’t, make sure you’ve actually added a services group in the FreeNAS web UI settings). You can adjust the “Type” and “Applies to” parameters if you like (each option is pretty self-explanatory), but I’m going to assume you’ve left them as the default values. Click “Show advanced permissions” on the right side of the window to view a full list of the very granular permissions that Windows offers. Each of these permissions options are also pretty self-explanatory, and most of the time you can get away with using just basic permissions (meaning you don’t click this “Show advanced permissions” button). For read/execute only, you’ll want to select the following advanced permissions:

If you click “Show basic permissions”, you will be able to see that this set of selections will translate to:

You can leave the “Only apply these permissions...” check box unchecked. Go ahead and hit OK to be brought back to the Advanced Security Settings window where you’ll see your new ACL entry added to the list. It’s probably a good idea to check the “Replace all child object permission entries...” box to make sure everything within this folder gets the same set of permissions, but that’s obviously your choice. If you want to add or adjust other permissions, go ahead and do that now. When you’re happy with the settings, hit OK on the Advanced Security Settings window, hit OK on the folder properties window, and wait for it to go through and apply all the permission changes you just made. With the services group granted read/execute access to this folder, you should now be able to connect to it from another device (like a VM, as shown below) via any user in the services group. Once I had all my data moved into my SMB share, I went through and adjusted the permissions as needed by repeating the steps I outlined above.

I tend to prefer the Advanced Security Settings window (as opposed to the window you get when you hit the “Edit...” button in the Security tab) so I can make sure the settings are applied to all child objects, and the Advanced Security Settings window really isn’t any more difficult to use that the standard settings window. For more info on how to set up SMB share permissions, watch these videos in the FreeNAS resources section.

One final note here before moving on: if you want to grant a user permissions (whether it be read, execute, or write) to access some file or folder deep in your share’s directory structure, that user will also need at least read permissions for every parent folder in that structure. For example, if you want to grant the user “www” permission to access the directory “httpd” at location //SERVER/share/services/hosting/apache24/httpd, the user “www” will need to have read permission for:

...or else he won’t be able to access the “httpd” folder. In this scenario, you can see how useful automatic inheritance configuration can be.

Setting Up iSCSI Sharing [2018 Update]

iSCSI is a block-level sharing protocol that serves up a chunk of raw disk space. We can mount this raw disk space on the client format it with whatever file system we want; the client sees it as a physical disk connected directly to the system. Block-level sharing protocols like iSCSI and Fibre Channel differ from file-level sharing protocols like SMB/CIFS, NFS, and AFP in that the server has no notion or understanding of the file system or files that the client writes to the disks. Because block-level protocols operate at a lower level, they tend to perform much better than file-level protocols. Because the server has no concept of the file system on the share, it doesn't have the ability to lock in-use files and prevent simultaneous conflicting editing like SMB can do. For that reason, iSCSI shares are typically mounted on only a single machine. Because it has low protocol overhead and good performance, iSCSI is a very popular choice for serving up storage to virtual machines. You should strongly consider using iSCSI over SMB or NFS for applications that will be very sensitive to latency, IOPS, and overall storage throughput.

Before we dive into configuration, it's worth going over some iSCSI nomenclature. iSCSI is a protocol for sending SCSI (pronounced "scuzzy") commands over an IP network. The iSCSI host creates a target that a client can mount. A target is a block of disk space that the host sets aside. A target could be the entire data pool, or a small chunk of it. A single iSCSI host could have many targets on it. The target refers to its storage via a Logical Unit Number or LUN. For this reason, you'll sometimes hear 'target' and 'LUN' used interchangeably. The iSCSI client is referred to as an initiator (or more specifically, the software or hardware they're using to connect to the target is the initiator). A client connects to the iSCSI storage via the server's portal. The server's portal specifies the IP and port the service will listen on as well as any user authentication used. FreeNAS adds an extra step called an extent that configures a zvol or a file on a file system as something that can be served up by iSCSI.

On ZFS, we set aside a chunk of disk space using a zvol. A zvol can be created as a child of a file system or clone dataset; a zvol can not be created as the root dataset in a pool. In the example below, I walk through creating a zvol on my primary storage pool and then serve up that zvol via iSCSI. We'll set up an iSCSI portal, then configure a set of allowed initiators, then create a target, then configure an extent, and finally we'll map that extent to the target we created. I'll also show how to mount the iSCSI share on Windows (e.g., configure the initiator) where I'll use it to store my Steam library.

In FreeNAS, we can add a zvol by going into the storage manager, selecting the dataset we want to contain the zvol, and clicking the 'Create zvol' button at the bottom of the window. Here is an explanation of the options on the 'Create zvol' window:

With the zvol successfully created, we can start working on the iSCSI setup. Start by clicking Sharing > Block (iSCSI) > Target Global Configuration. You can either leave the default base name (iqn.2005-10.org.freenas.ctl) or specify your own. Technically, you can put whatever you want here, but if you want to comply with the various iSCSI RFCs, you should use the following format: iqn.yyyy-mm.domain:unique-name where domain is the reverse syntax of the domain on which the iSCSI host resides, yyyy-mm is the date that domain was registered, and unique-name is some unique identifier for iSCSI host. For example, if I host my iSCSI server on iscsi.jro.io and I registered that domain in August of 2015, I should use iqn.2005-08.io.jro.iscsi:freenas as my base name. I don't use that as my base name because all of this seems very silly to me, but I'm sure that someone somewhere at some point had a seemingly valid reason for writing this specification. In any case, we press onwards... You can leave the ISNS Servers and "Pool Available Space Threshold (%)" fields blank. The threshold field can be used to trigger a warning if the pool holding your zvol extent gets too full. Note the pool will always trigger a warning at 80% regardless of what you put here, so it's not really necessary.

The next step is to create an iSCSI portal. Expand the 'Portals' menu and click 'Add Portal'. Add a comment to your portal to easily identify it later. I left the authentication options set to 'None' because my system is on a private home network, but if you're using iSCSI in a different type of network, you may need to do some research on how to configure authentication. Select your server's IP address from the IP drop-down menu, leave the port on its default value (3260), and click 'OK'.

Next, expand the 'Initiators' menu and click 'Add Initiator'. If you want to limit access to your iSCSI share to specific initiators or IPs/networks, you can do so here. I left both values as 'ALL', put in a descriptive comment, and hit 'OK'.

After your portal and set of allowed initiators is configured, you can configure your target. Expand the 'Targets' menu and click 'Add Target'. Enter a descriptive name and alias (they can be the same) and select the portal and initiator you just created. As above, I left the authentication-related fields as 'None'. Click 'OK' when you're done.

Now we can create an extent. Expand the 'Extents' menu and click 'Add Extent'. Enter a descriptive name and make sure 'Device' is selected for 'Extent Type'. You can create an extent (and thus an iSCSI target) backed by a file on a file system rather than a zvol, but I'm not aware of any use case where this would be advantageous. Select the zvol you created in the 'Device' drop down and leave the serial number default. Some iSCSI initiators (including the one on Xen Server) have trouble mounting iSCSI targets if the logical block size isn't 512 bytes. For that reason, selecting '512' is recommended for the 'Logical Block Size' field for compatibility, but you can select '4096' if you know your initiator can support it. I went with '4096'. The rest of the fields on this menu can be left alone in most cases. Mouse over the 'i' icons for more information on them if you're curious what they're for. Click 'OK' when you're done.

Next, we'll map the extent to the target we created. Expand the 'Targets / Extents' menu and click 'Add Target / Extent'. Select the target and extent you created, leave the 'LUN ID' field set as 0 and click 'OK'.

Finally, the iSCSI service must be enabled in the Services > Control Services menu. You should also check the 'Start on boot' box. Now we can move over to our Windows machine and get the iSCSI share mounted!

On Windows, you will need to start the iSCSI Initiator program from Microsoft. If your start menu search doesn't pull up any results, you can download and install the software from here. In the initiator program, click the 'Targets' tab, then enter the IP of your server in the 'Target:" field at the top and hit 'Quick Connect'. The program will pop up a new window with the targets it discovered at that address. Select the target you created and click 'Connect'. (Note, if you enabled user authentication, you'll get an error here. You will have to use the 'Connect' button towards the bottom of the main window and click the 'Advanced...' button to enter access credentials.) On the main window, it should list the base-name of your target and specify 'Connected' in the status column. Click 'OK' then start up the Windows Disk Management program. From here, you can initialize and format the disk just like you would any physical SATA disk in Windows. Once it's formatted, it will show up on your system and you can start copying data to it. On the FreeNAS side, you can still run snapshots on the zvol, but you won't be able to access the files directly without mounting it.

As I noted above, you can create multiple iSCSI targets on a single server. If you want to add more iSCSI targets, start by creating another zvol, then skip directly to the target configuration steps. Set up a target, an extent, and then map that extent to your target. From there, you can mount the new target on your client.

Performing Initial bhyve Configuration

Running virtual machines on a storage system is kind of a controversial subject (as you’ll quickly discover if you ask anything about running a bhyve in #freenas or the forums). In a business environment, it’s probably a good idea to have a dedicated VM host machine, but for personal use, I don’t see it as a huge risk. The VM manager (also called a “hypervisor”) I use is called bhyve (pronounced “beehive”, super-clever developers...). More information on bhyve can be found here. It’s native on FreeNAS 9.10+ and setting it up and managing bhyve VMs (simply called “bhyves”) is very easy. There’s a great video on the basics of bhyve setup here (from which I am going to shamelessly copy the following steps).

Before we get started, make sure you know the name of your pool (called “tank” if you’re following this guide verbatim) and the name of your primary network interface (which you can find by going to the web UI and looking at Network > Network Summary; mine is igb0, highlighted in yellow below).

Once you’ve got that information, SSH into your server and run the following command as root to set up bhyve (replacing the <pool name> and <network interface> parts, obviously):


iohyve setup pool=<pool name> kmod=1 net=<network interface>
	

It will return some information to let you know it’s created a new dataset on your pool (to house VM data) and set up a bridge between the provided network interface and the virtual interface that your VMs will use. The kmod=1 flag tells bhyve to automatically load the required kernel modules. This program iohyve will be what you use to manage all your bhyve VMs. You can run iohyve (with no arguments) to see a summary of all available commands.

After you run the above command, go into the FreeNAS web UI and go to System > Tunables > View Tunables. You’ll need to add two new tunables which will ensure that the bhyve settings you just configured above are re-applied when FreeNAS reboots. Click the Add Tunable button and enter the following settings:

Click OK and then add a second tunable with the following settings (make sure to change the network interface value):

Click OK and you’re all set; you’re now ready to install some bhyve VMs!

Creating a bhyve VM

Before we get into installing a bhyve, it will be useful to list out some of the more commonly used iohyve commands (most of which need to be run as root):

I’ll go through a basic example of installing a Debian bhyve guest (or VM; I’ll use the terms “bhyve”, “guest”, and “VM” interchangeably in this section) and mounting shares from your NAS on the VM so it can access data. The first thing you will want to do is download an ISO with the iohyve fetch command. Note that this is the only (simple) way to use a given ISO to install an OS. I use the Debian amd64 network installation ISO, which you can find here. Don’t download the ISO in your browser, but rather copy the URL for that ISO (for amd64 Debian 8.7.1, it’s at this link) and run the following command on your FreeNAS machine as root:


iohyve fetchiso <paste URL to ISO>
	

Wait for it to download the ISO file from the provided link. When it’s done, we can create the VM. For this example, I’m going to create a guest named “acd” (Amazon Cloud Drive) which we’ll use later to set up rclone for full system data backups. I’ll give it 5GB of disk space, 2 CPU threads, and 2GB of RAM. You can change the name, CPU threads, or RAM values later, but note that changing the disk space of the guest later on can cause issues (even though there is an iohyve command for it; check iohyve man page). When you’re ready, run the following commands:


iohyve create acd 5G
iohyve set acd ram=2G cpu=2 os=debian loader=grub-bhyve boot=1
	

The first command will create a new bhyve guest called acd with a 5GB disk. The second command will set the listed properties for that bhyve (2GB RAM, 2 CPU threads, debian-based OS, GRUB bootloader, auto-boot enabled). The next step is to install the Debian ISO on this bhyve guest. Get the name of the ISO file by running the following:


iohyve isolist
	

Copy the name of the listed Debian iso, then run the following:


iohyve install acd <paste ISO name>
	

The console will appear to hang, but don’t panic! As the terminal output message will tell you, GRUB can’t run in the background, so you need to open a second SSH session with your FreeNAS machine. Once you’re in (again) and have root, run the following command in your second terminal session to connect to the acd console:


iohyve console acd
	

This will drop you into the console for your new VM (you may have to hit Enter a few times) where you can go through the Debian installation. Follow the instructions, selecting a root password, new user (for this one, I’d suggest “acd”), and hostname when prompted. Make sure that when you get to the package selection screen, you unselect all desktop environment options and select the SSH server option. Other than that, the Debian installation process is pretty easy. When you’ve finished, the VM will shut itself down and you can close out of this second SSH window. If you ever have to use iohyve console for other purposes, you can exit it by typing ~~. or ~ Ctrl+D.

Back in your first SSH session, the terminal should be responsive again (you may have a few errors saying stuff about keyboard and mouse input but you can safely ignore those). Run the following command to start the bhyve VM back up:


iohyve start acd
	

While you’re waiting for it to boot back up, take a moment to create a new user in the FreeNAS web UI (Account > Users > Add User). Give it whatever user ID you want, but make sure the username and password are exactly the same as the user you created on your bhyve VM. I would also suggest unchecking “Create a new primary group for the user” and selecting the “services” group you created above as this user’s primary group.

By now, the bhyve VM should be fully booted (typically it only takes 15-30 seconds), so SSH into this new VM with the non-root user account you created; you may need to look at your router’s DHCP tables to figure out its assigned IP address. Once you’re in, you’ll want to run su to get root then update software through apt-get or aptitude and install any standard programs you like (like sudo, htop, ntp, and whatever else you might need). Once sudo is installed and configured (if needed, use Google for help), exit back out to your primary user. The next step will be to mount your SMB share from FreeNAS on your bhyve VM. Most of the following steps are based on this guide from the Ubuntu wiki and a couple folks from the FreeNAS forums (thanks Ericloewe and anodos!).

The first thing you need to do is install the cifs-utils package by running:


sudo apt-get install cifs-utils
	

I usually mount my shares in the /media directory, so go ahead and create a new directory for your share (I’ll use mountdir in this example, but you can call it whatever you want):


sudo mkdir /media/mountdir
	

Next, you will want to create a text file with the login credentials for your VM user. Run the following command to create and open a new text file in your user’s home directory (I use nano here, but use whatever editor you like):


nano ~/.smbcredentials
	

In this file, you will want to enter the following two lines of text. I’ll use “acd” as the username and “hunter2” as the password for the example, but obviously change the text in your credentials file. Make sure it’s formatted exactly as shown; no spaces before or after the equal signs:


username=acd
password=hunter2
	

Save and exit (for nano, Ctrl+O to save, then Ctrl+X to exit) then change the permissions on this file:


chmod 600 ~/.smbcredentials
	

Next, you’ll want to run the following command to edit the fstab file (the file system table) on your bhyve with root privliges:


sudo nano /etc/fstab
	

Add the following line at the bottom of the file, making sure to replace the <server name>, <share name>, and <user name> placeholders with the appropriate values for your system (obviously, leave out the <>; if you named your mount point in the /media directory something different, make sure to change that, too):


//<server name>/<share name>	/media/mountdir	cifs	uid=<user name>,credentials=/home/<user name>/.smbcredentials,iocharset=utf8,sec=ntlmssp	0	0
	

Sorry about the super-long, table-breaking statement; there's probably a way to split the above into two shorter lines, but whatever... Save and exit (for nano, Ctrl+O to save, then Ctrl+X to exit). Once you’re back at the command line, run the following to attempt to mount the share:


sudo mount -a
	

If it goes through, try to access the share and list its contents:


cd /media/mountdir
ls
	

If it prints out the contents of your share, you’re all set! If it throws an error, check the permissions on your share, check that the credentials is entered correctly on FreeNAS and in the ~/.smbcredentials file, and check that the VM can resolve the server name to the correct IP (if not, you may have to enter the IP in the mount string you wrote in the fstab file). Mounting shares and getting their permissions set up right can be extremely finicky, so anticipate at least a few issues here.

You can mount more than one share (or multiple points from a single share) by entering more than one line in the fstab. For example, if you wanted to mount //SERVER/share/photos and //SERVER/share/documents, you would enter both those lines in /etc/fstab:


//SERVER/share/photos       /media/photos       cifs    uid=user,credentials=/home/user/.smbcredentials,iocharset=utf8,sec=ntlmssp 0   0
//SERVER/share/documents    /media/documents    cifs    uid=user,credentials=/home/user/.smbcredentials,iocharset=utf8,sec=ntlmssp 0   0
	

Remember to create the /media/photos and /media/documents directories beforehand (otherwise you’ll get an error when you run the mount -a command).

Once the share is mounted, you’ll be able to access it in the bhyve’s file system as normal. If your user only has read permissions, you’ll obviously get an error if you attempt to modify anything.

Configuring rclone in a bhyve

[8/7/17 Note] Amazon Cloud Drive has banned the rclone API key, effectively breaking rclone's support for ACD. People on the rclone forums have posted workarounds, but I haven't tried any of them. The information below is still applicable to setting rclone up with any other remote. Check the rclone docs for detailed instructions on your specific remote.

The last topic I want to cover is the installation and configuration of rclone, which will help keep your data backed up in an Amazon Cloud Drive (ACD). rclone also allows you to encrypt all your data before it’s sent to ACD, so you don’t have to worry about Amazon or the Stasi snooping in on your stuff. ACD is a paid $60/yr service through Amazon.com that offers unlimited data storage, and unlike services like Backblaze and CrashPlan, you can get great upload and download speeds to and from their backup servers. rclone is a program that will connect with your ACD instance (via Amazon-provided APIs), encrypt all your data, and synchronize it with the backup servers. rclone is still in active development, so it can be a bit finicky at times, but hopefully this guide will help you get through all that.

Before we dive in, a quick word on the backup services market. If you don’t want to pay $60/yr, I would understand, but I would still strongly recommend some sort of backup mechanism for your data. For larger amounts of data, services that charge per GB can get very expensive very quickly, so I would recommend a service with unlimited storage. Other than ACD, the two best options are Backblaze and CrashPlan, both of which I used for at least several months (CrashPlan for a couple years). My primary issue with Backblaze was the upload speed; even after working with their support team, I was only able to get upload speeds of 50-100KB/s. If I only wanted to back up my most important ~2TB of data, at 100KB/s it would take nearly a year to get everything copied to their servers. I also used CrashPlan for about 2 years before building my NAS. The upload speeds were slightly faster than Backblaze (I was able to get ~1MB/s), but still not great. My biggest issue is the backup client’s huge memory consumption. The Java-based CrashPlan client consumes 1GB of RAM per 1TB of data you need to back up, and this memory is fully committed while the client is running. For a large backup size, this is obviously unacceptable. The client itself is also a bit finicky. For example, if you want to back up more than 1TB, you have to manually increase the amount of memory the client can use by accessing a hidden command line interface in the GUI. The final nail in the coffin of CrashPlan and Backblaze (at least for me) is the fact that they are both significantly more expensive than ACD. ACD is not without its issues, as we’ll see in the subsequent sections, but it seems to be the best of all the not-so-great options (granted, at a few dollars a month for unlimited data storage, expectations can’t be all that high).

Of course the first thing you’ll need to do is sign up for ACD, which you can do here. You get 3 months free when you sign up, so you have plenty of time to make sure the service will work for you. (Note that the Prime Photos service is not what you’re looking for; that only works for pictures.) Don’t worry about downloading the Amazon-provided sync client as we will be using rclone as our sync client. The instructions for setting up rclone are based on a guide (originally posted on reddit) which can be found here.

Start by SSHing into the bhyve VM you created in the previous step. You’ll want to make sure sudo and ntp are installed are configured. Run the following commands to download (via wget) rclone, unpack it, copy it to the correct location, change its permissions, and install its man pages:


wget http://downloads.rclone.org/rclone-current-linux-amd64.zip
unzip rclone-current-linux-amd64.zip
cd rclone-*-linux-amd64

sudo cp rclone /usr/sbin/
sudo chown root:root /usr/sbin/rclone
sudo chmod 755 /usr/sbin/rclone

sudo mkdir -p /usr/local/share/man/man1
sudo cp rclone.1 /usr/local/share/man/man1/
sudo mandb
	

The official rclone documentation recommends placing the rclone binary in /usr/sbin, but by default, the /usr/sbin directory isn’t in non-root users’ path variable (meaning a normal user can’t just run the command rclone and get a result, you would have to either run sudo rclone or /usr/sbin/rclone; more information on /usr/sbin here). You can either choose to run rclone as root (sudo rclone or su then rclone), type out the full path to the binary (/usr/sbin/rclone), or add /usr/sbin to your user’s path variable. I got tired of typing out the full path and didn’t want to have rclone running as root, so I added it to my path variable. You can do this by editing the ~/.profile file and adding the following line to the end:


export PATH=$PATH:/usr/sbin
	

This probably isn’t within the set of Linux best practices, but this user’s sole purpose is to run rclone, so I don’t see a huge issue with it.

[Updated 8/17/17]The next step requires you to (among other things) authorize rclone to access your ACD via OAuth. OAuth requires a web browser, but if you select the correct option during the setup, the rclone config script will give you a URL you can access on your desktop rather than having it try to open a browser on the server. To start the process, run the following command in your rclone machine's terminal:


rclone config
	

You should see the rclone configuration menu. Press n to create a new remote and name it; I named mine acd, which is what I’ll use in this guide. On the provider selection section, choose Amazon Drive (which should be number 1 on the list). You can leave client_id and client_secret blank. When prompted to use the auto config, say no. Follow the URL and you’ll be prompted for your Amazon login credentials then asked if you want to trust the rclone application (say “yes”). The website might prompt you for a string of characters; copy them from the rclone terminal and it should advance automatically to the next section. If everything looks ok, enter y to confirm and you’ll be brought back to the main rclone config menu where you can type q to quit.

The process for setting up encryption for your ACD remote connection is a little counter-intuitive, but bear with me; this is the official (and only) way to do it. It will initially appear that you’re creating a second remote connection, but that’s just the process for configuring encryption on top of an existing remote connection.

Back in the SSH session with your acd bhyve, run rclone config again. At the menu, type n to create a new remote, and name this new remote something different than your previous remote. For this example, I’ll use acdCrypt as the name for the encrypted version of the acd remote. On the provider selection screen, pick Encrypt/Decrypt a remote (which should be number 5). You’ll be prompted to enter the name of the remote you want to encrypt; if you named your previous remote “acd”, then just enter acd: (include the colon on the end). When prompted to choose how to encrypt the filenames, enter 2 to select “Standard”. You’ll then be prompted to pick a password for encryption and another password for the salt. I recommend letting rclone generate a 1024 bit password for both items; just make sure to copy both of them somewhere safe (I copied them to a text file on my desktop, archived the text file in a password-protected RAR archive, and uploaded the RAR file to my Google Drive). After you’re done with the passwords, enter y to confirm your settings and then q to exit the rclone configuration menu.

rclone should now be configured and ready to use, but before you start your first backup, it’s a good idea to configure rclone to run as a service so it automatically starts up on boot. We’ll do this by creating a systemd unit file for clone. The guide I followed for this process can be found here.

Before we create the service itself, run the following to create an empty text file in your user’s home directory (which we’ll need later on):


touch ~/acd_exclude
	

Start by creating a new services file and settings its permissions:


sudo touch /etc/systemd/system/acd-backup.service
sudo chmod 664 /etc/systemd/system/acd-backup.service
	

Open this new service file in a text editor, paste the following text into the file, then save and exit (Ctrl+O, Ctrl+X in nano; be sure to edit your share’s mount directory and the path for the log file):


[Unit]
Description=rclone ACD data backup
After=network.target

[Service]
Type=simple
User=acd
ExecStartPre=/bin/sleep 10
ExecStart=/usr/sbin/rclone sync /media/mountdir acdCrypt: \
	--exclude-from /home/acd/acd_exclude \
	--transfers=3 \
	--size-only \
	--low-level-retries 10 \
	--retries 5 \
	--bwlimit "08:30,10M 00:30,off" \
	--acd-upload-wait-per-gb 5m \
	--log-file <path to log file> \
	--log-level INFO

[Install]
WantedBy=multi-user.target
	

You’ll likely want to tune the parameters called with rclone for your own application, but this should be a good starting point for most people. Full documentation on all commands and parameters is available on the rclone website here. Here is a quick explanation of each parameters I set above (note the \ characters allow the lengthy command string to span multiple lines):

I also have the service set to sleep 10 seconds before starting rclone to make sure the SMB share has time to mount. I would highly recommend reading through the rclone documentation (linked above) to figure out which settings would be appropriate for your use case. My filter file (acd_exclude) includes a list of directories and files I want rclone to ignore. Once you’ve got everything set in the acd-backup.service file, run the following command to enable the service so it runs on system start:


sudo systemctl enable acd-backup.service
	

After that, you can tell systemd to reload its daemons (you’ll need to run this command again any time you make changes to the acd-backup.service file):


sudo systemctl daemon-reload
	

You can start your service with the following command:


sudo systemctl start acd-backup.service
	

If you ever need to stop the service, you can run the following:


sudo systemctl stop acd-backup.service
	

Note that even though you stop the service, it may not terminate the rclone process; run htop to check and terminate any running processes to completely stop everything (useful if you want to update the parameters rclone is using via the service file).

You can follow along with rclone’s process by viewing the log file (in the location you specified in the acd-backup.service file). You can also use the following commands to see a summary of whats been uploaded:

You’ll have to use the log file and these two commands to view the progress of an encrypted upload; if you try to view your files on the ACD website (or using the mobile app), all the filenames will appear as garbled text.

The final thing you may consider doing is adding an entry in the root user’s crontab to restart the rclone service should it ever fail or exit. You can do this by running the following:


sudo crontab -e
	

Add the following line to the end of the file:


0 * * * * /bin/systemctl start acd-backup
	

Save and exit (Ctrl+O, Ctrl+X) and you’re all set. This will tell the system to start the acd-backup service on 60 minute intervals; if the service is already started, no action will be taken. If the service stopped, it will automatically restart it. As I noted above, ACD can be finicky sometimes, so some upload errors (particularly for larger files) are normal. With this cron statement, rclone should automatically retry those uploads after it’s finished its initial pass on your share (rclone is set to terminate after it finishes a full pass; this cron statement will re-invoke it, causing it to check the remote against your share and sync any changes).

Expanding Beyond a Single Chassis [2019 Update]

When I originally designed and built this system, I never expected to outgrow 100TB of storage. Despite that, data seems to have a way of expanding to occupy whatever space it's in, and so in late 2018, the time for a system expansion finally arrived. Unfortunately, I had used all 24 drive bays in my chassis (including the internal spots for SSDs). So what does one do in this situation? Did I need to simply build a second full FreeNAS system and somehow cluster them together to provide one logical storage volume? That was my original thought when I first built the system. We hear about "clusters" in enterprise computing all the time, so clearly that's was the answer.

As I've learned since I originally built this server, ZFS is not actually a clusterable file system, meaning it isn't capable of being distributed amongst an arbitrary number of storage nodes (or at least not natively). Examples of distributed file systems include DFS, Ceph, and Gluster. These file systems are said to "scale out", while ZFS "scales up". "Scale up" just means you attach additional disks directly to your system. But this brings me back to the previous question-- what am I supposed to do when my chassis is already full?

As it turns out, the answer is to add an expansion shelf! This is also sometimes called a JBOD (short for "just a bunch of disks"). It’s essentially another chassis with a bunch of drive bays and its own power supply. It doesn't need its own motherboard or CPU or memory or NIC or HBA or any of that other expensive stuff. You can connect the expansion shelf to the main chassis (or "head unit") with a host bus adapter ("HBA") that has external SAS ports. These external ports are functionally (and I believe electrically) identical to the internal ports on a normal SAS HBA except that they are located on the card's rear I/O shield rather than inside the chassis. You use an external SAS cable that has extra shielding and comes in lengths of up to 10 meters to attach the expansion shelf to this external HBA. The cable runs from the SAS port on the external HBA in the head unit to a passive external-to-internal SAS adapter mounted in a PCI bracket on the shelf and then from the adapter to the backplane in the shelf via an internal SAS cable. The drives in the shelf are powered by the shelf's own PSUs, so extra power load on the head unit isn't an issue. The FreeNAS system sees the drives in the shelf as if they were installed right inside the primary chassis with no extra configuration required.

While the basic idea is pretty straightforward, the execution of this expansion can be a bit tricky. I ran into some problems I had to solve before I could get the system running the way I wanted, namely:

Overarching these three problems, I also needed to consider how to design things so I can scale beyond a single shelf down the road. I’d like to be able to support 5+ shelves if I need to.

PCIe Slot Count Issue

We'll start with the issue of PCIe slots and SAS connections. While adding another 3 PCIe HBAs to support 24 more drives might be possible on some systems, it isn't really practical to scale things beyond a single shelf in this manner. I might have been able to find a CPU and motherboard that support 6 PCIe x8 slots, but if I ever managed to outgrow the single expansion shelf, I would likely have a hard time finding a system with 9 PCIe x8 slots, never mind 12 or 16. I obviously needed to consolidate my PCIe cards. There were a couple ways of doing this. First, I could get an HBA that has more than 2 SAS ports per card (they make them with up to 6 per card). And second, I could use a backplane with a SAS expander (which works sort of like a network switch but for SAS devices). With an expander, I could connect all the drives over a single SAS cable instead of 6 cables.

Let's first take a step back and examine the capabilities of SAS cables. Each SAS cable carries 4 SAS channels. On SAS version 2 (which is what my current HBA uses), each SAS channel provides 6 gigabits per second of bandwidth, giving each SAS cable a total bandwidth of 24 gigabits per second. SAS version 3 offers 12 gigabits per second per channel or 48 gigabits per second on each cable. SAS 3 also implements a feature called DataBolt that automatically buffers or aggregates the 6 gigabit data streams from multiple SAS 2 or SATA 3 devices and presumably bolts those streams together (hence data bolt, I guess?). Anyway, it somehow glues the data streams together to let those older devices take advantage of the increased bandwidth offered by the SAS 3 cable rather than having those devices simply run at the slower SAS 2/SATA 3 speed.

I’m using SATA 3 drives in my system because they’re a bit cheaper than SAS drives, but thankfully SATA drives are compatible with SAS connections. The SATA protocol data is carried through the SAS cables via “SATA tunneling protocol” or STP. SATA 3 offers the same 6 Gb/s bandwidth as SAS 2.

All of this basically means that 24 SATA 3 drives connected to a SAS 2 expander backplane and then to a SAS 2 HBA via a single SAS 2 cable will see about 24 gigabits per second of bandwidth. 24 drives and 24 gigabit of bandwidth... that gives you 1 gigabit per second per drive. Note that's bits per second, not bytes. You would get about 125 megabytes per second on each drive. SAS uses 8b/10b encoding which has a 20% overhead, so you'll see closer to 100 megabytes per second per drive. That's starting to come a bit close to a bottleneck which is something that I really wanted to avoid. SAS3, on the other hand, with its 48 gigabits per second bandwidth and its fancy DataBolt technology would support about 200 megabytes per second per drive for 24 drives connected via a single SAS3 cable. That's not too bad.

Based on all of this, I ended up using a SAS3 expander backplane in the expansion shelf and a SAS3 HBA in the head unit for connectivity. The LSI 9305-16e HBA has 4 SAS3 ports on a single card, meaning one PCIe slot can support 4 24-bay shelves. That sounds perfect.

Quick side note here: SAS also supports full duplex data transmission, meaning SAS3 can do 48 gigabits down and 48 gigabits up simultaneously, but since I'm using SATA drives and SATA only does half duplex, so the shelves will only see 48 gigabits total. That's still plenty of bandwidth for 24 drives to share.

With the new external HBA card, I'd consolidated the PCIe cards for the expansion shelves considerably, but I still had my head unit with 3 PCIe cards for the HBAs, plus another card for the 10 gig NIC, one for a PCIe to NVMe adapter card, and another for an Optane 900p drive. On my motherboard, this only leaves me with a single PCIe x4 slot which would totally bottleneck my 48Gb/s of SAS3 bandwidth. I might have been able to reshuffle the cards a bit and fit everything in; both the NVMe adapter and the 900p run on PCIe x4. However, some of my slots are running electrically at PCIe x4 even though they're physically x8 or x16. I could have replaced my direct-attach backplane with a SAS3 expander backplane like the one in the expansion shelf, but the SAS3 expander backplanes for the 846 chassis are like $500 and I'd still need to buy a SAS3 HBA; they aren't exactly cheap either. Instead of replacing the backplane, I just got an LSI 9305-24i, which has 6 internal SAS3 ports on it. This gave me enough channels to directly attach all 24 of the head unit drives. Each SAS3 channel (again, 4 channels per cable) now directly connects to one of my SATA disks, no expander or DataBolt required. I also needed 6 new SAS cables that have a SAS3 connector on one end and a SAS2 connector on the other, but those aren't too expensive.

My old HBAs are IBM M1015 which are just rebranded LSI 9211-8i cards. On those IBM-branded cards, (as we covered above) it’s common practice to crossflash them by erasing the IBM-based RAID firmware with the MEGAREC utility and then using sas2flsh to load either LSI’s IT or IR firmware. With the two new LSI cards I bought, there are no such hoops to jump through-- they both run LSI's IT firmware out of the box. Unless the firmware on them is out of data, you likely won’t have to worry about flashing firmware on these newer LSI cards. I booted FreeDOS with both cards just to check the firmware version and integrity but didn’t actually have to run any reflashing operations.

The photo below shows the new HBAs installed. Starting from the CPU cooler side, the first PCIe card on the right is the 9305-24i. You can see the 6x SAS3 connections to it. The next card is the M.2 to PCIe adapter, then the 9305-16e HBA with its 4x external SAS ports (obviously not visible here).


Power Supply Sync Issue

So that’s the PCIe slot issues pretty much taken care of. Next up, I had the power supply issue. How do I get the power supplies in the shelf to come on when I press the power button on the head unit chassis? If the drives in the shelf aren't powered, FreeNAS won't see them and it will fault the storage pool, which is obviously something I’d like to avoid. It turns out there's a fairly simple solution to this. On ATX power supplies, there's one pin (called the PS_ON pin) which is in the main 24-pin power connector. When this pin is pulled low (meaning it's connected to ground), the PSU knows it needs to turn on. They make little adapters things that plug into the 24 pin connector and taps the PS_ON pin as well as one of the ground pins so you can connect a second power supply and have both PSUs power on simultaneously. When the user pushes the power button, the primary PSU will still see that signal, but now the second PSU has a way of seeing that signal as well. The PS_ON and ground wires are connected to an otherwise empty 24 pin ATX connector which is attached to the second power supply.

Dual PSU setups are typically used for extreme overclocking and multi-gpu systems like crypto mining rigs, in which case the PSUs are sitting right next to each other, so the wires running between the two PSUs are pretty short. In my case, I needed about 6 to 8 feet (or about 2 meters) of cable length to run the connection between the two chassis. To accomplish this, I cut off the secondary ATX connector, crimped on a standard female fan connector, made a long cable with male fan connectors on either end, and connected that cable between the two chassis. I basically just spliced in an extra 2 meters of wire on each connection, but made the cable detachable for easier system management. Thankfully, this solution ended up working perfectly. Pressing the power button powers up both pairs of PSUs simultaneously; starting the system via IPMI works too. The shelf also powers down as expected. I was expecting some weird quirk to pop up in an edge case, but nope, it just worked. If I ever add more expansion shelves, I’ll just need to tap those two wires and connect them to the new chassis. I'm not sure at what point the voltage on the PS_ON line will degrade enough that the remote PSU won't flip on, but when and if that happens, I'll come up with another solution.

Fan Control Issue

The final problem, independent fan control for each chassis, proved to be the most difficult to solve. Totally independent cooling "zones" required a major overhaul of the software fan control setup I went through above. The major issue I faced was that my motherboard does support two fan zones but I was already using both of them: one for the CPU and one for the disks. If I just ran a long PWM fan cable from the main chassis to the shelf, the fans in the shelf would have to run at the same speed as those in the head unit and that obviously isn't ideal. I naturally wanted to be able to control the fans in each chassis based on the drive temperatures in that chassis.

After some research and brainstorming, I decided to use an Arduino microcontroller to generate the extra PWM signal to control the fans in the expansion shelf. PWM stands for “Pulse Width Modulation", which is a very common method for controlling the speed of DC motors. With PWM-controlled computer fans, the PWM signal is essentially a 5V square wave running at about 25kHz. The fan motor runs at a speed proportional to the "duty cycle" of this square wave, or the percentage of time in one wavelength that the signal is high (or at 5V). If the wave is at 5V for 75% of the wavelength and at 0V for the other 25%, that's a 75% duty cycle and the fans will run at three quarter speed. If the signal is 5V the whole time, that's a 100% duty cycle and the fans will run at full speed.

Four-pin PWM fans run the motor off a separate 12V line from the 5V PWM signal line, meaning the Arduino only had to put out a few milliamps and the fan motors could be powered directly from the PSU. The drive temperatures would still have to come from the main FreeNAS system; attaching 24 temperature probes to an Arduino would be a nightmare and the readings would likely not be very accurate. To send the temperature data from the FreeNAS to the Arduino, the two would have to be connected somehow. A simple USB connection to both provide the Arduino with power and a serial connection seemed like the best solution.

I had to do some major modifications to the Perl-based script from Stux on the FreeNAS forum. I'm not great with Perl, so the first thing I did was to port everything to Python. From there, I could more easily add in the functionality I needed for more fan zones. I added all the serial numbers for all my drives into the python fan control script along with an identifier that told the script which shelf each drive belonged to. The script loops through all the disk device nodes and runs smartctl on each disk, finds the serial number from the output, and matches the serial number to the identifiers I programmed in to know which shelf the disk is installed in. It then determines the max temperature of the disks in each shelf, again from smartctl, and sends out the fan control commands accordingly. I wrote the updates such that the script can support an arbitrary number of shelves. The number of shelves is set in a variable at the top of the script then everything just loops that number of times. I also made it so the fan speed to temperature curve could have an arbitrary number of points on it. The script runs through a list of temperatures, finds the closest match, then picks the corresponding duty cycle. I tested the script with 30 different points on the CPU fan speed curve and it worked just fine.

Development and testing of both the Arduino code and the ported fan control script took a few weeks. I wanted to have the new fan control system tested and ready before buying all the hardware for the expansion shelf so I could install it right away and not have to deal with noisy fans. In order to do all the testing before I had the expansion shelf hardware in hand, I moved the head unit's hard drive fan control from the system's motherboard over to this new Arduino setup. This is when the first obstacle presented itself. When you connect an Arduino to a FreeBSD-based system (such as FreeNAS), it creates a device node for the serial connection at /dev/cuaU and then a number. The first Arduino would get /dev/cuaU0, the second would get /dev/cuaU1, and so on. The thing is, while a connected device usually gets assigned the same device node from system reboot to reboot, it's not guaranteed. For example, if you shut down your FreeNAS system and add a bunch of disks to it, the device nodes for all the disks you already had installed might change (/dev/da5 could become /dev/da12; /dev/da5 could be assigned to one of the new drives).

While I didn't expect the device nodes for the Arduinos to change very often (or really at all), I wanted a way to make sure I was sending the fan control commands to the correct device. If the drives in the new shelf started heating up and I had the device nodes wrong, the script would send commands to the wrong Arduino which would spin up the fan in the head unit instead. What's worse, with the drives in the head being cooled by now rapidly-spinning fans, their temperatures would quickly drop, and the script would recognize this and erroneously spin down the fans in the overheating expansion shelf. This feedback loop would continue and I'd likely end up with drive failures from overheating before too long.

Adding some code to let the Arduinos identify themselves was pretty trivial. The serial connection between the FreeNAS and the Arduino was of course two-way; I could send and receive data on it. Most of the time, I'm just sending fan speed data from the FreeNAS to the Arduino in the form of an integer representing the desired duty cycle, but I also added a function to prompt the Arduino to respond with an ID code. The ID code was simply 0 for the Arduino in the heat unit and 1 for the Arduino in the shelf. If I ever added a second shelf, its Arduino would get ID 2. In the fan control script on the FreeNAS, I started out by running through all the /dev/cuaU device nodes on the system, prompted each node to identify itself, then matched its response to the appropriate shelf. The issue I ran into here was being able to consistently read the data that the Arduino was sending and get it into the python script. You can use the cat command in the FreeBSD shell to read data on the serial line but the command always hangs until the serial connection closes. To get cat to return, I ended up having to pair it with the timeout command so it would wait about a second and then return whatever value it got. The full command I used to send the ID command and wait for a response was as follows:


echo \<id\> > /dev/cuaU0 && echo "$(timeout 1 cat /dev/cuaU0)"
	

To complicate things even further, this command didn't seem to work every time. When I ran it in the script, sometimes it wouldn't return anything. I ended up having to run it in a loop until it got a response. There were many instances where I had to issue this command 17 or 18 times in the loop before I got a response. It worked, but it was frustratingly hacky.

# Populate shelf tty device nodes by querying each /dev/cuaUX device for ID. Sometimes query isn't received on first try, # so keep trying until we get a response. for shelf in range(0,num_chassis): shelf_id = "" while shelf_id = "": shelf_id = subprocess.check_output("echo \<id\>> > /dev/cuaU" + str(shelf) \ + " && echo \"$(timeout 0.1 cat /dev/cuaU" + str(shelf) + ")\"", shell=True) shelf_id = shelf_id.decode("utf-8").replace("\n","") shelf_tty[int(shelf_id)] = "/dev/cuaU" + str(shelf)

Once I had the Arduinos programmed and the fan control script modified, I had basic multi-zone fan control in place and things were working fairly well. However, I knew the Arduinos had a lot of potential that I wasn't taking advantage of, so I decided to expand the scope of the project a bit. I wanted to add a little display I could mount on the outside of each chassis to show some stats about the system. The Arduino displays are pretty tiny, so I couldn't display too much information. I ended up using a 1" I2C OLED display with a 128x64 resolution. I set the Arduino to display the duty cycle of the fan, the current fan speed, the temperature of all the drives in the chassis, and the ambient temperature inside the chassis (which was measured via a temperature probe attached to the Arduino). I had the displays connected to the Arduino via cables that ran through the same side vent holes I used to run the front fan PWM cables. I 3D printed little mounts for the displays and attached them with Velcro tape to the top of the front fan shroud.

The photo below shows the test setup on a breadboard with the I2C display on the left, the thermal probe on the right, and the fan connection on the bottom.


This is the wiring harness I created for the Arduinos. It has (from left to right) a connection for all the fans, a MOLEX 4-pin connection for PSU power, a connection for the I2C displays, and a connection for the thermal probe. The display and thermal probe attach via the same 4-pin connectors that the PWM fans use (I ordered all the pins and connectors and stuff in bulk...)


Here's a few photos of the Arduino all soldered together. When I put it in the case, I slide a big heat-shrink tube around it to keep stuff from shorting out on the chassis, etc.







The photos below show the I2C display I used, first in testing, then in its 3D printed mount that was secured with Velcro to the top of each chassis.





All of this worked fairly well, but the 1" displays was so small that the text was almost impossible to read unless you were right up in front of it. I also had issues with one of the displays where the text would wrap around to the bottom of the display every so often, likely due to RF interference or clock skew or something. I never did figure out what was causing it.

I ran this setup for a week or so and pretty quickly came upon some major issues. The Arduinos would freeze up every so often and I would have to reset them by power cycling them. When they locked up, the tiny OLED display and (more importantly) the fans’ duty cycles wouldn't update. The first time this happened, I didn't catch it until the drives were very hot, some close to 50 degrees C.

To make it so I didn't have to keep manually power cycling the Arduinos, I added an automatic reset function to the fan control script. The Arduinos I used have a feature where if you open a serial connection with them at 1600 baud and then drop that connection, the microcontroller will reset itself. In the script, I added a check that ran every 60 seconds which asked the Arduinos to ID themselves (just like I did at the beginning of the script). If the Arduino was locked up, it wouldn't respond to the ID request. As before, I had to run the ID request command in a loop and count the number of iterations before it bailed and reset the Arduino (I had it set at 20 attempts). With this automatic reset function in place, the whole setup was a lot more robust. I ran this setup for several more weeks and it did alright, but the Arduinos seemed to reset themselves far too often; typically five or six times per day. When the Arduinos reset, the fans would all suddenly ramp up to 100% and stay there for 30 to 60 seconds until they got the duty cycle command from the fan control script. This ended up being very annoying; I tried using much shorter USB cables to connect the Arduinos thinking that the 8' cable was losing too much signal. I also added ferrite cores to the USB cables to try to cut down on noise. These did make an improvement, but the Arduinos were still resetting several times a day.

After a few more weeks of sitting in a room with fans that would randomly ramp up to 3000 RPM and then spin themselves back down, I got so frustrated that I decided to scrap the whole Arduino approach and re-implement everything on Raspberry Pis. Of course the Pi is massive overkill for a project like this, but I wanted something more reliable than the serial connection on the Arduino. I probably could have gotten things working much more reliably on the Arduinos by doing more troubleshooting, but I was way too frustrated with them at that point, and I was also intrigued by the enhanced capabilities of moving all the communications to Ethernet. I could get rid of the little I2C displays and instead set up a simple web server to display all the system vitals. I could have it displayed on a much larger (but still fairly small) screen with much more information and I could also of course access that web server from any other device I wanted.

I started by porting the Arduino C code to Python to run on the Raspberry Pis, which would let them generate the necessary PWM signal to control the fans as well as measure the fan speed and ambient temperature in the chassis. Instead of receiving commands via serial, the Pi's are connected via Ethernet and use Python's socket module to receive commands from the FreeNAS system. And instead of displaying the system statistics on attached I2C panels, the script sends all the data to another Raspberry Pi running flask, socket.io, and redis which formats and displays everything on a live web page. I have the web page displayed on a dedicated 1080p 11" touchscreen I got from a Chinese retailer. The fan speed, duty cycle, and ambient temperature information is sent by the two Raspberry Pi fan controllers, while the FreeNAS system itself sends the individual drive temperatures to the display controller Pi, also via python sockets. With the increased resolution of the 11" display, I can also show some extra information like the FreeNAS system's CPU temperatures, the CPU cooler’s fan speeds and duty cycles, and the average CPU load.

The bare Raspberry Pi and wire harness are picked below (attached via a ribbon cable). Since I'm using a separate unit to display system statistics, I didn't need a plug for the display on the Pi.



The Pi's wiring has a connection for the fans (top left), a 4-pin MOLEX for extra power (lower left), and for the thermal probe (center).


Without the cable jacket, you can get a better idea of how everything is wired.


These photos show the Pis installed inside of each chassis.



These photos show the display console output.



Final Results and Future Plans

The system has been running on this raspberry pi setup for several months now and has been pretty much rock solid. The scripts are set up to gracefully handle socket disconnection and continue running while attempting to reconnect. There are a few other modifications and improvements I have planned, including better threading on the fan controller pis, more robust socket reconnection logic between all parts of the system, buttons on the display web page to control variables like ramp speed and duty cycle mappings, the Raspberry Pi's vitals, and some additional statistics from the FreeNAS system like pool capacity. I’d also like to have it display data from other systems in my lab, like my primary workstation and FreeNAS Mini. The source code for everything is still in a pretty rough state and it's obviously very specific to my setup, but I do have everything in a Github repository here if you’re curious to look through it.

Once the new shelf was in and working, I added 16 new 8 TB drives to the system, so the usable storage capacity has increased from about 100 TB to about 180 TB and I still have room in the shelf for another 8 drives. I also added more RAM to the system (which now totals 128 GB) and got that Optane 900p drive I mentioned before to use as an L2ARC. And finally, I added cable management arms to all the chassis. For some reason, Supermicro doesn’t make arms for the 846, but their 2U/4U arm worked after doing a little bit of cutting on one of the included brackets.

With the added fans, the noise level in my office has definitely gone up a little bit, but I've found that the fans in the expansion shelf are almost always at 25-30%, or like 800-1000 RPM. I do still plan on finding a spot outside my office for the rack, but that's at least several years off. With the additional vdevs in my FreeNAS system, performance has also increased. I'm able to get 650-700 MB/s sequential reads and writes between the FreeNAS and my workstation, but I think I can do some samba tuning to increase that to 1 GB/s. Note that if you do a similar expansion, you may want to manually re-balance your vdevs by moving all your data into another temporary dataset, then moving it all back into the primary dataset. This ensures that the data is evenly distributed across all vdevs and reads and writes can be equally divided between all the drives.

Looking back on this update, I definitely took a more difficult route to achieve storage expansion, but I had a lot of fun planning, developing, and implementing everything. The fan control was obviously the biggest hurdle and could have been avoided completely if I had this system in a room where noise wasn't an issue (like by just letting the fans run at full speed). If you don’t have to worry about fan control, FreeNAS expansion is pretty simple. Enterprise storage systems like TrueNAS use more advanced expansion shelves with an integrated baseboard management controller (or BMC), often with its own web UI and fan control based on temperature sensors around the chassis. The shelf’s BMC doesn’t have access to the drives' internal temperatures though, so the fan control isn’t quite as tight as my setup.

That being said, these enterprise storage systems are scaled up in the exact same way we just covered, sometimes to 10’s of petabytes. For maximum density, they can use top-loading expansion shelves with sometimes over 100 drive bays in a 4U chassis. On the opposite end of the spectrum, it would be easy enough for a home lab user to buy a second mid-tower case, stuff it full of drives, rig the PSUs together, and connect the expansion tower with an external SAS cable.

Closing and Summary

If you’ve been following along, you should now have a pretty robust file server configured. It should be able to tolerate the failure of one or more hard drives, automatically report on low-level disk and pool errors before they cause hardware failures, heal minor boot and data pool errors, adjust fan speed to keep itself cool, shut itself down gracefully when it loses wall power, back up its configuration files on a regular basis, back up all its user data to the cloud, and run any sort of Linux-based VM you might require for other tasks! Hopefully you’ve learned a few things as well. I happily welcome any feedback you might have on this write up; please let me know if you spot any mistakes, misconceptions, sections that aren’t very clear, or a task that can be tackled in an easier manner. Thank you for reading, and feel free to contact me with questions and comments, or if you're interested in having me build a similar system for you: jason@jro.io!