Datacenter Move post 4

Youtube videos are worth a thousands words, so I will let them do the talking.

Progress TCR Move May27 part 1 (the old location)



Progress TCR Move May27 part 2 (the new location)



What Mustafa thinks of IBM rackmount kits:

And some pics from the last 3 days:
(Click for larger versions)


IBM xseries 336, bout 3 years old now

IMG_3452
ServeRAID 6M controller (in the PCI slot bay). Its IBM branded but its basically an Adaptec.
This comes out of one of the 2 xseries 336 servers that, together with the EXP400 shelf, served as a Windows 2003 cluster. The ServeRAID controllers are needed to provide failover control of the shared disk shelf.

IMG_3453
Mainboard of an xseries 336, with the PCI card bay/thing removed.  The blue bracket at the top is where usually an RSA-II management card would be sitting, but this one doesnt have one 😦

IMG_3454
ServeRAID 6M removed from PCI bay of the 336 server

IMG_3456
The rack is slowly emptying. I remember when I was building it all up, 3 years ago! Check it out

IMG_3459
Installing windows on an IBM xseries 336. Its been a while. I noticed the IBM Serveguide CD has a few more options now.

IMG_3460
Picture I needed to have to illustrate where to connect everything.

IMG_3461
Ready to move to new location

IMG_3464

IMG_3465
Not the most ideal way of moving servers, but its better than nothing. At least they are softer  here than in the back of the car.

IMG_3467
Richard, our project manager, trying to get more work in.

IMG_3469
Temporary cabling 😉

IMG_3472
Its slowly growing

IMG_3473
They are not done with the rack interconnects, damnit. I cant finish my patching like this.

IMG_3474
Our new firewall cluster

IMG_3475
I love the blue glow of the console. Kinda wierd to have that out-of-place IBM in there, squashed between the HP.s.  The cable rail for the server is different than HP aswell, so that will be fun to cable.

IMG_3476
Moved some servers around, getting to our final config now.

IMG_3477
WAN comms rack in the new location

IMG_3478
LANcoms rack in the new location

IMG_3480
Khalid on servicedesk duty

IMG_3481
Gertjan is helping us decomission

IMG_3482

IMG_3483
Mustafa hard at work decomissioning servers


IMG_3486
A lot of servers where decommissioned today, these are all basically being scrapped.



The Vegas SuperNAP

Datacenter Knowledge has 2 posts up about Switch Communications new datacenter in Las Vegas, which they are claiming is the highest-density datacenter in the world.

http://www.datacenterknowledge.com/archives/2008/May/27/the_vegas_supernap_a_data_center_revolution.html

http://www.datacenterknowledge.com/archives/2008/May/27/1500_watts_a_square_foot_a_look_at_tscif.html

switch-tscif-aisle.jpg swith-tscif.jpg

Switch Communications says it is successfully cooling a section of its Las Vegas data center running at nearly 1,500 watts per square foot using air cooling. How are they accomplishing this?

The key to Switch’s high-density cooling is a design known as Thermal Separate Compartment in Facility (TSCIF), according to company co-founder Rob Roy. The ingredients in this approach include high-capacity AC units placed outside the data center area, and a tightly integrated hot aisle containment system for the racks. Here’s an overview:

    * The cabinets are set on a slab, with no raised floor.

    * Chilled air is delivered into the cold aisle near the ceiling rather than through the floor, and enters the cabinets through the front.

    * Each cabinet fits into a slot in the TSCIF unit, which encapsulates the rear and sides of each cabinet, while the open front extends beyond the enclosure.

    * The hot aisle containment system delivers waste heat back into the ceiling plenum, where it can be returned to the chiller.

 

Very cool video of the SuperNAP setup:

http://www.switchnap.com/pages/products/the-supernap-video.php

More pics of their T-Scif cooling system: http://www.switchnap.com/pages/tech-specs/thermal-scif.php

The statistics off their site:

407,000 square feet of space
 
250 MVA Switch owned substation
 
146 MVA of generator capacity
 
84 MVA of UPS supply
 
30,000 tons of system plus system cooling
 
4,500,000 CFM

30 cooling towers

100% heat containment using thermal-scif™

Designed for 1500 watts per sq. ft. density
 
7000+ cabinets
 
Armed 24/7/365 military trained
Switch employed security staff

 


Desicionmaking on the new Proxy solution

For your enjoyment, here is a, slightly edited, email I just sent to the department head and various other decision makers. It goes over some of the options we need to consider to solve the current issues with our internet access.

Names and places have been changed to protect the guilty 😉

And please exuse the spelling. I was in a hurry and I really dont care about spelling as much as I do content.

—————————-

Hi all,

We are currently faced with some decisions that need to me made in regard to the Internet Proxy solution for the Netherlands and Belgium.

This is the current situation in regard to the proxy servers in Lala City and Chipville.

Server Lala City: LA-Server-S99
Server Chipville: CHIPVILLE-Server-S99

Both servers are HP DL360 G2 servers, and are now approaching 8 years of age. They are very out of warranty, and no hardware support can be expected from HP anymore regarding these.
Both servers run Windows 2000 standard
The Proxy software on both servers is ISA server 2000, running on the SQL MSDE engine. This software is still supported by Microsoft, but has been superceded by 2 newer versions.
In addition, we currently run the Surfcontrol web-filtering software, as a plug-in for ISA.
This software allows us to tightly control web-behaviour, for example to allow certain users access to certain sites, and to block entire catagories of websites, or web-protocols.
We have built up a pretty extensive rule-set over the years on both machines, and both rulesets are largely identical.
The company “Surfcontrol” was aquired by Websense in 2007, and since that time the Surfcontrol software is no longer supported, no patches or service packs are being offered for download, and no licences are being extended or sold, forcing all former Surfcontrol customers, including us, to look for alternatives.
The software combination on these servers has causes us some issues in the past. Some elements of Surfcontrol have always been buggy, and as the hardware has aged, it has become unreliable.
Furthermore, the decision to use SQL MSDE has causes problems, because of its inherrent 2gb limit.

Lala City
The Lala City proxy server is due to be replaced with new hardware, located in SiteB. This action is outstanding as part of the TCR move project.
As part of this, a new server was purchased, together with W ISA server 2006, and SQL 2005
At the moment, no replacement for Surfcontrol has yet been purchased, although Dick Dickerson did get a cost estimate for the Websense software, based on a single server, 500 users, and 3 years of licensing. (included as attachement)
A decision on this has been on the back burner, due to the fact that we where also planning on moving the current ISA server to SiteB anyway, and using the Chipville ISA server as a backup.

Chipville
The old Proxy server in Chipville is in a similair state to the one in Lala City. Although one of its 2 disks (that run in a mirror) has failed since last week.
This causes a serious risk to internet service continuity. It also represents a risk to the TCR move project, as this server is now no longer a reliable fallback while we move the Lala City server.

We need to decide how to proceed going forward.

The time factor
We have only a limited time to come up with a solution. Currently the situation in Chipville is more pressing, because of the hardware failure of the server there.
The big-bang server move from TCR Lala City is sqeduled less than a month from now, and we need a stable and supportablesolution at the very least in Chipville before that time, and idealy a solution for Lala City aswell.

There are a number of options:

Option 1. Keep the current servers
The Lala City server can be moved to SiteB and continue to operate from there, serving Internet users (non-citrix) in the Netherlands.
However, the hardware and software is no longer supported, the software is in an unstable state due to past problems with Surfcontrol, and the ISA MSDE database.
Due to the advanced age of the hardware, it is only a matter of time before it fails. Moving it might actually break it too.
The Chipville server cannot operate as-is, on a failed hardware mirror. This absolutely needs to be replaced, more or less disqualifying this option.
Due to the above, I cannot recommend this option in any way.

Option 2. Outsource the Proxy service to European Datacenter / UK
This would involve redirecting all internet traffic from Netherlands and Belgium to an outside, centralised Proxy system for internet access.
This would simplify our support model somewhat, and remove the technical burden of supporting the solution ourselves.
The downside though, is that we no longer have direct control over what is allowed/disallowed over the Internet.
By default, as far as I have heard, no rules are in place for both the UK and European Datacenter proxy solutions, meaning that there are no limits on what people can do with the Internet connection, it would be a free-for-all, whereas right now, we have strict limits on usage.
This option should be considdered. But the question has to be asked why the web-filtering function was ever needed in the first place. If web-filtering and control remains a business requirement, that this options cannot be considdered.

Option 3. Hybrid In-country hosting / European Datacenter hosting
I have been made aware of a version of the European Datacenter hosting scenario, that includes re-directing in-country internet traffic to European Datacenter, but in combination with a local Proxy/web-filter server, running the Websense software. This would involve installing a local server with the “Websense” filtering software, and “chaining” it to the Websense Proxy server in European Datacenter. Many countries apparently already follow this model.
This has the advantage of retaining local control of a rulebase, allowing us to continue to restrict internet use where nessesary, but with the advantage of not needing local Internet line for basic Internet use anymore. MEGACORP(TM) also can retain an amount of corperate internet-use control, via the gateway in European Datacenter, as all internet fraffic eventually moves through there to get out. Currently MEGACORP(TM) does not pose any global restrictions on the Internet gateway in European Datacenter, as far as I have heard.
This option should be considdered, however it will take some time to study and set up properly. The support model may be complicated because of the fact you are dealing with possible web-filtering and proxying in 2 different locations, supported by 2 different organisations. It would however, also require that local websense software be purchased and supported. I have also been told by some, that the connection via European Datacenter is very slow and not that usefull for many operational tasks.  This could hurt us, as we run a number of line-of-business web-based applications over the internet. (Hp Shipview, etc)
We would also benefit from the fact that the websense software can be centerally managed from 1 console, making it very easy keep the netherlands and Belgium ruleset identical, and simplifying reporting and failover.
I would recommend this option if we can be sure the performance is adequate for our business needs, and if the support model can be agreed apon quickly. The major downside of this solution currently is that it will take time to set up, and we dont have much time anymore.

Option 4. New installation In-Country
This involves basicly rebuilding the 2 Proxy servers on new hardware, and installing fresh, current and supported Proxy and web-filtering software.
In this scenario we would use our own local Intranet lines in SiteB and Chipville.
We would directly support the solution, and maintain direct control over the web-filter ruleset, this is the most simple support scenario.
Hardware for this is already available: The replacement of the Lala City server was already part of the TCR move project, as is the licence for ISA 2006 and SQL 2005.
Hardware for Chipville is also already available on site, in the form of a 3-year-old IBM server, however, this server may soon fall out of hardware support (needs to be checked).
Apart from the ISA and SQL license that would be needed for the Chipville server, we need new web-filtering software for both servers, again, IF the business still deems this a requirement.
If they dont, then this solution would provide unfiltered internet access to all (non-citrix) internet users in Netherlands and Belgium.
For the web-filtering requirement, I would at this time advise to als go with the Websesne software, as they are currently regarded as the market leader, and their software is well supported ans well known in the industry. (they are incorperating a lot of the Surftcontrol concepts as part of the aquesition )
We need to look at the current available hardware for this. Almost all the hardware we have is 3 years old or older, so it may be advisable to considder purchasing a new piece of hardware for this solution in Chipville.
This option should be considdered. It has the advantage of retaining central control and will be quick to set up, once the software has been purchased. The downside is that the Websense software is expensive, so we may want to considder looking at alternatives, even though it has becomes a defacto standard within MEGACORP(TM). Again, we have a time-constraint problem here.
We would also benefit from the fact that the websense software can be centerally managed from 1 console, making it very easy keep the netherlands and Belgium ruleset identical, and simplifying reporting and failover.
I would recommend this option first and foremost, and it is the prefered solution technically, considdering the circumstances.

Again, i wish to stress the timeconstraints we have, less than a month before big-bang, we want a new solution up and running within the next 3 weeks!

————————-


Videos of serverrooms and progress report

So I havent posted about our Datacenter move in a while.

Well we are starting to pick up the pace now, all the new servers at the new location have been set up, and various other teams are now installing their stuff on there.

We are now getting into decomissioning more and more servers from the old server rooms. Here are two videos of rooms as they were 2 weeks ago.

 


Old Datacenter 1 from Robert Kloosterhuis on Vimeo.
(I focus on one of the comm pc’s for a while, as the mouse was moving just a minute or so before I filmed this, someone was working on the box remotely, and I was hoping to capture it )


Old Datacenter 2 from Robert Kloosterhuis on Vimeo.

The pics below show some of the servers, or lack thereof (gaps), as we are starting to take the first ones out

IMG_3431

IMG_3430

IMG_3429

IMG_3428
Ready to be disposed of.

IMG_3437
I had Mustafa looking at ways to make some ghost-based images of some of the servers, just in case something happened during the move. He has a version of Bart’s modular Boot disk running there, allowing very easy to configure dos-based network booting.

In clearing out the office, I also found some interesting items 😉

IMG_3438

IMG_3436
This nicely complements my collection 😉
All I miss now is the NT4 and 2008 resource kits!

IMG_3439

 


Friendfeed Sysadmin Room, Twhirl FF support, and Pretend-Sysadmins

So not long after I complained that there where so few System Administerors on the social media scene, Friendfeed introduces “Rooms” and immediately Adnan takes the initiative and makes a Sysadmin Room.

The uptake was pretty damn fast, probably partly due to Adnans blog being part of the Planetsysadmin collective. (why am I not on there yet?! )

So, lets hope admins posts “adminy” and interesting there. Adnan and me are off to a good start, at least.

Meanwhile, I realised I now had a treasure trove of Sysadmins to add to Twitter, which has been a really succesfull strategy so far.

I was immediately struck my a number of things though. Why is it, that I wanted to follow these guys on Twitter, and not just solely on Friendfeed, where they already where?

Well the awnser to that is very very simple: Twhirl. Or more to the point, their completely crappy implementation of Friendfreed support, at least at the moment.

I am reffering mostly to the lack of any kind of filtering, the fact that FF and Twitter are still two seperate streams, double Twitter posts, and the lack of FF comment collapsing/expanding.

But to get back to my original action, the adding of the Sysadmins, as it turns out, many people that added themselves to the Sysadmin Room, turn out not to be Sysadmins at all. Rather any are developers, or web-entrepeneurs. At least that is how they discribe themselves on Twitter. I have filtered who I add to Twitter accordingly 😉

 

 

 

 

 


Scheduled reboot batch job, unexpected “access denied” and how to handle security

So here is something silly I was running up against. In the end its super simple, but its not obvious, and not easy to google for.

I want to equip the new servers are are installing with a standard weekly reboot schedule.

I created a batch file that launched shutdown.exe with some fancy parameters, and set this up as a scheduled task for each server.
I created a special domain account called sa-scheduledreboot with normal user rights, and rights to access the share, and of course the famous “log on as a batch job” privilege, granted to each server via Group Policy.

But dispite this, rather textbook, rights scenario,  I was continuously getting “Could not Start”

However, if I ran the command using Runas, using the credentials of the sa-scheduledreboot account, it would work fine.

The Scheduled Task eventlog showed the following:

“Task Scheduler Service”
5.2.3790.3959 (srv03_sp2_rtm.070216-1710)
“Sheduled Reboot.job” (Reboot.cmd) 5/13/2008 5:43:54 PM ** ERROR **
    Unable to start task.
    The specific error is:
    0x80070005: Access is denied.
    Try using the Task page Browse button to locate the application.

I spent all several hours trying to find out where the “access denied” came from. Eventually, I stumbled apon this:

http://support.microsoft.com/kb/867466/en-us

as it turns out:

In Windows Server 2003, the Users group does not have Read and Execute permissions to the command processor (Cmd.exe). By default, the Cmd.exe program has the following permissions settings:
•    The Interactive implicit group and the Service implicit group have Read and Execute permissions.

Note On a member server, the TelnetClients group also has Read and Execute permissions. On a domain controller, the Batch implicit group also has Read and Execute permissions.
•    The Administrators group and the System implicit group have Full Control permissions.

One of those quirky things you just have to know.

The way I have solved this, is that I have created a special Domain Local security group called RG_command_processor_execute  (RG stands for Resource Group)

This group will allow me to control this specific privilege, and assign it to accounts, usually service accounts, that require the access to cmd.exe to run batch files.

I have added sa-scheduledreboot to this group.

I dont want to mess around on each individual server, so I have made it standard that -all- security settings, including changes to default ACL’s, should happen via Group Policy.

For this we use the File System section of the Security Settings part of a Group Policy Object.
We can add files and folders here, and define how their ACL should look.

The tricky bit is that you have to remember that this Group Policy setting overrides and replaces the original ACL on the object.

Thats a bit annoying, cause it means I have to replicate its current ACL’s, including any special permissions assigned to implicit security groups. 

The KB article shows two ways to do this.
The first is to add the account or group directly to cmd.exe. ACL
the second is to add the BATCH group to the cmd.exe ACL

The second option is interesting, because the BATCH built-in group implicitly includes all batch files that run on the system.

The way that would go would be:

sa-scheduledreboot –>member of–> RG_command_processor_execute –>member of–> %hostname%/BATCH –>applied to–> (ACL of) cmd.exe

This looked like a good option for a while, until I realized it was perhaps a bit broad. (all batch files, including those run by rogue processes? )

And since it only applies to batch files, if I ever needed to grant anything other than a batch file (say, a resident program or agent), that right, I would have to assign the group directly anyway.

So I decided to add the group directly to the resource, which also makes it easier to see what the ACL change is for, for anyone examining the GPO.

sa-scheduledreboot –>member of–> RG_command_processor_execute –>applied to–> (ACL of) cmd.exe

The scheduled reboot command works fine now. And I am confident I did not assign any more rights that I absolutely needed to to get it to work. (In contrast, the previous reboot account had domain admin rights).

The only thing I need to do now, is to remove many other rights from the sa-scheduledreboot service account.
Its currently a member of Domain Users, and that grants a load of rights this account certainly does not need. I will look more closely into that at a later time, as my solution will have to cover many service accounts, not just this one.

By giving out the exact rights needed in a very granular way for each service account I need, I can far more easily restrict ALL service accounts in other ways, all at once, making them useless to use for any other purpose than what they where intended for.

Documenting this is gong to be a challenge.

I need to document exactly what I am doing in the GPO that assigns the rights to these servers, and why each option was chosen the way it was.

I need to document the exact rights of the sa-scheduledreboot

And if I develop a blanket method to restrict ALL servuice accounts in other, general ways, I need to document that too!

I better get to it!


Loic le Meur responds to my post

I pinged Loic le Meur, creater of Seesmic and owner of Twhirl on Twitter to draw his attention to my post on some of the issues i have with Seesmic.

He responded as follows:

“you have great points and we are working exactly in the spirit you expect”

I will take this to mean that we can expect groups/filter function in the Twhirl client or the Seesmic service in the near future. I cant wait 🙂