As some of you know, here in Texas the sport of football is a bit more than a team sport. In some locales, it's practically royalty. Back in high school and college, I was ever-aware of the sports fans that liked to wear the team colors. One in particular invited me to lunch while his parents were in town (I already knew them from back home) but I never knew that his father owned a car that was actually painted in Dallas Cowboys colors. Another colleague of mine was stuck in a parking garage with a dead battery as I was leaving work, and asked if I could help him with a jump. Upon saying yes, he produced a set of maroon-and-white jumper cables, honoring Texas A&M University. Oh yeah, he was a fan.
Then several of my friends kept large jars of team-color makeup and would smear it all over them before a game. No, not to go see the game at the stadium, but in their apartment! Yes, the team makeup was always a big hit, and their parties were solely centered on the tiny color TV in the center of their rather modest apartment. I had once wondered what the people next door thought of their shouts, whoops and antics, until I learned that the folks next door were just as rowdy with their own team. Hey, ya gatta be a fan.
Okay that's not the team makeup I'm talking about here, but those were just such funny memories I thought I'd share. And to make a point of course. The enthusiasm of your developers and architects, and their desire to cheer you on with your goals to achieve, are directly dependent on the team makeup. Of people, that is. So what does it take to make a data warehouse? Or for that matter, what does it take to effectively roll out a new one, or migrate an old one?
What most people fail to estimate in taming such a beast, is the level of testing required to make it a reality. As an example, many of you must produce documents as part of your regular workday. Those documents are often hard to write, but even more work to proofread. In fact, proofreading is the same form of content-and-context testing we would do for a data warehouse. The chief reason is the product - information and knowledge. Business intelligence is the same way - it has a way of taking on a life of its own, but the only way we can reliably roll out a viable business intelligence platform, is to test, test and test some more. Eyeballs-to-page however, may be required for a book or document, but it won't scale for a warehouse.
Many people don't realize that the testing portion of the warehouse can take as much as 80 percent of the project's resources. While we can compress this somewhat with agile methods, we cannot afford to test such quantities of data with simplistic manual approaches. And by that I mean eye-ball examination or screen-shot testing. No, the majority of testing is in the data itself, and on a Netezza machine it's in the billions of records. Eyeballs don't have the bandwidth. We need to use the actual power of the machine to scale this mountain.
So what should the team itself look like? May I suggest that for every Architect you would have several developers, and for every developer you would have two or more testers. Ideally three testers for each developer, regardless of how many developers are actually doing the work. I will also suggest that you keep the total count of developers rather lean. Five is perhaps overkill for the back end. For the front end, three solid developers can be an army, and five is about the upper limit. The reason is simple: logistics. If you have five developers on the back end and five on the front end, and three testers in the wake of each, this is a team of 40 people - which quite frankly is overkill in any sense of the word. Not to say that an overall team might not be comprised of 40 people once we include all of the infrastructure folks, but not for pure develop-and-test. We can and should make it leaner.
As an aside, I had the rather disturbing experience, numerous times, of encountering folks who worked these things out with overblown spreadsheets that they normally used for application development estimations. A data warehouse gig is completely different from an application development gig. But of course, if one of these spreadsheet guys ever showed up and plugged in his metrics, he would spout off that we need a 30-person team to migrate a couple of tables from one machine to the new one. Once this number is in the air, it becomes the de-facto standard by which all discussions are measured, even though it is completely wrong. In another setting, another spreadsheet-guy plugged in his numbers and characterized a project as a $900k gig when our competitors were bidding $300k for the same work. Knock yourself out, dude, because the client ain't a-bitin' three times more expensive projects because they like our faces. True to form, the $300k bid actually won. But the irony was, that the potential client had no desire whatsoever to pay more than about $400k, so the bid fit their budget just fine. The eventual winner of the bid took a bath, however. The truth is always somewhere in the middle.
I still say, watch those spreadsheet-guys. Somethin' up with that.
Perhaps it goes without saying that an architect needs to lay out a framework so that all can work comfortably in the same sandbox. This is a challenge and should not be left to the developers to forge on their own. Harnessing it later will be impossible, because too many opportunities for flow-based consolidation will be lost. Workarounds and repetitive logic will become the rule. Let's not go there.
If we have say, three solid developers in the back and front end each, they can and should cross-test each other's work. In this case, we have the senior developers and architect working on the core logic, and the junior developers bench-checking their work and zipping it up for a formal testing team. Here we have a synergy, that a senior developer can crank out ten times the work and quality of the juniors (so says Demarco and Lister, your actual mileage may vary) but nonetheless, we would not want to put a junior developer in the front seat of this chaim because the testers will be waiting on him more often than not. But with the ten-times-more-power driving the front like a locomotive, the junior developers can wrap up the many tactical areas of the warehouse and cross-test each other, but also receive the work products from the senior developer.
Now think about what this kind of model means. The senior developer is literally force-feeding the pipeline with work products and is doing it with the highest quality the team has to offer. The junior developers are learning from the senior without injecting their own inexperience into the mix, which will invariably have to be reviewed by the senior developer anyhow. No, the senior developer is more productive and experienced, so let him/her drive. Seems like every senior developer I talk to, they really, really want to develop and have the testing tedium off their plates. The junior developers really, really want to learn from a senior developer, and of course want to do some development themselves. I'm not saying this is off-limits, but the senior developer can delegate-what-he-knows to the junior developers because they cannot go too far astray without his guidance anyhow, so it's a win-win. And of course, I and every other person who was ever a junior developer had to pay our dues, so not everyone can be the leader. I don't say this dismissively, but we know in a business intelligence project there has to be a driving mind. Too much consensus means too little leadership, and in the famous words of Margaret Thatcher "Consensus is the absence of leadership".
But for the people doing the testing, they need something that will scale. To billions of records. And it had better be solid on the first round or they will be playing catch-up for every round after that. While writing and proofreading a document is an eyeballs-only model, don't you think I could at least do myself a favor and run a spell-check and grammar-check on the contents? Such set-based operations resolve a world of problems and let me focus my eyeballs on the harder stuff. But in a data warehouse, our eyeballs will never have enough bandwidth, and will never scale to the necessary heights. Set-based testing is all we have, but it's also all we need. And with a Netezza machine, we're so in the zone.
Now testing of the report screens can involve eyeball-based activities but doesn't have to be so egregious. Automated testing tools go a long way to mitigate the necessity for eyeballs on these as well (for the subjective parts like positioning, banners or colors especially). However, if the data is wrong, no amount of pretty-pretty will fix it. As Murphy would say "Beauty is only skin deep, but ugly goes to the bone."
Now, no sooner will I write this than I will get feedback from those junior developers who say that they have been relegated, but not to fear. This particular article is in context of a high-productivity bubble of work, normally found with new projects or migrations. The priority is not to make people feel better about their role, but to get past the workload so that everyone can feel better about the work products. I am always looking for opportunities to stretch the developers, both junior and senior. When a junior is ready to sit in the driver's seat of the locomotive, it's because he's passed the Demarco and Lister smoke test. Now what the heck is that, anyhow?
Get a copy of Demarco and Lister's Peopleware, a classic in every decade. Something they have empirically measured, is that a junior developer will start out at one level of productivity, and then in a sudden epiphany will transform into someone who is ten times more productive than before. Something mentally and/or emotionally clicks and they get this whoosh. They claim it is different timing for each developer, but usually takes about two years to make this transition. This is perhaps one reason why so many job-search requirement listings show "X years of experience in Y" and the "X" is never less than two years. Not because the poster has ever read Peopleware, but we who are in the field want folks who are 2 years along because we already know they have (at least) transitioned into a high-productivity asset.
But this is the mechanism driving the team makeup - and the experience of the developers and their known levels of productivity should help us find the right role for them on the team. We don't want a low productivity person in the locomotive chair. But having one in the wake of a strong developer only makes them stronger and exposes them to practices that will accelerate their transition into the higher productivity person we always wanted anyhow. And then, of course, once the person has made the 10x transition and is self-aware of their value, we have another problem: They are self-aware of their salary level too! Making someone stronger makes them more valuable. Be prepared to recognize the value (or rest assured that your competitor will). But all this, is the nature of the beast we purport to tame, no?
Back to set-based testing. This has as much to do with using the right data as it does the right method. The right data means - select a subset of known data that will deliberately exercise all of your business rules and software paths. Nothing is worse than realizing such errors in production. Then, we need set-based testing methods. This means we need three primary assets: (a) source data that we can sluice through our application transforms to get a result (b) a saved baseline result to compare this recent result against and (c) tested components that compare these two results in a reliable manner so that we get a statistical report on what passed and what didn't, and a detailed report on what specific records didn't make the cut. Counts, amounts, checksums and summaries all reveal deviations, especially for regression testing. You might recognize this as an exception report, and this is exactly the spirit of the effort. Our testing has to deal with statistical exceptions, because it is the only practical and scalable way to validate billions of rows.
And also notice that such a practice would be the kiss-of-death in many other "secondhand" data warehouse platforms. Such platforms are in no wise optimized to compare monstrous sets of data to each other column-for-column, row-for-row. Queries like that can dim-the-lights and may not return for hours, if not days. We cannot afford a protracted testing phase, and with Netezza we don't have to. Scan times and comparison times are very objective and knowable. The tests will take the same amount of time each time they run, and we always have the option to optimize them further with the Netezza performance model. Power is in the physics.
And again, why all the focus on testing? I have seen data warehouses blind-side an organization that accounted only for the opposite equation - 80 percent development and 20 percent testing, when more often than not, exactly the opposite is true. This would mean that if a two-month development effort were characterized with one model (the wrong one) it would look like at most a three-month effort. Why then does it metastasize into a ten-month effort? Because 20 percent (2 months) tranlates to 80 percent (8 months) of testing.
That is, if we just embrace the standard model. By embracing the aforementioned model, we get the development out of the way quickly and deliberately, entering the testing phase much sooner, and if more heads are deliberately dedicated to set-based testing we can close this part off even sooner. I have watched very-large-scale projects, with a Netezza team in the middle and strong developers in the locomotive seat, enter their first UAT phase within two months of the project's inception. The funny thing is, the model requires rapid turnaound that only the Netezza workhorse can provide, Try pulling off this team makeup with any other lower-productivity technology, and it won't sing the same key. A high-productivity developer is meaningless on low-productivity technology. And high-productivity testing methods are useless if enslaved to a low-productivity technology.
Start it, shape it, ship it. Netezza is the ticket home.








