Blog Posts

Blog Posts: 57
Items per page
Statistics: Blogs: 8 Blog Posts: 57   1 2 Previous Next

What's heating up about as fast as Summer here in Texas, is the excitement over the upcoming EnZee World Tour.

 

I am especially excited this year because I've been tapped to host/emcee the Best Practices sessions in each of the cities, which means that I'll get a front-row seat to hear how the masters of the technology ply their trade and make the Netezza machine sing.

 

After all my fellow Enzees - you are the ones gathered 'round the grill and the ones who make-it-happen. Others of us are often in awe of the rather inspired means and outcomes you so deftly deploy with the technology, and integrate it to the technologies around you.

 

Of all the questions I hear at a customer site on the basic workin's of the machine, there's nothing like sharing war stories with people who pull all those things together and instantiate an operational environment. Especially when you do it by utterly eclipsing the performance of Netezza's displaced predecessor. And here's where we really want to hear the down-low on how things used-to-be versus how-things are.

 

In many cases, I hear that you had an easy time of bringing in the box and making it go. But making the technology go wasn't nearly as difficult as bringing-in-the-box - especially if you have to wheel it past the sneering eyes of doubters or political players who want to see it fail, or at least  - see it be not-so-widly successful as the current expectations might dictate.

 

But Netezza really does meet those lofty expectations, doesn't it? And one of the stories we all love to hear is that type of victory - the dark horse so to speak - championing the cause amidst the pressure of anything-but-technology. The odd thing about new, better technologies is that they are so much better than old technologies that the older technologists cannot believe their own ears. Orders-of-magnitude more power you say? Tish tosh, you must be mad.

 

So when we get into best practice sessions, we speak of things like scanning a terabyte, or 2 or 10, and complain that our query can't seem to cross the X-number-of-seconds boundary. Seconds, mind you. And people hear this and wonder what the complaint really is - after all we can't be working with real data because terabyte-sized table queries always take hours to run, or hadn't you heard this?

 

I recall sitting in on a session with a bunch of people who honestly had money-to-burn. One of them complained that they could not get up to New York often enough, and every time they went their favorite restaurant/play/whatever seemed to be oversold. One of them complained about a broken drawer in his private jet, while another complained about the drafty interior of one of his summer homes. Still other said that they had spent 150k on custom teak wood in their 140-foot sailboat, and had it all ripped out and replaced because it "didn't look right". Ahh, money to burn. People with a completely different list of priorities than the average Joe like me.

 

I say this for contrast, because the things we speak of as Enzees, with the power available at our fingertips in the machine, is utterly foreign to people who have never experienced the power themselves. And it's interesting in best-practice space when we talk about squeezing 9-hour processes into 9 minutes, and then hear our business counterparts wonder if we could squeeze out just a few more. A best-practice balancing act is getting to the solution without over-engineering, and some of you consider this an art form.

 

So Enzees, Artists and those who would kick-the-tires, gather round the grille and let's fire up those steaks, veggies and what-have-you - then the only thing hotter than Summer will be the ideas coming off the cooker -

0 Comments Permalink

Just when I thought I would stop moaning about Oracle, and avoid accusations of paranoia, I learned that back in my hometown of England, Oracle has been bragging about how since they announced Exadata, Netezza haven't closed a single deal. Apart from the fact that this simply isn't true, it's an odd claim for a vendor 100 times our size and one known for making bold claims. How about a bit of bragging about how many Exadata deals Oracle has closed? Isn't that the yardstick for success?

Oracle has been in the data management business some 25 years longer than Netezza. Based on understanding their own clients’ business strategies, Oracle has advised thousands of companies on their IT strategy and - in numerous cases - this advice has even been paid for. So any notion that Oracle didn't see the explosion of data and analytics looming on the horizon is ridiculous. Quite frankly, if Oracle had done their job properly, Netezza would have never existed in the first place!

So here's the crunch. Given Oracle is a trusted IT advisor with the inside track, reputation and relationship with tens of thousands of companies, Exadata's progress in the market since September 2008 has been appalling. And let's not forget that with 100 times the muscle of Netezza other benefits come to play -- like brand recognition, marketing and sales coverage, misleading ads in the Wall Street Journal, partner leverage and so on.

Rather than crowing about their success, they should be blushing with sheer embarrassment.

With such a poor track record, one can only conclude that Exadata can't be very good. Think about it: if a little start-up, like Dataupia or Greenplum, had released it no one would be taking it seriously. This suggests that the only thing of any value with this machine is the little plastic button on the front that says "Oracle"!

I'm still sleeping at night.

0 Comments Permalink

Wonder Why Wonder How!!!!!!!!!!!!!!

 

Am Raj Guthikonda I've been a Chief Architect in th Oracle BI & Data warehouse arena

 

This blog post is my initial musingin the Netezza arena  to do soe thought provocation

 

I wonder why data has so many silos and how can we overcome this silos

 

Business data is an invaluable commodity in today's fast changing analytic World where we figure out how to beat our customers & how can we leverrage the data that a business has

 

I have been recently associated with one of the biggest financial services mergers in the mid atlantic region of greater Pittsburgh area

 

This client of mine was using an analytics tool / application coming from a formidable provider who has the chunk o the market share

 

but we lacked te basic adapters to bring in the data from various types of data sources such as VSAM & IDM

 

 

I hope the next level determining a best of breed business intelligence data base server or the tool should have all this great features where in we can overcome the application silos and build our enterprise wide data warehouses in an efficient way

 

 


The next breed should have great analytics capabilities great user interfaces and also could be used to do data integration & data quality assessments

0 Comments Permalink

Some enterprises will stand up a Netezza machine and point all their data processing towards it. They wouldn't think of actually installing anything on the Netezza machine (such as database clients or other client software) and of course, are strongly advised against by the vendor. Why is this? The Netezza host has a lot of work to do in keeping those spinning SPUs happy and busy. Adding other duties can detract from this critical mission, and we don't want that.

 

But we can also abuse the host in subtle ways. A case in point follows - you may have other tales to tell.

 

We always have a need to pull in a wide variety of files. In this particular case, dozens of intake tables in their various staging locations. In many installations, the intake table definitions are few, discrete and stable. But in just as many, the staging tables will mirror the upstream sources, with one table for each upstream interface. In our case, handling source-to-target with no ETL in between. We extract directly from the source into an intake table definition that mimics source column names, but the data types are all varchar to facilitate "dirty" intake. The objective is to get the data into the machine.

 

Then we convert this intake table to its final form, the internal Netezza table that is identical to the source table in column name and type. This conversion is a simple table copy, mechanically speaking, but we have to do some light ELT to make it happen. For example, we need to guard against nulls, empty strings, bogus numeric values and the like. In our case, numerics could be dozens of characters in width because the upstream definition happened to be a view with no defined precision. A typical intake SQL could look like:

 

select

case when column is null then value else column end,
case when translate('-+.0123456789','') = '' then column else null end,

etc

 

Such that each column is wrapped with this kind of logic (call it "Intake ELT"). Now, we don't manually wrap these column defs, we do it dynamically from the Netezza catalog definition. (And for efficiency, we cache it for later reuse, but that's another story).

 

Now we have an intake-ELT that looks thus:


External Database Table -> network -> intake table ->  Intake ELT -> Staging Table

 

Note for clarity - the External Database Table and Staging Table are "book ends" to this operation, and have the same column names, data types and column order. We don't absolutely require common column ordering, but it's handy for troubleshooting.

 

Note also that this works just as well for flat file intake as database intake. Better, in fact, because we can more easily load multiple files at once than multiple tables at once (the database might not like multiple extracts)


All of this worked swimmingly until we encountered a slightly different kind of data feed, one that had to be extracted from an archival source into flat files. Rather than present the flat file as normal (on the network) the admins decided to use the available on-board Netezza storage pad (5 TB of space). Keep in mind that we were not allowed to execute anything directly on the machine, so we had to set up External Tables on top of these files to load them, rather than using NZLOAD. This, too, worked transparently and all was well. Then a "bright idea" occurred, that in the above equation the Intake ELT faced a table (our intake table) and couldn't we just use the intake ELT right on top of the External Table, eliminating the additional middle-man?

 

Like so:


flat File -> External Table -> Intake ELT -> Staging Table


The above configuration only appears more efficient by eliminating the Intake table. Looks are quite deceiving, cconsidering how much "per-column work" the Intake ELT had to perform to get data into the Staging Table. What is not obvious, is that the Intake ELT is now sitting on top of the External Table, which is a Host-managed table, not a SPU-managed table. In this configuration, we have reduced our power from a 108-SPU problem to a 4-(Host) CPU problem. The immediate loss of power was measurable in orders of magnitude.

 

So under the covers, here's the power-plant difference in the two models:

 

External Database Table -> network -> intake table ->  Intake ELT -> Staging Table
                          |----HOST -------------|---SPUs--------------------------------|

 

flat File -> External Table -> Intake ELT -> Staging Table
             |------HOST ------------------------------|


So we can see that the second model is abusing the host with the Intake ELT, and if we go with the original model, the ELT will be handled by the SPUs, offering the necessary scalability and power. In a continuum, we can see where we might initially install nzload or external tables and perhaps "tweak" them along the way. Then a maintenance developer comes along and sees that the "easiest" place to add a fix is in the external table or the nzload rather than pushing it to SPUs. The external table and nzload can (and should) do light-intake formatting per their interface specifications, but no further.

 

The over-arching directive remains the same - get the data into the SPU-based tables as rapidly as possible and then do the "dirty-work" with massively parallel power.

0 Comments Permalink

The DB Lytix advanced analytics library that Fuzzy Logix announced yesterday is now on the NDN applications page.  Check it and other NDN applications out at:  http://www.netezza.com/data-warehouse-appliance-products/data-warehouse-applications.aspx.

0 Comments Permalink

Last week I wrote about some of the new work that I saw coming out of Fuzzy Logix and their partner Aperity.  It turns out that the Aperity deal is part of some bigger news from the guys at FL.  Today at the Netezza Midwest User Group meeting in Columbus, Fuzzy Logix announced the commercial release of their in-database analytics product called DB Lytix.  For the full background, you can see a new page on FL’s web site (http://www.fuzzyl.com/in-database_analytics.php).

Over the past year Fuzzy Logix has worked closely with a number of Netezza customers in a model that, to date, has been very consultative and solution-focused.  Some excellent case studies have come from their work, and we continue to get a significant amount of interest from the field in Fuzzy Logix’s approach.  From what’s in the online description, DB Lytix is a packaging and extension of many of the in-database algorithms that we’ve seen FL leverage in their engagements. 

The difference with this announcement is that Netezza users, application developers, and partners will now have access to FL’s foundational advanced analytics library to develop and deploy their own applications.  This is a clear move up the technology stack, and will empower a much larger audience to take advantage of OnStream.  And on a broader note, the release of a package like DB Lytix is an indication that the Netezza Developer Network is evolving as we hoped it would.  Our charter with the Netezza Developer Network is to put the OnStream technology directly in as many hands as possible and enable partners to run with it.

So far there’s no news on how Fuzzy Logix will be licensing DB Lytix and making it available to the Netezza community.  I’ll be speaking with the FL crew later today, and will try to get more news out to you in the coming days.  Stay tuned in the meantime…

0 Comments Permalink

Ahh, the theme of so many horror stories, where the heroes plod along life until met with something they don't understand. Like a member of the Borg Collective, Agent Smith from the bowels of the Matrix, or Tom Cruise, cruisin' along in a dead-end life before the Tripods from Mars rip a hole in his, er, reality.

 

Got a call last week from a buddy of mine who's in an all-RDBMS shop. And powerful, too.  Some of their SMP machines have forty and fifty-plus processors on them. They do heavy-duty processing, don'tcha know, and have no need for any technologies newer than what they already have. The reason this call was so strange, was that we'd just caught up not six months prior, at which time I reveled on about the Netezza technology. He
didn't have anything to say about it, no opinion as to its benefit or purpose. This call was different.

 

Like many conversations about this, following is a composite of several, but I'll use my buddy as a springboard, because he's a good sport.

 

"David, need your help," he said, a tinge of urgency in his voice.
"Shoot."
"Some people here are talking about bringing in Netezza for a test drive," he tells me.
"Good for them. You - "
"Stop there," he pushed back, "Let me tell you what they want to do. They want to replace our primary data marts with this stuff."
"It's good for that," I said, "It's purpose-built for high-speed reporting."
"But our reports run fine," he asserted, but didn't really sound convinced.
"All that mess you went through last year," I reminded, "When you needed to add more dimensionality and had to take your entire schema back-to-formula?"
"So?" he said, "No technology is immune from that."
"Restructuring your entire indexing strategy to get better performance, and all that denormalization and renormalization to balance the workload?"
"Necessary evil," he asserted.
"Evil, true, but necessary is a function of the chosen technology."
"What, you mean Netezza can just keep assimilating information without ever having to refactor the indexing? Who are you kidding?"
"Not kidding at all," I said calmly, "Netezza has no index structures, so there's nothing to manage."
Silence.
"You still there?"
"I'll have to call you back."
"Okey-dokey - " but that's all I could summon before the line went dead.
Fifteen minutes later he called back, out of breath, "Okay, tell me more about this no-indexing thing."
"Just that, no indexes. Netezza doesn't need 'em or use 'em."
"Then it's slow as molasses, and not a threat."
"Don't kid yourself."
"I know data, dude. Don't try to - "
"It's not about the data. It's about the hardware. Netezza embraced the truth that power is found in the hardware, and bulk data processing needs access to a lot of it."
"We have a lot of hardware too -"
"But it's configured for general purpose processing, and I'll bet you don't do any of it inside the database."
"Well, no, that would be insane. We do the bulk processing with an ETL tool. It's just faster."
"Have you ever considered why it's so much faster? Or why the rise of the ETL tools? People generally agree we can't do bulk processing inside the SMP machine, because it's not built for it. It will pull data in quanity off the disk drives, process it in quantity and push it back, and the data is meeting itself coming and going on the SMP's backplane."
"So? How does that change anything?"
"Netezza processes data down in the parallel SPUs - data doesn't leave the disk at all, and if the database needs to process data, all of the CPUs handle their own little section of it. That's why you don't need indexes, because an SMP/RDBMS sees the data as a single logical table with monolithic physical data, where Netezza sees a single logical table on hundreds of physical drives.
"I'm not following."
"Okay, when a general marshals troops, does he give specific commands to each troop member, or does he formulate a plan and delegate it to the masses?"
"That's obvious."
"Because it works. It's the only way to manage physical scale. With multiple actors who are incapable of completing the mission alone, we need synergy."
"Or Jack Bauer."
"Not even Jack Bauer could - "
"Hey, you're bad-mouthin' Jack now!" he said playfully.
"Jack's good," I said, "But not that good. Imagine what he could do with a thousand Chloe's back at the ranch?"
"I see your point, but if this is just for reporting stuff, I don't really have much to worry about."
"For now."
"What do you mean, for now - that sounds sinister, like the Pod-People leader from Invasion of the - "
"Data Mart Snatchers?" I laughed, "Didn't mean to sound mysterious, but there's more to this."
"Oh?"
"Well, once the mart is in place and operating, someone will notice a pattern of activity. It goes like this: We spend hours processing the data to get it ready for Netezza, and then load it in seconds. The box sits there idle for the next few hours until the users start pounding it, then it goes idle for some protracted cycle until the few seconds it requires to load the next day's data. Something is amiss, because now the slow point is the ETL environment."
"Our ETL environment is state-of-the-art," he said, "We push data like crazy through it."
"And I'll bet if you examine the larger part of the load, you will find that it spends most of its cycles in joining or summarizing, even sorting. Those operations take a lot of hardcore CPU power."
"They always have."
"But what if you loaded the raw data from your ETL tool into Netezza? You said yourself it only takes seconds, right?"
"But then where will we integrate it?"
"Massively parallel joins and rollups inside a Netezza machine are orders of magnitude faster than your ETL."
Silence.
"You there?"
"No, I see what you mean," he said, "Our sixteen-way ETL machine cannot even theoretically compete with a 108-processor Netezza machine."
"And the 108-processor version is the development box."
"Thanks for that."
"Seriously, people will take a look at the "T" in the ETL, and experiment by dividing it into row-level activities and bulk inter-row activities. They will keep the row-level stuff in the ETL and move the larger-scale transforms into the Netezza machine."
"Okay,"
"Assimilating the data mart and the larger scale bulk processing into a single platform."
"Hmm, that would be troubling."
"Why is that?" I smiled.
"Because then the our high-powered and expensive ETL tool is relegated to nothing more than row-level scrubbing and data transport."
"Want another kick in the pants?"
"May as well."
"You can do a lot of that scrubbing in regular SQL once the data is inside the box. In massively parallel form."
(Sigh)
"Meaning that your ETL tool is now nothing more than a raw data transport mechanism."
"We have one of those already," he sighed again, then laughed "It's called "scp""
"At this point you've moved a lot of processing "under air" as they say."
"As who says?"
"Netezza of course."
"You mean, they assimilate all this stuff by design?"
"Resistance is useless. Close your eyes and join the collective."
"Aaaggghh!"

0 Comments Permalink

Everybody talks about having a partner ecosystem, but usually that’s just marketing speak for the typical us-and-them model of partner programs. One of my goals since taking over leadership of the Netezza Developer Network has been to encourage more peer-to-peer interaction among Netezza partners. We offer a lot of benefits to individual partners via our partner programs such as the NDN, but I’ve often wondered what sort of innovation might happen if we dumped the traditional hub-and-spoke partnering mindset and found a way of managing a partner program more like a small world network. Sure, at Netezza’s annual user conference we get our partners together in the conference hall for networking, but the real question is what happens during the rest of the year.

I just had a meeting that challenged my thoughts and proved that there are some really interesting initiatives bubbling up from our partner community. I had a meeting scheduled with one of our key NDN partners (Fuzzy Logix- http://www.fuzzyl.com) to review the latest and greatest advanced analytics functions that they’ve been developing in OnStream. Fuzzy Logix often pushes the envelope with the Netezza platform, so I was looking forward to seeing some neat stuff. I stepped into the meeting and noticed that there was a new guy in the room who turned out to be John Madalon of Aperity (https://www.aperity.com). Aperity is another Netezza partner who offers a hosted brand management solution for CPG. It’s not an OnStream application that falls under NDN, but it does run on Netezza as a complementary technology.

It turns out that Aperity and Fuzzy Logix have partnered to extend Aperity’s brand management solution with Fuzzy Logix’s OnStream-enabled technology for forecasting. The demo they walked me through was focused on the beverages industry. By embedding the forecasting analytics in-database, they were able to calculate extremely accurate (north of 95%) forecasts on the fly based on very granular, high volume historical data. I was surprised by what I saw—it was a cutting edge application of Netezza’s OnStream technology, and it happened without any direct technical support from Netezza. The Aperity / Fuzzy Logix product partnership is exactly the sort of direct peer-to-peer collaboration that we want to see come from the NDN. I was also very intrigued by the forecasting technology implemented by Fuzzy Logix. When I asked what it was, the answer I got from Partha Sen (CEO of Fuzzy Logix) was a smile and, “You’ll see…”. Normally when people are coy it means that something is up, so I wouldn’t be surprised if we see some additional product announcements coming out of Aperity and Fuzzy Logix in the coming months.

So… rest assured that good things are bubbling up from Netezza’s partner ecosystem…

0 Comments Permalink

April 1st is the one day every year when companies and individuals alike can throw their integrity to the wind for 24 straight hours - all in the name of April Fools Day. I tend to think companies that have fun with the faux holiday show that they're more than just the lunky depiction of Corporate America. These companies are generally flexible, at least somewhat creative, and they obviously don't take themselves too seriously - all good traits I'd argue!


Anyway, here are some clever ones I've stumbled upon so far for 2009. If you have more - please share...don't hold back!


o     Amazon’s new  “blimp-based” cloud computing: http://aws.typepad.com/aws/2009/03/up-up-and-away-cloud-computing-reaches-for-the-sky.html

o     YouTube turns itself  upside-down: http://www.youtube.com/t/new_viewing_experience

o     Google debuts CADIE –  an AI “tasked array” based system: http://www.google.com/intl/en/landing/cadie/index.html

o     Kodak’s eyeCamera: http://www.kodak.com/eknec/PageQuerier.jhtml?pq-path=2/6868&pq-locale=en_US&_requestid=355

o     CERN’s rather large  “oopsie!”: http://www.thetechherald.com/article.php/200914/3354/CERN-admits-black-hole-ripped-in-space-by-Large-Hadron-Collider


  • from “ThinkGeek”: http://www.thinkgeek.com/ (gotta love the  “Squeez Bacon” product!)
  • and last, but certainly not least, did anyone catch Netezza's website today? The evidence is all gone now from what I can tell...



0 Comments Permalink

As you might have noticed, the title of this blog recently changed to “Partners in Play”.  The new name is a play on words in a handful of ways.  I’ll leave it up to you to guess the full list, but as you can see the most obvious is to communicate that partnering with Netezza is oftentimes a creative, and even playful, experience.  Based on the strength of our core product architecture, Netezza’s partners often come up with interesting, unexpected, and sometimes even surprising ways to leverage the technology. 

One of the programs that we’ve established to support this creative interaction with our partners is the Netezza Developer Network (NDN).  Established in the fall of 2007 at our annual user’s conference, the NDN was formed as a technical community to encourage in-database analytics using Netezza’s OnStream development platform. 

At the time, OnStream was a revolution in the way that partners and customers could interact with the platform.  For the first time ever, it was possible not only to carve up and push queries down next to data, but also to leverage the same paradigm to drive key application logic down to the data.  OnStream opened up many new possibilities for Netezza’s partners, and it required partners to think in fundamentally different ways about what was possible to achieve with a data warehouse.  The NDN was founded as a sort of sandbox in which partners could initially play and experiment with the OnStream technology, and ultimately push early concepts to the market as complete solution offerings.

The NDN is many things to many people. On the surface, it is a technical program to promote and support OnStream development by any and all users of Netezza.  For commercial partners developing OnStream applications, the NDN is also a business program that facilitates the commercialization of third party applications.  But at its core, the NDN is a hugely diverse community that is home to a remarkable variety of development activity.  I sometimes describe it like the cantina scene from Star Wars— every strange and wonderful type of project that you can imagine has taken place in the NDN sandbox, with more being undertaken every month.

Now that the early days of OnStream NDN are behind us, the temptation would be to transition to a more serious and less experimental tone for the program.  But that’s not the Netezza way.  Our company culture and user community will always see us keep at least one foot in the playful side of our partnership programs.  You can expect that the NDN will mature and evolve (in fact, I’ll plant a teaser here that you can expect some important announcements about the program over the summer…), but it will always include a sandbox for partners in play.  Stay tuned for more details in the coming months…

0 Comments Permalink

One of the most significant challenges of ELT-based processing is the need for housekeeping infrastructure. I mean, we will find ourselves needing temporary tables, and that's okay. But we'll also need persistent temporary tables - that is - tables we create as processing resources in context of a given set of operations, that we might keep coming back to. Or that we might want for troubleshooting. We have to admit, a truly temporary table that evaporates at the end of the session is handy for housekeeping but lousy for troubleshooting. In many ways, the "temp table" should be a means to organize our immediate thoughts, like de-duping a resource list or whatnot. Utility-stuff makes them handy. But when we need to debug intermediate results, we need a persistent table. And then we need to ditch it, with some rules.

 

If we run things in shell-script, whether inside the Netezza Linux host or on a companion Linux box, we'll almost always need temporary files as well. Once we create these, we need a way to get rid of them. And upon creating any of these resources, we need a systematic way to keep them safe, meaning a completely unique way of identifying them. For temporary files, we have many options to timestamp their file names, but for a database table this can be somewhat daunting, considering that we could litter our database with lots of orphaned tables in no time flat. The last thing an operator wants to do is go into a littered database and clean out the trash.

 

The latest release of the Netezza environment provides for stored procedures, which will formalize an invitation (for some) to black-box everything inside the black box. Do not fall into this trap. The black box is only colored that way, it is not intended to be used that way. In the end, we might have componentized stored procedures that do a lot of handy stuff for us, but if their activities are too hidden, we'll hear the same complaints from operators as we hear now when someone violates Rule #10 and performs bulk-data-processing inside their RDBMS, usually with a stored proc. Don't go there. Use the stored procs in peace, but be kind to your operators. Then they won't call you at midnight for answers.

 

So it goes with any operational scenario, but if we embrace some simple housekeeping hooks in a frameworked sense, we can avoid all these woes and pitfalls. The clear objective is to get more business functionality and solutions "under air" - that is - let the black box do what it does best - crunch and munch the data at high volumes.

 

A framework is a systematic way to startup a process, provide common resources for the process, allow the process to consume and leverage the resources as a foundation, and when the process completes, the resources are torn-down and tossed. This allows a given process to simply request a resource from the framework, with the expectation of delivery of course, without having to worry about giving it back, tearing it down or anything else in a housekeeping sense.

 

However, just to keep the operators happy, there is yet another practical means to avoid this headache, and this is to provide more than one database to support the data flow. Many of you are aware of the simplest form of this, where we have an intake database, a workspace database (for transforms) and a target database such as a repository or mart.

 

Staging -> Transforms -> Reposit

 

What this means is that we can intake data from any number of sources and assume that the information is in an unknown or dirty state. We can then apply large-scale transformation (joins, rollups, etc) to the data in massively parallel form. Once completely done, we commit our work to the target with simple table copies. We can afford a table copy in Netezza because of its ability to move data rapidly at the SPU level. Given the right distribution, copying data is a minor penalty when we see what we get back for it:

 

The ability to control the transformation process in a safe zone so that if it fails or corrupts, it is never committed to the repository. We use this technique in ETL all the time, pull data from a source, transform it and prepare it for insert, then commit it. We need to embrace this for ELT as well, because we need to protect the target from corruption with intermediate results from an incomplete operation.


Once we assume the need for such a backbone, we don't really have a lot of high-functionality to support it. After all, if shell-script can help this along, I suppose any of you could provide a frameworked model in Perl, Java or .NET. But might not want to launch an entire development environment just to support these simple activities. They seem simple because they are. Shell script is good because it is inherently a control language and does not tempt us to do a lot of programming. Since we don't need a lot of programming, this is good.

 

When we launch an ELT framework, we need a number of different items to provide context. Once is a reference to the aforementioned databases. Understand also that if we have a database that is for our personal development use, it can behave as all three. Once we move toward integration, we would break apart the references and make sure nothing else breaks in the process.

 

We'll need a holding place for end-of-run teardown. Like a teardown-file we will execute at the end of the run. In this file, we will simply export our commands to tear down the resources we create, as we create them. So if nothing is created, there is nothing to tear down. At the end, we just execute the teardown file and it does the trick.

 

Now we'll need a way to create assets both as a local resource and as a more visible one. In bulding a local resource, we might create-table, or create-view or create-synonym etc and then call a housekeeping functon with the database name and the resource name so that it's marked for teardown. this is as simple as an nzsql "drop table tablename" statement. Either way, we drop the resource at the end of the run.

 

We'll also need the more visible form, that is creating a local asset in our Transform database that is identical to the target version of the asset. CTAS magic works for us here, in that we offer up several parameters to a given function, such as the table-to-create name, the "real" table name and the target database. For example, if our transform database is a my_transforms, and the target database is my_target, with a table name of my_table, we would want to create a table thusly

 

create table my_table_temp as select * from my_target..my_table limit 0;


But we can see a limitation here. We cannot run an application with multiples of these without accidentally stomping on each other. If another thread of this ELT stream is launched (say this one is running behind) it will attempt to create my_table_temp as well, and will fail (but might think it succeeded, the table is there after all) and start to write its data into the same resource that is being used by another thread.

 

No, the most appropriate way to deal with this is the simplicity of the AUDIT_ID, a bigint value that we capture at the beginning of the framework's run (and yes, it's a simple sequence value we pull from a sequencer on the Transforms database).

 

this in hand, we now create with confidence:


create table my_table_temp_$AUDIT_ID as select * from my_target..my_table limit 0;

 

Now we have our very own copy and no other thread, even for debug, will gain access to it. What do we do after this, is we simply echo the drop command to the houskeeping file, and we're done:

 

echo "drop table my_table_temp_$AUDIT_ID ;" >> $housekeeping_file

 

At the end of the run, we'll execute each item in the housekeeping file and this will tear down all the assets we created.


Conversely, we could echo this to another houskeeping_file such as:

 

echo "drop table my_table_temp_$AUDIT_ID ;" >> $housekeeping_file_operator

 

And now we have a way to keep the tables around without losing track of them. The "drop" statment is logged to another external housekeeping file that we will not execute at the end of our current run. Rather, the operator can execute the file once a night or once a week, or whatever, and guarantees that the assets can be dropped without any surgical activity from the operator.

 

We'll do something else with this table name, though, how ahout a FINALIZE file that keeps track of each of these assets? After all, we pulled the definition from the Repository as a means to fill it with data that is destined for the same table in the Repository.

 

 

echo "insert into my_table select * from transforms..temp_my_table_$AUDIT_ID" >> $FINALIZE_FILE
echo "generate express statistics on my_table; " >> $FINALIZE_FILE

 

Now if we never make it to the end, we will toss the FINALIZE_FILE and do nothing. Otherwise we execute the FINALIZE file and commit the temporary tables to the target.

 

And the order of execution is, of course, FINALIZE first and HOUSEKEEPING second!

 

Another simple suggestion is that we make this call into a formal function, such as

 

MY_TABLE=$( create_table_from  "my_table"  "temp"  "$AUDIT_ID" "my_target" )

 

now we have the variables aligned - the source table in my_target, the "temp" prefix and $AUDIT_ID suffix. We also have something else, as we can now reference our new table in simpler form, following:

 

insert into $MY_TABLE (
column1,
column 2
)
select * from my_source;

 

but what if the "my_source" here were created in yet another similar thread - itself creating a temporary asset:

 

insert into $MY_TABLE (
column1,
column 2
)
select * from $MY_SOURCE;

 

In this case, the value of "MY_SOURCE" is actually "staging_database..my_source". But we don't need to know that, do we? What if we loaded up another source table called "my_source_test" ? Could we now point the variable $MY_SOURCE to this new table and it remain transparent to the above ELT SQL statement? You see where this leads? Flexibility, portability. troubleshooting etc - all because we embrace a framework that is letting us out of the box.


We can build up resources and not worry about whether they exist locally, remotely, in staging or whatever. Another simple aspect of this approach is this - if we don't get rid if the resource(s) at the end of the run, or the run aborts prematurely for any particular reason, all of the resources we have already built - remain bult and filled - we don't need to start the ELT from the beginning.

 

Let's say we pull from ten different sources, integrate the data with a workload that takes about an hour (hard churning on billions of records) toward a final target of five reporting tables. If we get to the end and the last table has to abort for any number of reasons, we quit and don't commit. However, our threads are in a condition that allows us to restart from where we left off, and not repeat all that work again, with the added benefit that nothing has been committed to the repository - yet. So we don't lose all the processing time, we just pick up where we left off. Such a scenario requires a manual restart, but the primary takeaway is the ability to checkpoint our work de-facto without ever corrupting the final target. Netezza gives us the power to do these things inside the machine.

 

How, you might ask, do we determine where we left off so we can pick right back up? Hey, I'm out of space on this one, but later -

0 Comments Permalink

When punting data around inside our magical machine, one may wonder how to keep track of it all. Some will eschew ELT because it boils down to a pile of SQL statements, and it sometimes feels out of control. Control of course, is what we make of it. Even a well-defined development product is no match for someone who doesn't like controls.

 

However, we know this really does boil down to insert/select combinations like so:

 

Insert into Mytarget (

column list here

)

select

yet another column list

from some tables using complex join and filter logic

 

It seems we have a handle on the top and bottom, but the "select" clause is where the primary transform and integrations are applied. Things can get really ugly here, especially if we're moving from one legacy platform to another. Our select-statements will look very hairy, indeed.

 

The "insert" clause is largely along for the ride.

 

Now it doesn't seem likely that this could get out of control until we're presented with tables that contain, say, a hundred columns. Or even fifty, or say twenty-five. Just enough you see, to keep them from appearing on the same editor page. We might want to add a column to the mix. Hey, add it down below in the select - and make sure you add it in the top to the insert - and don't get anything out of order! And what of the columns are misalgned - data corruptions are a higher-than-everage danger here.

 

It feels a little primitive, but all we really need is some assistance on the source-to-target mapping and we're good to go. It's impractical to do a source-to-target with unweildy insert/select statements, so let's apply a little Netezza magic. Now, considering that the cost of an ELT statement can sometimes run into minutes of execution time, sacrificing a few extra seconds up front, just to support our weary eyes.

 

Let's say we automate the scenario a little bit. I have a table called customers that I want to roll together from our old customers and customer-properties tables. The target table is a reporting table with denormalized stuff to support our ad-hoc folks with a lot of pre-calculated goodies. Once we have the calculations, we want to put them into business visibility.

 

insert into Customers (

customer_id,

first_name,

last_name,

most_recent_purchase_dt,

total_purchases_ctr

lots of other columns here

)

select

c_id,

f_name,

l_name,

max(b.purchase_dt),

sum(b.daily_purchase_ctr),

lots of other columns here

from

  old_customer_table a,

  old_properties_table b

where a.customer_id = b.customer_id

group by a.customer_id, l_name, f_name, etc;

 

 

We can readily see that this very typical SQL statement is doing some heavy lifting for us, just like we want it to do inside the machine. But what if the inser/select clauses had a lot more columns? It wouldn't take much for this to feel nervous about its maintainability. What if we have to interate another table to the mix? Left outer joins? The Select clause has pretty much unlimited potential for complexity.

 

ANSI SQL supports aliases, so let's run with that. We have our source columns in the Select and the Target columns in the Insert, so let's align them thusly (I'll just use the first few for brevity)

 

 

select

c_id                                      customer_id,

f_name                                  first_name,

l_name                                  last_name,

max(b.purchase_dt)               most_recent_purchase_dt,

sum(b.daily_purchase_ctr)      total_purchases_ctr,

$AUDIT_ID                            audit_id

from

  old_customer_table a,

  old_properties_table b

where a.customer_id = b.customer_id

group by a.customer_id, l_name, f_name, etc;

 

 

And lo, we have the makings of a source-to-target map. Don't we? Of course - the Insert-columns appear on the right, ready to functionally redefine the souce values on the left. We do this all the time, don't we? But largely for spontaneous reports and the like. Let's look a little further, because having something like this in open-text doesn't really benefit us.

 

By circumscribing it with a "cat" we can gain two major benefits without sacrificing clarity - one is the ability to put the SQL statement into a place where we can use it, and one is to provide a means to resolve any $ variables that happen to be in the SQL statement. Note the use of the AUDIT_ID variable.

 

 

MY_SELECT=$( cat <<!

select

c_id                                      customer_id,

f_name                                  first_name,

l_name                                  last_name,

max(b.purchase_dt)               most_recent_purchase_dt,

sum(b.daily_purchase_ctr)      total_purchases_ctr,

$AUDIT_ID                            audit_id

from

  old_customer_table a,

  old_properties_table b

where a.customer_id = b.customer_id

group by a.customer_id, l_name, f_name, etc;

 

!

)

 

Okay, now we have some options - so let's try this:

 

nzsql -a <<!

$MY_SELECT  limit 0 ;

!

 

this lets us test the SQL statement, but only to make sure we formatted it right. Now let's do something more useful:

 

 

nzsql -a <<!

create table temp_target as $MY_SELECT  limit 0 ;

!

 

now we have a persistent table in catalog, with correctly named  and sequenced columns that align with the select statement. Note that the columns on the catalog will also have expected data types, which we could check against the target table's data types for consistency, but for now we just need something that the system will accept without complaining.

 

I'm a big fan of letting the Netezza environment do the heavy lifting. We could set up a parsing function to rip through our SELECT statement and find the alias'd column names, but this will fall apart with the more complex SQL statements. We already have a highpowered SQL parser at our disposal, don't we? And doesn't the CTAS have a thousand-and-one uses, after all?

 

Let's do a CTAS like the above - with "limit 0" - meaning that it won't do any real processing work, but will give us the power of its parsing engine to find the target columns with the added benefit of registering them by name and in the proper order - but to a temporary table

 

 

Now let's put the CTAS together with a way to pull the columns off the catalog - I'm throwing this to a flat file for debugging, but you probably know how to stream this directly into a loop - to follow

 

nzsql -A -t -o outputfile.txt   <<!

create temp table temp_target as $MY_SELECT  limit 0 ;

select attname from _v_relation_column where name=upper('temp_target');

EOF

 

Now let's pull this file into a quick loop,

 

M_SEP=""

foreach line in outputfile,txt

do

INSERT_STR=$INSERT_STR $M_SEP $line

M_SEP=","

done

 

 

or how about

 

INSERT_STR=$( nzsql -q -A -t  <<!

create temp table my_temp as  $MY_SELECT limit 0;

select

case when attnum = 1 then ''

        else ',' end ||

attname

from _v_relation_column where name = upper('temp_table') order by attnum ;

!

 

INSERT_STR="insert into TARGET_TBL ("${INSERT_STR}")"

 

 

 

 

 

and form it into a string INSERT_STR that looks like this:

 

 

   customer_id

,  first_name

,  last_name

,  most_recent_purchase_dt,

,  total_purchases_ctr

 

Now what?  Execute the the Insert?

 

nzsql -a <<!

$INSERT_STR  $MY_SELECT ;

!

 

------------------------------------------------

 

If we put the above activities into a bash function call we would find a setup like this:

 

nz_insert_from_select()

{

 

put all the above activities in here

 

}

 

 

-------------------------------------------------------------------------------------------------------------------

So here is what we would implement for any given ELT - we get a visual source-to-target map

 

MY_SELECT=$( cat <<!

select

c_id                                      customer_id,

f_name                                  first_name,

l_name                                  last_name,

max(b.purchase_dt)               most_recent_purchase_dt,

sum(b.daily_purchase_ctr)      total_purchases_ctr,

$AUDIT_ID                            audit_id

from

  old_customer_table a,

  old_properties_table b

where a.customer_id = b.customer_id

group by a.customer_id, l_name, f_name, etc;

 

!

)

 

 

mret=$( nz_insert_from_select  target_table "$MY_SELECT" )

 

 

 

------------------------------------------------------------------------------------

 

 

So in ELT space, one of the keys is to balance how much we need to program versus how much is already programmed for us - in the Netezza parsing engine for starters. Catalog-hits are inconsequential when compared to the functional benefit we achieve, and the visually-aligned columns names even for very large tables. We can then add or delete columns from the ELT by adding or deleting lines in the Select. We dont have to align the columns on the top (Insert) and bottom (Select) because they are side-by side - and we know exactly what is going where.

0 Comments Permalink

Why does Oracle insist on dragging around an empty Exadata cabinet to all their tradeshows? How does this help people learn about Exadata? Perhaps it helps you size up the system to determine if it will fit in a data center or underscores the point that when Oracle say hardware they know what they are talking about? Or perhaps it sets a new benchmark for demo'ing vaporware? One can imagine the Oracle booth bunnies saying: "and it has a door which opens like this...and then closes like this!"

For the last few years, lil' old Netezza routinely bring real kit to tradeshows - stuff intended for grown ups looking to solve real problems. We know booth visitors don't want to load up terabytes of data or break performance barriers with complex SQL, but for Netezza this is a case of putting our money where our mouth is. We don't talk about simplicity: we just show it. If we can set up a full blown 50TB machine in a couple of hours in a hostile environment like a tradeshow then think how this very visible simplicity benefit translates to time to value and cost reduction at your site?

Conversely, what does it say about a great big huge enormous vendor with infinite resources like Oracle that cannot set up their own live machine? Or even take one to a customer’s site. Perhaps it says that Oracle is a software company who in truth, know nothing about hardware? Perhaps it says that without an army of trained specialists working behind the scenes, this stuff just doesn't work so well? Or maybe Oracle has lost their innovative edge to such an extent that they truly believe opening and closing a cabinet door really is on the cusp?

In reality, embracing hardware - as Oracle is doing with Exadata - is a huge departure for them and don't underestimate the extent for them to screw this up. I can tell you anecdotally that the notion of "HARDWARE VENDORS ARE SCUM" was so ingrained in me at Oracle, that this was my only objection to joining Netezza several years after I left Oracle. I'm sure if you look closely at the Exadata box on the Oracle booth, it displays a sign that says: now wash your hands!

Meanwhile the little sign on the side of the Netezza cabinet just says: ours is real.

0 Comments Permalink

I met with the Netezza folks this week, and their nutty marketing team updated me on all the stuff they're excited about doing in the next few months. Here's the low-down:

 

Over the next couple of weeks, keep your eyes peeled for the next issue of the Enzee Frenzy magazine. I'm sure the whole thing is going to be great... especially my own signature Brian Teasers cartoon at the back!

 

And I got a sneak preview of one of the big announcements that will be coming out in Frenzy -- instead of the annual Enzee Universe conference, Netezza is doing an Enzee Universe World Tour this year so people don't have to travel to them. So don't worry about missing out on the fun this year because your travel budget has been cut - Netezza is bringing the whole Universe to your doorstep! From what I understand, the tour will kick off at Netezza's base camp in Boston MA and then they'll travel to other cities around the US, followed by Europe and then out to Japan. And it gets even better - I get to join them for the entire World Tour! I better make sure my passport hasn't expired...

 

The other cool thing Netezza is doing for enzees is they're hosting a Netezza Partner Webinar series where, each month, a Netezza partner will talk about the cool stuff they're doing with Netezza and how it can help optimize NPS system performance, optimize business practices, and more. Very cool. All of the webinars will posted on the Enzee Events page on the enzee site, so check back there at least once a month!

 

One last thing - did you check out the Groups that have been added to this community site? Looks like they could be good places for subsets of the larger enzee community to exchange ideas about things they can all relate to.

Just throwing it out there... looks like they're just getting started but I'm sure they won't go very far without enzee participation!

0 Comments Permalink

About a year ago we encountered an environment where the client wanted the old system refactored into the new. The "new" here being the Netezza platform and the "old" here being an overwhelmed RDBMS that couldn't hope to keep up with the workload.

 

So the team landed on the ground with all hopes high. The client had purchased a 10200 (216 processors) for production deployment and a 10100 system for development. Oddly, the same thing happened here as happens in many places. The 10200 was dispatched to the protected production enclave and the 10100 was dropped into the local data center with the developers salivating to get started. And get started they did.

 

The first team inherited about half a terabyte of raw data from the old system and started crunching on it. The second team, starting a week later, began testing on the work of the first team. A third team entered the fray, building out test cases and a wide array of number-crunching exercises. While these three teams dogpiled onto and hammered the 10100, the 10200 sat elsewhere, humming with nothing to do.


We know that in any environment we encounter, with any technoogy you can name, the development machines are underpowered compared to the production environment. And while the production environment has a lot of growing priorities for ongoing projects, we don't have this problem for our first project, do we?

 

And this is the irony - for a first project we have a huge "first-bubble" of work before us that will never appear again. the bubble includes all the data movement, management and backfilling of structures that we will execute only once, right? Really? I've been in places where these processes have to be executed dozens if not hundreds of times in a devlopment or integration environment as a means to boil out any latent bugs prior to its maiden - and only  - conversion voyage. But is this a maiden-and-only voyage? Hardly - typically the production guys will want to make several dry runs of the stuff too. We can multiply their need for dry runs with ours, because we have no intention of invoking such a large-scale movement of data without extensive testing.

 

And yet, we're doing it on the smaller machine. No doubt the 10100 has some stuff - but I've seen cases where it might take us two weeks to wrap up a particularly heavy-lifing piece of logic. If we'd done this on the larger 10200, we would have finished it in a week or less. Double the power, half the time-to-deliver (when the time is deliver is governed by testing) In practically every case of a data warehouse conversion, the actual 'coding' and development itself is a nit compered to the timeline required for testing. I've noted this in a number of places and forms, in that the testing load for a data warehouse conversion is the largest and most protracted part of the effort. And if testing (as in our case) is largely loading, crunching and presenting the data, we need the strongest possible hardware to get past the first bubble.

 

So this is a case for any data warehouse project, not just one with a Netezza machine. The first bubble is the worst bubble. As our techs slave themselves over a hot CPU, sweating out the extreme workload of the initial conversion, they will quickly start to compare the machine they are working on versus that production machine sitting over there with nothing to do. It wouldn't matter what the technology happened to be - the equation is out of kilter. We need all the available power to get past the first bubble.

 

But I've had this conversaion with more people than I can count. Why can't you deploy the production-destined machine with all its power, for development/testing use in getting past the first bubble, then scratch the system and deploy for production? What is the danger here? I know plenty of people, some of them vendor product engineers,  who would be happy to validate such a 'scratch' so that the production system arrives with nothing but its birthday-suit - its originally deployed default environment. Yet another philosophy is that we would pre-configure the machine for production deployment, but nobody likes developers doing this kind of thing in a vacuum. They would rather see deployment/implementation scripts that blow-out and instantiate the inplementation. I'm a big fan of that, too, for the first and every following deployment. That's why I would prefer we used the production-destined system to get past the first-bubble-blues, then scratch it, and get the original environment standing up straight, then treat it as an operational production asset.

 

Most projects like this have a very short runway, and we do a disservise to the hard-working folks who are doing their best to stand up this environment, They need all the power they can get, especially when they enter the testing cycle. And for this, it's an 80/20 rule for every technical work product we will ever produce. Take a look sometime at what it takes to roll out a simple Java Bean, or a C# application, or a web site. Part of the time is spent in raw development, and part of it in testing. If I include the total number of minutes spent by the developer in unit testing, and then by hardcore testers in a UAT or QA environment, and it is clear that the total wall-clock hours spent in producing quality technology breaks into the 80/20 rule - 20 percent of the time is spent in development, and 80 percent in testing.

 

And if the majority of the time is spent in testing, what are we testing on Enzee space? The machine's ability to load, internally crunch and then publish the data. On a Netezza machine, this last operation is largely a function of the first two. But we have to test all the loading don't we? And when testing the full processing cycle we have to load-and-crunch in the same stream, no? What does it take to do this? Hardware, baby, and lots of it. So why are we doing it on one-third of the available hardware (seeing that we're on a 10100 and the 10200 is sitting over there, humming away and taunting us from a distance!)

 

I can say that multiple small teams can get a lot of "ongoing" work done on a 10100, no doubt a very powerful environment. I can also say that a machine like this, for multiple teams in the first-bubble effort, will gaze longingly at the 10200 in the hopes they can get to it soon, because so much testing is still before them, and they need the power to close. With that, Netezza gives us the power to close faster than any other environment, to get past this first-bubble without the blues - we only hurt ourselves with rules for the environment that are impractical for the first-bubble. So all things considered, if we were on a traditional platform we would see months pass for the relative weeks it would take for a Netezza machine to do the same work.

 

Alas, when one has a Netezza machine, it bends gravity and dilates time. Months become weeks. Weeks become days. And yet, we still need more power. More is never enough to wash away the blues.

 

Those first-bubble-blues.

0 Comments Permalink

We’ve received several calls recently from Oracle that sound like something from an episode of The Sopranos—all in jest I should add—the gist of which is that Oracle don’t like us poking fun at Exadata and we had better watch it! It’s hard to imagine that Oracle can bring any more attention to this issue than having Larry Ellison declare premature victory against Netezza during his keynote at Oracle OpenWorld last year.

 

The irony for me personally of all this, is that I cracked my marketing teeth at Oracle in the late eighties when Oracle was the upstart with the cool technology fighting the big ugly incumbents against all the odds. How the tables have turned! Oracle has grown all big and ugly, lost their innovative flair, rely on corporate muscle to get the job done and throw a tantrum when they don’t get their way.

Personally, I doubt whether Exadata will be ready for prime time for another 18 months yet. In the meantime, we have a phoney war on our hands: all talk and no action. I predict Oracle is trying hard to find an opportunity to announce that someone, somewhere is using this thing. At least that’s what the Dummies Guide to Blasé Big-Company Corporate Marketing advocates. The HP Neoview team—having been clearly sidelined from this arrangement with Oracle—is also yet to buck up the courage and admit there is no synergy with Oracle here.

 

In the meantime, while Oracle and Netezza throw their handbags at each other, innovative companies looking to solve today’s analytic problems will continue to buy from Netezza (and sometimes from our competitors!) leaving laggards, already bogged down in the recessionary mire, to worry about Exadata in the name of their corporate standardization initiatives and performance "incrementalism".

0 Comments Permalink

Before throwing the data on the grill, it needs prep. We might want to chop it, slice it, dress it and otherwise tenderize it before throwing it on the flames, but throw it on the fire we must.

 

Here's an interesting problem - what if we're presented with data that just won't intake. It's horribly formatted, if at all, and we don't have access to any kind of ETL or data-shaping environment to get the data inside. We have bad dates, bad numerics, and the only thing we have that actually works are the varchars! Whoo hoo. Okay, not so fast.


To make matters worse, we have intake records that face concatenated views, some of which have almost two-hundred columns. If any columns are bad, it's a lot like a submarine hunt, without the submarine. How bad is the data? Well, not all that bad where it actually resides. The users do a lot of inline and stored-proc math, and don't like those pesky overflows in their math. So the best option for them is to define all their numerics as "number", with no specified precision. Hey, that works great as long as the data never has to leave its home.

 

But now it's escaped, and it's on our front door with a trick-or-treat bag without the treats. When the source database doesn't define the numeric precision of its source columns, we'll find values with very oddball characteristics. Simple things, like 45 digits to the right of the decimal. Not particularly useful digits, but those leftovers like .33333 etc that just showed up without being invited.

 

How do we shock-the-system with these values, or just trim them out? On intake we have the potential of an empty column, too. The danger never ends. check out this example, if we want to clamp the numeric data to a more palatable value:

 

select

-- for a numeric

     case when mycol is null then null else substring(MyColumnName, 1, 38)::numeric(38,8) end MyColumnName

 

--or for a date

     case when mycol is null then null else date(MyColumnName) end MyColumnName

 

 

Now we've stripped the data to something we won't choke on, and it's within lasso-distance of our numeric precision. What's that? You don't want to do this for every column in case it's something pervasive (and it usually is) - and you don't like the idea of bringing the numeric precision out of the catalog and putting it into the intake statement? What kind of perfectionist are you?

 

Not to worry, I don't like this kind of construct either, at least, not "out in the open". If I really need to use this, I would rather find a way to automate its construction right off the catalog. Our "substring" doesn't have to change, it will chop the physical data to a non-choking size. The catalog-based precision we can get, well, off the catalog. I am a huge fan of using the catalog for meta-data-based constructs, especially for common, automated tasks like intake and publication.

 

Let's say we have a very-wide intake record, like 200 columns of varying types. Do we really want to carefully craft an intake statement including the construct above? The capability is willing, but the flesh is weak. I don't find such mind-numbing work to be profitable or productive, even though it's sometimes insanely necessary because the intake data is so junky.

 

The cool part is just this: get the data into the machine! We don't have to push load-ready data into the machine, we just need to push it to a safe location. Once inside the machine, we can use Netezza's SPUs to beat the living daylights out of the data. And when we think about it, once the data's in the machine, it's like putty in our hands. We can create, teardown and rebuild whole data models, several-a-day if we want, to shape and mold the structures to the form we want. But the potter's wheel is awefully lonely with no clay.

 

I've been place where we literally waited for weeks upon weeks to get data into the machine, largely because the data we want in the machine has never (by design) left its home for another machine. Once it leaves, people see how ugly the data looks "out in the open" -- but something funny happens. We might ask them "could you format those pesky dates and numerics to something more palatable?". The answer we get back is as honest and refreshing as Nestle iced tea: "You guys own the warehouse, and the whole chain of data cleansing. I'm not the cleaner, you are. Why are you asking me to do your job?"

 

Ouch, well, there it is, and quite frankly they're right about it. It's awfully hard to tell someone to restructure their data to a form that meets our needs - it's 2008 after all. What would it take to get information into the machine, especially junky data? Do we really need to push back on our DBAs? Our analysts? Our DBAs are well-paid to manage and deliver the information in a pre-defined form, usually not in bulk. It's not particularly daunting for them, but they cannot read your mind, either.

 

So how would we access the catalog to automate this intake problem? And while we're at it, why not solve other intake problems, not just the pesky numeric precisions? How about solving the need for file space to land a flat file for intake? Or that the data in the source doesn't completely match the final target tables (we've added some administrative columns and other items that don't have a source-side equivalent. Intake-mechanics are actually pretty simple, and once we solve some of the basics, we can do some pretty advanced intake at the push of a button, and then use the common SQL-based ELT to take the data to its final home.

 

But getting the data into the box is no different than getting a player on the field, No player, no game. If we examine the common failure points that are in our way, we have the potential for source-failure in a database or file system, network errors, power-outages - you name it - all between our machine and our data. If we can get the data into the machine and cut the ropes from whence it came, we can do anything we want. If we can't get the data into the machine, well, what on earth are we standing around for?

0 Comments Permalink

Sorry it's been so long. I've been scouring the internet and finding lots of interesting news in our space - here are my Five Fascinating Findings of the past couple of weeks:


  1. Have you seen this Data Liberation Movement? It's a pretty funny new "movement" that Netezza has launched to encourage transparency during the data warehousing selection and testing process. I'm very curious to see where this thing goes - but in the meantime, I've created a Facebook group for all the data liberators out there!  The Data Liberators' Twitter page is also an interesting place to see what they're up to.  My only question is... who is the data liberators' mysterious leader??
  2. Interesting article from "referencedatamanagement" on wordpress.com providing an objective point of view from someone who has used a variety of different tools for MDM projects, including short snippets on Informatica, Siperian, Teradata, GlobalIDs, Netezza, Talend, Pentaho, and Teleran. My favorite part is when the author says "I’ve done more ‘working around’ this product." with regards to Teradata.
  3. From baselinemag.com - a short slideshow on six technologies that were hyped up in 2008 but didn't actually get funding. The ones I found most surprising: RFID, SaaS and On-demand Computing.
  4. The anomalie of consolidating unstructured and structured data in a warehouse: "Data Warehousing Between a Rock and a Hard Place" by Lou Agosta on the B-EYE Network.
  5. What is going on in the world of BI? The various factors affecting today's BI purchasing decisions are explained in Enterprise Systems' "How BI Habits Are Changing" article by Stephen Swoyer.

 

  1.  


0 Comments Permalink

For any accountants in the audience (or anyone else strange enough to enjoy reading financial statements…), you will know that Netezza’s fiscal year begins on February 1. Setting aside the quirks of history that made it so, this shifted Netezza New Year gives us a second chance to consider the year ahead, without all the pressures of turkey hangovers and soon-to-be-broken New Year’s resolutions in mind.

As a part of our new fiscal year planning, I just had the pleasure to attend Netezza’s annual customer team kick-off meeting. At the beginning of every year, all of Netezza’s customer-facing employees gather to discuss what made us successful in the prior year, and what we can do to serve our customers better in the year to come. At the same time, it’s an opportunity for us to educate each other on all of the partner activities happening out in the field.

This year, one of the main initiatives discussed was to focus additional efforts on the core commercial industries that have shown the greatest adoption of Netezza technology:

·        Telecommunications

·        Retail & Consumer Packaged Goods

·        Financial Services

·        Health & Life Sciences

·        Digital Media

From a partnership perspective, this means that in 2009 the Netezza Strategic Development team will be engaging with current partners and recruiting new partners in each of these areas. Our objective is to create a growing and vibrant ecosystem of partners who can help us to deliver complete business solutions and even greater value to our customers. By the end of 2009, you can expect an even greater number of applications will be available on the Netezza platform in these target industries.

While I was attending the team session on Retail & Consumer Packaged Goods, I was impressed by one partner who I think is as an excellent example of the way that Netezza can engage, support, and elevate partners via our existing programs. Howard Cohn of EYC (http://www.eyc.com) described how EYC’s analysis software for customer loyalty card databases, combined with Netezza’s data warehouse appliance, has enabled customers to get better answers faster while working with even more comprehensive data sets.

With the support of Netezza’s technology, industry, and partner sales teams, EYC has successfully ported several of its applications to the NPS platform. Together, EYC’s and Netezza’s customer teams are promoting this joint solution to the retail industry. To enable even more compelling performance, the teams are now considering embedding some of EYC’s key analytics into the database itself using Netezza’s OnStream development platform and the supporting resources available through the Netezza Developer Network.

EYC’s progression from partnership concept through to market launch and product enhancement is a great example of the full partnership lifecycle that we support. EYC is just one example of many partners who have successfully engaged with Netezza, and a growing number of these are enhancing their applications with Netezza’s in-database OnStream capabilities. For a list of partners with OnStream applications already on the market, you can see an application listing at http://www.netezza.com/data-warehouse-appliance-products/data-warehouse-applications.aspx.

I expect that 2009 will see a number of partners take OnStream applications to market in each of the vertical industries that we’re focusing on. If you see any application needs in your industry, want to recommend a vendor for Netezza partnership, or are a vendor with a concept for an OnStream or vertical industry solution based on Netezza, we’d love to hear from you.

0 Comments Permalink

Here's a shout-out to all you ELT aficionados out there - those who have embraced the call to use the Netezza machine for hard-core data processing, and not just query acceleration. What's that? You've deployed it as a query accelerator because that's was your functional requirement? Tish-tosh, you are under-utilizing the machine.

 

In ELT space, we see data arriving on our machine's eastern shore like immigrants from a foreign land. Give us your poor, tired and huddled data, and buried information yearning to be free, and all that. We need liberation! (a subject of another like-minded site) and the big-black-box is a beacon to collect the uncollectable, love the unlovable, and process the unprocessable data arriving in completely un-integrated form. We see information from this source or that, arriving on rafts, boats, inner-tubes and the like, and we want to believe that all are created equal, yet our process for assimilation and naturalization of the data has an uptake, doesn't it? Perhaps we'll stage the information (give it a green card) and maybe even load it partially-cleansed into an ODS - but one way or another we have to challenge the information, make it consumable "enough" even when it first arrives.

 

ELT is a practice already found in many RDBMS's, engaged by people who have no desire to purchase middleware, and honestly believe that pulling the data out of the database, processing it only to put it right back, is a waste of time and resources. Fire up a stored procedure, they will say, and process the data on the machine. Isn't this the most efficient means to achieve our goal?

 

On the Netezza machine, you bet. We have hundred(s) of processors working in purpose-built synergy toward this goal. But on an SMP-based RDBMS, no way. It won't scale and is destined to run out of gas. It's only a matter of time. And because we would have to use stored procs to affect our outcome, we also embrace a black-box processing scenario that really is black, lights-out, underground, all the bad things. Our poor operators will watch it kick off, run for hours and hope it finishes on time, and correctly. Once it becomes visible to an operator or admin, it's already running out of gas, and now we'll watch the engineers swarm to do what they do best - engineer - and the danse-macabre of propping up a dying process with artificial respiration.

 

Which is why we have Rule #10 isn't it?

 

One reason some don't embrace ELT - that is - simple data transport followed by hard-core data processing in the database engine, is because it's a bad practice to do bulk data processing in the (SMP-based) database engine. Since Netezza has broken this envelope, we now have freedom to proceed, but wait. We need a way to manage the ELT flow itself. After all, the ELT flow is just a series of SQL statements, right? Even the most robust "ETL" tools will only support "ELT" by firing off sequential SQL statements because they are not really in control of the data. What we'd like to see, is a flow-based mechanism like Ab Initio, Informatica, Expressor or the like, to transparently harness the SQL statement like a true transformation component, even though in the background it will "only" fire off a SQL statement to affect the outcome. With the Netezza power we really can process the data with mind-bending speed, but after the smoke clears we need a way to report, track and manage this process. A SQL statement seems rather primitive, raw, and too much like hand-coding. Largely because it is. If we had a tool that could manufacture these kinds of transforms on-the-fly and manage the process as a visual flow, hey, life would be good.

 

What would this look like? Well, sort of like what we see today in flow-based management. Expressor, for example, allows us to leverage Visio to describe the flow, then Expressor components consume the Visio diagram's metadata  and manufacture a living program. Ab Initio uses its proprietary graphical canvas to affect a similar scenario. What we really need is the ability in one of these products (or another product entirely) to pull a Netezza ELT component onto the canvas, connect one end to a source table, one end to a target table (albeit a temp-table if necessary) and allow us to describe the transform between the two - just like any other transform. Ultimately this would provide us with graphical control over the flow in a visible, manageable and traceable form. Alas, we as the machine owners must (for now) embrace some degree of scripted logic to affect our desired outcome. I see this as a temporary state of affairs. Someone will rise to the challenge.

 

Oh, come on, David, we know that the folks who live and breathe Netezza are the pioneers, who eat from the back of wagons, sleep under the stars and change their own horse-shoes. Innovative problem solvers, braving the wilds of the prairie with fearless resolve. Yeah, uh, before we go down that path, describng a cowboy (and don't get me started) let's examine what the enterprise needs. Whether the cowboys (intone John Williams theme song here) get the job done rustling the data and wrangling the loads, at the end of the day the trail boss will want a status. Have we lost any dogies? Are all of them fed and watered? How much farther to the end of the trail? What about the weather? Wild animals? Data-rustlers (hackers). The list of pitfalls and opportunities is boundless, and the trail boss wants to know "where we stand". You know, like the business intelligence dashboards.


The reason we might not see this "ELT" harness scenario any time soon from the power-hitter products is that ELT requires the power-hitter to maintain local control but externalize the processing power (delegating it to Netezza). This is unpalatable for the product vendors that claim we don't need Netezza to process the data (and of course, Netezza is horning-in on their action like a good competitor). Yet we have this big-black-box machine sitting on the floor that has the power to perform seriously hard-core processing on a breathtaking scale, achieving internal bandwidth that these power-hitter software products cannot achieve (because their hardware platforms constrain them). Let's face it, put a power-hitter software product on a 32-way Sun machine and then attempt to process data in the same scale as a 200+ processor Netezza machine. No slight on the software product, because it could probably process at Netezza's scale if given enough hardware, but do we really want to deploy a 200 processor Sun machine?

 

Another reason is that Netezza is the only product that truly unleashes the processing capacity to make ELT a practical and easy reality, and is seen as a competitor by the power-hitter software product vendors. Yet another reason is that those who would embark on this path have to commit resources to Netezza's (somewhat) more rarefied market, and for now are simply unwilling to do so. Time will change this, however.

 

I'm not one to tell competitors how they should behave in the marketplace, because competition always increases quality. But if we all get together and shout "we are here" - perhaps at least one maverick elephant will hear our cry. With all apologies to Dr Seuss, we could start our own web presence as Horton Hears an ELT, or Horton.com, or even MaverickElephant.com - I don't know - just thinkin' out loud here.

1 Comments Permalink

I was amazed at how many people showed up for the inauguration. Here in DC the roads were closed, the weather was cold and the atmosphere was in a word, electric. I also witnessed some disturbing things, like people who had arrived from hundress of miles away but could not get in, and many more who did get in but had to watch it all on the jumbo-tron screens anyhow. Still, being there is being there, and nobody can take it away.

 

And this is a segue into my observations on some interesting patterns of humanity that align perfectly with our solution domains, so follow me on this. As I sat in a hotel room on Tuesday, debating whether I should brave the extreme cold and traffic, I am reminded that one of our presidents (Harrison) gave a too-long inauguration speech in such conditions, developed a cold and died of pneumonia a month later.  As for me, I had a problem to solve - when I turned on the morning news I found a common "of scale" ingress and egress scenario, something I predicted but was stunned as to the magnitude of its reality. Yes, an important day for America and for the world - and worth at least a superficial examination of the logistics of people-movement.

 

I noted that the cabs, trains, buses public and private, and other transportation mechanisms had been trickling people onto the scene for days, increasing by the hour. The DC train systems had dropped off over 200k people in a matter of hours, probably setting records as well. But when we think of the drop-off mechanics, these are not people arriving in-bulk. They are collecting from various places in a near-transactional manner. A handful here, a handful there, each one patted-down for their belongings one-at-a-time no differently than our transactional stored procedures check our inbound data one-at-a-time. Then the festivities started, the regalia of our peaceful transition of power, and when it was all done, we saw another interesting effect. It was time for everyone to make an orderly exit.

 

So people who had arrived early in the day and had likewise confirmed a front-row position for the festivities -  now did an about-face to exit - only to find a sea of humanity between themselves and the exits. I use the word "exits" here loosely, because they would not be patted-down to leave like they had been to enter, so the egress was a bit smoother and more steady. Uh, you know, like we pat-down the data when it first enters our warehouse, and deliver it clean on-demand. Ohhh, the parallels.

 

Discussing this at the watercooler, a couple of colleagues wondered out loud how to make a mass-egress work. How would we (safely) empty the Mall of the majority of people in a short period? After all, some of the attendees didn't make it back to their hotels until almost nine hours later! One person suggested to do it "the Netezza way" by providing 100 helicopter pads, plus 400 helicopters, each of them alighting on the pads every five minutes with a 20-minute round trip to spirit people out. This would leverage the vertical space, not just the horizontal. But this model won't serve, since the helicopters waste the return leg. One of them suggested conveyor belts, objecting to the Netezza way. But I suggested that this better represented the Netezza way, a streaming model of constantly moving data. The helicopters could move people at only 1/100th the speed of the same number of conveyors, and they wouldn't have to move all that fast.

 

The streaming model is something that shakes the rafters on our reporting models, but as with any problem of scale, we must provide the physical plant first, and it has to address the problem on purpose.

0 Comments Permalink

Here I am in Washington DC. Yes, that's right, Washington DC on the eve of the inauguration. And one may well wonder why a boy from Texas is hanging out in DC? I don't really have plans to attend the events of the next several days, as I am ensconced at a client site on a new and challenging project. I'll share some of the more generic challenges and opportunities at this site at a later time, scrubbing the content, as always. However, the DC area is about to be inundated with what is nothing less than a problem-of-scale. Estimates are that over three million people will attend the inauguration tomorrow, and they've pulled out all the stops and added extra manpower to assist in this mass ingress (and egress) of humanity. The parks services here often use indirect readings to get a better picture of total headcount (like water usage, purchases of refreshments, etc). In the most recent Earth Day celebration, they over-estimated the total attendees because they left more trash on the ground than the usual events. The irony is not lost.

 

It is instructive what the local authorities have initiated - the bridges and roads all around the city are closed to motored traffic - allowing foot-traffic only. In case of emergency, people can make a quick and orderly exit from the city and into the sorrounding areas. Of course, after watching the mayhem of 9/11, we know that if a catastrophic emergency arises, people will make it to the riverfront and then wait patiently in line to cross the closest bridge, right? Who are we kidding? Expect to see swimmers in the water, and all that. Not to put a damper on the festivities, of course, this is an important day for America in many ways, and the people in charge are very carefully seeing to it that all forms of chaos are under control.

 

But are we? How does chaos, even explosive, career-threatening chaos enter our environments and wreak havoc on our systems? I noted in a prior entry that simply watching for what is known while ignoring the unknown - or even accidentally allowing the unknown to pass through because we failed to recognize it - isn't good enough. It wouldn't be good enough this week for the Obama and Biden families, so why is it ever good enough for our mission-critical systems? It is because we are accustomed only to checking our information against the standard for which it must rise to, rather than checking it against a minimum standard for which it cannot rise at all.

 

Now, that sounds a little oblique so here's an example: Whenever a police dog from any given K-9 unit in America is searching for a suspect in the dead of night amidst flashlights, annoying shouts and eerie background music, it is searching for one and only one thing- the scent that isn't there. In fact, for any of these dogs you could easily perform a test - fill a room with fifty people and have one of them leave and hide. Then bring in the dog. It will immedaitely lock upon the missing-scent and go after the person who isn't there. How amazing is that? Of course, their olfactory sensors are several hundred thousand times more sensitive than ours.

 

Within international terrorist discussions on the chit-chat shows, we hear a maxim of "our intelligence has to be right 100% of the time, but a terrorist only has to get it right once." In other words, the terrorist may make one thousand attempts to take down a target, and only has to get it right once to succeed. But the people protecting it have no options - it has to be right all the time. Errors and junk in our data are a form of virtual terrorism - but only by latent effort on their part. Like dust slowly settling on electronic parts, if the parts are not protected from the creeping effect of buildup, the layer will eventually reach a critical mass and bring down the system. No one speck of dust did the trick, only the lack of attention to the dirt. When we say, "what are the odds?" - that word "odd" has a meaning. It is the meaning of "what doesn't belong here".

 

In a data processing environment, chaos thrives amid neglect. If we're in hard pursuit of chaos it has less of a chance to succeed - but we're only talking about probability now. Sooner or later, will the odds be in the favor of the last speck of dust that really counts, or will we keep the dust to an acceptable level so that there is no appreciable or dangerous buildup? Before I start sounding like a cleaning commercial, let's try to keep in mind that we need some serious sifting power, and not just spot-check, sampling or rules-based checking of row-at-a-time data. Dirt and all its patterns show up slowly in some systems, but show up in bulk in ours.

 

We need a way to bulk-lift the dirt out and away. Or for that matter, redirect the dirt in quantity so that it never finds a home. How do we do this? By challenging as-suspect every row that arrives on the front door. Of course, doing this "upon arrival" is a bit daunting, highly inefficent and error-prone, and not really practical. No, what we want to do is pull alll the candidates inside, take a look at those who belong and those that don't, and sift them while they are in the pipleline before they ever land in an Important Place. We need something akin to a switching device on a train track, or like a letter-sorter in a postal office, or like a coin-sifter in a change-making machine. We need for this sifting effect to fall-out using normal physics, not as a result of carefully handling the elements. We just don't have the time or bandwidth to taste-touch-handle each row.

 

Within a Netezza domain, start thinking about doing everything in scale. Rather than examining a problem as though we're trying to find a criminal, examine the problem in terms of what criminals look like, and what they don't. These provide clues to Netezza on "where not to look" - and we'll find our suspect faster.

 

Many years ago I assisted (from a virtual distance) a prosecutor in a dragnet operation where the various detectives had, over time, photographed or filmed a number of drug-purchases by nefarious characters and denizens of society. In one day and night of misfortune (for the perps), a task force netted all of these people in a several-hour sweep of three large districts in three states. Within a matter of hours, all but one of the over three-hundred perps had been captured, and the final one on the next morning when he walked in his front door. We hear about occasional "creative stings" - one in Dallas where every person with an outstanding warrant was sent a mailer saying that they had won an all-expense-paid cruise vacation and all they needed to do was show up at Texas Stadium to claim their prize. Of course, in reality only police officers awaited them, and scuttled the hapless souls out the back door to awaiting transportation to a local incarceration facility.

 

Why does all this matter? When we want to find the bad data, we need a way to call it out. We need a way to find it in a manner that separates it from the rest of the data. In human terms, we often think in terms of identifying the perps and going after them, because we cannot possibly sift through the counties on a door-to-door search. We might know for sure that a nefarious character lives in a certain neighborhood, but without specific intelligence to locate the character, he remains at large. Netezza doesn't do a door-to-door search either. In fact, if we can only identify "what isn't a perp" - the suspects stand out like the hapless lemmings at Texas Stadium. Both of them arrived by a form of natural physics. The perps stood out by the natural human gravity of something-for-nothing. Our bad data stands out using the natural physics of the Netezza architecture.

 

Or in another vein, Secret Service agents and bank employees are taught how to spot a counterfeit bill based on what the real thing looks like. While they may know various counterfeiting techniques,nothing beats the ability to know the real thing so well, that a counterfeit stands out like a sore thumb. Got a pile of cash in your pocket? Could you tell which ones are real and which ones are counterfeit because you know "what isn't a counterfeit" better? Same technique applies here - give Netezza the cues on where not-to-look and we have our counterfeit data, our posers, our perps and wannabees. And make the domain safe before bedtime.

0 Comments Permalink

Welcome everyone to Linking In: The Partner Perspective, a new blog that will keep the Enzee world up to date on all of the news and developments coming from Netezza’s business partner community.

I am lucky to represent a dedicated team of people here at Netezza whose mission is to make our customers and partners successful by linking the right technology with the right people. We support an incredible group of partners who solve customer problems and enable adoption of our technology around the world. If you’re interested to see the next generation of advanced analytics and data warehouse appliance technology unfold, you’re at the right place.

Through this blog I will keep you up to date on innovations and customer success stories as they emerge from the Netezza partner community. You will see postings on partner activities, customer success stories, partner support programs, and everything else related to our goal of enabling broad adoption of Netezza’s technology platform. By providing a rich source of information, we want you to have all of the knowledge necessary to make the links out there in the marketplace and make things happen. And beyond just sharing information, this blog is meant to be a catalyst for communication in all directions. Ask us questions, tell me what you think, ask questions of other partners, debate among yourselves… it’s all good.

So… with that background, as the proud owner of a shiny new blog I need to figure out what to do with it. Suggestions or requests?  All comments are welcome!

By the way, I'm Kevin Kostuik, Director of Platform within the Strategic Development group here at Netezza.  As we go forward I'll introduce you to the other members of our team...

Cheers,

Kevin

0 Comments Permalink

Welcome to my blog! In this "Soundbytes from Brian" blog, you'll get to hear my thoughts and reports on hot topics to enzees, happenings in the data warehousing and BI industries, and other fun stuff that I think enzees around the globe care about. Hopefully you'll find this blog both entertaining and informative - and please comment so we can get some good discussions going!


My first posting has to do with - what else - the economy. Stop yawning - hopefully this enzee-oriented twist contains information that you care about! As we all know, the worldwide economy is shaping up to be a bear market for the bulk of 2009. So I’d feel overly optimistic saying that the BI/data warehousing industry will be a raging bull market this year.

But then again, when you listen to the experts, it’s hard not to think that we enzees may very well be sitting in the middle of 2009’s silver lining! SilverLining.jpg

Here are a few words of wisdom that may convince you to jump on the BI/data warehousing bull market bandwagon:

  • Stephen Swoyer from Enterprise Systems says: With market watchers Gartner Inc., IDC, and Forrester revisiting and revising their IT spending forecasts downward for 2009 and beyond, spending on business intelligence (BI) and data warehousing (DW) appear, paradoxically, to be holding steady or poised for growth.
  • Dan Vesset and Brian McKnight from IDC report to Enterprise Systems:"[Our] analysis suggests that we are at the beginning of a new wave of business analytics deployments that will materialize over the next decade and will be focused on addressing two primary demands [i.e., an ability to handle both more data and more users]."
  • Brian McDonough from IDC says: "As information access, analysis, and management software markets continue to adapt to changes in demand, IDC believes that these technologies will be increasingly deployed in the context of specific business processes. Making information actionable, rather than just accessible, is paramount to successfully competing in these markets in 2009.
  • Jeff Gleason from Aegon USA Investment Management explained to InfoWorld: “… it's always tough times like now when we wish our enterprise architecture practice were more mature, that we didn't have so much redundancy, that system changes were easier and faster…”<!--[endif]-->
  • David Stodder from Ventana Research explained to Intelligent Enterprise:In the recent Ventana Research ‘Optimizing BI and Data Warehouse Performance’ research study, more than half (58%) of participants said they are experiencing sometimes nightmarish performance problems when they have to scale to run more complex queries, and nearly half (48%) said they have the same problem when scaling to load more data. As a result, they are canceling important queries when they simply run too long. Thus, it's not surprising that the study found that organizations are evaluating appliances and column-oriented databases to remedy problems in these areas."
  • Eric Lai from ComputerWorld says that, according to a recent Forrester study: “The survey also found BI becoming ubiquitous, with almost 60% of respondents saying they were deploying BI, data warehousing or data integration tools across their enterprises."
  • IDC Research reported in their 2009 predictions webinars: Companies across all industries – including the financial services! – will look to cut costs overall, but they’ll continue to invest in innovative BI solutions in 2009.

So, to all you enzees out there, breathe easy! I’m not going to lie – every industry will be tough in 2009 – but best-of-breed solutions in data warehousing and business intelligence, and the people that implement and run them, should be OK. Knock on wood.

 

- Brian

 

 

0 Comments Permalink


And now, a drum roll please, the inaugural entry for this auspicious occasion. I realize that many people who read this will be Enzees and non-Enzees, so for those who want to know a little more about the machine, sort of dipping the toes in the water, I'm talkin' to you.

 

And for those who are already swimming in the deep end, and those in the deep end without a Netezza machine, I'll try to shape some thoughts for your own discussions. I've noted in other venues (particularly in the Netezza Underground, and that's my only shameless plug for the book!) that the Netezza machine addresses problems-of-scale. We know what a function-point looks like, and we know when data is transformed from one shape to another. What we might not consider is what it takes to make this happen in scale. And not to belabor a point, but Netezza is an appliance. While we know that we can make enough drinks for a small dinner party in our blender-appliance, and serve coffee to them afterward from our automatic coffee maker-appliance, will these same appliances cover, say one thousand people? Ten thousand people? For that matter, if we have a simple toaster-appliance, will even our four-slot toaster satisfy warm-toast-requirements for say, a thousand hungry teenagers? I recall having a winter get-together for a bunch of teenagers in our home, and one of the parents had signed up to provide chili. I still remember the kids piling into the house, about thirty or more, cold and thirsty from being outside in the weather. The parent showed up with mini-crock-pot containing the warmed contents of two can of store-bought chili, and had clearly missed the memo to show up prepared to feed the masses. We ordered pizza.

 

The requirements for processing data in scale are no different, and we have several primary hurdles to overcome that Netezza has recognized, embraced, harnessed and solved, and I would be remiss for not pointing them out, because for some people these are not obvious. But before we delve, consider this: Let's say I have a compliance model and I want to find the few thousand records among millions or billions that are not in compliance. This is a needle-in-a-haystack-problem, and for some of you that pile of hay is pretty big! If we did this in a "typical" fashion (without Netezza), we would examine each row for a variety of anomalies, all based on some criteria or set of rules. We really need to get it right on the first pass, because a multiple-pass-model on the data is unthinkable.

 

Does this always work? What if one of the anomalies gets past the sifter because we did not actively include a rule that one of the bad-boy records is now deftly sidestepping without our knowledge? Will this dirt creep into our warehouse? Will our users see the dirt before we do? Will this chaos potentially spell doom for someone, somewhere, and can we rescue it from the tracks before the locomotive arrives? Stop throwing popcorn, it's only a melodrama.

 

But with Netezza we have a very interesting option, in that Netezza operates on the principle of where-not-to-look. If we give it enough information, it can ignore the data we don't care about and by default bubble-up the data we want. In the above case, let's say we have two broad categories of data, the "hay" and the "needles". We know that the needles don't belong, and we can certainly tell what hay looks like. Now what if we did a single-pass on the data, identifying everything that looks like hay, and roll the identifiers for these records into a temporary table. In the next pass (that's right, another pass!) we simply anti-join the orignal data with this temporary table (using a where-not-exists), and voila! - the needles fall out of this as a natural result. Will all of these results be needles for certain? Well, we know that if all the other records are definitely hay, we now have a much smaller and objective subset of needle-candidates to work with, and that all of our needles are in it, even though some anomalous "borderline hay" might be there, too. Either way, the needle-candidates are in the tank, ready for examination, And more importantly - our haystack itself is pure and ready for the next downstream operation.

 

And this is what's important - if we have billions of rows in a table and we want to find the few thousand that will cause us trouble, we can perform this two-step carving using Netezza's massively parallel power, carve out the troublesome ones and pass good data to the next downstream process. We then deal with the anomalies in a more administrative manner. In short, we don't have to "stop the presses" for the sake of the needles. We just remove the needles and continue. I know that some people have the philosophy that if any needles exist at all, we must stop-the-presses and figure things out before proceeding. This might work for a few hundred anomalies in a few thousand, but it won't scale.

 

Our analysts will grow impatient and our troubleshooters catatonic from the sheer volume. Someone will scream an epithet that begins with the words "Why don't you just..." and hopefully the end of this sentence will be something professional like "take the dirt out and deal with it elsewhere, but don't hold the rest of our data hostage!"

 

I'll deal with hostage negotiation in a later entry.

0 Comments Permalink

Greetings to all Enzees everywhere and welcome to the Grill!

 

Now, to level-set a bit on where this "grill" concept came from, I was hanging out at the house with friends and family shooting the breeze, cooking food on the outdoor grill and it sort of dawned on me that this is a lot like the virtual atmosphere in the Enzee Community. Maybe some of the analogies fall short, but the spirit of the atmosphere is what we're after.

 

And of course, the grill itself. A place where we toss things on the cooker, or cook-things-up, or take a raw idea to its conclusion, along with the transformation that happens from raw-to-cooked, and hopefully we won't have anything half-baked!

 

But keep in mind this is a grill, not a roast! So let's keep the atmosphere jovial and forward-looking, because that's where all the action is. And like any good cookout, while there's a guy flippin' the food who seems to be in charge, everyone's opinion counts. I'll write most of the entries in essay-style form, perhaps even persuasive-essay style, because this is a blog and I have opinions too - but I don't do one-liner entries (mostly)

 

I'd like to thank the folks at Netezza for this opportunity to interact with the Enzee community at large, and hope it becomes a very fruitful discussion for all.

 

In the spirit of the political season, the "inaugural" discussion will appear next!

0 Comments Permalink

Watching my  modest investments dissipate is a sure sign of a recession. Another sign is  being continuously harassed by telemarketing calls from vendors with “recession  proof” offerings. This has now reached such a fever pitch that, if these vendors  are to be believed, they must be delighting in the economic downturn! For me,  it’s a case of hear the word recession and tune out. So note self: make sure  Netezza doesn’t jump on the recession bandwagon.

 

As  Netezza’s “spin doctor in  chief”, I have the task of fathoming out a good recession story of our own.  Fortunately, the answer is pretty straightforward: just continue to do what  we’ve always done (“yuk” I hear you scream as you reach for a  bucket).

 

In times of  economic strife, corporate executives batten down the hatches and get back to  basics—they watch expenses and kill strategic programs that have only long term  value, they focus on quality and execute tactically on programs with a rapid ROI  and immediate results. And when they’re not doing this, they look for market  openings where they can step-in and make a killing.

 

Data warehouse  appliances provide a quick and easy solution to a business problem without  all the fuss that we’ve come so accustomed to in IT projects. Of Netezza’s hundreds of customers, we are proud  to rank among their number some very successful EDW implementations, but a great  number of enzees deploy Netezza to solve pressing business problems and then  grow from there increasing business value (and IT credibility) at each step of the  way.

 

So whether  it’s dealing with huge amounts of data in revenue  assurance, doing deep-dive analytics in fraud  detection or just sorting out SLA  breaches in business intelligence… Netezza data warehouse appliances go in  easily, are quick to deploy, perform blisteringly fast and are as cheap as chips  are about as recession proof as you can get.

 

So, note to  self: don’t mess with something that’s working.

0 Comments Permalink

 

We came across a series of blog posts the other day which seemed to insinuate that Netezza and other competitors might be trying to shape our 10-100X performance message on the backs of comparisons to antiquated, end-of-service life systems and not comparing to current competitors' platforms . When we got to this one - "Database Customer Benchmarketing Reports" - I felt we just had to correct the record, so I wrote a response to Greg Rahn's posting to give Netezza's side of the story, namely that

 

  • we are as up-front as possible with prospective customers and use the customer benchmark testing/POC process to prove out Netezza's performance, value and simplicity value propositions;

  • the results of other products' performance come from our prospects/customers and not the result of Netezza running the tests on those platforms;

  • not only do we test against the incumbent systems, but there is almost always at least one other current competitive system that is included in the POC process;

  • the PowerPoint deck on which Greg was doing his analysis contained some rather ancient (in enzee-years, anyway) comparisons with versions of the NPS appliance that we have not sold in as much as 4.5 years & was really not much of a data set on which to base his analysis; and

  • the "proof of the pudding is in the tasting" - Netezza's success rate of converting prospects to customers through the customer benchmarking process remains very strong.

 

In short - we make every effort to keep the Netezza website contents both accurate and clear and we definitely feel confident in standing by our 10-100X performance claims. It would be great to have more than the Netezza "product marketing guy" clarify things - while I know some of the excellent results recent customers have seen in POC, no one knows them better than our customers & SI partners themselves.

 

 

5 Comments Permalink

 

We had quite a surprise the other day when it came to our attention that Netezza and the NPS data warehouse appliance are now the subjects of a new book: Netezza Underground: The unauthorized tales of derring-do and adventures in resilient data warehousing solutions, by David Birmingham (ISBN: 1-4392-0743-7 and now available in paperback version for $31.54 at Amazon.com).

 

 

This is not the first instance of the NPS system being the subject of a book sold by Amazon (e.g., SAS/ACCESS(R) 9.1.3 Supplement for Netezza), but this particular publication certainly brought feelings of both fun and reaching into the mainstream with it, starting right from it's very clever cover art (above) to David's clever turns of phrase and real-life examples.

 

 

As the title suggests, it was not written or coordinated with any Netezza authorization. So of course we bought a copy and read/skimmed through it as quickly as we could. I will say this, David's self-publication skills are great - he keeps what could easily have been a boring, heavy technical tome both engaging and fun to read while still imparting lots of great information about the NPS system, its performance and its ease of operation. And the book's publication is incredibly current - with references to Netezza Developer Network and "BI Appliance" announcements made only as recently as the Enzee Universe user conference in September.

 

 

While I certainly could quibble with a point made here or there about the system, in general I thought it was an excellent book and even put up the following recommendation for it on the Amazon site:

 

I commend David Birmingham on a book that is at once as lightly entertaining and interesting to read as it is chock full of details about just the kind of performance and operational simplicity that is possible with the Netezza Performance Server (NPS) system. Straightaway from the opening pages, Birmingham's effusive, engaging style and excitement about Netezza's system is apparent, "It inhales, crunches and publishes Libraries-of-Congress-at-a-time - and fast."

He also captures the essence of the NPS appliance in an ultra-succinct two-sentence paragraph explaining just why his "Administration Stuff" chapter is so short, "It's an appliance. Put it in the corner and let it work." I couldn't have said it better myself!

This book is comprehensive and current - even reflecting some of the more recent announcements from Netezza regarding OnStream programmability, the Netezza Developer Network and analytic appliances.

As the guy who is responsible for projecting the Netezza products and our technology direction forward, I want to recommend David Birmingham's book to current and prospective customers and partners alike, or as David himself says on the book's Dedication page, "to Enzees everywhere".

--Phil Francisco, VP Product Management & Marketing, Netezza Corporation

So "to Enzees everywhere", have a read of David's book and welcome to the "Netezza Underground".

2 Comments Permalink

Okay I'll admit that my first posting about the new Oracle Data Warehouse Appliance (DWA) tonight was a tad on the "snarky" side. But I have to say that I think it was because of all influences in the environment all around me. Straight away since the announcement yesterday afternoon, there's been a healthy degree of skepticism from industry insiders.

 

Beyond his commentary on Larry Ellison's hairstyle, Gavin Clarke of the UK's Channel Register virtually flogged Larry for flogging the "Oracle server appliance alliance with HP". Some of the best snippets included:

 

  • Gavin's subtitle: "(Not) a hardware provider"

  • "And so to chief executive Larry Ellison, who Wednesday afternoon announced Oracle's third effort in 10 years bundling his company's software with someone else's hardware. This time, it's a high-performance, Oracle data and storage server stack locking arms with old favorite Hewlett-Packard."

 

And after taking several informative paragraphs to expound on Oracle's two previously-failed attempts at ‘appliantization' - most recently the "Network Computer" initiative circa-2000 - to draw the clear analogy to yesterday's announcement, Clarke closed out his piece with this stinger:

 

  • "In a telling sign of how much faith Ellison places in his latest appliance, he did not sit down for his traditional, open-mic smack-down session with OpenWorld attendees to field questions."

 

 

 

Analyst/blogger Curt Monash summarized more than a few skeptical digs in his Oracle Exadata and Oracle data warehouse appliance sound bites posting earlier today. For example, here are a few "bites" from Curt's post:

 

 

 

VP & Global Marketing CTO Chuck Hollis of EMC weighed in with a couple good shots on his Chuck's Blog post: Oracle does hardware (emphasis mine):

  • "Of course, there's little in the way of performance comparisons to help us evaluate just how fast this beast might go, except the ‘Up To 10x Faster' which smells a bit optimistic, never mind that it's Oracle comparing with itself, rather than other data warehousing appliances."

  • "Every year at Oracle Open World, we hear about many "new initiatives" from Oracle. Well, not to be harsh here, but it's my impression that very few of them get talked about at next year's Oracle Open World. I routinely dig up past announcements from previous years, and it's relatively consistent pattern. I think it's fair to ask the question -- just how serious is Oracle about all of this?"

 

 

 

 

But the lead cynic was none other than Oracle CEO Larry Ellison himself. After years of denying performance issues at scale with various generations of Oracle DBMS software for data warehousing, Larry dropped this 11g-megaton bombshell about Oracle's data warehouse scalability, pre-Exadata - laying out the fundamental reason why Netezza has become the industry leader in Data Warehouse Appliances (source: ZDNet's Larry Dignan):

"Ellison, speaking at Oracle's OpenWorld conference, said large databases are creating a fundamental problem: Disk storage systems can't cope with data that has to be moved off of drives to database servers. He called it a ‘data bandwidth problem.'

"As data gets larger the slowdowns become more unbearable. At one terabyte you will notice data bandwidth slippage. At 10 terabytes, storage systems crawl. ‘At one terabyte the problem rears its ugly head and it gets worse every year,' said Ellison."

 

 

And that's not all - the barbs, skepticism and "bites" go on in site-after-site, and commentary-after-commentary. So please forgive my snarky-ness - I blame it on the "nuture" of my environment, not my personal "nature", per se.

1 Comments Permalink
1 2 Previous Next