BI Decision

Thursday, January 11, 2007

Is the One Size Database Dead?

Slashdot has an article based on some work at MIT showing databases with non-tradtional architectures can dramatically improve performance compared to the usual RDBMS.

It's an interesting race between new requirements and the mainstream. When requirements get ahead, we start seeing interesting new tools, when the big RDMS's catch up, then the new tools retreat to the niches.

I think we are coming out of an era when the big RDBMS's were ahead, but we may be heading into a new era of niche players.

Monday, December 04, 2006

Asking Good Questions

Asking good questions is key to building the right BI system. Interviewing is an art, and Seth Godin points out some question tips from an ESPN Pro.

Sunday, October 15, 2006

TOAD

No one who writes a lot of SQL should be using the clients that come with databases. SQL*Plus and the DB2 Command window work, but they lack the tools of a modern development environment.

My favorite is TOAD. Despite the acronym (Tool for Oracle Application Developers) TOAD comes in flavors for DB2 and SQL Server as well. The freeware versions for DB2 and SQL Server are a relatively new development.

Try the free version (which can be a bit hard to find on the website)
Other good tools are PL/SQL developer, WinSQL, Rapid SQL, and the new Oracle SQL Developer.

Tuesday, July 18, 2006

Tips for folks switching between DB2 and Oracle

Some Oracle and DB2 differences that are hard to find in the manuals:

Generic 1 row table for fast queries:
Oracle: dual
DB2: sysibm.sysdummy1

Current date and time for the server:
Oracle: select to_char(sysdate, 'yyyy-mm-dd hh24:mi:ss') from dual;
DB2: select current_time, current_date from sysibm.sysdummy1;

Delete a whole table, fast:
Oracle: truncate table x
DB2: load an empty file into the table

Monday, July 17, 2006

Ab Initio's Web Site

Ab Initio has an interesting corporate style. From what I've seen and heard, they make a heck of an ETL tool. But they are surprisingly private. Here is their website: Ab Initio

Notice the lack of the traditional "about us" tab that every other private software company has. No information on board, investors, or management team. No brag list of revenue growth or customers. No public forums or downloadable demos.

Intriguing.

Saturday, July 15, 2006

Agile development and BI

Interesting conversation going on at Slashdot about the rise of rapid prototyping tools and the increasing emphasis on fast delivery.
If somebody came to you and said "hey, I've got this great new way to build a bridge! Instead of making up plans, we'll just start building it! We'll build it out of popsicle sticks first, and then we'll go in and add some steel beams, and toss some pavement on top of that," you'd say they were insane. Nobody does stuff like that in the real world -- yet that's exactly what a lot of poorly-managed 'agile' software projects are doing. They're getting short-term prototyping gains but at the cost of maintainability and probably stability as well.
Slashdot is mostly focused on web development tools, but the same ideas apply in the Business I ntelligence world. Tools like Ab Initio, Informatica, and Ascential (IBM Websphere Information Integration? IBM still has their touch for catchy naming) enable stunningly fast development.

Fast development tempts us to try to do the rest of the project fast as well. Faster is better, except when it comes at the cost of building to the wrong requirements or writing low quality code.

Wednesday, July 12, 2006

Inserting instead of updating

Why don't they teach this in school?

During prototyping or early development, I often want to add a column to a table that already has data in it.

The direct approach is alter table, update column.

Once the table gets bigger than a few hundred thousand rows, the update approach gets slow.

The faster approach is to create a new_table as select * from old_table, drop the old_table, and rename the new_table to the old name.

Why is that faster? Over to Tom Kyte.

Tuesday, July 11, 2006

Nullable fields in fact tables

Why should fact table fields not be nullable?

Three big reasons:
  1. Null key fields can’t have descriptions from a domain table attached (without special processing in the report/query)
  2. Nulls can cause wrong group by/drilling results.
  3. Using natural keys gives the optimizer more information to choose faster query paths

Kimball has thoughts on cases where fact table fields should be nullable in his latest design tip.

Friday, July 07, 2006

Kimball's Design Tips

Ralph Kimball publishes a periodic design tip. I recommend the email list.

Space is cheap, but every byte in a fact row counts

Replacing natural fields with synthetic integer keys reduces the row length. This can really speed up queries that scan significant numbers of rows.

For example, on a table I recently worked on, switching from natural fields to integer keys reduced the average row size from 122 bytes to 50 bytes. Switching the integer keys to smallints further reduced the row length to 32 . On a 200 Million row table, that reduced the core data storage from 24.4 GB to 6.4 GB (a 74% reduction).

Some quick testing on our prototyping box, a DB2 on XP machine, showed a pretty dramatic performance difference in full table scans. A table scan on the large table took 18 minutes, on the small table, 8 minutes.

Friday, May 19, 2006

Paving the Well-trodden Path

Tom Austin of Gartner hits on IT policies that don't serve the user (no permalinks on the Gartner blogs, scroll down to the 5/8 entry):

So, for example, we persist in the notion that end users abuse e-mail and store too much there and transport things that should be elsewise transported. So we rail against users, institute draconian quotas and issue "get tough" proclamations (along with "reeducation campaigns"). Meanwhile, IT professionals are often the worst offenders. Why? Well, because our proclamations are wrong.

Consider the town green with a few paved foot paths. They're where people are supposed to walk. What if the town finds that there are unpaved foot paths trodden by people ignoring the paving? What's the right thing to do? Erect fences to force people onto the paved paths? Or pave the unpaved but well-trodden foot paths?

Some of this thinking is an outgrowth of the days when computers were expensive compared to labor. IT had to focus on serving the hardware beast, and only indirectly on the end customer.

I think BI projects can be guilty of not paving the well-trodden paths. Some of this is in making a pragmatic trade off between design purity and user needs.

Elegant IT Solution:

Users Want:

Not to write canned reports – users should use flexible query tools

Canned reports

Managable/scalable/accurate web reporting tools

User Friendly Excel

A pure data warehouse back room with no user access

Every kind of access to their data

Friday, May 05, 2006

Features don't exist unless you know about them

Jensen Harris of Microsoft at "An Office User Interface Blog" tells the story of the missing drawing tools in Word and Excel:

again and again we hear stories about people assiduously creating drawings in PowerPoint and copying them over piece by piece into their Word or Excel document. I remember during a site visit watching a man create a simple flowchart in Excel which should have taken 3 minutes actually take 15 minutes because of all of the cross-application, clipboard, and windowing work it took to keep moving shapes between the apps.

Why do many people believe the drawing tools are only in PowerPoint? Quite simply, PowerPoint is the only application which shipped with the Drawing toolbar turned on by default.


This is an idea with broad impact for BI design: we need to make sure that users, especially casual users, see everything they need.

Sunday, April 23, 2006

Is the spell checker telling me something?

MS Word replaces denormalized with demoralized.

How much does a terabyte cost?

A Terabyte is a “Big Round Number”. For while, it was the benchmark of a big data warehouse. However, Moore’s law seems to apply to the cost of hard disks as well. The cost of disks keeps falling.

Just for fun, here are some quick benchmarks:

1 Terabyte of inexpensive PC disks: $344
(4 x 250 GB SATA disks
http://www.newegg.com/Product/Product.asp?Item=N82E16822148065)

1 Terabyte of iPods: $6,650
(17 x 60 GB iPods http://store.apple.com/1-800-MY-APPLE/WebObjects/AppleStore.woa/wo/1.RSLID?mco=CC4D3CBB&nclm=iPod)

1 Terabyte of RAID 5 SCSI attached to a bare bones Xeon server: $8,148
(Dell Poweredge 2800 4x300GB SCSI 10,000 RPM)

1 Terabye of RAID 5 Direct Attached Storage Server: $16,159
(Dell
AX150 iSCSI SAN 3x500 GB)


The problem is that the cost of storage hardware is only a small part of the cost of a Terabyte of data warehouse. A Terabyte has direct costs in server hardware, backup, and supporting software. It has indirect costs in program complexity, mostly in the labor of DBAs, ETL developers, and related program staff.

Storage is cheap. As the cost goes down, substituting storage for development effort will save time and money. For example: storing several aggregates, keeping both normalized and dimensional data, and extending retention periods. The direct costs of storage are unavoidable, but indirect costs scale with complexity.

Friday, April 21, 2006

The beginning

This blog will talk about business intelligence and data warehousing from a practitioner's perspective.

Particular interests are:
1. Applying information to business processes
2. Data modeling & system design
3. Project management