Simplifying Rdbms

Some have stated a reluctance to rely on Relational Data Base Management Systems (RDBMSs) because they allegedly:

are too hard to learn or train staff for;

are too cumbersome to install or manage;

are not standardized enough to write vendor-neutral queries or prototype "in the small";

require too much reliance on database administrators (see Dbas Gone Bad)

A lot of this may have to do with "Oracle Syndrome" (Is Oracle Too Complex) where vendors feel they have to match Oracle features rather than improve on the RDBMS concept. Here are some suggestions for improving future RDBMSs:

Provide a dynamic relational model with optional typing, on-the-fly column and table creation, etc. (See Dynamic Relational for more on this.) This would especially be useful for RAD and prototyping projects.

Provide the option of using Relational Languages other than SQL, in order to get away from Sql Flaws. SQL, with its awkward and inconsistent syntax, is sometimes called the "COBOL of relational languages".

Allow easy immediate views (temporary or user-defined) if stuck with SQL. A lot of queries I encounter that grow unwieldy would have been much simpler if immediate views were available.

Develop Nimble Databases whose feature sets are upward-compatible with Big Iron RDBMSs. That way, one can prototype on an easy-to-install system but migrate over to a production system without having to rewrite application queries. This might require either further standardizing SQL or developing an alternative language. For example, companies often prototype using Microsoft Access as the RDBMS, and later port to Oracle or SQL-Server for production. However, the dialects are different enough to make this process difficult at times.

Better Table Browsers.

View Columns. Table views are sometimes too big a view granularity. View columns may simplify certain things. (See Database Best Practices.)


I think the key is making wrappers and tools around good databases instead of implementing half complete solutions. Why can't we abstract from a good database that follows the relational model and make it easier to use? No one says you have to use the initial database by hand.. we can create GUI tools and terser languages that access the relational database in an easier to use fashion.. without completely bastardizing the relational model with some form of database that isn't even a database.. let's not reinvent the wheel but reuse the wheel and improve it. For example, some of the Tutorial Dee syntax may seem complicated to you if you just want to get her going, like a true Table wheenie would want (with tools like Ex Base). So instead of reinventing another fake relational model (like dbase etc.) that will just lead to the relational model again when you finally Just Get It, why not just improve the tools that access the relational model? Make it an easier entry point for people to realize and become relational! (Similarily, I don't think people need to completely reinvent Ada or Pascal or C, but rather we need to make it easier to access and less tedious.) --Lars

''Therefore, lets use (or create, or foment the use of a) TrueRelationalToPseudoRelationalMapper''


Using the relational model is not simple, is not meant to be simple because it is used to make the complicated manageable.

You can make the management of the complicated accurate, complete and reliable with "hard work". If for example you are using the relational database model to design, specify and track the installation and provide for the automatic control of the instrumentation of a modern chemical plant, an Oil Refinery Complex, an Automobile Assembly Complex, or an Electrical Generation Facilty, you can do it, but it is not an easy or simple task. It requires the application of domain knowledge as well as data processing and programming expertise. These do not come free, or easy. It takes "Hard Work", skill and knowledge of the each of the creational steps in building the model, structure and methods others will use in accomplishing the tasks for which the model exists. If one wants to work on something easy to access and less tedious, try a simple problem and another process. -- Donald Noyes.20080401.1405.m05

Not disagreeing per se, but a minor quibble: The relational model is very simple. Its original goal, in fact, was to achieve (among other things) simplicity in the face of complex, often hardware or systems-dependent hierarchical and network database systems like CODASYL. In terms of simplicity, the Relational Model was a breath of fresh air. Applying the model correctly, however, is not so simple. Bricks and mortar are simple too; now go build a skyscraper with them...

I also find the Relational Model to be remarkably simple. What Is Data is a good bit more complicated, and knowledge systems (with fuzzy and probabilistic inference and open-world models) are even more so. The Relational Model denies the existence of fuzzy relationships or issues of confidence and knowledge. It also doesn't readily admit to higher-order logic (though that is more a consequence of the typical Relational Algebra and Relational Calculus). Oh... and, of course, it is simpler than navigational for purposes of data access (though that's not a necessary feature... one could easily stick a micro-RDBMS into every object and encapsulate them, and once there are two or more RDBMSs the non-navigational feature disappears).

Most of the complications people introduce into the Relational Model seem to be due to the root of all evils in programming: Premature Optimization done by hand. In this case, in the form of denormalized logical tables to force denormalized physical tables. Table denormalization for efficiency should be something you 'suggest' to the RDBMS as an optimization.

People too often underestimate the value of required optimizations in a system - they think "Relational Model isn't about optimizations, you fool!" (and Relational Model isn't about optimization... but RDBMS certainly is; otherwise you'd be doing Relational Model by hand, and you'd be happy with it). Providing good and ideally portable optimization solutions for the RDBMS would go a very long way towards simplifying it. What it does is reduce any need programmers feel to 'hack' a faster solution, because they can get the same efficiency (through automated optimization) while maintaining a greater simplicity of the model itself. I.e. they could put everything into Fourth Normal Form or into the 'one row = exactly one propositional fact' form I prefer without gritting of teeth about whether 'SSN(Employee,Value)' and 'PHONE(Employee,Value)' should be rolled into one table at the logical level just to guarantee it happens at the physical level. This is a case where Simplifying Rdbms requires adding some useful complexity in practice, such that programmers feel little need for Premature Optimization.

Support for complex types (and queries over and indexes upon them) would also greatly simplify RDBMS. It may seem counterintuitive, but this allows for far more natural representations of complex facts. E.g. currently, if you have a fact that relates one set of entities to another set of entities (i.e. R(S1,S2)), modern RDBMSs lack any natural way to represent this fact. You need to hack in solutions involving strange extra tables that give artificial 'names' to these sets just so you can put them into the relationship. One should note that, at this point, there is no natural way to express a comparison between two sets to see whether they are equal or share certain features: one must hack together a massive, grossly complicated query to make such a comparison, or one must settle for identity-by-name. This same problem also presents itself when dealing with trees and graphs (i.e. where the relationship is with the whole tree-value, as opposed to individual parts of it). Properly, there should never need to be 'named values' in an RDBMS; the need for them indicates that the RDBMS is 'simplistic' rather than 'simple', and forces programmers to add a great many epicycles to get anything nontrivial done. Note that joins or selects can and should include functions over complex values that return multiple values and thus separate individual rows into multiple-rows - this is part of relational programming. This is a case where the only proper way to go about Simplifying Rdbms is to add some necessary complexity.

You need to demonstrate that "complex types" makes the world better. Complex types usually generate the kinds of navigational messes that motivated relational to begin with. I will agree that powerful data dictionaries, including relationship info, could simplify things and automate a lot of validation that otherwise would have to be hand-coded. Thus, perhaps we are in partial agreement, just going about it in different ways: meta-data versus type-systems.

You need to demonstrate that "Complex types usually generate the kinds of navigational messes that motivated relational to begin with" - because it seems to me that would only happen for improperly normalized databases.

As far as demonstrating that complex types make the world better, the proof is already there - even 'strings' are complex types if all you have are the simple integers. If all you had were integers, you'd have to use a hackish create-a-table-just-to-represent-strings approach whenever you wanted strings as elements of another table. You've grown used to using this same sort of utter hack when dealing with tree-values or graph-values. This hack is a form of unnecessary complexity - Adding Epicycles due to the RDBMS being 'simplistic' rather than 'simple'; it creates problems of garbage-collection and forces relational programmers to worry about issues of representation when attempting to perform queries, and that makes a horrendous navigational mess far more than does proper support for complex types.

It has already been proven and even accepted by you that at least one more-complex-type, the string, makes the world better, and the exact same set of reasons apply to the other potential more-complex-types (matrix, graph, tree, function, set (as per the example above), relation, another whole database-value, etc.). As a consequence, the only reasonable conclusion is that support for these other complex types would avoid yet more hacks and make the world better still. If you have an objection to what I consider the only reasonable conclusion based on the evidence, you should properly argue why the evidence already in play is not sufficient to constitute proof; simply ignoring it and implying my point hasn't already been demonstrated is foolish - like closing your eyes and demanding your opposition prove the sun is up. Anyhow, I'll note that I'm not actually demanding validation or anything related to it in that last paragraph - just support for complex structured values... arbitrary 'domains' in the Relational parlance including matrix-values, graph-values, vector-values, tree-values, function-values, relation-values, measurement-with-unit-and-precision-values, etc.

I find your "proof" indirect, round-a-about, and full of peculiar extrapolations. Strings being difficult to force of out integers does not prove anything other than forcing strings out of integers is difficult. It also seems to be making the fallacy (unjustified extrapolation) that if some quantity of "types" are "good", then lots more are better. But, this is not the place for yet another long type-fight.

Forcing graph, set, and relation values out of strings and other more basic data types is also difficult - even more so than for strings. If you're going to claim this extrapolation is 'unjustified', a little handwaving and claiming fallacy where none is obvious won't do the job. I just happen to dislike the enormous Discontinuity Spikes placed capriciously in my path in many of my favorite domains because RDBMS guys have people like you saying you know what's best for everyone else - 'all you need are strings', you whisper, 'strings, strings, strings!'. An RDBMS should be able to support any value-structure I require. It must perform proper equality testing and indexing - even for unlabeled graphs (where equality is isomorphism) and unordered sets, both of which are extremely tricky to get right when handling relations between set-values and graph-values.

Having to re-implement support for these by hand via named values in distinct tables is just a way of Adding Epicycles, forced by a simplistic model that doesn't support the necessary type. So, yes, you have two choices: either support a wide enough variety of types that programmers in all target domains for your RDBMS have whatever they need or force programmers to badly reinvent these types in-database, repeatedly, and inefficiently, along with reinventing support for garbage-collection of unused values (since two 'equal' values will need to have the same name or autonum to support joins) - all of which will often drive them away from the RDBMS and back to storing data in a more flexible application language and using the DBMS as a mere persistence layer alternative to the filesystem.

Simplifying Rdbms demands you choose to add proper types - unless by 'Simplifying Rdbms' you happen to mean 'make the RDBMS simpler by shoving complexity to the user of the RDBMS'.

Show my ideas/techniqes being "bad" in plenty of common, frequent, and typical circumstances; and then I might start believing you. (And remember, I agree a type-heavy approach has domains where it does help over-all.) Otherwise, it appears you are making mountains out of molehills, magnifying stuff into Adding Epicycles in just your mind because you mistake personal preferences for universal truths. If you wish to continue this, please do so at a type-related topic. --top

[What do you consider to be "common, frequent, and typical circumstances"? Unless you're working in exactly the same domain I am, our perception of these will probably be wildly different.]

Top, I've encountered this problem plenty often. I want an unordered set of words to be an identifier for an object? Need to hack it. I want relations between graph-values as part of an associative memory engine? Need to hack it. I want a database of proposition-structured beliefs held by an expert-system? Need to hack it. Hack, hack, hack - in some of my favorite domains the RDBMS isn't doing me any good, barely even as a persistence layer. These problems exist aplenty, especially in domains in which RDBMS have failed to find purchase (in part for this very reason). Perhaps your laughable 'variety' of experiences hasn't forced you to bang your head against these problems often enough for memories to penetrate your skull, but I have even asked you to encounter similar situations a few times... e.g. supporting predicate and propositions-values, which you hacked (badly) then apparently suppressed like you seem to do with all other evidence inconvenient to your views.

And I've presented evidence that I believe should convince any reasonable, unbiased person who has sufficient experience and education to follow the arguments - but I honestly don't expect people like you to be swayed without having your face shoved repeatedly into problems that would require dragging you kicking and screaming from your posh but rather limited domain. Of course, you have no power whatsoever anyway - you don't know jack about systems software and thus you'll never implement an RDBMS, you'll likely not even implement TQL before you're pushing daisies... so you aren't my target audience. If you aren't convinced, and you can't present any reasonable counter-argument, it is not my loss. But your snide hand-waving and FUD is unprofessional and pointless; either confront my argument head-on and demonstrate how and why it is insufficient, or go find something productive to do.

If you have special needs for your AI niche, that is understandable. But if you don't relate it to the reader's domain, the reader will not find the scenarios convincing. You need things like, "Look what happens when you add a new product category: without warning it breaks the following 7 subroutines....". I'm just the messenger. AI is not currently a large domain, so your evidence as is approaches a Walled Garden. Related: How To Sell Golden Hammers.

[What do you consider "niche" vs. "non-niche"? How do you know what the "reader's domain" is? What is a "large" domain, in your opinion, if AI is not large?]

I think it is safer to say that custom business apps (Category Business Domain) are a much larger niche than AI. This does not make AI an "invalid" niche, but using a narrow niche to demonstrate the power of heavy typing will probably get you a very narrow audience in turn. If you are going to make a movie to demonstrate your product, using English will get you a wider audience than one in Swahili. That's just the way it is. I'd suggest you put AI evidence into a topic such as How Types Help Ai

[I find it curious that you would consider a higher numeric quantity of Category Business Domain applications in the world to have a bearing on how you present arguments here, where there appears to be a diverse mix of domain interests. At the very least, Category Business Domain (or at least the end-user application, data-entry & reporting subcategory thereof) is not a larger (or even large) niche on Wards Wiki, where the usual concerns and interests are quite cross-domain. It seems, therefore, that material presented here should either (a) be intended to target as broad an audience as possible by covering multi-domain or cross-domain issues; or (b) have clear indications as to its scope and/or intended domain. By the way, where's the How Types Fail Category Business Domain page?]

This isn't a topic on SpecializingRdbms. The Relational Model isn't supposed to be domain-specific, and if a reader is only willing to consider benefits to a small subset of domains when considering future directions of the RDBMS, then that reader only deserves a small fraction of voice in any discussion regarding the future directions of a general-purpose RDBMS. A more reasonable reader might consider potential harm to his own domain and give that a greater weight, but would still consider net benefits across the wider variety of domains.

In reflection, I suggest Top Mind keep business-domain specific comments and ideas limited to topics such as Specializing Rdbms For Business Domain. From my perspective, Category Business Domain has a lot of liquid spending money but still constitutes a rather small conceptual space in the set of all domains - different businesses largely solve the same problems (personnel, service contracts, physical product, distribution, transport, warehousing, purchases, sales, accounting, etc.) with slightly different parameters and policies. BusinessDomain might be able to afford and provide the purchase-opportunity for a specialized RDBMS. Other domains, lacking this money (or at least the spending authority to dedicate it to an RDBMS) really need an RDBMS that is fit for a greater variety of all domains. And so a topic on Simplifying Rdbms should give a great deal of weight to simplifying the RDBMS for a greater variety of domains.

I find the "spare money" argument weak. If business is competitive, there is little if any spare money. Just about every domain wants to keep costs low if possible. Let's not assume that one domain has more spare cash than another.

It would take an idiot or a fool to not know that some domains have more spare cash than others. Knowing which ones have more money is a bit trickier, but businesses - being the entities that both make the most money and spend it most lavishly on offices, ponds, buildings, network upgrades, risky ventures, etc. - are obviously at or near the top of that pyramid. And academics, being forced to struggle to achieve yearly stipends from DARPA and the rare charitable business, happen to be near the bottom. Being competitive, businesses can and do seek specialized solutions to make them even more competitive and will spend money when opportunity arises to seek these advantages. That would be why 'fads' like Object Oriented and Agile Programming and so on sweep across businesses like diseases. It would also be why you have a job in attempting to fix up their database driven processes despite the lack of concrete proof that you'll actually make things better. I'm not going to act stupid or foolish just to help you support your ill-formed opinions. Besides, unless you plan to backpedal and argue that business is a small domain, having equal spending money to hire programmers to new tasks as the domain of AI, you really are showing off your enormous hypocrisy with this new line of argument.

The fancy rooms and ponds are not something that go to IT. IT'ers often complain about how the sales staff get all the perks and that they are stuck in the basement with desks and chairs from the 70's. Of course, each organization is different. However, I reject the notion that businesses over-fund IT in any way overall. Universities have boom-and-bust cycles also, I would note. And, many of the tools we use to today came out of university experiments, even if OO is not one of them.

The fact is, business has money to spend and they do spend it on IT. Career ITers don't complain about small-domain and academic projects because those guys cannot even afford career ITers. Those other domains need some people to start Simplifying Rdbms because they can't afford to pay dedicated experts to hack a solution together and maintain it. They need an RDBMS that won't create unnecessary complexity due to arbitrary and capricious limitations maintained with vociferous arguments by ignorant and egocentric people who believe 'custom biz apps' comprise the whole universe.

Universities have a fairly good supply of students and interns to do projects so that they can put something on their resume. If you could make a good case to the directors or committees for an AI-centric DB to be built...

Students and interns are, almost by definition, people who both barely know what they are doing and won't have much time to figure it out. They really aren't good for tasks that will take several years to complete. If you're lucky and an excellent judge of character and competence, you can perfectly select only the rare geniuses who can both learn the system as it exists and still have time to make useful contributions before they leave; if not, said rare geniuses will be spending all their time picking up the crud left behind by the previous programmers. Besides, why build an AI specialized DB when it is much smarter to start Simplifying Rdbms that already exist? Businesses collectively can afford the cost of Specializing Rdbms For Business Domain then maintaining it. Small-domain and academic projects cannot - not even collectively can they afford to pay Oracle or Sun to specialize to their domain and maintain the system. At best, they can hope they are targeted by an open source charity effort... which will usually take at least a decade just to catch up.

You are repeating your original arguments without any new evidence. At this point its anecdotes versus anecdotes, and so I am bailing.

It only makes sense to repeat a sufficient counter-argument after you repeat unreasonable claims. And, believe me, your claim - that we should assume that businesses both generally and collectively don't have any more spare money than any other domain (but can, uh, afford IT guys and stuff) - was the unreasonable one. Properly, you should have bailed that argument before you presented it. It wasn't ever a good one.

Projection.

Of course! It must have been me that was stating things that any glance at Forbes magazine would prove to be utterly false! I was just projecting that fallacy onto you! I'm very sorry, Top.

Meanwhile, unless top can locate damage caused to business-domain by supporting types other than just strings, numbers, dates, and the what-have-you tiny set supported by My Sql and related, then I think he has extremely little to contribute regarding this issue.

I did not mean to turn this into a domain Pissing Match. My point is that if you are going to chose a domain to demonstrate the power of something, it is best to find a common domain. Otherwise, fewer people will relate to it and it will not convince them of anything. If you use a domain that nobody relates to for your scenarios, it will be hollow evidence to them. If instead you are accusing me of being too dumb to extrapolate AI examples to custom biz examples, that may be the case, I don't know. Knowing how the domain changes over time is important to evaluating the decisions. Change impact analysis is important to estimating the cost and impact of maintenance scenarios. If you say, "adding a new foo will brake bar", then the question will come up as to whether adding new foo's is common because it may have to be weighed against the alternatives. We must know the domain to know the frequency. I'm just being practical. --top

You're not being practical at all. A practical person would be unconcerned about changes to an RDBMS unless they believe they have identified something to the change that directly harmed them. I've not claimed that type support will help you with your narrow domain of 'custom biz examples', but there is no good reason to believe it harms you either: it isn't as though I'm suggesting dates and strings and decimal numbers be banned from RDBMS. Simpler for some + No worse for others = Simpler Overall. In an attempt at Simplifying Rdbms, I don't need to help you; it is sufficient to simplify things for me and others while not forcibly making things more complex for you. The only 'dumb' you're obviously displaying is psychological egocentrism.

You keep missing my point. I shall take a brake from this section. Maybe I'll find a better way to explain my point later. Plus, you are in a pissy mood.

I believe your 'point' has been comprehended and rejected as irrelevant to any conclusions one should make in the context of this page. That isn't the same as 'missing' it. I've even offered explanation as to why your point is irrelevant. You are free to to explain again that to be 'convincing' I've got to show how a change makes life easier for the 'typical reader' (in your mind: you) and what the 'typical reader' considers to be the only domains worthy of consideration (in your mind: 'custom biz apps'), but I doubt the irrelevance of your point will be at all diminished by its repetition.

Logic would dictate that an example that targets a larger audience would be more relevant to more people (assuming you care to be relevant). This truth is independent of my alleged "psychological egocentrism" or whatever other sinister traits you associate with me. A scan of Dice-dot-com will demonstrate that there are not lot of AI development slots out there (other than maybe a handful for video games).

Unless you can say your larger audience is harmed, a larger audience isn't required: the case for supporting a wider variety of types is properly justified by their helping these other domains. It is sufficient to "demonstrate that 'complex types' makes the world better". Besides, logic would dictate that a collective audience of small domains that are currently disserviced by the limitations of RDBMS cannot be determined to be collectively called 'smaller' without proper evidence, which you have failed to provide. It isn't even as though 'AI' is the only domain where I've encountered problems with simplistic and unnecessary limitations existing in current RDBMS. The example involving unordered sets of words for identifiers was very much a systems software domain issue (relating to filesystem design).

Are you saying you disagree that AI is a smaller domain? I'd like to see your file-system scenario, if you would.

If you were a tad more literate or a hair less prejudiced, perhaps you would not have read "collective audience of small domains" as "just the AI domain". And I'll touch on the file-system example below (Page Anchor FS)

Your entire line of argument that "I'm not convinced" and rather egocentric and inconsiderate thoughts limited to "how does this help me?" (which is childish and immature, not 'sinister') is simply irrelevant. Your points are irrelevant. Your logic, thus far, is irrelevant. Your only potentially relevant claim thus far is: "Complex types usually generate the kinds of navigational messes that motivated relational to begin with. And you have yet to support that claim.

And do not wish to now. I'm just trying to get you to demonstrate something specific instead of speak in generalities.



Page Anchor FS: 'filesystem' example. (under construction)

Requirements demanded sets-of-words (unordered sets) be identifiers for objects, along with various selection properties for locating objects: ability to select for supersets and subsets of words, along with intersections. This is similar to indexing of files, but it is the filenames under operations here. All primary operations needed to be faster than polynomial with the number of objects. The 'whys' behind the set-of-words as identifiers aren't particularly relevant.

Representing sets in an RDBMS was the first approach. The RDBMS didn't support unordered sets of arbitrary words, just the basics provided by My Sql. Several ideas were fielded by members of the team; these included:

Using several columns, one per word. This was dismissed quite rapidly due to the flaws associated with it (which were presented immediately by the person who mentioned it). A lesser flaw included that it imposed an arbitrary limit to the number of words used to identify an object (which we could deal with; it would have been rare to use even six or seven words, so ten columns or so would have worked well enough). The greater flaw was the difficulty in selecting for words or sets of words (10 columns => (10 choose K) 'or' statements to match just K words).

Another option was to use a table IDENTIFIERS(fk-autonum, word). This table would essentially have one to ten entries per object in the system. It also requires proper maintenance in addition to the other table. Subset and intersections using this table were efficient enough (though sometimes involving multiple joins on the foreign-key) but superset operations were still difficult (requiring complex queries to prove that the the foreign-key wasn't associated with any additional words). All of them were still ugly and inelegant as sin.

Please clarify the above. Supersets can be done via GROUP BY, no?

No. Use of GROUP BY doesn't really solve any of the problems here. To find objects identified by a subset of words, you can do repeated joins (3 words -> 3 joins), essentially filtering objects on each pass. To find an object identified by a superset of words, you essentially union all of the objects that match on at least one word (no objects had empty idents, but if there were you'd just add it as well: there could only be one), then you need to filter down all identifiers that carry even one word from outside of that superset. This, done naively, will violate the complexity requirements - though our initial implementation went ahead and did it naively: simple join (where word = 'wordA' or word = 'wordB' or word = 'wordC'...) subtracting those with where 'not' (word = 'wordA' or word = 'wordB' or word = 'wordC'...). Then you need to go back and join this with the objects table. It is ultimately quite cludgy, especially when what we really want is: select * from objects where object.id subset of (and this could have been indexed for this purpose ahead of time).

A slight alteration to the previous idea was to use, instead of an autonum, a comma-delimited string of all the words as the identifier. This had the additional opportunity of guaranteeing fast access time once you knew all the involved words, even without lookups to identify the 'set' of words: you just had to guarantee the list was (always) properly collated and normalized (so two identical sets of words always formed identical strings). This was the option we ended up selecting at the start.

Another initially popular option was to limit the words to a dictionary, and represent each set as a 'vector' of bits, each bit indicating whether the associated word from the dictionary was included. 'Arbitrary' words could be handled by adding new words and position-information to the dictionary and simply limiting the vector to sizes (mod 32) where the last item still contained at least on set bit. Ultimately, this would have worked great for smaller dictionaries (and we seriously considered adding the bit-diddling support necessary for optimal performance to the RDBMS) but we were told that the 'arbitrary words' component was really important because all objects would have at least one semi-random word or number-word that was initially unique to that object. Thus, this became infeasible.

Now, if the RDBMS had properly supported unordered set-values, we would have avoided that hand-created index 'this word is in that set' table that essentially duplicated a ton of unnecessary facts, we wouldn't have had to hack together unnecessarily complicated queries to determine which sets were subsets or supersets of another set (that would be as much an operation as the numerical '<' or '>'), and we wouldn't have eventually said 'fsck this and its simplistic types' and destroyed the Big Ball Of Mud that the RDBMS solution eventually became. Ultimately, the RDBMS solution was scrapped in favor of a flat-file hashtable based solution that allowed us to make better guarantees on the performance complexity. It wasn't a big loss: a database simply isn't doing us much good if we need to carefully route all queries to it through a dedicated application or library that carefully adds all the required epicycles.

The complexity here would have been much lower if performance wasn't a concern or if the set-value didn't need to be a key, but part of 'supporting types' means making them perform well.

Parsing text and diddling with individual characters it not functionality that DBMS normally target heavily. DB's generally assume that the facts are more or less already "atomized".

Irrelevant and off-topic. This had nothing to do with parsing text. If the DBMS supported proper set-values in the first place, this would have been finished without any diddling or Adding Epicycles or dealing with the crap that your vaunted, utterly simplistic, string-based approach requires to make things work.

Okay, let's start from square one. What specifically do you mean by the following: "Alphabetize" - Most RDBMS readily provide sorting ability. Thus, where is the catch? "normalize spacing" - Do you mean like change 2 spaces to one space?

You need to alphabetize within a string or (for the multi-column solution) across columns in the same row. Most RDBMS's do not provide this. In the solution where a set of words is represented in a string, it is important to guarantee that the same set of words always forms the same string (e.g. "of,set,words" - alphabetized and normalized). Alphabetization and normalization were lesser issues (compared to subset and superset matching), but forcing even those sorts of complexities upon the application severely diminishes the value of having an RDBMS in the first place.

Are these going to change often? If not, have a process that makes them into a sorted long string. Databases cannot be expected to process everything. Some operations will probably need to be done procedurally (although it may be in a "database" language, like PL/SQL).

Object identifiers never 'changed' (any object with a different identifier was considered a truly different object), but objects were created and deleted fairly rapidly (up to several objects a second). And, yes, you essentially had to filter operations through a process that would sort these strings; we did it client-side since our RDBMS didn't support the ability to do it for arbitrary incoming query-strings (triggers or a delayed process would only help after the bad identifier was stored for some time, which was pointless (since clients would need to sort strings to perform queries anyway) and well and truly into error by the point a query was likely).

What would prevent you from implementing a central word-key preparation service in PL/SQL?

It wasn't an option; we didn't have Oracle. We had MySQL in about 2001.

Seems you were using the wrong tool for the job. (My Sql has been adding such features of late.)

Oracle would also have been a wrong tool for the job - and a much more expensive wrong tool. And just because I haven't analyzed PL/SQL does NOT mean you've provided a convincing argument that it would have been of significant benefit. What I do know is that the right tool for the job would be an RDBMS that supports unordered set-values.

PL/SQL (or a similar tool) would allow you to have a "central" service to produce those "phrase keys" via a Turing Complete language. You wouldn't have to do it on the client-side. One would add or update the "list" of words for a given object, and then run the re-keying process. (Perhaps have a flag-system to mark the process or key as complete to prevent un-updated lists/keys.) As far as "expensive", sometimes you gotta pay for features. That's life. One who ignores this begins to sound like Iwanta Pony.

Are you saying PL/SQL would allow one to intercept every single query, insert, and delete to the system and automatically 're-key' the identifier strings into alphabetized normalized comma-delimited lists of words? or only that it could support the maintenance of the additional word->id table that shouldn't be necessary anyway?

Require that all phrase-key changes go through stored procedures (except object deletion, which can automatically cascade), which clean up the words and regenerate the phrase-key. This ensures consistency and centrality.

The 'phrase-key' itself is supposed to be the object-identifier - a 'requirement' of the system. If some sort of surrogate key is used entirely inside the DBMS, that would be okay... but only so long as nobody outside the DBMS was ever expected to know about it. Since is the case, objects can't be created then "update" their word-set: that set of words must be 'the' identifier for the object as far as anyone else is concerned. I'm certain there are plenty more viable solutions if you're willing to drop the requirements...

This may be a case of Perfect Storm such that the requirements make otherwise acceptable work-arounds difficult or impossible. This just does not sound like a common-enough need to be put into the out-of-the-box features of a DBMS. (We don't have any hard-numbers to determine need frequency either way, so this is merely an educated guess.) Have you ever encountered a strong need for set-values before in a different application? If its the only case, then this would back the assertion of rareness.

The whole 'process' approach would just be another option when it comes to bad ways to hack what should be a simple problem, anyway. The good solution would be an RDBMS that supports set-values. I agree that Databases cannot be expected to process everything, but they should be expected to process as much as possible while keeping the Database itself 'pure' and 'generic'.

I don't see set-values as common enough of a need to make it a standard DB feature. However, stating this will probably trigger yet another domain Pissing Match like above regarding what is "common". Perhaps when it comes to what should and shouldn't be in the DB out-of-the-box, Everything Is Relative to the domain.

You say "Everything Is Relative to the domain", but you should realize that an RDBMS is NOT supposed to be a domain-specific tool. Assuming you are entirely correct, that what a DBMS should provide out-of-the-box depends on the domain it is intended to support, then one must logically conclude that any non-specialized RDBMS should provide a wide-enough range of values to support every domain people try to use it in - because, in a sense, people in every domain are grabbing the same 'box'. That includes supporting set-values.

Perhaps "not supposed to be a domain-specific tool" is an unrealistic goal? It is often pointed out that many existing RDBMS don't support declarative graph traversal, which some domains use heavily. Text processing/querying is also a known weakness. I have not surveyed enough domains by far to provide data on this issue. Perhaps "idioms used by many different domains" is a more realistic qualification than "all domains".

Being general-purpose as opposed to domain-specific mostly means supporting several different domains then embracing the extension of that set to more domains - as opposed to rejecting such change with elitist comments like "I don't see your domain as common enough to deserve support". Instead, one uses a different criterion for rejecting certain additions, e.g. "nothing specific to individual users, projects, businesses, frameworks, particular models, etc.". One can additionally provide support for more specific extensions (e.g. libraries or modules) that can add domain or project-specialized features.

So your suggestion for Simplifying Rdbms is to make them extensible for different domains? I'd accept that. Violent Agreement? However, the "how" may be tricky. As far as not supporting uncommon domains as being "elitist", I find that an odd accusation. Either something supports all domains or a subset. Since resources are finite, a subset is the likely result. And if a subset, then we must decide what to support and what not to support. Commonality is a very logical criteria to me. Vulcans should smile. How you view that as "elitist" escapes me. (Related: Left Handers Too Costly). What would be a non-elitist criteria for culling then? (Until we figure out an extension interface.) Then again, your sense of morality seems to not fit the norm. --top

Error. Resources are finite, therefore you need to prioritize which features should receive support. That is not the same as rejection. Rejection of a feature needs to occur by its own set of criterion (i.e. "that would break optimization/transaction/etc. features" or "adding GUI-features would really be outside the intended tasking of a 'pure' database-product, though it might make a good sister project" or "we won't allow operations for which we can't guarantee termination" or "err... this product is supposed to be an RDBMS, so no pointers-into-rows for you").

Understood. I did not mean the above from a hard-rejection ("forbidden") standpoint, I would note. It was in the context of things like set-values and graph traversal.

A better measure might be: if at least three different and otherwise unrelated projects profess to needing or desiring a feature, it is potentially a domain-generic feature and should be seriously considered for inclusion... i.e. put it on a card, then drop it into a priority-queue somewhere. As to how you include it: you might put it in a common and related extension set if possible; otherwise you may need to add core support.

As far as Violent Agreement goes: sure. I would be happy with an RDBMS that was truly extensible for whichever types and optimizations and trigger-systems etc. that were necessary, especially if one didn't need to bugger the RDBMS host and cut through red tape to 'install' extensions every time you discover you need them.

One idea proposed in Does Relational Require Types is to separate the relational engine (or at least the relational interface) from the "domain math".

Moved rest of discussion to Separation Of Database And Domain Math.


Proposal "Cindy"

table:
objects // generic stand-in for domain thing being labeled

------- objectID phraseKey // example: "All, Friday, Meetings" ...etc...

table:
objectWords

------ objectRef word

Stored procedures ------------ addWord(objectID, word) removeWord(objectID, word) // // These would lookup or generate word entry in "words" table // and regenerate the "phrasekey" if needed. All word changes // go through these procedures.

Note that this implies someone outside the system must possess an 'objectID' that is something other than a set of words, which would be a violation of the original system requirements.

I dropped the 'words' table for Yag Ni reasons.

(Yag Ni sometimes conflicts with clarity.)


Moved "elitist" issue to Being Offended Discussion.



See original on c2.com