File Naming, GUIDs, Duplication, Identity and Metadata: A Response to John O'Gorman

I love social network conversations. There has been a great one going on over in LinkedIn. In the AIIM Group for Intelligent Information Management a conversation was started around whether or not file naming conventions were needed when we have robust EDMSs (enterprise document management systems). John O’Gorman, an Information Integration specialist made a provocative post (in the spirit of great dialog) and I responded. The answers and debate have grown and now, rather than take up the whole form, I am posting my reply here, so that you may participate as well!

If you have not read the thread, you can do so HERE (if you want to skip to the billy vs john debate go to page 3). Without further ado, here is my reply:

I appreciate the engagement and invite others into the fray. I think this makes us all sharper! So in the spirit of mutual enlightenment and the disputational interrogative we engage!

We start off on common ground agreeing that humans are *much* better suited at pattern recognition and discrimination than are programs. While they can process vastly larger quantities of information, we can identify and consume “relevant” information more efficiently (at least now).

1) You mention that a computer cannot have even 2 files with the same name in a folder while we can pick out one from many quite easily. I agree with the example you give but the question was not answered and your answer imports some assumptions that aren’t necessarily so. Let me explain. I would argue (based on my pop-sci understanding) that human discrimination is facilitated by the way our brains “tag” memories with unique identifiers. We use electro-chemical “naming conventions” programs use other conventions. Same fundamental strategy though. To this extent ,the argument that the strategy that computer programs use is bad because it is different than what human brains use fails. Showing a difference in result does not impugn the process, merely the efficiency or execution or a host of other limiting factors. Secondly your example imports some assumptions that do not hold. Why do you assume the windows file system uniqueness requirements? I work with EDMSs that can store N number of files with the same file name in the same “location” and display those in a single collection and slap a “folder” icon on top of it. Computer programs? Yep. Windows file system limits? nope.

2) Maybe I didn’t understand when you originally stated, “the reason we put meaningful labels on anything is because no one has come up with a better alternative”. This sounded to me like you considered this the worst of no possible alternatives. I am not sure why you feel this way. As you say, it seems to work pretty well for our brains and while search isn’t up to our levels yet, it is getting there. I would also argue that basic keyword indexing (tokenization) has limitations. But aren’t you creating a straw man argument here? Why must disambiguation happen at this level? Search engines also incorporate (and are increasing their incorporation of) other weighting/prioritization/relevancy axes in order to achieve disambiguation and increase relevancy. This is why entity extraction and ontology assisted querying is so promising. Before going there though, disambiguation and prioritization happens through incorporating inbound linking, folksonomies, usage/consumption patterns and other factors. Bibliographic tracing is quite popular in higher education systems in order to figure out which concepts are derivative and which are foundational. Computer programs are doing the tracing and therefore the discrimination here.

3) I grant that, as you say, two different GUIDs only assert that the two resources to which they are associated are deemed different. But again you brought up the distinction between systems and humans. A picture of (the same) JaneSmith (age 3) and JaneSmith (age 33) are very different and *depending on your prior relationship with here*. The difference may be enough to prevent your identification of her as the same person or (if you are her father) not enough to confuse you even for a moment that it is the same person. This raises two very important questions around meaningful (aka relevant) difference and identity. With a person we presume continuity of something that transcends cellular existence (personality, soul, spirit, whatever). With information what do we have? Changing a single byte in a document between versions alters checksum values thereby creating an entirely new item (by one interpretation). But that one byte is not likely a meaningful difference and so we are comfortable with maintaining the ID of the item. So hyper content centric identity of information seems not to be useful. Alternately I can create an empty “unique id” for an item in my EDMS and then proceed to associate that ID with an image of JaneSmith, then a resume for JaneSmith then a video of the Space Shuttle. That unique ID retains its uniqueness among all other items in the system but is a container for “nothing – image – document – video”. The ID is still able to be distinguished from all others in the system. The difference is the intentionality with which it is assigned. Humans import definition by creating the ID and therefor create the uniquess regardless of what the “thing” is the identifier points to. This is the extrinsic identification rather than intrinsic identification. I think there is a place for both kinds of identity in our world but EDMSs generally focus on and utilize extrinsic identity. This is because we can import purpose more easily to extrinsically identified objects (since we control them) rather than intrinsically identified objects (which we have to find a use for).

4) I agree whole-heartedly that, as you say “There is some philosophy in every human endeavor, just as there are some mathematics in every computer”. I, like you enjoy engaging on this level as well! So props to us and the readers who enjoy it. I think this is an important area where we can elevate the discussion and practice in our community. So I’ll bite again. While I love talking about how interior angles of a triangle can add up to 270 when laid out on a sphere, I’m not seeing the information science analogy but I am excited to hear it! In the mean time, I do not follow your LA example. You say that “without being told ‘Los Angeles’ (GUID h34dh23a4b7b33c8361) is different than ‘Los Angeles’ (GUID 7b33c8361h34dh23a4b) a computer is ignorant of that difference.” But I would reply that the *fact* of the different GUIDs is the discriminating factor for a computer. So by definition a computer *must* “know” that title attribute “Los Angeles” of GUID 1 is different and distinct from title attribute “Los Angeles” of GUID 2. The trouble comes up in how those GUIDs were assigned. If extrinsically assigned (i.e. though an act of intention) then we the human consumers of that information assume/rely on the idea that the difference is meaningful. If intrinsically assigned (e.g. automatically via a crawler or something else) then we get into potential duplication / overlap / synonym problems (e.g the checksums were different but the objects were not meaningfully different). The trouble I have with this is that meaning is always imported by humans and is dependent on the scope of our problem domain. Google Earth can show the globe or my back yard. At one scope of the problem domain (e.g. where do you live?) both the “glob” and “my back yard” answers are correct. To a space alien the globe provides the appropriate level of meaning. To my mother the back yard provides the appropriate level of meaning. So by setting up the question the way you have you seem to be assuming a common scope which is only ever extrinsically identified and therefore unable to be held in common.

5) I think the “assigned to maintain” vs “derived to describe” strategy is very good and the best part is that they are not mutually exclusive. I agree with you that these are, “an interesting twist on randomly assigned object identifiers”. But these are quite common. They are simply at different levels of the problem domain. OWLs, most EDMSs and other relationship maps do this all the time. Using your example, ‘Management Salary Policy’ and ‘Management Gehalts Politik’ and ‘Política del Sueldo de la Gerencia’ and ‘???????? ???????? ??????????’ would share a common identifier that acts as a meta-identifier. The reason for this is that (I assume) you are using a localization example where we have 4 different translations of the same policy. In this case we humans understand that the collection, the set, is a common set and should be related and identified with a common identifier. The difference at the set level is not meaningful. At a deeper level the language difference becomes meaningful only after the set has been identified and located. Here is where we start delving into Derridan concepts of differance and what is meant by the identity. Suffice it to say that this is the realm of extrinsicly assigned (e.g. derived to described) identification.

6) Here we’re getting down to brass tacks. I agree that search is a big pain point for most organizations. AIIM, Gartner, Forrester, Ovum, Gilbane and others all agree. But in most EDMs don’t care if multiple files with the same name are stored. I should be clear, by files with the same name I mean “file name” (e.g. BlogPost.doc or MyPresentation_Final.pptx or LinkedInReply.txt). This is because the EDMSs will store that filename as an attribute but give the file a GUID. The good EDMSs (like Oracle UCM) will provide a set identifier (ContentID) that identifies the set of revisions which may be substantially different from each other. Each revision has it’s own unique identifier (dID). Furthermore, ContentIDs can be auto generated, auto derived from rules / extraction processes or manually created each and every time. Additionally content objects can be associated with each other along N number of axes for intentional purpose-based collecting/discovering/location/identification. These axes may or may not be indexable. This allows information discovery (classification, grouping, categorization) to be brought along side information location (querying, retrieving) rather than relying simply on one or the other or requiring a serial approach of one then the other. I fundamentally disagree with your statement that, “nor is it considered best practice in an EDMS to encourage or even allow contributors to randomly assign their own metadata.” First, no user ever “randomly” assigns their own metadata. At least not in true random form. Second your statement begs the question of what is metadata and what should end users create and consume? So are Flickr tags metadata? Yes. Should users create them and be enabled to create them? YES! Are star based ratings on blog posts metadata? YES. Should users be allowed and empowered to engage with rating systems? YES! What about “comments” or “descriptions” or “due date”? I would argue that in contextually appropriate situations users should always have at least the option if not the requirement to add descriptive and intentional metadata to their content objects. The more that entity extraction systems (e.g. OpenCalais, GATE, CLARABRIDGE etc) become commonplace the easier we can make it for people. But I do not foresee a time when information creators will be able or should stop describing and classifying what they have created or consumed.

7) You write that “This issue of whether or not to have a convention for naming files is symptomatic of a systemic problem.” I agree whole heartedly. But I disagree that this means that all systems fail to solve the problem. Indeed I am very confident that with technologies like the Oracle ECM system and Fishbowl Solutions add on modules such as our Subscription notifier, Workflow solution set, CollabPoint and Advanced User Security Mapping along with our solutions like Contract Management, Policies and Procedures, Admissions Office Onboarding and Research Solutions that we can and do solve the business problems around file naming conventions, information location and efficiency boosting.

So I will echo your sentiment at the end, with which I agree unequivocally: In the spirit of the community of intelligent information management,I’m just sayin’…

File Naming, GUIDs, Duplication, Identity and Metadata: A Response to John O'Gorman

Subscribe to Our Blog

Blog Categories

Recent Blogs