I apologize in advance for the "brain dump" nature of this post. Just wanted to push this idea out there. Will flesh it out more later.
I work with data and databases. As most, if not all, database and business software developers know, data quality and duplicate data can be nightmare problems.
Related to this have been thinking about an idea that likely is already out there or may already being implemented. Have done some basic research but haven't really come up with anything concrete. May be looking in the wrong place. The 30 second elevator pitch is to Mashup Wikipedia and Search Engines with Data. The key to this idea is something called Master Data Management
). The goal of Master Data Management is eliminating duplicate (and likely differing) sets of data - in the end there is one "master" set of data that ensures that everyone in a business or organization is looking at and working from the same page.
Lots of software/database companies are working on this problem but usually limited to targeting a single business/organizations data. Microsoft has a new SQL Server-related product called Master Data Services
. What about expanding this idea so that the entire world uses these "master" sets of data? This idea has a good many similarities to one of the key backbones of the Internet - DNS
(Domain Name System). Another great example is date/time synchronization back to the master clock at the US Naval Observatory
in Washington, D.C.
Wikipedia has a great deal of data contained within it but it is wrapped in narrative. I am thinking of just the raw data.
Many companies and organizations/government agencies expose their public data on the web, through file downloads, through web services, and other methods. But it is usually inconsistent and often difficult to manipulate. What about a system where this data and access is standardized?
This idea is still fairly fuzzy so I think I am going to switch to a simple stream of consciousness:
Tabular data/Lists. Key-value pairs. Cloud computing. Crowdsourcing. Searching for a piece or set of data (web service based Wolfram Alpha). Standardized API. Web Services to right/read data. File upload/importing/exporting. Both secure (login/password) and public. Multiple methods of read access - web services, RESTful HTTP requests, return XML, nice web pages. Social aspect similar to Wikipedia - voting/discussions/etc. Cacheing/Mirroring. Internationalization. Store data or API to allow for proxy against your data. CPU cycles focused on providing Master Data to world. Separate system with CPU cycles focused on mashing up data contained within the Master Data system. Could add narratives/explanations to raw data. Versioning/Effective Dating component to data. Notifications on data changes. An RSS-feed-like system for the raw data. Initially textual/numeric data only - eventually maybe image/audio/video information. Related to government information - transparency, ready access to data/information, standardized. Initial implementation CPUs not available beyond data retrieval - would still need to process locally. Possibly substantial validation of parties to ensure trust.
An example: www.blah.com/usps/zipcode would be access point (web, web service) for official US Postal Service Zip Code information. The USPS would maintain their master data here (likely still a copy of the "real" master table) but very close to the real thing and frequently updated and maintained by official organization. Access API would be standardized for writing and reading.
Many companies (Amazon, Twitter, etc) have come out with APIs to work with their data but they are different and disparate - data and web services silos. Wikipedia tries to be a global, all-inclusive encyclopedia of facts and narrative. The primary UI for Wikipedia is the web. A "Wikidatabase" or "Datawiki" or ? could focus on being a global, all-inclusive database/data repository of the world's raw data. The primary UI for the "Datawiki" would be the XML Web Service.
I don't know if this idea is even viable. I don't know if this idea has any commercial prospects. I don't know if this ideas has any utility. There are some significant issues to overcome such as scale, scope, performance, security, reliability, etc. But the idea could possibly have some idea. I am hoping that someone comes back and says, "This has already be done or is being done here - X".
As a slight tangent: this whole idea is closely related to standards of weights and measurement (see NIST
). I am reminded of a fantastic cartoon from the US Navy about calibration
. With this Data Mashup idea we are not dealing with physical equipment but rather data standardization - but the lessons hold true.
Maybe someone smarter than me will take this idea and run with it. I can help with some of the technical aspects especially if it involves database development and SQL, but someone with a bigger brain would need to lead the way.
Please share any feedback that you may have. I will likely do more research and writing on this idea and would love others' viewpoints and thoughts.