John-Mark Gurney 035d354930 | 1 year ago | |
---|---|---|
dht | 5 years ago | |
papers | 3 years ago | |
sample | 5 years ago | |
ui | 1 year ago | |
.gitignore | 2 years ago | |
NOTES.md | 3 years ago | |
OBJSIG.md | 3 years ago | |
README.md | 1 year ago | |
notes.txt | 2 years ago |
The idea for medashare is both a standard, but also an implementation of defining and sharing meta data about files. Some media contains the ability to embed meta data, e.g. mp3, some documents, video files, but not all files are able to convey the complete and rich information that may be desired. Even if the file format has embedded data, it may not be modifiable, or even be desirable to modify the file to include the metadata. This will allow you to identify files, and have external associated meta data with various files.
It should be possible to classify parts of the meta data as private (aka TLP:Red), such that the information will not be shared, so that a person can mark files w/ your own tags and/or ratings, which will not be automatically shared.
There will also be able to define derivative works by the standard. For example, an album may have multiple tracks, so you can note that an mp3 file is a segment of the album, or that the mp3 file obtained from a music service is equivalent to the track. This will allow sharing meta data between mediums.
This derivation is also useful for when files are programmatically transformed. Say an image is resized, adding a note that the smaller file is a derivative work can allow others to reproduce the file, but also allow you to not have to reenter all the meta data associated with the new version of the file.
This can be useful for things like raw files on a camera, where you associate general picture information w/ the raw image data (but none of the associated processing data that files like CR2 contains) so that the meta data is not lost.
This work is inspired by my work on [STIX], a Cyber Threat Intelligence standard, that has many similar requirements as meta data sharing.
Each object has a URN which uniquely describes it. XXX copy from STIX URN proposal, which is simlar to the magnet proposal.
Something similar to:
urn:medashare:uuid[--modifieddate]
And the URI version:
medashare:?xt=<urn>
Everything must have a type. Not having well defined types can lead to confusion and problems. Different encoding schemes have different ways of encoding types. If the encoding scheme has a native way to encode that type, it should be used. In some cases, e.g. JSON, there is no formal types beyond numbers and strings, and in this case, a type should (MUST? or via schemas?) be layered on top.
Look at adding units.
The hash string is name of hash (hash type) followed by a colon followed by the hex string (hash value).
The list of valid hashes is:
A reference is the UUID optionally followed by two dashes (--) followed by the modified date of the object. The modified date is neccessary in some cases to know what version of the object is being referenced.
For storing multiple hashes, a hashes type is used. This will be serialized as a list of hash strings. The list is used to reduce serialization overhead, and when loaded, may be parsed out into a dictionary for faster lookup. Only one hash string of each hash type is allowed. If there are duplicates, the object is invalid and MUST be ignored.
These are the nodes that contain a majority of the data.
The following properties are present on all (most?) objects: type The type of the object. producer_ref UUID of the producer that created this object. Add signing info.
Properties: type ‘metadata’ uuid UUIDv4 modified date of last modification of the metadata object dc: A [Dublin Core] property object_marking_refs Imported from [STIX v2.0 Part 1]: Section 3.1 granular_markings Imported from [STIX v2.0 Part 1]: Section 3.1 hashes A list of hash strings. lang RFC XXXX language of the properties. parent_refs List of UUIDv4s of MetaData Object that overlay. Any properties on this object override the parent. (allow deletion via None/null?) Any missing properties are passed through to the parent for resolution. The first/earliest object that has a property is used in that objects later in the list are “hidden” by the earlier objects. The modified date must be included in this property. mimetype The mime-type. If the set of bytes is polymorphic, there should be an object for each “type”. uri List of URI’s where the file may be located. child_files A dictionary where the keys are the file names and the values are hash strings. (One issue w/ using hashes is that you can’t tie YOUR idea of the metadata, but it also allows a person to have metadata about a file that is private and not be forced to share it, nor create a dummy object.)
Opinion Properties: qualityrating On a scale from 1 (poor/terrible) to 5 (great/pristine), the subjective quality of the content.
The base object will contain all the data associated w/ the file (object). The base set of data is based upon the [Dublin Core] specification, as it provides a nice starting point, and will provide a good mapping to other systems out there.
There may be a link to another MetaData object from which this one is derived (via parent_refs). If there is, all the meta data from the derived object (and the ones it derives from) must be included, except for the ones that have been marked deleted, or were overridden. When a property is marked as opinion, it should not be inherited. If the new author agrees with the opinion, then they have to restate the opinion in their object.
Custom properties must be preceded w/ a namespace. The name space is name followed by colon, as is demonstrated above w/ dc for [Dublin Core]. For existing standards, please submit them for inclusion, otherwise a reverse dns name should be used, e.g. com.example:property.
The link to the meta data object must include the version referenced, as the referenced object may change. A three way merge may be needed when updating an object where the derived object has also been updated if the new information is wished to be used.
If a property is imported from the file/object itself, it is recommended to mark it as such via the granular marking, see X for more info on how to do this.
The additional properties can be considered the RDF-MT s-p-o (subject-predicate-object) triple, where the subject one of (each of) hashes, the predicate is “has”, and the object is another s-p-o triple, which is (the property name, “is”, property value). (Use “be” instead of “is”?)
Open Questions: When meta data is “declassified”, how do you maintain a link to the classified version?
Properties: type ‘file’ uuid UUIDv5 If the stats do not match, check hash, create a derivative blob object, possibly? modified date of last modification of the object stat Stats for the file, modified time, file size, used to detect when file has been changed/modified.
A file object references a blob Object, and contains information about the file name in the file system associated w/ the blob. This is used to speed up looking up blob objects.
A container object references one or more File objects. This is for representing containers such as zip or tar.gz files, but is also for BitTorrent hashes (event for single file torrents).
Similar to the File Object, but for web resources.
These are the edges that connect the nodes. For the most part they do not contain any data.
The two linked nodes, required to be File Objects, are equivalent.
Ask cvoid: Should a file system reference point to the blob hash or the uuidv4 of the blog object? blob hash requires a lookup, and maybe selection? Maybe both? To denote the selection. Likely File Objects are going to be private, so internal optimization? This will likely be different for URL objects as they are more public, where file system is often local only (unless on a shared, e.g. work, system).
The [Dublin Core] standard is also noted by [RFC5013].
[STIX] https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=cti [Dublin Core]: http://dublincore.org/documents/dces/ [RFC5013]: https://tools.ietf.org/html/rfc5013 [STIX v2.0 Part 1]: http://docs.oasis-open.org/cti/stix/v2.0/cs01/part1-stix-core/stix-v2.0-cs01-part1-stix-core.html
DHT looks like a good option for finding things. IPFS is a great option for storing the data, and allowing peers to find the data, but it does NOT provide a search solution. It should be able to combine the hash tree crypto solution along w/ the DHT to provide a way to build up an index for a peice of data.
Need to look at DSHT Thoughts:
For search, you need two functions:
How to do lookup:
How to add an object:
Validating the object:
Attacks to prevent: Adding random hashes that don’t map to anything. Adding valid hashes that don’t have the proper term in them. When adding hashes, limit number of unverified hashes per block iteration.
Issue is, how do accept that a new block is valid:
When updating, always check for n + 1 until n is not found. When publishing, depending upon timeframe, select n where n is smallest but still means time parameter.
IPFS not sure if it has a final hash mapping system (was unable to find one), but each object in IPFS may have different multihash depending upon how blog/list is broken out.
Need a mapping system like: sha256:XXX -> ipfs:XLYkgq61DYaa8Nh3cq1U7rLinSa7dSHQ16x
This could/should include multiple other items? like maybe block level hashing, though if there’s an ipfs, that provides it..
Reference Info:
https://wiki.freedesktop.org/www/CommonExtendedAttributes/ Dublin Core reference by Freedesktop https://help.archive.org/hc/en-us/articles/360018818271-Internet-Archive-Metadata